In a conversation with a colleague some months ago I was asked if I knew how to scale the replication factor of a Hadoop Distributed File System (HDFS) cluster as the number of nodes increased in order to keep the probability of experiencing any data loss below a certain threshold. My initial reaction to the question was that it would not be affected, I was naively thinking the data loss probability was a product of the replication factor only.

Thankfully, it didn't take me long to realize I was wrong. What is confusing, is that for a constant replication factor as the cluster grows the probability of data loss increases, but the quantity of data lost decreases (if the quantity of data remains constant).

To see why consider a situation in which we have N nodes in a cluster with replication factor K. We let the probability of a single node failing in a given time period be X. This time period needs to be sufficiently small so that we know that the server administrator will not have enough time to replace the machine or drive and recover the data. The probability of experiencing data loss in that time period is the probability of getting K or more nodes failing. The exact value of which is calculated with the following sum:

Although in general a good approximation (a consistent overestimate) is simply:

Clearly as the size of N increases this probability must get bigger.

This got me thinking about how to determine the way the replication factor should scale with the cluster size in order to keep the probability of data loss constant (ignoring the quantity). This problem may have been solved elsewhere, but it was an enjoyable mathematical exercise to go through.

In essence we want to know if the number of nodes in the cluster increases by some value n, then what is the minimum number k such that the probability of data loss remains the same or smaller. Using the approximation from above we can express this as:

Now if we substitute in the formulas for N-choose-K and perform some simplifications we can transform this into:

I optimistically thought that it might be possible to simplify this using Stirling's Approximation, but I am now fairly certain that this is not possible. Ideally we would be able to express k in terms of N,n,K,X, but I do not think that it is possible. If you are reading this and can see that I am wrong please show me how.

In order to get a sense of the relationship between n and k I decided to do some quick numerical simulations in R to have a look at how k scales with n.

I tried various combinations of X, N and K. Interestingly for a constant X the scaling was fairly robust when you varied the initial values of N and K. I have plotted the results for three different values of X so you can see the effect of different probability of machine failure. In all three plots the baseline case was a cluster of 10 nodes with a replication factor of 3.

You can grab the R code used to generate these plots from my GitHub repository.

Thankfully, it didn't take me long to realize I was wrong. What is confusing, is that for a constant replication factor as the cluster grows the probability of data loss increases, but the quantity of data lost decreases (if the quantity of data remains constant).

To see why consider a situation in which we have N nodes in a cluster with replication factor K. We let the probability of a single node failing in a given time period be X. This time period needs to be sufficiently small so that we know that the server administrator will not have enough time to replace the machine or drive and recover the data. The probability of experiencing data loss in that time period is the probability of getting K or more nodes failing. The exact value of which is calculated with the following sum:

Although in general a good approximation (a consistent overestimate) is simply:

Clearly as the size of N increases this probability must get bigger.

This got me thinking about how to determine the way the replication factor should scale with the cluster size in order to keep the probability of data loss constant (ignoring the quantity). This problem may have been solved elsewhere, but it was an enjoyable mathematical exercise to go through.

In essence we want to know if the number of nodes in the cluster increases by some value n, then what is the minimum number k such that the probability of data loss remains the same or smaller. Using the approximation from above we can express this as:

In order to get a sense of the relationship between n and k I decided to do some quick numerical simulations in R to have a look at how k scales with n.

I tried various combinations of X, N and K. Interestingly for a constant X the scaling was fairly robust when you varied the initial values of N and K. I have plotted the results for three different values of X so you can see the effect of different probability of machine failure. In all three plots the baseline case was a cluster of 10 nodes with a replication factor of 3.

You can grab the R code used to generate these plots from my GitHub repository.