Designing a system from scratch to handle a workload you don’t know is a challenge. If you put to much hardware, you might be wasting money. You put little, then your users will complain of how slow the system is.

If you’re given only a rate, like 6000 hits/hour, you don’t know how these are distributed in a minute by minute or per second interval. We can make a guess and say that there are about 100 hits per minute or 1.67 hits/sec. If hits come uniformly at that rate, then we can design a system that can handle 2 hits/sec and all users will be happy since all requests will be served quickly and no queueing of requests. But we know it’s not going to happen. There will be some interval where the number of hits is less than 3 and some more than 3.

Theoretically, requests to our server come randomly. Let’s imagine 60 bins represented by seconds in one minute. We also imagine that requests are like balls we throw into the bins. Each bin is equally likely to be landed by a ball. It’s possible that all balls land on only one bin!

After throwing the balls into bins, let’s see what we have.

As you can see, some bins have more than 2 balls (which is the average number of balls in a bin). Therefore if we design our system based on the average, 50% of our users will have a great experience while the other 50% will have a bad experience. Therefore we need to find how many requests per second our server needs to handle so that our users will have a good experience (without overspending).

To determine how many requests per second we need to support, we need to get the probability of getting 4, 5, 6 or more request per second. We will compute the probability starting from 3 requests per second and increment by one until we can get a low enough probability. If we design the system for a rate that has a low probability, we are going to spend money for something that rarely occurs.

**Computing the Probability Distribution**

We can view the distribution of balls into bins in another way. Imagine labeling each ball with a number from 1 to 60. Each number has an equal chance to be picked. The meaning of this labeling is this: the number that was assigned to the ball is the bin (time bucket) it belongs to. After labeling all balls, what you have is a distribution of balls into bins.

Since each ball can be labeled in 60 different ways and there are 100 balls, the number of ways we can label 100 different balls is therefore

Pick a number from 1-60. Say number 1. Assume 2 balls out of 100 are labeled with number 1. In how many ways can you do this ? Choose the first ball to label. There are 100 ways to choose the ball. Choose the second ball. Now there are 99 ways to choose the second ball. We therefore have 990 ways to select 2 balls and label them 1. Since we don’t really care in what order we picked the ball, we divide 990 with the number of possible arrangements of ball 1 and ball 2, which is 2! (where the exclamation mark stands for “factorial”). So far, the number of ways to label 2 balls with the same number is

Since these are the only balls with label 1, the third ball can be labeled anything except number 1. In that case, thereĀ are 59 ways to label ball 3. In the same way, there are 59 ways to label ball 4. Continuing this reasoning until ball 100, the total ways we can label 2 balls with number 1 and the rest with anything else is therefore:

Notice that the exponent of 59 is 98 since there are 98 balls starting from ball 3 to ball 100.

Therefore, the probability of having two balls in the same bin is

We can also write this as

In general, if *m* is the number of balls, *n* the number of bins and *k* the number of balls with the same label, then the probability of having *k* balls within the same bin is given by

,

where

is the binomial coefficient.

It turns out that this is a probability distribution since the sum of all probabilities from *k=0* to *k=m* is equal to 1. that is

To see this, recall from the Binomial Theorem that

If we let *x=n-1* and *y=1*, we can write the above equation as

Here is a graph of this probability distribution.

Here’s the plot data:

probability

1 0.315663315854

2 0.264836171776

3 0.146632456689

4 0.060268424995

5 0.019612775592

6 0.005263315484

7 0.001197945897

8 0.000236035950

9 0.000040895118

10 0.000006307552

We can see that for k=9, the probability of it occurring is .004%. Anything beyond that we can call rare and no need to spend money with.

**Just For Fun**

What’s the probability that a given bin is empty, that is, there are no balls in it?

**Other Probability Distributions**

Our computation above was based on a uniform probability distribution. However, there are other distributions that are more suitable for arrival of requests. One of the most widely used is called the Poisson Distribution where you can read from here.

**R Code**

The R code to generate the simulation:

par(mfrow=c(4,4)) f=function(){ x=c() for(i in 1:100){ x=c(x,sample(seq(1,60),1,replace=T)) } plot(tabulate(x),type="h", ylab="tx", xlab="secs") } for(i in 1:16){ f() }

The R code to generate the probability distribution:

p=function(m,n,s){ prod(seq(m,m-s+1))/factorial(s)*(n-1)^(m-s) } tt=c() for(i in 1:10){ tt=c(tt,p(100,60,i)/60^100) } plot(tt,type="h",xlab="Number of Balls",ylab="Probability")

**Dedication**

This post is dedicated to my friend Ernesto Adorio, a mathematician. He loves combinatorial mathematics.

Rest in peace my friend! I miss bouncing ideas with you.