An Interview Question: Using Integer Programming

We can solve the Interview Question using a mathematical technique called Integer Programming. Let d_1, d_2, \ldots, d_N be the variables representing diskette 1, diskette 2, diskette 3, etc. The values of the d_k variables can only be 0 or 1. A 0 means the diskette is not used while a 1 means that it is used.

Each file is saved to a certain diskette. We want to know to what diskette d_i a given file f_j is assigned. To represent this, we assign the variable a_{ij} a value of 1 if file f_j is assigned to diskette d_i.

We will normalize the file sizes so that if s_i is the size of f_i, the s_i \le 1. We do this by simply dividing all file sizes by the size of the diskette. For a given diskette d_i, the following constraint should be satisfied:

d_i - s_1a_{i1} - s_2a_{i2} - \ldots - s_N a_{iN} \ge 0

for diskette i = 1, 2, \ldots, N and s_i are the normalized file sizes of file f_i for i=1,2,\ldots,N.

Since each file f_j can only be assigned to one diskette, we have the following constraint:

a_{1j} + a_{2j} + \ldots + a_{Nj} = 1

where a_{1j} is the variable representing the “file f_j is in diskette d_1“, etc.

Finally, we have to constrain the value of d_i to be either 0 or 1, that is,

d_i \le 1

for all i=1,2,\ldots,N.

Integer Programming Formulation

Given the above information, we can formulate the Integer Programming problem as

Minimize:

d_1 + d_2 + d_3 + \ldots + d_N

subject to

\begin{array}{rl}  d_1 - s_1a_{11} - s_2a_{12} - s_3a_{13} - \ldots - s_Na_{1N} &\ge 0\\  d_2 - s_1a_{21} - s_2a_{22} - s_3a_{23} - \ldots - s_Na_{2N} &\ge 0\\  :\\  d_N - s_1a_{N1} - s_2a_{N2} - s_3a_{N3} - \ldots - s_Na_{NN} &\ge 0\\  a_{11} + a_{21} + a_{31} + \ldots + a_{N1} &= 1\\  a_{12} + a_{22} + a_{32} + \ldots + a_{N2} &= 1\\  :\\  a_{1N} + a_{2N} + a_{3N} + \ldots + a_{NN} &= 1\\  d_1 &\le 1\\  d_2 &\le 1\\  :\\  d_n &\le 1  \end{array}

Solving the Problem

We will use R to solve this Integer Programming Formulation. Please see code below:

library("lpSolve")
NUMFILES=4

# Generate random file sizes between 1 and 10
FileSizes=ceiling(10*runif(NUMFILES))
x = -1*FileSizes/10
l=length(x)

# Each files can be in any of the diskettes. Suppose there are N files,
# to determine if a file j is in diskette i, the value of variable x_ij will
# 1 if file j is in diskette i, and 0 otherwise.
# Here we construct the coefficients of variables x_ij which are the
# sizes of the files (normalized to 1)
zz=c()
for(i in 1:(l-1)){
  zz=c(zz,x,rep(0,l*l))
}
zz=c(zz,x)

# Construct the coefficients of the indicator variables representing the 
# diskettes d_i
zzmatrix=matrix(zz,ncol=l*l,byrow=T)
CoefficientsOfDiskettes=c();
for(i in 1:l){
 ttt=rep(0,l)
 ttt[i] = 1
 CoefficientsOfDiskettes= c(CoefficientsOfDiskettes,ttt,zzmatrix[i,])
}

# Construct the coefficients of x_ij for constant j. These variables 
# satisfy the equation \sum_{i=1}^N x_{ij}
SumOfFileAcrossDiskettes=c()
for(i in 1:l){
  ttt=rep(0,l)
  ttt[i]=1
  SumOfFileAcrossDiskettes=c(SumOfFileAcrossDiskettes,rep(ttt,l))
}

# Prepend Coefficients of variables d_i. The value of these coefficients is 0. 
SumOfFileAcrossDiskettesMatrix=matrix(SumOfFileAcrossDiskettes,ncol=l*l,byrow=T)
PrependCoefficientsOfDiskettes=c()
for(i in 1:l){
 PrependCoefficientsOfDiskettes=c(PrependCoefficientsOfDiskettes,c(rep(0,l),SumOfFileAcrossDiskettesMatrix[i,]))
}

# Construct coefficients of d_i to construct constraint d_i <= 1
DisketteConstraints=c()
for(i in 1:l){
 ttt=rep(0,l)
 ttt[i]=1
 DisketteConstraints=c(DisketteConstraints,ttt,rep(0,l*l))
}

# Construct matrix input of lpSolve
const.mat=matrix(c(CoefficientsOfDiskettes,PrependCoefficientsOfDiskettes,DisketteConstraints),ncol=l*(l+1),byrow=T)

print("Matrix Coefficients:")
print(const.mat)

# Construct inequalities/equalities
const.dir=c(rep(">=",l),rep("=",l),rep("<=",l))

# Construct Right-Hand side
const.rhs=c(rep(0,l),rep(1,l),rep(1,l))

# Construct Objective Function
objective.in=c(rep(1,l),rep(0,l*l))

# Invoke lpSolve
mylp=lp(direction="min",objective.in=objective.in,const.mat=const.mat,const.dir=const.dir,const.rhs=const.rhs,all.int=T)

# Print Results
print(paste("Number of Diskettes: ", sum(mylp$solution[1:l])))
tz=matrix(mylp$solution,ncol=l,byrow=T)
print("File Sizes: ")
print(FileSizes)
for(i in 2:(l+1)){
   files = which(tz[i,] == 1)
   if(length(files) > 0){
      print(paste("Files in diskette ", i-1))
      print(files)
   }
}

Most of the code above is setting up the matrix of coefficients. The line 70 then calls on lpSolve to compute the optimal values of the variables

Program Output

Running this code we get the output

[1] "Matrix Coefficients:"
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
 [1,]    1    0    0    0   -1 -0.2 -0.1 -0.1    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
 [2,]    0    1    0    0    0  0.0  0.0  0.0   -1  -0.2  -0.1  -0.1     0   0.0   0.0   0.0     0   0.0   0.0   0.0
 [3,]    0    0    1    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0    -1  -0.2  -0.1  -0.1     0   0.0   0.0   0.0
 [4,]    0    0    0    1    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0    -1  -0.2  -0.1  -0.1
 [5,]    0    0    0    0    1  0.0  0.0  0.0    1   0.0   0.0   0.0     1   0.0   0.0   0.0     1   0.0   0.0   0.0
 [6,]    0    0    0    0    0  1.0  0.0  0.0    0   1.0   0.0   0.0     0   1.0   0.0   0.0     0   1.0   0.0   0.0
 [7,]    0    0    0    0    0  0.0  1.0  0.0    0   0.0   1.0   0.0     0   0.0   1.0   0.0     0   0.0   1.0   0.0
 [8,]    0    0    0    0    0  0.0  0.0  1.0    0   0.0   0.0   1.0     0   0.0   0.0   1.0     0   0.0   0.0   1.0
 [9,]    1    0    0    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[10,]    0    1    0    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[11,]    0    0    1    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[12,]    0    0    0    1    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[1] "Number of Diskettes:  2"
[1] "File Sizes: "
[1] 10  2  1  1
[1] "Files in diskette  1"
[1] 2 3 4
[1] "Files in diskette  2"
[1] 1

Interpreting the Result

Lines 2-14 of the output gives you the matrix of coefficients. Line 15 prints the number of diskettes needed to store the files. Line 17 prints the randomly generated file sizes from 1 to 10. Finally lines 18-21 prints which diskettes contain which files.

The space complexity of this solution is quite substantial. Given N files, we need to specify N^2 + N variables by 3\times N equations for a total of (N^2 + N)\times 3N memory space for coefficients.

An Interview Question

I was given this interview question and I’d like to share it to you. The setting is back in the days when the largest size of a hard disk was 1 GB and there were no CD writers yet and the only way to back up your data is through those 1.44 floppy disks. You want to back up your files but you want to minimize the number of floppy disks you need to use. Assume your most important files are in a single directory. How will you distribute the files across your disks in such a way that the number of disks you use is minimized ?

Click here to see the Integer Programming Solution.

To make this simple, let’s assume the following:

– we will not take into account that every file you copy to the disk has a record of the metadata of the file and stored on the disk as well. This will eat up space as you put more files. For our purposes, we ignore this complexity.
– The size of each file is less than or equal to 1.44

First we need to have a list of those files including the size and sort the list according to size in descending order. If A is the list of files, we can apply this algorithm:

B := list of files to copy to current floppy disk
remaining_size := 1.44 MB
For file in A:
    If remaining_size - file.size > 0:
        B.add(file)
        A.remove(file)
Copy all files listed in B to disk
Empty B
Repeat process for remaining files in A

Although there are other better algorithms than the one above, this is the one I managed to to come up during the interview.

We now need to determine how fast our algorithm can run.

Worst Case Complexity

How slow can this algorithm get ? If for any two files F_i and F_j in A we have F_i + F_j > 1.44, then all files will have their own diskette. If this is the case, for each file, our algorithm will execute step 4. For the first disk, it will execute the step N times. For the second disk, it will execute the step N-1 times, for the third disk it will execute N-2 times, etc. the total number of times it executes step 4 is the total number of comparisons and is equal to the summation:

\displaystyle \sum_{1=1}^{N} i

which is equal to

\displaystyle \frac{N(N+1)}{2}

Therefore, in the worst case, the complexity is O(N^2).

Best Case Complexity

The best case is when all files fit in just one diskette. For this, the total number of comparisons is N

Average Case Complexity

On the average, files have different sizes. We now compute the complexity on the assumption that the probability distribution is uniform.

If k is the number of diskettes, the number of comparisons is a sequence of monotonic decreasing numbers \{ a_1, a_2, a_3, \ldots, a_k \} taken at random from the set \{ 1, 2, \ldots, N\}. Each of the numbers a_j, j\in \{1, 2, \ldots, k\} has a probability 1/N of being chosen. Let X be a random variable such that

\displaystyle Pr(X=a_j) = \frac{1}{N} \text{ for } j=1,2,\ldots,k

then the number of comparisons C is equal to

\displaystyle C = \sum_{i=1}^k X = kX

The expected value of C is given by the

E[C] = E[kX] = kE[X]

However, the expected value of X is given by

\displaystyle E[X] = \sum_{j=1}^N j\cdot Pr(X=j) = \frac{1}{N} \sum_{j=1}^N j = \frac{1}{N}\frac{N(N+1)}{2} = \frac{N+1}{2}

Therefore,

\displaystyle E[C] = k\frac{N+1}{2}

What remains is to determine the average value of k, which is the number of diskettes. If M=1.44 is the maximum file size, the average file size is M/2. The average total file size is then NM/2. The average number of diskettes is equal to the average total size divided by size of diskette, that is

k = \displaystyle \frac{NM}{2}\frac{1}{M} = \frac{N}{2}

This means that

\displaystyle E[C] = \frac{N}{2} \frac{N+1}{2} = O(N^2)

which is the same as the worst case complexity.

There is another way to solve this problem using Integer Programming.