Difference Between Two Sample Proportions
Consider an infinite (or very large) population, where each observation has a probability pX of being a success, and a probability (1-pX) of being a failure. Let the set of independent and identically distributed random variables X1, X2, ..., Xm represent the observations from a sample of size m, where
Xi = 1 if the ith observation is a success
= 0 if the ith observation is a failure
Consider another similar population, with each observation having a probability pY of being a success, and a probability (1-pY) of being a failure. Let the set of independent and identically distributed random variables Y1, Y2, ..., Yn represent the observations from a sample of size n, where
Yi = 1 if the ith observation is a success
= 0 if the ith observation is a failure
Let X = (X1 + X2 + ... + Xm) and Y = (Y1 + Y2 + ... + Yn). Then X is binomially distributed with parameters pX and m; similarly, Y is binomially distributed with parameters pY and n. Using the normal approximation to the binomial distribution,
X ® N(mpX, mpX (1-pX)) as m ® ¥
Y ® N(npY, npY (1-pY)) as n ® ¥
Therefore,
As the sample proportions (X/m) and (Y/n) are both normally distributed when m and n are large, the difference (X/m) - (Y/n) is also normally distributed. In fact,
and thus