Confidence Interval Estimate for the Difference Between Means

(1) Large Samples

The difference between two means is of considerable importance in testing the homogeneity of populations. In this tutorial we are concerned with the confidence interval estimate for the difference between two population means.

With a non-rigorous logic from the central limit theorem we can state that “If we have two populations with means $${\mu _1}$$ and $${\mu _2}$$, and variances $${\sigma _1}^2$$ and $${\sigma _2}^2$$ respectively, then the sampling distribution of the difference of their sample means $$\left( {{{\overline X }_1} – {{\overline X }_2}} \right)$$ is said to be approximately normal with the mean $$\left( {{\mu _1} – {\mu _2}} \right)$$ and standard deviation $$\sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} $$, $${n_1}$$, and $${n_2}$$ are the two sample sizes both larger than 30 from the two populations.”

This formula of combined standard deviation is obtained from the knowledge of the theorem stated,  i.e., the variance of a sum or the difference of two independent random variables is the sum of their variances. Thus,

$$Var\left( {X \pm Y} \right) = Var\left( X \right) + Var\left( Y \right)$$

Hence   \[Var\left( {{{\overline X }_1} – {{\overline X }_2}} \right) = Var\left( {{{\overline X }_1}} \right) + Var\left( {{{\overline X }_2}} \right) = \frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}\]

Therefore, the standard deviation of $$\left( {{{\overline X }_1} – {{\overline X }_2}} \right)$$ which is stated as
$${\sigma _{{{\overline X }_1} – {{\overline X }_2}}}\,{\text{would}}\,{\text{be}}\,\, = \sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} \]

We can also standardize $$\left( {{{\overline X }_1} – {{\overline X }_2}} \right)$$ as follows
$$Z = \frac{{\left( {{{\overline X }_1} – {{\overline X }_2}} \right) – \left( {{\mu _1} – {\mu _2}} \right)}}{{\sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} }}$$

Here $$Z$$ is standard normal variate. From this value of $$Z$$ we can directly state $$\left( {1 – \alpha } \right)$$ 100% confidence limits for the difference between two population means as
$$\left( {{{\overline X }_1} – {{\overline X }_2}} \right) \pm {Z_{\alpha /2}}\sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} $$

And the confidence interval may be stated as
\[\left[ {\left( {{{\overline X }_1} – {{\overline X }_2}} \right) – {Z_{\alpha /2}}\sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} < \left( {{\mu _1} – {\mu _2}} \right) < \left( {{{\overline X }_1} – {{\overline X }_2}} \right) + {Z_{\alpha /2}}\sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} } \right]\]

It must be remembered that the above results only hold for large samples or small samples from normal populations provided the population variance is known. If $$\sigma _1^2$$and $$\sigma _2^2$$ are not known, for a large sample they can be replaced by $$S_1^2$$ and $$S_2^2$$ (the sample variances), which are computed by the formula $${S^2} = \frac{{\sum {{\left( {{X_i} – \overline X } \right)}^2}}}{{n – 1}}$$. The larger of the two sample means should be considered as $${\overline X _1}$$.

(2) Small Samples

When at least of the two sample sizes are small, then “$$t$$” takes the place of $$Z$$. Two different kinds of interval estimates are obtained depending on whether the two populations are assumed to have the same variances \[\left( {\sigma _1^2 = \sigma _2^2} \right)\] or unequal variances \[\left( {\sigma _1^2 \ne \sigma _2^2} \right)\].

If the two populations are assumed to have unknown and unequal population variances \[\left( {\sigma _1^2 \ne \sigma _2^2} \right)\] then, $$\left( {1 – \alpha } \right)$$, 100% confidence limits may be stated as
$$\left( {{{\overline X }_1} – {{\overline X }_2}} \right) \mp {t_{\alpha /2}}\sqrt {\frac{{S_1^2}}{{{n_1}}} + \frac{{S_2^2}}{{{n_2}}}} $$

Here $$S_1^2$$ and $$S_2^2$$ are calculated by using formula $${S^2} = \frac{{\sum {{\left( {{X_i} – \overline X } \right)}^2}}}{{n – 1}}$$.

If the two populations are assumed to have equal but unknown population variances \[\left( {\sigma _1^2 = \sigma _2^2} \right)\] then, $$\left( {1 – \alpha } \right)$$, 100% confidence limits may be stated as
$$\left( {{{\overline X }_1} – {{\overline X }_2}} \right) \mp {t_{\alpha /2}}{S_c}\sqrt {\frac{1}{{{n_1}}} + \frac{1}{{{n_2}}}} $$

Here $$S_c^2 = \frac{{\left( {{n_1} – 1} \right)S_1^2 + \left( {{n_2} – 1} \right)S_2^2}}{{{n_1} + {n_2} – 2}}$$

It may be noted that the $$t – $$ statistic from the table should be obtained against the degree of freedom $${n_1} + {n_2} – 2$$ in both of the above cases.

Example:

A random sample of 100 students from an MBA class had an average score of 60 with a standard deviation score of 15 in statistics. A random sample of 64 students from a BS class had an average score of 66 with a standard deviation of 16 in the same course. Construct a 95% confidence interval for the difference between the mean score of the two classes.

Solution:

Since both sample sizes are large we will use the $$Z – $$ statistic to construct the interval. We have the following information:

\[\begin{gathered} {\overline X _1} = 66\,\,\,\,\,\,\,{S_1} = 16 \simeq {\sigma _1}\,\,\,\,\,\,\,{n_1} = 64 \\ {\overline X _2} = 60\,\,\,\,\,\,\,{S_2} = 15 \simeq {\sigma _2}\,\,\,\,\,\,\,{n_2} = 100\,\,\,\,\,\,\,\alpha = 0.05 \\ \end{gathered} \]

 

Using formula $$\left( {{{\overline X }_1} – {{\overline X }_2}} \right) \pm {Z_{\alpha /2}}\sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} $$, the 95% lower confidence limit for the difference between two population means $$\left( {{\mu _1} – {\mu _2}} \right)$$ would be

\[\begin{gathered} \left( {{{\overline X }_1} – {{\overline X }_2}} \right) – {Z_{0.025}}\sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} \\ \,\,\, = \left( {66 – 60} \right) – 1.96\sqrt {\frac{{256}}{{64}} + \frac{{225}}{{100}}} \\ \,\,\, = 6 – 1.96\left( {2.5} \right) = 6 – 4.90 = 1.1 \\ \end{gathered} \]

 

Also, the upper limit would be
$$\left( {{{\overline X }_1} – {{\overline X }_2}} \right) – 1.96\sqrt {\frac{{\sigma _1^2}}{{{n_1}}} + \frac{{\sigma _2^2}}{{{n_2}}}} = 6 + 4.90 = 10.90$$

 

Hence, the 95% confidence interval for the difference between the two population means is
$$1.10 < {\mu _1} – {\mu _2} < 10.90$$