Chapter 7. Estimation Techniques

Chapter Objectives

In this chapter, readers will learn to do the following:
• Define and apply the point estimation of mean for large samples
• Define and apply point estimation of proportion for large samples
• Define and evaluate the margin of error
• Define and applyinterval estimation of mean for large samples

• Define and apply interval estimation of proportion for large samples
• Define and apply the Student distribution for interval estimation of mean for small samples
• Evaluate the minimum sample size for the given accuracy of estimation
• Estimate the difference between two means using large samples
• Estimate the difference between two proportions using large samples
• Estimate the difference between two means using small samples
• Define and use matched pairs to estimate the difference

In previous chapters, we emphasized that very often statisticians need to analyze samples instead of the population and extend the results of this analysis to make conclusions with respect to the entire population. We already mentioned that this process is called statistical inference and is considered one of the main objectives of statistics. In this chapter, we will estimate the reliability of statistical inferences.

Recall that both population and sample data sets are characterized by quantities describing the centre and dispersion of observations: mean (a measure of the centre) and standard deviation (a measure of a dispersion). By convention, the quantities describing population and samples are called parameters and statistics, respectively.

Later in this book we will also make inferences concerning the variables of two outcomes: success and failure, such as “yes” and “no,” “turned on” and “turned off.” Also, we could mention the coin toss examples that we analyzed in previous chapters, where there were two outcomes: head and tail. Considering that the probability of success was determined as a proportion of this event, it would be reasonable to mention population and sample proportions.

The following table will help you to memorize notations of all these quantities.

Table 7.1

Quantity

Parameters

Statistics

Mean

μ

Variance

σ2

s2

Standard deviation

σ

s

Proportion

p

7.1 Sample Estimation

In previous chapters, we briefly discussed some sampling methods. Although sample statistics provide some information about the mean and variance of the population, we still need to estimate the accuracy and intervals within which the sample representation is valid. Let’s consider the following examples.

Example 7.1

Twenty students wrote a test. Their teacher recorded the times that each student spent writing the test (table 7.2).

Table 7.2

Student #

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Time, minutes

49

45

41

44

32

40

39

42

35

38

40

32

43

30

39

41

43

31

31

32

The mean time of this population is \(\mu=38\)minutes. The teacher then created four samples of five randomly selected students and evaluated the sample means (table 7.3). Consider that the number of all possible samples of five can be determined as 20×20×20×20×20=205.

Table 7.3

Sample 1

Sample 2

Sample 3

Sample 4

Student #

Time, minutes

Student #

Time, minutes

Student #

Time, minutes

Student #

Time, minutes

5

32

15

39

9

35

19

31

6

40

9

35

3

41

8

42

4

44

17

43

8

42

5

32

3

41

16

41

5

32

17

43

8

42

4

44

12

32

2

45

Sample mean,

40

Sample mean,

40

Sample mean,

36

Sample mean,

39

Table 7.3 shows that the sample means are close enough to the population mean, although they differ from it and between themselves. Very often, we do not have the opportunity to evaluate the population mean or to create various samples. In most cases, we are able to select only one random sample and make projections based on its statistics. The example below is an instance where the number of samples is limited.

Example 7.2

A 1998 Environics poll showed that 85% of Canadians supported registration of all firearms, whereas the CBC, based on its research, concluded that 80% of Canadians are in support of this law (obtained from https://www.justice.gc.ca/eng/rp-pr/csj-sjc/crime/rr01_1/p5.html). Obviously, the researchers did not ask all Canadians about this law; they asked two randomly selected groups (samples) of Canadians. Therefore, 85% and 80% are corresponding sample proportions. Based on the examples analyzed in the previous chapter and above, we are not surprised that these numbers differ. It is also not clear which number is more accurate. But it is more important to clarify whether we can draw conclusions about the opinions of all Canadians based on these surveys for given sample sizes.

Below, we will talk about sample estimation techniques for large and small samples. The outcomes of the Central Limit Theorem will be used to conduct the sample estimation. Estimator is considered as a rule (often expressed as a formula), which determines an estimate based on information obtained from a sample. Estimators are classified as point and interval. Intuitively, we can foresee that the accuracy of estimation might depend on the sample size (poorer for small samples and more accurate for large sample sizes).

7.2 Point Estimation

It is logical to use the sample mean, , to estimate the population mean, μ . The mathematical process of calculation of this number is called point estimation. The formula used in this process is called a point estimator, while the obtained number is called a point estimate. The distance between the true value of the parameter and its estimate is called the estimation error or estimation bias.

Point Estimation of Mean for Large Samples

In the previous chapter, we noted that by convention, a sample is considered large if the sample size \(n\ge 30\) . According to the Central Limit Theorem, the sample mean distribution is approximately normal for this sample. Earlier, we applied the empirical rule to large samples and stated that 95% of such data lie approximately within the two standard deviations of the mean of this distribution. Let’s determine this quantity more precisely.

image

Figure 7.1. Empirical rule for 95% of data with precisely determined z-scores

Following the procedure explained in example 5.3, we can determine that 95% of data precisely lie within 1.96 standard deviations of the mean (fig. 7.1). The quantity evaluated as “1.96 SE” is called the margin of error (fig. 7.2).

image

Figure 7.2. Empirical rule for 95% of data with precisely determined margin of error

In the previous chapter we determined the standard error, SE, as \(\frac{\sigma}{\sqrt{n}}\) . Consequently,

$$
\text{Margin of error}=\pm1.96\,SE=\pm1.96\left(\frac{s}{\sqrt{n}}\right) \qquad (F7.1)
$$

where s is the standard deviation, n is the size of the sample. Once again, let’s note that 95% was chosen by common convention only and is not based on any theoretical consideration.

In the example below, we will show how to construct the point estimation using the point estimator and margin of error.

Example 7.3

Within one of their projects, the authors of this book studied seasonal birth oscillations of birth time series in Canadian provinces (A. Sardarli, F. Trovato, A. Volodin, Role of Temperature in Forming Seasonal Fertility and Mortality Oscillations, Annual Meeting of Canadian Population Society, University of Ottawa, Ontario, ON, June 2–4, 2015). To conduct the statistical analysis, we randomly selected numbers of 40 monthly births in Alberta in the years 1951–1999. The mean and standard deviation of this sample were determined as x̅=3030 and s=405 , respectively. Since the sample size n=40 is greater than 30, we can use the normal distribution approach and x̅=3030 as the point estimator of this data set. Now, let’s determine the margin of error.

$$
\text{Margin of error}
= \pm1.96\,SE
= \pm1.96\left(\frac{405}{\sqrt{40}}\right)
= \pm1.96\cdot 64
= \pm126
$$

Therefore, we can conclude that the sample estimate of monthly births in Alberta in the years 1951–1999 was \(3030 \pm 126\) .

 

Point Estimation of Proportion for Large Samples

The point estimation approach used for large samples can be applied to samples of binomial variables if the conditions \(n\bar{p} > 5\) and \(n\bar{q} > 5\) are satisfied; where n is the sample size, \(\bar{p}\) and \(\bar{q}\) are estimated probabilities of success and failure, respectively. As was explained in chapter 6, the normal approximation can be applied to \(\bar{p}=\frac{x}{n}\), where x is the number of successes. It is logical to use this quantity as a point estimator with standard error as

$$
SE=\sqrt{\frac{\bar{p}\bar{q}}{n}} \qquad (F7.2)
$$

and the 95% margin of error as

$$
\pm 1.96\sqrt{\frac{\bar{p}\bar{q}}{n}} \qquad (F7.3)
$$

If n is very large, then the margin of error becomes almost negligible. In statistics, this is considered an equivalent of a statement that the point estimator is unbiased.

Example 7.4

A driver training company offers six-hour mandatory driving education courses. The manager randomly selected 125 students and found out that 105 of them passed the test after the first attempt. Determine the estimator and the margin of error of the proportion of successful students trained by the company.

Solution:

The size of the sample is \(n = 125\). The sample proportion can be calculated as \(\bar{p}=\frac{105}{125}=0.84\) . Consequently,

\(\bar{q}=1-0.84=0.16\) . Since \(n\bar{p}=125\bullet0.84=105>5\) and \(n\bar{q}=125\bullet0.16=20>5\), the normal distribution approximation can be applied to this sample. We use the sample proportion as our estimator. The margin of error can be calculated using the formula F7.3.

$$
\pm1.96\sqrt{\frac{\bar{p}\bar{q}}{n}}
=\pm1.96\sqrt{\frac{0.84\bullet0.16}{125}}
=\pm0.03
$$

Therefore, the proportion of successful students of the driving training company is 0.84 within the margin error of 0.03. Using more casual language we could say that \((84\pm3)\%\) of students successfully completed the test after the first attempt.

7.3 Interval Estimation

At the beginning of this chapter, we argued that it is recommended to use the sample mean,\(\bar{x}\), as an estimator of the population mean, \(\mu\) . In the case of binomial variables, we use the sample proportion as an estimator of the population proportion. At the same time, we have to realize that the probability that a randomly selected sample mean exactly equals to the population mean is zero. Therefore, it would be reasonable and more practical to determine an interval of possible values of the population mean instead of a single point.

Another argument supporting the interval estimation is that the point estimation does not provide any information about the accuracy of the estimation itself. For instance, in example 7.3 we concluded that the sample estimate of monthly births in Alberta in the years 1951–1999 was 3,030. This estimation gives us an idea about the monthly births in the shown year interval. However, this statement does not show the level of our confidence that the conclusion based on statistics of the randomly selected sample accurately represents the entire population.

Interval Estimation of Mean for Large Samples

Below, we will learn about confidence interval estimation, which allows us to estimate a large sample mean at the required confidence level.

Example 7.5

We will consider a forest with the mean height of trees equals to 12.4 metres (m) and the standard deviation of 2.1 m. According to the Central limit Theorem, if we randomly select samples sizes of \(n\ge 30\), the sample mean \(\bar{X}\) will be approximately normally distributed with mean \(\mu_{\bar{X}}=\mu\) and the standard deviation \(\sigma_{\bar{X}}=\frac{\sigma}{\sqrt{n}}\) . Based on the empirical rule, we can state that 95% of observations (tree heights) of the sample lie within 1.96 standard deviations—i.e., in the interval \(\left[\bar{X}-1.96\frac{\sigma}{\sqrt{n}},\ \bar{X}+1.96\frac{\sigma}{\sqrt{n}}\right]\) or \(\left[\bar{X}-1.96SE,\ \bar{X}+1.96SE\right]\). Let’s say \(n=30\) . Using the Excel program, we randomly selected 85 samples of 30 tree heights, determined sample means for each of them, and constructed the intervals \(\left[\bar{x}-1.96SE,\ \bar{x}+1.96SE\right]\)

. Figure 7.3 demonstrates the 95% intervals for 85 simulated samples.

image


Figure 7.3. 95% intervals for 85 Excel-simulated samples

Note that 4 out of 85 intervals do not contain the population mean \(\mu\) . In our example, the relative frequency of these intervals equals \(\frac{4}{85}=0.047\approx0.05\). Consequently, the relative frequency of intervals containing the population mean can be evaluated as \(1-0.047=0.953\), which is very close to 95%. In statistics, these two quantities determine the accuracy of the interval estimation. It is more practical to explain the meaning of these quantities using the normal distribution curve and standard scale (fig. 7.4), where the origin \(\mu\) is mapped to 0 and \(\sigma\) scaled to 1.

image

Figure 7.4. Normal distribution curve with standard scale (95% data)

In statistics, we use the Greek letter \(\alpha\) to denote the frequency of intervals “failing” the population mean. The quantity \(1-\alpha\) characterizes the accuracy of the estimation and is called the level of confidence. Usually, the level of confidence is given either in terms of α or in a percentage. In our example, \(\alpha=0.05\) and, hence, the level of confidence is equal to \(100(1-\alpha)\%=100(1-0.05)\%=95\%\) .

Figure 7.5 generalizes our example to any α; smaller is α, higher is the level of confidence. This is why we consider small values for α. Since the normal distribution curve is symmetrical, the tails on both sides are equal to \(\alpha/2\) . The z-score that leaves the area

\(\alpha/2\) on the right, is denoted as \(z_{\alpha/2}\) . Consequently, the z-score that leaves \(\alpha/2\) on the left equals \(-z_{\alpha/2}\) . The corresponding values of z-scores can be found from the cumulative normal distribution table (table A3, Appendix; backward problem).

image

Figure 7.5. Normal distribution curve with standard scale (\(100(1-\alpha)\%\)data)

By convention, in statistics we use a 90%, 95%, 98%, or 99% level of confidence for interval estimation.

Now we are ready to construct the confidence interval for large samples at the given \(\alpha\) . If the standard deviation of the population is known,

$$
\bar{x}\pm z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right)
\quad \text{or} \quad
\left[\bar{x}-z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right),\
\bar{x}+z_{\alpha/2}\left(\frac{\sigma}{\sqrt{n}}\right)\right]
\qquad (F7.4)
$$

where

  • n is the sample size
  • is the sample mean
  • σ is the standard deviation of population

If σ is unknown, and \(n\geq 30\) , we use the sample standard deviation s:

$$
\bar{x}\pm z_{\alpha/2}\left(\frac{s}{\sqrt{n}}\right)
\quad \text{or} \quad
\left[\bar{x}-z_{\alpha/2}\left(\frac{s}{\sqrt{n}}\right),\
\bar{x}+z_{\alpha/2}\left(\frac{s}{\sqrt{n}}\right)\right]
\qquad (F7.4)
$$

Example 7.6

Use the data provided in example 7.3 to construct confidence intervals for the following values of confidence levels:

(a) 90%
(b) 95%
(c) 98%
(d) 99%

Solution:

(a) \(\alpha=\frac{100\%-90\%}{100\%}=0.10\) and \(z_{\alpha/2}=1.645\) for the given confidence coefficient. Consequently, the corresponding margin of error
$$
\pm1.645\left(\frac{405}{\sqrt{40}}\right)=\pm105
$$

Considering the point estimator, \(\bar{x}=3030\) , we can construct the confidence interval for the confidence coefficient of 90%:

 

\(3030\pm105\ \text{or}\ 2925<\mu<3135\)

This means that with 90% of confidence we state that in the years 1951–1999, the monthly births in Alberta were between 2,925 and 3,135.

Parts (b)–(d) of the example can be solved following the same procedure.

 

(b) \(\alpha=0.05,\ z_{\alpha/2}=1.96\),

$$
\text{Margin of error}
=\pm1.96\left(\frac{405}{\sqrt{40}}\right)
=\pm126
$$

 

Confidence interval: \(3030\pm126\ \text{or}\ 2904<\mu<3156\)

 

(c) \(\alpha=0.02,\ z_{\alpha/2}=2.33\),
$$
\text{Margin of error}
=\pm2.33\left(\frac{405}{\sqrt{40}}\right)
=\pm149
$$

Confidence interval: \(3030\pm149\ \text{or}\ 2881<\mu<3179\)

 

(d) \(\alpha=0.01,\ z_{\alpha/2}=2.58\),

(e) \(\text{Margin of error}=\pm2.58\left(\frac{405}{\sqrt{40}}\right)=\pm165\)

 

Confidence interval: \(3030\pm165\ \text{or}\ 2865<\mu<3195\)

 

It would be reasonable to visualize the results of the example (fig. 7.6).

image

Figure 7.6. Confidence intervals constructed from example 7.6

As one can see from the figure 7.6, larger is the confidence level, wider is the confidence interval.

Table 7.4 presents values of z-scores for the most used confidence levels.

Table 7.4

1001-α%

α

α/2

zα/2

90

0.10

1.28

0.05

1.645

95

0.05

1.645

0.025

1.96

98

0.02

2.05

0.01

2.33

99

0.01

2.33

0.005

2.576

Example

Interval Estimation of Proportion for Large Samples

In this chapter we learned that the point estimation can be applied to proportion of large samples if \(n\bar{p}>5 \ \text{and} \ n\bar{q}>5\), where n is the sample size, \(\bar{p}\) and \(\bar{q}\) are estimated probabilities of success and failure, respectively. If these conditions are satisfied, the confidence interval can be determined for the given level of confidence, 100

\(\left(1-\alpha\right)\%\), using the formula

\(\bar{p}\pm z_{\alpha/2}\sqrt{\frac{\bar{p}\bar{q}}{n}}\ \text{or}\ \left[\bar{p}-z_{\alpha/2}\sqrt{\frac{\bar{p}\bar{q}}{n}},\ \bar{p}+z_{\alpha/2}\sqrt{\frac{\bar{p}\bar{q}}{n}}\right]\ (F7.5)\)

where \(z_{\alpha/2}\) is the z-score, which leaves \(\alpha/2\) on its right under the normal distribution curve.

Example 7.6(a)

Construct a 98% confidence interval for the data set given in example 7.4.

Solution:

First, we determine the sample proportion,

[latex]\bar{p}=\frac{x}{n}=\frac{105}{125}=0.84[/latex]

Consequently,

\(\bar{q}=1-\bar{p}=1-0.84=0.16\)

In example 7.4 we verified that the conditions \(n\bar{p}>5\) and n\bar{q}>5 are satisfied. Hence, we can use the normal distribution approach. For the 98% level of confidence,

\(\alpha=\frac{100\%-98\%}{100}=0.02\)

The corresponding z-score

\(z_{\alpha/2}=z_{0.01}=2.33\)

Now we can construct the required interval:

$$
\left[\bar{p}-z_{\alpha/2}\sqrt{\frac{\bar{p}\bar{q}}{n}},\ \bar{p}+z_{\alpha/2}\sqrt{\frac{\bar{p}\bar{q}}{n}}\right]
$$
$$
=\left[0.84-2.33\sqrt{\frac{0.84\cdot0.16}{125}},\ 0.84+2.33\sqrt{\frac{0.84\cdot0.16}{125}}\right]
$$
$$
=\left[0.84-0.08,\ 0.84+0.08\right]
$$
$$
=\left[0.76,\ 0.92\right]
$$

Therefore, the proportion of students passing the test after the first attempt varies from 0.76 to 0.92 within a 98% confidence level.

7.4 Minimum Sample Size for Accurate Estimation

In the examples that we analyzed before, the sample sizes were given. However, working on a real project, statisticians are required to determine the sample size (i.e., the number of measurements) necessary to achieve a certain level of accuracy. In fact, this is an important part of the sampling plan. It seems reasonable to assume that there must be a relationship between the accuracy and number of measurements. Following this logic, we can use the margin of error to estimate the minimum number of measurements/sample size. Above, for the confidence level

\(100\left(1-\alpha\right)\%\), we determined the margin of error as \(\pm z_{\alpha/2}\left(\frac{\sigma}{\sqrt n}\right)\ \text{or}\ \pm z_{\alpha/2}\left(\frac{s}{\sqrt n}\right)\), depending on whether the standard deviation of the population is known or not. If we require that the margin error does not exceed a bound, B, the minimum sample size can be found from the inequality:

\(z_{\alpha/2}\left(\frac{\sigma}{\sqrt n}\right)<B\)

Solving the inequality, we obtain the following:

\(n>\left(\frac{z_{\alpha/2}}{B}\right)^2\sigma^2\ (F7.6)\)

Considering that n is a positive integer denoting the minimum sample size, we must round up the number obtained from the right-hand side of (F7.6) to a 1.

Example 7.7

A gardener asked his statistician friend to estimate the average weight of Gala apples from his garden with an accuracy of 5 grams and at a 95% confidence level. He also told him that previous measurements yield 20 grams of standard deviation. What is the minimum weight of apples the statistician needs to measure?

Solution:

The z-score for the given confidence level is 1.96. Considering the bound \(B=5\), the minimum sample size can be found using the formula (F7.6):

\(\left(\frac{z_{\alpha/2}}{B}\right)^2\sigma^2=\left(\frac{1.96}{5}\right)^2\cdot 20^2=61.4656\)

By rounding up we obtain the minimum number of weight measurements equal to 62.

Researchers working in the “field” can confirm that the population variance or standard deviation are not always known when you start your project. Often, we have information about the data range. Based on accumulated knowledge of statistical studies, it is recommended to use a quarter of the data range to approximate the standard deviation.

\(\sigma\approx\frac{R}{4}\)

Example 7.8

According to city hall’s all-time records, the daily dollar amount of parking tickets varies from $21,075 to $32,628. Use \(\alpha=0.10\) to determine the minimum number of randomly selected days to construct a confidence interval within $500.

Solution:

Since the standard deviation is not given, we will approximate it to the quarter of the data range:

\(\sigma\approx\frac{R}{4}=\frac{32628-21075}{4}=2888\)

The z-score corresponding to the required confidence level is

\(z_{\alpha/2}=1.645\)

Considering \(B=500\) and using the formula (F7.6), we obtain the following:

\(\left(\frac{z_{\alpha/2}}{B}\right)^2\sigma^2=\left(\frac{1.645}{500}\right)^2\cdot 2888^2=90.279\)

After rounding up, we can conclude that one must randomly select at least 91 days of records to construct the required confidence interval.

A similar strategy can be applied to estimate the minimum sample size to construct the confidence interval of the proportion using the margin error of the proportion, which must not exceed a bound B:

\(z_{\alpha/2}\sqrt{\frac{pq}{n}}<B\)

Solving the inequality, we get

\(n>\frac{pq}{B}\,z_{\alpha/2}^2\ (F7.7)\)

where p and q are probabilities of success and failure, respectively; \(z_{\alpha/2}\) is the z-score determined for the required confidence level.

Example 7.9

According to the Coffee Association of Canada, two-thirds of Canadians enjoyed at least one cup a day in 2018 (Obtained from https://coffeebi.com/2019/02/18/the-canadian-coffee-consumption-2019/#:~:text=According%20to%20the%20Coffee%20Association,other%20beverage%2C%20even%20tap%20water). What is the minimum number of Canadians who have to be interviewed in this regard to make an estimate of this proportion within 0.01 of the actual proportion with probability equal to 90%?

Solution:

The probability of success \(p=\frac{2}{3}=0.67\) . Consequently, \(q=1-0.67=0.33\) . It is given that \(B=0.01\). z-score for

\(\alpha=\frac{100-90}{100}=0.1\) is 1.645. Now we can use formula (F7.7):

\(\frac{pq}{B}z_{\alpha/2}^2=\frac{0.67\cdot0.33}{0.01}\cdot 1.645^2=59.83\)

Therefore, the minimum required sample size is \(n=60\).

NOTE: As was mentioned above, very often, we start a statistical analysis without having detailed information about the population. In particular, there may not be sufficient knowledge of the population proportion to make an estimation. In these cases, we assume that \(p=0.5\) and determine the minimum sample size using the following formula:

\(n>\frac{z_{\alpha/2}^2}{4B}\ (F7.8)\)

Example 7.10

A statistics student at the First Nations University of Canada, as part of her research projects, needs to estimate the probability that a randomly selected Indigenous student at her university speaks an Indigenous language. How many Indigenous students must she survey to complete her estimation within 0.03 of the proportion at the 98% confidence level?

Solution:

Since the population proportion is unknown, we can assume \(p=0.5\) and use formula (F7.8) to determine the sample size (required minimum number of randomly selected Indigenous students), considering that for a 98% confidence level \(z_{\alpha/2}=2.33\).

\(\frac{z_{\alpha/2}^2}{4B}=\frac{2.33^2}{4\cdot0.03}=45.24\)

Rounding up, we conclude that at least 46 Indigenous students must participate in the survey to achieve the project's objectives.

7.5 Student’s t-Distribution

We referred to conclusions of the Central Limit Theorem and used the normal distribution approach to design the estimation techniques. However, the application of the Central Limit Theorem requires the large sample size (\(n\geq30\)), or considers that the population is normal and its standard deviation \(\sigma\) is given. Therefore, if the sample is small (\(n<30\)) and the standard deviation is unknown, we must use a different statistic to make an inference. Intuitively, we can suggest that the distribution curve still will have a similar shape and would turn into the normal distribution curve as the sample size exceeds 30. This statistic, which is described by the t-distribution, first was derived in 1876 by German statisticians Friedrich Robert Helmert (1843–1817) and Jacob Lüroth (1844–1910). In 1908, the English statistician William Sealy Gosset published the first article about this distribution in an English-language scientific journal under the pseudonym of “Student.” This is why now the t-distribution is also known as “Student’s distribution” (Ronald E. Walpole, H. Raymond Myers, Probability & statistics for engineers & scientists, 7th ed, Pearson, New Delhi, 2006).

Student’s or t-distribution curves can be described as follows:

  • Mount-shaped
  • Symmetrical about 0
  • More widely dispersed than the standard normal distribution (heavy tales)
  • Actual shape is dependent on the degrees of freedom, with different t-distributions identified by a number called degrees of freedom (df)

Figure 7.7 presents the t-distribution curves for the degrees of freedom 15, 30, and 120. The t-distribution curve approaches the normal curve as sample size gets large enough.

image

Figure 7.7. t-distribution curves the degrees of freedom 15, 30, and 120

The area below each curve is equal to 1 regardless the degrees of freedoms. We leave the explanation of this property to readers. (Hint: The sum of probabilities of all probabilities is equal to 1.)

Note that t-scores, like z-scores, can be calculated using mathematical functions and calculus. In this book, we will not provide these calculations, but will instead use the table of t-scores (table A4, Appendix). The table presents t-scores that leave on their right the area equal to α. The first column of the table shows the degrees of freedom,

df . The subscripts on the first row indicates the values of α.

Example 7.11

Solve the following using the t-distribution table (table A4, Appendix):

(a) Determine the t-score, which cuts off 0.025 in the right tail with degrees of freedom df=4.

(b) Determine the area on the left of t=2.5 for the 6 degrees of freedom.

Solution:

(a) It is given that df=4 and α=0.025. We find the corresponding t-score at the intersection of the row "df=4" and column "t0.025" (table 7.5).

Table 7.5

Therefore, the t-score, which cuts off 0.025 in the right tail with degrees of freedom 4, equals to 2.776 (fig. 7.8).

image

Figure 7.8. t-distribution curve for example 7.11(a)

(b) The t-distribution table (table A4, Appendix) does not provide the t-score, which exactly equals 2.5 on row "df=6" (table 7.6).

Table 7.6

image

The two closest numbers to 2.5 are 2.447 and 3.143. The corresponding values of α are 0.025 and 0.010. The average value is [latex]\displaystyle \frac{0.025+0.010}{2}=0.0175[/latex]. Therefore, the area to the right from [latex]\displaystyle t=2.5[/latex] for 6 degrees of freedom is 0.0175. Considering that the total area below the curve is 1, the area to the left from [latex]\displaystyle t=2.5[/latex] is [latex]\displaystyle 1-0.0175=0.9825[/latex]

(fig. 7.9).

image

Figure 7.9. t-distribution curve for example 7.11(b)

Interval Estimation of Mean for Small Samples

Now, if the size n of the sample, randomly selected from a normal population, is less than 30 and the variance is unknown, we can use the t-score for estimation. The t-score of the observation x is determined as

[latex]\displaystyle t=\frac{x-\bar{x}}{\frac{s}{\sqrt{n}}}\qquad\text{(F7.9)}[/latex]

where [latex]\displaystyle \bar{x}[/latex] is the sample mean, and [latex]\displaystyle s[/latex] is the standard deviation of the sample.

When we do not know the population variance but have an estimate s, we will use the following formula to construct the confidence interval:

[latex]\displaystyle \bar{x}\pm t_{\alpha/2}\frac{s}{\sqrt{n}}\qquad\text{(F7.10)}[/latex]

where t has [latex]\displaystyle df=n-1[/latex] degrees of freedom.

Example 7.12

A statistics professor knows that according to the records of the Registrar’s Office, the GPAs of his current statistics students are normally distributed. He decided to estimate the GPA interval of his students and randomly selected seven students with the following GPAs: 2.61, 2.83, 3.94, 2.54, 3.02, 2.24, and 2.54. Construct the confidence interval using

[latex]\displaystyle \alpha=0.05[/latex].

Solution:

Since the sample size [latex]\displaystyle n=7[/latex] is less than 30, and population distribution is normal, we will use formula (F7.10) to construct the confidence interval. To apply the formula, we need to determine the sample mean and sample standard deviation.

[latex]\displaystyle \bar{x}=\frac{2.61+2.83+3.94+2.54+3.02+2.24+2.54}{7}=2.82[/latex]

[latex]\displaystyle s=\sqrt{\frac{\left(\sum_{i=1}^{n} x_i^2\right)-\frac{\left(\sum_{i=1}^{n} x_i\right)^2}{n}}{n-1}}=\sqrt{\frac{57.386-\frac{(19.72)^2}{7}}{6}}=0.55[/latex]

 

The degrees of freedom

[latex]\displaystyle df=n-1=7-1=6.[/latex].

For [latex]\displaystyle df=6[/latex] and given confidence level, [latex]\displaystyle t_{\alpha/2}=t_{0.05/2}=t_{0.025}=2.447[/latex] according to the t-distribution table (table A4, Appendix). Consequently,

$$
\bar{x}\pm t_{\alpha/2}\frac{s}{\sqrt{n}}
$$

$$
=2.82\pm2.447\cdot\frac{0.55}{\sqrt{7}}
$$

$$
=2.82\pm0.51
$$

$$
=[2.82-0.51,\;2.82+0.51]
$$

$$
=[2.31,\;3.33]
$$

Therefore, with 95% probability, the average GPA of all statistics students of this instructor is between 2.31 and 3.33, inclusively.

7.6 Estimation of the Difference of Samples

Large Sample Estimation of the Difference Between Two Means

One important topic of discussion during the COVID-19 pandemic concerned the question of which vaccine is more effective. In fact, researchers working in many areas need to answer these types of questions, comparing various populations, such as the gross domestic product (GDP) of countries, the duration of actions of pain medications, the success of teaching methods, the safety ratings of vehicles, etc. Below, we will show how the point and interval estimators can be used to compare the means of two populations. The estimation procedure is similar to the one used for one population. The main difference is that we will now estimate the difference between two means instead of one mean.

Consider two populations whose means are \(\mu_1\) and \(\mu_2\) with standard deviations \(\sigma_1\) and \(\sigma_2\), respectively. We already mentioned that it is not always possible to review each observation of populations. As we did before, one can analyze the randomly selected samples and infer the obtained conclusion to the populations. Assume that two samples (one from each population) have been randomly selected from the given populations with the following descriptive statistics (table 7.7):

Table 7.7

Sample 1

Sample 2

Mean

x̅1

x̅2

Standard deviation

s1

s2

Sample size

n1

n2

According to the Central Limit Theorem, if both populations are normal or sample sizes are not less than 30, we can use the normal distribution approximation to estimate the difference between the means of these populations. Again, we will apply the point and interval estimations to analyze the difference between the two means.

 

As was mentioned, the means of populations 1 and 2 are \(\mu_1\) and \(\mu_2\), respectively, and two samples randomly selected from these populations are described by the statistics provided in Table 7.7. To conduct the point estimation of the difference between \(\mu_1\) and \(\mu_2\), we define the point estimator as

$$
\bar{x}_1-\bar{x}_2 \qquad \text{(F7.11)}
$$

and the 95% margin error as

$$
SE=\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}
\qquad \text{(F7.12)}
$$

[latex]\text{where } \sigma_1 \text{ and } \sigma_2 \text{ are standard deviations of populations 1 and 2, respectively.}[/latex] If standard deviations of populations are not given, we will determine the margin error using the standard deviations of samples:

$$
SE=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}
\qquad \text{(F7.13)}
$$

Consequently, 95% margin of error can be evaluated as

$$
\pm 1.96\,SE=\pm 1.96\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}
\;\text{or}\;
\pm 1.96\,SE=\pm 1.96\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}
\qquad \text{(F7.14)}
$$

Therefore, the interval estimator for \( \mu_1 - \mu_2 \) is

$$
\left(x_1-x_2\right)\pm1.96\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}
\;\text{or}\;
\left(x_1-x_2\right)\pm1.96\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}
\qquad \text{(F7.15)}
$$

According to the logic used above, a \( (1-\alpha)100\% \) large sample confidence interval for the difference \( \mu_1 - \mu_2 \) can be constructed as

$$
\left(\bar{x}_1-\bar{x}_2\right)\pm z_{\alpha/2}\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}
\;\text{or}\;
\left(\bar{x}_1-\bar{x}_2\right)\pm z_{\alpha/2}\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}
\qquad \text{(F7.16)}
$$

Example 7.13

A pharmacy needs to select one of two offered pain medications, conditionally called Medicine 1 and Medicine 2. To make the decision, the pharmacy manager decided to check if there is any difference between the durations of actions of medications. Upon her request, a city hospital provided the following data about the action duration of medications on randomly selected patients (table 7.8):

Table 7.8

Medication 1

Medication 2

Sample mean of the acting duration, in minutes

280

272

Standard deviation, in minutes

30

20

Sample size

60

50

(a) Construct the point estimator for the difference between two population means.
(b) Construct a 98% confidence interval for the difference between two population means.
(c) Based on your analysis, can you conclude if there is a difference between the action durations of these medications?

Solution:

Since both sample sizes exceed 30, we can use the normal distribution approximation to construct estimators.

(a) According to the provided data in the table 7.8,

$$
\bar{x}_1 - \bar{x}_2 = 280 - 272 = 8
$$

and

$$
SE=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}
=\sqrt{\frac{30^2}{60}+\frac{20^2}{50}}
=4.8
$$

Therefore, the point estimate for the difference between the means of action durations is $8$ minutes with a standard error of 4.8 minutes.

(b) For a 98% confidence interval, [latex]z_{\alpha/2} = 2.33[/latex] (see table 7.4). Therefore, the interval can be constructed as follows:

$$
\left(\bar{x}_1-\bar{x}_2\right)\pm z_{\alpha/2}\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}
\Rightarrow 8\pm 2.33\cdot 4.8
\Rightarrow 8\pm 11
$$

or

$$
-3<\mu_1-\mu_2<19
$$

(c) The inequality above shows that the means’ difference is restricted by a negative number from the left and by a positive number from the right. It is logical to conclude that there is no difference between the means; there is zero between negative and positive numbers, which means no difference between the action durations of these medications.

In the next chapter, we will test these types of statements using a specific technique.

Large Sample Estimation of the Difference Between Two Proportions

Now we are ready to devise a technique to estimate the difference between two proportions. Consider two binomial populations with parameters $p_1$ and $p_2$, respectively. As we did above, assume that two samples (one from each population) have been randomly selected from the given populations with the following statistics:

Table 7.8(a)

Sample 1

Sample 2

Number of successes

x1

x2

Sample size

n1

n2

The corresponding sample proportions can be evaluated respectively as

\(\bar{p}_1 = \frac{x_1}{n_1} \text{ and } \bar{p}_2 = \frac{x_2}{n_2}\)

Consequently, the difference between the sample proportions is

$$
\bar{p}_1-\bar{p}_2=\frac{x_1}{n_1}-\frac{x_2}{n_2}
\qquad \text{(F7.17)}
$$

Considering that the mean of \( \bar{p}_1 - \bar{p}_2 \) is \( p_1 - p_2 \), we can construct the point and interval estimators. The standard error for two samples is evaluated as

$$
SE=\sqrt{\frac{p_1q_1}{n_1}+\frac{p_2q_2}{n_2}}
\qquad \text{(F7.18)}
$$

where where \( q_1 = 1 - p_1 \) and \( q_2 = 1 - p_2 \).

If the population proportions are not given, the standard error can be estimated using the sample proportions:

$$
SE=\sqrt{\frac{\bar{p}_1\bar{q}_1}{n_1}+\frac{\bar{p}_2\bar{q}_2}{n_2}}
\qquad \text{(F7.19)}
$$

where \( \bar{q}_1 = 1 - \bar{p}_1 \) and \( \bar{q}_2 = 1 - \bar{p}_2 \).
The sampling distribution of \( \bar{p}_1 - \bar{p}_2 \) can be approximated by the normal distribution using the Central Limit Theorem, if the following conditions are satisfied:

$$
n_1\bar{p}_1>5,\quad
n_1\bar{q}_1>5,\quad
n_2\bar{p}_2>5,\quad
\text{and}\quad
n_2\bar{q}_2>5
\qquad \text{(F7.20)}
$$

Then, the 95% margin of error can be evaluate as

$$
\pm 1.96\,SE=\pm 1.96\sqrt{\frac{\bar{p}_1\bar{q}_1}{n_1}+\frac{\bar{p}_2\bar{q}_2}{n_2}}
\qquad \text{(F7.21)}
$$

Consequently, the point estimator for \(\left(\bar{p}_1-\bar{p}_2\right)\) is

$$
\left(\bar{p}_1-\bar{p}_2\right)\pm1.96\sqrt{\frac{\bar{p}_1\bar{q}_1}{n_1}+\frac{\bar{p}_2\bar{q}_2}{n_2}}
\qquad \text{(F7.22)}
$$

If the conditions (F7.20) are satisfied, a \(\left(1-\alpha\right)100\%\) large sample confidence interval for the difference \(\left(p_1-p_2\right)\) can be constructed as

$$
\left(\bar{p}_1-\bar{p}_2\right)\pm z_{\frac{\alpha}{2}}
\sqrt{\frac{\bar{p}_1\bar{q}_1}{n_1}+\frac{\bar{p}_2\bar{q}_2}{n_2}}
\qquad \text{(F7.23)}
$$

Example 7.14

Between 2015 and 2017, we conducted a research project in Prince Albert (Saskatchewan, Canada) in collaboration with the Prince Albert Grand Council to study the correlation between income and educational attainment in Indigenous and non-Indigenous communities within the city (A. Sardarli, S. Pete, T. Ngamkham, S. Suraphee, A. Volodin, The Determinants of Annual Income in Aboriginal and Non-Aboriginal Communities: Comparative Statistical Analysis, Thailand Statistician, 17(2), 2019, 235-241). The survey showed that 47 out of 95 Indigenous and 56 out of 105 non-Indigenous residents of Prince Albert had permanent full-time jobs.

(a) Construct the point estimator for the difference between two population proportions.
(b) Construct a 90% confidence interval for the difference between two population proportions.
(c) Based on your analysis, can you conclude if there is a difference between these two proportions?

Solution:

(a) It is given that \(x_1=47,\; n_1=95\) and \(x_2=56,\; n_2=105\). First, we need to determine the sample proportions.

\(\bar{p}_1=\frac{47}{95}=0.49\)

\(\bar{p}_2=\frac{56}{105}=0.53\)

Hence,

\(\bar{q}_1=1-0.49=0.51\)

\(\bar{q}_2=1-0.53=0.47\)

The standard error of two proportion samples can be estimated using the formula (F7.19).

$$
SE=\sqrt{\frac{\bar{p}_1\bar{q}_1}{n_1}+\frac{\bar{p}_2\bar{q}_2}{n_2}}
=\sqrt{\frac{0.49\cdot0.51}{95}+\frac{0.53\cdot0.47}{105}}
=0.07
$$

Therefore, the point estimate for the difference \(\left(p_1-p_2\right)\) is \(0.49-0.53=-0.04\) with the standard error of \(0.07\).

(b) Note that, conditions (F7.20) are satisfied:

$$
n_1\bar{p}_1=97\cdot0.49=47.5>5,\quad
n_1\bar{q}_1=97\cdot0.51=49.5>5
$$

$$
n_2\bar{p}_2=105\cdot0.53=55.7>5,\quad
n_2\bar{q}_2=105\cdot0.47=49.4>5
$$

Therefore, we can apply the normal distribution approximation to construct the confidence interval using (F7.23). For the 90% confidence interval [latex]z_{\alpha/2}=1.645[/latex] . Therefore, the interval can be constructed as follows:

$$
\left(\bar{p}_1-\bar{p}_2\right)\pm z_{\alpha/2}
\sqrt{\frac{\bar{p}_1\bar{q}_1}{n_1}+\frac{\bar{p}_2\bar{q}_2}{n_2}}
\Rightarrow -0.04\pm1.645(0.07)
\Rightarrow -0.04\pm0.12
$$

or

$$
-0.16 < p_1 - p_2 < 0.08
$$

(c) The inequality above shows that the difference [latex]p_1-p_2[/latex] is restricted by a negative number from the left and by a positive number from the right. Similar to the previous example, it is logical to conclude that the proportions of permanent full-time employees are about the same in Indigenous and non-Indigenous communities in Prince Albert.

Small Sample Interval Estimation for Two Means

As was mentioned, data collection is a challenging task. Often, we are not able to select large samples and use the normal distribution approach. Earlier in this chapter we explained how t-distribution can be used for point and interval estimation of small sample mean if the population is normal. Now readers will learn how t-distribution can be applied to construct a confidence interval to estimate the difference between the means of two normal populations using small-size samples.

Assume that two samples randomly selected from two normal populations are presented by table 7.5, where [latex]n_1<30 \text{ and } n_2<30[/latex]. We have to note that the assumption of normality is very important. There are special procedures to check the normality of populations, which are beyond the scope of this textbook. Interested readers can find many helpful resources on this topic (for instance, I.R. Savage, Nonparametric Statistics, Journal of the American Statistical Association, 52 (279), 1957, 331–344). The [latex]100(1-\alpha)\%[/latex] confidence interval for the difference between two population means [latex]\left(\mu_1-\mu_2\right)[/latex] using small samples can be constructed as follows:

$$
\left(\bar{x}_1-\bar{x}_2\right)\pm t_{\alpha/2}
\sqrt{s^2\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}
\qquad \text{(F7.24)}
$$

where

$$
s^2=\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}
\qquad \text{(F7.25)}
$$

The quantity defined by formula (F7.25) is called the pooled sample variance. The denominator of the quotient on the right side of the formula represents the numbers of degrees of freedom, df.

Example 7.15

A statistics professor teaches STAT 100 course for two sections: section 1, in-classroom format, and section 2, online format. He decided to estimate the difference between the means of mid-term exam scores (maximum score is 40) in two sections. He randomly selected 10 works from section 1 and 8 works from section 2 and calculated the sample means and deviations as presented in the table 7.9.

Table 7.9

Section 1

Section 2

Mean of scores (out of 40)

30

24

Standard deviation (out of 40)

4

6

Sample size

10

8

(a) Construct a 99% confidence interval for the difference between two mean scores of the two sections.

(b) Based on your analysis, can you conclude if there is a difference between the mid-term exam performances of in-classroom and online students?

Solution:

We assume that both populations are normal.

(a) Let’s first determine the pooled sample variance:

$$
s^2=\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}
=\frac{(10-1)\cdot4^2+(8-1)\cdot6^2}{10+8-2}
=25
$$

The degrees of freedom [latex]df=10+8-2=16[/latex] . The corresponding t-score is 2.921. Now we can construct the confidence interval:

$$
(30-24)\pm 2.921 \sqrt{25 \left(\frac{1}{10} + \frac{1}{8}\right)}
\Longrightarrow 6 \pm 6.9
$$

or

$$
-0.9 < \mu_1 - \mu_2 < 12.9
$$

(b) The inequality above shows that the means’ difference is restricted by a negative number from the left and by a positive number from the right. It is logical to conclude that there is no difference between the mean scores of in-classroom and online students, because 0 is between negative and positive numbers.

Matched Pairs

In example 7.15, we compared the test results of two populations—students taking a statistics course in two different formats: online and in-classroom. One has to admit that the students’ performances are affected not only by the different formats; their results may depend on many other factors as well, such as the accessibility of course materials, living conditions, time availability, etc. For more robust estimation of the effect of the teaching format, it would be ideal to eliminate the role of these factors. In order to conduct these types of estimations, researchers use so-called matched pairs design. The matched pairs design considers managing the pairs of participants (one participant from each population) with relevant/similar characteristics. For example, pairs of identical twins or patients before and after a particular treatment. We randomly select the pairs of participants to estimate the difference between the populations. In fact, the differences of measurement become the data. If the sample size exceeds 30, we can use the normal approximation to estimate the difference between means of two populations. Consequently, we test the difference means.

[latex]\mu_d = \mu_1 - \mu_2[/latex]

Assume that [latex]n[/latex] matched pairs selected for the data set,

$$
\left\{(a_1, b_1),\ (a_2, b_2),\ \ldots,\ (a_n, b_n)\right\}
$$

and the differences are

$$
\left\{d_i = a_i - b_i\right\}
$$

where [latex]i = 1, 2, \ldots, n[/latex]

The mean and the standard deviation of this set can be evaluated as

$$
\bar{d} = \frac{\sum d_i}{n}
\qquad \text{(F7.26)}
$$

and

$$
s_d = \sqrt{\frac{\sum d_i^2 - \frac{1}{n} \left(\sum d_i\right)^2}{n-1}}
\qquad \text{(F7.27)}
$$

respectively.

Above, we explained how to construct the estimator and confidence interval for the given data. The same strategy can be applied to the set of differences: \(\{d_i = a_i - b_i\}\). Hence, \(\bar{d} = \frac{\sum d_i}{n}\) can be considered as the point estimator of \(\mu_d = \mu_1 - \mu_2\). Consequently, the \(100(1-\alpha)\%\) confidence interval for the paired difference samples can be constructed as

$$
\bar{d} \pm z_{\alpha/2} \frac{s_d}{\sqrt{n}}
\qquad \text{(F7.28)}
$$

As has been mentioned many times, in reality, often we need to deal with small samples. If the population of differences is normally distributed, the \(100(1-\alpha)\%\) confidence interval for a small-size (\(n<30\)) sample of paired differences will be constructed using the t-distribution:

$$
\bar{d} \pm t_{\alpha/2} \frac{s_d}{\sqrt{n}}
\qquad \text{(F7.29)}
$$

Example 7.16

Farmer Randy from Saskatchewan receives offers from food-providing companies for organic and genetically modified organisms (GMO) and decides to check which diet (organic or GMO) is more effective at helping his sheep gain weight. Using the knowledge obtained during his STAT 100 class at the First Nations University of Canada, he designed the matched pairs to conduct the analysis. Randy had 5 pairs of 40- to 60-day-old twin sheep with 20–25 kilograms of weight. One sheep in each pair was fed organic food, the other GMO. In 30 days, Randy recorded his measurement results.

Table 7.10

Pairs

Weight gain within a month, in kilograms

Gain difference,

di

, in kilograms

di2

Organic foodfed sheep

GMO-fed sheep

1

7.5

8.1

7.5-8.1=-0.6

0.36

2

8.6

10.1

8.6-10.1=-1.5

2.25

3

9.1

8.8

9.1-8.8=0.3

0.09

4

9.0

8.8

9.0-8.8=0.2

0.04

5

7.9

9.3

7.9-9.3=-1.4

1.96

Total

-0.3

4.7

Let’s construct a 98% confidence interval to estimate the gain difference for the pairs provided in table 7.10. First, we will determine the mean and the standard deviation of the gain difference using formulae (F72.6) and (F7.27), respectively.

$$
\bar{d} = \frac{-0.3}{5} = -0.06
$$

$$
s_d = \sqrt{\frac{4.7 - \frac{1}{5} \cdot (-0.3)^2}{5-1}} = 1.1
$$

Assuming that the population distribution is normal and considering that the sample size is less than 30, we can use formula (F7.29) to construct the confidence interval for [latex]df = 5 - 1 = 4[/latex].

$$
\bar{d} \pm t_{\alpha/2} \frac{s_d}{\sqrt{n}} \\
= -0.6 \pm 3.747 \cdot \frac{1.1}{\sqrt{5}} \\
= -0.6 \pm 1.8
$$

or

$$
-2.4 < \bar{d} < 1.2
$$

Again, as was done in the previous example, by the inspection, we can conclude that within 98% of confidence, the sheep’s ability to gain weight does not depend on whether they were fed using either organic or GMO food.

Chapter 7 Summary

• Sample estimation
• Point estimation
o Point estimation of mean for large samples
o Point estimation of proportion for large samples
• Interval estimation
o Interval estimation of mean for large samples
o Interval estimation of proportion for large samples
• Minimum sample size for given accuracy of estimation
• Student’s t-distribution
• Interval estimation of mean for small samples
• Large sample estimation of the difference between two means
• Large sample estimation of the difference between two proportions
• Small sample interval estimation for two means
• Matched pairs

You can also access the presentation of the lecture just by clicking here: click

EXERCISES

Point estimation

1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) The difference between the point estimate, such as the sample mean, and the value of the population parameter it estimates, such as population mean, is known as the

a. Confidence level
b. Sampling error
c. Parameter estimate
d. Interval estimate
e. None of the above answers are correct.

2. (Introductory Business Statistics, Holmes, A., Illowsky, B., Dean, S., Openstax, 2017) According to a Field Poll, 79% of California adults (actual results are 400 out of 506 surveyed) feel that “education and our schools” is one of the top issues facing California. We wish to construct a 90% confidence interval for the true proportion of California adults who feel that education and the schools is one of the top issues facing California. A point estimate for the true population proportion is:

a. 0.90
b. 1.27
c. 0.79
d. 400

3. (Introductory Business Statistics, Holmes, A., Illowsky, B., Dean, S., Openstax, 2017) Five hundred and eleven (511) homes in a certain southern California community are randomly surveyed to determine if they meet minimal earthquake preparedness recommendations. One hundred seventy-three (173) of the homes surveyed met the minimum recommendations for earthquake preparedness, and 338 did not.
The point estimate for the population proportion of homes that do not meet the minimum recommendations for earthquake preparedness is ______.

a. 0.6614
b. 0.3386
c. 173
d. 338

4. (Introductory Business Statistics, Holmes, A., Illowsky, B., Dean, S., Openstax, 2017) The American Community Survey (ACS), part of the United States Census Bureau, conducts a yearly census similar to the one taken every ten years, but with a smaller percentage of participants. The most recent survey estimates with 90%confidence that the mean household income in the U.S. falls between $69,720 and $69,922. Find the point estimate for mean U.S. household income and the error bound for mean U.S. household income.

5. (Introductory Business Statistics, Holmes, A., Illowsky, B., Dean, S., Openstax, 2017) A national survey of 1,000 adults was conducted on May 13, 2013 by Rasmussen Reports. It concluded with 95%confidence that 49% to 55% of Americans believe that big-time college sports programs corrupt the process of higher education. Find the point estimate and the error bound for this confidence interval.

Confidence interval estimation

1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) An estimate of a population parameter that provides an interval of values believed to contain the value of the parameter is known as the
Confidence level
Interval estimate
Parameter value
Population estimate
None of the above answers are correct

2. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) If an interval estimate is said to be constructed at the 90% confidence level, the confidence coefficient would be

a. 0.1
b. 0.95
c. 0.9
d. None of the above answers is correct

3. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) In developing an interval estimate, if the population standard deviation is unknown
It is impossible to develop an interval estimate

a. The standard deviation is arrived at using historical data
b. The sample standard deviation can be used
c. None of the above answers are correct

4. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) The confidence associated with an interval estimate is called

a. Significance
b. Degree of association
c. Confidence level
d. Precision
e. None of the above answers are correct

5. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) A probability statement about sampling error is known as

a. Confidence
b. Precision
c. Interval
d. Error
e. None of the above answers are correct

Large sample estimation. How large is large?

1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) A sample of 100 elements from a population is selected, and the standard deviation of the sample is computed. For interval estimation of μ, the proper distribution to use is

a. Normal distribution
b. T distribution with 100 degrees of freedom
c. T distribution with 99 degrees of freedom
d. None of the above answers are correct

2. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) The z value for 97.8% confidence interval estimation is

a. 2.02
b. 1.96
c. 2.00
d. 2.29
e. None of the above answers is correct.

3. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) As the sample size increases, the sampling error

a. Increases
b. Decreases
c. Stays the same
d. None of the above answers are correct

4. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) A 95% confidence interval for the population mean is determined to be 100 to 120. If the confidence coefficient is reduced to 0.90, the interval for μ

a. Becomes narrower
b. Becomes wider
c. Does not change
d. Becomes 0.1
e. None of the above answers are correct.

5. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) In general, higher confidence levels provide

a. Wider confidence levels
b. Narrower confidence intervals
c. A smaller standard error
d. Unbiased estimates
e. None of the above answers are correct

Large sample estimation of mean

1. A sample of 200 banks, trust companies and credit unions was made across Canada in order to determine how many hours a week they are open to the public.  The mean number of hours was 29.1 with a standard deviation of 3.7 hours.

a. Calculate the 95% confidence interval for the mean opening hours.
b. Calculate the 99% confidence interval.

2. A sample of forty individuals were tested to determine how long it took them to get through a maze. The average time was 133 seconds, with a standard deviation of 43 seconds. Calculate the 99% confidence interval for the appropriate statistic.

3. A sample of 200 banks, trust companies and credit unions was made across Canada in order to determine how many hours a week they are open to the public.  The mean number of hours was 33.1 with a standard deviation of 3.7 hours.  Calculate the 95% and 99% confidence intervals.

4. Zoos around the world cooperated in an effort to measure the weight of pandas in captivity.  A sample of 100 pandas yielded a mean weight of 360 pounds with a standard deviation of 10 pounds.  Calculate the 95% confidence interval.

5. A group of students working on a summer project with a social service agency surveyed a random sample of 150 families in a poor neighbourhood.  The students were interested in estimating the average family income in this neighbourhood.  The mean income was $3,820 with a standard deviation of $860.  Calculate the 99% confidence interval.

6. A sample of 64 bolts is measured.  The average length of the bolts is 34.5 millimeters.  The sample standard deviation is 0.8 millimeters.  Calculate the 95% confidence interval for the mean.

Using the following data, calculate the confidence intervals:

1-α Sample mean N Standard dev.
a) Systolic BP .95 122 61 11
b) Serum cholesterol .95 177 51 21
c) Income (000) .95 48 91 12.8

Large sample estimation of proportion

  1. A random sample of automobile owners was asked if they prefer small cars to larger ones. 240 of the 400 sampled stated they preferred smaller cars.  Calculate the 95% confidence interval for the proportion who prefer smaller cars.
  2. 1200 individuals were polled across Canada. 120 stated that they were against gambling on moral grounds.  Estimate the proportion of the population that are against gambling on moral grounds.

a. Calculate the 95% confidence interval.

b. Calculate the 99% confidence interval.

   3. A sample of 400 diabetics revealed that 310 were on insulin treatments. Calculate the 99% confidence interval for the proportion of individuals who were on insulin treatments.

4. A total of 166 individuals are sampled and asked if they are in favour of legalizing cannabis. 97 state they are in favour.  Calculate the 95% confidence interval for the proportion in favour of legalization.

5. A forest in eastern Canada is investigated to determine the effects of acid rain. A sample of 67 trees is made, and it is found that 28 are dying due to acid rain.  Calculate the 99% confidence interval for the proportion of trees dying due to acid rain.

6. A peony plant with red petals was crossed with another plant having streaky petals. A geneticist collected and germinated a sample of 100 seeds from this cross and 58 plants had red petals. Construct a 98% confidence interval for the true proportion of offspring that will have red petals.

Minimum sample size for accurate estimation

  1. A quality control engineer wants to determine what proportion of defective parts are coming off the assembly line. Past experiments, based on large sample sizes, have shown this proportion to be 0.19. What sample size does the engineer need in order to estimate, with 90% confidence, this proportion with a margin of error of 0.12? Justify your conclusion.
  2. (a) A breakfast food manufacturer wants to do a market survey to estimate the proportion of households that consume Quinoa on a regular basis. (Quinoa is an ancient grain made from the Peruvian Goosefoot plant). How large a sample should they use if they want their estimate to be accurate to within 3% with 95% confidence?

(b) Upon completion of your Quinoa study, you collected a sample size of 500. Of the 500; 123 indicate that they consume Quinoa on a regular basis, and the rest did not, or had no idea what Quinoa was. Calculate the 90% confidence interval of the mean from this sample.

  1. A researcher wants to estimate the mean number of days that patients stay at Wascana Hospital. The researcher will take a sample of patients and estimate the mean number of days from this sample.
    a) The researcher wants the mean to be accurate to within 2 days, with a 95% confidence interval. In a previous study the sample standard deviation was 18 days.  What should the sample size be?
    b) Assume that no previous study exists. However, it is known that the shortest duration of stay is 7 days, and the longest is 406 days.  Estimate the sample size required for a 95% confidence interval.  The mean should be accurate to within 2 days.
    c) The researcher wants the mean to be accurate to within 1 day, with a 99% confidence interval. In a previous study the sample standard deviation was 25 days.  What should the sample size be?
  2. A researcher wants to estimate the mean monthly mortgage payments in the town. The research wants to select a large enough sample so that she will be 95% confident that the mean mortgage payment is accurate to within $30 of the population mean.  The researcher knows that all mortgage payments in the town are between $680 and $1,280.  How large should the sample be?
  3. A researcher plans to survey Regina residents regarding their smoking behaviour. She wants to determine how many adults smoke. She wants her survey to be accurate to within 0.02, with a confidence level of 95%.
    1. The researcher is certain that the proportion is less than 0.40, but will give herself a safety margin of .05 when calculating the required sample size. How large will her sample be?
    2. She also wants to carry out a separate survey of 16 year old students in high schools. She wants this survey to be accurate to within .04 with a confidence level of 99%. She is certain that the proportion of 16 year old students who smoke is less than 30%, cut will give herself a safety margin of .05 when calculating the retired sample size. How many students should she survey?
  1. A survey of marriages done in 1988 showed that the average age at which people got married was 25.2, with a standard deviation of 2.5 years. A research wants to estimate the current age at which people get married.  The researcher wants the answer to be accurate to within 1 year, with a confidence level of 99%.  How large should the sample be?
  1. A loan officer wants to sample post-secondary students to determine what the average student loan is. The officer wants the average to be accurate to within $75 with a 95% level of confidence.  A similar survey carried out in the past showed that the standard deviation for student loans was $500.  How many students should be sample to achieve the desired level of accuracy?

T – Student Distribution

  1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) The t distribution is applicable whenever

a) The sample is considered small (n <30).

b) The population is normal and the sample standard deviation is used to estimate the population standard deviation

c) Both a and b

d) None of the above answers are correct.

  1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) As the number of degrees of freedom for a t distribution increases, the difference between the t distribution on the standard normal distribution

a) Becomes larger

b) Becomes smaller

c) Stays the same

d) None of the above answers are correct

  1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) A sample of 20 items from a population with unknown is selected in order to develop an interval estimate of . Which of the following is not necessary?

a) We must assume the population has a normal distribution

b) We must use a t distribution

c) Sample standard deviation must be used to estimate .

d) The sample must have a normal distribution

e) All of the above are necessary.

  1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) When constructing a confidence interval for the population mean and a small sample is used, the degrees of freedom for the t-distribution equals

a) n-1

b) n

c) 29

d) 30

e) None of the above answers are correct

  1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) The t value for 95% confidence and 24 degrees of freedom is

a) 711

b) 064

c) 492

d) 069

e) None of the above answers are correct

  1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) From a population which is normally distributed, a sample of 25 elements is selected and standard deviation of the sample is computed. For interval estimation of , the proper distribution to use is the

a) Normal distribution

b) t distribution

c) t distribution with 26 degrees of freedom

d) t distribution with 24 degrees of freedom

e) none of the above answers are correct.

Small size estimation for the mean

  1. The following data are the ages of a sample of children taken into care by social services:

7            14          3            9            11          2

 

Calculate the 95% confidence interval for the mean age.

 

  1. After a chemical spill on the Ring Road, water samples are taken from a number of homes and checked for the level of pesticides. The following are the concentrations of the pesticide found in the survey of homes:

22          15          13          27          16          16          19          24          18              23          20

Calculate the 95% confidence interval.

  1. Researchers wanted to check what the average cost of a prescription was at Saskatchewan pharmacies. The following are the costs from a sample of pharmacies:

27.35    30.00    25.65    28.50    33.50

Calculate the 99% confidence interval for the mean cost.

  1. Twenty movies were sampled at random, and the number of violent acts in each was measured. The average number per movie was 21.7, with a standard deviation of 4.3. calculate the 99% confidence interval for the appropriate statistic.
  2. The following are the ages of a sample of children in a daycare:

3            6            5            4            4            4            5            3

Calculate the 95% and 99% confidence intervals for the mean age.

  1. The recovery times (in days) for a sample of 12 patients who underwent surgery are:

3.9         4.2         4.2         3.8         3.6         4.4         4.1         3.4         4.0         3.5         3.7         4.2

 

Calculate the 80% confidence interval for the average recovery time.

Comparison Analysis of two large samples using confidence intervals. Means and Proportions

1. Many studies have been conducted to test the effects of marijuana use on mental abilities. In one such study, groups of light and heavy users of marijuana in a university were tested for memory recall, with the following results:

Items sorted correctly by light marijuana users:
\[
n = 64, \quad \bar{x} = 53.3, \quad s = 3.6
\]

Items sorted correctly by heavy marijuana users:
\[
n = 64, \quad \bar{x} = 51.3, \quad s = 4.5
\]

Construct a \(98\%\) confidence interval for the difference between the two population means.

2. Suppose that a random sample of size 200 entering students in 1989 showed 144 were enrolled late. Another random sample of 100 entering students in 1999 showed that 66 were enrolled late.

a) Find a point estimate for P₁ − P₂
b) Find a 90% confidence interval for the difference between two proportions.

3. Analyses of drinking water samples for 100 homes in each of two different sections of a city gave the following means and standard deviations of lead levels.

Section 1 Section 2
Sample size (n) 100 100
Mean (x̄) 34.1 34.1
Standard deviation (S) 5.9 5.9

a) Find a point estimate for \(u_1 - u_2\)

b) Find a 98% confidence interval for \(u_1 - u_2\)

c) How do you explain this interval?

4. Psychologists have made extensive studies on the relationship between child abuse and later criminal behaviour. A study consisted of the follow-ups of 52 boys who were abused in their preschool years and 67 boys who were not abused. The data of the number of criminal offenses of those boys in their teens yielded the following summary statistics:

Abused Nonabused
Mean number of criminal offenses 2.52 1.63
Standard deviation of criminal offenses 1.84 1.22

a) Determine a 99% confidence interval for the difference between the true means for these two groups.

5. A parent believes the average height for 14-year-old girls differs from that of 14- year-old boys. The summary data are listed below.

Boy Girl
The sample size 40 40
The sample mean 155 cm 146 cm
The sample standard deviation 6.1  cm 9.1  cm

Construct a 90% confidence interval for the difference between the true mean height of 14-year-old girls and boys.

6. Independent random samples were taken in order to determine information about the longevity of two different winglets. A sample size of 55 was taken for part A and had a mean and variance of 80 and 40, respectively. A sample size of 85 was taken for part B and had a mean and variance of 44 and 60, respectively. Find a 98% confidence interval for the difference in population means.

7. A researcher wished to estimate the difference between the proportion of users of two shampoos who will never switch to another shampoo. In a sample of 400 users of Shampoo A taken by this researcher, 78 said they would never switch to a new shampoo. In another sample of 500 users of Shampoo B taken by the same researcher, 92 said they would never switch to a new shampoo. Construct a 90% confidence interval for the true difference between the two population proportions.

Comparison Analysis of two small samples using confidence intervals. Means

1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) The following is the test scores of two samples of students from university A and University B on a national Biology examination. Develop an interval estimate of the difference between mean scores of the two populations at the 5% level of significance. Assume the populations are normally distributed and have equal variances.

University A Scores University B Scores
82 75
90 80
65 60
83 90
80 75
70

2. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) Scores of samples of students on a standardized test from two universities are given below.

UA UB
Sample Size 14 12
Average Test Score 80 84
Variance 64 100

Provide a 98% confidence interval estimate for the difference between the test scores of the two universities.

3. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) The following shows the number of semester hours taken by random samples of day and evening students

Day Evening
15 9
8 12
9 9
12 10
16

Develop a 95% confidence interval estimate for the difference between the mean semester hours taken by the two groups of students. Assume the populations are normally distributed and have equal variances.

4. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) Read Corporation advertises that their program will significantly increase the number of words a person can read in a minute. A random sample of 10 people was taken. You are given the per minute reading scores before and after the program.

Person Words per minute (before program) Words per minute (after program)
A 293 282
B 276 270
C 300 300
D 250 265
E 260 250
F 310 310
G 260 240
H 265 245
I 255 260
J 281 278

Construct a 99% confidence interval for the true difference in the reading scores.

5. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) MNM Corporation is trying to determine whether to purchase Machine A or B. It has leased the two machines for a month. A random sample of 5 employees has been taken. These employees have gone through a training session on both machines. Below you are given information on their productivity rate on both machines.

 

Person

Productivity Rate
Machine A Machine B
1 47 52
2 53 58
3 50 47
4 55 60
5 45 53

Construct a 90% confidence interval for the true difference between the machines.

definition

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introductory Statistics Copyright © 2026 by Arzu Sardarli and Andrei Volodin is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.