Chapter 6. Sampling Distributions

Chapter Objectives

In this chapter, readers will learn to do the following:

• Specify sampling methods
• Define and evaluate expected value and standard deviation of the sample mean
• State and apply the Central Limit Theorem
• Define and evaluate expected value and standard deviation of the sample proportion when normal approximation is applicable

In the previous chapters, we mainly analyzed data distributions of various populations. However, we noted that analyzing the entire population is often complicated and sometimes physically impossible (i.e., census). In these cases, we attempt to collect a sample representing the population. It is also essential to consider that sample analysis is less expensive and may save significant time and resources.

6.1 Sampling Methods

Sampling is one of the crucial branches of statistics. Various sampling techniques have been developed to solve different kinds of statistical problems. This book will briefly review some of them: simple random sampling, systematic sampling, and voluntary sampling.

Simple Random Sampling

Simple random sampling is a randomly selected subset of the population such that every member of the population is equally likely to be measured. This method is straightforward, hence the most popular design of random sampling. The easiness of simple random sampling is the main advantage of this method. Therefore, this book will mainly focus on simple random sampling and conducting statistical analysis of provided data sets.

Example 6.1

(a) Indigenous language knowledge

Assume that we need to determine the proportion of students at the First Nations University of Canada who speak an Indigenous language. It is not practical and sometimes it is not physically possible to ask each student about his/her knowledge of Indigenous languages. Instead, we can randomly select a certain number of students and find the required proportion for this sample.

(b) Election forecast

The interest in predicting election results is as old as elections themselves. In the days leading up to an election, many institutions attempted to predict the results of upcoming elections of state heads and parliaments. Scholars developed various election forecasting techniques. Most of them are based on simple random sampling.

(c) Animal migration tracking

Wildlife biologists use animal migration tracking to study the seasonal behaviour of animals. One of the classic techniques is placing an ID tag on the animal's body to identify the animal in a future capture-and-release. However, modern technologies allow the chasing of randomly selected animals to be done almost continuously.

To conduct simple random sampling, one has to allocate a number to each member of the population and use a random number generator to determine which individuals will be measured. A large number of computer software has the random number generating feature.

Example 6.2

A landlord of an apartment house with 124 apartments needs to get tenants' opinions about the quality of services her company provides. According to the landlord's report, one tenant lives in each apartment. Due to the lack of time, she contacted 10 randomly selected tenants. Let's use the Excel program to help this landlord.

Solution:

This problem can be solved using various features of Excel. Here is a suggested option with the "RANDARRAY" function. This function creates an array of random numbers. In order to use this function, we should specify the number of rows and columns to fill, minimum and maximum values, and whether to list integer numbers or decimal values. Since the apartments are enumerated from 1 to 124, we define the minimum value as equal to 1 and the maximum value as equal to 124. Considering that some entries of the constructed array may be equal, it would be safe to make the number of entries equal to 50. This amount of entries can be gained for various array dimensions. In our example, we defined the number of rows as equal to 10 and the number of columns as equal to 5. Finally, we only select "TRUE" for "integer" to get integers. Here is one of the generated arrays.

22

88

49

109

44

49

121

28

50

86

92

38

36

36

124

84

9

18

58

13

62

7

25

114

47

20

8

71

34

48

110

41

57

90

42

26

91

95

95

6

32

35

23

71

106

88

96

26

56

104

Now, we can pick the numbers starting from the entry at the top-left corner and going along rows. Since each number is unique in the given data set (apartment numbers), we must skip a repeated entry. Using this option, we obtain the following numbers:

22, 88, 49, 109, 44, 121, 28, 50, 86, 92

Therefore, the landlord can contact the tenants living in apartments 22, 88, 49, 109, 44, 121, 28, 50, 86, and 92.

Note: One has to consider the RANDARRAY every time the function selects different numbers, even if the input range stays the same.

Systematic Sampling

Usually, systematic sampling is used to select observations from a population sorted out regarding specific characteristics, such as alphabetical order, descending scores, etc. Then, we randomly pick a starting point and select the sample elements after fixed intervals to create a systematic sample.

Example 6.3

A general manager of a private call centre needs to estimate the mean of the time duration of a single call service provided by her firm in the last month. According to the records, the call centre received 529 calls this month. Due to the lack of time, she contacted 12 randomly selected callers out of 529. This selection can be conducted in various ways. Here is one of the suggested procedures of systematic sampling.

(1) Assign numbers from 1 to 529 to the names of callers.

(2) 529÷12=44.08 . Round this number down to 44.

(3) Randomly select a start point from 1 to 44, end points including. Selection of the start point can be conducted using various software. For example, assume that the software picked 17.

(4)Start counting from 17, adding 44 in every 11 steps: 17, 61, 105, 149, 193, 237, 281, 325, 369, 413, 457, 501.

Therefore, the general manager needs to determine the mean of time duration of services provided to callers enumerated by 17, 61, 105, 149, 193, 237, 281, 325, 369, 413, 457, and 501.

Voluntary Sampling

Sometimes, researchers ask volunteers to participate in their studies. In other words, they select a subset of the population. In statistics, we call this type of selection voluntary sampling, not to be mixed up with random sampling. Volunteers can be invited using various methods: in person, by a letter, via the Internet, etc.

Example 6.4

In 2008, within our community-based research project, we invited 120 households to answer questions about the water quality in Calling Lakes (A. Sardarli, Use of Indigenous Knowledge in modeling the water quality dynamics in Peepeekisis and Kahkewistahaw First Nations communities, Pimatiswin: A Journal of Aboriginal and Indigenous Community Health 11(1), 2013, 55–63). The residents of 42 households volunteered to answer the survey questions.

Example 6.5

A web-based survey was conducted by the MEDICO-19 Research Group (G. Lazarus, A. Findyartini, A.M. Putera, N. Gamalliel, D. Nugraha, I. Adli, J. Phowira, L. Azzahra, B. Ariffandi, & I.S. Widyahening, Willingness to volunteer and readiness to practice of undergraduate medical students during the COVID-19 pandemic: A cross-sectional survey in Indonesia, BMC Medical Education, 21, 2021). A web-based survey was conducted among undergraduate medical students throughout Indonesia. Socio-demographic and social interaction information, willingness to volunteer, and readiness to practise were obtained using a self-reported questionnaire. Among 4,870 participants, 2,374 (48.7%) expressed their willingness to volunteer, while only 906 (18.6%) had adequate readiness to practise.

This example shows that voluntary sampling is not a straightforward procedure and that the selection is not entirely unbiased. Although almost half the students volunteered to practise, only 18.6% were ready for this work. This discrepancy occurred because self-estimation of readiness is subjective; hence, the sampling is not entirely unbiased.

6.2 Expected Value and Standard Deviation of the Sample Mean

After briefly reviewing sampling techniques, we can start using samples to analyze the population. In previous chapters, we learned that measures of the centre and dispersion describe a data set. Therefore, it would be reasonable, first, to find some connections between the means \((\mu \text{ and } \bar{x})\) and standard deviations \((\sigma \text{ and } s)\) of the population and samples that are randomly selected from this population.

It is easier to start with a relatively small population to minimize calculations.

Example 6.6

Assume that you have five Facebook friends who are 33, 27, 30, 30, and 27 years old. First, let's construct all possible random samples of size 3. Since the order does not matter, the number of samples can be determined using the combination function.

$$ C(5,3)=\frac{5!}{3!(5-3)!}=10 $$

Now, we can easily construct the samples.

{33, 27, 30}, {33, 27, 30}, {33, 27, 27}, {33, 30, 30}, {33, 30, 27}, {33, 30, 27}, {27, 30, 30}, {27, 30, 27},

{27, 30, 27}, {30, 30, 27}

Let's determine the means of each sample.

Sample

Ages, xi

Sample mean,

1

33

27

30

30

2

33

27

30

30

3

33

27

27

29

4

33

30

30

31

5

33

30

27

30

6

33

30

27

30

7

27

30

30

29

8

27

30

27

28

9

27

30

27

28

10

30

30

27

29

We can consider the selection of each random sample as a simple event and assign the respective value of the mean to each of these events. Probabilities of these events can be determined as relative frequencies of sample means.

Sample mean (xi̅)

Frequency

Relative frequency/probability p(xi̅)

28

2

2/10 = 0.2

29

3

3/10 = 0.3

30

4

4/10 = 0.4

31

1

1/10 = 0.1

10

1

In chapter 4, we determined the mean and variance of probabilities as

$$
\mu=\sum_{i=1}^{N} x_i p(x_i) \qquad (F4.1)
$$

$$
\sigma^2=\sum_{i=1}^{N} (x_i-\mu)^2 p(x_i) \qquad (F4.2)
$$

In our case, in formulae (F4.1) and (F4.2), we must replace \(x_i \text{ by } \bar{x}_i\)

$$
\mu=\sum_{i=1}^{N} \bar{x}_i\, p(\bar{x}_i) \qquad (F6.1)
$$

$$
\sigma^{2}=\sum_{i=1}^{N} (\bar{x}_i-\mu)^{2} p(\bar{x}_i) \qquad (F6.2)
$$

Let's apply these formulae to the means of samples and their variance, considering them as a set of observations.

$$
\sum_{i=1}^{4} \bar{x}_i p(\bar{x}_i)
= (28)(0.2) + (29)(0.3) + (30)(0.4) + (31)(0.1)
= 29.4
$$

$$
\sigma^2=\sum_{i=1}^{4} (\bar{x}_i-\mu)^2 p(\bar{x}_i)
=(28-29.4)^2(0.2)+(29-29.4)^2(0.3)+(30-29.4)^2(0.4)+(31-29.4)^2(0.1)
$$

$$
=0.84
$$

Consequently, the standard deviation of sample means can be determined as as \(\sigma=\sqrt{\sigma^2}=\sqrt{0.84}=0.92\).
The quantity evaluated using the formula (F6.1) is called the expected value . Consider that in our example, the population mean \(\mu=\frac{33+27+30+30+27}{5}=29.4\), which is equal to the expected value.

Figure 6.1 presents the relative frequency histogram/probability distribution of sample means.

image

Figure 6.1. Frequency histogram/probability distribution of sample means constructed from example 6.6

The histogram does not exactly look mount-shaped. In previous chapters, we explained that the mount shape is typical for normally distributed probabilities. The histogram for sample means also becomes almost ideally mount-shaped for larger sample sizes.

6.3 Central Limit Theorem

It can be proved that the expected value of sample means equals the population mean for any sample size. This conclusion is obtained from a statement known as the Central Limit Theorem.

Consider the population with the mean μ and standard deviation σ . If the size of the sample is n, then

(1) The expected value of the sample means is equal to the population mean

μ

(2)The standard deviation of sample means can be determined as $$
\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}
$$

. This quantity is also known as a standard error and is denoted as “SE” in most textbooks.

(3) The means of samples randomly selected from the population are approximately normally distributed.

The proof of the theorem is beyond the scope of this book.

How Large Is Large Enough?

Above in this chapter, several times we mentioned the large size samples. However, we never specified how large would be large enough. In statistical analysis, estimating a sufficient sample size is one of the challenging tasks of sampling. On the one hand, the availability of resources restricts the sample size because collecting data for large samples is time-consuming and expensive. On the other, a size that is too small may cause a loss of accuracy, hence some important features of the analyzed population. By convention, statisticians define a large size as 30 or higher, \(n \ge 30\) . However, one must note that the expected value of the sample means equals the population mean for any sample size.

It is also helpful to consider the following corollaries (conclusions) of the Central Limit Theorem:

  • For any sample size, the sampling distribution of sample means is normal if the sample is normal.
  • The distribution becomes normal even for relatively small sample sizes if the sample population is approximately symmetrical.
  • If the sample size is 30 or higher, the sampling distribution of sample means is approximately normal, even if the sample population is not normal.

The conclusions of the Central Limit Theorem can be applied to calculate the probabilities of the sample mean using the normal distribution. In this case, we will still use the standard normal distribution table (table A3, Appendix), and the z-score will be calculated as follows:

$$
z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \qquad (F6.3)
$$

where μ and are means of the population and sample, respectively, σ is the standard deviation of the population, and n is the sample size.

Example 6.7

A manager of an Internet provider company determined that the monthly data usage of their customers in an apartment building is normally distributed with a mean of 300 GB at 400 GB2 variance. What is the probability that 16 randomly selected customers' average monthly data usage exceeds 310 GB?

Solution:

Given that \(\mu=300\,\text{GB},\ \sigma^2=400\,\text{GB}^2,\ \bar{x}=310\,\text{GB},\ n=16\) , we can calculate the standard deviation of the population as \(\sigma=\sqrt{400\,\text{GB}^2}=20\,\text{GB}\) . In this example, the sample size is not large enough. However, we still can use the formula (F6.3) to evaluate the z score, considering that the population is normal.

\( z=\frac{310-300}{\frac{20}{\sqrt{16}}}=2 \)

Now, we can use the normal distribution table (table A3, Appendix) to determine the required probability.

\(P(\bar{x}>310)=P(z>2)=1-0.9772=0.0228\)

Therefore, the probability that the mean data usage of the selected sample exceeds 310 GB equals 0.0228 GB. In practice, this means that with a chance of 2.28%, more than 310 GB of data will be used per month.

Example 6.8

According to a specialist musher from Kinadapt Outdoor Training and Education Centre, the best age for dogs to start running a dog sled lies between 18 and 24 months (obtained from https://www.aventurequebec.ca/en/article/the-life-of-a-sled-dog). The average age of dogs in a dog training centre is 20 months, with a standard deviation of 10 months. A dog coach trains a group of 36 dogs. Do you think the average age of this group of dogs is in the recommended interval?

Solution:

In answering this question, it would be reasonable to determine the probability that this group's age mean of dogs is between 18 and 24 months. It is given that \(\mu=20,\ \sigma=10,\ n=36,\ x_L=18,\ x_U=24\) . Regardless of the population distribution, we can use the normal distribution approach since the sample size exceeds 30.

$$
P(18\le\bar{x}\le24)
= P\!\left(\frac{18-20}{10/\sqrt{36}}\le z\le\frac{24-20}{10/\sqrt{36}}\right)
= P(-1.2\le z\le2.4)
= 0.9918-0.1151
$$

$$
= 0.8767
$$

The probability that the age mean of the selected group of dogs is between 18 and 24 months is 0.8767. In other words, we can estimate the chances of having the average age of this group in the recommended interval high enough, approximately 88%.

6.4 Expected Value and Standard Deviation of the Sample Proportion

In chapter 5, we stated that if a sample size (n), probability of success (p), and probability of failure (q) meet the criteria

$$
np>5 \quad \text{and} \quad nq>5 \qquad (F6.4)
$$

one can apply the normal distribution approximation to this binomial sample. Recall that we also can determine the mean and standard deviation of binomial variables using the following formulae, respectively:

$$
\mu = np \qquad (F6.5)
$$

$$
\sigma = \sqrt{npq} \qquad (F6.6)
$$

If conditions (F6.4) are satisfied, the Central Limit Theorem can be applied. For example, consider a sample of binomial variable x with a size of n, such that np > 0 and nq > 0, where q = 1 – p.

Then, the sample proportion

$$
\bar{p}=\frac{x}{n} \qquad (F6.7)
$$

has the standard error

$$
SE=\sqrt{\frac{pq}{n}} \qquad (F6.8)
$$

To evaluate the required probabilities for this sample using the cumulative normal distribution table (table A3, Appendix), we will need to evaluate a corresponding z-score as

$$
z=\frac{\bar{p}-p}{\sqrt{\frac{pq}{n}}} \qquad (F6.9)
$$

Example 6.9

On June 3, 2021, The Jerusalem Post published material stating that nearly half of Americans believe that dinosaurs still exist (obtained from https://www.jpost.com/health-science/nearly-half-of-americans-still-believe-dinosaurs-roam-the-earth-669977). A statistics graduate student decided to check if this statement is reasonable. She called 25 people she randomly selected from the local phone book and asked them if they believed dinosaurs still existed. Ten of the respondents answered "Yes." Based on these statistics, determine the proportion of people who are less skeptical about the existence of dinosaurs today than the respondents selected for this research.

Solution:

The sample is n=25. We are interested in investigating if more or less than half of Americans believe in dinosaurs' existence. Consequently, we can consider \(p=0.5\). Hence \(q=1-p=1-0.5=0.5\). The conditions (F6.5) are satisfied since \(np=25\bullet0.5=12.5>5\) and \(nq=25\bullet0.5=12.5>5\). Therefore, we can use the normal approximation to solve this problem. First, we need to evaluate the sample proportion.

$$
\bar{p}=\frac{x}{n}=\frac{10}{25}=0.4
$$

Using the cumulative normal distribution table (table A3, Appendix), we can estimate the proportion of Americans who believe in the existence of dinosaurs today than people selected for this sample.

$$
P\!\left(\bar{p}>0.4\right)
= P\!\left(z>\frac{0.4-0.5}{\sqrt{\frac{0.5\bullet 0.5}{25}}}\right)
= P(z>-1.00)
= 1-0.1587
= 0.8413
$$

Therefore, the proportion of Americans who are less skeptical about the existence of dinosaurs today is approximately 84%. We can thus rephrase this conclusion as the probability that a proportion of local residents believe in the existence of dinosaurs today is greater than in the selected sample equals 0.84.

Chapter 6 Summary

• Sampling methods
o Simple random sampling
o Systematic sampling
o Voluntary sampling

• Expected value and standard deviation of the sample mean

• Central Limit Theorem

• How large is large enough?

• Expected value and standard deviation of the sample proportion

You can also access the presentations links just by clicking here: click

EXERCISES

Mean and Standard Deviation of the Sample Mean

1. Calculate the standard error of the mean for the following situations:

a) ( \mu = 100 \quad \sigma = 2 \quad n = 30 )

b) ( \mu = 10 \quad \sigma = 8 \quad n = 50 )

c) ( \mu = 200 \quad \sigma = 20 \quad n = 90 )

d) ( \mu = 47 \quad \sigma = 9 \quad n = 225 )

e) ( \mu = 16.2 \quad \sigma = 9 \quad n = 900 )

2. The average life expectancy of all Canadians is (78.2) years with a standard deviation of (6) years. A sample of (100) Canadians is examined.

a) What is the probability that the life expectancy of any one individual is greater than (79) years?

b) What is the standard error of the mean?

3. Calculate the standard error of the mean for the following situations:

a) ( \mu = 200 \quad \sigma = 1 \quad n = 25 )

b) ( \mu = 40 \quad \sigma = 2 \quad n = 16 )

c) ( \mu = 100 \quad \sigma = 10 \quad n = 100 )

d) ( \mu = 67 \quad \sigma = 10 \quad n = 200 )

e) ( \mu = 36.2 \quad \sigma = 10 \quad n = 400 )

4. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) If we consider the simple random sampling process as an experiment, the sample mean is

a) Always zero

b) Always smaller than the population mean

c) Is a random variable

d) Exactly equal to the population mean

e) None of the above answers are correct.

5. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) The standard deviation of all possible values of values is called

a) The standard error of the proportion

b) The standard error of the mean

c) Mean deviation

d) Central variation

e) None of the above answers are correct.

6. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) Starting salaries of a sample of five biology majors are shown below.

Employee Salary (in $1000)
1 30
2 28
3 22
4 26
5 19

a) What is the expected value of the starting salaries of all biology majors?
b) Determine the expected value of the variance of the population.

Sampling Distribution of the Sample Mean

1. If the cholesterol level of men in the community is normally distributed with a mean of 200 and a standard deviation of 25, what is the probability that a randomly selected sample of 49 men will have a mean between 190 and 205?

2. If the forced vital capacity of 11-year-old males is normally distributed with a mean of 2400cc and a standard deviation of 400, find the probability that a sample of size n=64 will provide a mean:

a. greater than 2500
b. between 2300 and 2500
c. less than 2350

3. a) The birth weight of newborn babies is normally distributed with a mean of 3.8 kg and a standard deviation of .6 kg. If a single newborn was sampled at random, what is the probability that this baby would weigh more than 3.9 kg? b) If 36 newborns had been sampled, what is the probability that the sample mean would be more than 3.9 kg?

7. The mean age of all students at SIAST is 25 with a standard deviation of 4.

a) If the ages are normally distributed, between what intervals would you expect to find 95% of all student ages?
b) If repeated samples of 100 students are made, between what intervals would you expect to find 95% of the sample means?
c) What is the probability that a sample mean is greater than 26?
d) What is the probability that any one student is older than 26?

8. The average age of all children admitted to hospital for asthma is 8.92 years, with a standard deviation of 2.43 years.

a) What is the probability that a child admitted to hospital will be over 10 years old?
b) In a sample of 36 children admitted to hospital, what is the probability that the mean age of the sample children will be over 10 years old?
c) What is the probability that a single child will be under 8.5 years old?
d) For a sample of 100 children, what is the probability that the average age will be under 8.5 years old?

9. The number of letters and packages (“mail items”) that Canada Post handles each day is normally distributed with a mean of 29.1 million mail items and standard deviation 3.9 million mail items.

a) What is the probability that on any given day Canada Post handles between 25.0 and 30.0 million mail items?
b) On December 21st, the amount of mail delivered  was higher than on 99% of all other days. How many mail items were delivered on this day?
c) What is the probability that the total amount of mail delivered in April (30 days) was more than 900 million mail items? In other words, find the probability that a sample of size 30 has a mean greater than 30.0 million mail items.

Sample Proportion: Sample Distribution, Mean and Standard Deviation (COVID, Indigenous)
1. Suppose that 25 year old females have a remaining mean life expectancy of 55 years with a standard deviation of 6. What proportion of 25-year-old females will live past 65? What assumption do you have to make in order to obtain a valid answer?

2. To qualify for a Master's study in statistics, applicants are given a trivia test. The scores are normally distributed, with a mean of 80 and a standard deviation of 8.

a. Find the portions of applicants that have a score of more than 90
b. If only 15% top applicants are selected, find the cutoff score.

3. In a local university, 10% of the students live in the dormitories. A random sample of 100 students is selected for a particular study.

a) What is the probability that the sample proportion (the proportion living in the dormitories) is between 0.172 and 0.178?
b) What is the probability that the sample proportion (the proportion living in the dormitories) is greater than 0.025?

4. There are 500 employees in a firm; 45% of whom are female. A sample of 60 employees is selected randomly.

a) Determine the standard error of the proportion.
b) What is the probability that the sample proportion (proportion of females) is between 0.40 and 0.55?

5. A new soft drink is being market tested. It is estimated that 60% of the consumers will like the new drink. A sample of 96 consumers taste tested the new drink.

a) Determine the standard error of the proportion.
b) What is the probability that more than 70.4% of consumers will indicate they like the drink?
c) What is the probability that more than 30% of consumers will indicate they do not like the drink?

 

definition

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introductory Statistics Copyright © 2026 by Arzu Sardarli and Andrei Volodin is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.