Chapter 2. Numerical Measures of Data

Chapter Objectives

In this chapter, readers will learn to do the following:

• Define numerical measures of the centre of data (Mean, Median, Mode)
• Determine numerical measures of the centre of data (Mean, Median, Mode)
• Define numerical measures of the dispersion of data (Variance, Standard Deviation)
• Evaluate Variance and Standard Deviation of population and sample
• Construct the Box Plot

2.1. Numerical Measures of Data

In the previous chapter, we attempted to describe various data sets by graphing. Graphs are very helpful for the visualization of data distribution, indicating some specific features. However, if you attempt to construct various graphs for a large number of data (so-called big data), you might notice that sometimes, the shapes and values of some specific parameters may vary depending on the individual choices of a person working with data. We will go back to example 1.13 to support this observation. In that example, we had decided to divide the data into 11 classes for constructing the relative frequency histogram and obtained the following graph.

Figure 2.1. 11-class relative frequency histogram for data obtained from example 1.13

Figure 2.1. 11-class relative frequency histogram for data obtained from example 1.13

Now, let’s observe the histogram of the same data set with seven classes (fig. 2.2).

Figure 2.2. Seven-class relative frequency histogram for data obtained from example 1.13

Figure 2.2. Seven-class relative frequency histogram for data obtained from example 1.13

As one can see from figures 2.1 and 2.2, the histograms look slightly different, although they were constructed for the same data set. In particular, in figure 2.1, with 11 classes, the histogram has two picks, allowing us to conclude that this distribution is bimodal. But if we change the number of classes to seven (fig. 2.2), the histogram indicates only one pick, and a conclusion based on only this histogram would be that the distribution is unimodal. Therefore, although graphs give some idea about the distribution of collected data, their centres and dispersion, the numerical measures are essential for making reliable decisions.

In this chapter, we will define some numerical measures for characterizing data and learn how to evaluate them. This method of data analysis is called numerical descriptive analysis. In statistics, three groups of numerical measures are used for descriptive analysis of big data: measures of centre, measures of dispersion, and measures of position.

2.2. Numerical Measures of Centre

In statistics, three numerical measures are used to describe the centre of data distribution: mean, median, and mode.

The Mean

The mean is the most common measure used to describe the centre of data. The term “average” is used interchangeably with the word “mean.” The mean is determined by summing up all data values and dividing by the number of data points. For example, the mean or average of three integers, 5, 7, and 12, can be determined as

\[
\frac{5+7+12}{3} = 8
\]

In this textbook, we will use different symbols to distinguish the means of a sample from the means of a population. The Greek letter μ represents the population mean. The x bar symbol represents the sample mean.

Population mean:

\[
\mu = \frac{\sum_{i=1}^{N} x_i}{N} \tag{F2.1}
\]

Sample mean:

\[
\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \tag{F2.2}
\]

Where xi is a value of i’s data, N and n are the sizes of the population and sample, respectively.

We hope readers recall from the school mathematics courses that the symbols \(\sum_{i=1}^{N} x_i\) and \( \sum_{i=1}^{N} x_i \) imply sums of data values:

\( \sum_{i=1}^{N} x_i = x_1 + x_2 + \cdots + x_N \)

\( \sum_{i=1}^{n} x_i = x_1 + x_2 + \cdots + x_n \)

“∑” is known as a summation symbol. Sometimes, we might use a summation symbol without the lower and upper limits

xi to ease our writing.

Example 2.1

There are 112 students in the STAT 100 class. They got the following scores (in percentage) for their mid-term test:

51

43

43

81

99

43

92

73

48

57

85

54

90

54

89

36

33

31

88

51

78

51

60

76

98

63

55

67

42

36

65

57

88

45

83

70

83

47

45

43

78

93

76

84

80

75

71

81

30

40

84

36

97

72

66

88

89

76

53

56

52

66

41

61

79

99

58

98

54

75

33

30

74

42

35

85

59

48

42

49

73

47

100

80

51

45

36

62

87

66

97

35

40

51

41

38

35

39

47

90

59

86

30

36

97

42

85

51

31

67

56

62

Determine the mean of the mid-term exam score for the entire class.

Solution:

We will use the formula F2.1 to determine the mean of the population. Readers can add up the values of individual scores and divide by the population size, N = 112.

\[
\mu = \frac{\sum x_i}{N} = \frac{6960}{112} = 62
\]

Therefore, the class average for this test is 62%. In statistics, we prefer to say, “The mean score is 62%.”

Example 2.2

Refer to example 2.1. The instructor randomly selected the scores of ten students for analyzing the data.

90

43

89

81

33

43

92

51

43

51

Determine the sample mean.

Solution:

We will use the formula F2.2 to determine the mean of the sample.

\( \bar{x} = \frac{\sum x_i}{n} = \frac{90+43+72+81+33+43+92+51+48+51}{10} = \frac{599}{10} = 59.9 \approx 60 \)

The mean of this sample is 60%. Observe that the sample mean is slightly different from the population mean determined in example 2.1. We have to consider that the sample mean is an estimator of the mean. Selecting various samples, we can obtain different estimators. Later we will discuss how the set of estimators can help us to make a conclusion about the population mean.

Sometimes we need to evaluate the mean using frequencies.

Population mean:

\[
\mu = \frac{\sum f_i x_i}{\sum f_i} \tag{F2.3}
\]

Sample mean:

\[
\bar{x} = \frac{\sum f_i x_i}{\sum f_i} \tag{F2.4}
\]

where \( f_i \) is the frequency of observation \( x_i \). To determine the mean using these formulae, we need to get the frequency distribution.

Example 2.3

Let us evaluate the sample mean for the data provided in example 2.2 using the frequency. First, we have to construct the frequency table.

 

\( x_i \)

\( f_i \)

33

1

43

3

51

2

72

1

81

1

90

1

92

1

∑\( f_i \) =10

Now we can use formula F2.4 to evaluate the sample mean:

\[
\bar{x}
= \frac{\sum f_i x_i}{\sum f_i}
= \frac{33\cdot1 + 43\cdot3 + 51\cdot2 + 72\cdot1 + 81\cdot1 + 90\cdot1 + 92\cdot1}{1 + 3 + 2 + 1 + 1 + 1 + 1}
= \frac{599}{10}
\approx 60
\]

Observe that the sum of frequencies must be equal to the sample size. This helps us to ensure that all observations are counted.

Mean is not always a robust indicator of the real centre of the data. Let’s analyze another example.

Example 2.4

Assume that there are 10 students in a class and the following data show the amount of money (in Canadian dollars) in their bank accounts:

2,500

6,100

350

2,100,000

3,200

1,030

7,400

5,200

2,200

4,100

Apparently, one of your classmates is a millionaire! Let’s evaluate the mean of this data.

\[
\mu = \frac{\sum x_i}{N} = \frac{2{,}132{,}080}{10} = 213{,}208
\]

According to this calculation, we could state that an average student has $213,208 in their bank account. Obviously, any student would be happy to have this amount of money. Unfortunately, this does not seem realistic. This does not mean that we made a mistake in our calculations. However, it does tell us that the mean does not look suitable for this particular example. The following characteristic is more appropriate to describe the centre of this type of data.

The Median

In general, it can be shown that when the data are not unimodal or symmetrical (i.e., skewed), the mean does not give an adequate idea of the centre of data. For these types of cases, in statistics, we use median to describe the data centre. The median is the middle value when the data are arranged in order. In other words, there are an equal number of observations above and below the median. If there are an odd number of values, the median is the value of the middle of ordered observations. Otherwise, it is somewhere between the two middle values, and generally calculated as the average of these two numbers.

The median is determined in two steps.

Step 1.

To locate the median, we arrange the data in increasing order and then use the formula

\[
\text{position of median} = \frac{n+1}{2}
\]

where n is a number of observations.

Step 2.

If the position of the median is a positive integer (i.e., n is an odd number), then the median is equal to the observation located at this position of the ordered data. If the position of the median is not an integer (i.e., n is an even number), then the median is equal to the average of these two neighbouring observations of the ordered data.

Example 2.5

Determine the median for the following data:

49

9

83

52

39

15

75

Solution:

First, we will put this data in increasing order.

Ordered data

9

15

39

49

52

75

83

Position

1

2

3

4

5

6

7

The number of observations is seven, which is an odd number.

 

\( \frac{n+1}{2} = \frac{7+1}{2} = 4 \) . The fourth number in this sequence is 49. Therefore, the median of the provided data set is 49.

Example 2.6

Determine the median for the data provided in example 2.4.

2500

6100

350

2100000

3200

1030

7400

5200

2200

4100

Solution:

First, we will put this date in increasing order:

Ordered data

350

1030

2200

2500

3200

4100

5200

6100

7400

2100000

Position

1

2

3

4

5

6

7

8

9

10

These data contain 10 observations, which is an even number:

\( \frac{n+1}{2} = \frac{10+1}{2} = 5.5 \). This decimal is between 5 and 6. Consequently, the median of this data is equal to the average of the fifth and sixth observations: \( \frac{3200+4100}{2} = 3650 \). Therefore, the median is $3,650. You can agree that this number better describes the possible amount of money in a student’s bank account than the mean value of $213,208, previously found for this data in example 2.4.

The Mode

The mode is the value(s) that occurs most often. It is useful for data where it is not possible to calculate the mean or median. For example, if we deal with data representing eye colours or political parties, the concepts of addition and ordering do not have a logical meaning.

Sometimes, data can have more than one mode. We use a special term, “bimodal,” to describe the data with two modes. Figure 2.3 represents histograms of data sets with various modality.

image

Figure 2.3. Histograms of data sets with various modalities

Example 2.7

Determine the modes of the following data sets:

(a)

21

12

18

31

12

12

47

18

12

17

38

12

24

21

27

12

(b)

21

12

18

31

12

38

47

38

12

17

38

12

24

21

27

38

(c)

21

12

18

31

15

38

47

32

13

17

29

9

24

25

27

33

Solution:

(a) First, we will construct the frequency table for the given data.

Data

Frequency

12

6

17

1

18

2

21

2

24

1

27

1

31

1

38

1

47

1

As one can see from the frequency table, the observation “12” occurs most often, six times. Therefore, the mode of this data set is 12 and it is unimodal. This can be visualized using the dot plot where we see one peak (fig. 2.4).

 

Figure 2.4. Dot plot for the data set provided in example 2.7(a)

Figure 2.4. Dot plot for the data set provided in example 2.7(a)

(b) Similarly, we construct the frequency table.

Data

Frequency

12

4

17

1

18

1

21

2

24

1

27

1

31

1

38

4

47

1

The observations “12” and “38” each occur three times, more often than any other. Therefore, these data have two modes, 12 and 38, and are described as bimodal. As we can see from the dot plot, two peaks characterize the modality of these data.

Figure 2.5. Dot plot for the data set provided in example 2.7(b)

Figure 2.5. Dot plot for the data set provided in example 2.7(b)

(c) One can see that there are no repeating observations in the following data set:

Data

Frequency

21

1

12

1

18

1

31

1

15

1

38

1

47

1

32

1

13

1

17

1

29

1

9

1

24

1

25

1

27

1

33

1

The data set does not have any mode, since the frequency of each observation is equal to 1. The dot plot supports this conclusion (fig. 2.6). That means that these data have a uniform modality.

Figure 2.6. Dot plot for the data set provided in example 2.7(c)

Figure 2.6. Dot plot for the data set provided in example 2.7(c)

Working on large-size data, it is more practical to determine the class, which includes the greatest number of observations. This class is called the modal interval.

Example 2.8

A company manager recorded the duration of phone calls (in minutes) made by her workers during the day.

1

10

16

19

20

22

24

26

31

39

39

11

26

19

20

31

24

16

23

3

4

12

17

37

32

23

24

26

20

19

24

32

18

20

20

23

6

27

13

30

28

15

25

23

25

20

18

7

32

22

8

15

18

20

22

23

25

29

33

23

22

15

19

20

8

Determine the modal interval for the given phone call durations.

Solution:

Following the instructions provided in the previous chapter, we need to define the number of classes, create the frequency table, and construct the histogram. Let’s define 10 classes for the given data.

Classes

Frequency

[0, 4)

2

[4, 8)

3

[8, 12)

4

[12, 16)

5

[16, 20)

10

[20, 24)

18

[24, 28)

11

[28, 32)

5

[32, 36)

4

[36, 40)

3

Figure 2.7. Frequency histogram of phone call durations provided in example 2.8

Figure 2.7. Frequency histogram of phone call durations provided in example 2.8

The frequency histogram shows that most of the calls had a duration of 20 to 24 minutes. Therefore, the modal interval of the given data is [20, 24).

Earlier, in example 1.10, we had constructed the relative frequency histogram of Oscar winners' ages (fig. 2.8).

Figure 2.8. Relative frequency histogram constructed for the data set provided in example 1.10

Figure 2.8. Relative frequency histogram constructed for the data set provided in example 1.10

Using the histogram provided in figure 2.8, one can determine the modal intervals of the data from example 1.10: [42, 46) and [50, 54).

Comparison of Mean, Median, and Mode

When conducting a statistical analysis, we often need to compare the measures of the centre. There exists an interesting relation between such measures (mean, median, and mode) and the shape of a histogram.

In the following example we will analyze three different data sets with various measures of centres.

Example 2.9

A self-service store owner who excelled at statistics in university decided to record the amounts of payments made in a day at three different cash registers of her store and built histograms for each of the data sets.

Cash register 1

Figure 2.9. Unimodal symmetrical data. Mean ≈ median ≈ mode

Figure 2.9. Unimodal symmetrical data. Mean ≈ median ≈ mode

Cash register 2

Figure 2.9. Unimodal symmetrical data. Mean ≈ median ≈ mode

Figure 2.10. Right-skewed data. Mean > median > mode

Cash register 3

Figure 2.11. Left-skewed data.

Figure 2.11. Left-skewed data. Mean < median < mode

If the data are unimodal and symmetrical, the three measures of central tendency will be of similar value (fig. 2.9). When data are skewed, the mean and median will not be equal. The mean will be “pulled toward the skew.” For data skewed to the right, the mean will be greater than the median (fig. 2.10). For data skewed to the left, the mean will be less than the median (fig. 2.11).

2.3. Measures of Dispersion

In the previous section, we discussed the numbers that describe the centre of data. They are very important and useful for analyzing data, but they do not provide complete information about how the data are spread or dispersed. In this section, we will introduce the measures that characterize the dispersion of the data. The three most commonly used measures are the range, variance, and standard deviation.

As we defined earlier in this book, the range is simply the difference between the highest and lowest values in a data set.

Range = xmaxxmin

It is standard to give the actual values (minimum and maximum). For example, according to Statistics Canada, gas prices across Canada in September 2021 ranged between $129.50 and $156.70 per 100 litres. Therefore,

Range = $156.70 − $129.50 = $27.20

However, the range gives no indication of the dispersion of gas prices between these two extreme values, $129.50 and $156.70. In other words, there may be a lot of values (prices) clumped at either end of the distribution.

To analyze the data distribution, we often need to determine how far the observation is from the mean. The difference between the mean and each observation is called the deviation. The two most commonly used measures that take into account the spread of all the data values are the variance and the standard deviation. A data set that is more variable will have a larger variance than a data set that is relatively homogeneous. The population variance is the sum of the squared deviations from the mean divided by the number of elements. In the example below, we will provide a step-by-step evaluation of the variance and standard deviation.

Example 2.10

Moving from the bottom of the hill to the top, Ann recorded distances between trees in metres (m):

18

7

8

17

15

10

5

7

6

7

12

8

Determine the variance and the standard deviation of these data using the definitions provided above.

Solution:

The mean of the data is

\( \mu = \frac{18+7+8+17+15+10+5+7+6+7+12+8}{12} = 10 \)m.

Now, let’s evaluate the variance of these data using the definition provided above.

Number

Deviation from the mean

Squared deviation

\( x_i , \text{m} \)

\( x_i - \mu , \text{m} \)

\( (x_i - \mu)^2 , \text{m}^2 \)

18

18 – 10 = 8

\( (8)^{2} = 64 \)

7

7 – 10 = −3

\( (-3)^{2} = 9 \)

8

8 – 10 = - 2

\( (-2)^{2} = 4 \)

17

17 – 10 = 7

\( 7^{2} = 49 \)

15

15 – 10 = 5

\( 5^{2} = 25 \)

10

10 -10 = 0

\( 0^{2} = 0 \)

5

5 – 10 = - 5

\( (-5)^{2} = 25 \)

7

7 – 10 = - 3

\( (-3)^{2} = 9 \)

6

6 – 10 = - 4

\( (-4)^{2} = 16 \)

7

7 – 10 = - 3

\( (-3)^{2} = 9 \)

12

12 – 10 = 2

\( 2^{2} = 4 \)

8

8 – 10 = - 2

\( (-2)^{2} = 4 \)

\[
\sum_{i=1}^{12} x_i = 120 \text{ m}
\]

\[
\sum_{i=1}^{12} (x_i - \mu) = 0 \text{ m}
\]

\[
\sum_{i=1}^{12} (x_i - \mu)^2 = 218 \text{ m}^2
\]

Note that the sum of deviations always equals 0. We can evaluate the sum of deviations to ensure that our calculation is correct.

The squared deviations are then summed and divided by the number of observations to give the variance.

\( \text{Variance} = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N} = \frac{218}{12} = 18.1\overline{6} \approx 18.2\ \text{m}^2 \)

As one can see, the unit of the variance is m2, whereas the unit of the data was m. It would be more convenient to use a measure for the dispersion, which would have the same unit as the data. That is why we evaluate the square root of the variance to describe the spread of the data. This measure is called standard deviation and is denoted as σ . This is a Greek letter and is pronounced “sigma.”

Consequently, the variance is denoted as \( \sigma^2 \), considering that \( \sqrt{\sigma^2} = \sigma \). In our example, the variance \( \sigma^2 = 18.2\ \text{m}^2 \) and the standard deviation \( \sigma = \sqrt{18.2\ \text{m}^2} = 4.3\ \text{m} \).

To summarize, in this example we referred to the definition of the variance and standard deviation of population and used the following formulae:

\[
\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N} \tag{F2.5}
\]

\[
\sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}} \tag{F2.6}
\]

(F 2.5) and (F 2.6) are called definition formulae for the population variance and population standard deviation, respectively.

As mentioned repeatedly, often we have to work with large-size data sets (populations), and in these cases we randomly select a sample for conducting statistical analysis. The formulae of the sample variance and sample standard deviation are similar to (F 2.5) and (F2.6):

\[
s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1} \tag{F2.7}
\]

\[
s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}} \tag{F2.8}
\]

Here is the sample mean and n is the sample size.

Note that, in these formulae, we divided by n – 1 instead of N. While the exact explanation of this change is beyond the scope of this course, we can say that this replacement reduces biasness of sampling.

We hope careful readers noticed that we use different letters, σ and s, to distinguish standard deviations of populations and samples, respectively.

In the example below, we will evaluate the variance and standard deviation of a sample randomly selected from the data set from example 2.10.

Example 2.11

Refer to example 2.10. Moving from the bottom of the hill to the top, Ann recorded distances between trees in metres (m) and randomly selected four observations from her data:

5 10 7 17

Determine the variance and standard deviation of this sample.

Solution:

First, we will determine the sample mean:

\( \overline{x} = \frac{5+10+7+17}{4} = 9.75\ \text{m} \)

Following the procedure used in example 2.10, we construct the following table:

Number

Deviation from the mean

Squared deviation

\( x_i ,\ \text{m} \)

\( x_i - \bar{x} ,\ \text{m} \)

\( (x_i - \bar{x})^2 ,\ \text{m}^2 \)

5

5 – 9.75 = −4.75

\( (-4.75)^{2} = 22.5625 \)

10

10 – 9.75 = 0.25

\( (0.25)^{2} = 0.0625 \)

7

7 – 9.75 = - 2.75

\( (-2.75)^{2} = 7.5625 \)

17

17 – 9.75 = 7.25

\( (7.25)^{2} = 52.5625 \)

\[
\sum_{i=1}^{4} x_i = 39\ \text{m}
\]

\[
\sum_{i=1}^{4} (x_i - \bar{x}) = 0\ \text{m}
\]

\[
\sum_{i=1}^{4} (x_i - \bar{x})^2 = 82.75\ \text{m}^2
\]

Now we can evaluate the variance and standard deviation of the sample using formulae (F 2.7) and (F 2.8), respectively.

\[
s^2
= \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}
= \frac{\sum_{i=1}^{4} (x_i - 9.75)^2}{4 - 1}
= 27.6\ \text{m}^2
\]

\[
s
= \sqrt{\frac{\sum_{i=1}^{4} (x_i - 9.75)^2}{4 - 1}}
= \sqrt{27.5833}
= 5.2520
\approx 5.3\ \text{m}
\]

As one can note, the population and sample standard deviations (variances) differ. Later in this book we will see that the standard deviations of samples can be used as estimators of population standard deviations.

One of the benefits of the standard deviation is that in certain circumstances it allows us to calculate the number of observations lying within particular intervals of the distribution. The Russian mathematician Chebyshev created a theorem summarizing this application of the standard deviation. Chebyshev’s Theorem estimates the minimum proportion of observations that fall within a specified number of standard deviations from the mean. This theorem applies to any data set.

Chebyshev’s Theorem

For any numerical data set,

  • at least \( \frac{3}{4} \) of the data lie within the interval \( (\mu - 2\sigma,\ \mu + 2\sigma) \);
  • at least \( \frac{8}{9} \) of the data lie within the interval \( (\mu - 3\sigma,\ \mu + 3\sigma) \); and
  • in general, at least \( 1 - \frac{1}{k^{2}} \) of the data lie within the interval \( (\mu - k\sigma,\ \mu + k\sigma) \).

where k is any real number, strictly greater than 1.

In this book, we will mainly analyze data with approximately mound-shaped distribution (some textbooks prefer to use the term “bell-shaped” distribution). This is why we will use a method specifically derived for the mound-shaped data distribution. This method is called the empirical rule. In general, empirical research is based on observations of various phenomena and provides us with the knowledge and rules. One has to emphasize that Chebyshev’s Theorem can be applied to any data distribution, whereas the empirical rule is restricted by mound-shaped data distribution only. Because of this, for the same interval, the empirical rule gives a higher percentage than Chebyshev’s Theorem.

The Empirical Rule

If a distribution of data is approximately mound-shaped, then

  • approximately 68\% of the data lie within the interval \( (\mu - \sigma,\ \mu + \sigma) \);
  • approximately 95\% of the data lie within the interval \( (\mu - 2\sigma,\ \mu + 2\sigma) \); and
  • approximately 99.7\% of the data lie within the interval \( (\mu - 3\sigma,\ \mu + 3\sigma) \).
    .image

Figure 2.12. Approximately mound-shaped data distribution

Example 2.12

Samantha weighed all 99 potatoes in grams collected from her garden, and recorded the data as follows:

37

38

38

39

39

40

40

40

40

41

41

41

41

41

41

42

42

42

42

42

42

42

43

43

43

43

43

43

43

43

44

44

44

44

44

44

44

44

44

45

45

45

45

45

45

45

45

45

45

46

46

46

46

46

46

46

46

46

46

47

47

47

47

47

47

47

47

47

48

48

48

48

48

48

48

48

49

49

49

49

49

49

49

50

50

50

50

50

50

51

51

51

51

52

52

53

53

54

55

We leave computation of the mean and standard deviation of the following data set to readers. μ=46grams,σ=3.8grams

.

As one can see, the histogram of the data is almost mound-shaped, although not quite symmetrical at the smallest and largest class intervals (fig. 2.13).

Figure 2.13. Frequency histogram constructed from data set provided in example 2.12.Figure 2.13. Frequency histogram constructed from data set provided in example 2.12.

Let’s check if the empirical rule is true for these data.

k

μ±kσ

Interval

Proportion of data in the interval

Empirical rule estimation

Chebyshev’s Theorem

1

46±3.8

[42.2,49.8]

61/99 = 61.6%

65%

Not applicable

2

46±2∙3.8=46±7.6

[38.4,53.6]

94/99 = 94.9%

95%

\( \left(\frac{3}{4}\right) \times 100\% = 75\% \) or more

3

46±3∙3.8=46±11.4

[34.6,57.4]

99/99 = 100%

97.7%

\( (\frac{8}{9}) \times 100\% = 89\% \) or more

The table above indicates the good agreement between the estimation of the empirical rule and the calculation made for the given data. We can see slight discrepancies for the intervals [42.2, 49.8] and [34.6, 57.4], which can be explained if we consider that the histogram is not exactly mound-shaped. As could be expected, for the same intervals, Chebyshev’s Theorem gives lower percentages due to the more conservative estimation. Hence that theorem provides the estimation, which is true for any shape of data distribution.

2.4. Position Characteristics

Percentiles and Deciles

A percentage (from the Latin per centum, "by a hundred") is a number or ratio expressed as a fraction of 100. We use percentage often to describe relative amounts of things. Statisticians refer to a similar concept, “percentile”—, namely, a “k-th percentile.” Sometimes this is called quantile. The k-th percentile is a score below which a given percentage, k, of data falls (exclusive definition), or a score at or below which a given percentage of data falls (inclusive definition). Hereafter, we will consider the inclusive definition of the percentile. One can state that all scores are between 0th and 100th percentiles.

Percentiles are expressed in the same unit of measurement as the input scores; for example, if the scores refer to the heights of trees, the corresponding percentiles will be expressed in metres. Consider a 7-metre-high pine in a park. Assume that the height of this pine is in 60th percentile of heights of trees in that park. This means that 60% of trees there are 7 metres tall or shorter.

Example 2.13

Sandy has 149 classmates in her statistics class. Their instructor marks a test using the grades A, B, C, D, and E, with A being the best mark and E the worst mark. To challenge students even more, the instructor, instead of posting the test marks, emailed to students their percentile in the overall standing and provided the following information: 5% of students received E, 11% received D, 18% received C, 62% received B, and 4% received A. Sandy was informed that she was in 74th percentile of standing for the test.

  • What mark did Sandy obtain for the test?
  • How many students’ marks were higher than Sandy’s?

Solution:

First, we have to make clear that 74 is NOT Sandy’s score for the test!

It would be helpful to draw a diagram indicating the provided information on grades and Sandy’s percentile.

image

Figure 2.14. Percentile diagram for example 2.13

  • Figure 2.14 shows that 5%+11%+18%=34% of Sandy’s classmates received a mark below B. This means that Sandy’s grade is above the upper border of B. Moreover, 5%+11%+18%+62%=96% of students received a mark below A. The 74th percentile, where Sandy is standing, is between 34% and 96%. Therefore, Sandy’s grade is B.
  • Since Sandy has 149 classmates, there are 150 students in this class. Of these, 4% received A, higher than B. Since 4% of 150 equals 0.04∙150=6, we can determine that 6 of Sandy’s classmates received a higher mark than her.

Another interesting sorting procedure is connected with the notion of a decile. We sort data into 10 equal parts, such that each part represents 1/10, or 10%, of the population or sample. Any of the nine values (1, 2, 3, 4, 5, 6, 7, 8, and 9) that divide the sorted data into 10 equal parts is called decile. Along with the percentile, the decile is one more possible form of a quantile. For instance, in example 2.13, Sandy’s score (74th percentile) was in the 7th decile.

The position of the \( k \)th decile of a population can be determined as \( \frac{k}{10} \times (N+1) \), where \( N \) is the size of the population. Similarly, the position of the \( k \)th decile of a sample can be determined as \( \frac{k}{10} \times (n+1) \), where \( n \) is the size of the sample.

Example 2.14

A university surveyed a sample of students to determine how far they lived from the campus. The results are shown below:

Distance (km)

Number of students

0 to 5

4

5 to 10

15

10 to 15

27

15 to 20

18

20 to 25

5

Determine the value of the 8th decile for this data set.

Solution:

First, we need to create the frequency and cumulative frequency table.

Distance (km)

Frequency

Cumulative frequency

0 to 5

4

4

5 to 10

15

19

10 to 15

27

46

15 to 20

18

64

20 to 25

5

69

Total

69

The total number of students who participated in the survey is 4 + 15 + 27 + 18 + 5 = 69. Let’s determine the position of the decile.

\[
\frac{8}{10} \times (n+1)
= \frac{8}{10} \times (69+1)
= 56
\]

Since 56 lies between the cumulative frequencies 46 and 64, this means that the 8th decile corresponds to the interval 15 to 20 km. Now we can evaluate the value of the 8th decile as follows:

\[
15 + (56 - 46) \times \frac{20 - 15}{18}
= 17.8\ \text{km}
\]

We can consider the median in terms of the percentages, as the 50th percentile. In other words, 50% of observations are less than or equal to the median. In statistics, to solve practical questions, we also use special percentiles: 25th, 50th, and 75th. Usually, we call the 25th and 75th percentiles the first quartile and the third quartile, respectively. The first quartile lies one-quarter of the way through the data. Consequently, one-quarter of the data values are less than or equal to the first quartile. Similarly, the third quartile lies three-quarters of the way through the data, and three-quarters of the data values are less than or equal to the third quartile. In this book we will denote the first and third quartiles as Q1 and Q3, respectively.

Example 2.15

Determine the median and quartiles of the following data set:

51 44 60 12 38 65 12 41 45 62 59

Solution:

First, we have to put the recorded numbers in ascending order.

12 12 38 41 44 45 51 59 60 62 65

There are 11 values, so the median is the 6th value, m = 45. The following numbers are less or equal to the median:

12 12 38 41 44

The middle value of these observations is 38. Therefore, Q1 = 38.

Similarly, the following observations are greater or equal to the median:

51 59 60 62 65

and Q3 = 60.

The Five-Number Summary and Box Plots

The difference between the first and third quartiles is called the interquartile range and denoted as IQR = Q3Q1. The first and third quartiles, as well as the median, maximum, and minimum data can be used collectively to determine the five-number summary. The five-number summary offers a reasonably complete description of the centre and the spread of the data around the centre. The five-number summary lends itself nicely to a new type of graph, called the box plot.

image

Figure 2.15. Box plot

The box plot allows the viewer to easily assess the range, spread, and centre of a distribution.

Example 2.16

Construct the box plot for the data provided in example 2.15.

Solution:

First, we put the values in ascending order.

12 12 38 41 44 45 51 59 60 62 65

We already determined the values of the median and quartiles. The minimum and maximum values of the data are 12 and 65, respectively. Therefore, the five-number summary for these data is as follow:

xmin = 12, xmax = 65, m = 45, Q1 = 38, Q3 = 60

Now we can draw the box plot.

image

Figure 2.16. Horizontal box plot constructed from data provided in example 2.15

Based on the constructed box plot (fig. 2.16), we can conclude that the data distribution is not symmetrical, since the median is not at the centre of distribution. Interestingly, without evaluating the mean, we can confirm that it is not equal to the median of the provided data. (Explain why.)

Box plots can be drawn vertically as well as horizontally. Figure 2.17 shows the vertical box plot constructed for the data provided in example 2.15.

image

Figure 2.17. Vertical box plot constructed from data provided in example 2.15

There exist many computer programs capable of generating box plots. Most programs allow users to identify outliers in the data. Outliers can be detected as observation(s) that are more than 1.5 interquartile ranges away from quartiles.

To identify outliers, we determine the lower fence, Q1 – 1.5 IQR, and the upper fence, Q3 + 1.5 IQR. Usually, outliers are marked using an asterisk (*).

Example 2.17

Check if the data given in example 2.15 contain outliers.

Solution:

In our example,

IQR = 60 – 38 = 22

Lower fence = 38 – 1.5 (22) = 5

Upper fence = 60 + 1.5 (22) = 93

Now we can complete the box plot by adding lower and upper fences (dashed vertical lines) in figure 2.16.

image

Figure 2.18. Horizontal box plot with lower and upper fences constructed fr data provided in example 2.15

Since all observations lie between the lower and upper fences, 5 and 93, the given data do not have any outliers.

Example 2.18

Samantha recorded her store’s sales for the first 30 days of the year in thousand dollars:

3

10

21

3

5

5

8

14

15

24

60

14

6

6

12

23

17

3

3

17

23

6

3

4

18

11

5

11

8

7

Construct the box plot for the given data set. Check the data for outliers.

Solution:

First, let’s put the numbers in ascending order:

3

3

3

3

3

4

5

5

5

6

6

6

7

8

8

10

11

11

12

14

14

15

17

17

18

21

23

23

24

60

The mean of these data is \( m = \frac{8 + 10}{2} = 9 \). The quartiles can be determined with the formulae used in example 2.15, Q1 = 5 and Q2 = 17. Consequently, IQR = 17 – 5 = 12. Now, one has to evaluate lower and upper fences to check the existence of outliers.

Lower fence = 5 – 1.5 (12) = − 13

Upper fence = 17 + 1.5 (12) = 35

The observation 60 is an outlier since it is greater that the upper fence. Finally, the box plot can be constructed.

image

Figure 2.19. Box plot constructed for the data provided in example 2.18

Usually, we do not connect outliers with the box as in the figure above.

 

Chapter 2 Summary

Numerical measures of data

Numerical measures of centre

Mean

Median

Mode

  • Measures of dispersion
  • Range, variance, standard deviation
  • Chebyshev’s Theorem
  • The empirical rule
  • Position characteristics
  • Percentiles, first quartile, and third quartile
  • The five-number summary
  • Box plots

You can also access the presentation file of this chapter. Just click the link to view.

EXERCISES

2.1. Numerical Measures of Data

  • We need a few characteristics to describe big data.
  • measures of centre: median, mean, mode
  • measures of dispersion (spread): range, standard deviation, variance, IQR
  • measures of position: percentiles, quartiles
  • outliers

 

2.2. Measures of Centre

  • Median
  • Mean
  • Mode
  1. SIAST recently surveyed a sample of students to determine how far they lived from the Wascana campus. The results are shown below.
Distance (km) Number of Students
0 to 5 4
5 to 10 15
10 to 15 27
15 to 20 18
20 to 25 16

2. These are the ages of a sample of children seen at a health clinic:

2  7  3  9  0  0  8  5  5

Calculate the following:

\[
\begin{aligned}
\text{a) } & \sum x \\
\text{b) } & \sum x^{2} \\
\text{c) } & (\sum x)^{2} \\
\text{d) } & \sum (x - 1)^{2}
\end{aligned}
\]

3. A frequency distribution is skewed to the left.

a) Is the mean greater than, equal to, or less than the median?
b) Is the mode greater than, equal to, or less than the median?

4. Listed below is the monthly percentage increase in nationwide prices of a litre of unleaded gasoline from January to April 2006:

9.4, 13.8, 11.7, 11.9

Calculate the average percentage increase during that time period.

 

5. The following is the number of minutes to commute from home to work for a group of four automobile executives.

 

Number of Minutes
28
25
48
37

Determine:

The mean of the number of minutes.

 

6. According to basic economics as the demand for a product increases the price will decrease. Listed below is the number of units demanded and the price.

 

Price Demand
120.00 2
90.00 5
80.00 8
70.00 12

Determine:

a) The mean number of the price.

b) The median of the price

7. Below is a sample of ages of people killed in traffic accidents in Saskatchewan in 2005.

72 44 60 14 23 19
19 5 22 30 55 29
33 31 18 17 24 66
47 16 19 26 44 30
19 31 4 20 27  

Use the statistical functions in Excel to calculate the following:
a) mean
b) median

8. The following are the ages of all children in a daycare:

 

3 3 5 7 2 3 2 8 6 3

Calculate the following statistics (you may use your calculator):
a) Mean
b) Mode
c) Median

9. Consider the following 20 scores on Stat.100 midterm exam (out of 20).

8.8, 10.5, 7.8, 6.1, 9.1, 17.2, 9.6, 7.2, 6.6, 7.7, 9.3, 6.8, 7.6, 14.5, 16.9, 8.3, 9.9, 8.7, 9.7, 7.8

Find the mean, median and mode for above data.

10. The following data gives the systolic blood pressure of 80 patients at a local hospital. Estimate the mean and standard deviation using the class midpoints as representative values.
Blood Pressure (in mm Hg), Frequency

85 ≤ x < 110, 18
110 ≤ x < 135, 24
135 ≤ x < 160, 26
160 ≤ x < 185, 12

11. The final grade for a (fictional) Stat 100 course is calculated based on the following grading scheme: 3 Quizzes, worth 10% each. 1 Midterm, worth 25%. Final Exam, worth 45%.
A student receives the following scores:

Quiz 1: 7 out of 10,  Quiz 2: 10 out of 10,  Quiz 3: 9 out of 12
Midterm: 72%      Final Exam: 83%

Use a weighted mean to compute the final Stat 100 grade for this student.

2.3. Range, Variance and Standard Deviation

1. Use Question 2 (2.2) to determine:
a) the standard deviation of the number of minutes.
b) the variance of the number of minutes based on your un-rounded answer to part (b).

2. Use Question 3 (2.2) to determine the standard deviation of the price.

3. In Question 7 (2.2), Use the statistical functions in Excel to calculate the following:

a) Range
b) Variance
c) Standard deviation

4. Use Question 8 (2.2) to calculate the following statistics (you may use your calculator):

a) Variance
b) Standard Deviation.

5. The daily high temperature in Swift Current on December 16th for a random sample of past years is as follows (in degrees Celsius):

-23; -5; -26; -10; 0; -8; -5

a)Determine the range.
b) Determine the standard deviation.

6. The following scores are from a sample of families taken from a major Canadian city. They represent the percentage of family income allotted to rent.

17.2 17.1 17.0 17.1 16.9
17.0 17.1 17.0 17.3 17.2
17.1 17.0 17.1 16.9 17.0
17.1 17.3 17.2 17.4 17.1

Determine:
a) Variance
b) Standard deviation.

7. For Question (10) in the previous section, calculate the standard deviation.

2.4. Position Characteristics
Percentiles
Quartiles
Box Plot and Outliers (Five-numbers Summary, example from PA survey)

1. Use Question 3 (2.2) to determine the 9th decile of the price.

2. Use Question 9 (2.2), find

a) Find Q1, Q3 and outlier(s)
b) Construct a Box-plot for above data.

3. (6 Marks) The following is the graph that represents the yearly number of deaths in 15 years from tornadoes in the USA

Stem  Leaf

4  0 0 2 3 5 5 6

5  1 1 1 1

6  0 1

7  9

8  2

Leaf unit = 1

What is the first quartile of this data set?

5. (4 marks) Given the following data set (n = 12):

0, 1, 3, 4, 4, 5, 6, 6, 6, 7, 7, 14

which has the following Five-Number Summary (do not verify this information):
Min, Q₁, Median, Q₃, Max
0, 3.25, 5.5, 6.75, 14
Construct the box plot for the data set. Are there any outliers? Must show your work!

5. If the Z-score of a data point = 4.5, is this data point considered to be an outlier?

a) Yes, because Z-score > 3
b) No, because Z-score > 3
c) We cannot answer because there is not enough information.

6. The following measurements show the average daily temperature for eleven days of July:

20.1, 20.2, 21.9, 22.6, 23.1, 23.5, 23.6, 24.4, 25.4, 25.5, 30.1

Find the five-number summary (min, first quartile, median, third quartile, max).

7. The following data give the lengths of time (in weeks) taken to find a full-time job by 18 physics majors who graduated from a small university in 2012.

12, 6, 23, 50, 25, 9, 21, 5, 14, 18, 17, 16, 20, 11, 7, 23, 8, 7

a) Construct a box and whisker plot. Be sure to properly label your box and whisker plot with the appropriate title and scale.
b) Are any of the data points considered an outlier?

8. The following measurements show average daily temperature (in degrees centigrade) for eleven days of July. Note that the data is already sorted in increasing order.

20.1, 20.2, 21.9, 22.6, 23.1, 23.5, 23.6, 24.4, 25.4, 25.5, 30.1

Find the five-number-summary for this data set and draw a box plot.

9. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991)  As the median divided the data set in to two equal groups, the percentile divides the data set into

a) Tenths
b) Two unequal parts
c) One hundredths
d) One thousandths
e) None of the above are correct.

 

definition

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introductory Statistics Copyright © 2026 by Arzu Sardarli and Andrei Volodin is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.