Chapter 9. Correlation and Regression
Chapter Objectives
In this chapter, readers will learn to do the following:
- Define and specify bivariate data
- Specify a scatter plot and equation of two linearly dependent variables
- Evaluate and interpret the linear correlation coefficient
- Evaluate the goodness of fit of a line to a set of pairs
- Determine equation of the regression line using the least squares method
- Evaluate and interpret the coefficient of determination
We have no doubt that many of our readers own a car. If you have ever checked advertisements for used cars, you might have noticed how sellers tend to describe cars. Here is a descriptive table of vehicles we constructed using the information obtained from the car advertisement website CarGurus on November 21, 2023.
Table 9.1
|
Vehicle # |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
Make |
Lincoln |
Jeep |
Ford |
Hyundai |
Volkswagen |
BMW |
Hyundai |
|
Model |
MKX |
Grand Cherokee |
Edge |
Kona Electric |
Tiguan |
X5 M |
Santa Fe Sport |
|
Year |
2019 |
2018 |
2018 |
2022 |
2021 |
2021 |
2016 |
|
Mileage, km |
88,583 |
104,633 |
150,801 |
28,289 |
35,136 |
100,160 |
201,448 |
|
Engine power, hp |
303 |
295 |
245 |
201 |
184 |
567 |
290 |
|
Colour |
Grey |
Rhino Clearcoat |
White |
Dive In Jeju |
Silver |
Grey |
White |
|
Transmission |
AWD |
4WD |
AWD |
FWD |
AWD |
AWD |
AWD |
|
Location |
Saskatoon |
Prince Albert |
Saskatoon |
North Battleford |
Saskatoon |
Regina |
Saskatoon |
|
Price, CAD |
33,481 |
35,500 |
23,495 |
33,988 |
40,990 |
47,990 |
13,999 |
Table 9.1 presents randomly selected SUVs offered by sellers in Saskatchewan. Each SUV is described in this table with reference to nine variables; some (year, mileage, engine power, price) are quantitative, and some (make, model, colour, transmission, location) are qualitative. In statistics, these types of data are called multivariate. Some of the variables may depend on each other. For instance, most likely, the older the car, the higher the mileage. Overall, older cars with higher mileage have lower prices. However, one has to note that the prices of cars depend on other variables, such as engine power, transmission, and model. In general, analyzing relationships between variables characterizing features of things, systems, and processes is an important branch of mathematics and statistics. The variables whose variation depends on the variation of others are called dependent. The variables whose variation presumably does not depend on others are called independent. Philosophically, the definition of dependence and independence is more complex (and sometimes arguable) than we have provided here. However, it can be considered sufficient for our understanding of statistical modelling of the dependencies between variables.
Based on our experience and common sense, we know that the prices of cars depend on various properties, such as the year of production, engine power, mileage, etc. In this example, the year of production is called the independent variable. The price variable, which depends on the year of production, is called the dependent variable. Another independent variable would be location, make, etc. In this chapter, we will analyze the relationship between two sets of variables and solve some statistics problems based on this analysis.
9.1 Relationship Between Two Variables: Linear Dependence
In statistics, the data consisting of multiple sets of variables are called multivariate. The data, expressed by two variables, are called bivariate. In chapter 1 of this book, we discussed using graphs to describe the data with one variable. Typically, scatter plot graphs are considered more convenient for presenting bivariate data. We hope readers remember from high school how to post dots on the xy coordinate system using their coordinates. The graphs below represent the various bivariate data sets. The first coordinate, x, stands for the independent, and the second coordinate, y, for the dependent variables. Below, we will explain how x and y can be assigned for real data.
Table 9.2 (a)
|
x |
y |
|
1 |
15 |
|
2 |
12 |
|
2 |
16 |
|
3 |
12 |
|
4 |
11 |
|
5 |
8 |
|
6 |
5 |

(a)
Table 9.2(b)
|
x |
y |
|
-4 |
18 |
|
-2.5 |
16 |
|
-2 |
4 |
|
-1 |
7 |
|
0 |
1 |
|
1 |
6 |
|
2 |
4 |
|
3 |
14 |
|
4 |
16 |
(b)
Table 9.2(c)
|
x |
y |
|
1.5 |
18 |
|
2 |
16 |
|
3 |
4 |
|
1.5 |
10 |
|
2.5 |
1 |
|
2 |
4 |
|
3 |
14 |
|
2.5 |
20 |
|
2.25 |
10 |
(c)
Figure 9.1. Scatter plot graphs for data provided in tables 9.2(a), 9.2(b), and 9.2(c)
As one can see, the scatter plot graphs make it easier to understand the features of the dependence between two variables. For example, we can conclude that in figure 9.1(a), the dots lie around a line, and y decreases with the increase of x. The dependence y vs. x presented in figure 9.1(b) looks more complex; for negative values of x, y decreases with the increase of x, whereas for positive values of x, y increases with the increase of x. Unlike figures 9.1(a) and 9.1(b), the graph in figure 9.1(c) does not present any dependence between x and y; the dots do not follow any patterns and look like a cloud.
Mathematicians developed various methods to construct the best-fitting curves for the provided bivariate data set. We have no doubt that many readers are familiar with graphing programs, such as Excel, Baserow, or Google Sheets, which are very useful for graphing the best-fitting curves. Figure 9.2 shows the Excel-made scatter plot graph and the best-fitting line for the bivariate data set provided in table 9.2(a).

Figure 9.2. The scatter plot graph and best-fitting line for data provided in table 9.2(a)
Visually, we can conclude that on this graph, dots corresponding to the provided bivariate data lie approximately on a line. Later in this chapter, we will show how to construct a formula to describe these dependences. In mathematics this type of dependence is classified as linear.
Now let’s consider the relation between the years of production and prices of vehicles from table 9.1 and construct another table.
Table 9.3
|
Vehicle # |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
Year |
2019 |
2018 |
2018 |
2022 |
2021 |
2021 |
2016 |
|
Price, CAD |
33,481 |
35,500 |
23,495 |
33,988 |
40,990 |
47,990 |
13,999 |
Again, the scatter plot graph could help us to explore the relation between the variables, prices vs. years.

Figure 9.3. The scatter plot graph and best-fitting line for data provided in table 9.3
The graph in figure 9.3 clearly shows that newer vehicles are more expensive than older ones, which makes perfect sense based on our life experience. We also can draw a line that approximately fits the dependence prices vs. years. In other words, we can conclude that the prices of vehicles approximately linearly depend on years of production. We hope readers remember from high school that if a variable y depends on variable x linearly, then the dependence can be described by the equation
y=a+bx (F9.1)
where a and b are constants. Geometrically, a and b determine the y-intercept and slope of the line, respectively. The line rises when b>0 (fig. 9.4[a]) and falls when b<0 (fig. 9.4[b]).

(a) (b)
Figure 9.4. Lines with positive (graph A) and negative (graph B) slopes
In this book, we will talk about bivariate data, where the dependence between variables can be classified as linear.
9.2 Linear Correlation Coefficient
Consider the scatter plot graphs below:
(a)
(b)
(c)
(d)
Figure 9.5. Scatter plot graphs for various bivariate data
One can see that all dots on the graph presented in figure 9.5(a) lie on a straight line, which could be described by the equation (F9.1). We can also state that in this equation, a equals – 5, and b is a positive number because the line is rising with x. Although not all dots on figure 9.5(b) lie on one line, we could still sketch a line that could fit the observed patterns. Later, we will learn how to determine the constants of the equation (F9.1) that provide the best fit. From the graph figure 9.5(b), we can state that for this pattern, b must be a negative number (explain why). The dots presented in figure 9.5(c) form a pattern, which also could be somehow described by a linear equation. It would be reasonable to expect that the accuracy of this estimation will be lower than for the data presented in figure 9.5(b). Figure 9.5(d) represents a strong dependence between the x and y variables. Definitely, this dependence is not linear. Readers familiar with calculus can identify the best-fitting curve of these data as a parabola.
In chapter 2, we learned how to characterize the one-variate data using numerical measures. In statistics, we use the so-called correlation coefficient to describe the dependence between two variables. In this book, we will define the linear correlation coefficient to analyze the dependence, which could be classified as linear within some accuracy. The linear correlation coefficient is defined as
$$
r = \frac{s_{xy}}{s_x s_y} \tag{F9.2}
$$
In this formula, \( s_x \) and \( s_y \) represent the standard deviations of the \( x \) and \( y \) variables, respectively. In Chapter 2, we defined these quantities and learned how to evaluate them (see F2.8, F2.8a). Intuitively, we can assume that \( s_{xy} \) would contain information about both variables. This quantity is called covariance between \( x \) and \( y \) variables and can be evaluated as
$$
s_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n - 1} \tag{F9.3}
$$
where \( x_i \) and \( y_i \) are \( x \) and \( y \) values of the \( i \)-th measurements, respectively, and where \( \bar{x} \) and \( \bar{y} \) are sample means of \( x \) and \( y \) values, respectively.
Sometimes, it is more convenient to use the so-called computing formula:
$$
s_{xy} = \frac{\sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}}{n - 1} \tag{F9.3a}
$$
where \( \sum x_i \) and \( \sum y_i \) are the sums of \( x \) and \( y \) values, respectively, and where \( \sum x_i y_i \) is the sum of the products of \( x \) and \( y \) values of the \( i \)-th measurements.
Below, we provide values of the linear correlation coefficient calculated for the data provided in figure 9.5:
|
Data |
linear correlation coefficient, r |
|
Fig. 9.5(a) |
1 |
|
Fig. 9.5(b) |
– 0.95 |
|
Fig. 9.5(c) |
0.69 |
|
Fig. 9.5(d) |
0 |
By inspection, we can conclude that r=±1 when all dots lie on one line. The stronger the linear dependence between y and x, the closer is the absolute value of r to 1 or – 1. As the linear dependence becomes weaker, the value of r approaches 0. The linear correlation coefficient, evaluated for the data shown in figure 9.5(d), exactly equals 0. We would like to emphasize that the 0 value of the linear correlation coefficient does not mean that y and x are independent; this just shows that the dependence can be non-linear.
It can be proved that -1≤r≤1. Based on the value of r, by convention, the dependence y vs. x is classified as follows:
Table 9.4
|
Linear correlation coefficient, r |
Interpretation |
|
-1≤r<0 |
Negative direction (y decreases as x increases) |
|
0<r≤1 |
Positive direction (y increases as x increases) |
|
|r|=0 |
Uncorrelated |
|
|r|<0.2 |
Very weak |
|
0.2≤ |r| <0.4 |
Weak |
|
0.4≤ |r| <0.6 |
Moderate |
|
0.6≤ |r| <0.8 |
Strong |
|
0.8≤| r| <1 |
Very strong |
|
|r| =1 |
All points lie on a straight line |
Example 9.1.
In 2008, within our community-based research project, we asked Indigenous Elders and Knowledge Keepers to evaluate water quality using Indigenous Knowledge. (A. Sardarli, Use of Indigenous Knowledge in Modeling the Water Quality Dynamics in Peepeekisis and Kahkewistahaw First Nations Communities, Pimatiswin: A Journal of Aboriginal and Indigenous Community Health 11(1), 2013, 55-63). Elders told us that the number of eggs in shorebird nests depends on the water quality. Biological research shows that egg-laying females need calcium to help form strong eggshells (https://saratogasprings.wbu.com/calcium-for-birds). A biologist measured the amount of calcium in the water taken from a local lake and counted the average number of eggs in a dunlin nest on the shore of this lake five consecutive years (table 9.4[a]).
Table 9.4(a)
|
Years |
Amount of Calcium (in milligrams per 100 grams of water) |
Number of eggs in a nest |
|
2008 |
3.2 |
2 |
|
2009 |
4 |
4 |
|
2010 |
2 |
1 |
|
2011 |
4 |
3 |
|
2012 |
3.5 |
4 |
Determine the linear correlation coefficient between the amount of calcium and the number of eggs in a dunlin nest.
Solution:
First, you need to calculate the required squares, products, and sums to determine the standard deviations and covariance between x (amount of calcium) and y (number of eggs).
|
Amount of Calcium (x) |
Number of eggs in one nest (y) |
x2 |
y2 |
xy |
|
3.2 |
2 |
10.24 |
4 |
6.4 |
|
4 |
4 |
16 |
16 |
16 |
|
2 |
1 |
4 |
1 |
2 |
|
4 |
3 |
16 |
9 |
12 |
|
3.5 |
4 |
12.25 |
16 |
14 |
|
\[ |
\[ |
\[ |
\[ |
\[ |
We can evaluate \( s_x \), \( s_y \), and \( s_{xy} \) using formulae (F2.8a) and (F9.3a), considering that \( n = 5 \).
$$
s_x = \sqrt{\frac{\sum x_i^2 - \frac{(\sum x_i)^2}{n}}{n - 1}}
= \sqrt{\frac{58.49 - \frac{(16.7)^2}{5}}{5 - 1}}
= 0.82
$$
$$
s_y = \sqrt{\frac{\sum y_i^2 - \frac{(\sum y_i)^2}{n}}{n - 1}}
= \sqrt{\frac{46 - \frac{14^2}{5}}{5 - 1}}
= 1.30
$$
$$
s_{xy} = \frac{\sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}}{n - 1}
= \frac{50.4 - \frac{(16.7)(14)}{5}}{5 - 1}
= 0.91
$$
Hence,
$$
r = \frac{s_{xy}}{s_x s_y}
= \frac{0.91}{(0.82)(1.30)}
= 0.85
$$
Based on this value of the linear correlation coefficient, we can suggest that the number of eggs in dunlins’ nests depends on the amount of calcium in the water; the higher the calcium amount, the more eggs will be observed in nests. The scatter plot graph of the data also supports this conclusion (fig. 9.6).

Figure 9.6. Scatter plot graph for data provided in example 9.1
Since the value of the linear correlation coefficient
is fairly close to 1, we can conclude that the number of eggs in one dunlin nest is strongly, positively linearly dependent on the amount of calcium in water (table 9.4[a]). One can sketch a straight line that would fit the dots presented in figure 9.6. Below, we will explain how to construct the equation of the best-fitting line. Generally speaking, we will build a mathematical model of the dependence for the provided bivariate data.
9.3 Modelling the Linear Dependence
Let’s consider two lines given by the equations \( \hat{y}' = -2 + 1.5x \) and \( \hat{y}'' = -5 + 2.4x \). As one can see from Figure 9.7, the graphs of both lines somehow fit the dots obtained from the data provided in Example 9.1.

Figure 9.7. Scatter plot graph with several variously fitting lines for data provided in example 9.1
For a researcher, it is reasonable to determine which of these lines fits the dots better. One has to note that a conclusion made based on visual estimation is subjective and non-reliable. In statistics, we use a more robust technique to evaluate the goodness of fit of a line to a set of pairs \( (x_i, y_i) \), where \( i = 1, 2, \ldots, n \) and \( n \) is the sample size.
Assume that the line fitting this data set can be given by the equation \( \hat{y} = a + bx \). Then the error of the \( y \) value for the \( i \)-th pair will be \( \hat{y}_i - y_i \), where \( \hat{y}_i = a + b x_i \). The goodness of fit of the line \( \hat{y} = a + bx \) to the set of \( n \) pairs \( (x_i, y_i) \) is based on the quantity
\[
\sum_{i=1}^{n} (\hat{y}_i - y_i)^2
\]
Since this quantity represents the sum of squares of errors for each pair, we try to minimize it in order to construct the fitting line.
Now, based on the value of the goodness of fit, let us determine which of the lines \( \hat{y}' = -2 + 1.5x \) or \( \hat{y}'' = -5 + 2.4x \) fits better to the data set provided in Example 9.1.
Goodness Table
|
\( x_i \) |
\( y_i \) |
\( \hat{y}'_i = -2 + 1.5x_i \) |
\( \hat{y}''_i = -5 + 2.4x_i \) |
\( \hat{y}'_i - y_i \) |
\( \hat{y}''_i - y_i \) |
\( (\hat{y}'_i - y_i)^2 \) |
\( (\hat{y}''_i - y_i)^2 \) |
|
3.2 |
2 |
2.8 |
-0.2 |
0.8 |
-2.2 |
0.64 |
4.84 |
|
4 |
4 |
4 |
1 |
0 |
-3 |
0 |
9 |
|
2 |
1 |
1 |
-2 |
0 |
-3 |
0 |
9 |
|
4 |
3 |
4 |
1 |
1 |
-2 |
1 |
4 |
|
Goodness |
1.64 |
26.84 |
|||||
Comparing the goodness of the lines shown in the table above, we can conclude that the line
\[
\hat{y}' = -2 + 1.5x
\]
fits better to the bivariate data provided in Example 9.1.
Obviously, we may try different lines that would fit the provided data to some degree. Now, our goal is to determine the equation of the best-fitting line. In statistics, we use the so-called least squares method to determine this equation, and the line corresponding to this equation is called the least squares regression line.
The least squares method has a wide array of applications in various disciplines, including mathematics, physics, and economics. The detailed analysis of this method requires some specific mathematical knowledge and is beyond the objectives of this book. Below, we show how to construct the equation of the regression line using the least squares method.
Assume that bivariate data
\[
(x_i, y_i), \quad i = 1, 2, \ldots, n
\]
are given. Also, \( s_x \) and \( s_y \) are the standard deviations of the x and y sets, respectively, and \( r \) is the corresponding linear correlation coefficient.
It can be shown that the regression line relating to these bivariate data can be described by the equation
\[
\hat{y} = a + bx \tag{F9.4}
\]
where
\[
b = r \frac{s_y}{s_x} \tag{F9.5}
\]
and
\[
a = \bar{y} - b \bar{x} \tag{F9.6}
\]
considering that \( \bar{x} \) and \( \bar{y} \) are sample means of x and y sets, respectively.
Let us determine the equation of the best-fitting line for the data provided in Example 9.1. We already calculated the standard deviations and the linear correlation coefficient:
\[
s_x = 0.82, \quad s_y = 1.30, \quad r = 0.85
\]
The means can be determined as
\[
\bar{x} = \frac{\sum x_i}{n} = \frac{16.7}{5} = 3.34
\]
and
\[
\bar{y} = \frac{\sum y_i}{n} = \frac{14}{5} = 2.8
\]
Hence,
\[
b = r \frac{s_y}{s_x} = 0.85 \cdot \frac{1.30}{0.82} = 1.35
\]
and
\[
a = \bar{y} - b\bar{x} = 2.8 - 1.35 \cdot 3.34 = -1.71
\]
Therefore, the equation of the best-fitting line (regression line) is
\[
\hat{y} = -1.71 + 1.35x
\]
The goodness of the regression line can be evaluated as shown above and equals 0.94. For the two other lines, the values were 1.64 and 26.84. This is reasonable because there exists only one best-fitting line, and its goodness characteristic is the smallest.
Therefore, we can state that the line given by the equation
\[
\hat{y} = -1.71 + 1.35x
\]
fits best to describe the linear dependence between the calcium amount in water and the number of eggs in one nest. The graph below visually supports this conclusion.

Figure 9.8. Scatter plot graph with various fitting lines and the best-fitting line for data provided in Example 9.1
Example 9.2
A literature review shows that the academic performance of post-secondary students is affected by their engagement in social media (Shafiq, M., Parveen, K., Social media usage: Analyzing its effect on academic performance and engagement of higher education students, International Journal of Educational Development, 98, 2023). A social scientist conducted case studies to analyze the relationship between the 4.0-scale grade point average (GPA) of randomly selected seven students and the daily hours that they spend communicating via social media within an academic year (table 9.5).
Table 9.5
|
Students |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
Hours |
1.2 |
3 |
0.5 |
2 |
3 |
2 |
3 |
|
GPA |
3.8 |
2.4 |
4 |
2.4 |
1 |
3.2 |
1 |
- Determine the linear correlation coefficient between daily hours spent by a student on social media and his/her GPA.
- Construct the regression line equation for the dependence GPA vs. hours if the linear correlation exists.
- What GPA is expected for a student spending 1.5 hours daily on social media?
- What GPA is expected for a student spending 5 hours daily on social media?
- If a student’s GPA is 3.0, how many hours daily does he/she presumably spend communicating via social media?
Solution:
First, let’s construct the table with essential quantities to evaluate the linear correlation coefficient.
|
∑ |
Hours (x) |
GPA (y) |
\( x_i^2 \) |
\( y_i^2 \) |
\( x_i y_i \) |
|
1.2 |
3.8 |
1.44 |
14.44 |
4.56 |
|
|
3 |
2.4 |
9 |
5.76 |
7.2 |
|
|
0.5 |
4 |
0.25 |
16 |
2 |
|
|
2 |
2.4 |
4 |
5.76 |
4.8 |
|
|
3 |
1 |
9 |
1 |
3 |
|
|
2 |
3.2 |
4 |
10.24 |
6.4 |
|
|
3 |
1 |
9 |
1 |
3 |
|
|
14.7 |
17.8 |
36.69 |
54.2 |
30.96 |
- (a)
\[
s_x = \sqrt{\frac{\sum x_i^2 - \frac{(\sum x_i)^2}{n}}{n - 1}}
= \sqrt{\frac{36.69 - \frac{(14.7)^2}{7}}{7 - 1}}
= 0.9849
\]
\[
s_y = \sqrt{\frac{\sum y_i^2 - \frac{(\sum y_i)^2}{n}}{n - 1}}
= \sqrt{\frac{54.2 - \frac{(17.8)^2}{7}}{7 - 1}}
= 1.2205
\]
\[
s_{xy} = \frac{\sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}}{n - 1}
= \frac{30.96 - \frac{(14.7)(17.8)}{7}}{7 - 1}
= -1.07
\]
Therefore,
\[
r = \frac{s_{xy}}{s_x s_y}
= \frac{-1.07}{(0.9849)(1.2205)}
= -0.89
\]
(b) The means can be determined as
\[
\bar{x} = \frac{\sum x_i}{n} = \frac{14.7}{7} = 2.1
\]
and
\[
\bar{y} = \frac{\sum y_i}{n} = \frac{17.8}{7} = 2.54
\]
Hence,
\[
b = r \frac{s_y}{s_x}
= -0.89 \left(\frac{1.2205}{0.9849}\right)
= -1.10
\]
and
\[
a = \bar{y} - b\bar{x}
= 2.54 - (-1.10)(2.1)
= 4.85
\]
Therefore, the equation of the regression line is
\[
y = 4.85 - 1.10x
\]
- Given: x=1.5. Then y=4.85-1.10∙1.5=3.2 . Therefore, a student spending 1.5 hours daily on social media can expect a 3.2 GPA.
- Given: x=5 . Then y=4.85-1.10∙5=-0.65. This result looks senseless since the GPA cannot be negative. This confusing outcome leads us to an important note about one of the restrictions of the application of the least square regression method. NOTE: The regression line equation can be applied only within the given range of the independent values. In our example, the given range of the daily hours was between 0.5 and 3 hours. This means that we cannot use the regression line equation to estimate the GPA of the student spending 5 hours on social media.
- Given: y=3.0. Consequently,3.0=4.85-1.10∙x. Solving this equation, we get x=1.68. Therefore, most likely, the student with a 3.0 GPA spends 1.68 hours on social media.
Example 9.3
Finding the equation of the regression line must not be considered just a mathematical exercise. Let’s go back to our example of vehicles. Based on the data provided in table 9.3, readers can easily construct the equation of the regression line for the dependence between the years of production and prices of vehicles:
y=4133.5x-8314003.0
where x is the year of the production of a vehicle, y is its price in Canadian dollars.
Using this equation, sales managers and potential buyers can answer many practical questions and make projections, which are beneficial for any business. Let’s consider two cases: one for a sales manager and another for a person who is looking for a car within his/her budget.
Case 1. Assume that the manager has to offer a vehicle made in 2020. Using the equation derived above, he/she can calculate a reasonable and competitive price: One just has to calculate y in that equation for x=2020:
y=4133.5∙2020-8314003.0=35667
Hence, $35,667 would be a good price for the vehicle produced in 2020.
Case 2. Let’s find an appropriate vehicle for the buyer whose budget does not exceed $25,000. This is the value of y in the regression line’s equation above, and we need to solve the equation
25000=4133.5x-8314003.0
to determine the projected year of the production of the suitable vehicle. The solution yields approximately 2017. In other words, having $25,000 in his/her pocket, this buyer most likely needs to check the advertisements for vehicles made in 2017 and earlier.
Returning to the vehicle example, we would like to note again that a certain variable (such as the price of cars) may depend on more than one variable, and these dependencies may be more complex than linear. In statistics, we use multivariate non-linear regression analysis to construct these types of dependencies. Explanation of multivariate non-linear regression analysis methods is beyond the objectives of this book.
The Coefficient of Determination
Above, we learned how to estimate the goodness of fit of a line to the given data. It is understandable that not all observed values of \( y_i \) can be determined by the regression line equation.
In statistics, the proportion of the variation that can be explained by the linear regression between \( y \) and \( x \) is called the coefficient of determination. It can be shown that this proportion equals the square of the linear correlation coefficient, \( r^2 \). Often, statisticians use the term “R-squared” and express this quantity in percentages.
For instance, the coefficient of determination between the hours spent on social media by students and their GPA (Example 9.2) can be evaluated as
\[
r^2 = (-0.89)^2 = 0.79
\]
or 79%.
In other words, 79% of the data variation of \( y \) can be determined by the regression line equation
\[
y = 4.85 - 1.10x
\]
Intuitively, we can conclude that the coefficient of determination is also a measure of the strength of the relationship between \( x \) and \( y \).
Chapter 9 Summary
- Relationship between two variables: Linear dependence
- Linear correlation coefficient
- Modelling the linear dependence
- Equation of the best-fitting line
- Least squares regression line
- Coefficient of determination
You can also access the presentation of the lecture just by clicking here: click
EXERCISES
Relationship between two variables. Linear dependence
1. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) Regression analysis is a statistical procedure for developing a mathematical equation which describes how
a) One dependent and one or more independent variables are related
b) Several independent and several dependent variables are related
c) One dependent and one or more independent variables are related
d) None of the above answers are correct
2. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) In a simple regression analysis (where Y is a dependent and X is an independent variable) if the Y intercept is positive, then
a) There is a positive correlation between X and Y
b) There is a negative correlation between X and Y
c) If X increased, Y must also increase
d) If Y increased, X must also increase
e) none of the above answers are correct
3. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) A procedure used for finding the equation of a straight line which provides the best approximation for the relationship between the independent and dependent variables is the
a) Correlation analysis
b) Mean squares method
c) Least squares method
d) Most squares method
e) none of the above answers are correct
4. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) In the regression analysis, the variable that is being predicted
a) Must have the same units as the variable doing the predicting
b) Is the independent variable
c) Is the dependent variable
d) Usually is denoted by X
e) None of the above answers are correct
5. (Introduction to statistics, 2nd Ed, Test Bank, Anderson, D. R., Sweeney, D.J., Williams, T.A, 1991) A regression analysis between sales (in $1000) and advertising (in dollars) resulted in the following equation:
\[
\hat{Y} = 20000 + 6X
\]
The above equation implies that:
(a) An increase of $6 in advertising is associated with an increase of $6000 in sales
(b) An increase of $1000 in advertising is associated with an increase in $6000 in sales
(c) An increase of $1 in advertising is associated with an increase in $26000 in sales
(d) An increase of $1 in advertising is associated with an increase in $6000 in sales
(e) None of the above answers are correct
Bivariate data refers to a type of statistical data that involves pairs of observations or measurements on two different variables.
The linear correlation coefficient, denoted by r, is a statistical measure that quantifies the strength and direction of a linear relationship between two variables
The goodness of fit of a line is a measure used to assess how well a particular line (such as a regression line) fits a set of data points.
In statistics, a regression line is a straight line that best represents the relationship between a dependent variable (usually denoted as y) and one or more independent variables (usually denoted as x). The regression line is used to predict the value of the dependent variable based on the values of the independent variables.
The coefficient of determination, often denoted as R2, is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model. It is a key metric used to evaluate the goodness of fit of a regression model.
The best-fitting line, also known as the regression line, is a line that represents the relationship between two variables in a data set in the best possible way.