Chapter 12. Linear Regression and Correlation

12.1. Linear Regression and Correlation*

Student Learning Objectives

By the end of this chapter, the student should be able to:

  • Discuss basic ideas of linear regression and correlation.

  • Create and interpret a line of best fit.

  • Calculate and interpret the correlation coefficient.

  • Calculate and interpret outliers.

Introduction

Professionals often want to know how two or more variables are related. For example, is there a relationship between the grade on the second math exam a student takes and the grade on the final exam? If there is a relationship, what is it and how strong is the relationship?

In another example, your income may be determined by your education, your profession, your years of experience, and your ability. The amount you pay a repair person for labor is often determined by an initial amount plus an hourly fee. These are all examples in which regression can be used.

The type of data described in the examples is bivariate data - “bi” for two variables. In reality, statisticians use multivariate data, meaning many variables.

In this chapter, you will be studying the simplest form of regression, “linear regression” with one independent variable ( x ). This involves data that fits a line in two dimensions. You will also study correlation which measures how strong the relationship is.

12.2. Linear Equations*

Linear regression for two variables is based on a linear equation with one independent variable. It has the form:

(12.1) y = a + bx

where a and b are constant numbers.

x is the independent variable, and y is the dependent variable. Typically, you choose a value to substitute for the independent variable and then solve for the dependent variable.

Example 12.1. 

The following examples are linear equations.

(12.2) y = 3 + 2x
(12.3) y = -0.01 + 1.2x

The graph of a linear equation of the form y = a + bx is a straight line. Any line that is not vertical can be described by this equation.

Example 12.2. 

Figure 12.1. 

Graph of the equation y = -1 + 2x. This is a straight line that crosses the y-axis at -1 and is sloped up and to the right, rising 2 units for every one unit of run.
Graph of the equation y = -1 + 2x .


Linear equations of this form occur in applications of life sciences, social sciences, psychology, business, economics, physical sciences, mathematics, and other areas.

Example 12.3. 

Aaron’s Word Processing Service (AWPS) does word processing. Its rate is $32 per hour plus a $31.50 one-time charge. The total cost to a customer depends on the number of hours it takes to do the word processing job.

Problem

Find the equation that expresses the total cost in terms of the number of hours required to finish the word processing job.

Solution

Let x = the number of hours it takes to get the job done.

Let y = the total cost to the customer.

The $31.50 is a fixed cost. If it takes x hours to complete the job, then (32)(x) is the cost of the word processing only. The total cost is:

y = 31.50 + 32x




12.3. Slope and Y-Intercept of a Linear Equation*

For the linear equation y = a + bx , b = slope and a = y-intercept.

From algebra recall that the slope is a number that describes the steepness of a line and the y-intercept is the y coordinate of the point (0,a) where the line crosses the y-axis.

Figure 12.2. 

A straight line with a slope up and to the right.
(a) If b > 0, the line slopes upward to the right.
A flat line from left to right with no slope.
(b) If b = 0, the line is horizontal.
A straight line with a slope down and to the right.
(c) If b < 0, the line slopes downward to the right.
Three possible graphs of y = a + bx .


Example 12.4. 

Svetlana tutors to make extra money for college. For each tutoring session, she charges a one time fee of $25 plus $15 per hour of tutoring. A linear equation that expresses the total amount of money Svetlana earns for each session she tutors is y = 25 + 15x.

Problem

What are the independent and dependent variables? What is the y-intercept and what is the slope? Interpret them using complete sentences.

Solution

The independent variable (x) is the number of hours Svetlana tutors each session. The dependent variable (y) is the amount, in dollars, Svetlana earns for each session.

The y-intercept is 25 (a = 25). At the start of the tutoring session, Svetlana charges a one-time fee of $25 (this is when x = 0). The slope is 15 (b = 15). For each session, Svetlana earns $15 for each hour she tutors.




12.4. Scatter Plots*

Before we take up the discussion of linear regression and correlation, we need to examine a way to display the relation between two variables x and y . The most common and easiest way is a scatter plot. The following example illustrates a scatter plot.

Example 12.5. 

From an article in the Wall Street Journal: In Europe and Asia, m-commerce is becoming more popular. M-commerce users have special mobile phones that work like electronic wallets as well as provide phone and Internet services. Users can do everything from paying for parking to buying a TV set or soda from a machine to banking to checking sports scores on the Internet. In the next few years, will there be a relationship between the year and the number of m-commerce users? Construct a scatter plot. Let x = the year and let y = the number of m-commerce users, in millions.

Figure 12.3. 

x (year) y (# of users)
20000.5
200220.0
200333.0
200447.0
(a) Table showing the number of m-commerce users (in millions) by year.
A scatter plot with the x-axis representing the year and the y-axis representing the number of m-commerce users in millions. There are four points plotted, at (2000, 0.5), (2002, 20.0), (2003, 33.0), (2004, 47.0).
(b) Scatter plot showing the number of m-commerce users (in millions) by year.


A scatter plot shows the direction and strength of a relationship between the variables. A clear direction happens when there is either:

  • High values of one variable occurring with high values of the other variable or low values of one variable occurring with low values of the other variable.

  • High values of one variable occurring with low values of the other variable.

You can determine the strength of the relationship by looking at the scatter plot and seeing how close the points are to a line, a power function, an exponential function, or to some other type of function.

When you look at a scatterplot, you want to notice the overall pattern and any deviations from the pattern. The following scatterplot examples illustrate these concepts.

Figure 12.4.  Positive Linear Pattern (Strong)

Scatterplot of 6 points in a straight ascending line from lower left to upper right.
(a)
Scatterplot of 6 points in a straight ascending line from lower left to upper right with one additional point in the upper left corner.
(b)


Figure 12.5.  Negative Linear Pattern (Strong)

Scatterplot of 6 points in a straight descending line from upper left to lower right.
(a)
Scatterplot of 8 points in a wobbly descending line from upper left to lower right.
(b)


Figure 12.6.  Exponential Growth Pattern

Scatterplot of 7 points in a exponential curve from along the x-axis on the left to slowly ascending up the graph in the upper right.
(a)
Scatterplot of many points scattered everywhere.
(b)


In this chapter, we are interested in scatter plots that show a linear pattern. Linear patterns are quite common. The linear relationship is strong if the points are close to a straight line. If we think that the points show a linear relationship, we would like to draw a line on the scatter plot. This line can be calculated through a process called linear regression. However, we only calculate a regression line if one of the variables helps to explain or predict the other variable. If x is the independent variable and y the dependent variable, then we can use a regression line to predict y for a given value of x .

12.5. The Regression Equation*

Data rarely fit a straight line exactly. Usually, you must be satisfied with rough predictions. Typically, you have a set of data whose scatter plot appears to “fit” a straight line. This is called a Line of Best Fit or Least Squares Line.

Optional Collaborative Classroom Activity

If you know a person’s pinky (smallest) finger length, do you think you could predict that person’s height? Collect data from your class (pinky finger length, in inches). The independent variable, x , is pinky finger length and the dependent variable, y , is height.

For each set of data, plot the points on graph paper. Make your graph big enough and use a ruler. Then “by eye” draw a line that appears to “fit” the data. For your line, pick two convenient points and use them to find the slope of the line. Find the y-intercept of the line by extending your lines so they cross the y-axis. Using the slopes and the y-intercepts, write your equation of “best fit”. Do you think everyone will have the same equation? Why or why not?

Using your equation, what is the predicted height for a pinky length of 2.5 inches?

Example 12.6. 

A random sample of 11 statistics students produced the following data where x is the third exam score, out of 80, and y is the final exam score, out of 200. Can you predict the final exam score of a random student if you know the third exam score?

Figure 12.7. 

x (third exam score)y (final exam score)
65175
67133
71185
71163
66126
75198
67153
70163
71159
69151
69159
(a) Table showing the scores on the final exam based on scores from the third exam.
Scatterplot of exam scores with the third exam score on the x-axis and the final exam score on the y-axis.
(b) Scatter plot showing the scores on the final exam based on scores from the third exam.


The third exam score, x , is the independent variable and the final exam score, y , is the dependent variable. We will plot a regression line that best “fits” the data. If each of you were to fit a line “by eye”, you would draw different lines. We can use what is called a least-squares regression line to obtain the best fit line.

Consider the following diagram. Each point of data is of the the form (x,y)and each point of the line of best fit using least-squares linear regression has the form .

The is read “y hat” and is the estimated value of y . It is the value of y obtained using the regression line. It is not generally equal to y from data.

Figure 12.8. 

Scatterplot of the exam scores with a line of best fit tying in the relationship between the third exam and final exam scores. A specific point on the line, specific data point, and the distance between these two points are used in order to show an example of how to compute the sum of squared errors in order to find the points on the line of best fit.


The term is called the “error” or residual. It is not an error in the sense of a mistake, but measures the vertical distance between the actual value of y and the estimated value of y . In other words, it measures the vertical distance between the actual data point and the predicted point on the line.

If the observed data point lies above the line, the residual is positive, and the line underestimates the actual data value for y . If the observed data point lies below the line, the residual is negative, and the line overestimates that actual data value for y .

In the diagram above, is the residual for the point shown. Here the point lies above the line and the residual is positive.

ε = the Greek letter epsilon

For each data point, you can calculate the residuals or errors, for i = 1, 2, 3, …, 11.

Each ε is a vertical distance.

For the example about the third exam scores and the final exam scores for the 11 statistics students, there are 11 data points. Therefore, there are 11 ε values. If you square each ε and add, you get

This is called the Sum of Squared Errors (SSE).

Using calculus, you can determine the values of a and b that make the SSE a minimum. When you make the SSE a minimum, you have determined the points that are on the line of best fit. It turns out that the line of best fit has the equation:

(12.4)

where and .

and are the averages of the x values and the y values, respectively. The best fit line always passes through the point .

The slope b can be written as where s y = the standard deviation of the y values and s x = the standard deviation of the x values. r is the correlation coefficient which is discussed in the next section.

Least Squares Criteria for Best Fit

The process of fitting the best fit line is called linear regression. The idea behind finding the best fit line is based on the assumption that the data are scattered about a straight line. The criteria for the best fit line is that the sum of the squared errors (SSE) is minimized, that is made as small as possible. Any other line you might choose would have a higher SSE than the best fit line. This best fit line is called the least squares regression line .

Note

Computer spreadsheets, statistical software, and many calculators can quickly calculate the best fit line and create the graphs. The calculations tend to be tedious if done by hand. Instructions to use the TI-83, TI-83+, and TI-84+ calculators to find the best fit line and create a scatterplot are shown at the end of this section.

THIRD EXAM vs FINAL EXAM EXAMPLE:

The graph of the line of best fit for the third exam/final exam example is shown below:

Figure 12.9. 

Scatterplot of the third exam scores by final exam scores and its line of best fit.

The least squares regression line (best fit line) for the third exam/final exam example has the equation:

(12.5)

Note

Remember, it is always important to plot a scatter diagram first. If the scatter plot indicates that there is a linear relationship between the variables, then it is reasonable to use a best fit line to make predictions for y given x within the domain of x -values in the sample data, but not necessarily for x -values outside that domain.
You could use the line to predict the final exam score for a student who earned a grade of 73 on the third exam.
You should NOT use the line to predict the final exam score for a student who earned a grade of 50 on the third exam, because 50 is not within the domain of the x-values in the sample data, which are between 65 and 75.

UNDERSTANDING SLOPE

The slope of the line, b, describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.

INTERPRETATION OF THE SLOPE: The slope of the best fit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average.

Slope: The slope of the line is b = 4.83.
Interpretation: For a one point increase in the score on the third exam, the final exam score increases by 4.83 points, on average.

Using the TI-83+ and TI-84+ Calculators

Using the Linear Regression T Test: LinRegTTest

  1. In the STAT list editor, enter the X data in list L1 and the Y data in list L2, paired so that the corresponding (x,y) values are next to each other in the lists. (If a particular pair of values is repeated, enter it as many times as it appears in the data.)

  2. On the STAT TESTS menu, scroll down with the cursor to select the LinRegTTest. (Be careful to select LinRegTTest as some calculators may also have a different item called LinRegTInt.)

  3. On the LinRegTTest input screen enter: Xlist: L1 ; Ylist: L2 ; Freq: 1

  4. On the next line, at the prompt β or ρ, highlight “≠ 0” and press ENTER

  5. Leave the line for “RegEq:” blank

  6. Highlight Calculate and press ENTER.

Figure 12.10. 

1. Image of calculator input screen for LinRegTTest with input matching the instructions above. 2.Image of corresponding output calculator output screen for LinRegTTest: Output screen shows: Line 1. LinRegTTest; Line 2. y = a + bx; Line 3. beta does not equal 0 and rho does not equal 0; Line 4. t = 2.657560155; Line 5. df = 9; Line 6. a = 173.513363; Line 7. b = 4.827394209; Line 8. s = 16.41237711; Line 9. r squared = .4396931104; Line 10. r = .663093591


The output screen contains a lot of information. For now we will focus on a few items from the output, and will return later to the other items.

The second line says y=a+bx. Scroll down to find the values a=-173.513, and b=4.8273 ; the equation of the best fit line is
The two items at the bottom are r 2 = .43969 and r =.663. For now, just note where to find these values; we will discuss them in the next two sections.

Graphing the Scatterplot and Regression Line

  1. We are assuming your X data is already entered in list L1 and your Y data is in list L2

  2. Press 2nd STATPLOT ENTER to use Plot 1

  3. On the input screen for PLOT 1, highlight On and press ENTER

  4. For TYPE: highlight the very first icon which is the scatterplot and press ENTER

  5. Indicate Xlist: L1 and Ylist: L2

  6. For Mark: it does not matter which symbol you highlight.

  7. Press the ZOOM key and then the number 9 (for menu item “ZoomStat”) ; the calculator will fit the window to the data

  8. To graph the best fit line, press the “Y=” key and type the equation -173.5+4.83X into equation Y1. (The X key is immediately left of the STAT key). Press ZOOM 9 again to graph it.

  9. Optional: If you want to change the viewing window, press the WINDOW key. Enter your desired window using Xmin, Xmax, Ymin, Ymax

**With contributions from Roberta Bloom

12.6. Correlation Coefficient and Coefficient of Determination*

The Correlation Coefficient r

Besides looking at the scatter plot and seeing that a line seems reasonable, how can you tell if the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatterplot) of the strength of the relationship between x and y .

The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is a numerical measure of the strength of association between the independent variable x and the dependent variable y.

The correlation coefficient is calculated as

where n = the number of data points.

If you suspect a linear relationship between x and y , then r can measure how strong the linear relationship is.

What the VALUE of r tells us:

  • The value of r is always between -1 and +1: -1 ≤ r ≤ 1.

  • The closer the correlation coefficient r is to -1 or 1 (and the further from 0), the stronger the evidence of a significant linear relationship between x and y ; this would indicate that the observed data points fit more closely to the best fit line. Values of r further from 0 indicate a stronger linear relationship between x and y . Values of r closer to 0 indicate a weaker linear relationship between x and y .

  • If r=0 there is absolutely no linear relationship between x and y (no linear correlation).

  • If r = 1, there is perfect positive correlation. If r = -1, there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not generally happen.

What the SIGN of r tells us

  • A positive value of r means that when x increases, y increases and when x decreases, y decreases (positive correlation).

  • A negative value of r means that when x increases, y decreases and when x decreases, y increases (negative correlation).

  • The sign of r is the same as the sign of the slope, b , of the best fit line.

Note

Strong correlation does not suggest that x causes y or y causes x . We say “correlation does not imply causation.” For example, every person who learned math in the 17th century is dead. However, learning math does not necessarily cause death!

Figure 12.11. Positive Correlation

Scatterplot of points ascending from the lower left to the upper right.
(a) A scatter plot showing data with a positive correlation. 0 < r < 1
Scatterplot of points descending from the upper left to the lower right.
(b) A scatter plot showing data with a negative correlation. -1 < r < 0
Scatterplot of points in a horizontal configuration.
(c) A scatter plot showing data with zero correlation. r =0


The formula for r looks formidable. However, computer spreadsheets, statistical software, and many calculators can quickly calculate r . The correlation coefficient r is the bottom item in the output screens for the LinRegTTest on the TI-83, TI-83+, or TI-84+ calculator (see previous section for instructions).

The Coefficient of Determination

r 2 is called the coefficient of determination. r 2 is the square of the correlation coefficient , but is usually stated as a percent, rather than in decimal form. r 2 has an interpretation in the context of the data

  • r 2 , when expressed as a percent, represents the percent of variation in the dependent variable y that can be explained by variation in the independent variable x using the regression (best fit) line.

  • 1- r 2 , when expressed as a percent, represents the percent of variation in y that is NOT explained by variation in x using the regression line. This can be seen as the scattering of the observed data points about the regression line.

The line of best fit is:
The correlation coefficient is r = 0.6631
The coefficient of determination is r 2 = 0.66312 = 0.4397
Interpretation of r 2 in the context of this example:
Approximately 44% of the variation in the final exam grades can be explained by the variation in the grades on the third exam, using the best fit regression line.
Therefore approximately 56% of the variation in the final exam grades can NOT be explained by the variation in the grades on the third exam, using the best fit regression line. (This is seen as the scattering of the points about the line.)

**With contributions from Roberta Bloom.

Glossary

Coefficient of Correlation

A measure developed by Karl Pearson (early 1900s) that gives the strength of association between the independent variable and the dependent variable. The formula is:

(12.6)

where n is the number of data points. The coefficient cannot be more then 1 and less then -1. The closer the coefficient is to ± 1, the stronger the evidence of a significant linear relationship between x and y .

12.7. Testing the Significance of the Correlation Coefficient*

Testing the Significance of the Correlation Coefficient

The correlation coefficient, r, tells us about the strength of the linear relationship between x and y. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, together.

We perform a hypothesis test of the “significance of the correlation coefficient” to decide whether the linear relationship in the sample data is strong enough and reliable enough to use to model the relationship in the population.

The sample data is used to compute r, the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we only have sample data, we can not calculate the population correlation coefficient. The sample correlation coefficient, r, is our estimate of the unknown population correlation coefficient.

The symbol for the population correlation coefficient is ρ, the Greek letter “rho”.
ρ = population correlation coefficient (unknown)
r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is “close to 0” or “significantly different from 0”. We decide this based on the sample correlation coefficient r and the sample size n.

If the test concludes that the correlation coefficient is significantly different from 0, we say that the correlation coefficient is “significant”.

  • Conclusion: “The correlation coefficient IS SIGNIFICANT

  • What the conclusion means: We believe that there is a significant linear relationship between x and y. We can use the regression line to model the linear relationship between x and y in the population.

If the test concludes that the correlation coefficient is not significantly different from 0 (it is close to 0), we say that correlation coefficient is “not significant”.

  • Conclusion: “The correlation coefficient IS NOT SIGNIFICANT.”

  • What the conclusion means: We do NOT believe that there is a significant linear relationship between x and y. Therefore we can NOT use the regression line to model a linear relationship between x and y in the population.

Note

  • If r is significant and the scatter plot shows a reasonable linear trend, the line can be used to predict the value of y for values of x that are within the domain of observed x  values.

  • If r is not significant OR if the scatter plot does not show a reasonable linear trend, the line should not be used for prediction.

  • If r is significant and if the scatter plot shows a reasonable linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed x values in the data.

PERFORMING THE HYPOTHESIS TEST

SETTING UP THE HYPOTHESES:

  • Null Hypothesis: Ho: ρ=0

  • Alternate Hypothesis: Ha: ρ≠0

What the hypotheses mean in words:

  • Null Hypothesis Ho: The population correlation coefficient IS NOT significantly different from 0. There IS NOT a significant linear relationship(correlation) between x and y in the population.

  • Alternate Hypothesis Ha: The population correlation coefficient IS significantly DIFFERENT FROM 0. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

There are two methods to make the decision. Both methods are equivalent and give the same result.
Method 1: Using the p-value
Method 2: Using a table of critical values
In this chapter of this textbook, we will always use a significance level of 5%, α = 0.05
Note: Using the p-value method, you could choose any appropriate significance level you want; you are not limited to using α = 0.05. But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, α = 0.05. (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)
The linear regression t-test LinRegTTEST on the TI-83+ or TI-84+ calculators calculates the p-value.
On the LinRegTTEST input screen, on the line prompt for β or ρ, highlight “≠ 0
The output screen shows the p-value on the line that reads “p=”.
(Most computer statistical software can calculate the p-value.)

If the p-value is less than the significance level (α = 0.05):

  • Decision: REJECT the null hypothesis.

  • Conclusion: “The correlation coefficient IS SIGNIFICANT.”

  • We believe that there IS a significant linear relationship between x and y. because the correlation coefficient is significantly different from 0.

If the p-value is NOT less than the significance level (α = 0.05)

  • Decision: DO NOT REJECT the null hypothesis.

  • Conclusion: “The correlation coefficient is NOT significant.”

  • We believe that there is NOT a significant linear relationship between x and y. because the correlation coefficient is NOT significantly different from 0.

You will use technology to calculate the p-value. The following describe the calculations to compute the test statistics and the p-value:
The p-value is calculated using a t -distribution with n-2 degrees of freedom.
The formula for the test statistic is . The value of the test statistic, t , is shown in the computer or calculator output along with the p-value. The test statistic t has the same sign as the correlation coefficient r .
The p-value is the probability (area) in both tails further out beyond the values - t and t .
For the TI-83+ and TI-84+ calculators, the command 2*tcdf(abs(t),10^99, n-2) computes the p-value given by the LinRegTTest; abs(t) denotes absolute value: | t |

THIRD EXAM vs FINAL EXAM EXAMPLE: p value method

  • Consider the third exam/final exam example.

  • The line of best fit is: with r = 0.6631 and there are n = 11 data points.

  • Can the regression line be used for prediction? Given a third exam score ( x value), can we use the line to predict the final exam score (predicted y  value)?

Ho: ρ = 0
Ha: ρ ≠ 0
α = 0.05
The p-value is 0.026 (from LinRegTTest on your calculator or from computer software)
The p-value, 0.026, is less than the significance level of α = 0.05
Decision: Reject the Null Hypothesis Ho
Conclusion: The correlation coefficient IS SIGNIFICANT.
Because r is significant and the scatter plot shows a reasonable linear trend, the regression line can be used to predict final exam scores.

METHOD 2: Using a table of Critical Values to make a decision

The 95% Critical Values of the Sample Correlation Coefficient Table at the end of this chapter (before the Summary) may be used to give you a good idea of whether the computed value of r is significant or not. Compare r to the appropriate critical value in the table. If r is not between the positive and negative critical values, then the correlation coefficient is significant. If r is significant, then you may want to use the line for prediction.

Example 12.7. 

Suppose you computed r = 0.801 using n = 10 data points. df = n – 2 = 10 – 2 = 8. The critical values associated with df = 8 are -0.632 and + 0.632. If r < negative critical value or r > positive critical value, then r is significant. Since r = 0.801 and 0.801 > 0.632, r is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Figure 12.12. 

Horizontal number line with values of -1, -0.632, 0, 0.632, 0.801, and 1. A dashed line above values -0.632, 0, and 0.632 indicates not significant values.
r is not significant between -0.632 and +0.632. r = 0.801 > +0.632. Therefore, r is significant.



Example 12.8. 

Suppose you computed r = -0.624 with 14 data points. df = 14 – 2 = 12. The critical values are -0.532 and 0.532. Since -0.624 < -0.532, r is significant and the line may be used for prediction

Figure 12.13. 

Horizontal number line with values of -0.624, -0.532, and 0.532.
r = -0.624 < -0.532. Therefore, r is significant.



Example 12.9. 

Suppose you computed r = 0.776 and n = 6. df = 6 – 2 = 4. The critical values are -0.811 and 0.811. Since -0.811 < 0.776 < 0.811, r is not significant and the line should not be used for prediction.

Figure 12.14. 

Horizontal number line with values -0.924, -0.532, and 0.532.
-0.811 < r = 0.776 < 0.811. Therefore, r is not significant.



THIRD EXAM vs FINAL EXAM EXAMPLE: critical value method

  • Consider the third exam/final exam example.

  • The line of best fit is: with r = 0.6631 and there are n = 11 data points.

  • Can the regression line be used for prediction? Given a third exam score ( x value), can we use the line to predict the final exam score (predicted y  value)?

Ho: ρ = 0
Ha: ρ ≠ 0
α = 0.05
Use the “95% Critical Value” table for r with df = n – 2 = 11 – 2 = 9
The critical values are -0.602 and +0.602
Since 0.6631 > 0.602, r is significant.
Decision: Reject Ho
Conclusion: The correlation coefficient is significant
Because r is significant and the scatter plot shows a reasonable linear trend, the regression line can be used to predict final exam scores.

Example 12.10. Additional Practice Examples using Critical Values

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if r is significant and the line of best fit associated with each r can be used to predict a y value. If it helps, draw a number line.

  1. r = -0.567 and the sample size, n , is 19. The df = n – 2 = 17. The critical value is -0.456. -0.567 < -0.456 so r is significant.

  2. r = 0.708 and the sample size, n , is 9. The df = n – 2 = 7. The critical value is 0.666. 0.708 > 0.666 so r is significant.

  3. r = 0.134 and the sample size, n , is 14. The df = 14 – 2 = 12. The critical value is 0.532. 0.134 is between -0.532 and 0.532 so r is not significant.

  4. r = 0 and the sample size, n , is 5. No matter what the dfs are, r = 0 is between the two critical values so r is not significant.


Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between x and y in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between x and y in the population.

The regression line equation that we calculate from the sample data gives the best fit line for our particular sample. We want to use this best fit line for the sample as an estimate of the best fit line for the population. Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.

The assumptions underlying the test of significance are:

  • There is a linear relationship in the population that models the average value of y for varying values of x. In other words, the average of the y values for each particular x value lie on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)

  • The y values for any particular x value are normally distributed about the line. This implies that there are more y values scattered closer to the line than are scattered farther away. Assumption (1) above implies that these normal distributions are centered on the line: the means of these normal distributions of y values lie on the line.

  • The standard deviations of the population y values about the line the equal for each value of x. In other words, each of these normal distributions of y values has the same shape and spread about the line.

Figure 12.15. 

A downward sloping regression line is shown with the y values normally distributed about the line with equal standard deviations for each x value. For each x value, the mean of the y values lies on the regression line. More y values lie near the line than are scattered further away from the line.
The y values for each x value are normally distributed about the line with the same standard deviation. For each x value, the mean of the y values lies on the regression line. More y values lie near the line than are scattered further away from the line.


**With contributions from Roberta Bloom

12.8. Prediction*

Recall the third exam/final exam example.

We examined the scatterplot and showed that the correlation coefficient is significant. We found the equation of the best fit line for the final exam grade as a function of the grade on the third exam. We can now use the least squares regression line for prediction.

Suppose you want to estimate, or predict, the final exam score of statistics students who received 73 on the third exam. The exam scores ( x -values) range from 65 to 75. Since 73 is between the x -values 65 and 75, substitute x = 73 into the equation. Then:

(12.7)

We predict that statistic students who earn a grade of 73 on the third exam will earn a grade of 179.08 on the final exam, on average.

Example 12.11. 

Recall the third exam/final exam example.

Problem 1.

What would you predict the final exam score to be for a student who scored a 66 on the third exam?

Solution

 145.27



Problem 2. (Go to Solution)

What would you predict the final exam score to be for a student who scored a 78 on the third exam?



**With contributions from Roberta Bloom

Solutions to Exercises

Solution to Exercise 2. (Return to Problem)

The x values in the data are between 65 and 75. 78 is outside of the domain of the observed x values in the data (independent variable), so you cannot reliably predict the final exam score for this student. (Even though it is possible to enter x into the equation and calculate a y value, you should not do so!)


12.9. Outliers*

In some data sets, there are values (observed data points) called outliers. Outliers are observed data points that are far from the least squares line. They have large “errors”, where the “error” or residual is the vertical distance from the line to the point.

Outliers need to be examined closely. Sometimes, for some reason or another, they should not be included in the analysis of the data. It is possible that an outlier is a result of erroneous data. Other times, an outlier may hold valuable information about the population under study and should remain included in the data. The key is to carefully examine what causes a data point to be an outlier.

Besides outliers, a sample may contain one or a few points that are called influential points. Influential points are observed data points that are far from the other observed data points but that greatly influence the line. As a result an influential point may be close to the line, even though it is far from the rest of the data. Because an influential point so strongly influences the best fit line, it generally will not have a large “error” or residual.

Computers and many calculators can be used to identify outliers from the data. Computer output for regression analysis will often identify both outliers and influential points so that you can examine them.

Identifying Outliers

We could guess at outliers by looking at a graph of the scatterplot and best fit line. However we would like some guideline as to how far away a point needs to be in order to be considered an outlier. As a rough rule of thumb, we can flag any point that is located further than two standard deviations above or below the best fit line as an outlier. The standard deviation used is the standard deviation of the residuals or errors.

We can do this visually in the scatterplot by drawing an extra pair of lines that are two standard deviations above and below the best fit line. Any data points that are outside this extra pair of lines are flagged as potential outliers. Or we can do this numerically by calculating each residual and comparing it to twice the standard deviation. On the TI-83, 83+, or 84+, the graphical approach is easier. The graphical procedure is shown first, followed by the numerical calculations. You would generally only need to use one of these methods.

Example 12.12. 

Problem

In the third exam/final exam example, you can determine if there is an outlier or not. If there is an outlier, as an exercise, delete it and fit the remaining data to a new line. For this example, the new line ought to fit the remaining data better. This means the SSE should be smaller and the correlation coefficient ought to be closer to 1 or -1.

Solution

Graphical Identification of Outliers

With the TI-83,83+,84+ graphing calculators, it is easy to identify the outlier graphically and visually. If we were to measure the vertical distance from any data point to the corresponding point on the line of best fit and that distance was equal to 2s or farther, then we would consider the data point to be “too far” from the line of best fit. We need to find and graph the lines that are two standard deviations below and above the regression line. Any points that are outside these two lines are outliers. We will call these lines Y2 and Y3:

As we did with the equation of the regression line and the correlation coefficient, we will use technology to calculate this standard deviation for us. Using the LinRegTTest with this data, scroll down through the output screens to find s=16.412

Line Y2=-173.5+4.83x-2(16.4) and line Y3=-173.5+4.83X+2(16.4) where =-173.5+4.83x is the line of best fit. Y2 and Y3 have the same slope as the line of best fit.

Graph the scatterplot with the best fit line in equation Y1, then enter the two extra lines as Y2 and Y3 in the “Y=”equation editor and press ZOOM 9. You will find that the only data point that is not between lines Y2 and Y3 is the point x=65, y=175. On the calculator screen it is just barely outside these lines. The outlier is the student who had a grade of 65 on the third exam and 175 on the final exam; this point is further than 2 standard deviations away from the best fit line.

Sometimes a point is so close to the lines used to flag outliers on the graph that it is difficult to tell if the point is between or outside the lines. On a computer, enlarging the graph may help; on a small calculator screen, zooming in may make the graph more clear. Note that when the graph does not give a clear enough picture, you can use the numerical comparisons to identify outliers.

Figure 12.16. 

Scatterplot of data and best fit line of the exam score data, showing the lines two standard deviations above and below the best fit line. The data value (65,175) lies slightly above the upper line identifying it as an outlier.


Numerical Identification of Outliers

In the table below, the first two columns are the third exam and final exam data. The third column shows the predicted values calculated from the line of best fit: =-173.5+4.83x. The residuals, or errors, have been calculated in the fourth column of the table: .

s is the standard deviation of all the values where n = the total number of data points. If each residual is calculated and squared, and the results are added up, we get the SSE. The standard deviation of the residuals is calculated from the SSE as:

Rather than calculate the value of s ourselves, we can find s using the computer or calculator. For this example, our calculator LinRegTTest found s=16.4 as the standard deviation of the residuals: 35; -17; 16; -6; -19; 9; 3; -1; -10; -9; -1.

Table 12.1.
x y
65 175 140 175 – 140 = 35
67133150 133 – 150 = -17
71185169 185 – 169 = 16
71163169 163 – 169 = -6
66126145 126 – 145 = -19
75198189 198 – 189 = 9
67153150 153 – 150 = 3
70163164 163 – 164 = -1
71159169 159 – 169 = -10
69151160 151 – 160 = -9
69159160 159 – 160 = -1

We are looking for all data points for which the residual is greater than 2s=2(16.4)=32.8 or less than -32.8. Compare these values to the residuals in column 4 of the table. The only such data point is the student who had a grade of 65 on the third exam and 175 on the final exam; the residual for this student is 35.

How does the outlier affect the best fit line?

Numerically and graphically, we have identified the point (65,175) as an outlier. We should re-examine the data for this point to see if there are any problems with the data. If there is an error we should fix the error if possible, or delete the data. If the data is correct, we would leave it in the data set. For this problem, we will suppose that we examined the data and found that this outlier data was an error. Therefore we will continue on to delete the outlier, so that we can explore how it affects the results, as a learning experience.

Compute a new best-fit line and correlation coefficient using the 10 remaining points:

On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 and L2. Using the LinRegTTest, the new line of best fit and the correlation coefficient are:

and r = 0.9121

The new line with r = 0.9121 is a stronger correlation than the original ( r =0.6631) because r = 0.9121 is closer to 1. This means that the new line is a better fit to the 10 remaining data values. The line can better predict the final exam score given the third exam score.




Numerical Identification of Outliers: Calculating s and Finding Outliers Manually

If you do not have the function LinRegTTest, then you can calculate the outlier in the first example by doing the following. First, square each (See the TABLE above):

The squares are 352 ; 172 ; 162 ; 62 ; 192 ; 92 ; 32 ; 12 ; 102 ; 92 ; 12

Then, add (sum) all the squared terms using the formula

(Recall that .)

= 352 + 172 + 162 + 62 + 192 + 92 + 32 + 12 + 102 + 92 + 12

= 2440 = SSE. The result, SSE is the Sum of Squared Errors.

Next, calculate s , the standard deviation of all the values where n = the total number of data points. (Calculate the standard deviation of 35; 17; 16; 6; 19; 9; 3; 1; 10; 9; 1.)

The calculation is

For the third exam/final exam problem,

Next, multiply s by 1.9: (1.9)⋅(16.47) = 31.29 31.29 is almost 2 standard deviations away from the mean of the  values.

If we were to measure the vertical distance from any data point to the corresponding point on the line of best fit and that distance is at least 1.9s , then we would consider the data point to be “too far” from the line of best fit. We call that point a potential outlier.

For the example, if any of the values are at least 31.29, the corresponding (x,y) data point is a potential outlier.

For the third exam/final exam problem, all the ‘s are less than 31.29 except for the first one which is 35.

That is,

The point which corresponds to is (65,175). Therefore, the data point (65,175) is a potential outlier. For this example, we will delete it. (Remember, we do not always delete an outlier.) The next step is to compute a new best-fit line using the 10 remaining points. The new line of best fit and the correlation coefficient are:

and r = 0.9121

Example 12.13. 

Problem

Using this new line of best fit (based on the remaining 10 data points), what would a student who receives a 73 on the third exam expect to receive on the final exam? Is this the same as the prediction made using the original line?

Solution

Using the new line of best fit, = 184.28. A student who scored 73 points on the third exam would expect to earn 184 points on the final exam. The original line predicted = 179.08 so the prediction using the new line with the outlier eliminated differs from the original prediction.




Example 12.14. 

(From The Consumer Price Indexes Web site) The Consumer Price Index (CPI) measures the average change over time in the prices paid by urban consumers for consumer goods and services. The CPI affects nearly all Americans because of the many ways it is used. One of its biggest uses is as a measure of inflation. By providing information about price changes in the Nation’s economy to government, business, and labor, the CPI helps them to make economic decisions. The President, Congress, and the Federal Reserve Board use the CPI’s trends to formulate monetary and fiscal policies. In the following table, x is the year and y is the CPI.

Table 12.2. Data:
x y
1915  10.1  
192617.7
193513.7
194014.7
194724.1
195226.5
196431.0
196936.7
197549.3
197972.6
198082.4
1986109.6
1991130.7
1999166.6

Problem

  • Make a scatterplot of the data.

  • Calculate the least squares line. Write the equation in the form .

  • Draw the line on the scatterplot.

  • Find the correlation coefficient. Is it significant?

  • What is the average CPI for the year 1990?

Solution

  • Scatter plot and line of best fit.

  • is the equation of the line of best fit.

  • r = 0.8694

  • The number of data points is n = 14. Use the 95% Critical Values of the Sample Correlation Coefficient table at the end of Chapter 12. n – 2 = 12. The corresponding critical value is 0.532. Since 0.8694 > 0.532, r is significant.

  •  CPI

  • Using the calculator LinRegTTest, we find that s = 25.4 ; graphing the lines Y2=-3204+1.662X-2(25.4) and Y3=-3204+1.662X+2(25.4) shows that no data values are outside those lines, identifying no outliers. (Note that the year 1999 was very close to the upper line, but still inside it.)

Figure 12.17. 

Scatter plot and line of best fit of the consumer price index data, on the y-axis, and year data, on the x-axis.



Note

In the example, notice the pattern of the points compared to the line. Although the correlation coefficient is significant, the pattern in the scatterplot indicates that a curve would be a more appropriate model to use than a line. In this example, a statistician should prefer to use other methods to fit a curve to this data, rather than model the data with the line we found. In addition to doing the calculations, it is always important to look at the scatterplot when deciding whether a linear model is appropriate.

If you are interested in seeing more years of data, visit the Bureau of Labor Statistics CPI website ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt ; our data is taken from the column entitled “Annual Avg.” (third column from the right). For example you could add more current years of data. Try adding the more recent years 2004 : CPI=188.9 and 2008 : CPI=215.3 and see how it affects the model.


**With contributions from Roberta Bloom

Glossary

Outlier

An observation that does not fit the rest of the data.

12.10. 95% Critical Values of the Sample Correlation Coefficient Table*

Table 12.3.
Degrees of Freedom: n – 2 Critical Values: ( + and )
10.997
20.950
30.878
40.811
50.754
60.707
70.666
80.632
90.602
100.576
110.555
120.532
130.514
140.497
150.482
160.468
170.456
180.444
190.433
200.423
210.413
220.404
230.396
240.388
250.381
260.374
270.367
280.361
290.355
300.349
400.304
500.273
600.250
700.232
800.217
900.205
100 and over0.195

12.11. Summary*

Bivariate Data: Each data point has two values. The form is (x,y).

Line of Best Fit or Least Squares Line (LSL):

x = independent variable; y = dependent variable

Residual:

Correlation Coefficient r:

  1. Used to determine whether a line of best fit is good for prediction.

  2. Between -1 and 1 inclusive. The closer r is to 1 or -1, the closer the original points are to a straight line.

  3. If r is negative, the slope is negative. If r is positive, the slope is positive.

  4. If r = 0, then the line is horizontal.

Sum of Squared Errors (SSE): The smaller the SSE, the better the original set of points fits the line of best fit.

Outlier: A point that does not seem to fit the rest of the data.

12.12. Practice: Linear Regression*

Student Learning Outcomes

  • The student will explore the properties of linear regression.

Given

The data below are real. Keep in mind that these are only reported figures. (Source: Centers for Disease Control and Prevention, National Center for HIV, STD, and TB Prevention, October 24, 2003)

Table 12.4. Adults and Adolescents only, United States
Year # AIDS cases diagnosed # AIDS deaths
Pre-19819129
1981319121
19821,170453
19833,0761,482
19846,2403,466
198511,7766,878
198619,03211,987
198728,56416,162
198835,44720,868
198942,67427,591
199048,63431,335
199159,66036,560
199278,53041,055
199378,83444,730
199471,87449,095
199568,50549,456
199659,34738,510
199747,14920,736
199838,39319,005
199925,17418,454
200025,52217,347
200125,64317,402
200226,46416,371
Total 802,118 489,093

Note

We will use the columns “year” and “# AIDS cases diagnosed” for all questions unless otherwise stated.

Graphing

Graph “year” vs. “# AIDS cases diagnosed.” Plot the points on the graph located below in the section titled “Plot” . Do not include pre-1981. Label both axes with words. Scale both axes.

Data

Exercise 12.12.1.

Enter your data into your calculator or computer. The pre-1981 data should not be included. Why is that so?


Linear Equation

Write the linear equation below, rounding to 4 decimal places:

Exercise 12.12.2. (Go to Solution)

Calculate the following:

a. a =
b. b =
c.
d. n = (# of pairs)


Exercise 12.12.3. (Go to Solution)

equation:


Solve

Exercise 12.12.4. (Go to Solution)

Solve.

a. When x = 1985,
b. When x = 1990,


Plot

Plot the 2 above points on the graph below. Then, connect the 2 points to form the regression line.

Blank graph with horizontal and vertical axes.

Obtain the graph on your calculator or computer.

Discussion Questions

Look at the graph above.

Exercise 12.12.5.

Does the line seem to fit the data? Why or why not?


Exercise 12.12.6.

Do you think a linear fit is best? Why or why not?


Exercise 12.12.7.

Hand draw a smooth curve on the graph above that shows the flow of the data.


Exercise 12.12.8.

What does the correlation imply about the relationship between time (years) and the number of diagnosed AIDS cases reported in the U.S.?


Exercise 12.12.9.

Why is “year” the independent variable and “# AIDS cases diagnosed.” the dependent variable (instead of the reverse)?


Exercise 12.12.10. (Go to Solution)

Solve.

a. When x = 1970, :
b. Why doesn’t this answer make sense?


Solutions to Exercises

Solution to Exercise 12.12.2. (Return to Exercise)

a. a = -3,448,225
b. b = 1750
c.
d. n = 22

Solution to Exercise 12.12.3. (Return to Exercise)

+ 1750 x


Solution to Exercise 12.12.4. (Return to Exercise)

a. 25082
b. 33,831

Solution to Exercise 12.12.10. (Return to Exercise)

a. -1164

12.13. Homework*

Exercise 12.13.1. (Go to Solution)

For each situation below, state the independent variable and the dependent variable.

a. A study is done to determine if elderly drivers are involved in more motor vehicle fatalities than all other drivers. The number of fatalities per 100,000 drivers is compared to the age of drivers.
b. A study is done to determine if the weekly grocery bill changes based on the number of family members.
c. Insurance companies base life insurance premiums partially on the age of the applicant.
d. Utility bills vary according to power consumption.
e. A study is done to determine if a higher education reduces the crime rate in a population.

Exercise 12.13.2.

In 1990 the number of driver deaths per 100,000 for the different age groups was as follows (Source: The National Highway Traffic Safety Administration’s National Center for Statistics and Analysis):

Table 12.5.
AgeNumber of Driver Deaths per 100,000
15-2428
25-3915
40-6910
70-7915
80+25

a. For each age group, pick the midpoint of the interval for the x value. (For the 80+ group, use 85.)
b. Using “ages” as the independent variable and “Number of driver deaths per 100,000” as the dependent variable, make a scatter plot of the data.
c. Calculate the least squares (best–fit) line. Put the equation in the form of:
d. Find the correlation coefficient. Is it significant?
e. Pick two ages and find the estimated fatality rates.
f. Use the two points in (e) to plot the least squares line on your graph from (b).
g. Based on the above data, is there a linear relationship between age of a driver and driver fatality rate?
h. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.3. (Go to Solution)

The average number of people in a family that received welfare for various years is given below. (Source: House Ways and Means Committee, Health and Human Services Department)

Table 12.6.
YearWelfare family size
19694.0
19733.6
19753.2
19793.0
19833.0
19883.0
19912.9

a. Using “year” as the independent variable and “welfare family size” as the dependent variable, make a scatter plot of the data.
b. Calculate the least squares line. Put the equation in the form of:
c. Find the correlation coefficient. Is it significant?
d. Pick two years between 1969 and 1991 and find the estimated welfare family sizes.
e. Use the two points in (d) to plot the least squares line on your graph from (b).
f. Based on the above data, is there a linear relationship between the year and the average number of people in a welfare family?
g. Using the least squares line, estimate the welfare family sizes for 1960 and 1995. Does the least squares line give an accurate estimate for those years? Explain why or why not.
h. Are there any outliers in the above data?
i. What is the estimated average welfare family size for 1986? Does the least squares line give an accurate estimate for that year? Explain why or why not.
j. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.4.

Use the AIDS data from the practice for this section, but this time use the columns “year #” and “# new AIDS deaths in U.S.” Answer all of the questions from the practice again, using the new columns.


Exercise 12.13.5. (Go to Solution)

The height (sidewalk to roof) of notable tall buildings in America is compared to the number of stories of the building (beginning at street level). (Source: Microsoft Bookshelf)

Table 12.7.
Height (in feet)Stories
105057
42828
36226
52940
79060
40122
38038
1454110
1127100
70046

a. Using “stories” as the independent variable and “height” as the dependent variable, make a scatter plot of the data.
b. Does it appear from inspection that there is a relationship between the variables?
c. Calculate the least squares line. Put the equation in the form of:
d. Find the correlation coefficient. Is it significant?
e. Find the estimated heights for 32 stories and for 94 stories.
f. Use the two points in (e) to plot the least squares line on your graph from (b).
g. Based on the above data, is there a linear relationship between the number of stories in tall buildings and the height of the buildings?
h. Are there any outliers in the above data? If so, which point(s)?
i. What is the estimated height of a building with 6 stories? Does the least squares line give an accurate estimate of height? Explain why or why not.
j. Based on the least squares line, adding an extra story adds about how many feet to a building?
k. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.6.

Below is the life expectancy for an individual born in the United States in certain years. (Source: National Center for Health Statistics)

Table 12.8.
Year of BirthLife Expectancy
193059.7
194062.9
195070.2
196569.7
197371.4
198274.5
198775
199275.7

a. Decide which variable should be the independent variable and which should be the dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Calculate the least squares line. Put the equation in the form of:
d. Find the correlation coefficient. Is it significant?
e. Find the estimated life expectancy for an individual born in 1950 and for one born in 1982.
f. Why aren’t the answers to part (e) the values on the above chart that correspond to those years?
g. Use the two points in (e) to plot the least squares line on your graph from (b).
h. Based on the above data, is there a linear relationship between the year of birth and life expectancy?
i. Are there any outliers in the above data?
j. Using the least squares line, find the estimated life expectancy for an individual born in 1850. Does the least squares line give an accurate estimate for that year? Explain why or why not.
k. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.7. (Go to Solution)

The percent of female wage and salary workers who are paid hourly rates is given below for the years 1979 - 1992. (Source: Bureau of Labor Statistics, U.S. Dept. of Labor)

Table 12.9.
YearPercent of workers paid hourly rates
197961.2
198060.7
198161.3
198261.3
198361.8
198461.7
198561.8
198662.0
198762.7
199062.8
199262.9

a. Using “year” as the independent variable and “percent” as the dependent variable, make a scatter plot of the data.
b. Does it appear from inspection that there is a relationship between the variables? Why or why not?
c. Calculate the least squares line. Put the equation in the form of:
d. Find the correlation coefficient. Is it significant?
e. Find the estimated percents for 1991 and 1988.
f. Use the two points in (e) to plot the least squares line on your graph from (b).
g. Based on the above data, is there a linear relationship between the year and the percent of female wage and salary earners who are paid hourly rates?
h. Are there any outliers in the above data?
i. What is the estimated percent for the year 2050? Does the least squares line give an accurate estimate for that year? Explain why or why not?
j. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.8.

The maximum discount value of the Entertainment® card for the “Fine Dining” section, Edition 10, for various pages is given below.

Table 12.10.
Page numberMaximum value ($)
416
1419
2515
3217
4319
5715
7216
8515
9017

a. Decide which variable should be the independent variable and which should be the dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Calculate the least squares line. Put the equation in the form of:
d. Find the correlation coefficient. Is it significant?
e. Find the estimated maximum values for the restaurants on page 10 and on page 70.
f. Use the two points in (e) to plot the least squares line on your graph from (b).
g. Does it appear that the restaurants giving the maximum value are placed in the beginning of the “Fine Dining” section? How did you arrive at your answer?
h. Suppose that there were 200 pages of restaurants. What do you estimate to be the maximum value for a restaurant listed on page 200?
i. Is the least squares line valid for page 200? Why or why not?
j. What is the slope of the least squares (best-fit) line? Interpret the slope.

The next two questions refer to the following data: The cost of a leading liquid laundry detergent in different sizes is given below.

Table 12.11.
Size (ounces)Cost ($)Cost per ounce
163.99 
324.99 
645.99 
20010.99 

Exercise 12.13.9. (Go to Solution)

a. Using “size” as the independent variable and “cost” as the dependent variable, make a scatter plot.
b. Does it appear from inspection that there is a relationship between the variables? Why or why not?
c. Calculate the least squares line. Put the equation in the form of:
d. Find the correlation coefficient. Is it significant?
e. If the laundry detergent were sold in a 40 ounce size, find the estimated cost.
f. If the laundry detergent were sold in a 90 ounce size, find the estimated cost.
g. Use the two points in (e) and (f) to plot the least squares line on your graph from (a).
h. Does it appear that a line is the best way to fit the data? Why or why not?
i. Are there any outliers in the above data?
j. Is the least squares line valid for predicting what a 300 ounce size of the laundry detergent would cost? Why or why not?
k. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.10.

a. Complete the above table for the cost per ounce of the different sizes.
b. Using “Size” as the independent variable and “Cost per ounce” as the dependent variable, make a scatter plot of the data.
c. Does it appear from inspection that there is a relationship between the variables? Why or why not?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. Is it significant?
f. If the laundry detergent were sold in a 40 ounce size, find the estimated cost per ounce.
g. If the laundry detergent were sold in a 90 ounce size, find the estimated cost per ounce.
h. Use the two points in (f) and (g) to plot the least squares line on your graph from (b).
i. Does it appear that a line is the best way to fit the data? Why or why not?
j. Are there any outliers in the above data?
k. Is the least squares line valid for predicting what a 300 ounce size of the laundry detergent would cost per ounce? Why or why not?
l. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.11. (Go to Solution)

According to flyer by a Prudential Insurance Company representative, the costs of approximate probate fees and taxes for selected net taxable estates are as follows:

Table 12.12.
Net Taxable Estate ($)Approximate Probate Fees and Taxes ($)
600,00030,000
750,00092,500
1,000,000203,000
1,500,000438,000
2,000,000688,000
2,500,0001,037,000
3,000,0001,350,000

a. Decide which variable should be the independent variable and which should be the dependent variable.
b. Make a scatter plot of the data.
c. Does it appear from inspection that there is a relationship between the variables? Why or why not?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. Is it significant?
f. Find the estimated total cost for a net taxable estate of $1,000,000. Find the cost for $2,500,000.
g. Use the two points in (f) to plot the least squares line on your graph from (b).
h. Does it appear that a line is the best way to fit the data? Why or why not?
i. Are there any outliers in the above data?
j. Based on the above, what would be the probate fees and taxes for an estate that does not have any assets?
k. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.12.

The following are advertised sale prices of color televisions at Anderson’s.

Table 12.13.
Size (inches)Sale Price ($)
9147
20197
27297
31447
351177
402177
602497

a. Decide which variable should be the independent variable and which should be the dependent variable.
b. Make a scatter plot of the data.
c. Does it appear from inspection that there is a relationship between the variables? Why or why not?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. Is it significant?
f. Find the estimated sale price for a 32 inch television. Find the cost for a 50 inch television.
g. Use the two points in (f) to plot the least squares line on your graph from (b).
h. Does it appear that a line is the best way to fit the data? Why or why not?
i. Are there any outliers in the above data?
j. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.13. (Go to Solution)

Below are the average heights for American boys. (Source: Physician’s Handbook, 1990)

Table 12.14.
Age (years)Height (cm)
birth50.8
283.8
391.4
5106.6
7119.3
10137.1
14157.5

a. Decide which variable should be the independent variable and which should be the dependent variable.
b. Make a scatter plot of the data.
c. Does it appear from inspection that there is a relationship between the variables? Why or why not?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. Is it significant?
f. Find the estimated average height for a one year–old. Find the estimated average height for an eleven year–old.
g. Use the two points in (f) to plot the least squares line on your graph from (b).
h. Does it appear that a line is the best way to fit the data? Why or why not?
i. Are there any outliers in the above data?
j. Use the least squares line to estimate the average height for a sixty–two year–old man. Do you think that your answer is reasonable? Why or why not?
k. What is the slope of the least squares (best-fit) line? Interpret the slope.

Exercise 12.13.14.

The following chart gives the gold medal times for every other Summer Olympics for the women’s 100 meter freestyle (swimming).

Table 12.15.
YearTime (seconds)
191282.2
192472.4
193266.8
195266.8
196061.2
196860.0
197655.65
198455.92
199254.64

a. Decide which variable should be the independent variable and which should be the dependent variable.
b. Make a scatter plot of the data.
c. Does it appear from inspection that there is a relationship between the variables? Why or why not?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. Is the decrease in times significant?
f. Find the estimated gold medal time for 1932. Find the estimated time for 1984.
g. Why are the answers from (f) different from the chart values?
h. Use the two points in (f) to plot the least squares line on your graph from (b).
i. Does it appear that a line is the best way to fit the data? Why or why not?
j. Use the least squares line to estimate the gold medal time for the next Summer Olympics. Do you think that your answer is reasonable? Why or why not?

The next three questions use the following state information.

Table 12.16.
State# letters in nameYear entered the UnionRank for entering the UnionArea (square miles)
Alabama718192252,423
Colorado 187638104,100
Hawaii 19595010,932
Iowa 18462956,276
Maryland 1788712,407
Missouri 18212469,709
New Jersey 178738,722
Ohio 18031744,828
South Carolina131788832,008
Utah 18964584,904
Wisconsin 18483065,499

Exercise 12.13.15. (Go to Solution)

We are interested in whether or not the number of letters in a state name depends upon the year the state entered the Union.

a. Decide which variable should be the independent variable and which should be the dependent variable.
b. Make a scatter plot of the data.
c. Does it appear from inspection that there is a relationship between the variables? Why or why not?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. What does it imply about the significance of the relationship?
f. Find the estimated number of letters (to the nearest integer) a state would have if it entered the Union in 1900. Find the estimated number of letters a state would have if it entered the Union in 1940.
g. Use the two points in (f) to plot the least squares line on your graph from (b).
h. Does it appear that a line is the best way to fit the data? Why or why not?
i. Use the least squares line to estimate the number of letters a new state that enters the Union this year would have. Can the least squares line be used to predict it? Why or why not?

Exercise 12.13.16.

We are interested in whether there is a relationship between the ranking of a state and the area of the state.

a. Let rank be the independent variable and area be the dependent variable.
b. What do you think the scatter plot will look like? Make a scatter plot of the data.
c. Does it appear from inspection that there is a relationship between the variables? Why or why not?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. What does it imply about the significance of the relationship?
f. Find the estimated areas for Alabama and for Colorado. Are they close to the actual areas?
g. Use the two points in (f) to plot the least squares line on your graph from (b).
h. Does it appear that a line is the best way to fit the data? Why or why not?
i. Are there any outliers?
j. Use the least squares line to estimate the area of a new state that enters the Union. Can the least squares line be used to predict it? Why or why not?
k. Delete “Hawaii” and substitute “Alaska” for it. Alaska is the fortieth state with an area of 656,424 square miles.
l. Calculate the new least squares line.
m. Find the estimated area for Alabama. Is it closer to the actual area with this new least squares line or with the previous one that included Hawaii? Why do you think that’s the case?
n. Do you think that, in general, newer states are larger than the original states?

Exercise 12.13.17. (Go to Solution)

We are interested in whether there is a relationship between the rank of a state and the year it entered the Union.

a. Let year be the independent variable and rank be the dependent variable.
b. What do you think the scatter plot will look like? Make a scatter plot of the data.
c. Why must the relationship be positive between the variables?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. What does it imply about the significance of the relationship?
f. Let’s say a fifty-first state entered the union. Based upon the least squares line, when should that have occurred?
g. Using the least squares line, how many states do we currently have?
h. Why isn’t the least squares line a good estimator for this year?

Exercise 12.13.18.

Below are the percents of the U.S. labor force (excluding self-employed and unemployed ) that are members of a union. We are interested in whether the decrease is significant. (Source: Bureau of Labor Statistics, U.S. Dept. of Labor)

Table 12.17.
YearPercent
194535.5
195031.5
196031.4
197027.3
198021.9
198617.5
199315.8

a. Let year be the independent variable and percent be the dependent variable.
b. What do you think the scatter plot will look like? Make a scatter plot of the data.
c. Why will the relationship between the variables be negative?
d. Calculate the least squares line. Put the equation in the form of:
e. Find the correlation coefficient. What does it imply about the significance of the relationship?
f. Based on your answer to (e), do you think that the relationship can be said to be decreasing?
g. If the trend continues, when will there no longer be any union members? Do you think that will happen?

The next two questions refer to the following information: The data below reflects the 1991-92 Reunion Class Giving. (Source: SUNY Albany alumni magazine)

Table 12.18.
Class YearAverage GiftTotal Giving
192241.67125
192760.751,215
193283.823,772
193787.845,710
194788.276,003
195276.145,254
195752.294,393
196257.804,451
197242.6818,093
197649.3922,473
198146.8720,997
198637.0312,590

Exercise 12.13.19. (Go to Solution)

We will use the columns “class year” and “total giving” for all questions, unless otherwise stated.

a. What do you think the scatter plot will look like? Make a scatter plot of the data.
b. Calculate the least squares line. Put the equation in the form of:
c. Find the correlation coefficient. What does it imply about the significance of the relationship?
d. For the class of 1930, predict the total class gift.
e. For the class of 1964, predict the total class gift.
f. For the class of 1850, predict the total class gift. Why doesn’t this value make any sense?

Exercise 12.13.20.

We will use the columns “class year” and “average gift” for all questions, unless otherwise stated.

a. What do you think the scatter plot will look like? Make a scatter plot of the data.
b. Calculate the least squares line. Put the equation in the form of:
c. Find the correlation coefficient. What does it imply about the significance of the relationship?
d. For the class of 1930, predict the average class gift.
e. For the class of 1964, predict the average class gift.
f. For the class of 2010, predict the average class gift. Why doesn’t this value make any sense?

Exercise 12.13.21. (Go to Solution)

We are interested in exploring the relationship between the weight of a vehicle and its fuel efficiency (gasoline mileage). The data in the table show the weights, in pounds, and fuel efficiency, measured in miles per gallon, for a sample of 12 vehicles.

Table 12.19.
WeightFuel Efficiency
271524
257028
261029
275038
300025
341022
364020
370026
388021
390018
406018
471015

a. Graph a scatterplot of the data.
b. Find the correlation coefficient and determine if it is significant.
c. Find the equation of the best fit line.
d. Write the sentence that interprets the meaning of the slope of the line in the context of the data.
e. What percent of the variation in fuel efficiency is explained by the variation in the weight of the vehicles, using the regression line? (State your answer in a complete sentence in the context of the data.)
f. Accurately graph the best fit line on your scatterplot.
g. For the vehicle that weights 3000 pounds, find the residual (y-yhat). Does the value predicted by the line underestimate or overestimate the observed data value?
h. Identify any outliers, using either the graphical or numerical procedure demonstrated in the textbook.
i. The outlier is a hybrid car that runs on gasoline and electric technology, but all other vehicles in the sample have engines that use gasoline only. Explain why it would be appropriate to remove the outlier from the data in this situation. Remove the outlier from the sample data. Find the new correlation coefficient, coefficient of determination, and best fit line.
j. Compare the correlation coefficients and coefficients of determination before and after removing the outlier, and explain in complete sentences what these numbers indicate about how the model has changed.

Exercise 12.13.22. (Go to Solution)

The four data sets below were created by statistician Francis Anscomb. They show why it is important to examine the scatterplots for your data, in addition to finding the correlation coefficient, in order to evaluate the appropriateness of fitting a linear model.

Table 12.20.
Set 1  Set 2  Set 3  Set 4 
xy xy xy xy
108.04 109.14 107.46 86.58
86.95 88.14 86.77 85.76
137.58 138.74 1312.74 87.71
98.81 98.77 97.11 88.84
118.33 119.26 117.81 88.47
149.96 148.10 148.84 87.04
67.24 66.13 66.08 85.25
44.26 43.10 45.39 1912.50
1210.84 129.13 128.15 85.56
74.82 77.26 76.42 87.91
55.68 54.74 55.73 86.89

a. For each data set, find the least squares regression line and the correlation coefficient. What did you discover about the lines and values of r?

For each data set, create a scatter plot and graph the least squares regression line. Use the graphs to answer the following questions:

b. For which data set does it appear that a curve would be a more appropriate model than a line?
c. Which data set has an influential point (point close to or on the line that greatly influences the best fit line)?
d. Which data set has an outlier (obviously visible on the scatter plot with best fit line graphed)?
e. Which data set appears to be the most appropriate to model using the least squares regression line?


Try these multiple choice questions

Exercise 12.13.23. (Go to Solution)

A correlation coefficient of -0.95 means there is a ____________ between the two variables.

A. Strong positive correlation
B. Weak negative correlation
C. Strong negative correlation
D. No Correlation

Exercise 12.13.24. (Go to Solution)

According to the data reported by the New York State Department of Health regarding West Nile Virus (http://www.health.state.ny.us/nysdoh/westnile/update/update.htm) for the years 2000-2008, the least squares line equation for the number of reported dead birds ( x ) versus the number of human West Nile virus cases ( y ) is . If the number of dead birds reported in a year is 732, how many human cases of West Nile virus can be expected? r = 0.5490

A. No prediction can be made.
B. 19.6
C. 15
D. 38.1

The next three questions refer to the following data: (showing the number of hurricanes by category to directly strike the mainland U.S. each decade) obtained from www.nhc.noaa.gov/gifs/table6.gif A major hurricane is one with a strength rating of 3, 4 or 5.

Table 12.21.
DecadeTotal Number of HurricanesNumber of Major Hurricanes
1941-19502410
1951-1960178
1961-1970146
1971-1980124
1981-1990155
1991-2000145
2001 – 200493

Exercise 12.13.25. (Go to Solution)

Using only completed decades (1941 – 2000), calculate the least squares line for the number of major hurricanes expected based upon the total number of hurricanes.

A.
B.
C.
D.

Exercise 12.13.26. (Go to Solution)

The correlation coefficient is 0.942. Is this considered significant? Why or why not?

A. No, because 0.942 is greater than the critical value of 0.707
B. Yes, because 0.942 is greater than the critical value of 0.707
C. No, because 0942 is greater than the critical value of 0.811
D. Yes, because 0.942 is greater than the critical value of 0.811

Exercise 12.13.27. (Go to Solution)

The data for 2001-2004 show 9 hurricanes have hit the mainland United States. The line of best fit predicts 2.83 major hurricanes to hit mainland U.S. Can the least squares line be used to make this prediction?

A. No, because 9 lies outside the independent variable values
B. Yes, because, in fact, there have been 3 major hurricanes this decade
C. No, because 2.83 lies outside the dependent variable values
D. Yes, because how else could we predict what is going to happen this decade.

**Exercises 21 and 22 contributed by Roberta Bloom

Solutions to Exercises

Solution to Exercise 12.13.1. (Return to Exercise)

a. Independent: Age; Dependent: Fatalities
d. Independent: Power Consumption; Dependent: Utility

Solution to Exercise 12.13.3. (Return to Exercise)

b.
c. -0.8533, Yes
g. No
h. No.
i. 2.97, Yes
j. slope = -0.0432. As the year increases by one, the welfare family size decreases by 0.0432 people.

Solution to Exercise 12.13.5. (Return to Exercise)

b. Yes
c.
d. 0.9436; yes
e. 478.70 feet; 1207.73 feet
g. Yes
h. Yes; ( 57 , 1050 )
i. 172.98; No
j. 11.7585 feet
k. slope = 11.7585. As the number of stories increases by one, the height of the building increases by 11.7585 feet.

Solution to Exercise 12.13.7. (Return to Exercise)

b. Yes
c.
d. 0.9448; Yes
e. 62.9206; 62.4237
h. No
i. 72.639; No
j. slope = 0.1656. As the year increases by one, the percent of workers paid hourly rates increases by 0.1565.

Solution to Exercise 12.13.9. (Return to Exercise)

b. Yes
c.
d. 0.9986; Yes
e. $5.08
f. $6.93
i. No
j. Not valid
k. slope = 0.0371. As the number of ounces increases by one, the cost of the liquid detergent increases by $0.0371 (or about 4 cents).

Solution to Exercise 12.13.11. (Return to Exercise)

c. Yes
d.
e. 0.9964; Yes
f. $208,872.49; $1,028,318.20
h. Yes
i. No
k. slope = 0.5463. As the net taxable estate increases by one dollar, the approximate probate fees and taxes increases by 0.5463 dollars (about 55 cents).

Solution to Exercise 12.13.13. (Return to Exercise)

c. Yes
d.
e. 0.9761; yes
f. 72.2 cm; 143.13 cm
h. Yes
i. No
j. 505.0 cm; No
k. slope = 7.0948. As the age of an American boy increases by one year, the average height increases by 7.0948 cm.

Solution to Exercise 12.13.15. (Return to Exercise)

c. No
d.
e. -0.4280
f. 6; 5

Solution to Exercise 12.13.17. (Return to Exercise)

d.
e. 0.9553
f. 1934

Solution to Exercise 12.13.19. (Return to Exercise)

b.
c. 0.8302
d. $1577.46
e. $11,642.66
f. -$22,105.34

Solution to Exercise 12.13.21. (Return to Exercise)

b. r = -0.8, significant
c. yhat = 48.4-0.00725x
d. For every one pound increase in weight, the fuel efficiency decreases by 0.00725 miles per gallon. (For every one thousand pound increase in weight, the fuel efficiency decreases by 7.25 miles per gallon.)
e. 64% of the variation in fuel efficiency is explained by the variation in weight using the regression line.
g. yhat=48.4-0.00725(3000)=26.65 mpg. y-yhat=25-26.65=-1.65. Because yhat=26.5 is greater than y=25, the line overestimates the observed fuel efficiency.
h. (2750,38) is the outlier. Be sure you know how to justify it using the requested graphical or numerical methods, not just by guessing.
i. yhat = 42.4-0.00578x
j. Without outlier, r=-0.885, rsquare=0.76; with outlier, r=-0.8, rsquare=0.64. The new linear model is a better fit, after the outlier is removed from the data, because the new correlation coefficient is farther from 0 and the new coefficient of determination is larger.


Solution to Exercise 12.13.22. (Return to Exercise)

a. All four data sets have the same correlation coefficient r=0.816 and the same least squares regression line yhat=3+0.5x

b. Set 2 ; c. Set 4 ; d. Set 3 ; e. Set 1

Figure 12.18. 

Figure (graphics1(1).png)


Solution to Exercise 12.13.23. (Return to Exercise)

C


Solution to Exercise 12.13.24. (Return to Exercise)

A


Solution to Exercise 12.13.25. (Return to Exercise)

A


Solution to Exercise 12.13.26. (Return to Exercise)

D


Solution to Exercise 12.13.27. (Return to Exercise)

A


12.14. Lab 1: Regression (Distance from School)*

Class Time:

Names:

Student Learning Outcomes:

  • The student will calculate and construct the line of best fit between two variables.

  • The student will evaluate the relationship between two variables to determine if that relationship is significant.

Collect the Data

Use 8 members of your class for the sample. Collect bivariate data (distance an individual lives from school, the cost of supplies for the current term).

  1. Complete the table.

    Table 12.22.
    Distance from schoolCost of supplies this term
      
      
      
      
      
      
      
      

  2. Which variable should be the dependent variable and which should be the independent variable? Why?

  3. Graph “distance” vs. “cost.” Plot the points on the graph. Label both axes with words. Scale both axes.

    Figure 12.19. 

    Blank graph with vertical and horizontal axes.


Analyze the Data

Enter your data into your calculator or computer. Write the linear equation below, rounding to 4 decimal places.

  1. Calculate the following:

    a. a =
    b. b =
    c. correlation =
    d. n =
    e. equation: =
    f. Is the correlation significant? Why or why not? (Answer in 1-3 complete sentences.)

  2. Supply an answer for the following senarios:

    a. For a person who lives 8 miles from campus, predict the total cost of supplies this term:
    b. For a person who lives 80 miles from campus, predict the total cost of supplies this term:

  3. Obtain the graph on your calculator or computer. Sketch the regression line below.

    Figure 12.20. 

    Blank graph with vertical and horizontal axes.


Discussion Questions

  1. Answer each with 1-3 complete sentences.

    a. Does the line seem to fit the data? Why?
    b. What does the correlation imply about the relationship between the distance and the cost?

  2. Are there any outliers? If so, which point is an outlier?

  3. Should the outlier, if it exists, be removed? Why or why not?

12.15. Lab 2: Regression (Textbook Cost)*

Class Time:

Names:

Student Learning Outcomes:

  • The student will calculate and construct the line of best fit between two variables.

  • The student will evaluate the relationship between two variables to determine if that relationship is significant.

Collect the Data

Survey 10 textbooks. Collect bivariate data (number of pages in a textbook, the cost of the textbook).

  1. Complete the table.

    Table 12.23.
    Number of pagesCost of textbook
      
      
      
      
      
      
      
      

  2. Which variable should be the dependent variable and which should be the independent variable? Why?

  3. Graph “distance” vs. “cost.” Plot the points on the graph in “Analyze the Data”. Label both axes with words. Scale both axes.

Analyze the Data

Enter your data into your calculator or computer. Write the linear equation below, rounding to 4 decimal places.

  1. Calculate the following:

    a. a =
    b. b =
    c. correlation =
    d. n =
    e. equation: y =
    f. Is the correlation significant? Why or why not? (Answer in 1-3 complete sentences.)

  2. Supply an answer for the following senarios:

    a. For a textbook with 400 pages, predict the cost:
    b. For a textbook with 600 pages, predict the cost:

  3. Obtain the graph on your calculator or computer. Sketch the regression line below.

    Figure 12.21. 

    Blank graph with vertical and horizontal axes.


Discussion Questions

  1. Answer each with 1-3 complete sentences.

    a. Does the line seem to fit the data? Why?
    b. What does the correlation imply about the relationship between the number of pages and the cost?

  2. Are there any outliers? If so, which point(s) is an outlier?

  3. Should the outlier, if it exists, be removed? Why or why not?

12.16. Lab 3: Regression (Fuel Efficiency)*

Class Time:

Names:

Student Learning Outcomes:

  • The student will calculate and construct the line of best fit between two variables.

  • The student will evaluate the relationship between two variables to determine if that relationship is significant.

Collect the Data

Use the most recent April issue of Consumer Reports. It will give the total fuel efficiency (in miles per gallon) and weight (in pounds) of new model cars with automatic transmissions. We will use this data to determine the relationship, if any, between the fuel efficiency of a car and its weight.

  1. Which variable should be the independent variable and which should be the dependent variable? Explain your answer in one or two complete sentences.

  2. Using your random number generator, randomly select 20 cars from the list and record their weights and fuel efficiency into the table below.

    Table 12.24.
    WeightFuel Efficiency
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      

  3. Which variable should be the dependent variable and which should be the independent variable? Why?

  4. By hand, do a scatterplot of “weight” vs. “fuel efficiency”. Plot the points on graph paper. Label both axes with words. Scale both axes accurately.

    Figure 12.22. 

    Blank graph with vertical and horizontal axes.


Analyze the Data

Enter your data into your calculator or computer. Write the linear equation below, rounding to 4 decimal places.

  1. Calculate the following:

    a. a =
    b. b =
    c. correlation =
    d. n =
    e. equation: =

  2. Obtain the graph of the regression line on your calculator. Sketch the regression line on the same axes as your scatterplot.

Discussion Questions

  1. Is the correlation significant? Explain how you determined this in complete sentences.

  2. Is the relationship a positive one or a negative one? Explain how you can tell and what this means in terms of weight and fuel efficiency.

  3. In one or two complete sentences, what is the practical interpretation of the slope of the least squares line in terms of fuel efficiency and weight?

  4. For a car that weighs 4000 pounds, predict its fuel efficiency. Include units.

  5. Can we predict the fuel efficiency of a car that weighs 10000 pounds using the least squares line? Explain why or why not.

  6. Questions. Answer each in 1 to 3 complete sentences.

    a. Does the line seem to fit the data? Why or why not?
    b. What does the correlation imply about the relationship between fuel efficiency and weight of a car? Is this what you expected?

  7. Are there any outliers? If so, which point is an outlier?

** This lab was designed and contributed by Diane Mathios.