What is the significance of the slope in the linear regression relationship




















To avoid making wrong inferences, regression toward the mean must be considered when designing scientific experiments and interpreting data. The conditions under which regression toward the mean occurs depend on the way the term is mathematically defined. Sir Francis Galton first observed the phenomenon in the context of simple linear regression of data points. However, a less restrictive approach is possible. Regression towards the mean can be defined for any bivariate distribution with identical marginal distributions.

Two such definitions exist. Not all such bivariate distributions show regression towards the mean under this definition. However, all such bivariate distributions show regression towards the mean under the other definition. Historically, what is now called regression toward the mean has also been called reversion to the mean and reversion to mediocrity. Suppose that all students choose randomly on all questions. Naturally, some students will score substantially above 50 and some substantially below 50 just by chance.

No matter what a student scores on the original test, the best prediction of his score on the second test is If there were no luck or random guessing involved in the answers supplied by students to the test questions, then all students would score the same on the second test as they scored on the original test, and there would be no regression toward the mean. Most realistic situations fall between these two extremes: for example, one might consider exam scores as a combination of skill and luck.

In this case, the subset of students scoring above average would be composed of those who were skilled and had not especially bad luck, together with those who were unskilled, but were extremely lucky. On a retest of this subset, the unskilled will be unlikely to repeat their lucky break, while the skilled will have a second chance to have bad luck.

Hence, those who did well previously are unlikely to do quite as well in the second test. The following is a second example of regression toward the mean.

A class of students takes two editions of the same test on two successive days. It has frequently been observed that the worst performers on the first day will tend to improve their scores on the second day, and the best performers on the first day will tend to do worse on the second day. The phenomenon occurs because student scores are determined in part by underlying ability and in part by chance. For the first test, some will be lucky, and score more than their ability, and some will be unlucky and score less than their ability.

Some of the lucky students on the first test will be lucky again on the second test, but more of them will have for them average or below average scores. Therefore a student who was lucky on the first test is more likely to have a worse score on the second test than a better score. Similarly, students who score less than the mean on the first test will tend to see their scores increase on the second test. The concept of regression toward the mean can be misused very easily. In the student test example above, it was assumed implicitly that what was being measured did not change between the two measurements.

Then the students who scored under 70 the first time would have no incentive to do well, and might score worse on average the second time. The students just over 70, on the other hand, would have a strong incentive to study and concentrate while taking the test. In that case one might see movement away from 70, scores below it getting lower and scores above it getting higher. It is possible for changes between the measurement times to augment, offset or reverse the statistical tendency to regress toward the mean.

Statistical regression toward the mean is not a causal phenomenon. A student with the worst score on the test on the first day will not necessarily increase her score substantially on the second day due to the effect. On average, the worst scorers improve, but that is only true because the worst scorers are more likely to have been unlucky than lucky.

Sir Francis Galton : Sir Frances Galton first observed the phenomenon of regression towards the mean in genetics research. Privacy Policy. Skip to main content. Correlation and Regression. Search for:. The Regression Line. Learning Objectives Model the relationship between variables in regression analysis.

The mathematical function of the regression line is expressed in terms of a number of parameters, which are the coefficients of the equation, and the values of the independent variable.

Key Terms slope : the ratio of the vertical and horizontal distances between two points on a line; zero if the line is horizontal, undefined if it is vertical. Two Regression Lines ANCOVA can be used to compare regression lines by testing the effect of a categorial value on a dependent variable, controlling the continuous covariate. Key Takeaways Key Points Researchers, such as those working in the field of biology, commonly wish to compare regressions and determine causal relationships between two variables.

It is also possible to see similar slopes between lines but a different intercept, which can be interpreted as a difference in magnitudes but not in the rate of change. Key Terms covariance : A measure of how much two random variables change together.

Least-Squares Regression The criteria for determining the least squares regression line is that the sum of the squared errors is made as small as possible. Key Takeaways Key Points Linear regression dictates that if there is a linear relationship between two variables, you can then use one variable to predict values on the other variable.

The least squares regression method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation. Least squares regression provides minimum- variance, mean- unbiased estimation when the errors have finite variances. Key Terms least squares regression : a statistical technique, based on fitting a straight line to the observed data.

The approach described in this section is illustrated in the sample problem at the end of this lesson. If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.

The local utility company surveys randomly selected customers. For each survey participant, the company collects the following: annual electric bill in dollars and home size in square feet. Output from a regression analysis appears below. Is there a significant linear relationship between annual bill and home size? Use a 0. A strong relationship between the predictor variable and the response variable leads to a good model.

A simple linear regression model is a mathematical equation that allows us to predict a response for a given predictor value. The slope describes the change in y for each one unit change in x. A hydrologist creates a model to predict the volume flow for a stream at a bridge crossing with a predictor variable of daily rainfall in inches. The y-intercept of 1. The slope tells us that if it rained one inch that day the flow in the stream would increase by an additional 29 gal.

If it rained 2 inches that day, the flow would increase by an additional 58 gal. The Least-Squares Regression Line shortcut equations. An alternate computational equation for slope is:. This simple model is the line of best fit for our sample data. The regression line does not go through every point; instead it balances the difference between all data points and the straight-line model. The difference between the observed data value and the predicted value the value on the straight line is the error or residual.

The criterion to determine the line that best describes the relation between two variables is based on the residuals. For example, if you wanted to predict the chest girth of a black bear given its weight, you could use the following model.

But a measured bear chest girth observed value for a bear that weighed lb. A negative residual indicates that the model is over-predicting. A positive residual indicates that the model is under-predicting. In this instance, the model over-predicted the chest girth of a bear that actually weighed lb. This random error residual takes into account all unpredictable and unknown factors that are not included in the model. An ordinary least squares regression line minimizes the sum of the squared errors between the observed and predicted values to create a best fitting line.

The differences between the observed and predicted values are squared to deal with the positive and negative differences. After we fit our regression line compute b 0 and b 1 , we usually wish to know how well the model fits our data. To determine this, we need to think back to the idea of analysis of variance. In ANOVA, we partitioned the variation using sums of squares so we could identify a treatment effect opposed to random variation that occurred in our data.

The idea is the same for regression. We want to partition the total variability into two parts: the variation due to the regression and the variation due to random error. And we are again going to compute sums of squares to help us do this. Suppose the total variability in the sample measurements about the sample mean is denoted by , called the sums of squares of total variability about the mean SST.

The squared difference between the predicted value and the sample mean is denoted by , called the sums of squares due to regression SSR. The SSR represents the variability explained by the regression line. Finally, the variability which cannot be explained by the regression line is called the sums of squares due to error SSE and is denoted by.

SSE is actually the squared residual. The sums of squares and mean sums of squares just like ANOVA are typically presented in the regression analysis of variance table. The ratio of the mean sums of squares for the regression MSR and mean sums of squares for error MSE form an F-test statistic used to test the regression model.

The larger the explained variation, the better the model is at prediction. The larger the unexplained variation, the worse the model is at prediction. A quantitative measure of the explanatory power of a model is R 2 , the Coefficient of Determination:. The Coefficient of Determination measures the percent variation in the response variable y that is explained by the model. The Coefficient of Determination and the linear correlation coefficient are related mathematically.

Even though you have determined, using a scatterplot, correlation coefficient and R 2 , that x is useful in predicting the value of y , the results of a regression analysis are valid only when the data satisfy the necessary regression assumptions. We can use residual plots to check for a constant variance, as well as to make sure that the linear model is in fact adequate. The center horizontal axis is set at zero. One property of the residuals is that they sum to zero and have a mean of zero.

A residual plot should be free of any patterns and the residuals should appear as a random scatter of points about zero. A residual plot with no appearance of any patterns indicates that the model assumptions are satisfied for these data.

The residuals tend to fan out or fan in as error variance increases or decreases. The model may need higher-order terms of x , or a non-linear model may be needed to better describe the relationship between y and x. Transformations on x or y may also be considered. A normal probability plot allows us to check that the errors are normally distributed. It plots the residuals against the expected value of the residual as if it had come from a normal distribution.

Recall that when the residuals are normally distributed, they will follow a straight-line pattern, sloping upward. The most serious violations of normality usually appear in the tails of the distribution because this is where the normal distribution differs most from other types of distributions with a similar mean and spread. Curvature in either or both ends of a normal probability plot is indicative of nonnormality. Our regression model is based on a sample of n bivariate observations drawn from a larger population of measurements.

We use the means and standard deviations of our sample data to compute the slope b 1 and y-intercept b 0 in order to create an ordinary least-squares regression line.

But we want to describe the relationship between y and x in the population, not just within our sample data. We want to construct a population model. Now we will think of the least-squares line computed from a sample as an estimate of the true regression line for the population. In our population, there could be many different responses for a value of x. In simple linear regression, the model assumes that for each value of x the observed values of the response variable y are normally distributed with a mean that depends on x.

We also assume that these means all lie on a straight line when plotted against x a line of means. In other words, the noise is the variation in y due to other causes that prevent the observed x, y from forming a perfectly straight line. The sample data used for regression are the observed values of y and x. The response y to a given x is a random variable, and the regression model describes the mean and standard deviation of this random variable y.

We now want to use the least-squares line as a basis for inference about a population from which our sample was drawn. Procedures for inference about the population regression line will be similar to those described in the previous chapter for means. As always, it is important to examine the data for outliers and influential observations. This is the standard deviation of the model errors. It measures the variation of y about the population regression line. We will use the residuals to compute this value.

The residual is:. A small value of s suggests that observed values of y fall close to the true regression line and the line should provide accurate estimates and predictions. We relied on sample statistics such as the mean and standard deviation for point estimates, margins of errors, and test statistics.

Inference for the slope and intercept are based on the normal distribution using the estimates b 0 and b 1. Because we use s , we rely on the student t-distribution with n — 2 degrees of freedom. We can construct confidence intervals for the regression slope and intercept in much the same way as we did when estimating the population mean.

This tells us that the mean of y does NOT vary with x. In other words, there is no straight line relationship between x and y and the regression of y on x is of no value for predicting y. The index of biotic integrity IBI is a measure of water quality in streams.



0コメント

  • 1000 / 1000