|
SUMMER 2004 DR SUSAN CAROL LOSH |
|
|
|
| Assignment 5 | July 28 (Wednesday) |
| Exam 3 | August 4 (Wednesday) |
|
KEY TO: Agresti and Finlay, Chapter 9, pp 301-342; Chapter 11, pp 382-404 and pp 411-421 I treat bivariate and multiple regression
as a comprehensive unit because I believe it is easier to learn this way.
Therefore, it is a good idea to read through Guides 7 and 8 which translate
much of this material, then go back and read both selections of Agresti
and Finlay's Chapters 9 and 11.
|
ASSIGNMENT FIVE SPECIFICATIONS |
|
LINE |
VARIANCE EXPLAINED |
ACCIDENT |
|
|
This Guide examines some more technical
aspects of regression analysis. But in the thicket of formulas and terminology,
keep the purpose ahead of you:
|
|
|
|
|
A regression equation creates a line for "simple regression" in 2-dimensional space, and, for multiple regression, a geometric plane in at least 3-dimensional space.
The "a" or
is the intercept term or constant, where the predictive regression line
crosses the Y-axis. It is the Y score we would expect if the X score were
"0".
Example: according to my height and weight formula in Guide 7, the predicted weight score for a woman exactly five feet tall would be 100 pounds.
The "b" or
term is a slope. It tells us how many units to go up or down on the dependent
variable for a one unit change in the independent variable. For
example, in the formula that I borrowed from insurance companies, a one
inch increase in height is predicted to cause a 5 pound increase in weight.
The b terms ARE ALWAYS IN THE UNITS OF THE DEPENDENT VARIABLE. In my Guide 7 "weighty example", the bs will always come out as pounds. We call this predictive slope the metric or "unstandardized" regression coefficient because it is in the "metric" or the units of the dependent variable.
If we had the entire
population, or close to it, we would know the "real" intercept term,
or
0, and we would also
know the "real" slope or slopes terms,
or
yk. My weight
example, in large part, was derived from formulas devised by insurance
companies, who measured and weighed hundreds of thousands of women, to
create these estimates.
But, nearly all of the time, we are working with samples, and therefore we must ESTIMATE the intercept and slopes terms from sample data.
Below, let's review
the pieces of the regression equations, first for "simple" or bivariate
regression, where you have ONE independent variable and one dependent variable,
then for "multiple" or multivariate regression, where you have AT LEAST
TWO independent variables and one dependent variable.
SIMPLE REGRESSION
GENERIC EQUATIONS (OBSERVED)
| POPULATION GENERIC EQUATION |
|
|
|
|
| SAMPLE GENERIC EQUATION |
|
|
|
|
| Dependent
Variable |
Y-axis
Intercept |
Slope
term |
Residual
term |
When we include the "e" term, we have the actual equation for each person. If we omit the "e" term, we have an estimated prediction equation.
MULTIPLE REGRESSION
GENERIC EQUATIONS (OBSERVED)
where K
is the NUMBER OF INDEPENDENT VARIABLES
If you have K independent variables,
you will have K regression coefficients
CONVENTION: the dependent variable comes FIRST in the subscript to each term.
|
|
|
|
|
|
How do we obtain the intercept and slope
terms?
The intercept and slope terms are estimated
with a specific goal in mind:
The intercept
and slope terms create what is called "the best-fitting straight
line" (or plane) that describes the data.
What does best fitting mean? IT IS
VERY SPECIFIC!
The intercept and slope terms are chosen so as to minimize the AVERAGE deviation from the regression line or plane. This means the estimates for each case are (on the average) as close to the line or plane as possible. We want our line such that the vertical deviations from it are as small as they can possibly be, given the data.
This means that
we MINIMIZE THE (RESIDUAL) SUM OF SQUARED ERROR or the sum of the
squared "e" (residual) terms.
|
|
|
|
|
|
We then call the
difference between the observed and the estimated (predicted) score:
|
|
|
|
To get the unexplained sum of squares or sum of squared errors, just take each "e" or residual score, square the e score, then add up all the squared residual terms and you have the formula in the box above.
The goal is to MINIMIZE ERROR so as to have the very best prediction possible. This means the smallest unexplained sum of squares possible because:
|
|
We obtain the intercept and slope terms through partial differentiation. The slope terms are actually partial derivatives. You will find a mathematical explanation in most introductory calculus books (typically second semester calculus books) under "least squares analysis," so-called because you are minimizing the sum of squared errors.
This type of regression
we are examining is called "Ordinary Least Squares" or OLS Regression
for
short.
|
|
We can use systems
of simultaneous equations to solve for the intercept and slope terms. One
comparatively easy way is to make each row in the matrix an equation for
the dependent variable and make the entries in the matrix the bivariate
correlations among the variables (the math of the actual entries is reserved
for your higher level statistics courses). We solve for the standardized
beta weights (not the metric bs) and at the end of the process multiply
the beta weights back by standard deviation ratios to turn them into metric
regression coefficients. A variation on this process is what most computer
programs do to derive the intercept and slope terms.
|
|
Starting with simple regression, where you just have one independent and one dependent variable:
And the formula for a becomes: a = |
The NUMERATOR for the b term is identical to the numerator for Pearson's r: it is the covariance (or multiplying the mean deviations) for the independent variable and the dependent variable.
The DENOMINATOR for the b term is simply the squared sum of deviation scores between the independent variable and its own mean. It is as though you began to calculate the standard deviation of the independent variable, but you stopped at the second step.
By using the independent variable sum of squares in the denominator of B, you are setting the metric for the metric regression coefficient, i.e., for a one unit change in "x" (e.g., one inch of height) you have a specified change in "y" (e.g., 5 pounds).
We obtain the b or metric regression slopes
FIRST, then go back and solve for the a or intercept term.
|
|
Well, I did warn you that this was the technical guide. Below is ONE way to derive the B coefficients in multiple regression when you have only TWO independent variables:
Here's the formula for b1:
ry1 is the bivariate correlation between the dependent variable and X1 |
| b0 = | by1 |
by2 |
|
bb- b
yk |
|
|
R2 tells us how close the data are to the regression line or regression plane that you draw with the regression equation. When the data are very close to the predicted line, R is close to 1 and prediction is excellent. Use the strength chart in Guide 5 to evaluate R2 from nothing through very weak to very strong to "perfect" (R2 is always positive).
R2 is THE PRE measure. It tells us how precisely we can predict the dependent variable from knowing the scores on at least one independent variable.
Recall that in Guide 7, I described Pearson's r as:
What does it mean to say that R2 or r2 is a measure of how well we have explained the variance in the dependent variable?
In order to answer that question, we must study three entities and the relationships among them.
The first entity is
the TOTAL SUM OF SQUARES.
Sometimes this is called the "total variation
in y" (y being the dependent variable.)
In the total sum of squares, you begin taking a standard deviation on the dependent variable but you stop at the third step:
1) you subtract the mean of the dependent variable from each dependent variable score
2) you square this difference or deviation
3) you sum or add up all the squared deviations
from the mean score.
|
The second entity
is the SUM OF SQUARED ERROR.
Sometimes called
the "unexplained sum of squares" or the "residual sum of squares."
Or the sum of the squared deviations between the predicted dependent variable score (predicted by the regression equation) and the actual observed dependent variable score.
Although you did your best to find good independent variables, the unexplained (residual) sum of squared error (SSE) is what you COULDN'T predict. It might be random measurement error, it might be all the other predictors that you DIDN'T include in the regression equation -- we don't know what this prediction error exactly is. although we make the statistical assumption that e has a normal distribution and that it is randomly distributed across values of the independent variable.
|
Why, explained? Because it is how much better you do predicting the dependent variable knowing scores on the independent variable than if you had no information about the independent variables at all.
So, the Explained or Regression Sum of Squares is:
1) The deviation between the PREDICTED
dependent variable score and the MEAN OF THE DEPENDENT VARIABLE.
(The mean would have been your best guess
if
you didn't know the independent variable scores.)
2) Square that deviation (so the positive and negative deviations don't cancel each other out) then
3) Sum up all the squared deviation scores between the dependent variable predicted score and the dependent variable mean.
|
The Regression or "Explained" Sum
of Squares examines the difference between the regression predicted score
and the mean score (squared).
Thus, it tells us how much more accurate our prediction is using the regression equation than if we only had the mean score on the dependent variable for the entire group.
NOW, we can look
at R-square as the ratio of the "explained" sum of squares to the total
original variation in the dependent variable. Below are several different
formulae for the same entity!
|
|
TSS |
|
|
|
|
|
|
|
1 - |
__________ |
All these formulas are just different ways of saying the same thing:
|
|
TESTING FOR STATISTICAL SIGNIFICANCE IN REGRESSION |
When you work with multiple regression, you will have two sets of tests for statistical significance.
1. The first is for the overall regression, whether all the independent variables PUT TOGETHER have a TOTAL non-zero influence on the dependent variable.
This is also the test for R2, the multiple correlation coefficient.
The null hypothesis, Ho : R2 = 0
The alternative hypothesis is HA: R2 > 0.
[Review null hypotheses in Guide 4 if you need to.]
Because R2 is a squared measure,
it cannot be a negative number.
|
2. In the
second set of tests for statistical significance, you will separately
test whether each B is zero or non-zero. After all, the R2
could be statistically significant, but that could happen because just
one B out of many independent variables had statistically significant effects
on the dependent variable, and the other Bs, within sampling error, were
simply 0.
To say that a B is zero is to say IT HAS NO NET PREDICTIVE EFFECT ON THE DEPENDENT VARIABLE. This also means that the slope of that line is totally flat. As X rises or falls, the value of Y stays the same (e.g. controlling height, no matter how many cigarettes you smoke per day, your weight in pounds is the same).
It is entirely possible that an independent variable can have a strong effect on the dependent variable USING THE BIVARIATE CORRELATION, and NO NET EFFECT WITH THE REGRESSION B, once other variables are controlled.
In fact, you may have detected a spurious or intervening causal relationship to have such a thing happen (although it will take more than one regression model to begin to tease this out). When you have a set of independent variables that are highly intercorrelated with each other, you can also get such a set of findings.
For each B:
The null hypothesis, Ho : Bk = 0
The alternative hypothesis is HA: | Bk | > 0
That is, the alternative is that the absolute value of B is greater than zero. Because Bs are directional, the B could turn out to be positive or negative.
We test for whether B is zero in the
population with a t-test.
|
|
The Standard Error (seB)
of the slope term (with one independent
variable) is:
|
|
|
---------------- |
However, again, most computer programs will either compute the t-test for the B for you, or will give you the Standard Error of the B so you can use a calculator to find it.
If your sample is over 120 people (and you only have a couple of independent variables), a B that is about twice its standard error typically is statistically significant at the .05 probability level.
B can be either positive (more calories
mean more weight) or negative (more exercise means less weight).
If you know the direction of B in advance,
you can do a one-tailed test. (did you have to do a study to know that,
all else equal, people who eat more weigh more?)
If you are doing SIMPLE REGRESSION with one independent variable, the sign of r is the same as the sign of the B.
If you are doing MULTIPLE REGRESSION,
R is always the square root of R2 and it is always positive.
|
|
Remember that question from the Surveys of Public Attitudes Toward Science about whether the sun went around the earth or the earth went around the sun? Well, on six separate surveys, spanning the years 1988 to 1999, there were nine more questions on basic science. Here are a few:
I created a science knowledge score from the total 10 items. Each correct response was coded 1, thus the total correct runs from 0 (none correct) to 10 (all correct). The data in the tables below are from the 1999 survey only.
The average for all 1882 respondents was 6.9 correct out of 10 with a standard deviation of 2.11 points.
In my regression to explain science knowledge scores, I use THREE PREDICTOR VARIABLES: gender (as a dummy variable, coded 0-1), educational degree level, and the number of college science courses the person elected. Obviously I should control degree level, because the individuals who never went to college at all had no college science courses.
First, I examined how each independent variable separately related to science knowledge score.
The first is a difference of means test (a t-test) by gender (why a difference of means test? why not a table?)
SCIENCE KNOWLEDGE SCORES BY GENDER
| GROUP | Mean Score (out of 10) | Standard Deviation |
| MALES | 7.6 | 1.93 |
| FEMALES | 6.4 | 2.10 |
| TOTAL | 6.9 | 2.11 |
t-test (1881) = 12.59 p < .001 Eta = 0.278
Notice that on the average, males score 1.2 points higher than females, and this difference was (a) highly statistically significant and (b) the relationship was of moderate strength.
(If you had trouble with either of these two findings, go back and review Guides 4 and 5.)
SCIENCE KNOWLEDGE SCORES BY EDUCATIONAL LEVEL
| GROUP | Mean Score (out of 10) | Standard Deviation |
| Less Than High School | 5.2 | 2.09 |
| High School Degree | 6.6 | 1.97 |
| College Degree | 8.0 | 1.78 |
| Advanced Degree | 8.3 | 1.56 |
| TOTAL | 6.9 | 2.11 |
F-test (3,1878) = 143.38 p < .001 Eta = 0.43
Again, the difference in science knowledge scores across educational levels is highly statistically significant. This relationship, too, is moderate. Scores increase at a constant rate (1.4 points) from the less than high school group to the college degree group, then basically plateau at the graduate level, so the relationship begins to approximate a straight line.
Notice how the standard deviation on our dependent variable, science knowledge scores, gets smaller and smaller with each successive educational level.
This is an example of HETEROSCEDASTICITY, or UNEQUAL VARIANCES ON THE DEPENDENT VARIABLE ACROSS CATEGORIES OF THE INDEPENDENT VARIABLE. Heteroscedasticity can be a problem for several reasons. Among them is that the regression Bs are often no longer as efficient (as precise) as they could be. In later statistics courses, you will learn how to correct for heteroscedasticity. We will also discuss heteroscedasticity briefly in the last section of this Guide.
These correlations are presented below is what is called A CORRELATION MATRIX.
ZERO ORDER (bivariate) CORRELATION MATRIX
AMONG ALL VARIABLES IN THE EQUATION
All correlations are statistically significant
at the p < .001 level
| Gender
(1 = female) |
N college science courses | Educational
LEVEL |
Science
Knowledge |
|
| Gender (1 = female) |
|
|
|
|
| Number college science courses |
|
|
|
|
| Educational LEVEL |
|
|
|
|
| Science Knowledge |
|
|
|
|
Notice how the matrix is symmetric, that is, the top right hand side is the same as the lower left hand side. Because of this, many people will just present the lower left side of the matrix as you see below. There are "1"s on the diagonal of the matrix because the correlation of a variable with itself equals 1.
ZERO ORDER (bivariate) CORRELATIONS
AMONG ALL VARIABLES IN THE EQUATION
All correlations are statistically significant
at the p < .001 level
| Gender
(1 = female) |
N college science courses | Educational
LEVEL |
Science
Knowledge |
|
| Gender (1 = female) |
|
|
|
|
| Number college science courses |
|
|
|
|
| Educational LEVEL |
|
|
|
|
| Science Knowledge |
|
|
|
|
From the correlation matrix, we can see that gender has weak negative correlations with the number of science courses and educational level, and gender is moderately and negatively correlated with science knowledge score.
The number of college science classes is strongly and positively related to educational level (r = 0.59), and has a moderate positive relationship (r = 0.44) with science knowledge score.
Educational level is moderately and positively
related (r = 0.42) to science knowledge score.
|
|
|
|
|
|
|
| Gender (1 = female) | -0.86 | .084 | -0.20 | -10.222 | .0000 |
| Number college science courses | 0.17 | .016 | +0.26 | 10.816 | .0000 |
| Educational LEVEL | 0.61 | .063 | +0.23 | 9.618 | .0000 |
| Constant | 5.60 | .150 |
|
37.368 | .0000 |
R = 0.520 R2 = 0.270
n = 1882
Standard error of the estimate = 1.80
Standard deviation of science knowledge
score = 2.11
|
Given that each B is statistically significant, i.e., not zero, let's see what each one means IN WORDS.
The B for Gender was -0.86. Given that female = 1 and male = 0, this means that controlling educational level and the number of science courses, women averaged 0.86 fewer right answers than men (the B is negative). However, notice that this difference is about one-third smaller than the bivariate sex difference of 1.2 answers. Thus, at least some of the sex difference occurred because men take slightly more college science classes than women do.
The B for the number of college science classes was 0.17. This means that for each additional college science class the person takes, he or she scores about 0.17 points higher on the science knowledge questions (controlling gender and degree level).
The B for educational level was 0.61. For each jump in degree level, the person averages 0.61 right answers more. That may not seem like very much, but if the jump is from less than high school to an advanced degree, the person with an advanced degree on the average gets nearly two more answers right out of 10 (3 X 0.61 = 1.83) than someone who never completed high school at all (controlling gender and the number of college science classes).
Finally, the constant term is 5.60. If someone was male, never had any education, and never had a college science course, we would expect their score on the average to be 5.60 out of 10.
You could present the numeric results
in a simple chart like this one:
|
|
|
|
|
Significant? |
| Gender (1 = female) | -0.86 | -10.222 |
|
|
| Number college science courses | 0.17 | 10.816 |
|
|
| Educational LEVEL | 0.61 | 9.618 |
|
|
| Constant | 5.60 | 37.368 |
|
|
|
|
|
|
|
| Number college science courses | +0.26 | Positive | Moderate |
| Educational LEVEL | +0.23 | Positive | Weak |
| Gender (1 = female) | -0.20 | Negative | Weak |
The constant disappears because it is zero in a standardized regression equation.
The Standard Error of the Regression (that is, the average deviation around the regression line for science knowledge score) was 1.80
The actual standard deviation of science knowledge score was 2.11
Square each of these numbers. The square of the Standard Error of the Regression is also sometimes called the Mean Square Error or the Mean Square Residual.
If you divide the mean square residual by the variance of the knowledge scores, that ratio is 0.73, that is, on the average about 27 percent smaller than the variance around the knowledge score mean. This is reflected in the R2 of 0.27. Fortunately, since the computer calculates R2 for you, you don't have to do the math, but it is helpful to know where R2 came from.
|
|
Use the metric B regression coefficients when:
1. you want to make a definite prediction(e.g.,
the dollars of a person's salary or someone's GRE score) or
2. when you want to compare two groups (e.g.,
predicting the salaries for men and women in two separate regression equations).
Use the standardized regression coefficients (the Beta Weights) when:
1. you want to assess how relatively important each
independent variable is WITHIN THE SAME EQUATION.
2.
you want an approximate indication of how strongly
each independent variable influences the dependent variable.
(note Agresti and Finlay's cautions on
use of the Beta Weight. You are probably OK in samples of several hundred
cases where the Bs have small standard errors.)
|
|
NOTE: This section is also repeated in Assignment 5.
FIRST
examine
your univariate and bivariate statistics: the means, standard deviations,
and the correlation coefficients.
Note any unusually large or small correlations.
MAKE SURE YOU KNOW WHAT THE METRIC
IS OF YOUR DEPENDENT VARIABLE (pounds of weight? number of household
computers? number of library books?)! Yyou will use this metric
for the Bs.
SECOND see if the overall R2 is significant. Use the Global F-Test results and look at the "P" for probability level.
The null hypothesis, Ho : R2 = 0
The alternative hypothesis is HA: R2 > 0.
Because R2 is a squared measure, it cannot be a negative number.
If the significance level for the F test
is small (p < .05), then the R2 is REAL (non-zero).
Usually this means at least one B is non-zero.
Go to step 3.
If the R2 is basically 0 (p > .05), any apparent influence of the predictors on the dependent variable is an ACCIDENT. STOP HERE! GO NO FURTHER!
THIRD
see if the STRENGTH of R2 is at least weak (.11 plus).
If yes, continue to step 4.
If R2 is smaller than .11,
your results are real but probably not practically important.
Interpret any Bs with extreme caution.
(NOTE: It's true that 10% explained variation MIGHT be a big deal, depending on the state of knowledge in your discipline of study. So interpret strength with your discipline in mind.)
FOURTH
NOW
examine each of the Bs.
Any B less than
twice its own standard error will usually have a significance level
greater
than .05.
This means any apparent influence of that
B is a sampling ACCIDENT and that B is really 0.
Use a marker to note the Bs with statistical
significance < .05.
These are REAL or nonzero.
Discuss how the statistically significant
Bs raise or lower scores on the dependent variable (in pounds of weight
for my example: For example, for each 15 minute period a woman exercised,
she would weigh 1 pound less.)
CLICK HERE
TO REVIEW THE WEIGHT EXAMPLE.
FIFTH Look at the BETA weights of the SIGNIFICANT Bs. (Remember that the Bs that were not statistically significant are really 0 in the population and so are the corresponding Beta Weights.)
Rank the Beta Weights from most to least
important in terms of absolute value size.
Discuss the strength and direction
of each statistically significant beta weight.
|
|
Here are a few polysyllabic terms that designate potential problems.
1. Remember how the standard deviation on the science knowledge index grew smaller with each jump of educational degree level? There's a fancy name for this: heteroscedasticity, and it violates one important regression assumption:
At each level of the independent variable, the e's or residual terms are supposed to resemble a normal distribution, and the variances on the residuals should be the same no matter where in the distribution of independent variable scores that you look.In other words, the spread of scores and the standard deviation of scores on the dependent variable should be about the same, no matter which category you examine of the independent variable. For example, you might expect the standard deviation on weight for height = 5 feet 3 inches (e.g., 6 pounds) to be identical to the standard deviation on weight for height = 5 feet 6 inches (e.g., also 6 pounds).
The standard deviation of the weight scores should be about the same, whether you look at women who are five feet two inches tall or women who are five feet ten inches tall.
The standard deviation of science knowledge scores should be about the same, whether you look at people with less than a high school degree versus people with an advanced college degree.
Similar variances on the dependent variable across different values of the independent variable is called homoscedasticity (for "the same").
If you violate homoscedasticity, you will probably not have minimum variances for your regression estimates. Your estimates will not be the most efficient (have the smallest possible variance). The estimates of the standard errors of the B terms that you receive from the computer programs (which assume homoscedasticity) could be both incorrect and misleadingly low--that means that you can think some independent variables have a statistically significant effect on the dependent variable when they really don't.
One way to try to work with homoscedasticity is called "weighted least squares" and higher level statistics courses will address how to do this.
2. Multicollinearity. WHAT? Come again?
Multicollinearity refers to highly intercorrelated INDEPENDENT variables in the multiple regression equation. You can "eyeball" these in the zero order correlation matrix, although more formal tests are available. By highly correlated, some analysts say any correlation that has an absolute value of 0.50 or higher indicates multicollinearity. Most say an absolute value of 0.70 or higher (that corresponds to an R2 of about 50%) designates a problem.
Why is multicollinearity an issue?
When independent variables are very highly intercorrelated, it is difficult to disentangle the unique effects of each independent variable on the dependent variable.
Very large Beta Weights (over an absolute value of 1) may be a diagnostic indication of multicollinearity.
3. NEVER NEVER NEVER omit an important predictor variable THAT IS CORRELATED WITH OTHER INDEPENDENT VARIABLES from the regression equation.
Yes, your remaining B coefficients probably will become larger. Some of the Bs that weren't statistically significant before may become so now. But you have just introduced a different and more serious problem.
NEVER NEVER NEVER try to solve multicollinearity by "throwing out" one (or more) of the intercorrelated independent variables.
The result is B coefficients with systematic biases, i.e., systematic departures from the true population values. That is because the covariance that originally was shared between independent variables now all goes to the independent variables that are left in the equation, artificially and invalidly raising their values and producing invalid, inflated estimates of how those independent variables influence the dependent variable.
This is the most common mistake I have seen novice analysts make who use multiple regression.
4. Issues with low R2s
Well, you did your best, but even the biggest optimist would call your entire regression equation results WEAK.
Why?
Low R2s can occur for several reasons:
![]() |
READINGS AND ASSIGNMENTS |
OVERVIEW |
|
Susan Carol Losh July 20,
2004
This page was built with
Netscape Composer
and is best viewed with
Netscape Navigator
600 X 800 display resolution.