OVERVIEW

ASSIGNMENT 5 DUE JULY 28

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
TO EDF 5400 READINGS AND ASSIGNMENTS


 
EDF 5400 INTRODUCTORY STATISTICS
SUMMER 2004

DR SUSAN CAROL LOSH


 
ASSIGNMENT OR EXAM
DUE DATE
Assignment 5 July 28 (Wednesday)
Exam 3 August 4 (Wednesday)

 

KEY TO: Agresti and Finlay, Chapter 9, pp 301-342; Chapter 11, pp 382-404 and pp 411-421

I treat bivariate and multiple regression as a comprehensive unit because I believe it is easier to learn this way. Therefore, it is a good idea to read through Guides 7 and 8 which translate much of this material, then go back and read both selections of Agresti and Finlay's Chapters 9 and 11.
 


 
GUIDE 8: MORE TECHNICAL ASPECTS OF REGRESSION
ASSIGNMENT FIVE SPECIFICATIONS
FEEDBACK ASSIGNMENT 4
GENERAL FEEDBACK EXAM 2

 
THE REGRESSION 
LINE
R & THE PERCENT 
VARIANCE EXPLAINED
REAL OR 
ACCIDENT
WORKING AN EXAMPLE
ELEMENTARY CAUTIONS

This Guide examines some more technical aspects of regression analysis. But in the thicket of formulas and terminology, keep the purpose ahead of you:
 

 
Regression allows you to assess the NET effects of each independent variable, while statistically controlling for the influence of all the other independent variables in your regression equation, on ONE numeric dependent variable, . All your independent variables or predictors are entered in JUST ONE equation. The results are more straightforward and easier to interpret than using multiple crosstabulation tables. Further, there are few limits on the number of independent variables you can include (although you may find practical limits). Regression is appropriate to use when you suspect that the independent variables have approximately linear effects on the dependent variable (or if nominal or ordinal independent variables have been recoded as 0-1 dummy variables).

Regression summarizes all this information for you with JUST ONE linear B coefficient per independent variable, as well as a summary correlation coefficient (R2) that describes how well the TOTAL regression equation describes or predicts the dependent variable. Mathematically, regression adjusts the net predictive B coefficient for each independent variable taking into account all other predictors in the equation. 

"Beginner's Rules" regression will not work for you if relationships between an independent variable and a dependent variable are nonlinear. You can't use a dummy dependent variable either. There are other, more technical assumptions that I will address in this Guide.
 


 


HOW IS THE REGRESSION LINE DERIVED?

 
REVIEW!

A regression equation creates a line for "simple regression" in 2-dimensional space, and, for multiple regression, a geometric plane in at least 3-dimensional space.

The "a" or  is the intercept term or constant, where the predictive regression line crosses the Y-axis. It is the Y score we would expect if the X score were "0".

Example: according to my height and weight formula in Guide 7, the predicted weight score for a woman exactly five feet tall would be 100 pounds.

The "b" or  term is a slope. It tells us how many units to go up or down on the dependent variable for a one unit change in the independent variable. For example, in the formula that I borrowed from insurance companies, a one inch increase in height is predicted to cause a 5 pound increase in weight.

The b terms ARE ALWAYS IN THE UNITS OF THE DEPENDENT VARIABLE. In my Guide 7 "weighty example", the bs will always come out as pounds.  We call this predictive slope the metric or "unstandardized" regression coefficient because it is in the "metric" or the units of the dependent variable.

If we had the entire population, or close to it, we would know the "real" intercept term,   or0, and we would also know the "real" slope or slopes terms,  or yk. My weight example, in large part, was derived from formulas devised by insurance companies, who measured and weighed hundreds of thousands of women, to create these estimates.

But, nearly all of the time, we are working with samples, and therefore we must ESTIMATE the intercept and slopes terms from sample data.

Below, let's review the pieces of the regression equations, first for "simple" or bivariate regression, where you have ONE independent variable and one dependent variable, then for "multiple" or multivariate regression, where you have AT LEAST TWO independent variables and one dependent variable.
 


SIMPLE REGRESSION GENERIC EQUATIONS (OBSERVED)

POPULATION GENERIC EQUATION
Y  = 
  + 
+ e
SAMPLE GENERIC EQUATION
y  =
a  +
bx
+ e
  Dependent
Variable
Y-axis
Intercept
Slope
term
Residual
term

When we include the "e" term, we have the actual equation for each person. If we omit the "e" term, we have an estimated prediction equation.

MULTIPLE REGRESSION GENERIC EQUATIONS (OBSERVED)
where  K is the NUMBER OF INDEPENDENT VARIABLES
If you have K independent variables, you will have K regression coefficients

CONVENTION: the dependent variable comes FIRST in the subscript to each term.

POPULATION GENERIC EQUATION
 
Y  = 0  + y1X1 y2X2y3X3+...+ ykXk+ e
SAMPLE GENERIC EQUATION
 
y  =  b0  + by1x1+ by2x2+  by3x3+...+ bykxk + e

How do we obtain the intercept and slope terms?
The intercept and slope terms are estimated with a specific goal in mind:

The intercept and slope terms  create what is called "the best-fitting straight line" (or plane) that describes the data.
What does best fitting mean? IT IS VERY SPECIFIC!

The intercept and slope terms are chosen so as to minimize the AVERAGE deviation from the regression line or plane. This means the estimates for each case are (on the average) as close to the line or plane as possible. We want our line such that the vertical deviations from it are as small as they can possibly be, given the data.

This means that we MINIMIZE THE  (RESIDUAL) SUM OF SQUARED ERROR or the sum of the squared "e" (residual) terms.
 

Recall the following:
call the OBSERVED SCORE simply
y
call the PREDICTED (ESTIMATED) SCORE
 y' or 

We then call the difference between the observed and the estimated (predicted) score:
 

ei = yii

 
 
TECHNICAL NOTE: if we simply added the positive and negative deviations from the observed scores, then the positive deviations from those with higher than expected dependent variable scores (for example, overweight people) would cancel out the negative deviations from those with lower than expected dependent variable scores (say, very skinny people). The positive and negative deviations would cancel each other out and the sum (and the average) would equal zero.

So, before we add the deviation scores, we square them first, resulting in the equation below:

(yii) 2  ei2

 

Each "e" term is the difference between the observed and predicted score on the dependent variable. For example, if the weight formula predicted that a woman would weigh 120 pounds and she actually weighed 130, her residual or deviation "e" score would be +10 (pounds).

To get the unexplained sum of squares or sum of squared errors, just take each "e" or residual score, square the e score, then add up all the squared residual terms and you have the formula in the box above.

The goal is to MINIMIZE ERROR so as to have the very best prediction possible. This means the smallest unexplained sum of squares possible because:

So the intercept and slope terms are chosen to minimize the sum of the squared errors or the sum of the e2s. 

IF YOU HAVE HAD CALCULUS

We obtain the intercept and slope terms through partial differentiation. The slope terms are actually partial derivatives. You will find a mathematical explanation in most introductory calculus books (typically second semester calculus books) under "least squares analysis," so-called because you are minimizing the sum of squared errors.

This type of regression we are examining is called "Ordinary Least Squares" or OLS Regression for short.
 
 
IF YOU HAVE HAD MATRIX OR LINEAR ALGEBRA

We can use systems of simultaneous equations to solve for the intercept and slope terms. One comparatively easy way is to make each row in the matrix an equation for the dependent variable and make the entries in the matrix the bivariate correlations among the variables (the math of the actual entries is reserved for your higher level statistics courses). We solve for the standardized beta weights (not the metric bs) and at the end of the process multiply the beta weights back by standard deviation ratios to turn them into metric regression coefficients. A variation on this process is what most computer programs do to derive the intercept and slope terms.
 
 
IF YOU HAVE HAD NEITHER ONE

Starting with simple regression, where you just have one independent and one dependent variable:

(More technically, B is the covariance of X and Y divided by the variance of X.)

The NUMERATOR for the b term is identical to the numerator for Pearson's r: it is the covariance (or multiplying the mean deviations) for the independent variable and the dependent variable.

The DENOMINATOR for the b term is simply the squared sum of deviation scores between the independent variable and its own mean. It is as though you began to calculate the standard deviation of the independent variable, but you stopped at the second step.

By using the independent variable sum of squares in the denominator of B, you are setting the metric for the metric regression coefficient, i.e., for a one unit change in "x" (e.g., one inch of height) you have a specified change in "y" (e.g., 5 pounds).

We obtain the b or metric regression slopes FIRST, then go back and solve for the a or intercept term.
 

MULTIPLE REGRESSION Bs 

Well, I did warn you that this was the technical guide. Below is ONE way to derive the B coefficients in multiple regression when you have only TWO independent variables:

And the formula for b0or the intercept term in multiple regression becomes:
 
b0 = by1 1  -  by2 2  -
 bby3 3  - ... 
bb- b yk k

 


R REVISITED

R2 tells us how close the data are to the regression line or regression plane that you draw with the regression equation. When the data are very close to the predicted line, R is close to 1 and prediction is excellent. Use the strength chart in Guide 5 to evaluate R2 from nothing through very weak to very strong  to "perfect" (R2 is always positive).

R2 is THE PRE measure. It tells us how precisely we can predict the dependent variable from knowing the scores on at least one independent variable.

Recall that in Guide 7, I described Pearson's r as:

The square of r is r2. When r2 = 1, we say that we have "explained all the variation in the dependent variable."

What does it mean to say that R2 or r2 is a measure of how well we have explained the variance in the dependent variable?

In order to answer that question, we must study three entities and the relationships among them.

The first entity is the TOTAL SUM OF SQUARES.
Sometimes this is called the "total variation in y" (y being the dependent variable.)

In the total sum of squares, you begin taking a standard deviation on the dependent variable but you stop at the third step:

1) you subtract the mean of the dependent variable from each dependent variable score

2) you square this difference or deviation

3) you sum or add up all the squared deviations from the mean score.
 

 
THE FORMULA BELOW IS FOR THE TOTAL SUM OF SQUARES (TSS)
(yi-)= TSS

The second entity is the SUM OF SQUARED ERROR.
Sometimes called the "unexplained sum of squares" or the "residual sum of squares."

Or the sum of the squared deviations between the predicted dependent variable score (predicted by the regression equation) and the actual observed dependent variable score.

Although you did your best to find good independent variables, the unexplained (residual) sum of squared error (SSE) is what you COULDN'T predict. It might be random measurement error, it might be all the other predictors that you DIDN'T include in the regression equation -- we don't know what this prediction error exactly is. although we make the statistical assumption that e has a normal distribution and that it is randomly distributed across values of the independent variable.

Third, there is THE REGRESSION (or "EXPLAINED") SUM OF SQUARES.
Agresti and Finlay sometimes call it the "model sum of squares."

Why, explained? Because it is how much better you do predicting the dependent variable knowing scores on the independent variable than if you had no information about the independent variables at all.

So, the Explained or Regression Sum of Squares is:

1) The deviation between the PREDICTED dependent variable score and the MEAN OF THE DEPENDENT VARIABLE.
(The mean would have been your best guess if you didn't know the independent variable scores.)

2) Square that deviation (so the positive and negative deviations don't cancel each other out) then

3) Sum up all the squared deviation scores between the dependent variable predicted score and the dependent variable mean.


The Regression or "Explained" Sum of Squares examines the difference between the regression predicted score and the mean score (squared).

Thus, it tells us how much more accurate our prediction is using the regression equation than if we only had the mean score on the dependent variable for the entire group.

NOW, we can look at R-square as the ratio of the "explained" sum of squares to the total original variation in the dependent variable. Below are several different formulae for the same entity!
 
 

R2=
TSS - SSE
TSS 
OR
ESS ÷ TSS
OR
1 - (SSE ÷ TSS)
OR
(E1 - E2) ÷ E1
OR
1 -
e2
__________
(yi )2

All these formulas are just different ways of saying the same thing:

But, because one way to calculate R2 is the ratio of the explained sum of squares to the total sum of squares (the second formula from the left), the outcome is the proportion of explained variation in the dependent variable.


 
 
Multiply the proportion of explained variation in the dependent variable by 100 (that is, multiply R2 by 100) and the result is "the percent of the variance explained in the dependent variable."

 


  REAL OR ACCIDENT?
TESTING FOR STATISTICAL SIGNIFICANCE IN REGRESSION

When you work with multiple regression, you will have two sets of tests for statistical significance.

1. The first is for the overall regression, whether all the independent variables PUT TOGETHER have a TOTAL non-zero influence on the dependent variable.

This is also the test for R2, the multiple correlation coefficient.

The null hypothesis, Ho : R2 = 0

The alternative hypothesis is  HA:  R2 > 0.

[Review null hypotheses in Guide 4 if you need to.]

Because R2 is a squared measure, it cannot be a negative number.
 

 
We test the significance of R2 with the F-TEST.

THIS PARTICULAR F-TEST HAS TWO SEPARATE SETS OF DEGREES OF FREEDOM.
Since you are looking at a ratio of explained variance to the total variance, you have two separate sets of df.

The FIRST degrees of freedom (df1) are associated with the EXPLAINED variance.
This is "k", the number of independent variables you have in the regression equation. If you have 3 independent variables, you use the column headed with "3".

The SECOND degrees of freedom (df2) are associated with the RESIDUAL variation.
The degrees of freedom here = n - k - 1

Start with the casebase. Subtract the number of INDEPENDENT variables you have. Then subtract 1 (you had to calculate the mean on the dependent variable.)
Thus, if I have 3 independent variables and 34 cases, df2 will be  34 - 3 - 1 = 30. 

If the F-test for my sample IN THIS EXAMPLE is 2.92 OR LARGER (for 3 independent variables and 34 cases), I conclude my results would have happened by chance in 5 or fewer samples in 100 if F were really 0 in the population. If so, I reject the null hypothesis that F (and R2) are really 0 in the population and decide that R2 is really nonzero.

The F distribution looks different for each set of the number of independent variables and the case base. Like Chi-square, there are tables where you can look up the "critical values" of F. An F-ratio larger than the critical value or alpha level will be statistically significant for the specified probability level (most tables show only p < .05 and p < .01).

However, as you know, the computer will typically give you the exact probability level out to several decimal places. If the significance level (P) for the F test is small (p < .05), then the R2 is REAL (non-zero). Usually this means at least one B is non-zero.

Notice I haven't said ANYTHING about strength! If you accept the alternative hypothesis that R2 > 0, check the strength of R2  with the strength chart. Check the chart HERE.
 


2. In the second set of tests for statistical significance, you will separately test whether each B is zero or non-zero. After all, the R2 could be statistically significant, but that could happen because just one B out of many independent variables had statistically significant effects on the dependent variable, and the other Bs, within sampling error, were simply 0.

To say that a B is zero is to say IT HAS NO NET PREDICTIVE EFFECT ON THE DEPENDENT VARIABLE. This also means that the slope of that line is totally flat. As X rises or falls, the value of Y stays the same (e.g. controlling height, no matter how many cigarettes you smoke per day, your weight in pounds is the same).

It is entirely possible that an independent variable can have a strong effect on the dependent variable USING THE BIVARIATE CORRELATION, and NO NET EFFECT WITH THE REGRESSION B, once other variables are controlled.

In fact, you may have detected a spurious or intervening causal relationship to have such a thing happen (although it will take more than one regression model to begin to tease this out). When you have a set of independent variables that are highly intercorrelated with each other, you can also get such a set of findings.

For each B:

The null hypothesis, Ho : B= 0

The alternative hypothesis is  HA: | B| > 0

That is, the alternative is that the absolute value of B is greater than zero. Because Bs are directional, the B could turn out to be positive or negative.

We test for whether B is zero in the population with a t-test.
 
 

 
t =
B ÷ seB

The Standard Error (seB) of the slope term (with one independent variable) is:
 
 

seB =
the square root of
ei 2
----------------
(x)2

However, again, most computer programs will either compute the t-test for the B for you, or will give you the Standard Error of the B so you can use a calculator to  find it.

If your sample is over 120 people (and you only have a couple of independent variables), a B that is about twice its standard error typically is statistically significant at the .05 probability level.

B can be either positive (more calories mean more weight) or negative (more exercise means less weight).
If you know the direction of B in advance, you can do a one-tailed test. (did you have to do a study to know that, all else equal, people who eat more weigh more?)

If you are doing SIMPLE REGRESSION with one independent variable, the sign of r is the same as the sign of the B.

If you are doing MULTIPLE REGRESSION, R is always the square root of R2 and it is always positive.
 


 A REGRESSION EXAMPLE

Remember that question from the Surveys of Public Attitudes Toward Science about whether the sun went around the earth or the earth went around the sun? Well, on six separate surveys, spanning the years 1988 to 1999, there were nine more questions on basic science. Here are a few:

(Answers: light waves, false, and false--bacteria only.)

I created a science knowledge score from the total 10 items. Each correct response was coded 1, thus the total correct runs from 0 (none correct) to 10 (all correct). The data in the tables below are from the 1999 survey only.

The average for all 1882 respondents was 6.9 correct out of 10 with a standard deviation of 2.11 points.

In my regression to explain science knowledge scores, I use THREE PREDICTOR VARIABLES: gender (as a dummy variable, coded 0-1), educational degree level, and the number of college science courses the person elected. Obviously I should control degree level, because the individuals who never went to college at all had no college science courses.

First, I examined how each independent variable separately related to science knowledge score.

The first is a difference of means test (a t-test) by gender (why a difference of means test? why not a table?)

SCIENCE KNOWLEDGE SCORES BY GENDER

GROUP Mean Score (out of 10) Standard Deviation 
MALES 7.6 1.93
FEMALES  6.4 2.10
TOTAL 6.9 2.11

t-test (1881) = 12.59 p < .001   Eta = 0.278

Notice that on the average, males score 1.2 points higher than females, and this difference was (a) highly statistically significant and (b) the relationship was of moderate strength.

(If you had trouble with either of these two findings, go back and review Guides 4 and 5.)



Next, I did a difference of means test (a one-way analysis of variance) on science knowledge scores by educational degree level.

SCIENCE KNOWLEDGE SCORES BY EDUCATIONAL LEVEL

GROUP Mean Score (out of 10) Standard Deviation 
Less Than High School 5.2 2.09
High School Degree  6.6 1.97
College Degree 8.0 1.78
Advanced Degree 8.3 1.56
TOTAL 6.9 2.11

F-test (3,1878) = 143.38 p < .001   Eta = 0.43

Again, the difference in science knowledge scores across educational levels is highly statistically significant. This relationship, too, is moderate. Scores increase at a constant rate (1.4 points) from the less than high school group to the college degree group, then basically plateau at the graduate level, so the relationship begins to approximate a straight line.



Notice how the standard deviation on our dependent variable, science knowledge scores, gets smaller and smaller with each successive educational level.

This is an example of HETEROSCEDASTICITY, or UNEQUAL VARIANCES ON THE DEPENDENT VARIABLE ACROSS CATEGORIES OF THE INDEPENDENT VARIABLE. Heteroscedasticity can be a problem for several reasons. Among them is that the regression Bs are often no longer as efficient (as precise) as they could be. In later statistics courses, you will learn how to correct for heteroscedasticity. We will also discuss heteroscedasticity briefly in the last section of this Guide.



Next, I ran the zero-order or bivariate correlations among all the variables that I was examining in this analysis. Gender is coded as a dummy variable with female = 1 and male = 0. These are Pearson's r correlations and those are the type of correlations that regression  programs calculate for you.

These correlations are presented below is what is called A CORRELATION MATRIX.

ZERO ORDER (bivariate) CORRELATION MATRIX AMONG ALL VARIABLES IN THE EQUATION
All correlations are statistically significant at the p < .001 level

  Gender 
(1 = female) 
N college science courses Educational 
LEVEL
Science 
Knowledge 
Gender (1 = female) 
1.00
-.16
-.14
-.28
Number college science courses
-.16
1.00
.59
.44
Educational LEVEL
-.14
.59
1.00
.42
Science Knowledge
-.28
.44
.42
1.00

Notice how the matrix is symmetric, that is, the top right hand side is the same as the lower left hand side. Because of this, many people will just present the lower left side of the matrix as you see below. There are "1"s on the diagonal of the matrix because the correlation of a variable with itself equals 1.

ZERO ORDER (bivariate) CORRELATIONS AMONG ALL VARIABLES IN THE EQUATION
All correlations are statistically significant at the p < .001 level

  Gender 
(1 = female) 
N college science courses Educational 
LEVEL
Science 
Knowledge 
Gender (1 = female) 
1.00
 
 
 
Number college science courses
-.16
1.00
 
 
Educational LEVEL
-.14
.59
1.00
 
Science Knowledge 
-.28
.44
.42
1.00

From the correlation matrix, we can see that gender has weak negative correlations with the number of science courses and educational level, and gender is moderately and negatively correlated with science knowledge score.

The number of college science classes is strongly and positively related to educational level (r = 0.59), and has a moderate positive relationship (r = 0.44) with science knowledge score.

Educational level is moderately and positively related (r = 0.42) to science knowledge score.



Now, here's the regression of gender, the number of science courses, and educational level on the science knowledge score.
 
INDEPENDENT VARIABLE
B
seB
BETA
t
P
Gender (1 = female)  -0.86 .084 -0.20 -10.222   .0000
Number college science courses  0.17 .016 +0.26 10.816  .0000
Educational LEVEL  0.61 .063 +0.23  9.618  .0000
Constant  5.60 .150
--
 37.368  .0000

R = 0.520 R2 = 0.270    n = 1882
Standard error of the estimate = 1.80
Standard deviation of science knowledge score = 2.11
 

 
The first thing we need to check is whether the ENTIRE REGRESSION is statistically significant. We check this with the significance level for the R2. The p is a row of zeros (.0000). This means that the odds of getting these sample results by chance if R2 were really zero would be LESS THAN ONE IN 10,000 SAMPLES. This is a very rare event, so we reject the null hypothesis that R2 is zero, and accept the alternative, i.e., that R2 is something greater than zero.

Notice that each B also shows a row of zeros under the probability level because the value was truncated at the 4th decimal place.

The R2 is 0.27. This is a MODERATE relationship. It also means that we can explain 27 percent of the variation in science knowledge scores over and above the mean knowledge score by knowing someone's gender, educational level, and their number of college science classes. 

Given that R2 is greater than zero, the next thing we do is check for the statistical significance of each of the separate Bs. The p for each one of the Bs is a row of zeros (.0000). This means that the odds of getting these sample results by chance if B were really zero would be LESS THAN ONE IN 10,000 SAMPLES IN THE CASE OF EACH OF OUR THREE INDEPENDENT VARIABLES. This is a very rare event, so we reject the null hypothesis that each Bis zero, and accept the alternative, i.e., that THE ABSOLUTE VALUE OF THE B is something greater than zero.
 

Given that each B is statistically significant, i.e., not zero, let's see what each one means IN WORDS.

The B for Gender was -0.86. Given that female = 1 and male = 0, this means that controlling educational level and the number of science courses, women averaged 0.86 fewer right answers than men (the B is negative). However, notice that this difference is about one-third smaller than the bivariate sex difference of 1.2 answers. Thus, at least some of the sex difference occurred because men take slightly more college science classes than women do.

The B for the number of college science classes was 0.17. This means that for each additional college science class the person takes, he or she scores about 0.17 points higher on the science knowledge questions (controlling gender and degree level).

The B for educational level was 0.61. For each jump in degree level, the person averages 0.61 right answers more. That may not seem like very much, but if the jump is from less than high school to an advanced degree, the person with an advanced degree on the average gets nearly two more answers right out of 10 (3 X 0.61 = 1.83) than someone who never completed high school at all (controlling gender and the number of college science classes).

Finally, the constant term is 5.60. If someone was male, never had any education, and never had a college science course, we would expect their score on the average to be 5.60 out of 10.

You could present the numeric results in a simple chart like this one:
 

INDEPENDENT VARIABLE
B
t
P
Statistically 
Significant?
Gender (1 = female)  -0.86 -10.222 
 <.0001
YES
Number college science courses  0.17  10.816
 <.0001
YES
Educational LEVEL  0.61   9.618
 <.0001
YES
Constant  5.60  37.368
 <.0001
YES



Now, let's examine the Beta Weights, or the STANDARDIZED regression coefficients. Within this single equation we can directly compare them, again in a simple chart:
 
INDEPENDENT VARIABLE
BETA
Direction
Strength
Number college science courses +0.26 Positive Moderate
Educational LEVEL +0.23 Positive Weak
Gender (1 = female)  -0.20 Negative Weak

The constant disappears because it is zero in a standardized regression equation.

All the beta weights in standard deviation units are similar (something you could not tell from the metric regression equation) in terms of their impact on science knowledge score.



Finally, here's one more piece to look at:

The Standard Error of the Regression (that is, the average deviation around the regression line for science knowledge score) was 1.80

The actual standard deviation of science knowledge score was 2.11

Square each of these numbers. The square of the Standard Error of the Regression is also sometimes called the Mean Square Error or the Mean Square Residual.

If you divide the mean square residual by the variance of the knowledge scores, that ratio is 0.73, that is, on the average about 27 percent smaller than the variance around the knowledge score mean. This is reflected in the R2 of 0.27. Fortunately, since the computer calculates R2 for you, you don't have to do the math, but it is helpful to know where R2 came from.



REMEMBER

Use the metric B regression coefficients when:

1. you want to make a definite prediction(e.g., the dollars of a person's salary or someone's GRE score) or

2. when you want to compare two groups (e.g., predicting the salaries for men and women in two separate regression equations).

Use the standardized regression coefficients (the Beta Weights) when:

1. you want to assess how relatively important each independent variable is WITHIN THE SAME EQUATION.

2. you want an approximate indication of how strongly each independent variable influences the dependent variable.

(note Agresti and Finlay's cautions on use of the Beta Weight. You are probably OK in samples of several hundred cases where the Bs have small standard errors.)
 


HERE'S SOME GUIDANCE TO HELP YOU EVALUATE YOUR MULTIPLE REGRESSION RESULTS

NOTE: This section is also repeated in Assignment 5.

FIRST examine your univariate and bivariate statistics: the means, standard deviations, and the correlation coefficients.
Note any unusually large or small correlations.
MAKE SURE YOU KNOW WHAT THE METRIC IS OF YOUR DEPENDENT VARIABLE (pounds of weight? number of household computers? number of library books?)! Yyou will use this metric for the Bs.

SECOND see if the overall R2 is significant. Use the Global F-Test results and look at the "P" for probability level.

The null hypothesis, Ho : R2 = 0

The alternative hypothesis is  HA:  R2 > 0.

Because R2 is a squared measure, it cannot be a negative number.

If the significance level for the F test is small (p <  .05), then the R2 is REAL (non-zero).
Usually this means at least one B is non-zero.
Go to step 3.

If the R2 is basically 0 (p > .05), any apparent influence of the predictors on the dependent variable is an ACCIDENT. STOP HERE! GO NO FURTHER!

THIRD see if the STRENGTH of R2 is at least weak (.11 plus).
If yes, continue to step 4.
If R2 is smaller than .11, your results are real but probably not practically important.
Interpret any Bs with extreme caution.

(NOTE: It's true that 10% explained variation MIGHT be a big deal, depending on the state of knowledge in your discipline of study. So interpret strength with your discipline in mind.)

FOURTH NOW examine each of the Bs.
Any B less than twice its own standard error will usually have a significance level greater than .05.
This means any apparent influence of that B is a sampling ACCIDENT and that B is really 0.

Use a marker to note the Bs with statistical significance < .05.
These are REAL or nonzero.

Discuss how the statistically significant Bs raise or lower scores on the dependent variable (in pounds of weight for my example: For example, for each 15 minute period a woman exercised, she would weigh 1 pound less.)
CLICK HERE TO REVIEW THE WEIGHT EXAMPLE.

FIFTH Look at the BETA weights of the SIGNIFICANT Bs. (Remember that the Bs that were not statistically significant are really 0 in the population and so are the corresponding Beta Weights.)

Rank the Beta Weights from most to least important in terms of absolute value size.
Discuss the strength and direction of each statistically significant beta weight.

Regression is a terrific technique IF you can meet the assumptions and IF your data don't present complications.

Here are a few polysyllabic terms that designate potential problems.

1. Remember how the standard deviation on the science knowledge index grew smaller with each jump of educational degree level? There's a fancy name for this: heteroscedasticity, and it violates one important regression assumption:

At each level of the independent variable, the e's or residual terms are supposed to resemble a normal distribution, and the variances on the residuals should be the same no matter where in the distribution of independent variable scores that you look.
In other words, the spread of scores and the standard deviation of scores on the dependent variable should be about the same, no matter which category you examine of the independent variable. For example, you might expect the standard deviation on weight for height = 5 feet 3 inches (e.g., 6 pounds) to be identical to the standard deviation on weight for height = 5 feet 6 inches (e.g., also 6 pounds).

The standard deviation of the weight scores should be about the same, whether you look at women who are five feet two inches tall or women who are five feet ten inches tall.

The standard deviation of science knowledge scores should be about the same, whether you look at people with less than a high school degree versus people with an advanced college degree.

Similar variances on the dependent variable across different values of the independent variable is called homoscedasticity (for "the same").

If you violate homoscedasticity, you will probably not have minimum variances for your regression estimates. Your estimates will not be the most efficient (have the smallest possible variance). The estimates of the standard errors of the B terms that you receive from the computer programs (which assume homoscedasticity) could be both incorrect and misleadingly low--that means that you can think some independent variables have a statistically significant effect on the dependent variable when they really don't.

One way to try to work with homoscedasticity is called "weighted least squares" and higher level statistics courses will address how to do this.

2. Multicollinearity. WHAT? Come again?

Multicollinearity refers to highly intercorrelated INDEPENDENT  variables in the multiple regression equation. You can "eyeball" these in the zero order correlation matrix, although more formal tests are available. By highly correlated, some analysts say any correlation that has an absolute value of 0.50 or higher indicates multicollinearity. Most say an absolute value of  0.70 or higher (that corresponds to an R2 of about 50%) designates a problem.

Why is multicollinearity an issue?

When independent variables are very highly intercorrelated, it is difficult to disentangle the unique effects of each independent variable on the dependent variable.

Very large Beta Weights (over an absolute value of 1) may be a diagnostic indication of multicollinearity.

Again, possible solutions that address multicollinearity are often handled in higher level statistics courses. However, read point #3 for SOMETHING YOU MUST NEVER DO!!

3. NEVER NEVER NEVER omit an important predictor variable THAT IS CORRELATED WITH OTHER INDEPENDENT VARIABLES from the regression equation.

Yes, your remaining B coefficients probably will become larger. Some of the Bs that weren't statistically significant before may become so now. But you have just introduced a different and more serious problem.

NEVER NEVER NEVER try to solve multicollinearity by "throwing out" one (or more) of the intercorrelated independent variables.

The result is B coefficients with systematic biases, i.e., systematic departures from the true population values. That is because the covariance that originally was shared between independent variables now all goes to the independent variables that are left in the equation, artificially and invalidly raising their values and producing invalid, inflated estimates of how those independent variables influence the dependent variable.

This is the most common mistake I have seen novice analysts make who use multiple regression.

4. Issues with low R2s

Well, you did your best, but even the biggest optimist would call your entire regression equation results WEAK.

Why?

Low R2s can occur for several reasons:

Very often our regression results reflect our conceptual state of knowledge about the dependent variable. Poor prediction is a sign to "go back to the drawing board" and think some more about how the phenomena under study really operate. Or, like the ideal number of children in Western countries, China, and Japan, there may be such a high level of cultural consensus that there is no variability in the dependent variable to explain.
 

READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh July 20, 2004
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.