Ready for use!
OVERVIEW

GUIDE 1: INTRODUCTION
GUIDE 2: CONSTRUCTING A TABLE
GUIDE 3: UNIVARIATE STATISTICS AND DISPLAYS
GUIDE 4: BIVARIATE BASICS
GUIDE 5: BIVARIATE CORRELATIONS
GUIDE 6: MULTIVARIATE CROSSTABULATIONS
GUIDE 7: BASIC REGRESSION
GUIDE 8: REGRESSION SPECIFICS
GUIDE 9: SAMPLING
TO EDF 5400 READINGS AND ASSIGNMENTS


 
EDF 5400 INTRODUCTORY STATISTICS
SUMMER 2004

DR SUSAN CAROL LOSH


 
GUIDE 5: BIVARIATE ASSOCIATIONS AND
CORRELATION COEFFICIENT PROPERTIES

 
ASSIGNMENT 3 SPECIFICATIONS
GENERAL FEEDBACK EXAM 1

 

I treat the material in Guides 4 and 5 as a unit. A lot of material in Agresti and Finlay below will make much more sense after you have finished the material in Guide 5 (this Guide). If you hate that queasy, "I can't understand what's going on" feeling, I strongly recommend that you go over the material below after reading Guide 4. Return and REREAD this material after we complete Guide 5.

KEY TO: Huff, Chapter 7, pp. 74-86.
KEY TO: Agresti and Finlay, in order: Chapter 8, pp. 248-266; Chapter 8, pp. 272-278; Chapter 8, pp.
 282-286; Chapter 6, pp. 154-167; pp. 171-179; pp. 193-198 AND
KEY TO: Agresti and Finlay on t-tests: Chapter 7, pp. 210-220; pp. 232-234. 

If you would like to read more on basic Analysis of Variance ("ANOVA") or testing whether there are group differences on the mean scores of a dependent interval or ratio variable, see Agresti and Finlay, Chapter 12, pp. 438-445 (not required, skimming recommended).
 


 
"GOOD CORRELATION COEFFICIENTS"
WIDELY USED CORRELATION COEFFICIENTS
 HOW STRONG IS "STRONG"
FORM OF THE RELATIONSHIP
 CAUSAL ISSUES

STEP 1: You established whether the association between your two variables of interest was likely to be a SAMPLING ACCIDENT, or whether it was PROBABLY REAL, that is, non-zero. But to say that a correlation is non-zero is not really saying very much. Of course, it does mean your correlation is SOMETHING (as opposed to NOTHING) but, especially in a large sample, you could have a correlation that was highly "statistically significant" but relatively trivial.

THINK: Would you publish your results in The Tallahassee Democrat (or anywhere else, for that matter)?

THINK: Would you spend the long distance telephone money to call your family and tell them about it (especially an international call)?

It's time to move on to STEP 2: if your association between two variables is probably REAL, how strong is it? What kind of measure of association do you choose for your data?

We often measure the strength of a relationship with an entity called a correlation coefficient.
Generally, the larger the ABSOLUTE VALUE of the correlation coefficient, the stronger the relationship.

Absolute value means that we disregard the positive or negative size and simply focus on the numeric value.

Most correlation coefficients are 0 ("zero") when there is "no relationship." They are positive or  negative one (+1 or -1) when the relationship is "perfect," that is, when you can totally determine or predict scores on a second variable through your knowledge of the scores on the first variable.

That's what Guide Five is about. We will examine:

Properties of "good" correlation coefficients.

Some widely used bivariate correlation coefficients.

Some cautions about some widely used measures.

The concept of "tied pairs" for ordinal measures of association.

The concept of PRE, or "percentage reduction in error."

And a bit about "effect size" in the difference between means.



WHAT CHARACTERISTICS DOES A GOOD CORRELATION COEFFICIENT HAVE?

Correlation coefficients are measures of association that address the question: how strong is this relationshipThis is the issue of "substantive significance"or "practical significance." Or sometimes this is one form of "effect size."

There are MANY different correlation coefficients (statisticians love to create and name correlation coefficients).

We will concentrate on three widely used correlation coefficients, one primarily for nominal data, one primarily for ordinal data and one primarily for interval or ratio data, keeping in mind that there are many other correlation coefficients:
 
 

Coefficient Symbol Name Data Level and Properties
Phi (Cramer's V in non-square tables) Nominal, Symmetric
Phi-Square is PRE.
b
Tau-beta Ordinal, Asymmetric, PRE.
rho (Pearson's r). Interval, Symmetric, PRE.

Descriptions of these three correlation coefficients (and Gamma) follow below under "Some Widely Used Correlation Coefficients." PRE will be examined a bit later, so please have patience.

  I will also examine , a coefficient for nominal data. Some statistical computer programs calculate lambda and others do not, but we will study about it because it illustrates some useful statistical properties.

We will look at Lambda under the issue of "PRE" or Percentage Reduction in Error methods.

I will mention but do not recommend the ordinal coefficient gamma (  ).  is very widely used. One reason for this is that Gamma "inflates," that is, it gives a larger value than most other ordinal coefficients on the same data because of the way that it is calculated.
 
 
POINTS TO CONSIDER IN CHOOSING A CORRELATION COEFFICIENT

REMEMBER! If your association between two variables is not statistically significant, the population value of the correlation coefficient is really zero, regardless of what the sample results "look like."

1. THE LEVEL OF DATA

Certain correlation coefficients are formulated for use on each level of data, nominal, ordinal, and interval-ratio. For example, phi is typically used for nominal data; tau-b and tau-c for ordinal data and r for interval level data.

In general, you MUST go with the LOWER level of data in your two variables. For example, if one variable is nominal, and the other interval, either use a nominal correlation coefficient or examine a difference of means test (where you can use the correlation coefficient eta).

2. RANGE OF POSSIBLE VALUES

Correlation coefficients should vary between 0 and 1 for nominal measures and between -1 and +1 for ordinal and interval correlation coefficients. Correlation coefficients are standardized measures. So if you see an entity that is 'way larger than 1.0, it is not a correlation coefficient.

Adhering to this range makes it easier to interpret correlations, and to compare the same type of correlation coefficient across groups (e.g., a phi between two variables across women and men).

Depending on the size of the table, some correlation coefficients are unable to reach 1 or -1. Sometimes you can apply a correction factor so that they can do so. And, sometimes you can't. We will examine Phi, which does have a correction factor, which is not always true for nominal correlation coefficients.

3. CAUSAL ISSUES

NOTE: The last section of this guide is devoted to the basics of establishing causality in non-experimental data.

Correlation coefficients can be designed for symmetic relationships, where you cannot designate an independent variable ( the coefficients   or c  for example).

Other correlation coefficients are designed for asymmetric relationships (such as b ) where we are able to designate a cause and an effect. Typically asymmetric coefficients give slightly larger results on the same set of data than symmetric correlation coefficients do. It is like you are being "rewarded" for being able to designate more precise information about your variables and the relationships among them.
 

4. DIRECTION

If the two variables in the relationship are at least ordinal, we can discuss the direction of the relationship, positive or negative. That is because we can make "more than" and "less than" statements with ordinal, interval, and ratio variables.
 
 

 
IMPORTANT!

Positive and negative correlation coefficients do NOT operate like univariate numbers.

We evaluate strength by the absolute numeric value (magnitude) of the correlation, not the + or - sign.
For example, a correlation of -.75 is stronger than a correlation of +.50.

Direction refers to the FORM of the relationship.

With a positive correlation, scores on the second variable rise when scores on the first variable rise.

For example, people with more years of education usually earn more money in dollars.
People with more income are also more likely to play state lotteries (yes, they really are, despite what you may have read.)

If we were to plot out the relationship between number of years of education on the horizontal or "x" axis, and annual income in dollars on the vertical or "y" axis, a line connecting the "x"s would slant UPWARDS and TO THE RIGHT. it would look something like this:
 
 

RELATIONSHIP BETWEEN YEARS OF EDUCATION AND ANNUAL
INCOME IN DOLLARS
 
     |                             X
     |                        X
     |                    X
 $   |               X 
     |          X
     |     X
     |X______________________________
               education in years


With a negative correlation, scores on the second variable fall when scores on the first variable rise.

A negative correlation is sometimes called an inverse correlationor an  inverse relationship. You can use either negative correlation or inverse correlation (but you need to know what a negative correlation means in statistics.)

For example, better educated people smoke fewer cigarettes.
People with higher incomes lose fewer impairment days due to illness.

If we were to plot out the relationship between number of years of education on the horizontal or "x" axis, and average number of daily cigarettes smoked on the vertical or "y" axis, a line connecting the "x"s would slant DOWNWARD and TO THE RIGHT. it would look something like this:
 
 

RELATIONSHIP BETWEEN YEARS OF EDUCATION AND 
AVERAGE NUMBER OF CIGARETTES SMOKED PER DAY
 
                    |   X
                    |        X
                    |             X
             #      |                  X 
       cigarettes   |                       X
       daily        |                            X___

                                                  education in years


 

TIP: If both your variables are ordinal, interval or ratio, a coefficient with direction is preferable  (tau-b, tau-c or r). Because you can make (at a minimum) "more than" or "less than" statements with these variables, you can make directional statements:

The more the education, the less the cigarettes smoked. (Negative or inverse relationship.)
The taller the person, the larger their weight in pounds. (Positive relationship.)

5. "PRE" PERCENTAGE (PROPORTION) REDUCTION IN ERROR

Finally, it is desirable to choose a correlation coefficient with a "PRE" (proportion/percentage reduction in error, see section 'way below in this page) interpretation if this is possible.

The value of the coefficient shows how much we have reduced our mistakes in prediction on the second variable if we know scores on the first variable. These measures customarily use ratios of mistakes and correct predictions.

One reason we will look at the correlation coefficient "lambda" for nominal variables is because lambda nicely illustrates PRE properties.

We will consider these five points in more detail throughout this guide.

 SOME WIDELY USED CORRELATION COEFFICIENTS

 
PHI

The correlation coefficient Phi or  is based on the Chi-square pdf.

 
Phi is a very useful correlation coefficient, partly because it is easy to calculate, partly because Phi-Square is PRE, partly because it gives results comparable to Tau-beta and Pearson's r (which are VERY widely used correlation coefficients for ordinal and interval-ratio data respectively), and because, with a correction factor, it usually can be employed with any size table.

If your computer printout does not include Phi, do not worry, because you can calculate it in a minute or two with your hand calculator. The SDA system does not calculate Phi. SPSS does calculate Phi.

(Step 1) Take the value of Chi-square and divide by the casebase.

The result is PHI-SQUARED.

(Step 2) Then take the square root of Phi-Squared and you now have Phi.

That's it! That's Phi. Ease of calculation is a point in its favor. You can use this calculation of  for any table where either one of your two variables takes on ONLY TWO values.

If both your variables have MORE than two values,  cannot reach 1, so you will have to use a variation of Phi called:

CRAMER'S V
(named after a statistician named Cramer, what else?)
Cramer's V corrects for the fact that this correlation often cannot reach 1 in non-square and larger tables.

To obtain Cramer's V, instead of dividing Chi-square by n, divide Chi-Square by [ n * (k-1) ]

k is the LESSER of the number of rows or columns.
For example, if you had 3 rows and 4 columns, k would equal 3 and (k-1) would equal 2.

So the whole formula for Cramer's V looks like this:
          ________________
V = /  X2__________
   V   N (k - 1)

Phi-square ( 2 )  is a PRE measure (see below).
However,  itself is not PRE.
 
TAU-BETA
AND A BIT ON GAMMA

THE CONCEPT OF PAIRS

The correlation coefficient Tau-Beta ( b ) is a coefficient that requires both your variables to be at least ordinal.

Typically, the level of statistical significance for Tau-Beta and its "relatives" (tau-a or tau-c) is tested with a t-distribution. However, not all statistical programs will give you the t-value (SDA does not, SPSS does). In that case,  it is usually OK to "drop back" and test statistical significance with the Chi-Square to estimate whether the association in the population is ACCIDENTAL or REAL before proceeding further.

The taus vary from -1 to +1.
All the taus are PRE measures.

These two features make them very widely used in analytic presentations.

Tau-Alpha and Tau-Beta are used for asymmetric relationships where you can designate an independent and a dependent variable, while Tau-c (Tau-Gamma) is used for symmetric relationships.


Tau-a (rarely seen on statistical programs), Tau-b, and Tau-c all operate off the concept called "pairs" which we will also demonstrate in class. Once seen, I believe you will be able to easily keep this concept straight.

In the concept of "pairs," each case is sequentially paired off with every other case, thus creating a "pair".
You will have:

[ n * ( n -1 ) ]  / 2  such unique pairs

For example, if you are studying 10 students in a class, you could uniquely pair them off in ( 10 * 9 ) / 2 = 45 different ways.

Pairs of cases can further be divided into subgroups:

Pairs where both cases have the same independent variable score. An example would be two people in a sample who both have high school degrees for educational level. We call these pairs "tied on X."

Pairs where both cases have the same dependent variable score. An example would be two separate people in a sample who each have ONE child. We call these pairs "tied on Y."

Pairs where both cases have identical independent AND dependent variable scores. An example would be two people in a sample who both have high school degrees for educational level AND who both have one child. We call these pairs either "tied on both" OR  "tied on X AND Y."
There are also "agreement" pairs and "disagreement" pairs.
In agreement pairs, case number one has BOTH a higher score on the independent variable AND a higher score on the dependent variable than the second case in the pair.
THINK: Jane has more education and more computers than Bob. (Higher on both education and computers.)
THINK: Cherice is taller than Carol and weighs more than Carol. (Taller and heavier.)
In disagreement pairs, the first person in the pair has a HIGHER SCORE on the independent variable and a LOWER score on the dependent variable than the second person in the pair.
THINK: Jim has more education but smokes fewer cigarettes than John.
THINK: Janice works more hours and has less leisure time than Theresa.

These types of relationships are summed up in the chart below:

TYPE OF PAIR

Person #1 in the pair versus 
Person #2 in the pair

Score on:
Variable One (Independent)
Score on:
Variable Two (Dependent)
"Tied on X" ( P ) Same scores Two different scores
"Tied on Y"  ( P ) Two different scores Same scores
"Tied on both" ( PXY  ) Same scores Same scores
Agreement pairs  ( PA ) Higher than person #2 Higher than person #2
Agreement pairs Lower than person #2 Lower than person #2
Disagreement pairs  ( PD ) Higher than person #2 Lower than person #2
Disagreement pairs Lower than person #2 Higher than person #2

Now, we can put all these different pairs together in proportion to examine Tau-Beta and Gamma.
 
 

 
Here's the formula for Tau-b:

               ( PA -  PD
__________________________________________________
  _________________________________
 /
V  ( PA + PD + PY )( PA + PD +  PX )



Gamma (  ) is a widely used correlation coefficient designed for use when both variables are ordinal, interval, or ratio:

It is a measure used in symmetric relationships.
It varies from 0 to +1 or from 0 to -1.
Gamma is also a PRE correlation coefficient.

So, given these desirable qualities, what's the problem?
 
 

 
Here's the formula for Gamma:

                 ( PA - PD )
                            ____________________

                 ( PA + PD )

Gamma is very popular because:

Gamma is larger because it has a smaller denominator. The numerator for Gamma is identical to that for Tau-Beta. However, the denominator for Gamma eliminates all the tied pairs that Tau-Beta (and the other Taus) includes. This is why Gamma "inflates" correlation coefficients. It is an artifact of the formula that creates a smaller denominator.

However, it is VERY common to have pairs of cases (especially in large samples) that have the same independent variable score and the same dependent variable score (e.g., all the high school graduates who have only one child), so Gamma is a poor measure to use.


 
PEARSON'S PRODUCT MOMENT CORRELATION COEFFICIENT
ALSO CALLED "RHO" "r" or PEARSON'S "r"

AND ETA (  ), A STATISTICAL RELATIVE

r (rho [  ] in the population) is a correlation coefficient for interval or ratio-level level.
It is THE most widely used correlation coefficient that you see presented in the statistical literature.

Its level of statistical significance is tested with a t-distribution, although again, not all statistical programs will give you the t-value (SDA does not, SPSS does). In that case, it is usually OK to "drop back" and test statistical significance with the Chi-Square* before proceeding further.

Pearson's r varies from -1 to +1 and is a symmetric correlation coefficient.
r-square (r 2 )is a PRE measure. However, r itself is not PRE.

r is designated with a capital R on your SDA printout.

*r is a LINEAR correlation. To the extent your plotted correlation departs from a straight line, the Chi-square distribution will not work well as a significance test for r.
 
 

 
Here's the formula for Pearson's r:
 
 
 
( Xi )(Yi)
_________________________________________ 
  _____________
 /
(Xi)2
  ___________
 /
(Yi)2

            ____________
           /
my    V                                    means to take the square root.

r is a STANDARDIZED MEASURE. It is sometimes called the standardized ratio of the covariation between "x and y" to the variation in the independent and dependent variables. Sometimes r is also called a "standardized covariance" for short.

Fortunately the computer will calculate all these correlation coefficients VERY quickly and accurately, so none of us need to!


Eta  is a relative of r that we can use when we have a relationship between a nominal INDEPENDENT variable and an interval DEPENDENT variable.

Eta varies from 0 to 1 but it is always a positive number. Eta is symmetric too.

2 is a PRE measure, but  is not.

If you decide to use a difference of means test for statistical significance, eta is the most sensible associated correlation coefficient to use.
 

RECAP: WHICH IS THE CORRELATION COEFFICENT FOR YOU?

FIRSTexamine the measurement level of your data. Match your correlation coefficient to the LOWEST level of measurement in your variables. AVOID using an ordinal measures, such as the s, or an interval measure, such as r, if you have a nominal variable.

HOWEVER, if the independent variable is nominal and the dependent variable is interval, strongly consider a difference of means test, and use the correlation coefficient eta . This is especially true if the interval dependent variable has a lot of different values.

SECOND what kind of statement can you make about the causal connection between your variables? Can you clearly designate an independent variable and a dependent variable? If so, you have an asymmetric relationship and may want to consider an asymmetric correlation coefficient for nominal or ordinal data (we must use different methods for interval data, to be discussed in a later guide).

If you cannot clearly designate cause and effect, you have symmetric relationship, and need to use a symmetric coefficient, such as phi, tau-c, or r.

 IF THE RELATIONSHIP IS REAL (NON-ZERO), HOW STRONG IS IT?

The following chart can help you describe the numeric value of a correlation coefficient in words. This is a somewhat arbitrary chart which I copied from social scientist Gene Lutz. However, if we can all agree that the numbers are reasonable, it provides a place to begin.

The numbers presented below are ABSOLUTE VALUES, that is, we have disregarded whether the sign of the correlation is positive or negative, and simply presented the number.

PLEASE MEMORIZE THIS CHART!
You will use it on exams and we will refer to it frequently.
 

CORRELATION ABSOLUTE VALUE SIZE
VERBAL DESIGNATION
0
No relationship
.01-.10
Very weak
.11-.25
Weak
.26-.50
Moderate
.51-.75
Strong
.76-.99
Very strong
1.00
Perfect association

IMPORTANT: A complete description of a numeric correlation includes:

EFFECT SIZE AND THE DIFFERENCE ACROSS MEAN SCORES

You may have decided to use a t-test or an F-test to look at the differences in mean scores across the values of a single independent variable. These values typically form what are called "groups" in differences of means statistical tests.

You can assess substantive significance or effect size with the correlation coefficient eta.

You can also look at the difference between means.
We compare the differences between means across different values of the independent variable with respect to the standard deviation of the difference between means.

This is NOT the standard deviation of group one or group two.
Instead it is the standard deviation OF THE DIFFERENCE between group one and group two.

This means looking at the denominator of the t-distribution, or:

S 1 2   or  THE SQUARE ROOT OF THE FOLLOWING FORMULA:
(n1 - 1) S21   + (n2 - 1) S22

( n1 + n2 )  - 2

where n1 and n2 are the casebases of group 1 and group 2 respectively and S21 and S2are the variances around the means of group 1 and group 2. So this is a weighted kind of standard deviation.

We can now take the absolute difference between the mean scores of group 1 and group 2 ( 12 ) and divide that absolute difference by the standard deviation of the differences. The result will be in standard deviation of the differences units.

We can decide a priori how big a difference between means we decide to call "big." One standard deviation of the differences? One-half standard deviation of the differences (this one is usually the minimum for "effect size")?

Why do we go to all this trouble? Because with very large samples, very small or trivial effect sizes will be statistically significant. This way, we warn our reader not to expect too much if the ES or effect size is small. 



LINEAR, MONOTONIC OR NONLINEAR?
MORE ON RELATIONSHIP FORM

In a monotonic relationship, as the first variable increases, a second variable rises or falls at an irregular rate. But a positive relationship would not turn negative.

For example, personal income might increase a lot when education rises from 11th to 12th grade but rise only a little when education increases from 14 to 15 years. Education in this example always positively affects income, but how much income rises may depend on whether educational level is low or high. As you can see, this is an "ordinal type" of relationship--a more or less than" type of relationship (even if you have interval or ratio variables.)

Tau-b is a ordinal correlation that detects monotonicity.
If you have several values of the independent variable, you might get a small change of direction by chance, i.e., the relationship is "roughly" or "approximately" monotonic.

Linear relationships resemble a straight line. As one variable rises, a second variable rises or falls by a constant amount.

That "by a constant amount" part is critical! For example, for each inch increase in height, someone would average 8 pounds heavier. The 8 pound rise would occur whether we compared people 5'5'' and 5'6" tall or people 5'10" and 5'11".

Linear relationships allow more precise prediction than monotonic relationships.

Because of this precision, this is more of an "interval-ratio" kind of relationship because the line rises or falls at a constant rate.

Pearson's r is an interval-level correlation that is designed to measure linear relationships.

In a curvilinear or non-linear association, the relationship between two variables is irregular in both slope and direction. The graph of the relationship might "bend" or even resemble a "U".

Phi  (Cramer's V) is a correlation coefficient that works well with nonlinear relationships even if the variables are ordinal or interval. Eta also works well if the dependent variable is interval-ratio.

EXAMPLE: Among women, the odds of marriage first increase with higher education, peak at around a B.A. college degree level, then drop somewhat at the advanced degree level. Use Chi-square and  here.

To "eyeball" whether you have a nonlinear, linear or monotonic relationship, here is what you can check:

do column percents in a table so that you can compare across values of the independent variable (e.g., what percent of people with a high school degree like rock music compared with college graduates?).

Graph the percent with the highest (OR lowest) value of the dependent variable (e.g., "like" rock music).

If the graph constantly rises or falls for those liking rock with each advance in degree level, it is linear.

If there is an irregular rate of increase or decrease, the relationship is monotonic.

If liking for rock music "jumps around" with no clear increase or descrease, the relationship is non-linear or curvilinear.
 

YOU MUST HAVE AT LEAST THREE VALUES OF YOUR INDEPENDENT VARIABLE TO DETECT A LINEAR OR MONOTONIC TREND!

You can do an analogous "eyeball" examining mean differences across groups--IF AND ONLY IF you can rank the groups from lowest to highest, or least to most, as I do in the tabular example below.

EXAMPLE

Let's look at an SDA example with (collapsed) level of education as the independent variable and mean number of household computers as the dependent variable. These data are from the year 2000 U.S. Census Bureau's "Falling through the Net" series about the "Digital Divide."

TITLE: NUMBER OF COMPUTERS IN HOUSEHOLD BY DEGREE LEVEL
Educational Level Less than a High School Degree High School Degree Some College College Grad Advanced Degree
Mean # Computers
in Household
0.47
0.59
0.91
1.13
1.28

Source: Current Population Survey supplement on computer and internet use, August 2000.
Valid n = 94,821     MD = 40,165

It is clear that there is a POSITIVE relationship between the level of education and the number of computers in the household. As educational level rises, so, too, do the number of household computers. But the level rises at an IRREGULAR rate.

It rises by 0.12 from Less than High School to High School Degree (0.59 - 0.47 = 0.12)

It rises by 0.32 for High School Degree to Some College (0.91 - 0.59 = 0.32 )

It rises by 0.22 for Some College to College Grad (1.13 - 0.91 = 0.22)

And it rises by 0.15 for College Grad to Advanced Degree (1.28 - 1.13 = 0.15)

This is a MONOTONIC relationship because the rate of increase is IRREGULAR.


 
PROPORTION [PERCENTAGE] REDUCTION IN ERROR (PRE)
AND THE LAMBDA EXAMPLE

Correlation coefficients with a PRE interpretation tell us the percentage of error that we have reduced in predicting the dependent variable if we only know the scores on a second variable.

PRE correlation coefficients are ratio measures (they are NOT ratio variables ) that take this form (E stands for "error"):

[ E1 - E2 ]  ÷  E1

This means how many errors or mistakes we make in predicting the dependent variable WITHOUT knowing scores on the independent variable (E1), compared with the amount of errors or mistakes we make in predicting the dependent variable if we DO know scores on the independent variable (E2).

An example with the correlation coefficient lambda will be worked for you below in this section.

Suppose we want to predict someone's weight. With one variable, our best [interval-level] guess is the mean weight of sample or population that we obtained.

In this case, our "error rate" is the standard deviation, the average distance a score is from the mean.

Could we predict someone's weight score more precisely if we knew what their height is?

r2 is a correlation for two interval variables that helps to answer this question. If r2  = .25, this means that we have reduced the approximate equivalent of the variance in weight by 25 percent by knowing what someone's height is.

PRE interpretation is a desirable characteristic in a correlation coefficient. The following are all PRE correlation coefficients:

r2
2
b and
c

(So is    but    is a bad, inflated  correlation coefficient!)


The nominal and asymmetric correlation coefficient lambda is unfortunately not calculated by all statistical programs. However, it DOES provide a simple illustration of the PRE concept.

Suppose we have the following bivariate table using a sample from a class of undergraduate educational psychology students:
 
 

                       GENDER
OWNS  HOME COMPUTER MALE FEMALE ROW TOTALS
YES 7 4 11
NO 3 13 16
COLUMN TOTALS 10 17 27

If we calculated the percentages, 70 percent of the men but only 24 percent of the women currently own computers. It LOOKS like we have an association, but, of course, this is only a sample and we want to know about the larger population. Gender is nominal and owns home computer is ordinal, so we want a nominal correlation coefficient. We also have an asymmetric relationship (it is unlikely that buying a computer would cause a sex change.) We can't use eta here because owns computer is ordinal not interval (someone could own 6 or 7 computers and answer "yes" or own 1 computer and answer "yes".)

REVIEW (IS THERE A RELATIONSHIP?)

Calculated, the Pearson Chi Square = 5.63.
Its probability level with one degree of freedom = .018

Thus, if there were no relationship in the population between gender and computer ownership, we would expect this kind of sex difference just by chance in only about 2 in 100 samples.

Of course, our apparent relationship in the sample COULD be an accident and we have one of those two nonrepresentative samples out of 100. It COULD happen (rarely).

However, we will probably reject the null hypothesis of no sex difference, given these sample results, and conclude male "ed psych" undergraduates more often own a computer than women do.

HOW DOES LAMBDA WORK?

Given there appears to be a relationship, how strong is it?

Lambda compares differences from the mode for:

Follow the example below carefully line by line. Satisfy yourself that you are comfortable before proceeding to the next line.

The mode for the entire sample is NO HOME COMPUTER. 16 out of 27 cases have this mode.
However, 11 cases DO have a home computer so they are different from the mode.
So if we were trying to guess computer ownership for the total sample, we would use the mode (nominal data) but we would make 11 mistakes or errors.

Error number one (total sample, univariate) = "E1" = 11

For men, the mode is HAVE HOME COMPUTER. 7 men do. However, 3 men deviate from the male mode and do not have a home computer.
The number of prediction mistakes on "have computer" for men = 3

For women, the mode is NO HOME COMPUTER. 13 women do not. However, 4 women deviate from the female mode and DO have a computer.
The number of prediction mistakes on "have computer" for women = 4

So, the total for "E2" or the mistakes that we make predicting computer ownership IF WE KNOW THE STUDENT'S GENDER is a total of 7:   3 mistakes for men plus the 4 mistakes for women.

Lambda becomes =   (11 - 7)  ÷  11  =   4 ÷  11 = .36

Instead of making 11 mistakes in predicting computer ownership, we make only 7 instead, if we know someone's score on BOTH gender and computer ownership.

This is a .36 or a .36 X 100 = 36 percent improvement in our prediction of computer ownership.

Putting it in a slightly different way, we have a percent reduction in error (PRE) of 36 percent (using the coefficient  ) in guessing a student's computer status if we also know their gender.

 WHAT IS THE TRUE CAUSAL STATUS OF THE RELATIONSHIP?

This is the third question we ask about an association.

First, we must establish that the relationship is REAL, and of at least moderate strength so we know it is not trivial. We then ask whether the two variables have a true causal association or whether a third variable may have caused the original two variables (a spurious relationship). To evaluate the causal status of a bivariate relationship, we need to introduce a third variable. This will be the topic of Guide 6.

However, even with two variables, we need to assess whether we can designate an independent variable (cause) and a dependent variable (effect). If we can designate an independent variable, the correlation is asymmetric. If we cannot designate an independent variable, it is symmetric.

This is relatively easy with experiments. In creating an intervention, you also created the causal variable. For example, if you assigned cigarette smokers to a nicotine patch group versus a placebo patch group, nicotine patch (yes or no) becomes the independent variable.

However, in many cases, we cannot manipulate or intervene with variables, such as sex or ethnicity.

In other cases, we are dealing with naturally observed variables, which is often the case with surveys, ethnographies, contact analysis, and many other methods. So we need some guidelines to establish plausible causal order among variables in non-experimental studies.

Statistics, by the way, may be used to DISCONFIRM a postulated causal order, but NEVER, NEVER, NEVER establish causal order.

ON PROOF AND CAUSALITY

Concepts of causality are critical: they tell us what is possible, what can be changed and what is difficult, if not impossible, to change. For example, if you are convinced that biological factors cannot be overcome, you probably will not believe that visually impaired children can compensate for their disabilities. Causality tells us what are the “prime movers” of the phenomena that we observe.

Much of the research process centers around explaining what are the true causal or “independent variables.”

According to science rules, definitive proof via empirical testing does not exist. Science uses the term "proof" (or, rather, "disproof") differently from the way attorneys or journalists do. For example, a correlation could have many causes, only some of which have been identified. Later work can show earlier causes to be spurious, that is, both cause and effect depend on some prior causal (often extraneous) variable (see the charts on ice cream consumption and fire engines).

CAUSALITY IN NON-EXPERIMENTAL DATA: AN EXAMPLE
Cancerous Human Lung
This dissection of human lung tissue shows light-colored cancerous tissue in the center of the photograph. While normal lung tissue is light pink in color, the tissue surrounding the cancer is black and airless, the result of a tarlike residue left by cigarette smoke. Lung cancer accounts for the largest percentage of cancer deaths in the United States, and cigarette smoking is directly responsible for the majority of these cases. 

"Cancerous Human Lung," Microsoft(R) Encarta(R) 96 Encyclopedia. (c) 1993-1995 Microsoft Corporation. All rights reserved.

Most people--and most scientists--accept that smoking cigarettes causes lung cancer although the evidence (for humans) is strictly correlational rather than experimental. There are many topics where it is neither possible--nor desirable--to use the experimental method. To accept more correlational evidence it will help to examine the rules below.(SCL)

Many scientists believe that the ONLY way to establish causality is through randomized experiments. However a moment’s reflection will convince you that this cannot be so. Most people now accept that smoking cigarettes causes lung cancer (see the Encarta selection above)–yet no society has ever randomly assigned half its population to smoke cigarettes and the other half not. This causal conclusion about smoking and lung cancer is based on correlational evidence, i.e., observing the systematic covariation of two (or more) variables in a research study, which is exactly what we do when we examine the association between two variables. Cigarette smoking and lung cancer are both "naturalistic"  variables, i.e., we must accept the data as nature gave them to us.

There is no doubt that the results from careful, well-controlled experiments are typically easier to interpret in causal terms than results from other methods. However, as you can see, causal inferences are often drawn from correlational studies as well. Non-experimental methods must use a variety of ways to establish causality and ultimately must use statistical control, rather than experimental control.


 
RULES TO HELP ESTABLISH WHICH VARIABLE IS "CAUSE" AND WHICH IS "EFFECT"

If one variable causes a second variable, they should correlate thus causation implies correlation.

However, two variables can be associated without having a causal relationship, for example, because a third variable is the true cause of the "original" independent and dependent variable. For example, recall that there is a statistical correlation over months of the year between ice cream consumption and the number of assaults. Does this mean ice cream manufacturers are responsible for crime? No!The correlation occurs statistically because the hot temperatures of summer cause both ice cream consumption and assaults to increase. Thus, correlation does NOT imply causation. Other factors besides cause and effect can create the illusion of an observed correlation.

If one variable causes a second, the cause is the independent variable  (explanatory variables or predictors).
The effect is called the dependent variable (sometimes it is called the criterion variable).

If you can designate a distinct cause and effect,  the relationship is called asymmetric.

Two variables may be associated but we may be unable to designate cause and effect. These are symmetric  relationships.

Since we know that we cannot use experimental treatments in naturalistic variables to determine cause and effect, yet we know that scientists do draw causal conclusions in nonexperimental studies, here is a set of helpful rules for tentatively establishing causality in correlational data.
 
 

 
When you are asked to "name a causal rule" on assignments or exams, please use one of the six guidelines below.

GUIDE (1) TIME ORDER. The independent variable came first in time, prior to the second variable.

EXAMPLE: Gender or race are fixed at birth.


GUIDE (2) EASE OF CHANGE. The independent variable is harder to change. The dependent variable is easier to change.

EXAMPLE: One's gender is definitely harder to change than scores on an assessment test or years of school. One's chronological age is not usually changed by attitudes, values, education, or much of anything else.


GUIDE (3) "MAJORITY RULE." The independent variable is the cause for most people.

EXAMPLES: Although some people become so fed up with their jobs that they return to school to train for a better job, most people  complete their education prior to obtaining a regular year-round, full-time job.
Most people marry prior to having children (although some people have their children first, then marry as a result.)


GUIDE (4) NECESSARY OR SUFFICIENT. If one variable is a necessary or sufficient condition for the other variable to occur, or a prerequisite for the second variable, then the first variable is the cause or the independent variable.

EXAMPLES: A certain type of college degree is often required for certain jobs.
At most universities, publications are a prerequisite for being awarded tenure.
If you can come up with the money, you almost certainly can purchase a meal.


GUIDE (5)  GENERAL TO SPECIFIC. If two variables are on the same overall topic and one variable is quite general and the other is more specific, the general variable is usually the cause.

EXAMPLE: Overall ethnic intolerance influences attitudes toward Hispanics.


GUIDE (6) THE "GIGGLE" OR "SANITY" FACTOR. If reversing the causal order of the two variables seems illogical and makes you laugh, reverse the causal order back.

EXAMPLES: We don't  believe choosing a specific college major or engaging in a particular sport determines one's gender.
 
 
SOME STUPID STATEMENTS ABOUT CAUSALITY THAT OCCUR IN THE LITERATURE

1. Only experiments can be used to make causal statements.

ANSWER: Use the guidelines above to investigate causality in non-experimental data. Ninety-six percent of Americans believe that smoking causes lung cancer but the data are NOT experimental (with the exception of a few poor rats.)

2. Well, yes, but the smoking data is based on epidemiological studies with thousands of cases.

ANSWER: The number of cases has NOTHING TO DO with causality. Having a large database means your estimates are relatively stable, they have low sampling error. But low sampling variability has nothing to do with causality. Examine the six guidelines above.

3. Nominal variables can serve as causal variables but numeric variables (interval or ratio) cannot be independent variables.

ANSWER: The level of measurement has NOTHING TO DO with causality. You may be able to do more arithmetically complex statistics with numeric data but that gives it no special status. To believe a statement such as number 3 means that you believe that gender (nominal) can have causal status but variables such as age or years of education (ratio) cannot have causal status. A moment's reflection will show you how silly that is. For example, years of education is one of the most powerful predictors of how people live their lives. Do you honestly believe that education has no effect on the occupation you enter, the salary you earn, the health practices you use, or television programs you watch?

Use one of the guidelines above to determine causal status, not the measurement level of your variables.



 
REVIEW: TYPE 1 AND TYPE 2 ERROR IN CLASSICAL INFERENCE TESTING

Remember: we must address question 1 before examining the type of correlation coefficient to use or possible causal issues. This is question 1: is there a relationship at all?

Our null hypothesis is that there is NO relationship, such as:
Ho: X2 = 0 OR Ho: = 0 OR Ho:b = 0  OR Ho: = 0  OR  Ho = 0

or even        Ho = 0

There are two basic mistakes we can make with our assessment of the null hypothesis:

FIRST we can reject the null hypothesis when it is, in fact, true.
SECOND we can accept the null hypothesis when it is, in fact, false.

The first type of error means that we believe our results are real when the population association is truly zero, and we mistook a sample illusion for a real relationship.

In the second type of error, we accept a zero relationship between two variables when, in fact, some relationship exists.
 
 
Type of Mistake
Alternative designation
Mnemonic
What happened
Type One Error
FIRST
Reject true Ho
Type Two Error
SECOND
ACCept false Ho

When we set an  level, we typically decide the MAXIMUM level for a type one error, such as .05 or .01.
 
 

READINGS AND ASSIGNMENTS

OVERVIEW

Susan Carol Losh June 21 2004.
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.