|
SUMMER 2004 DR SUSAN CAROL LOSH |
|
CORRELATION COEFFICIENT PROPERTIES |
|
GENERAL FEEDBACK EXAM 1 |
|
I treat the material in Guides 4 and 5 as a unit. A lot of material in Agresti and Finlay below will make much more sense after you have finished the material in Guide 5 (this Guide). If you hate that queasy, "I can't understand what's going on" feeling, I strongly recommend that you go over the material below after reading Guide 4. Return and REREAD this material after we complete Guide 5. KEY TO: Huff, Chapter 7, pp. 74-86.
If you would like to read more on basic
Analysis of Variance ("ANOVA") or testing whether there are group differences
on the mean scores of a dependent interval or ratio variable, see Agresti
and Finlay, Chapter 12, pp. 438-445 (not required, skimming recommended).
|
|
|
|
|
|
|
STEP 1: You established whether the association between your two variables of interest was likely to be a SAMPLING ACCIDENT, or whether it was PROBABLY REAL, that is, non-zero. But to say that a correlation is non-zero is not really saying very much. Of course, it does mean your correlation is SOMETHING (as opposed to NOTHING) but, especially in a large sample, you could have a correlation that was highly "statistically significant" but relatively trivial.
THINK: Would you publish your results in The Tallahassee Democrat (or anywhere else, for that matter)?
THINK: Would you spend the long distance telephone money to call your family and tell them about it (especially an international call)?
It's time to move on to STEP 2: if your association between two variables is probably REAL, how strong is it? What kind of measure of association do you choose for your data?
We often measure
the strength of a relationship with an entity called a correlation coefficient.
Generally, the
larger the ABSOLUTE VALUE of the correlation coefficient, the stronger
the relationship.
Absolute value means that we disregard the positive or negative size and simply focus on the numeric value.
Most correlation coefficients are 0 ("zero") when there is "no relationship." They are positive or negative one (+1 or -1) when the relationship is "perfect," that is, when you can totally determine or predict scores on a second variable through your knowledge of the scores on the first variable.
That's what Guide Five is about. We will examine:
Properties of "good" correlation coefficients.
Some widely used bivariate correlation coefficients.
Some cautions about some widely used measures.
The
concept of "tied pairs" for ordinal measures of association.
The concept of PRE, or "percentage reduction in error."
And a bit about "effect size" in the difference between means.
|
|
Correlation coefficients are measures of association that address the question: how strong is this relationship? This is the issue of "substantive significance"or "practical significance." Or sometimes this is one form of "effect size."
There are MANY different correlation coefficients (statisticians love to create and name correlation coefficients).
We
will concentrate on three widely used correlation coefficients, one primarily
for nominal data, one primarily for ordinal data and one primarily for
interval or ratio data, keeping in mind that there are many other correlation
coefficients:
| Coefficient Symbol | Name | Data Level and Properties |
|
|
Phi (Cramer's V in non-square tables) | Nominal, Symmetric
Phi-Square is PRE. |
|
|
Tau-beta | Ordinal, Asymmetric, PRE. |
|
|
rho (Pearson's r). | Interval, Symmetric, PRE. |
Descriptions of these three correlation coefficients (and Gamma) follow below under "Some Widely Used Correlation Coefficients." PRE will be examined a bit later, so please have patience.
I
will also examine
, a coefficient
for nominal data. Some statistical computer programs calculate lambda and
others do not, but we will study about it because it illustrates some useful
statistical properties.
We will look at Lambda under the issue of "PRE" or Percentage Reduction in Error methods.
I
will mention but do not recommend the ordinal coefficient gamma (
).
is very widely used. One
reason for this is that Gamma "inflates," that is, it gives a larger value
than most other ordinal coefficients on the same data because of the way
that it is calculated.
|
|
REMEMBER! If your association between two variables is not statistically significant, the population value of the correlation coefficient is really zero, regardless of what the sample results "look like."
1.
THE LEVEL OF DATA
Certain correlation coefficients are formulated for use on each level of data, nominal, ordinal, and interval-ratio. For example, phi is typically used for nominal data; tau-b and tau-c for ordinal data and r for interval level data.
In general, you MUST go with the LOWER level of data in your two variables. For example, if one variable is nominal, and the other interval, either use a nominal correlation coefficient or examine a difference of means test (where you can use the correlation coefficient eta).
2.
RANGE OF POSSIBLE VALUES
Correlation coefficients should vary between 0 and 1 for nominal measures and between -1 and +1 for ordinal and interval correlation coefficients. Correlation coefficients are standardized measures. So if you see an entity that is 'way larger than 1.0, it is not a correlation coefficient.
Adhering to this range makes it easier to interpret correlations, and to compare the same type of correlation coefficient across groups (e.g., a phi between two variables across women and men).
Depending on the size of the table, some correlation coefficients are unable to reach 1 or -1. Sometimes you can apply a correction factor so that they can do so. And, sometimes you can't. We will examine Phi, which does have a correction factor, which is not always true for nominal correlation coefficients.
3.
CAUSAL ISSUES
NOTE: The last section of this guide is devoted to the basics of establishing causality in non-experimental data.
Correlation coefficients
can be designed for symmetic relationships, where you cannot
designate an independent variable ( the coefficients
or
c for example).
Other correlation
coefficients are designed for asymmetric relationships (such
as
b
) where we are
able to designate a cause and an effect. Typically asymmetric coefficients
give slightly larger results on the same set of data than symmetric correlation
coefficients do. It is like you are being "rewarded" for being able to
designate more precise information about your variables and the relationships
among them.
If the two variables in the relationship
are at least ordinal, we can discuss the direction of the relationship,
positive or negative. That is because we can make "more than" and "less
than" statements with ordinal, interval, and ratio variables.
|
We evaluate strength
by the absolute numeric value (magnitude) of the correlation, not the +
or - sign.
For example, a correlation of -.75 is
stronger than a correlation of +.50.
Direction refers to the FORM of the relationship.
With a positive correlation, scores on the second variable rise when scores on the first variable rise.
For example, people
with more years of education usually earn more money in dollars.
People with more
income are also more likely to play state lotteries (yes, they really are,
despite what you may have read.)
If we were to plot
out the relationship between number of years of education on the horizontal
or "x" axis, and annual income in dollars on the vertical or "y" axis,
a line connecting the "x"s would slant UPWARDS and TO THE RIGHT. it would
look something like this:
|
INCOME IN DOLLARS
|
With a negative correlation, scores on the second variable fall when scores on the first variable rise.
A negative correlation is sometimes called an inverse correlationor an inverse relationship. You can use either negative correlation or inverse correlation (but you need to know what a negative correlation means in statistics.)
For example, better educated people smoke
fewer cigarettes.
People with higher incomes lose fewer
impairment days due to illness.
If we were to plot
out the relationship between number of years of education on the horizontal
or "x" axis, and average number of daily cigarettes smoked on the vertical
or "y" axis, a line connecting the "x"s would slant DOWNWARD and TO THE
RIGHT. it would look something like this:
|
AVERAGE NUMBER OF CIGARETTES SMOKED PER DAY
|
TIP:
If both your variables are ordinal, interval or ratio, a coefficient with
direction is preferable (tau-b, tau-c or r). Because you can make
(at a minimum) "more than" or "less than" statements with these variables,
you can make directional statements:
The more the education, the less
the cigarettes smoked. (Negative or inverse relationship.)
The taller the person, the larger
their weight in pounds. (Positive relationship.)
5. "PRE" PERCENTAGE (PROPORTION) REDUCTION IN ERROR
Finally, it is desirable to choose a correlation coefficient with a "PRE" (proportion/percentage reduction in error, see section 'way below in this page) interpretation if this is possible.
The value of the coefficient shows how much we have reduced our mistakes in prediction on the second variable if we know scores on the first variable. These measures customarily use ratios of mistakes and correct predictions.
One reason we will look at the correlation coefficient "lambda" for nominal variables is because lambda nicely illustrates PRE properties.
We will consider these five points in more detail throughout this guide.
|
|
|
|
The correlation coefficient Phi or
is based on the Chi-square pdf.
|
If your computer printout does not include Phi, do not worry, because you can calculate it in a minute or two with your hand calculator. The SDA system does not calculate Phi. SPSS does calculate Phi.
(Step 1) Take the value of Chi-square and divide by the casebase.
The result is PHI-SQUARED.
(Step 2) Then take the square root of Phi-Squared and you now have Phi.
That's it! That's Phi. Ease of calculation
is a point in its favor. You can use this calculation of
for any table where either one of your two variables takes on ONLY TWO
values.
If both your variables have MORE than two
values,
cannot reach 1, so
you will have to use a variation of Phi called:
CRAMER'S V
(named after a statistician named Cramer,
what else?)
Cramer's V corrects for the fact that
this correlation often cannot reach 1 in non-square and larger tables.
To obtain Cramer's V, instead of dividing Chi-square by n, divide Chi-Square by [ n * (k-1) ]
k is the LESSER of the number of rows or
columns.
For example, if you had 3 rows and 4 columns,
k would equal 3 and (k-1) would equal 2.
So the whole formula for Cramer's V looks
like this:
________________
V = / X2__________
V N (k - 1)
Phi-square (
2
) is a PRE measure (see below).
However,
itself is not PRE.
|
|
THE CONCEPT OF PAIRS
The correlation coefficient Tau-Beta
(
b ) is a coefficient
that requires both your variables to be at least ordinal.
Typically, the level of statistical significance for Tau-Beta and its "relatives" (tau-a or tau-c) is tested with a t-distribution. However, not all statistical programs will give you the t-value (SDA does not, SPSS does). In that case, it is usually OK to "drop back" and test statistical significance with the Chi-Square to estimate whether the association in the population is ACCIDENTAL or REAL before proceeding further.
The taus vary from -1 to +1.
All the taus are PRE measures.
These two features make them very widely used in analytic presentations.
Tau-Alpha and Tau-Beta are used for asymmetric relationships where you can designate an independent and a dependent variable, while Tau-c (Tau-Gamma) is used for symmetric relationships.
Tau-a (rarely seen on statistical programs),
Tau-b, and Tau-c all operate off the concept called "pairs" which we will
also demonstrate in class. Once seen, I believe you will be able to easily
keep this concept straight.
In the concept of "pairs," each case is
sequentially paired off with every other case, thus creating a "pair".
You will have:
[ n * ( n -1 ) ] / 2 such unique pairs
For example, if you are studying 10 students in a class, you could uniquely pair them off in ( 10 * 9 ) / 2 = 45 different ways.
Pairs of cases can further be divided into subgroups:
Pairs where both cases have the same independent variable score. An example would be two people in a sample who both have high school degrees for educational level. We call these pairs "tied on X."Pairs where both cases have the same dependent variable score. An example would be two separate people in a sample who each have ONE child. We call these pairs "tied on Y."
Pairs where both cases have identical independent AND dependent variable scores. An example would be two people in a sample who both have high school degrees for educational level AND who both have one child. We call these pairs either "tied on both" OR "tied on X AND Y."There are also "agreement" pairs and "disagreement" pairs.
In agreement pairs, case number one has BOTH a higher score on the independent variable AND a higher score on the dependent variable than the second case in the pair.THINK: Jane has more education and more computers than Bob. (Higher on both education and computers.)
In disagreement pairs, the first person in the pair has a HIGHER SCORE on the independent variable and a LOWER score on the dependent variable than the second person in the pair.THINK: Jim has more education but smokes fewer cigarettes than John.
These types of relationships are summed up in the chart below:
| TYPE OF PAIR
Person #1 in the pair versus
|
Score on:
Variable One (Independent) |
Score on:
Variable Two (Dependent) |
| "Tied on X" ( PX ) | Same scores | Two different scores |
| "Tied on Y" ( PY ) | Two different scores | Same scores |
| "Tied on both" ( PXY ) | Same scores | Same scores |
| Agreement pairs ( PA ) | Higher than person #2 | Higher than person #2 |
| Agreement pairs | Lower than person #2 | Lower than person #2 |
| Disagreement pairs ( PD ) | Higher than person #2 | Lower than person #2 |
| Disagreement pairs | Lower than person #2 | Higher than person #2 |
Now, we can put all these different pairs
together in proportion to examine Tau-Beta and Gamma.
|
Gamma (
) is a widely used correlation coefficient designed for use when both variables
are ordinal, interval, or ratio:
It is a measure used in symmetric relationships.
It varies from 0 to +1 or from 0 to
-1.
Gamma is also a PRE correlation coefficient.
So, given these desirable qualities, what's
the problem?
|
Gamma is very popular because:
However, it is VERY common to have pairs of cases (especially in large samples) that have the same independent variable score and the same dependent variable score (e.g., all the high school graduates who have only one child), so Gamma is a poor measure to use.
|
ALSO CALLED "RHO" "r" or PEARSON'S "r" AND ETA ( |
r (rho [
] in the population) is a correlation coefficient for interval or ratio-level
level.
It is THE most widely used correlation
coefficient that you see presented in the statistical literature.
Its level of statistical significance is tested with a t-distribution, although again, not all statistical programs will give you the t-value (SDA does not, SPSS does). In that case, it is usually OK to "drop back" and test statistical significance with the Chi-Square* before proceeding further.
Pearson's r varies from -1 to +1 and
is a symmetric correlation coefficient.
r-square (r 2 )is a PRE
measure. However, r itself is not PRE.
r is designated with a capital R on your SDA printout.
*r is a LINEAR correlation.
To the extent your plotted correlation departs from a straight line, the
Chi-square distribution will not work well as a significance test for r.
|
____________
/
my V
means to take the square root.
r is a STANDARDIZED MEASURE. It is sometimes called the standardized ratio of the covariation between "x and y" to the variation in the independent and dependent variables. Sometimes r is also called a "standardized covariance" for short.
Fortunately the computer will calculate all these correlation coefficients VERY quickly and accurately, so none of us need to!
Eta
is a relative of r that we can use when we have a relationship between
a nominal INDEPENDENT variable and an interval DEPENDENT variable.
Eta varies from 0 to 1 but it is always a positive number. Eta is symmetric too.
2 is a PRE measure, but
is not.
If you decide to use a difference of means
test for statistical significance, eta is the most sensible associated
correlation coefficient to use.
|
|
FIRSTexamine
the measurement level of your data. Match your correlation coefficient
to the LOWEST level of measurement in your variables. AVOID using
an ordinal measures, such as the
s,
or an interval measure, such as r, if you have a nominal variable.
HOWEVER, if the independent variable
is nominal and the dependent variable is interval, strongly consider a
difference of means test, and use the correlation coefficient eta
.
This is especially true if the interval dependent variable has a lot of
different values.
SECOND what kind of statement can you make about the causal connection between your variables? Can you clearly designate an independent variable and a dependent variable? If so, you have an asymmetric relationship and may want to consider an asymmetric correlation coefficient for nominal or ordinal data (we must use different methods for interval data, to be discussed in a later guide).
If you cannot clearly designate cause and effect, you have symmetric relationship, and need to use a symmetric coefficient, such as phi, tau-c, or r.
|
|
The following chart can help you describe the numeric value of a correlation coefficient in words. This is a somewhat arbitrary chart which I copied from social scientist Gene Lutz. However, if we can all agree that the numbers are reasonable, it provides a place to begin.
The numbers presented below are ABSOLUTE VALUES, that is, we have disregarded whether the sign of the correlation is positive or negative, and simply presented the number.
PLEASE MEMORIZE
THIS CHART!
You will use it
on exams and we will refer to it frequently.
| CORRELATION ABSOLUTE VALUE SIZE |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
IMPORTANT: A complete description of a numeric correlation includes:
|
|
You may have decided to use a t-test or an F-test to look at the differences in mean scores across the values of a single independent variable. These values typically form what are called "groups" in differences of means statistical tests.
You can assess substantive significance or effect size with the correlation coefficient eta.
You can also look at the difference between
means.
We compare the differences between means
across different values of the independent variable with respect to
the standard deviation of the difference between means.
This is NOT the standard deviation of
group one or group two.
Instead it is the standard deviation
OF THE DIFFERENCE between group one and group two.
This means looking at the denominator of the t-distribution, or:
![]()
2
or THE SQUARE ROOT OF THE FOLLOWING FORMULA:
|
|
|
|
where n1 and n2 are the casebases of group 1 and group 2 respectively and S21 and S22 are the variances around the means of group 1 and group 2. So this is a weighted kind of standard deviation.
We can now take the absolute difference
between the mean scores of group 1 and group 2 (
1
-
2 ) and divide
that absolute difference by the standard deviation of the differences.
The result will be in standard deviation of the differences units.
We can decide a priori how big a difference between means we decide to call "big." One standard deviation of the differences? One-half standard deviation of the differences (this one is usually the minimum for "effect size")?
Why do we go to all this trouble? Because with very large samples, very small or trivial effect sizes will be statistically significant. This way, we warn our reader not to expect too much if the ES or effect size is small.
|
MORE ON RELATIONSHIP FORM |
In a monotonic relationship, as the first variable increases, a second variable rises or falls at an irregular rate. But a positive relationship would not turn negative.
For example, personal income might increase a lot when education rises from 11th to 12th grade but rise only a little when education increases from 14 to 15 years. Education in this example always positively affects income, but how much income rises may depend on whether educational level is low or high. As you can see, this is an "ordinal type" of relationship--a more or less than" type of relationship (even if you have interval or ratio variables.)
Tau-b is a ordinal
correlation that detects monotonicity.
If you have several values of the independent
variable, you might get a small change of direction by chance, i.e., the
relationship is "roughly" or "approximately" monotonic.
Linear relationships resemble a straight line. As one variable rises, a second variable rises or falls by a constant amount.
That "by a constant amount" part is critical! For example, for each inch increase in height, someone would average 8 pounds heavier. The 8 pound rise would occur whether we compared people 5'5'' and 5'6" tall or people 5'10" and 5'11".
Linear relationships allow more precise prediction than monotonic relationships.
Because of this precision, this is more of an "interval-ratio" kind of relationship because the line rises or falls at a constant rate.
Pearson's r is an interval-level correlation that is designed to measure linear relationships.
In a curvilinear or non-linear association, the relationship between two variables is irregular in both slope and direction. The graph of the relationship might "bend" or even resemble a "U".
Phi (Cramer's V) is a correlation coefficient that works well with nonlinear relationships even if the variables are ordinal or interval. Eta also works well if the dependent variable is interval-ratio.
EXAMPLE:
Among women, the odds of marriage first increase with
higher education, peak at around a B.A. college degree level, then drop
somewhat at the advanced degree level. Use Chi-square and
here.
To "eyeball" whether you have a nonlinear, linear or monotonic relationship, here is what you can check:
do
column percents in a table so that you can compare across values of the
independent variable (e.g., what percent of people with a high school degree
like rock music compared with college graduates?).
Graph
the percent with the highest (OR lowest) value of the dependent variable
(e.g., "like" rock music).
If
the graph constantly rises or falls for those liking rock
with each advance in degree level, it is linear.
If
there is an irregular rate of increase or decrease,
the relationship is monotonic.
If
liking for rock music "jumps around" with no clear increase or descrease,
the relationship is non-linear or curvilinear.
|
|
You can do an analogous "eyeball" examining mean differences across groups--IF AND ONLY IF you can rank the groups from lowest to highest, or least to most, as I do in the tabular example below.
EXAMPLE
Let's look at an SDA example with (collapsed) level of education as the independent variable and mean number of household computers as the dependent variable. These data are from the year 2000 U.S. Census Bureau's "Falling through the Net" series about the "Digital Divide."
TITLE: NUMBER OF COMPUTERS IN HOUSEHOLD BY DEGREE LEVEL
| Educational Level | Less than a High School Degree | High School Degree | Some College | College Grad | Advanced Degree |
| Mean # Computers
|
|
|
|
|
|
Source: Current Population Survey supplement
on computer and internet use, August 2000.
Valid n = 94,821
MD = 40,165
It is clear that there is a POSITIVE relationship between the level of education and the number of computers in the household. As educational level rises, so, too, do the number of household computers. But the level rises at an IRREGULAR rate.
It rises by 0.12 from Less than High School to High School Degree (0.59 - 0.47 = 0.12)
It rises by 0.32 for High School Degree to Some College (0.91 - 0.59 = 0.32 )
It rises by 0.22 for Some College to College Grad (1.13 - 0.91 = 0.22)
And it rises by 0.15 for College Grad to Advanced Degree (1.28 - 1.13 = 0.15)
This is a MONOTONIC relationship because the rate of increase is IRREGULAR.
|
AND THE LAMBDA EXAMPLE |
Correlation coefficients with a PRE interpretation tell us the percentage of error that we have reduced in predicting the dependent variable if we only know the scores on a second variable.
PRE correlation coefficients are ratio measures (they are NOT ratio variables ) that take this form (E stands for "error"):
[ E1 - E2 ] ÷ E1
This means how many errors or mistakes we make in predicting the dependent variable WITHOUT knowing scores on the independent variable (E1), compared with the amount of errors or mistakes we make in predicting the dependent variable if we DO know scores on the independent variable (E2).
An example with the correlation coefficient lambda will be worked for you below in this section.
Suppose we want to predict someone's weight. With one variable, our best [interval-level] guess is the mean weight of sample or population that we obtained.
In this case, our "error rate" is the standard deviation, the average distance a score is from the mean.
Could we predict someone's weight score more precisely if we knew what their height is?
r2 is a correlation for two interval variables that helps to answer this question. If r2 = .25, this means that we have reduced the approximate equivalent of the variance in weight by 25 percent by knowing what someone's height is.
PRE interpretation is a desirable characteristic in a correlation coefficient. The following are all PRE correlation coefficients:
r2
2
b
and
c
(So is
but
is a bad,
inflated correlation coefficient!)
The nominal and asymmetric correlation
coefficient lambda is unfortunately not calculated by all statistical
programs. However, it DOES provide a simple illustration of the PRE concept.
Suppose we have the following bivariate
table using a sample from a class of undergraduate educational psychology
students:
|
|
| OWNS HOME COMPUTER | MALE | FEMALE | ROW TOTALS |
| YES | 7 | 4 | 11 |
| NO | 3 | 13 | 16 |
| COLUMN TOTALS | 10 | 17 | 27 |
If we calculated the percentages, 70 percent of the men but only 24 percent of the women currently own computers. It LOOKS like we have an association, but, of course, this is only a sample and we want to know about the larger population. Gender is nominal and owns home computer is ordinal, so we want a nominal correlation coefficient. We also have an asymmetric relationship (it is unlikely that buying a computer would cause a sex change.) We can't use eta here because owns computer is ordinal not interval (someone could own 6 or 7 computers and answer "yes" or own 1 computer and answer "yes".)
REVIEW (IS THERE A RELATIONSHIP?)
Calculated, the Pearson Chi Square =
5.63.
Its probability level with one degree
of freedom = .018
Thus, if there were no relationship in the population between gender and computer ownership, we would expect this kind of sex difference just by chance in only about 2 in 100 samples.
Of course, our apparent relationship in the sample COULD be an accident and we have one of those two nonrepresentative samples out of 100. It COULD happen (rarely).
However, we will probably reject the null hypothesis of no sex difference, given these sample results, and conclude male "ed psych" undergraduates more often own a computer than women do.
HOW DOES LAMBDA WORK?
Given there appears to be a relationship, how strong is it?
Lambda compares differences from the mode for:
The mode for the entire sample is NO
HOME COMPUTER. 16 out of 27 cases have this mode.
However, 11 cases DO have a home computer
so they are different from the mode.
So if we were trying to guess computer
ownership for the total sample, we would use the mode (nominal data) but
we would make 11 mistakes or errors.
Error number one (total sample, univariate) = "E1" = 11
For men, the mode is HAVE HOME COMPUTER.
7 men do. However, 3 men deviate from the male mode and do not have
a home computer.
The number of prediction mistakes on "have
computer" for men = 3
For women, the mode is NO HOME COMPUTER.
13 women do not. However, 4 women deviate from the female mode and
DO have a computer.
The number of prediction mistakes on "have
computer" for women = 4
So, the total for "E2" or the mistakes that we make predicting computer ownership IF WE KNOW THE STUDENT'S GENDER is a total of 7: 3 mistakes for men plus the 4 mistakes for women.
Lambda becomes = (11 - 7) ÷ 11 = 4 ÷ 11 = .36
Instead of making 11 mistakes in predicting computer ownership, we make only 7 instead, if we know someone's score on BOTH gender and computer ownership.
This is a .36 or a .36 X 100 = 36 percent improvement in our prediction of computer ownership.
Putting it in
a slightly different way, we have a percent reduction in error (PRE) of
36 percent (using the coefficient
) in guessing a student's computer status if we also know their gender.
|
|
This is the third question we ask about an association.
First, we must establish that the relationship is REAL, and of at least moderate strength so we know it is not trivial. We then ask whether the two variables have a true causal association or whether a third variable may have caused the original two variables (a spurious relationship). To evaluate the causal status of a bivariate relationship, we need to introduce a third variable. This will be the topic of Guide 6.
However, even with two variables, we need to assess whether we can designate an independent variable (cause) and a dependent variable (effect). If we can designate an independent variable, the correlation is asymmetric. If we cannot designate an independent variable, it is symmetric.
This is relatively easy with experiments. In creating an intervention, you also created the causal variable. For example, if you assigned cigarette smokers to a nicotine patch group versus a placebo patch group, nicotine patch (yes or no) becomes the independent variable.
However, in many cases, we cannot manipulate or intervene with variables, such as sex or ethnicity.
In other cases, we are dealing with naturally observed variables, which is often the case with surveys, ethnographies, contact analysis, and many other methods. So we need some guidelines to establish plausible causal order among variables in non-experimental studies.
Statistics, by the way, may be used to DISCONFIRM a postulated causal order, but NEVER, NEVER, NEVER establish causal order.
|
|
Concepts of causality are critical: they tell us what is possible, what can be changed and what is difficult, if not impossible, to change. For example, if you are convinced that biological factors cannot be overcome, you probably will not believe that visually impaired children can compensate for their disabilities. Causality tells us what are the “prime movers” of the phenomena that we observe.
Much of the research process centers around explaining what are the true causal or “independent variables.”
According to science rules, definitive proof via empirical testing does not exist. Science uses the term "proof" (or, rather, "disproof") differently from the way attorneys or journalists do. For example, a correlation could have many causes, only some of which have been identified. Later work can show earlier causes to be spurious, that is, both cause and effect depend on some prior causal (often extraneous) variable (see the charts on ice cream consumption and fire engines).
CAUSALITY IN NON-EXPERIMENTAL DATA: AN EXAMPLE
| Cancerous Human Lung
This dissection of human lung tissue shows light-colored cancerous tissue in the center of the photograph. While normal lung tissue is light pink in color, the tissue surrounding the cancer is black and airless, the result of a tarlike residue left by cigarette smoke. Lung cancer accounts for the largest percentage of cancer deaths in the United States, and cigarette smoking is directly responsible for the majority of these cases. "Cancerous Human Lung," Microsoft(R) Encarta(R) 96 Encyclopedia. (c) 1993-1995 Microsoft Corporation. All rights reserved. |
|
| Most people--and most scientists--accept that smoking cigarettes causes lung cancer although the evidence (for humans) is strictly correlational rather than experimental. There are many topics where it is neither possible--nor desirable--to use the experimental method. To accept more correlational evidence it will help to examine the rules below.(SCL) |
Many scientists believe that the ONLY way to establish causality is through randomized experiments. However a moment’s reflection will convince you that this cannot be so. Most people now accept that smoking cigarettes causes lung cancer (see the Encarta selection above)–yet no society has ever randomly assigned half its population to smoke cigarettes and the other half not. This causal conclusion about smoking and lung cancer is based on correlational evidence, i.e., observing the systematic covariation of two (or more) variables in a research study, which is exactly what we do when we examine the association between two variables. Cigarette smoking and lung cancer are both "naturalistic" variables, i.e., we must accept the data as nature gave them to us.
There is no doubt that the results from careful, well-controlled experiments are typically easier to interpret in causal terms than results from other methods. However, as you can see, causal inferences are often drawn from correlational studies as well. Non-experimental methods must use a variety of ways to establish causality and ultimately must use statistical control, rather than experimental control.
|
|
If one variable causes a second variable, they should correlate thus causation implies correlation.
However, two variables can be associated without having a causal relationship, for example, because a third variable is the true cause of the "original" independent and dependent variable. For example, recall that there is a statistical correlation over months of the year between ice cream consumption and the number of assaults. Does this mean ice cream manufacturers are responsible for crime? No!The correlation occurs statistically because the hot temperatures of summer cause both ice cream consumption and assaults to increase. Thus, correlation does NOT imply causation. Other factors besides cause and effect can create the illusion of an observed correlation.
If one variable
causes a second, the cause is the independent
variable (explanatory
variables or predictors).
The effect
is called the dependent variable
(sometimes it is called the criterion variable).
If you can designate a distinct cause and effect, the relationship is called asymmetric.
Two variables may be associated but we may be unable to designate cause and effect. These are symmetric relationships.
Since we know that we cannot use experimental
treatments in naturalistic variables to determine cause and effect, yet
we know that scientists do draw causal conclusions in nonexperimental studies,
here is a set of helpful rules for tentatively establishing causality in
correlational data.
|
GUIDE (1) TIME ORDER. The independent variable came first in time, prior to the second variable.
EXAMPLE: Gender or race are fixed at birth.
GUIDE (2) EASE OF CHANGE. The independent variable is harder to change. The dependent variable is easier to change.
EXAMPLE:
One's
gender is definitely harder to change than scores on an assessment test
or years of school. One's chronological age is not usually changed by attitudes,
values, education, or much of anything else.
GUIDE (3) "MAJORITY RULE." The independent variable is the cause for most people.
EXAMPLES:
Although
some people become so fed up with their jobs that they return to school
to train for a better job, most people complete
their education prior to obtaining a regular year-round, full-time job.
Most people
marry prior to having children (although some people have their children
first, then marry as a result.)
GUIDE (4) NECESSARY OR SUFFICIENT. If one variable is a necessary or sufficient condition for the other variable to occur, or a prerequisite for the second variable, then the first variable is the cause or the independent variable.
EXAMPLES:
A
certain type of college degree is often required for certain jobs.
At most universities, publications are
a prerequisite for being awarded tenure.
If you can come up with the money, you
almost certainly can purchase a meal.
GUIDE (5) GENERAL TO SPECIFIC. If two variables are on the same overall topic and one variable is quite general and the other is more specific, the general variable is usually the cause.
EXAMPLE: Overall ethnic intolerance influences attitudes toward Hispanics.
GUIDE (6) THE "GIGGLE" OR "SANITY" FACTOR. If reversing the causal order of the two variables seems illogical and makes you laugh, reverse the causal order back.
EXAMPLES:
We
don't believe choosing a specific college major or engaging in a
particular sport determines one's gender.
|
|
1. Only experiments can be used to make causal statements.
ANSWER: Use the guidelines above to investigate causality in non-experimental data. Ninety-six percent of Americans believe that smoking causes lung cancer but the data are NOT experimental (with the exception of a few poor rats.)
2. Well, yes, but the smoking data is based on epidemiological studies with thousands of cases.
ANSWER: The number of cases has NOTHING TO DO with causality. Having a large database means your estimates are relatively stable, they have low sampling error. But low sampling variability has nothing to do with causality. Examine the six guidelines above.
3. Nominal variables can serve as causal variables but numeric variables (interval or ratio) cannot be independent variables.
ANSWER: The level of measurement has NOTHING TO DO with causality. You may be able to do more arithmetically complex statistics with numeric data but that gives it no special status. To believe a statement such as number 3 means that you believe that gender (nominal) can have causal status but variables such as age or years of education (ratio) cannot have causal status. A moment's reflection will show you how silly that is. For example, years of education is one of the most powerful predictors of how people live their lives. Do you honestly believe that education has no effect on the occupation you enter, the salary you earn, the health practices you use, or television programs you watch?
Use one of the guidelines above to determine causal status, not the measurement level of your variables.
|
|
Remember: we must address question 1 before examining the type of correlation coefficient to use or possible causal issues. This is question 1: is there a relationship at all?
Our null hypothesis is that there is NO relationship, such as:
| Ho: X2 = 0 | OR | Ho: |
OR | Ho: |
OR | Ho: |
OR | Ho: |
or even
Ho:
= 0
There are two basic mistakes we can make with our assessment of the null hypothesis:
FIRST we can reject the null hypothesis
when it is, in fact, true.
SECOND we can accept the null hypothesis
when it is, in fact, false.
The first type of error means that we believe our results are real when the population association is truly zero, and we mistook a sample illusion for a real relationship.
In the second type of error, we accept
a zero relationship between two variables when, in fact, some relationship
exists.
|
|
|
|
|
|
|
|
|
Reject true Ho |
|
|
|
|
ACCept false Ho |
When we set an
level, we typically decide the MAXIMUM level for a type one error, such
as .05 or .01.
![]() |
READINGS AND ASSIGNMENTS |
OVERVIEW |
|
Susan Carol Losh June 21
2004.
This page was built with
Netscape Composer
and is best viewed with
Netscape Navigator
600 X 800 display resolution.