Thank you to Karen Hutzel, survivor of Summer 2003 for providing picture files for some of the most widely used course symbols!
|
SUMMER 2004 DR SUSAN CAROL LOSH |
|
|
|
|
SUMMER 2003 DR SUSAN CAROL
LOSH
|
|
|
|
OF CENTRAL TENDENCY |
|
DISTRIBUTIONS |
|
BASICS |
DISPLAYS |
|
|
High speed computers have revolutionized data analysis. They accomplish in seconds what it took hours--even weeks--to calculate by hand. BUT most of the time computers are basically robots. They do what they are told--EXACTLY as they are told, even if the results do not make sense. GIGO is short for "garbage in, garbage out", indicating that your results are only as good as your input.
A case in point is how statistical programs such as the SDA system or SPSS (Statistical Package for the Social Sciences) use numbers. Often, the information management specialist who built the computer file for the data assigned numbers to all categories of each variable, even nominal data such as gender. This is for data-processing ease, especially speed, and this does NOT mean that you really have a numeric variable because the "computer said so".
As long as the categories of a variable are stored in the computer as numbers, a computer will calculate numeric means on nominal or ordinal data or give you a "number" for the range on a truly ordinal variable. YOU must decide which kind of data you have and which extenuating circumstances dictate the selection of the proper statistic.
Each year, issues of data analysis and computers become more sophisticated.
There are now dozens of online databases and archives that are posted on the Internet. The General Social Survey data that we are working with now is one example of such a database.
OPTIONAL: To see many more online databases as well as some considerations when you use such an archive, click HERE and follow the links.
In the last few years, the Internet has truly become an interactive partner in data analysis. Relatively simple statistical programs, such as the Berkeley SDA system (currently the most common in use), have been developed so that online databases can be analyzed online, instead of downloading data files to your home computer (although SDA can do that too). As you know, SDA is an incredibly fast statistical program that can "tear through" thousands of cases in a couple of seconds.
Computer hardware describes the physical computer equipment. Examples are CPUs (central processing units), RAM (random access memory), modems to connect you to telephone sources, monitors, scanners, CD-"burners," or printers.
Software refers to the actual programs or written commands to computers. There are many kinds of software. Some software are systematic coordinators such as Windows. They contain general commands that allow you to copy data or access programs. (Viruses are programs, too.)
Then there are specialized programs, such as word processors (e.g., Word Perfect or Word), spread sheet programs (e.g., Lotus or EXCEL), learning tools (e.g., Treasure Mountain), card games, Internet connectors and browsers (e.g., Netscape or Explorer) and statistical packages.
Statistical packages such as SDA or SPSS are "bundles" of programs that execute various statistical estimates, such as univariate frequency displays, measures of central tendency and variation, or crosstabulations. Most statistical packages can transform variable categories, create new variables from old ones (e.g., an added index), assign missing data codes, and create tables and charts.
Try as they do, the "thinking" that computer programs do is currently only as good as the human brain that wrote the program. The computer does not know the level of measurement of your data or the nuances of missing cases. Computers are very literal, and they do not do what you meant, only what you told them to do.
Only you can make the selection of the appropriate statistic to use. What the computer WILL do is spare you the drudgery of calculations, or working with n-dimensional matrices. In fact, the computer has made routine the use of more complex analytic techniques, such as logistic regression, where the mathematics had been known for decades, but the calculations were staggeringly cumbersome.
|
|
The problem that motivates us to use measures of central tendency is the same problem that motivates us to construct a table. You (or an archive) have a lot of data. You to present the information accurately, but succincly, and with the least amount of detail necessary. In other words, you want to reduce and describe the data.
One way to do so, of course, is to construct a frequency or percentage distribution table. As you know, when a variable has several categories, it is difficult to quickly summarize the meaning of all the data. Remember the General Social Survey computer output on "year of birth?" There were several dozen categories. Yet, even when you condense the data down to four or five categories, this can still be cumbersome when you want to describe the data to someone else.
Furthermore, researchers often work with several variables at a time in a study. This could ultimately lead to a gigantic series of tabular displays, one for each variable, which are lengthy and complicated to read. Ouch!
Ideally, many statisticians and data analysts like to use a SINGLE CATEGORY, or, if our data are actually numeric, a single number to summarize the "average" or the "typical" category or score for an entire distribution. A single category score is easy to remember and, if it is "average" or "typical," it is easy to grasp.
We often summarize a univariate distribution with a measure of central tendency. Sometimes this is called a measure of "central location." This is a single category or "most typical" score from the distribution of your variable.
The measures of central tendency that we will study are modes, medians, and means.
You must make some decisions before you can apply these techniques. For example, you must decide the level of measurement, whether each variable you wish to describe is nominal, ordinal or interval/ratio. You should also examine the shape of the distribution. If your scores are numeric, you need to know whether you have a few extremely large (positive skew) or extremely small (negative skew) scores.
NOTE: At this point, you may have some reservations about describing an "average score." What if scores are "heaped at the extremes"? For example, you might look at the quiz scores from a recent quiz for a course you are teaching. You immediately notice that students tended to do very well or very poorly on the quiz, as shown in my example below:
|
|
|
| Debby A. |
3
|
| Sam B. |
1
|
| Jack C. |
8
|
| Anne D. |
8
|
| Tanisha E. |
9
|
| Sari F. |
2
|
| Zack G. |
10
|
| Juan H. |
3
|
| Ken I. |
9
|
| Janet J. |
2
|
How can one category score or one number hope to do justice to such as distribution?
This is one example of looking at "the shape" of the distribution. And, as we will shortly see, there are ways to augment our measures of central tendency. Later, we will capture some of the diversity in the distribution with a measure of dispersion or variation.
|
|
The mode is the category that contains the largest frequency or the greatest number of scores. In other words, the mode is the score that occurs the most often in your variable of interest.
If your variable is a nominal measure, the mode is the only measure of central tendency that you can use.
(But remember, all is not lost: you still can do percents, rates, ratios and compare groups on a nominal variable.)
Look at the table below to locate the mode. You can use either the frequencies per category or the category percentage to do so.
Number of Computers per United States
Household
| How many computers or laptops are there in this household? | Number of Cases | Percent of Total Cases |
| No computer |
50646
|
41.6%
|
| 1 MODE--largest # of cases |
50710
|
41.7
|
| 2 |
14075
|
11.5
|
| 3 or more |
6314
|
5.2
|
| Total |
121,745
|
100.0%
|
Source = Current Population Survey Internet
and Computer Use Supplement (Aug 2000)
Missing data = 13241
Some data have so many modes ("multi" modal)
that the concept becomes meaningless. On the other hand, you might want
to use the mode when you have an ordinal variable that has only a few (or
even two) categories. All that is required for the mode is to be able to
tally the number of cases in each category. It does not matter if the categories
are ordered or numeric.
|
|
The median is a measure of central tendency that can be used with ordinal, interval, or ratio data.
In a set of cases in which the cases have been rank-ordered from highest to lowest (or lowest to highest), the median is the middle score.
Another way of viewing the median is that the median is the 50th percentile.
YOU MUST RANK ORDER THE CASES FIRST, or the median will be nonsense.
In a set of cases (for example, the order of finishers in a race), you will need the rank order of each case.
EXAMPLE: in a small footrace, we had (in score position:)
1, 2, 3, 4, 5, 6, 7 The middle score is the "4th position" with three scores above it, and three below.
EXAMPLE: Here are the grade point averages of the top seven students at Lion High School ranked from the highest down to the lowest:
4.00 3.80 3.70 3.69 3.68 3.60 3.60 The value of the score in the "4th position" = the median = 3.69
In general, with an odd-number set of ranked cases, the median is the [(n+1)/2]th case. In my example above, it would be (7+1)/2 = 8/2 = the 4th score position.
What if you have an even number of cases? Then, you will have two middle scores, and you will take the arithmetic average of those two.
EXAMPLE: 1, 2, 3, 4, 5, 6, 7,8 The two middle scores are "4th position" and "5th position" and their average in this example is 4.5
EXAMPLE: Using the grade point example and adding a lower GPA to the seven grades is:
4.00 3.80 3.70 3.69 3.68 3.60 3.60 3.20 Our two middle scores are 3.69 and 3.68.
Their average or the median is 3.685.
The second example illustrates a nice point about the median. The median is less affected than numeric averages by extremely high or low scores. Although the student with the 3.20 average was substantially below the others, adding this lower score only caused the median to drop from 3.69 (with seven students) to 3.685 (with eight students).
The federal United States government typically reports income by median rather than numeric averages (e.g., median income per educational attainment category). That is because income is "skewed," that is, a few extremely high scores (Bill Gates, maybe?) raise arithmetic averages way, way up. But the median will hardly change at all.
Another way to look at the median is that it is the 50th percentile. If you have become comfortable with cumulative percentages, the category containing the median is the lowest-ranked category where the cummulative percentage jumps to over 50 percent. This is an easy way to find the median value in data that are presented in tabular array, as is often the case for data presented in journals or the mass media. Let's stay with our CPS data about computers in the home:
In the category "1 computer in the household" the cumulative percent jumps from 41.6 to over 50.0 percent (in fact, it jumps up to 83.3 percent).
Therefore, the median category score is "1 computer per household".
Number of Computers per United States
Household
| How many computers or laptops are there in this household? | Number of Cases | Percent of Total Cases | Cumulative - down |
| No computer | 50646 |
41.6%
|
41.6
|
| 1 MEDIAN--50th percentile | 50710 |
41.7
|
83.3
|
| 2 | 14075 |
11.5
|
94.8
|
| 3 or more | 6314 |
5.2
|
100.0
|
| Total | 121,745 |
100.0%
|
Source = Current Population Survey Internet
and Computer Use Supplement (Aug 2000)
Missing data = 13241
|
|
|
|
n
i=1 ____________ n |
|
In words, here's how we obtain the arithmetic mean:
STEP ONE: For the chosen variable, start with the score for the first case, add that score to the score for the second case, then add in the score for the third case, and keep adding in the scores until you have added in the score for the very last case on that variable. This is the sum of all the scores on that variable.
STEP TWO: Divide the sum of all the scores from Step One by the number of cases that you have (n).
The result is the arithmetic mean.
Here's what each symbol means:
is
a capital, or upper case, Greek letter sigma. Mathematicians use
as a shorthand way of saying "to sum" or "to add".
What is added are the scores to the right of the sigma sign.
Xi means a single, particular score.
i = 1 means to start with the very first case on that variable.
n is the total number of cases ("the casebase") IN A SAMPLE OF SCORES.
A sample is some subset of the entire
population of scores. Most of the time, in the behavioral and social sciences
(and in many biological or physical sciences, too) we work with a sample
of scores, not the entire population.
is
the arithmetic mean for a SAMPLE of scores.
If you have the ENTIRE POPULATION OF SCORES, the symbol for the casebase is capital N, i.e., N.
The symbol for the ENTIRE POPULATION
ARITHMETIC MEAN is the Greek letter Mu or µ.
|
Here's the arithmetic mean, applied to the number of computers per household. However, we will have to take some shortcuts because no one is going to add the scores for 121,745 cases by hand.
Number of Computers per United States
Household (in frequencies)
| How many computers in household? | Number of Cases |
| No computer | 50646 |
| 1 | 50710 |
| 2 | 14075 |
| 3 or more | 6314 |
| Total | 121,745 (13241 missing) |
Source = Current Population Survey Internet and Computer Use Supplement (Aug 2000)
Instead of adding each case separately,
we will take the value of each category (call it C) and multiply that value
by the number of cases in the category (fc). As follows:
| Category Score | Category Frequency (C) | C X f c | |
| 0 | 50646 | 0 X 50646 = |
0
|
| 1 | 50710 | 1 X 50710 = |
50710
|
| 2 | 14075 | 2 X 14075 = |
28150
|
| 3 | 6314 | 3 X 6314 = |
18942
|
| Total Sum |
97802 |
(for purposes of this exercise, we will treat the category "3 or more" as just "3")
Then
/N becomes 97802/121,745 = 0.80 for the mean number
of computers per household.
|
The arithmetic mean has some interesting properties (if you're curious, you can take a calculator and check them out yourself).
The entity:
Xi -
is called the deviation of each score
from the mean or the "mean deviation score".
The total sum of all the mean deviation scores added up equals 0 within rounding error. The large and small scores essentially cancel each other out. This is one reason why the mean is considered the "center" of a set of interval-ratio data.
Because you are using numeric operations
(such as addition and division) to calculate the mean, of course, your
data must be numeric too. The mean, or "arithmetic average" is also
a number scaled with equal intervals (such as one year or one dollar).
This means your variable must be interval or ratio.
|
|
|
We become interested in the category midpoint for two main reasons:
First,
our measurements, especially in continuous data, are often approximations,
so there is some "wiggle room" in the category.
Second,
the categories may have been pregrouped, such as "9 to 11 years of school."
For a number of reasons, you may want to estimate the "middle point" of
the category.
Sometimes we use the midpoint of categories to calculate means and standard
deviations in cases where the data categories were grouped or collapsed
(see above).
Finding
the category midpoint means making some estimate of the upper and the lower
boundaries (the "true limits") of the categories. Then we do the arithmetic
average of the upper and lower boundaries. (Most of the time, of course,
we just go with the integer values.)
For a single score, such as "3," typically we can go .5 in either direction. So, in this example, the boundaries would range from 2.5 to 3.5 with an average midpoint of (2.5 + 3.5) / 2 = 3.
For a grouped category, we again add the lower and upper boundary of the category, then divide by 2.
In my example "9 to 11 years of school" category, the midpoint is (8.5+11.5) / 2 =10.
Of course, if you had even more precise estimates of the upper and lower boundaries of the categories you would use those precise estimates instead of + .5
|
|
Remember those lopsided quiz scores
way at the top?
A mean score in these conditions looks misleading and because of the "heaps" at each end, the median isn't a whole lot better.
However, if we add a measure of dispersion
that will help to describe the data. Consider another example with the
following two sets of quiz scores:
| QUIZ SCORES | STUDENT CLASS A | SCORE | STUDENT CLASS B | SCORE | |
| 1 | 1 | 1 | 3 | ||
| 2 | 1 | 2 | 3 | ||
| 3 | 3 | 3 | 3 | ||
| 4 | 3 | 4 | 3 | ||
| 5 | 5 | 5 | 3 | ||
| 6 | 5 | 6 | 3 | ||
| CLASS MEAN | 3 | 3 |
Although the two classes look quite different,
the class means are identical.
The purpose
of measures of dispersion or variability is to say something about how
much, "on the average" a score varies or deviates from a measure of central
tendency. For example, class
one above has a greater diversity in quiz scores than class two. Once we
have both a measure of central tendency and a measure of average
variability or dispersion, we know a lot about a set of scores.
|
|
Measures of dispersion or variation include the Index of Dispersion, D, (sometimes also called the index of qualitative variation [IQV]), the range, and the standard deviation of the mean. The only measure available for a set of nominal scores is the Index of Dispersion which varies from 0 (all cases are in the same category: a constant) to 1.00 when cases are evenly or uniformly distributed across categories so that each category has the same number of cases.
The quiz scores for Class B above would have a "D" of 0 because all the scores are a "3." There is no dispersion at all. On the other hand, the D for Class A would be 1, there are three categories with scores, 1, 3 and 5. Each category has the same number of cases, two.
The D is cumbersome to calculate and impractical
if your variable has over 10 categories. However, it is useful for nominal
OR ordinal data when the variable only has a few categories. Unfortunately,
most statistical software packages do NOT calculate this measure (if there
were many categories and lots of frequencies, this would probably crash
the computer). Below is the formula for D should you want to use it
at some future time:
|
|
n 2 (k-1) |
where k = the number of categories
n = the TOTAL sample size or total case base (N in the case of a population)
and f is the observed frequency in each category of the variable.
Or, in words:
Square
the frequency in each category of the variable, add up all the squared
frequencies.
Subtract
this sum from the square of the casebase.
Multiply
this entire numerator mess by the number of categories.
In
the denominator, multiply the square of the sample size by (the number
of categories - 1).
Divide
the numerator by the denominator.
|
|
|
For ordinal data, we have two measures of dispersion: the range and the inter-quartile range.
The generic definition of the range is to list the two end-points: the highest and the lowest category scores. This definition will work for both truly ordinal variables and for interval or ratio variables.
A second, very common, definition of the range is the highest category score minus the lowest category score. YOU CANNOT USE THIS VERSION OF THE RANGE ON TRULY ORDINAL DATA BECAUSE THIS VERSION PRODUCES A NUMBER! What is a meaningful number for "strongly agree" minus "strongly disagree"? There isn't one! (However, you can give the two endpoints, and this is meaningful.) Another problem with using the numeric version of the range is that it is insensitive to the absolute magnitude of the scores. 16 - 1 = 15 (say, for years of school completed) is clearly more comprehensive than 1016 - 1001 = 15 (say, for weekly salary in dollars).
The inter-quartile range is probably more useful than the range and spans the middle 50 percent of the cases. The IQR goes from the 25th to the 75th percentile. When your data are numeric, you can subtract the number that corresponds to the category that contains the 25th percentile from the number that corresponds to the category that contains the 75th percentile. Again, more generic for ordinal, interval, and ratio variables are the verbal end points for the categories that contain the 25th and the 75th percentile. Once more, the cumulative percentage makes it easy to find the end points of the inter-quartile range.
Thus, the inter-quartile range contains the middle 50 percent of the scores.
Looking at the Current Population Survey again, the 25th percentile category is "no computer". The 75th cumulative percentile category is "1 computer." The Interquartile Range goes from "no computer" to "1 computer."
Number of Computers per United States
Household
| How many computers or laptops are there in this household? | Number of Cases | Percent of Total Cases | Cumulative - down |
| No computer | 50646 |
25th
|
41.6
|
| 1 | 50710 |
75th
|
83.3
|
| 2 | 14075 |
11.5
|
94.8
|
| 3 or more | 6314 |
5.2
|
100.0
|
| Total | 121,745 |
100.0%
|
Source = Current Population Survey Internet
and Computer Use Supplement (Aug 2000)
Missing data = 13241
|
|
With interval or ratio data and the arithmetic mean, we can use the standard deviation of the mean. Here's how:
1. Subtract the mean from each score on your chosen variable.
2. Square each deviation difference.
3. Now add up all the squared differences.
4. This sum is called the "Total Sum of Squares" (TSS for short).
5. Take the TSS and divide it by either the total number of cases for a
population or
by (the total
number of cases -1) for a sample.
This quantity in step five is called the variance. The variance is the average squared deviation from the mean. We square each deviation from the mean first because if we did not, the sum of the deviations from the mean would be zero in every case. That would not distinguish among the different degrees of variability across samples such as those in the Class A and Class B example above.
6. Now, take the square root of the variance
and you have the standard deviation of the mean, or the "average deviation"
a score is from the mean.
Everything that I just said in steps
1-6 is summed up in the definitional formula presented below.
| TERMINOLOGY |
s for a sample |
|
______________
) |
Unless you have a very small sample, you will not calculate a standard deviation by hand. Once the casebase exceeds a few dozen, even the short-hand computational formulae that you see in textbooks, or hand-calculation procedures using grouped categories, rapidly become tedious and difficult to execute without error.
Each variable in a particular sample
has its own unique standard deviation.
|
|
The goal of much research is to predict the true POPULATION VALUE. We want to minimize ANY deviation from the true population value when we make such a prediction. However, because many populations are very large, it is too expensive, time consuming, or even practically impossible to measure every unit or case in the population.
So, most of the time we take a subset, or a SAMPLE, from the population. A well-chosen sample, nearly always a PROBABILITY SAMPLE, in which each case has a KNOWN chance of selection, often allows us to make very good inferences to the total population. However, because a sample is a subset of cases, we do expect random variations from case to case and from sample to sample. Positive fluctuactions cancel out negative ones IN THE LONG RUN, although not necessarily in any ONE particular sample.
When we observe sample univariate results, such as a mean or a percentage, we often put error limits around that result in an attempt to estimate what is happening in the population.
What we want to do is make an estimate of the "average sample," the one that would occur if we took repeated samples of the same size and the same type (typically around the same time) from the same population. Each sample would provide an estimate of the parameter we wanted to know.
EXAMPLE: if we wanted to know the average number of years of completed education among general public adults, and we took repeated samples, we could estimate a mean years of completed education for each sample. We could also estimate the average variability from sample to sample.
EXAMPLE: If we wanted to know the population percentage that would vote for President Bush in 2004, we could examine several polls (say, each one 1500 cases, and each one obtained through a Random Digit Dial telephone survey) over the next several months. Each polls would have a sample estimate of the percentage of voters choosing President Bush.
Thus, we have a SET OF SAMPLE ESTIMATES (such as a set of mean years of education or percentage endorsing President Bush) FROM SEVERAL SAMPLES.
We call this set of sample estimates THE SAMPLING DISTRIBUTION.
When we have a subset of individual
cases, we have a sample.
The unit is an individual case, such
as a United States adult.
When we have a set of samples, with
a statistic, such as a mean, from each sample, we have a sampling
distribution.
The unit is AN INDIVIDUAL SAMPLE.
Sometimes, we actually physically take a set of repeated samples and we can create a sampling distribution from these. One example is all the polls estimating who will win an election. These typically occur around the same time, are about the same size, and taken the same way.
More often, We make generalizations
from SAMPLING DISTRIBUTIONS.
We do so very often
with only a single sample.
SAMPLING DISTRIBUTIONS are hypothetical distributions of a sample statistic (such as a mean) taken from an infinite number of samples of the same size and the same type taken around the same time period (say, n = 900 for each sample and each sample is a Random Digit Dial survey).
Remember that each element in a sampling distribution is a separate sample.
In the long run, we hope that the center of the sampling distribution, such as the "mean of the means" (the grand mean) will be the same as the true population value (such as the true population mean.) This is often called the expected value.
If we do a good job on sampling, we can estimate the population mean or percentage from just one sample and put approximate limits of variability (called "confidence intervals") around our estimate.
The sampling distribution also has a
measure of variation. The standard deviation of the sampling distribution
is calculated in a way similar to that of a sample. Let M equal the number
of SAMPLES. Add up the sample means from all the samples and divide by
M, the total number of samples. This gives us the mean of the sampling
distribution.
|
When I write: )n
That's short for "the square root of n".
To summarize, for measures of central
tendency and variation for samples, populations, and sampling distributions:
| Unit | Measure of central tendency | Measure of variation | ||
| Sample | An individual case | Mean: |
Standard deviation: s
|
|
| Population | An individual case | Mean: |
Standard deviation:
|
|
| Sampling Distribution | A single SAMPLE | Mean: |
Standard error:
|
|
|
|
Standardized scores allow us to compare how extreme a score is across different variables no matter what the metric of the variable may be. For example, Marilyn Vos Savant, who is a syndicated speaker and columnist is supposed to have the highest measured intelligence test score in the WORLD.
Let's suppose that Marilyn's IQ score is 175. This metric is in IQ points.
Does Marilyn's stratospheric IQ transfer into megabucks too? (For causal purposes, we will assume that even the entire budget of the United States will not make Marilyn a genius, so the causal arrow must run from Marilyn's IQ to her income in dollars.)
Let's assume that Marilyn's annual income is $80,000 in U.S. dollars. This metric is in dollars.
Standardized or "normal" variables have a mean of 0 and a variance and standard deviation of 1, no matter what the original metric of the variable was (e.g., IQ points, dollars of income or years of age). This is what enables us to compare mean scores across different groups and even different variables.
NOTE: YOU
CAN ONLY CALCULATE STANDARD SCORES WITH INTERVAL-RATIO DATA!
You are using arithmetic operations
to calculate a standard score.
Here's how to obtain a standardized score, often called a "Z score" or a "normal score":
1. Take each score of a given variable
2. Subtract the mean from each score.
3.
Divide the deviation score by the standard deviation for that variable.
In symbols FOR A POPULATION:
Z =
(Xi - µ) /
For example, common IQ measures are calibrated to have a mean of 100 and a standard deviation of 15.
If Marilyn Vos Savant's IQ is 175, her Z score would be: (175 - 100) / 15 = + 5.00
Marilyn's IQ is five standard deviations above the average U.S. IQ.
How about Marilyn's income? Mean family income in the United States is about $50,000 per year. Let's suppose the standard deviation for income is $15,000.
So Marilyn's Z score on income would be: (80,000 - 50,000) / 15,000 = +2.00 or two standard deviations above the average.
So, although Marilyn's IQ is WAY above average, her income is above average, but not nearly to the same degree. Although IQ and income are two different measures, calibrated on two different metrics, the Z score allows us to directly compare both measures for a particular person.
Z scores are particularly valuable to
use with the Normal Curve (see below). Because the areas under the normal
curve are known by definition, if your data conform to a normal distribution,
you can tell whether your score is about average or extremely high or low...and
even a score's percentile if your variable follows a normal distribution.
|
|
The
Normal Curve is a mathematically derived hypothetical distribution of scores.
To understand the basics of the normal curve, you should now be familiar with the mean, median and mode, and standard deviations and standard errors.
Below is the basic function that produces the normal curve. When the area under the curve is aggregrated through the mathematical process of integration, we have what is called a probability density function or PDF. Virtually every statistic has its own unique PDF that will draw a curve. The normal curve is actually almost the most simple PDF in the field of statistics.
PDFs allow us to make very useful inference statements because each PDF is a collection of mathematical properties. Because the PDF itself is hypothetically defined, it always has the same theoretical mathematical properties, regardless of the specific sample involved (although the specific numbers will depend on the data itself.)
Here's the formula that produces the normal curve. (This copy is courtesy of Dr. Brewer in EPLS' book:)
If you look at some of the components of
this function for the normal curve, you will see some familiar symbols:
the population mean for the variable (µ), the
population variance (
2),
and the population standard deviation (
).
|
|
You can use the normal curve only with numeric
data.
The curve is bell-shaped.
The curve is symmetric:
each side is a "mirror image" of the other.
The
distribution has a center.
The mean, median and mode are the same number and they are all in the exact
center of the distribution of scores.
The
total area under the curve is set to 100% or 1.00.
With normally distributed data, 68
percent of cases are within one positive and negative standard deviation
of the mean, 95 percent of cases are within ± 1.96 standard deviations
of the mean, and 99 percent of the cases are within + 3
s
of the mean.
|
The total area under the normal curve is set to 1.00 or 100 percent. We can calculate the various areas under the normal curve (tables at the very back of your book can help you do so, and computers will calculate this too). For example, 34 percent of the cases (or the area under the curve) is found between the mean and 1 standard deviation, or to use the symbolic terminology:
|
|
Imagine that you have taken several samples of exactly the the same size and the same type (say, n = 1500 telephone Random Digit Dial samples).
You now have M samples. You calculate a mean from each of these M samples. Then you average these separate sample means to find the "mean of the means" (often called the "grand mean").
The standard deviation around the "mean of the means" or the grand mean has a special name: we call it the standard error of the mean so that we know that we are dealing with a sampling distribution of samples and not one sample of individual cases.
The standard error behaves analagously to the standard deviation for the normal curve except that the standard error is a measure of variability across separate entire samples.
(The standard deviation is a measure of variability across individual cases in a single sample or a single population.)
We can apply the Normal Curve to either a sample of cases OR to a set of sample statistics (such as a set of means across several samples)
CRITICALLY IMPORTANT: The results from a set of samples may be normally distributed even if the cases from a single sample do NOT have a normal distribution.
If each sample is big enough ("the law of large numbers"), the results will vary less from sample to sample and form a normal distribution, for example, of a mean or a proportion.
Because we know the defined math qualities of the normal distribution in advance (they are mathematically defined, remember), we can use these properties with our data if the cases themselves are normally distributed or the sample is several hundred cases.
For example, we expect about 95 percent
of SAMPLE MEANS to be withing two STANDARD ERRORS on either side of the
grand mean.
|
|
Because when we take a sample we expect some variability around the outcome, we can place a confidence interval around our estimate of the sample mean. This gives us some idea of the average amount we can expect the mean to vary from sample to sample.
If we go out two estimated standard error units around the mean, 95 percent of the confidence intervals placed about the mean constructed in the following way will contain the true population mean.
Obviously, any ONE confidence interval constructed from a single sample either will contain the population mean or it will not. Our faith is in the long run PROCESS--that 95 percent of the confidence intervals constructed in such a way WILL contain the population parameter.
If we are going out almost two standard errors (1.96 to be exact) about the sample mean in either direction, this will capture the population mean in about 95 percent of the samples (same size and type from the same population at about the same time) that we could take. In 5 percent of the samples we take, the confidence interval will NOT contain the population mean. We never know which sample is a good one or a bad one, but our faith is that we got one of the 95 accurate samples and not one of the 5 bad samples.
The general formula for the 95% confidence interval around the mean is:
=
+ 1.96
* (s.e.![]()
)
The 1.96 means we are capturing the population
mean score in 95 percent of the samples whether higher than the sample
mean in this particular sample [+ 1.96 * (s.e.![]()
)]
or lower than the mean in this particular sample [-1.96 * (s.e.![]()
)].
EXAMPLE:
Suppose we are trying to estimate adult
age in the United States.
We can use the 2002 General Social Survey
data to do so.
Mean age for 2002 according to the GSS
= 46.28 years.
The standard deviation is 17.37 and
the n = 2751.
This makes our estimated standard error
17.37 / )2751 or 17.37/52.45 = .33
When I write: )n
That's short for "the square root of n". So in this case )2751 means "the square root of" 2751.
Our 95 percent confidence interval (going out almost two standard errors or 1.96 standard errors from the sample mean on either side) will be:
46.28 + 1.96 (.33) or 46.28 +
0.65
46.28 - .0.65 = 45.63 years of
age
46.28 + 0.65 = 46.93 years of
age
So our best estimate is that mean adult U.S. age (in 2002) is between 45.63 and 46.93 years of age.
As the casebase becomes larger, the standard
error becomes smaller (remember you divide by 1/ )n). So larger samples
have much smaller, hence much more precise confidence intervals and estimates
of the population parameter.
|
|
|
|
Often the basic elements of tabular displays are presented instead in a chart or a graph. Or icons may be used to represent the frequencies in each category of a selected variable.
Histograms look like bar charts. On the horizontal, or "x" axis, are the categories of the variable. On the vertical, or "y" axis, are the frequencies for each category.
Line graphs, also called frequency polygraphs, connect the midpoints of (at least ordinal) categories to create lines.
Pie charts use "pie wedges" and percentages (make sure they add to 100%!)
Pictographs use representational icons such as small houses or moneybags to show relative frequencies, often across groups.
Using the number of computers per household, below is one example each of a histogram, a frequency polygram, and a pie chart.
|
|
Some writers of reports like to use little pictoral symbols or icons to represent frequencies or relative frequencies.
For example, suppose that we find that 40 percent of the households where the top degree is a high school diploma have at least one personal computer (or laptop) and 80 percent of the households where the top degree is a college degree have at least one personal computer (or laptop). Pictorially, the artist might represent this comparison as follows:
CORRECT DEPICTION
Percent of United States Households
Owning at least One Personal Computer by Education
|
|
|
![]() |
![]() ![]() |
|
|
|
Source = Current Population Survey Internet
and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)
However, suppose our graphic artist, who doesn't have any stastistics background, decides that he doesn't want MULTIPLE icons because he feels that this clutters up his display. Instead, he decides to change the size of a pictoral icon in the college household to show the relative differences. So he tries the following change and makes the "80 percent" icon twice as high as the 40 percent icon:
CORRECT (BUT NOT "PRETTY") DEPICTION
Percent of United States Households
Owning at least One Personal Computer by Education
|
|
|
![]() |
![]() |
|
|
|
Source = Current Population Survey Internet
and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)
Only now, the artist decides that the tall, skinny computer looks wierd and out of proportion (it does), so he juggles the dimensions of the tall, skinny computer to make it "look better" producing the following comparison:
INCORRECT (but pretty) DEPICTION
Percent of United States Households
Owning at least One Personal Computer by Education
|
|
|
![]() |
![]() |
|
|
|
Source = Current Population Survey Internet
and Computer Use Supplement Aug 2000.
(n = 121,745; Missing data = 13241)
There! Now the big computer is more in proportion! Isn't that better?
Well, no, it's not. It's actually very misleading. Our artist, in a desire to make the computer icon pretty, now has not only made it twice as tall--but also twice as wide. The total icon for those with a college degree is now FOUR TIMES larger than the icon for those with a high school diploma, even though they are only twice as likely to have a home computer.
Thus, it is easy to misrepresent icons in pictographs, although the artist may have the best of intentions.Be sure that the icon used is always the same size for the entire pictograph. You can use multiples of the same size icon to convey group differences (as in the first, and correct, pictograph of computers.)
Other tips: Be sure the x and y axis of frequency polygons and histograms use equal interval units across the bottom and up the side. If you look at the Radon example in my handout (last page), it uses equal intervals up the side but unequal intervals across the bottom so that the effect of Radon exposure on smokers looks much more dramatic than it really is. (There will be a PAPER handout coming on graphic displays!)
If the graphic display is truncated (that is, it omits the middle portion of the y axis to concentrate on where all the frequencies are displayed), be sure the y axis starts at zero OR that the graph uses CLEAR truncation marks (see the page in the handout that compares the "good graph" and the "bad graph" for consumer confidence on the same page).
Be sure that the graphic display, whether histogram, frequency polygon, pie chart, pictoral representation, and so forth tells you (when appropriate) the total case base, valid case base, and the source of the data.
GREAT BOOK TIP: Darrel Huff does
the best job I have ever seen depicting the mismanagement of graphic displays
of data. This section of Guide 3 owes a lot to How to Lie with Statistics.
So make sure to check out the reading with lots of examples.
![]() |
READINGS AND ASSIGNMENTS |
OVERVIEW |
|
Susan Carol Losh May 19 2004
This page was built with
Netscape Composer
and is best viewed with
Netscape Navigator
600 X 800 display resolution.