QUALITIES OF A GOOD MEASURE
BOTH ARE BASED LARGELY ON THE CORRELATION STATISTIC.
RELIABILITY: Consistently yields the same result.
If you took a ruler and repeatedly measured your book, you would always get the same answer. Ruler measurements of a physical object are perfectly reliable. Social scientific measurements of people are not perfectly reliable.
|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
BOOK |
VALIDITY: Really measuring what it intends to measure.
EXAMPLE: SAT analogy item à RUNNER : MARATHON
This item is intended to measure analogical reasoning ability, but may be as much or more a measure of socioeconomic status.
(From Herrnstein & Murray, 1994, The Bell Curve)
Correlation Coefficient (r) : How do two variables go together?
Determining the Correlation Coefficient (r)
| Take values from data file (below) and plot each person's values (see graph on right)
|
|
Computerized statistical program will try to find "best-fitting line," which comes as close to touching as many data points as possible. Correlation (r) based upon slope of best-fitting line and degree to which points are close to the line.
Three excellent visual/graphical websites for learning about correlations:
This one allows you to see the emergence of the best-fitting line in a sea of data points (a key element of determining the correlation).
This website is very interactive, in that you can try several different examples and get immediate feedback. Scroll down to where you see a chart with a grid. You can make up some data points for fun by moving around on the grid and clicking on each place you want to put a data point. When you click on "Show Line," it will insert the best-fitting line and also show you the value of the correlation (r) to the left of the grid in the upper corner.
Two additional websites (here and here) allow you to slide the cursor over any numerical value of a correlation from +1 to -1 and see what the data points and best-fitting line look like.
A song to nail down our understanding of correlation, best-fitting lines, upward and downward slopes, etc.
Fitting the Line
Lyrics by Alan Reifman
(May be sung to the tune of “Draggin’ the Line,” James/King)
(Back-up vocals in parentheses)
Plotting the data, on X and Y,
Finding the slope, with most points nearby,
We want to find the angle, of the trend’s incline,
Fitting the line (fitting the line),
Upward slopes make r positive,
Slopes trending down, make it negative,
From minus-one to plus-one, r can feel
fine,
Fitting the line (fitting the line),
Fitting the line (fitting the line),
Points align, how will the data shine?
If you have upward slopes, it’ll give you a plus sign,
Fitting the line (fitting the line),
Fitting the line (fitting the line),
How strongly will your variables relate?
Is there a trend, or just a zero flat state?
You want to know what your analysis will find,
Fitting the line (fitting the line),
Fitting the line (fitting the line),
Points align, how will the data shine?
Your r will be minus, if the slope declines,
Fitting the line (fitting the line),
Fitting the line (fitting the line),
(Guitar solo)
Points align, how will the data shine?
If you have upward slopes, it’ll give you a plus sign,
Fitting the line (fitting the line),
Fitting the line (fitting the line)…
|
TYPES OF RELIABILITY |
|
| PROCEDURAL CONDITIONS |
TYPE OF RELIABILITY |
|
Self-report, multiple occasions to gather data from each participant. |
TEST-RETEST: Give same measure twice, separated by days, weeks, or months. Correlation between scores at Time 1 and Time 2. |
|
Self-r eport, single occasion, multiple-item measure, such as Hendrick & Hendrick love scales, each with seven items (four items in the shortened version). You would compute a separate alpha each for Eros, Ludus, Storge, Pragma, Mania, and Agape. |
INTERNAL CONSISTENCY (ALPHA, a ): If high internal consistency, how a person answered any one item tells you how he/she answered the others. Based in part on correlations, with maximum = 1.0. UPDATED 2/8/08. |
|
Observation with two raters (single occasion) |
INTER-RATER RELIABILITY: |
Test-retest reliability correlations involving people who took the SAT more than once have been reported as .77 for whites and .90 for blacks (Vars & Bowen chapter in Jencks & Phillips, The Black-White Test Score Gap, p. 471, footnote 22).
Here's a sports example of "test-retest reliability" that I came up with.

The item listings above are just shortened, keyword descriptions. The actual wordings are available here. Instead of the True/False format shown on the web document, we used a system of 0 = Strongly Agree to 4 = Strongly Disagree.
For a given set of items (such as the Storge subscale), alpha is based on the number of items and the average of the correlations between each pair of items (Wikipedia page).
Both types of reliability shown below should exhibit large positive correlations (high reliability).

| TEST-RETEST DEPRESSION EXAMPLE Most people who score highly on the test the first time would also score highly on the same test the second time, and people with low initial scores would likely also get a low score the second time. A few individuals might get very different scores on the two occasions, but the correlation statistic represents the trend for the whole sample. |
INTER-RATER SMILING EXAMPLE If a particular wife smiles a lot, two well-trained raters should both record a large number of smiles, although they may differ slightly. However, no rater should have a tally of 0 for this wife. If another wife very rarely smiles, both raters should have tallies at or near 0 for her, with neither rater at, say, 10.
|
Reliability correlations tend to be much larger than validity correlations. Why might this be so?
Evidence a test is measuring what it intends to measure (From most to least important, in Dr. Reifman's view) |
TYPE |
DEFINITION |
EXAMPLE |
|
Predictive |
Test scores should correlate with real-world outcomes |
SAT (V) & first-year grades correlation = .36; SAT (M) & first-year grades correlation = .35 |
|
Construct: |
Test should correlate with other similar measures |
SAT should correlate with other academic ability tests |
|
Construct: |
Test should not correlate with irrelevant tests |
SAT should not correlate with political attitudes |
|
Content |
Covers the necessary range of material |
Different areas of math and verbal abilities should be covered |
|
Face |
Items look like they are covering proper topics |
Math test should not have history items |
Let's revisit the question of how to measure happiness, from the introductory measurement lecture.
Source for SAT validity coefficients: David Owen (with Marilyn Doerr), None of the Above: The Truth Behind the SATs (1999, revised and updated edition; p. 197)
Chart to Summarize Reliability and Validity
| RELIABILITY (test-retest preferred, if possible) | ||
| Test | Correlated With | Repeat Administration of Test to Same Persons |
| If only one testing session
available, correlate items with each other (internal consistency). If a behavioral observation, correlate two judges' scores of same videotapes (inter-rater). |
||
| VALIDITY (predictive preferred, if possible) | ||
| Test | Correlated With |
Real-World Behavior |
| If only one testing session available, correlate test with other established tests. | ||
In class discussion, Dr. Reifman asked the class how one might try to validate the state examination for barbers and cosmetologists. In other words, we would correlate barbers' and stylists' scores on the exams to what real-world outcomes? Students in a previous class came up with excellent suggestions, such as customer-satisfaction surveys and observing how often the same customers came back to the same barber/stylist. Also, the barber/stylist's work could be judged by experts, such as Vidal Sassoon.
Real-world examples:
How eHarmony has attempted to validate its measures used for matching singles (and another article about whether online matchmaking companies have the scientific validity to back up their claims; thanks to Dr. Niehuis for sending me the article).
Does the Wonderlic intelligence test, which is given to football players coming out of college at a camp where they work out for NFL team scouts, show validity in predicting on-the-field success once the players begin their pro careers? (the linked study looks at quarterbacks)
Does a popular computerized test for racial bias show reliability and validity?
...and a Song
Reliable and Valid
Lyrics by Alan Reifman
(May be sung to the tune of “Don’t Stop (Thinking About Tomorrow),” Christine McVie, popularized by Fleetwood Mac)
When selecting a questionnaire,
Psychometrics have to be sound,
You can make your own, if you have to,
But try to use one already around,
Make… it… re-liable and valid,
Make… it… the best that you can find,
It will help, strengthen your research,
Measurement’s prime, measurement’s prime,
(Guitar solo)
To assess re-li-a-bility,
Use test-retest with two occasions,
Use alpha for a one-time test, and,
Inter-rater for observations,
Make… it… re-liable and valid,
Make… it… the best that you can find,
It will help, strengthen your research,
Measurement’s prime, measurement’s prime,
(Guitar solo)
To assess a test’s validity,
There are many forms to make your case,
They may or may not be statistical,
Predictive, construct, content and face,
Make… it… re-liable and valid,
Make… it… the best that you can find,
It will help, strengthen your research,
Measurement’s prime, measurement’s prime,
Make… it… re-liable and valid,
Make… it… the best that you can find,
It will help, strengthen your research,
Measurement’s prime, measurement’s prime,
Ooh, make your tests sound,
Ooh, make your tests sound,…
(Fade out)