Chapter
10: Measuring Research Variables
l Judging
the quality of measures used in collecting research data: Validity and
Reliability
Validity
l Indicates
the degree to which the test, or instrument,
measures what it is suppose to measure
l Refers
to the soundness of the interpretation of the test, the most important
consideration in measurement.
Four
Basic Kinds of Validity
l
Logical
l
Content
l
Criterion
l
Construct
Logical
Validity
l Sometimes referred to as Face Validity
l Validity
is claimed when the measure obviously involves the performance being measured
Example: Static Balance Test consisting of balancing on one foot has logical validity
Content
Validity
Pertains to learning in educational settings
A test has content validity if it adequately samples what was covered in
the course
Criterion
Validity
Measurements used in
research studies are frequently validated against some criterion.
Two types of criterion
validity:
l
Concurrent validity
l
Predictive Validity
Concurrent
Validity
l Involves correlating an instrument with some criterion
that is administered at about the same time (concurrently).
l Many physical performance tests are validated in this
manner.
l Criterion Measures include an already validated or
accepted measure
l Concurrent validity is usually employed when the
research wishes to substitute a shorter, more easily administered test for a
criterion that is more difficult to measure.
Example: VO2max is regarded as the most valid measure of
cardiovascular fitness. Elaborate
testing equipment and expertise required to do the test.
Objective is to substitute another test that is much simpler, less
expensive and less time consuming. This
has been done by Astrand, and many others.
Submaximal testing on a cycle ergometer or treadmill have been
correlated with the VO2max test and a high correlation has been
found.
Predictive
Validity
If the criterion is some later behavior and the test predicts the behavior
the Predictive Validity is the major concern.
Example: college entrance exams
Physical performance tests in youth athletes to predict elite athletic
potential
Multiple regression tests of prediction
The Multiple regression equations used to
predict a variable in a new sample usually is less valid, shows shrinkage
(validity coefficient decreases; shrinks substantially).
To estimate shrinkage a technique called
Cross Validation is used. The same test
is given to a new sample from the same population to check whether the formula
is accurate. Example: Test 200 people, and use the results of 100
to develop prediction equations. Use
prediction equations to test the other 100 to see how accurately it predicts
the criterion for them. Correlate
predicted scores with actual scores using Pearson r.
Expectancy
Tables
GRE may not predict very well between scores
on the GRE and the attainment of advanced degrees.
However, scores on the GRE predict much
better when compared to those who attain Ph.D.s
Expectancy table can be used to predict the
probability of some performance.
A two-way grid that provides the probability
that individuals with a particular assessment score will attain some criterion
score.
Table 10.1 a test that measure sportsmanship, “A Jolly Good Show
Inventory”. To see how our test relates
to ratings of sportsmanship by judges who observed 60 students at play.
The relevance of the criterion and whether it is measured reliably is the
key issue
Construct
Validity
Many human characteristics are theoretical constructs and not directly
observable.
What is relevant is how a person would behave under certain circumstances
who possess a certain characteristic
Construct Validity is the degree to which a test measures a hypothetical
construct and is usually established by relating the test results to some
behavior
Known difference method – construct validity of a test for anaerobic power
demonstrated by testing a group of sprinters and jumpers and comparing to a
group of distance runners. If the
sprinters and jumpers score greater than the distance runners it supports the construct
validity.
Reliability
A measure of the consistency and repeatability of a test.
A test cannot be considered valid if it is not reliable.
But a test can be reliable without being valid.
Test
Reliability
Discussed in terms of
observed score- a test score by an individual.
This may not be a true assessment of the person’s ability, due to
emotional or physical state of the performer, or the measurement error in the
instrumentation, test directions, etc.
An observed score consists
of the true score and error score.
One must try to minimize the
error score.
Reliability
Coefficient
Is a measure of the ratio of the True Score Variance to the Observed Score
Variance
True Score variance is never known, so the Error Variance is subtracted
from the Observed Score Variance
Reliability coefficient reflects the degree to which the measurement is
free of error variance.
Sources of Measurement Error
The participant – mood, motivation, fatigue, health, fluctuations in memory
or performance, previous practice, specific knowledge, and familiarity with testing
items.
The testing – lack of clarity or completeness in directions, how are the
instructions followed, supplementary directions or motivations added, etc.
The scoring – competence, experience, and dedication of the scorers and to
the nature of the scoring itself.
Carelessness and inattention to detail could produce error.
The instrumentation – inaccuracy, lack of calibration of mechanical and
electronic equipment.
Reliability
Using Correlation
The degree of reliability is expressed by the Pearson r, which is often
called the interclass correlation, which is between two sets of different
variables.
Intraclass is used for reliability since one is using 2 scores of the same
variable.
ANOVA
Is used to obtain the reliability coefficient.
Pearson r uses only two scores, sometimes there may be more than two
measures of a variables
Intraclass Correlation provides estimates of systematic and error variance;
systematic is like the learning that takes place
Use a one-way with repeated measures ANOVA; Table 10.2
R = (MSS – MSE ) / MSS
MSE = 1.27
R = (3.73 – 1.27) / 3.73 = .66
Or, one may discard the trials thart are noticeably different from the
others and run a second ANOVA, and
another F test.
If F is not significant the above equation for R is used.
If F is still significant more trials are discarded until the F is not
significant
Methods of establishing Reliability
Look at three coefficients
of reliability:
l
Stability -
determined by the test-retest method on separate days. Used with
physical performance, but not paper-and-pencil tests. Intraclass correlation
used.
2. Alternate forms – Used
for paper-and-pencil tests where two forms of the test are constructed.
a. Parallel form method is
where two tests are given to the same individuals.
3. Internal consistency –
l
same-day test-retest used with physical performance
test
b. split-half technique used for written tests where the test is divided
into two parts, and the two halves are correlated. Usually odd and even numbered questions are separated into two
forms of the test.
The Spearman-Brown Prophecy formula is used to estimate reliability.
Other methods to determine reliability from split-halves are
Kuder-Richardson, and Flanagan
Objectivity
Intertester reliability- the degree to which different testers can achieve
the same scores on the same subjects.
The degree of objectivity is established by having more than one tester
gather data on the same subjects, and an intraclass correlation obtains the
intertester reliability coefficient.
Observation
of behavior
In physical education classes or in sports in the real world situation
A coding instrument is developed with categories into which various
behaviors may be coded.
Researchers are concerned about coder consistency
Interobserver Agreement = agreements ¸ (agreements +
disagreements)
Standard Error of Measurement
Every Test only yields Observed scores; only estimates of True Scores
Test scores fall into a range that contains the true score.
_______
SY·X = s Ö 1.00
– r
s = standard deviation of the scores, r = reliability coefficient
Example of Standard Error of Measurement
Body fat measurement test where the s = 5.6% and the test-retest
reliability is .83
________
SY·X = 5.6Ö
1.00-0.83 = 2.3%
If a subject has a estimated % body fat of 22.4% the true score would fall
between 20.1% and 24.7%.
We are then 68% sure the true score is within this range.
Doubling the SY·X from
2.3% to 4.6% now gives us a range of 17.8% to 27.0% in which we are 95% sure
that the true score is within this range.
The standard error of measurement is governed by the variability of the
test scores and the reliability of the test.
Using Standard Scores to Compare Performance
Z Scores is a standard score which converts
raw scores to units of standard deviation
z = (X – M)/s; X is a raw score, M =
mean, s = standard deviation
Example: M = 40 cm for vertical jump, s = 6cm; for Push-ups M=20 &
s=5
A score of 46 cm would have a z-score of:
z =
(46 cm – 40 cm)/6cm = 6cm/6cm = 1.0
A score of 25 pushups would have a z-score
of:
z =
(25 – 20)/5 = 1.0
T-scales
Sets the mean at 50 and standard deviation at 10.
T = 50 + 10z
Makes all scores positive.
Rating
Scales
Judges ratings according to an established scale: Diving; Gymnastics
Self-rating scale: RPE (Borg Scale)
Scales vary from numerical to check lists, etc.
Standardizing ratings of judges
Rating
Errors
1.
Leniency – tendency for judges to be overly generous (e.g.,
peer evaluations)
2.
Central Tendency Error – inclination of the rater to give
middle of the scale ratings and avoid the extremes
a.
Ego – an expert judge gives lower scores to really good
performers
b.
Leave room for higher scores
3.
Halo Effect – Allowing previous impressions and knowledge
about a certain performer to influence ratings; negative impressions can work
against the performer
4. Proximity
Errors – result of overly detailed rating scales, insufficient familiarity with
rating criteria. Rater tends to list
behaviors listed close together on the list as nearly the same, as compared to
when behaviors are separated in the list.
5. Observer Bias – Varies with judges prejudices
and characteristics. (e.g., Racial, sexual, political, philosophical)
6. Observer expectation errors – A person who
expects certain behaviors is inclined to look more carefully for those
behaviors and interpret observations in a particular direction.
To eliminate or reduce rating errors, define the behavior to be rated as
objectively as possible.
Measuring Knowledge
Analyzing Test Items
Purpose of Item Analysis is to determine
which test items are suitable and which need to be rewritten or discarded.
1. Item Difficulty: divide the number of
people who answer the question correctly by the total number of people
answering that item.
Example:
60 out of 80 answered item correctly.
75% is difficulty index.
Recommendation is to eliminate items that have < 10% or >90%
ratings
2. Item discrimination- did the test item discriminate between the people
who did well and who did poorly.
Index of Discrimination = (nH - nL ) / n
Where nH = number of high scorers, nL = number of low
scorers, and n= total number in either
high or low group.
Example: If 30 in the high group and 30
in the low group, and 20 high scorers and 10 low scorers answered an item
correctly the index would be (20-10)/30 or 33%.
Various percentages are used; 25%, 30% or 33%.
Flanagan Method uses the upper and lower 27%
If approximately the same number of high scorers and low scorers answer an
item correctly, the item is not discriminating.
An index of discrimination of over .20 is what is usually desired.
A negative index of discrimination is unacceptable.
Item
Response Theory
Skip this section.