Chapter 10: Measuring Research Variables

l   Judging the quality of measures used in collecting research data: Validity and Reliability

Validity

l   Indicates the degree to which the test, or instrument,  measures what it is suppose to measure

l   Refers to the soundness of the interpretation of the test, the most important consideration in measurement.

Four Basic Kinds of Validity

l          Logical

l          Content

l          Criterion

l          Construct

Logical Validity

l    Sometimes referred to as Face Validity

l   Validity is claimed when the measure obviously involves the performance being measured

Example: Static Balance Test consisting of balancing on  one foot has logical validity

Content Validity

Pertains to learning in educational settings

A test has content validity if it adequately samples what was covered in the course

Criterion Validity

Measurements used in research studies are frequently validated against some criterion.

Two types of criterion validity:

l        Concurrent validity

l        Predictive Validity

Concurrent Validity

l    Involves correlating an instrument with some criterion that is administered at about the same time (concurrently).

l    Many physical performance tests are validated in this manner.

l    Criterion Measures include an already validated or accepted measure

l    Concurrent validity is usually employed when the research wishes to substitute a shorter, more easily administered test for a criterion that is more difficult to measure.

 

Example: VO2max is regarded as the most valid measure of cardiovascular fitness.  Elaborate testing equipment and expertise required to do the test. 

Objective is to substitute another test that is much simpler, less expensive and less time consuming.  This has been done by Astrand, and many others.  Submaximal testing on a cycle ergometer or treadmill have been correlated with the VO2max test and a high correlation has been found.

Predictive Validity

If the criterion is some later behavior and the test predicts the behavior the Predictive Validity is the major concern.

Example: college entrance exams

Physical performance tests in youth athletes to predict elite athletic potential

Multiple regression tests of prediction

 

The Multiple regression equations used to predict a variable in a new sample usually is less valid, shows shrinkage (validity coefficient decreases; shrinks substantially).

To estimate shrinkage a technique called Cross Validation is used.  The same test is given to a new sample from the same population to check whether the formula is accurate.   Example:  Test 200 people, and use the results of 100 to develop prediction equations.  Use prediction equations to test the other 100 to see how accurately it predicts the criterion for them.  Correlate predicted scores with actual scores using Pearson r.

Expectancy Tables

GRE may not predict very well between scores on the GRE and the attainment of advanced degrees.

However, scores on the GRE predict much better when compared to those who attain Ph.D.s

Expectancy table can be used to predict the probability of some performance.

A two-way grid that provides the probability that individuals with a particular assessment score will attain some criterion score.

 

Table 10.1 a test that measure sportsmanship, “A Jolly Good Show Inventory”.  To see how our test relates to ratings of sportsmanship by judges who observed 60 students at play.

 

The relevance of the criterion and whether it is measured reliably is the key issue

Construct Validity

Many human characteristics are theoretical constructs and not directly observable.

What is relevant is how a person would behave under certain circumstances who possess a certain characteristic

Construct Validity is the degree to which a test measures a hypothetical construct and is usually established by relating the test results to some behavior

 

Known difference method – construct validity of a test for anaerobic power demonstrated by testing a group of sprinters and jumpers and comparing to a group of distance runners.  If the sprinters and jumpers score greater than the distance runners it supports the construct validity.

Reliability

A measure of the consistency and repeatability of a test.

A test cannot be considered valid if it is not reliable.

But a test can be reliable without being valid.

Test Reliability

Discussed in terms of observed score- a test score by an individual.  This may not be a true assessment of the person’s ability, due to emotional or physical state of the performer, or the measurement error in the instrumentation, test directions, etc.

An observed score consists of the true score and  error score.

One must try to minimize the error score.

Reliability Coefficient

Is a measure of the ratio of the True Score Variance to the Observed Score Variance

True Score variance is never known, so the Error Variance is subtracted from the Observed Score Variance

Reliability coefficient reflects the degree to which the measurement is free of error variance.

Sources of Measurement Error

The participant – mood, motivation, fatigue, health, fluctuations in memory or performance, previous practice, specific knowledge, and familiarity with testing items.

The testing – lack of clarity or completeness in directions, how are the instructions followed, supplementary directions or motivations added, etc.

 

The scoring – competence, experience, and dedication of the scorers and to the nature of the scoring itself.  Carelessness and inattention to detail could produce error.

The instrumentation – inaccuracy, lack of calibration of mechanical and electronic equipment.

 

Reliability Using Correlation

The degree of reliability is expressed by the Pearson r, which is often called the interclass correlation, which is between two sets of different variables.

Intraclass is used for reliability since one is using 2 scores of the same variable.

ANOVA

Is used to obtain the reliability coefficient.

Pearson r uses only two scores, sometimes there may be more than two measures of a variables

Intraclass Correlation provides estimates of systematic and error variance; systematic is like the learning that takes place

Use a one-way with repeated measures ANOVA; Table 10.2

 

R = (MSS – MSE ) / MSS

MSE = 1.27

R = (3.73 – 1.27) / 3.73 = .66

Or, one may discard the trials thart are noticeably different from the others  and run a second ANOVA, and another F test. 

If F is not significant the above equation for R is used.

If F is still significant more trials are discarded until the F is not significant

Methods of establishing Reliability

Look at three coefficients of reliability:

l          Stability -  determined by the test-retest method on separate days. Used with physical performance, but not paper-and-pencil tests. Intraclass correlation used.

 

2. Alternate forms – Used for paper-and-pencil tests where two forms of the test are constructed. 

a. Parallel form method is where two tests are given to the same individuals.

3. Internal consistency –

l        same-day test-retest used with physical performance test

 

b. split-half technique used for written tests where the test is divided into two parts, and the two halves are correlated.  Usually odd and even numbered questions are separated into two forms of the test. 

The Spearman-Brown Prophecy formula is used to estimate reliability. 

Other methods to determine reliability from split-halves are Kuder-Richardson, and Flanagan

Objectivity

Intertester reliability- the degree to which different testers can achieve the same scores on the same subjects.

The degree of objectivity is established by having more than one tester gather data on the same subjects, and an intraclass correlation obtains the intertester reliability coefficient. 

Observation of behavior

In physical education classes or in sports in the real world situation

A coding instrument is developed with categories into which various behaviors may be coded. 

Researchers are concerned about coder consistency

Interobserver Agreement = agreements ¸ (agreements + disagreements)

Standard Error of Measurement

Every Test only yields Observed scores; only estimates of True Scores

Test scores fall into a range that contains the true score.

                  _______

SY·X   = s Ö 1.00 – r

 

s = standard deviation of the scores, r = reliability coefficient

Example of Standard Error of Measurement

Body fat measurement test where the s = 5.6% and the test-retest reliability is .83

                     ________

SY·X   = 5.6Ö 1.00-0.83  = 2.3%

If a subject has a estimated % body fat of 22.4% the true score would fall between 20.1% and 24.7%.

We are then 68% sure the true score is within this range.

 

Doubling the SY·X   from 2.3% to 4.6% now gives us a range of 17.8% to 27.0% in which we are 95% sure that the true score is within this range.

 

The standard error of measurement is governed by the variability of the test scores and the reliability of the test.

Using Standard Scores to Compare Performance

Z Scores is a standard score which converts raw scores to units of standard deviation

z = (X – M)/s; X is a raw score, M = mean,       s = standard deviation

Example: M = 40 cm for vertical jump,           s = 6cm; for Push-ups M=20 & s=5

A score of 46 cm would have a z-score of:

    z = (46 cm – 40 cm)/6cm = 6cm/6cm = 1.0

A score of 25 pushups would have a z-score of:

    z = (25 – 20)/5 = 1.0

T-scales

Sets the mean at 50 and standard deviation at 10. 

T = 50 + 10z

Makes all scores positive.

Rating Scales

Judges ratings according to an established scale: Diving; Gymnastics

Self-rating scale:  RPE (Borg Scale)

Scales vary from numerical to check lists, etc.

Standardizing ratings of judges

 

Rating Errors

1.    Leniency – tendency for judges to be overly generous (e.g., peer evaluations)

2.    Central Tendency Error – inclination of the rater to give middle of the scale ratings and avoid the extremes

a.      Ego – an expert judge gives lower scores to really good performers

b.     Leave room for higher scores

3.    Halo Effect – Allowing previous impressions and knowledge about a certain performer to influence ratings; negative impressions can work against the performer

4. Proximity Errors – result of overly detailed rating scales, insufficient familiarity with rating criteria.  Rater tends to list behaviors listed close together on the list as nearly the same, as compared to when behaviors are separated in the list.

5. Observer Bias – Varies with judges prejudices and characteristics. (e.g., Racial, sexual, political, philosophical)

6. Observer expectation errors – A person who expects certain behaviors is inclined to look more carefully for those behaviors and interpret observations in a particular direction.

To eliminate or reduce rating errors, define the behavior to be rated as objectively as possible.

Measuring Knowledge

Analyzing Test Items

Purpose of Item Analysis is to determine which test items are suitable and which need to be rewritten or discarded.

1. Item Difficulty: divide the number of people who answer the question correctly by the total number of people answering that item.

    Example: 60 out of 80 answered item correctly.  75% is difficulty index.  Recommendation is to eliminate items that have < 10% or >90% ratings

 

2. Item discrimination- did the test item discriminate between the people who did well and who did poorly.

 

Index of Discrimination = (nH  - nL ) / n

 

Where nH = number of high scorers, nL = number of low scorers, and  n= total number in either high or low group. 

 

Example: If 30 in the high group and 30  in the low group, and 20 high scorers and 10 low scorers answered an item correctly the index would be (20-10)/30 or 33%.

Various percentages are used; 25%, 30% or 33%.

Flanagan Method uses the upper and lower 27%

If approximately the same number of high scorers and low scorers answer an item correctly, the item is not discriminating.

 

An index of discrimination of over .20 is what is usually desired. 

A negative index of discrimination is unacceptable.

Item Response Theory

Skip this section.