Certainly, information-based
decisions should be founded on the most accurate data. One prerequisite
of accurate data is that the data were collected using a reliable testing or
measurement methodology. Reliability can be defined as "the statistical
reproducibility of measurements, including the testing of instrumentation or
techniques to obtain reproducible results". (https://www.ncbi.nlm.nih.gov/mesh/68015203)
Another prerequisite of
accurate data is validity, which is discussed in another post on this blog.
So, since reliability is part of accuracy,
one should consider what makes a test accurate. One component of accuracy
(and reliability) is measurement error. The fact is that rarely (if ever)
is a test or measurement perfectly reliable. Before one can make a
decision as to whether or not a test is accurate, an individual must set a
threshold of acceptable measurement error. Consider the very objective
and accurate measurement of body mass using a calibrated, digital scale.
What can result in measurement error? Confounding variables like time of
day of measurement, hydration, amount of clothing worn, and even elevation above
sea level could result in measurement error. Measurement error can be defined as the difference
between the true value (very often unknown) and the observed/measured value.
Measurement error is related to bias. Bias is defined as any
deviation of results or inferences from the truth, or processes leading to such
deviation; bias can result from several sources: one-sided or systematic
variations in measurement from the true value (systematic error), flaws in
study design, deviation of inferences, interpretations, or analyses based on
flawed data or data collection etc. There is no sense of prejudice or
subjectivity implied in the assessment of bias under these conditions. (https://www.ncbi.nlm.nih.gov/mesh/?term=outcome+measurement+error)
Two kinds of measurement error
exist. Systematic measurement errors are predictable and cause measured scores to be
consistently lower or higher than true scores. For example, a stadiometer
can be used to measure an individual's height. If the stadiometer has an
instrumentation flaw and consistently measures a person's height 1 centimeter
more than a person's true height, then the measurement method consists of
systematic error. Systematic error is considered constant and predictable
and thus, does not impact the reliability of a measurement. The concern
of systematic error is more relevant in terms of validity of testing, because
the measured value is not representative of the actual value. Since systematic
error is predictable, the measurement error associated with systematic error
can be addressed. In the case of the stadiometer, one could measure the
height of a person and subtract 1 centimeter from the measured score to get the
true score.
A second type of measurement error is random
error. Random error is due
to chance and is unpredictable. Such errors may result in measurement of
a variable that is higher or lower than the true measurement. Consider
the stadiometer example for measuring height. While measuring a person's
height, the individual may move slightly and cause of measurement error.
Reliability of measurement focuses on the degree of random error associated
with a testing methodology. The reliability of a test is directly related
to the degree of random error.
Measurement error can consist of one or
more of the following three components: 1) the individual performing the
test/measurement, 2) the instrument being used to measure the variable, and 3)
variability of the properties of the measured variable. Sources of
measurement error can be minimized by meticulous planning, creating a testing
protocol, training on measurement procedures, objective and operational
definitions in a testing protocol, and proper calibration of equipment.
An example of a measurement that may consist of measurement error is heart rate
assessment. Heart rate data are variable and one could argue that heart
rate is a relatively unstable measurement. A variety of confounding
variables can impact the unreliable measurement of heart rate. For
example, the position of the individual (standing, sitting, or supine) could
result in a different heart rate measurement. Other variables that can
cause variation of heart rate are caffeine consumption, medications, and physical
activity. A heart rate testing protocol should be following to control
for such confounding variables. Also, one could argue that taking
multiple measurement and reporting the mean score will improve stability of the
measurement.
To have a better understanding of
reliability, one should understand the concepts of correlation and
agreement. Correlation can be defined as the degree
of association between two sets of data.
In a study that investigates the reliability of a test, researchers could perform
a test on a group of study participants at one point in time (initial test),
repeat the test at a different point in time (retest), and analyze the
correlation between the initial test data set and the retest data set. A
reliable test would reflect that higher initial test data would be associated
with higher retest data and lower initial test data would be associated with
lower retest data.
Consider the following two sets of
measurements.
Study Participant
Initial Test Retest
A
1
2
B
3
4
C
5
6
D 7 8
E 9 10
Below is a scatterplot with a line of best
fit that visually represents the correlation between the initial test data and
retest data.
We assume that any variations in
measurements are due to random error. If systematic errors occur, the
correlation will not be affected because initial test scores will still be
associated with retest scores, in relative terms.
However, the results of a reliable test
should be reproducible, that is scores from repetitive testing should be
in agreement with
each other. The data above have a high degree of correlation, but are not
perfectly in agreement, meaning differences between initial test scores and
retest scores exist.
Consider the following two sets of
measurements.
Study Participant Initial
Test Retest
A
1
1
B
3
3
C
5
5
D
7
7
E
9
9
The data above are perfectly correlated and
in perfect agreement. These data suggest that this test has perfect
reliability. So, the decision as to whether or not a test or measurement
method is reliable should be based on the extent of correlation and agreement
between repetitive testing.
Errors in measurements may be due to one or
more sources and cause one to make incorrect inferences as to whether or not a
test is reliable. One possible source of measurement error is the
interval between testing periods. Intervals should be large enough so
that the effects of confounding variables like fatigue and learning are
minimized or nullified. Yet, intervals should be small enough so that a
true change in a measured variable does not occur. An example of
measurement error due to inadequate intervals between testing periods is
measuring acute pain related to orthopaedic trauma over an extended period of
time. A true improvement in pain may result in data that are uncorrelated
or have lack of agreement between times at which pain is assessed.
Another issue that can make reliability
challenging (if not impossible) to determine is the effect of the first test on
the second test, which is referred to as carryover or testing effects.
Carryover effects can occur with repeated measurements that result in a change
in the measurement with subsequent testing. Carryover effects can occur
with repeated physical performance testing. The measurement of physical
performance could change between tests due to a motor learning effect (practice
effect). One way to minimize carryover effects is to require a person to
perform a series of familiarization tests, where measurements are compared
after the outcome measure has stabilized. Once the measurement has
stabilized, an initial test can be performed and compared to retesting.
Human error can also result in measurement
error. Many measurements require an individual to administer a
test. The individual who performs a test or collects the measurement data
can be referred to as a rater. Because of the potential of rater
measurement error, many studies investigate the intra-rater or inter-rater
reliability of a testing methodology. Intra-rater reliability is the
ability of one person to conduct a test or collect measurement data in a
reproducible manner. Inter-rater reliability concerns the differences in
measurements between individuals.
No comments:
Post a Comment