Thursday, November 15, 2018

Reliability of Tests and Measurements


Certainly, information-based decisions should be founded on the most accurate data.  One prerequisite of accurate data is that the data were collected using a reliable testing or measurement methodology.  Reliability can be defined as "the statistical reproducibility of measurements, including the testing of instrumentation or techniques to obtain reproducible results". (https://www.ncbi.nlm.nih.gov/mesh/68015203)
Another prerequisite of accurate data is validity, which is discussed in another post on this blog.

So, since reliability is part of accuracy, one should consider what makes a test accurate.  One component of accuracy (and reliability) is measurement error.  The fact is that rarely (if ever) is a test or measurement perfectly reliable.  Before one can make a decision as to whether or not a test is accurate, an individual must set a threshold of acceptable measurement error.  Consider the very objective and accurate measurement of body mass using a calibrated, digital scale.  What can result in measurement error?  Confounding variables like time of day of measurement, hydration, amount of clothing worn, and even elevation above sea level could result in measurement error.  Measurement error can be defined as the difference between the true value (very often unknown) and the observed/measured value.  Measurement error is related to bias.  Bias is defined as any deviation of results or inferences from the truth, or processes leading to such deviation; bias can result from several sources: one-sided or systematic variations in measurement from the true value (systematic error), flaws in study design, deviation of inferences, interpretations, or analyses based on flawed data or data collection etc. There is no sense of prejudice or subjectivity implied in the assessment of bias under these conditions. (https://www.ncbi.nlm.nih.gov/mesh/?term=outcome+measurement+error)

Two kinds of measurement error exist.  Systematic measurement errors are predictable and cause measured scores to be consistently lower or higher than true scores.  For example, a stadiometer can be used to measure an individual's height.  If the stadiometer has an instrumentation flaw and consistently measures a person's height 1 centimeter more than a person's true height, then the measurement method consists of systematic error.  Systematic error is considered constant and predictable and thus, does not impact the reliability of a measurement.  The concern of systematic error is more relevant in terms of validity of testing, because the measured value is not representative of the actual value.  Since systematic error is predictable, the measurement error associated with systematic error can be addressed.  In the case of the stadiometer, one could measure the height of a person and subtract 1 centimeter from the measured score to get the true score.

A second type of measurement error is random error.  Random error is due to chance and is unpredictable.  Such errors may result in measurement of a variable that is higher or lower than the true measurement.  Consider the stadiometer example for measuring height.  While measuring a person's height, the individual may move slightly and cause of measurement error.  Reliability of measurement focuses on the degree of random error associated with a testing methodology.  The reliability of a test is directly related to the degree of random error.

Measurement error can consist of one or more of the following three components:  1) the individual performing the test/measurement, 2) the instrument being used to measure the variable, and 3) variability of the properties of the measured variable.  Sources of measurement error can be minimized by meticulous planning, creating a testing protocol, training on measurement procedures, objective and operational definitions in a testing protocol, and proper calibration of equipment.  An example of a measurement that may consist of measurement error is heart rate assessment.  Heart rate data are variable and one could argue that heart rate is a relatively unstable measurement.  A variety of confounding variables can impact the unreliable measurement of heart rate.  For example, the position of the individual (standing, sitting, or supine) could result in a different heart rate measurement.  Other variables that can cause variation of heart rate are caffeine consumption, medications, and physical activity.  A heart rate testing protocol should be following to control for such confounding variables.  Also, one could argue that taking multiple measurement and reporting the mean score will improve stability of the measurement.

To have a better understanding of reliability, one should understand the concepts of correlation and agreement.  Correlation can be defined as the degree of association between two sets of data.  In a study that investigates the reliability of a test, researchers could perform a test on a group of study participants at one point in time (initial test), repeat the test at a different point in time (retest), and analyze the correlation between the initial test data set and the retest data set.  A reliable test would reflect that higher initial test data would be associated with higher retest data and lower initial test data would be associated with lower retest data.

Consider the following two sets of measurements.
Study Participant     Initial Test      Retest
A                              1                    2
B                              3                    4
C                              5                    6
D                              7                    8
E                              9                    10

Below is a scatterplot with a line of best fit that visually represents the correlation between the initial test data and retest data.


We assume that any variations in measurements are due to random error.  If systematic errors occur, the correlation will not be affected because initial test scores will still be associated with retest scores, in relative terms.

However, the results of a reliable test should be reproducible, that is scores from repetitive testing should be in agreement with each other.  The data above have a high degree of correlation, but are not perfectly in agreement, meaning differences between initial test scores and retest scores exist.

Consider the following two sets of measurements.
Study Participant     Initial Test      Retest
A                              1                    1
B                              3                    3
C                              5                    5
D                              7                    7
E                              9                    9

The data above are perfectly correlated and in perfect agreement.  These data suggest that this test has perfect reliability.  So, the decision as to whether or not a test or measurement method is reliable should be based on the extent of correlation and agreement between repetitive testing.

Errors in measurements may be due to one or more sources and cause one to make incorrect inferences as to whether or not a test is reliable.  One possible source of measurement error is the interval between testing periods.  Intervals should be large enough so that the effects of confounding variables like fatigue and learning are minimized or nullified.  Yet, intervals should be small enough so that a true change in a measured variable does not occur.  An example of measurement error due to inadequate intervals between testing periods is measuring acute pain related to orthopaedic trauma over an extended period of time.  A true improvement in pain may result in data that are uncorrelated or have lack of agreement between times at which pain is assessed.

Another issue that can make reliability challenging (if not impossible) to determine is the effect of the first test on the second test, which is referred to as carryover or testing effects.  Carryover effects can occur with repeated measurements that result in a change in the measurement with subsequent testing.  Carryover effects can occur with repeated physical performance testing.  The measurement of physical performance could change between tests due to a motor learning effect (practice effect).  One way to minimize carryover effects is to require a person to perform a series of familiarization tests, where measurements are compared after the outcome measure has stabilized.  Once the measurement has stabilized, an initial test can be performed and compared to retesting.

Human error can also result in measurement error.  Many measurements require an individual to administer a test.  The individual who performs a test or collects the measurement data can be referred to as a rater.  Because of the potential of rater measurement error, many studies investigate the intra-rater or inter-rater reliability of a testing methodology.  Intra-rater reliability is the ability of one person to conduct a test or collect measurement data in a reproducible manner.  Inter-rater reliability concerns the differences in measurements between individuals.

No comments:

Post a Comment