Monday, December 31, 2018

Quasi-Experimental Designs - One-group Intervention Studies


As I have addressed in a previous post, a clinical trial is defined as a research study in which one or more human participants are prospectively assigned to one or more interventions (which may include placebo or other control) to evaluate the effects of those interventions on health-related biomedical or behavioral outcomes.  My previous post provides a narrative about clinical trials that entail at least two independent groups.  This present post will describe clinical trials where one-group designs are utilized.

An intervention study in which one group of participants undergo repeated measurements before and after receiving one or more interventions is called a repeated measures design.  Because all participants receive the same interventions and the treatment effects are associated with changes within each participant, the repeated measures design is also referred to as a within-subjects design. 

In repeated measures designs, participants/subjects act as their own control, which is considered a study design strength in that the potential influence of individual differences is controlled.  For example, age and gender characteristics remain constant.  Therefore, changes in outcomes are inferred to be due to treatment effects and not differences between participants.  Using study participants are their own control provides the most equal “comparison group”.


 So, why is the randomized clinical trial consider the gold standard of intervention study designs?  Why isn’t the repeated measures design better?  One disadvantage of repeated measures studies is the possibility of practice effect or learning effect.  Study participants can learn how to perform better on an outcome measure through practicing or performing the outcome measurement on a repeated basis.  Consider a repeated measures intervention study in which the outcome measure is physical function.  Study participants may appear to show an improvement in physical function, but improvements could be due to participants learning how to perform the physical function test through repeated practice and not due to the treatment.

Another potential disadvantage of the repeated measures design is the carryover effect.  Study participants are at risk of experiencing a carryover effect when they are exposed to multiple forms of treatment.  Consider a study where one group of participants is exposed to three different balance training treatments (treatment A, treatment B, and treatment C) and the outcome measure is balance.  If balance is measured after each treatment, treatment A could have a carryover effect when balance is measured after the participants receive treatments B and C.  Thus, one cannot determine separate effects between treatments A, B, and C. 

Although practice effects and carryover effects are possible limitations of repeated measure designs, investigators can incorporate methods to control such limitations.  Methods to control for these limitations often depend on the nature of the independent (treatment) and dependent (outcome measure) variables.  I encourage readers of this blog to post comments on how to control for these limitations in difference types of studies!

Thursday, December 13, 2018

Experimental Designs - Clinical Trials


The National Institutes of Health has defined a clinical trial as a research study in which one or more human participants are prospectively assigned to one or more interventions (which may include placebo or other control) to evaluate the effects of those interventions on health-related biomedical or behavioral outcomes.  In terms of strength of study design, the randomized clinical trial is considered the strongest design to evaluate the cause-and-effect relationship of an intervention.

Clinical trials can be classified as either therapeutic or preventive.  Therapeutic trials investigate the effect of a treatment (independent variable) on an outcome (dependent variable).

Preventive trials assess if a treatment (independent variable) is effective in reducing the risk of developing a condition or disease (dependent variable).

Various experimental designs exist and I will be discussing different designs in future posts.  As for now, I will be discussing different types of clinical trial designs.

The “gold standard” of experimental designs is the randomized clinical trial (RCT).  The RCT design is also referred to as a pretest-posttest control group design.  This design is used to compare two or more groups that are created by random assignment, where one group receives the experimental treatment and the other group(s) serve as a control or comparison.  The different groups are sometimes referred to as treatment arms.  All groups undergo pre-testing before receiving treatment and post-testing afterwards.  During pre- and post-testing, outcomes data are collected from the study participants.  If the experimental group experiences a greater change in outcomes between pre- and post-testing than the control group, one can infer that the treatment caused an effect.

The information in previous posts on sampling, validity of intervention study design, and threats to study validity help to explain how the RCT is a strong study design.  Below is a flowchart that illustrates a RCT.


http://www.consort-statement.org/consort-statement/flow-diagram


The posttest-only control group design is the same as the pretest-posttest control group, except pre-testing does not occur in the experimental group nor control/comparison group.  In some situations, pre-testing is not possible.

Factorial designs are clinical trials that test the effect of more than one treatment, where potential interactions between the treatments are evaluated.

Randomized block designs are clinical trials where investigators wish to control for a possible extraneous variable that may influence differences between groups.  In such cases, homogeneous blocks of study participants.  For example, investigators may think that gender could influence the differences in effects between groups.  Each gender group could be randomly assigned to either the experimental group or the control group.  If 24 participants are included in the study (12 male and 12 female), the males could randomly assigned to groups (6 males in each group) and the female could be randomly assigned to groups (6 females in each group).  Gender can now be considered an independent variable and differences between males and females can be evaluated.

For more information about clinical trial designs, click on the following link.

Monday, December 3, 2018

Threats to Study Validity


Statistical Conclusion Validity
Statistical conclusion validity refers to the appropriate use of statistics for data analyses.  Examples of threats to statistical conclusion validity include:

1.  Low statistical power
2.  Violated assumptions of statistical tests
3.  Error rate
4.  Reliability of outcome measure procedures/tests

I have discussed reliability of outcome measures in a previous 
post.  I will be discussing statistical power, statistical test violations, and error rate in future posts.

Internal Validity
Internal validity refers to the potential for confounding factors to interfere with the relationship between the independent and dependent variables.  I have divided threats to internal validity into three categories.

1.  Single-Group Threats
2.  Multiple-Group Threats
3.  Social Threats  

Single-group threats include:  a) history, b) maturation, c) attrition, d) testing, e) instrumentation, and f) regression.

History is where the observed study results may be explained by events or experiences (confounding variables), other than the intervention/treatment.  For example, participation in other physical activities may effect the outcome of an exercise training study.

Maturation is a threat that is internal to the individual participant.  It is possible that mental or physical changes occur within the study participants that could account for the study results, simply due to the passage of time.  For example, a study that investigates the effect of a treatment on pain in patients with an acute orthopaedic injury may observe an improvement in pain due to the normal healing process and not the treatment.

Attrition (also referred to as drop-outs, withdraw, and experimental mortality) is a threat related to participants withdrawing from a study before it is completed.  If participants withdraw from a study, randomization is negatively impacted and data analyses cannot be performing on all pre-treatment and post-treatment data.

Testing (especially multiple testing) can have a potential effect on a dependent measure.  The effect of conducting multiple tests can result in improvement in an outcome measure that is due to a testing effect and not the effect of an intervention.  Because of this potential testing effect, researchers should use tests that are considered reliable.

Instrumentation is a possible threat if an instrument is unreliable and provides data that are unstable and prone to measurement error.  Observed changes seen between observation points (ie. pre-test and post-test) may also be due to changes in the testing procedure, including the instrument that is being used to collect data.

Regression (or regression toward the mean) is also related to the reliability of a test.  When an unreliable test is used for data collection, a statistical phenomenon sometimes occurs when extreme pre-intervention scores (for example, very high or very low scores) regress toward the group mean at post-intervention.  Again, a reliable test minimizes the threat of regression.  Previous research has identified this statistical phenomenon.


Multiple-group threats to internal validity are related to any variables other than the experimental intervention that can have an impact of the post-intervention difference in outcomes between the groups (sometimes referred to as selection interaction), making the groups not comparable.  Below is a description of different selection interaction threats.

The threat of selection-history is when one group of study participants has different experiences than the other group and these different experiences can influence the outcome of the study.  

Selection-maturation occurs when the experimental group experiences changes in the dependent variable at a different rate than the control or comparison group.  For example, a group of 2-year-old children are likely to experience a different rate of change in development than a group of 10-year-old children.

Selection-testing is when pre-intervention testing affects the groups differently.

Selection-instrumentation occurs when the test (outcome measurement procedures) are performed differently between the groups.

Selection-regression is a concern when participants are assigned to groups based on extreme scores.
Social threats to internal validity refer to the social pressures in the research context that can lead to post-intervention differences that are not directly caused by the treatment.  The following are possible social threats to internal validity.
Social threats can occur because study participants in one group are aware of the treatment that the other group is receiving.  Below are examples of socials threats.

Diffusion or imitation of treatment occurs when a comparison or control group learns about the treatment that the experimental group receives and tries to imitate the treatment.   

Compensatory rivalry is where the comparison or control group knows which intervention that the experimental group is receiving and develops a competitive attitude toward the experimental group.  

Compensatory equalization of treatments occurs when the one(s) who are delivering the experimental treatment administer the treatment to the control or comparison group because the treatment is considered more favorable for treating the study participants.

Resentful demoralization can be thought of the opposite of compensatory rivalry.   For example, the comparison or control group discovers the treatment that the experimental group is receiving.  In this case, instead of developing a rivalry, the comparison group becomes resentful and thus, post-intervention outcome scores may be lower.  Such an impact may be observed when subjective outcome measures (like a questionnaire) are being used.  This threat can result in false, exaggerated differences between groups, making the treatment appear more effective than it actually is.

Construct Validity of Cause and Effect
Construct validity are abstract behaviors or events than cannot be directly observed, but can impact the interpretation of the cause-and-effect relationship.  The threat related to construct validity that I would like to discuss is experimental bias.  This threat occurs when biases are introduced into a study by investigators or the study participants.  For example, participants may desire to fulfill the expectations of the investigators.  Thus, the participants "try harder" to perform better on post-intervention outcomes or adhere more strictly to the study protocol.  Investigators can incorporate this type of bias by making study participants aware of their expectations.  This phenomenon is often referred to as the Hawthorne effect.  Click on the following link for the history of the Hawthorne effect.

External Validity

The findings of a clinical research study must be translational to environments outside of a “lab” in order to be applicable.  External validity is related to the degree to which the results of a study can be generalized beyond the laboratory environment.  I will discuss three threats to external validity.

1.  Interaction of Treatment and Selection
2.  Interaction of Treatment and Setting
3.  Interaction of Treatment and History

When designing a clinical research study, investigators should include a sample of study participants from a target population.  The threat of interaction of treatment and selection concerns when the treatment does not apply to the entire target population.  For example, consider a study that includes a sample of patients with chronic low back pain.  The study findings may indicate that an intervention was effective for treating the sample of study participants.  However, sub-classifications of patients with chronic low back pain exist (patients have chronic low back pain with different causes).  So, the treatment may not be applicable for every patient with chronic low back pain.

The threat of interaction of treatment and setting occurs when the findings of a study cannot be applied to an environment outside the “lab”.  For example, a treatment may be shown to be effective in a controlled laboratory, but the clinical environment is significantly different and therefore, the results of the study cannot be observed at another site.  This threat can be minimized by replicating the study at multiple locations to determine if the study findings are observed in various settings.

The threat of interaction of treatment and history concerns the ability to generalize the findings of a study to different points in time.  For example, the findings of an older study may have indicated that a drug was effective for reducing hypertension, however the study did not control for confounding variables such as diet and exercise.  Since the results of more recent studies provide evidence that diet and exercise can improve hypertension, the findings of the older study may not presently apply.

Thursday, November 15, 2018

Validity of Intervention Study Design


The purpose of an intervention study is to investigate a cause-and-effect relationship (or lack thereof) between an intervention or treatment (independent variable) and an observed response (dependent variable).  The validity of an intervention study depends on the degree to which researchers can control and manipulate these variables as well as control for any confounding variables.

In practice, researchers can rarely (if ever) completely control for confounding variables, especially in human studies.  A multitude of studies have been conducted that provide important data about drawing a conclusion about the relationship between an independent variable(s) and a dependent variable(s).  However, a study always has limitations because no study can be perfectly designed.

Manipulation of variables is the intentional control of variables by a researcher.  For example, a researcher may assign some study participants to receive an experimental intervention and some participants to receive a comparison intervention.  In such an example, the researcher is controlling the intervention (independent variable) and measuring the effect of the experimental intervention (dependent variable).  The manipulation of independent and dependent variables may seem relatively simple.  But, in actuality, manipulation of variables (independent, dependent, confounding) can be challenging.  At this point, we will begin discussing methods for manipulating variables, procedures for appropriate data analyses, and ways to improve the validity of a study design.

In another post, I have discussed the importance of random sampling.  In a prospective, intervention study where two or more groups are being compared, study participants who are sampled from the target population should be allocated to a group using random assignment to improve study validity.  Random assignment increases study validity by providing confidence that no bias exists in regards to differences between study participants (inter-subject variability) that may impact the measured variable (dependent variable).  In theory, random assignment should result in a balance in inter-subject variability between groups and thus, minimizing the influence of inter-subject variability on the dependent variable.

For random assignment to improve study validity, participant characteristics should be considered equivalent between groups.  Consider a study where patients are randomly sampled from a target population and then, randomly assigned to one of two groups.  By chance, some participants may have higher scores for a dependent variable while others have lower scores.  Random assignment is likely to result in a balance of high and low scores between the groups.  However, this balance does not always occur.  I will discuss some ways to address this problem later in this post.

So, how can individuals be randomly assigned to groups?  In my opinion, use of a computer software program may be the most effective and efficient method for performing random assignment.  Various software programs are available for purchase or can be downloaded at no financial cost.  Microsoft Excel is a software program that can also be used for random assignment.

Possibly, the most effective method for controlling the influence of confounding variables on a dependent variable is the use of a control group.  The change of the dependent variable in the experimental group can be compared to the change in the control group.  If there are not significant differences between the groups before receiving an intervention (baseline), then any differences in change of the dependent variable between the groups can be inferred as an effect.  A control group that receives no treatment may be the optimal means of measuring the effect in the experimental group.  However, due to various reasons (lack of feasibility, unethical to withhold treatment, etc), a comparison group is often used instead of a control group.  A comparison group may receive a "standard" treatment to determine if the experimental treatment is better than standard care.  Another way that a comparison group is used is when researchers would like to know which treatment is superior.

Certainly, creating and following a research study protocol are very important to the validity of an intervention study (or any other study).  A research study protocol also provides the opportunity for clinicians and practitioners to replicate the study methodology in another environment (for example, a patient care setting).

Although it is not possible to have absolute control of all variables and ensure that every study participant has the same experience, a reasonable degree of control is often possible.  Research study protocols are frequently very detailed and exhaustive.  Click on the following link for more information about the process of writing a research study protocol.

Another issue related to the validity of an intervention study is appropriate data analysis when some data are incomplete.  Incomplete data can be due to events such as participants withdrawing from a study or not adhering to the study protocol.  Incomplete data can compromise the beneficial effect of random assignment and decrease study statistical power (I discuss statistical power in another post).  One may think that it is logical to analyze data for only those study participants that completed the study according to protocol (referred to on-protocol, on-treatment, per-protocol, or completer analysis).  In general, an on-protocol analysis will bias the study results in favor of the treatment, resulting in an inflated treatment effect.  Consider a study in which some participants experienced adverse side effects that resulted in participants withdrawing from the study.  If an on-protocol analysis is conducted, the effect of the treatment will reflect only those who experiences benefits and not the adverse side effects.

A more conservative approach is the intention-to-treat (ITT) analysis.  With the ITT analysis, all data are analyzed according to the original random assignment.  The phrase (intention-to-treat analysis) means that the data of study participants are analyzed based on the principle that the intention is to treat all participants.  One could also argue that this approach is more reflective of clinical practice, where some patients will not complete an intervention for various reasons (such as non-adherence).

The concept and application of the ITT analysis are in-depth.  The Annals of Internal Medicine provides investigators with some additional information about the ITT analysis and suggestions for analysis of missing data.

Blinding is a method of preventing any potential bias by investigators, study participants, or both.  Blinding can be important for intervention and non-intervention studies.  The British Medical Journal has published a very brief, but informative, article on the topic of blinding.

Earlier in this post, I discussed the issue of inter-subject variability and how methods such as random assignment can reduce the negative impact of inter-subject variability on intervention study design.  For intervention studies that include only one group of participants, participants may be used as their own control.  One-group intervention studies investigate the response (effect) of a treatment in a single-group of individuals (also referred to as a repeated measures design).  The repeated measures design is efficient for controlling inter-subject differences because participants are matched with themselves.  In contrast, multiple group studies compare responses between groups of different individuals, which can result in greater inter-subject variability.  Yet, issues related to the design of repeated measure studies exist and will be discussed in a future post.

The analysis of covariance (ANCOVA) is a statistical method of controlling for confounding variables.  In short, the ANCOVA allows the researcher to select potential confounding variables (covariates) and the ANCOVA will then statistically adjust response scores to control for the selected covariates.  More information about the ANCOVA will be discussed in a future post.

In theory, the most robust method of controlling for differences between study participants is careful planning and use of inclusion and exclusion study criteria.  The purpose is to choose study participants who are homogeneous in their characteristics.  If study participants are homogeneous, then confounding variables in regards to inter-subject variability do not exist.  A major disadvantage to this method is that study findings can relate to only individuals with the same characteristics as the study participants and therefore, limits that application of the study results.   

Sampling


The findings of a research study should be applicable to the population of interest.  The population of interest is the larger group to which study results are generalized.  For example, a group of researchers that intend to study the effect of a treatment in the population of patients with type 2 diabetes should include a representative sample of research participants.  A representative sample is a sub-group of individuals who has similar characteristics of the population of interest.  The term "sampling" refers to the process of selecting a representative sample.  Below is a Venn diagram that illustrates that a representative sample is a subset of the population of interest (target population).


The key to appropriate sampling is selecting a sample that would respond in the study in the same manner as if the entire population was included in the study.  The process of sampling can be challenging for multiple reasons.  In human studies, populations are heterogeneous, meaning variations in different variables exist between  individuals.  As an example, previous research studies have investigated the effect of exercise training on hemoglobin A1c (glycated hemoglobin) in patients with type 2 diabetes. (https://www.ncbi.nlm.nih.gov/pubmed/21098771) Participants in such a study should have a diagnosis of type 2 diabetes and not type 1 diabetes, since the target population of study is people with type 2 diabetes.

Sampling bias is defined as bias that occurs when individuals are selected to serve as a representative sample of the study population, however the selected individuals have different characteristics from the study population.  One type of sampling bias occurs when a researcher unknowing and unintentionally selects participants with different characteristics from the study population.  Another type of sampling bias occurs when a researcher knowing and intentionally selects participants with different characteristics from the study population.   Why might a researcher knowing and intentionally introduce sampling bias?  An investigator may select patients with knee osteoarthritis and consciously choose study participants who are most likely going to have a positive response to the treatment.  Certainly, intentional sampling bias is a concern.

A method for selecting a representative sample is using inclusion and exclusion criteria.  Inclusion criteria are a set of predefined characteristics used to identify participants who will be included in a research study.  Exclusion criteria are a set of predefined standards that is used to identify individuals who will not be included or will have to withdraw from a research study after being included.  Proper selection of inclusion and exclusion criteria will minimize sampling bias, improve external and internal validity of the study, minimize ethical concerns, help to ensure the homogeneity of the sample, and reduce confounding.  Click on the following link for an example of a study that describes inclusion and exclusion criteria.

Sampling procedures can be classified as either probability or non-probability methods.  Probability sampling methods are based on random selection.  Random selection means that every person who meets the criteria for being included in a study has an equal chance or probability of being chosen.  In statistical theory, random selection also means that every person who meets the criteria for study inclusion will have an equal probability of having the characteristics of the target population.  Thus, random selection should eliminate bias in sampling and the sample should be representative of the target population.  One should consider that random selection is based on probability and therefore does not guarantee that the sample will be a true representation of the target population due to the possibility that the sample and target population will have different characteristics.

For a description of sampling techniques, click on the following link.  https://towardsdatascience.com/sampling-techniques-a4e34111d808

Validity of Tests and Measurements

Validity of a test or measurement relates to the accuracy of a method to measure what it is intended to measure.  Because measurement error is related to the reliability of a test, one could argue that a test must be reliable in order to be valid.  Yet, an invalid test can be reliable.  For instance, the results of a testing methodology may be reproducible, but not accurately measuring the variable of interest. 

To illustrate the link between reliability and validity of a measurement, consider the analogy of shooting an arrow at a target.  Shooting the arrow at the bulls-eye of the target would represent performing the test.  The bulls-eye represents the characteristic of the variable being measured.

A tight shot pattern suggests that the test produces reproducible results (high reliability).


A shot pattern that is spread "wide" suggests that the test produces results that are not consistently reproducible (low reliability).


A shot that hits the bulls-eye suggests that the test produces results that reflect the true characteristic of the variable being measured (high validity).


A shot that misses the bulls-eye indicates that the test produces results that do not reflect the true characteristic of the variable being measured (low validity).




A tight shot pattern that consistently hits the bulls-eye indicates that the results of a test are reliable and are measuring the true characteristic of the variable being measure (high reliability and high validity).

A test with acceptable accuracy demonstrates an acceptable degree of reliability and validity.  One can now understand that "acceptable" should be defined, meaning what is an acceptable degree of reliability and validity.  The definition of "acceptable" is important since testing methods rarely display perfect reliability and validity.  Acceptable reliability and validity of a test is sometimes debatable, but should be based on evidence-informed decisions.

Validity is based on the categories of evidence that can be used to support the validity of a test.  Below are descriptions of various types of validity with links to example studies.

Face Validity
Indicates that a testing methodology appears to measure what it is supposed to measure.

Content Validity
Indicates that the various components of the testing methodology include the content of the variable being measured. 

Criterion-related Validity
Indicates that the results of a test can be used as a substitute for the findings of reference-standard test.

Concurrent Validity
Indicates that the results of a test are valid when compared to reference-standard test; established when both tests are performed at relatively the same time (concurrently); a form of criterion-related validity.

Predictive Validity
Indicates that the results of a test are valid since the test results can predict a future outcome; another form of criterion-related validity.

Construct Validity
Indicates that the results of a test reflect an abstract construct.

Reliability of Tests and Measurements


Certainly, information-based decisions should be founded on the most accurate data.  One prerequisite of accurate data is that the data were collected using a reliable testing or measurement methodology.  Reliability can be defined as "the statistical reproducibility of measurements, including the testing of instrumentation or techniques to obtain reproducible results". (https://www.ncbi.nlm.nih.gov/mesh/68015203)
Another prerequisite of accurate data is validity, which is discussed in another post on this blog.

So, since reliability is part of accuracy, one should consider what makes a test accurate.  One component of accuracy (and reliability) is measurement error.  The fact is that rarely (if ever) is a test or measurement perfectly reliable.  Before one can make a decision as to whether or not a test is accurate, an individual must set a threshold of acceptable measurement error.  Consider the very objective and accurate measurement of body mass using a calibrated, digital scale.  What can result in measurement error?  Confounding variables like time of day of measurement, hydration, amount of clothing worn, and even elevation above sea level could result in measurement error.  Measurement error can be defined as the difference between the true value (very often unknown) and the observed/measured value.  Measurement error is related to bias.  Bias is defined as any deviation of results or inferences from the truth, or processes leading to such deviation; bias can result from several sources: one-sided or systematic variations in measurement from the true value (systematic error), flaws in study design, deviation of inferences, interpretations, or analyses based on flawed data or data collection etc. There is no sense of prejudice or subjectivity implied in the assessment of bias under these conditions. (https://www.ncbi.nlm.nih.gov/mesh/?term=outcome+measurement+error)

Two kinds of measurement error exist.  Systematic measurement errors are predictable and cause measured scores to be consistently lower or higher than true scores.  For example, a stadiometer can be used to measure an individual's height.  If the stadiometer has an instrumentation flaw and consistently measures a person's height 1 centimeter more than a person's true height, then the measurement method consists of systematic error.  Systematic error is considered constant and predictable and thus, does not impact the reliability of a measurement.  The concern of systematic error is more relevant in terms of validity of testing, because the measured value is not representative of the actual value.  Since systematic error is predictable, the measurement error associated with systematic error can be addressed.  In the case of the stadiometer, one could measure the height of a person and subtract 1 centimeter from the measured score to get the true score.

A second type of measurement error is random error.  Random error is due to chance and is unpredictable.  Such errors may result in measurement of a variable that is higher or lower than the true measurement.  Consider the stadiometer example for measuring height.  While measuring a person's height, the individual may move slightly and cause of measurement error.  Reliability of measurement focuses on the degree of random error associated with a testing methodology.  The reliability of a test is directly related to the degree of random error.

Measurement error can consist of one or more of the following three components:  1) the individual performing the test/measurement, 2) the instrument being used to measure the variable, and 3) variability of the properties of the measured variable.  Sources of measurement error can be minimized by meticulous planning, creating a testing protocol, training on measurement procedures, objective and operational definitions in a testing protocol, and proper calibration of equipment.  An example of a measurement that may consist of measurement error is heart rate assessment.  Heart rate data are variable and one could argue that heart rate is a relatively unstable measurement.  A variety of confounding variables can impact the unreliable measurement of heart rate.  For example, the position of the individual (standing, sitting, or supine) could result in a different heart rate measurement.  Other variables that can cause variation of heart rate are caffeine consumption, medications, and physical activity.  A heart rate testing protocol should be following to control for such confounding variables.  Also, one could argue that taking multiple measurement and reporting the mean score will improve stability of the measurement.

To have a better understanding of reliability, one should understand the concepts of correlation and agreement.  Correlation can be defined as the degree of association between two sets of data.  In a study that investigates the reliability of a test, researchers could perform a test on a group of study participants at one point in time (initial test), repeat the test at a different point in time (retest), and analyze the correlation between the initial test data set and the retest data set.  A reliable test would reflect that higher initial test data would be associated with higher retest data and lower initial test data would be associated with lower retest data.

Consider the following two sets of measurements.
Study Participant     Initial Test      Retest
A                              1                    2
B                              3                    4
C                              5                    6
D                              7                    8
E                              9                    10

Below is a scatterplot with a line of best fit that visually represents the correlation between the initial test data and retest data.


We assume that any variations in measurements are due to random error.  If systematic errors occur, the correlation will not be affected because initial test scores will still be associated with retest scores, in relative terms.

However, the results of a reliable test should be reproducible, that is scores from repetitive testing should be in agreement with each other.  The data above have a high degree of correlation, but are not perfectly in agreement, meaning differences between initial test scores and retest scores exist.

Consider the following two sets of measurements.
Study Participant     Initial Test      Retest
A                              1                    1
B                              3                    3
C                              5                    5
D                              7                    7
E                              9                    9

The data above are perfectly correlated and in perfect agreement.  These data suggest that this test has perfect reliability.  So, the decision as to whether or not a test or measurement method is reliable should be based on the extent of correlation and agreement between repetitive testing.

Errors in measurements may be due to one or more sources and cause one to make incorrect inferences as to whether or not a test is reliable.  One possible source of measurement error is the interval between testing periods.  Intervals should be large enough so that the effects of confounding variables like fatigue and learning are minimized or nullified.  Yet, intervals should be small enough so that a true change in a measured variable does not occur.  An example of measurement error due to inadequate intervals between testing periods is measuring acute pain related to orthopaedic trauma over an extended period of time.  A true improvement in pain may result in data that are uncorrelated or have lack of agreement between times at which pain is assessed.

Another issue that can make reliability challenging (if not impossible) to determine is the effect of the first test on the second test, which is referred to as carryover or testing effects.  Carryover effects can occur with repeated measurements that result in a change in the measurement with subsequent testing.  Carryover effects can occur with repeated physical performance testing.  The measurement of physical performance could change between tests due to a motor learning effect (practice effect).  One way to minimize carryover effects is to require a person to perform a series of familiarization tests, where measurements are compared after the outcome measure has stabilized.  Once the measurement has stabilized, an initial test can be performed and compared to retesting.

Human error can also result in measurement error.  Many measurements require an individual to administer a test.  The individual who performs a test or collects the measurement data can be referred to as a rater.  Because of the potential of rater measurement error, many studies investigate the intra-rater or inter-rater reliability of a testing methodology.  Intra-rater reliability is the ability of one person to conduct a test or collect measurement data in a reproducible manner.  Inter-rater reliability concerns the differences in measurements between individuals.