Thursday, November 15, 2018

Validity of Intervention Study Design


The purpose of an intervention study is to investigate a cause-and-effect relationship (or lack thereof) between an intervention or treatment (independent variable) and an observed response (dependent variable).  The validity of an intervention study depends on the degree to which researchers can control and manipulate these variables as well as control for any confounding variables.

In practice, researchers can rarely (if ever) completely control for confounding variables, especially in human studies.  A multitude of studies have been conducted that provide important data about drawing a conclusion about the relationship between an independent variable(s) and a dependent variable(s).  However, a study always has limitations because no study can be perfectly designed.

Manipulation of variables is the intentional control of variables by a researcher.  For example, a researcher may assign some study participants to receive an experimental intervention and some participants to receive a comparison intervention.  In such an example, the researcher is controlling the intervention (independent variable) and measuring the effect of the experimental intervention (dependent variable).  The manipulation of independent and dependent variables may seem relatively simple.  But, in actuality, manipulation of variables (independent, dependent, confounding) can be challenging.  At this point, we will begin discussing methods for manipulating variables, procedures for appropriate data analyses, and ways to improve the validity of a study design.

In another post, I have discussed the importance of random sampling.  In a prospective, intervention study where two or more groups are being compared, study participants who are sampled from the target population should be allocated to a group using random assignment to improve study validity.  Random assignment increases study validity by providing confidence that no bias exists in regards to differences between study participants (inter-subject variability) that may impact the measured variable (dependent variable).  In theory, random assignment should result in a balance in inter-subject variability between groups and thus, minimizing the influence of inter-subject variability on the dependent variable.

For random assignment to improve study validity, participant characteristics should be considered equivalent between groups.  Consider a study where patients are randomly sampled from a target population and then, randomly assigned to one of two groups.  By chance, some participants may have higher scores for a dependent variable while others have lower scores.  Random assignment is likely to result in a balance of high and low scores between the groups.  However, this balance does not always occur.  I will discuss some ways to address this problem later in this post.

So, how can individuals be randomly assigned to groups?  In my opinion, use of a computer software program may be the most effective and efficient method for performing random assignment.  Various software programs are available for purchase or can be downloaded at no financial cost.  Microsoft Excel is a software program that can also be used for random assignment.

Possibly, the most effective method for controlling the influence of confounding variables on a dependent variable is the use of a control group.  The change of the dependent variable in the experimental group can be compared to the change in the control group.  If there are not significant differences between the groups before receiving an intervention (baseline), then any differences in change of the dependent variable between the groups can be inferred as an effect.  A control group that receives no treatment may be the optimal means of measuring the effect in the experimental group.  However, due to various reasons (lack of feasibility, unethical to withhold treatment, etc), a comparison group is often used instead of a control group.  A comparison group may receive a "standard" treatment to determine if the experimental treatment is better than standard care.  Another way that a comparison group is used is when researchers would like to know which treatment is superior.

Certainly, creating and following a research study protocol are very important to the validity of an intervention study (or any other study).  A research study protocol also provides the opportunity for clinicians and practitioners to replicate the study methodology in another environment (for example, a patient care setting).

Although it is not possible to have absolute control of all variables and ensure that every study participant has the same experience, a reasonable degree of control is often possible.  Research study protocols are frequently very detailed and exhaustive.  Click on the following link for more information about the process of writing a research study protocol.

Another issue related to the validity of an intervention study is appropriate data analysis when some data are incomplete.  Incomplete data can be due to events such as participants withdrawing from a study or not adhering to the study protocol.  Incomplete data can compromise the beneficial effect of random assignment and decrease study statistical power (I discuss statistical power in another post).  One may think that it is logical to analyze data for only those study participants that completed the study according to protocol (referred to on-protocol, on-treatment, per-protocol, or completer analysis).  In general, an on-protocol analysis will bias the study results in favor of the treatment, resulting in an inflated treatment effect.  Consider a study in which some participants experienced adverse side effects that resulted in participants withdrawing from the study.  If an on-protocol analysis is conducted, the effect of the treatment will reflect only those who experiences benefits and not the adverse side effects.

A more conservative approach is the intention-to-treat (ITT) analysis.  With the ITT analysis, all data are analyzed according to the original random assignment.  The phrase (intention-to-treat analysis) means that the data of study participants are analyzed based on the principle that the intention is to treat all participants.  One could also argue that this approach is more reflective of clinical practice, where some patients will not complete an intervention for various reasons (such as non-adherence).

The concept and application of the ITT analysis are in-depth.  The Annals of Internal Medicine provides investigators with some additional information about the ITT analysis and suggestions for analysis of missing data.

Blinding is a method of preventing any potential bias by investigators, study participants, or both.  Blinding can be important for intervention and non-intervention studies.  The British Medical Journal has published a very brief, but informative, article on the topic of blinding.

Earlier in this post, I discussed the issue of inter-subject variability and how methods such as random assignment can reduce the negative impact of inter-subject variability on intervention study design.  For intervention studies that include only one group of participants, participants may be used as their own control.  One-group intervention studies investigate the response (effect) of a treatment in a single-group of individuals (also referred to as a repeated measures design).  The repeated measures design is efficient for controlling inter-subject differences because participants are matched with themselves.  In contrast, multiple group studies compare responses between groups of different individuals, which can result in greater inter-subject variability.  Yet, issues related to the design of repeated measure studies exist and will be discussed in a future post.

The analysis of covariance (ANCOVA) is a statistical method of controlling for confounding variables.  In short, the ANCOVA allows the researcher to select potential confounding variables (covariates) and the ANCOVA will then statistically adjust response scores to control for the selected covariates.  More information about the ANCOVA will be discussed in a future post.

In theory, the most robust method of controlling for differences between study participants is careful planning and use of inclusion and exclusion study criteria.  The purpose is to choose study participants who are homogeneous in their characteristics.  If study participants are homogeneous, then confounding variables in regards to inter-subject variability do not exist.  A major disadvantage to this method is that study findings can relate to only individuals with the same characteristics as the study participants and therefore, limits that application of the study results.   

Sampling


The findings of a research study should be applicable to the population of interest.  The population of interest is the larger group to which study results are generalized.  For example, a group of researchers that intend to study the effect of a treatment in the population of patients with type 2 diabetes should include a representative sample of research participants.  A representative sample is a sub-group of individuals who has similar characteristics of the population of interest.  The term "sampling" refers to the process of selecting a representative sample.  Below is a Venn diagram that illustrates that a representative sample is a subset of the population of interest (target population).


The key to appropriate sampling is selecting a sample that would respond in the study in the same manner as if the entire population was included in the study.  The process of sampling can be challenging for multiple reasons.  In human studies, populations are heterogeneous, meaning variations in different variables exist between  individuals.  As an example, previous research studies have investigated the effect of exercise training on hemoglobin A1c (glycated hemoglobin) in patients with type 2 diabetes. (https://www.ncbi.nlm.nih.gov/pubmed/21098771) Participants in such a study should have a diagnosis of type 2 diabetes and not type 1 diabetes, since the target population of study is people with type 2 diabetes.

Sampling bias is defined as bias that occurs when individuals are selected to serve as a representative sample of the study population, however the selected individuals have different characteristics from the study population.  One type of sampling bias occurs when a researcher unknowing and unintentionally selects participants with different characteristics from the study population.  Another type of sampling bias occurs when a researcher knowing and intentionally selects participants with different characteristics from the study population.   Why might a researcher knowing and intentionally introduce sampling bias?  An investigator may select patients with knee osteoarthritis and consciously choose study participants who are most likely going to have a positive response to the treatment.  Certainly, intentional sampling bias is a concern.

A method for selecting a representative sample is using inclusion and exclusion criteria.  Inclusion criteria are a set of predefined characteristics used to identify participants who will be included in a research study.  Exclusion criteria are a set of predefined standards that is used to identify individuals who will not be included or will have to withdraw from a research study after being included.  Proper selection of inclusion and exclusion criteria will minimize sampling bias, improve external and internal validity of the study, minimize ethical concerns, help to ensure the homogeneity of the sample, and reduce confounding.  Click on the following link for an example of a study that describes inclusion and exclusion criteria.

Sampling procedures can be classified as either probability or non-probability methods.  Probability sampling methods are based on random selection.  Random selection means that every person who meets the criteria for being included in a study has an equal chance or probability of being chosen.  In statistical theory, random selection also means that every person who meets the criteria for study inclusion will have an equal probability of having the characteristics of the target population.  Thus, random selection should eliminate bias in sampling and the sample should be representative of the target population.  One should consider that random selection is based on probability and therefore does not guarantee that the sample will be a true representation of the target population due to the possibility that the sample and target population will have different characteristics.

For a description of sampling techniques, click on the following link.  https://towardsdatascience.com/sampling-techniques-a4e34111d808

Validity of Tests and Measurements

Validity of a test or measurement relates to the accuracy of a method to measure what it is intended to measure.  Because measurement error is related to the reliability of a test, one could argue that a test must be reliable in order to be valid.  Yet, an invalid test can be reliable.  For instance, the results of a testing methodology may be reproducible, but not accurately measuring the variable of interest. 

To illustrate the link between reliability and validity of a measurement, consider the analogy of shooting an arrow at a target.  Shooting the arrow at the bulls-eye of the target would represent performing the test.  The bulls-eye represents the characteristic of the variable being measured.

A tight shot pattern suggests that the test produces reproducible results (high reliability).


A shot pattern that is spread "wide" suggests that the test produces results that are not consistently reproducible (low reliability).


A shot that hits the bulls-eye suggests that the test produces results that reflect the true characteristic of the variable being measured (high validity).


A shot that misses the bulls-eye indicates that the test produces results that do not reflect the true characteristic of the variable being measured (low validity).




A tight shot pattern that consistently hits the bulls-eye indicates that the results of a test are reliable and are measuring the true characteristic of the variable being measure (high reliability and high validity).

A test with acceptable accuracy demonstrates an acceptable degree of reliability and validity.  One can now understand that "acceptable" should be defined, meaning what is an acceptable degree of reliability and validity.  The definition of "acceptable" is important since testing methods rarely display perfect reliability and validity.  Acceptable reliability and validity of a test is sometimes debatable, but should be based on evidence-informed decisions.

Validity is based on the categories of evidence that can be used to support the validity of a test.  Below are descriptions of various types of validity with links to example studies.

Face Validity
Indicates that a testing methodology appears to measure what it is supposed to measure.

Content Validity
Indicates that the various components of the testing methodology include the content of the variable being measured. 

Criterion-related Validity
Indicates that the results of a test can be used as a substitute for the findings of reference-standard test.

Concurrent Validity
Indicates that the results of a test are valid when compared to reference-standard test; established when both tests are performed at relatively the same time (concurrently); a form of criterion-related validity.

Predictive Validity
Indicates that the results of a test are valid since the test results can predict a future outcome; another form of criterion-related validity.

Construct Validity
Indicates that the results of a test reflect an abstract construct.

Reliability of Tests and Measurements


Certainly, information-based decisions should be founded on the most accurate data.  One prerequisite of accurate data is that the data were collected using a reliable testing or measurement methodology.  Reliability can be defined as "the statistical reproducibility of measurements, including the testing of instrumentation or techniques to obtain reproducible results". (https://www.ncbi.nlm.nih.gov/mesh/68015203)
Another prerequisite of accurate data is validity, which is discussed in another post on this blog.

So, since reliability is part of accuracy, one should consider what makes a test accurate.  One component of accuracy (and reliability) is measurement error.  The fact is that rarely (if ever) is a test or measurement perfectly reliable.  Before one can make a decision as to whether or not a test is accurate, an individual must set a threshold of acceptable measurement error.  Consider the very objective and accurate measurement of body mass using a calibrated, digital scale.  What can result in measurement error?  Confounding variables like time of day of measurement, hydration, amount of clothing worn, and even elevation above sea level could result in measurement error.  Measurement error can be defined as the difference between the true value (very often unknown) and the observed/measured value.  Measurement error is related to bias.  Bias is defined as any deviation of results or inferences from the truth, or processes leading to such deviation; bias can result from several sources: one-sided or systematic variations in measurement from the true value (systematic error), flaws in study design, deviation of inferences, interpretations, or analyses based on flawed data or data collection etc. There is no sense of prejudice or subjectivity implied in the assessment of bias under these conditions. (https://www.ncbi.nlm.nih.gov/mesh/?term=outcome+measurement+error)

Two kinds of measurement error exist.  Systematic measurement errors are predictable and cause measured scores to be consistently lower or higher than true scores.  For example, a stadiometer can be used to measure an individual's height.  If the stadiometer has an instrumentation flaw and consistently measures a person's height 1 centimeter more than a person's true height, then the measurement method consists of systematic error.  Systematic error is considered constant and predictable and thus, does not impact the reliability of a measurement.  The concern of systematic error is more relevant in terms of validity of testing, because the measured value is not representative of the actual value.  Since systematic error is predictable, the measurement error associated with systematic error can be addressed.  In the case of the stadiometer, one could measure the height of a person and subtract 1 centimeter from the measured score to get the true score.

A second type of measurement error is random error.  Random error is due to chance and is unpredictable.  Such errors may result in measurement of a variable that is higher or lower than the true measurement.  Consider the stadiometer example for measuring height.  While measuring a person's height, the individual may move slightly and cause of measurement error.  Reliability of measurement focuses on the degree of random error associated with a testing methodology.  The reliability of a test is directly related to the degree of random error.

Measurement error can consist of one or more of the following three components:  1) the individual performing the test/measurement, 2) the instrument being used to measure the variable, and 3) variability of the properties of the measured variable.  Sources of measurement error can be minimized by meticulous planning, creating a testing protocol, training on measurement procedures, objective and operational definitions in a testing protocol, and proper calibration of equipment.  An example of a measurement that may consist of measurement error is heart rate assessment.  Heart rate data are variable and one could argue that heart rate is a relatively unstable measurement.  A variety of confounding variables can impact the unreliable measurement of heart rate.  For example, the position of the individual (standing, sitting, or supine) could result in a different heart rate measurement.  Other variables that can cause variation of heart rate are caffeine consumption, medications, and physical activity.  A heart rate testing protocol should be following to control for such confounding variables.  Also, one could argue that taking multiple measurement and reporting the mean score will improve stability of the measurement.

To have a better understanding of reliability, one should understand the concepts of correlation and agreement.  Correlation can be defined as the degree of association between two sets of data.  In a study that investigates the reliability of a test, researchers could perform a test on a group of study participants at one point in time (initial test), repeat the test at a different point in time (retest), and analyze the correlation between the initial test data set and the retest data set.  A reliable test would reflect that higher initial test data would be associated with higher retest data and lower initial test data would be associated with lower retest data.

Consider the following two sets of measurements.
Study Participant     Initial Test      Retest
A                              1                    2
B                              3                    4
C                              5                    6
D                              7                    8
E                              9                    10

Below is a scatterplot with a line of best fit that visually represents the correlation between the initial test data and retest data.


We assume that any variations in measurements are due to random error.  If systematic errors occur, the correlation will not be affected because initial test scores will still be associated with retest scores, in relative terms.

However, the results of a reliable test should be reproducible, that is scores from repetitive testing should be in agreement with each other.  The data above have a high degree of correlation, but are not perfectly in agreement, meaning differences between initial test scores and retest scores exist.

Consider the following two sets of measurements.
Study Participant     Initial Test      Retest
A                              1                    1
B                              3                    3
C                              5                    5
D                              7                    7
E                              9                    9

The data above are perfectly correlated and in perfect agreement.  These data suggest that this test has perfect reliability.  So, the decision as to whether or not a test or measurement method is reliable should be based on the extent of correlation and agreement between repetitive testing.

Errors in measurements may be due to one or more sources and cause one to make incorrect inferences as to whether or not a test is reliable.  One possible source of measurement error is the interval between testing periods.  Intervals should be large enough so that the effects of confounding variables like fatigue and learning are minimized or nullified.  Yet, intervals should be small enough so that a true change in a measured variable does not occur.  An example of measurement error due to inadequate intervals between testing periods is measuring acute pain related to orthopaedic trauma over an extended period of time.  A true improvement in pain may result in data that are uncorrelated or have lack of agreement between times at which pain is assessed.

Another issue that can make reliability challenging (if not impossible) to determine is the effect of the first test on the second test, which is referred to as carryover or testing effects.  Carryover effects can occur with repeated measurements that result in a change in the measurement with subsequent testing.  Carryover effects can occur with repeated physical performance testing.  The measurement of physical performance could change between tests due to a motor learning effect (practice effect).  One way to minimize carryover effects is to require a person to perform a series of familiarization tests, where measurements are compared after the outcome measure has stabilized.  Once the measurement has stabilized, an initial test can be performed and compared to retesting.

Human error can also result in measurement error.  Many measurements require an individual to administer a test.  The individual who performs a test or collects the measurement data can be referred to as a rater.  Because of the potential of rater measurement error, many studies investigate the intra-rater or inter-rater reliability of a testing methodology.  Intra-rater reliability is the ability of one person to conduct a test or collect measurement data in a reproducible manner.  Inter-rater reliability concerns the differences in measurements between individuals.

Statistical Concepts of Measurement


Measurement can be defined as the process of assigning numbers to variables to represent qualities and quantities of characteristics.  A number can be assigned to qualitative data.  For example, the number "1" can be used to identify/classify males and the number "2" can be used to identify/classify females.  Another example is diagnostic testing studies.  The number "0" can be used to classify people who had a negative test result for a diagnosis and the number "1" can be used to classify people who had a positive test result for a diagnosis.

A number can also reflect an amount or quantity of a variable.  A continuous variable can theoretically be measured on a continuum within a defined range.  Goniometry is measurement of the range of motion of a joint and can be measured as a continuous variable in units of degrees.  A physical therapist may use goniometry to measure the amount of a patient's knee flexion range of motion as 100 degrees.  However, in practical application, a continuous variable can never be measured exactly due to lack of precision and measurement error.  The true amount of knee range of motion cannot be measured due to lack of precision of goniometric measurement. 

Another example is the hemoglobin A1c test.  The hemoglobin A1c test measures the percentage of a patient's hemoglobin that is glycated.  The hemoglobin A1c test is considered a standard test for measuring glucose control in patients with diabetes.  However, the hemoglobin A1c test has some measurement error related to reliability and validity of the test. 

To be clear, goniometry and the hemoglobin A1c test have established validity and are standard methods of measurement.  Yet, no test is perfectly accurate.  The key is that the test or measurement method should have an acceptable amount of measurement error.

Discrete variables are described in whole units of measurement.  Heart rate is measured in beats per minute and not recorded as a decimal or fraction, therefore heart rate would be a discrete variable.  When a qualitative variable can have only two values, like positive or negative test result, the variable is called a dichotomous variable.

The statistical concept of measurement also involves rules of measurement.  These rules dictate how numbers can be assigned to measure a variable.  In the case of gender, "1" can be assigned to represent males and "2" can be assigned to represent females.  Rules of measurement are important because such rules determine which mathematical operations can be performed for a set of data.  Consider the variable of gender.  If a study included five males and 10 females, the total number of study participants would be 15.  If one calculated the numbers that represent males and females (1 and 2, respectively), the total number of study participants would be 25.

1 = male, 2 = female
(1 x 5) + (2 x 10) = 25 study participants
Obviously, the correct answer is 15 study participants.

Statistical analysis of data is based on the rules that are applied to a measurement.  Data are analyzed according to different levels of measurement.  Nominal level data are also referred to as categorical data.  The gender variable is an example of nominal level data.  Diagnosis is another example of nominal level data.  Nominal level data can be expressed as counts/frequencies.

Measurement on an ordinal scale requires data that can be ranked.  One example of an ordinal scale is the measurement of pain on a Likert scale of 0 to 10, where "0" is defined as "no pain" and "10" is defined as "the worst pain ever experienced".  Another example is the measurement of loss of physical function.  Loss of physical function can be measured in terms of classifications, such as minor, moderate, and severe.  These classifications of loss of physical function can be placed on an ordinal scale.  Ordinal level data can be used for descriptive analyses, such as frequencies (like nominal data).  For example, a group of researchers may report the number of study participants that fall within different categories of physical disability.  But, technically, ordinal data cannot be analyzed using arithmetic operations.  However, one could argue that ordinal data can be analyzed using arithmetic operations (such as a mean pain rating) and that such analyses can be interpreted from a practical perspective.  If fact, peer-reviewed journals have published studies where ordinal data have been analyzed in such a manner. (https://academic.oup.com/ptj/article/90/9/1239/2737986?searchresult=1)

Data on the interval scale are rank-ordered (like ordinal data) but also consist of equal distances or intervals between units of measurement.  However, interval-level data do not consist of a true zero measurement.  Consider the measurement of temperature in degrees Celsius.  The measurement of 0 degrees Celsius is assigned arbitrarily.  Indeed, temperature can be measured in negative units (for example, -10 degrees Celsius).  Temperature is a measurement of the amount of heat.  Since 0 degrees Celsius does not represent the total absence of heat, the measurement of 0 degrees Celsius can be considered "artificial".  A strength of interval data is that arithmetic operations can be used to analyze the data since equal distances between units of measurement exist.

The highest level of measurement is using data that are on the ratio scale.  Ratio-level data are on the interval scale, but also have a true zero measurement.  A true zero measurement of the ratio scale reflects the total absence of the variable property and negative values are not possible.  The measurement of force in Newtons is an example of ratio-level data.  Because ratio data are on the interval scale and have a true zero measurement, all mathematical and statistical operations can be used for data analyses.

Click on the following link for another description of levels of measurement.

So, why is identification of the level of measurement (nominal, ordinal, interval, or ratio) important?  In the field of statistics, the most important reason may be utilization of appropriate statistical procedures, based on the level of data measurement.  A simple example is gender.  For the purpose of recording gender data, the number "1" may be used to represent males and the number "2" may be used to represent females.  This coding of data is often necessary for using statistical analysis software.  So, if a study includes five males and five females, a mean or average cannot be calculated based on such a coding system to reflect the variable of gender.  The mean would equal 1.5.  A mean of 1.5 does not represent the gender variable and we cannot make inferences from such a statistical analysis.

Collection and analysis of ordinal-level data occur frequently in social and health sciences.  As previously mentioned, ordinal data have been analyzed using arithmetic operations.  Although applying arithmetic operations to ordinal data is fundamentally inappropriate, such procedures have made interpretation of ordinal data more practical.  I will not attempt to debate "for or against" the use of arithmetic operations in ordinal data analysis.  The purpose of my comments is to make the reader aware of this topic.


Welcome!

Hello everyone!

Welcome to Statistical Inference!!!  In this blog, we will be discussing the process of drawing conclusions about populations or scientific truths from data.

Let's get started!