Reliability and Validity

All researchers strive to deliver accurate results. Accurate results are both reliable and valid. Reliability means that the results obtained are consistent. Validity is the degree to which the researcher actually measures what he or she is trying to measure.

Reliability and validity are often compared to a marksman's target. In the illustration below, Target B represents measurement with poor validity and poor reliability. The shots are neither consistent nor accurate. Target A shows a measurement that has good reliability, but has poor validity as the shots are consistent, but they are off the center of the target. Target C shows a measure with good validity and good reliability because all of the shots are concentrated at the center of the target.


Random Errors: Random error is a term used to describe all chance or random factors than confound—undermine—the measurement of any phenomena. Random errors in measurement are inconsistent errors that happen by chance. They are inherently unpredictable and transitory. Random errors include sampling errors, unpredictable fluctuations in the measurement apparatus, or a change in a respondents mood, which may cause a person to offer an answer to a question that might differ from the one he or she would normally provide. The amount of random errors is inversely related to the reliability of a measurement instrument.[1] As the number of random errors decreases, reliability rises and vice versa.

Systematic Errors: Systematic or Non-Random Errors are a constant or systematic bias in measurement. Here are two everyday examples of systematic error: 1) Imagine that your bathroom scale always registers your weight as five pounds lighter that it actually is and 2) The thermostat in your home says that the room temperature is 72º, when it is actually 75º. The amount of systematic error is inversely related to the validity of a measurement instrument.[2] As systematic errors increase, validity falls and vice versa.


As stated above, reliability is concerned with the extent to which an experiment, test, or measurement procedure yields consistent results on repeated trials. Reliability is the degree to which a measure is free from random errors. But, due to the every present chance of random errors, we can never achieve a completely error-free, 100% reliable measure. The risk of unreliability is always present to a limited extent.

Here are the basic methods for estimating the reliability of empirical measurements: 1) Test-Retest Method, 2) Equivalent Form Method, and 3) Internal Consistency Method.[3]

Test-Retest Method: The test-retest method repeats the measurement—repeats the survey—under similar conditions. The second test is typically conducted among the same respondents as the first test after a short period of time has elapsed. The goal of the test-retest method is to uncover random errors, which will be shown by different results in the two tests. If the results of the two tests are highly consistent, we can conclude that the measurements are stable and reliability is deemed high. Reliability is equal to the correlation of the two test scores taken among the same respondents at different times.

There are some problems with the test-retest method. First, it may be difficult to get all the respondents to take the test—complete the survey or experiment—a second time. Second, the first and second tests may not be truly independent. The mere fact that the respondent participated in the first measurement might affect their responses on the second measurement. And, third, environmental or personal factors could cause the second measurement to change.

Equivalent Form Method: The equivalent form method is used to avoid the problems mentioned above with the test-retest method. The equivalent form method measures the ability of similar instruments to produce results that have a strong correlation. With this method, the researcher creates a large set of questions that address the same construct and then randomly divides the questions into two sets. Both instruments are given to the same sample of people. If there is a strong correlation between the instruments, we have high reliability.

The equivalent form method is also not without problems. First, it can be very difficult—some would say nearly impossible—to create two totally equivalent forms. Second, even when equivalency can be achieved, it may not be worth the investment of time, energy, and funds.

Internal Consistency and the Split-Half Method: These methods for establishing reliability rely on the internal consistency of an instrument to produce similar results on different samples during the same time period. Internal consistency is concerned with equivalence. It addresses the question: Is there an equal amount of random error introduced by using two different samples to measure phenomena?

The split-half method measures the reliability of an instrument by dividing the set of measurement items into two halves and then correlating the results. For example, if we are interested the in perceived practicality of electric cars and gasoline-powdered cars, we could use a split-half method and ask the "same" question two different ways.

To be reliable, the answers to these two questions should be consistent. The problem with this method is that different "splits" can result in different coefficients of reliability. To overcome this problem researchers use the Cronbach alpha (α) technique, which is named for educational psychologist Lee Cronbach. Cronbach alpha (α) calculates the average reliability for all possible ways of splitting a set of questions in half. A lack of correlation of an item with other items suggests low reliability and that this item does not belong in the scale. Cronbach's alpha technique requires that all items in the scale have equal intervals. If this condition cannot be met, other statistical analysis should be considered. Chronbach's alpha is also called the coefficient of reliability.


Validity is defined as the ability of an instrument to measure what the researcher intends to measure. There are several different types of validity in social science research. Each takes a different approach to assessing the extent to which a measure actually measures what the researcher intends to measure. Each type of validity has different meaning, uses, and limitations.[4]

Face Validity: Face validity is the degree to which subjectively is viewed as measuring what it purports to measure. It is based on the researcher's judgment or the collective judgment of a wide group of researchers. As such, it is considered the weakest form of validity. With face validity, a measure "looks like it measures what we hope to measure," but it has not been proven to do so.

Content Validity: Content validity is frequently considered equivalent to face validity. Content or logical validity is the extent to which experts agree that the measure covers all facets of the construct. To establish content validity all aspects or dimensions of a construct must be included. If we are constructing a test of arithmetic and we only focus on addition skills, we would clearly lack content validity as we have ignored subtraction, multiplication, and division. To establish content validity, we must review the literature on the construct to make certain that each dimension of the construct is being measured.

Criterion Validity: Criterion Validity measures how well a measurement predicts outcome based on information from other variables. It measures the match between the survey question and the criterion—content or subject area—it purports to measure. The SAT test, for instance, is said to have criterion validity because high scores on this test are correlated with a students' freshman grade point averages.

There are two types of criterion validity: Predictive Validity and Concurrent Validity. Predictive Validity refers to the usefulness of a measure to predict future behavior or attitudes. Concurrent Validity refers to the extent to which another variable measured at the same time as the variable of interest can be predicted by the instrument. Concurrent validity is evidenced when a measure strongly correlates with a previously validated measure.

Construct Validity: Construct validity is the degree to which an instrument represents the construct it purports to represent. It involves an understanding the theoretical foundations of the construct. A measure has construct validity when is conforms to the theory underlying the construct.

There are two types of convergent validity: Convergent Validity and Discriminant Validity. Convergent validity is the correlation among measures that claim to measure the same construct. Discriminant validity measures the lack of correlation among measures that do not measure the same construct. For there to be high levels of construct validity you need high levels of correction among measures that cover the same construct, and low levels of correlation among measures that cover different constructs.

 Toggle open/close quiz group

[1] Carmines, Edward G. and Richard A. Zeller, Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications Inc., 1979. pp. 14-15.

[2] Carmines, Edward G. and Richard A. Zeller, Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications Inc., 1979. pp. 13-14.

[3] Carmines, Edward G. and Richard A. Zeller, Richard A., Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications Inc., 1979. pp. 37-51.

[4] Carmines, Edward G. and Richard A. Zeller, Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications Inc., 1979. p. 17. 

toc | return to top | previous page | next page