After completing this session, you will be able to:
- define measurement.
- describe the concept of levels of scales of measurement.
- Define reliability, including the different types and how they are assessed.
- Define validity, including the different types and how they are assessed.
- Differentiate between reliability and validity and understand how these are related to each other and relevant to understanding the value of a measurement tool.
The process of assigning numbers or labels to different objects under study to represent them quantitatively or qualitatively is called measurement. It can be understood as means to denote the amount of a particular attribute that a particular object possesses. There are certain rules defining the process of measurement; for ex. Number 1 might be assigned to people who are from South India and Number 2 might be assigned to people who are from North India. Measurement is done for the attributes of the units under study but not the units themselves. For ex. The height, weight, age or other such attributes of a person are measured.
This represents a limited use of the term measurement. In statistics, the term measurement is used more broadly and is more appropriately termed scales of measurement. Scales of measurement refer to ways in which variables/numbers are defined and categorized. Each scale of measurement has certain properties, which in turn determines the appropriateness for use of certain statistical analyses.
A researcher has to know what to measure before knowing how to measure something. The problem definition process should suggest the concepts that must be measured. A concept can be thought of as a generalized idea that represents something of meaning. Concepts like age, sex, education, and number of siblings are relatively concrete properties. They present few problems in either definition or measurement. Other concepts are more abstract. Concepts such as loyalty, personality, trust, customer satisfaction, and so on are more difficult to both define and measure. For example, loyalty has been measured as a combination of customer share and commitment. Thus, we can see that loyalty consists of two components, the first is behavioral and the second is attitudinal.
Researchers use variance in concepts to make diagnoses. From our previous understanding, we know that variables capture different concept values. For practical purposes, once a research project is underway, there is little difference between a concept and a variable. Consider the following hypothesis:
H1: Experience is positively related to job performance.
The hypothesis implies a relationship between two variables, experience and job performance. The variables capture variance in the experience and performance concepts. One employee may have 15 years of experience and be a top performer. A second may have 10 years’ experience and be a good performer. The scale used to measure experience is quite straightforward in this case and would involve simply providing the number of years an employee has been with the company. Job performance, on the other hand, can be quite complex.
Sometimes, a single variable cannot capture a concept alone. Using multiple variables to measure one concept can often provide a more complete account of some concept than could any single variable. Even in the physical sciences, multiple measurements are often used to make sure an accurate representation is obtained. In social science, many concepts are measured with multiple measurements.
A construct is a term used for concepts that are measured with multiple variables. For instance, when a business researcher wishes to measure the customer orientation of a salesperson, several variables like these may be used, each captured on a 1–5 scale:
- I offer the product that is best suited to a customer’s problem.
- A good employee has to have the customer’s best interests in mind.
- I try to find out what kind of products will be most helpful to a customer.
Constructs are not directly measured; they are usually measured through indicator variables. For example, loyalty is a construct. If we try to measure loyalty directly, different respondents mean loyalty in different ways and the response will have random errors. However, if we measure loyalty to three indicator variables, a) number of times the respondents visited a particular store, b) amount purchased from a particular store, c) number of people the respondent recommended.
Levels of Scales of Measurement
: (also called categorical variable) can be placed into categories. They don’t have a numeric value and so cannot be added, subtracted, divided or multiplied. They also have no order; if they appear to have an order then you probably have instead. For example: Color of the eyes (Black, Green, Aqua, Hazel, etc.), Gender (Male/Female), True/False.
: The ordinal scale contains things that you can place in order. For example, hottest to coldest, lightest to heaviest, richest to poorest. Basically, if you can rank data by 1st, 2nd, 3rd place (and so on), then you have data that’s on an ordinal scale.
Ordinal scales tell us relative order, but give us no information regarding differences between the categories. For example, in a race if Ram takes first and; Vinod takes second place, we do not know competition was close by how many seconds.
One more example could be rating surveys in restaurants–When a waiter gets a paper or online survey with a question: “How satisfied are you with the dining experience?” having 0-10 option, 0 being extremely dissatisfied and 10 being extremely satisfied.
: An interval scale has ordered numbers with meaningful divisions, the magnitude between the consecutive intervals are equal. Interval scales do not have a true zero i.e. In Celsius 0 degrees does not mean the absence of heat. For example, temperature on Fahrenheit/Celsius thermometer i.e. 90° are hotter than 45° and the difference between 10° and 30° are the same as the difference between 60° degrees and 80°.
Measurement of Sea Level is another example of an interval scale. With each of these scales there is direct, measurable quantity with equality of units. In addition, zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers both above and below it (for example, -10 degrees Fahrenheit).
: The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units with one major difference: zero is meaningful (no numbers exist below the zero). The true zero allows us to know how many times greater one case is than another. Ratio scales have all of the characteristics of the nominal, ordinal and interval scales. The simplest example of a ratio scale is the measurement of length. Having zero length or zero money means that there is no length and no money but zero temperature is not an absolute zero. Following table summarizes the properties of four types levels of measurement scales.
|Type of measurement Scale||Types of attitude Scales||Rules for assigning numbers||Typical applications||Statistics/ Statistical tests|
|Nominal||Dichotomous Yes or No Scales||Objects are either identical or different||Classification by gender, geographic area, social class||Percentages/ Mode/frequency.|
|Ordinal||Comparative rank order itemized category, paired comparison.||Objects are greater or lesser in comparison||Ranking preferences class standing||Percentile, median, ranking range, quartiles; rank order correlation coefficient, sign test.|
|Interval||Likert, Staple, Numerical, semantic deferential||Interval between adjacent ranks are equal||Temperature scales attitude measures||Mean, Standard deviation, correlation coefficient; t-test, Z-test, regression analysis & factor analysis|
|Ratio||Certain Scales with special instructions||There is a meaningful zero.||Sales, Income, costs, age||All mathematical and statistical operations can be carried out using the ratio scale data|
Criteria for Good Measurement
The three major criteria for evaluating measurements are reliability, validity, and sensitivity.
Reliability is an indicator of a measure’s internal consistency. Consistency is the key to understanding reliability. A measure is reliable when different attempts at measuring something converge on the same result. For example, consider an exam that has three parts: 25 multiple-choice questions, 2 essay questions, and a short case. If a student gets 20 of the 25 (80 percent) multiple-choice questions correct, we would expect she would also score about 80 percent on the essay and case portions of the exam. Further, if a professor’s research tests are reliable, a student should tend toward consistent scores on all tests. In other words, a student who makes an 80 percent on the first test should make scores close to 80 percent on all subsequent tests. Another way to look at this is that the student who makes the best score on one test will exhibit scores close to the best score in the class on the other tests. If it is difficult to predict what students would make on a test by examining their previous test scores, the tests probably lack reliability or the students are not preparing the same each time.
So, the concept of reliability revolves around consistency. Think of a scale to measure weight. You would expect this scale to be consistent from one time to the next. If you stepped on the scale and it read 140 pounds, then got off and back on, you would expect it to again read 140. If it read 110 the second time, while you may be happier, the scale would not be reliable.
Internal consistency represents a measure’s homogeneity. An attempt to measure trustworthiness may require asking several similar but not identical questions, as shown in image below. The set of items that make up a measure are referred to as a battery of scale items. Internal consistency of a multiple-item measure can be measured by correlating scores on subsets of items making up a scale.
Split half method
This method of checking reliability is performed by taking half of the items from a scale (for example, odd-numbered items) and checking them against the results from the other half (even-numbered items). The two scale halves should produce similar scores and correlate highly. The problem with split-half method is determining the two halves. Should it be even- and odd- numbered questions? Questions 1–3 compared to 4–6? Following Coefficient alpha provides a solution to this problem.
It is the most commonly applied estimate of a multiple-item scale’s reliability. Coefficient alpha represents internal consistency by computing the average of all possible split-half reliabilities for a multiple-item scale. The coefficient demonstrates whether or not the different items converge. Although coefficient alpha does not address validity, many researchers use alpha as the sole indicator of a scale’s quality. Coefficient alpha ranges in value from 0, meaning no consistency, to 1, meaning complete consistency (all items yield corresponding values). Generally speaking, scales with a coefficient alpha between 0.80 and 0.95 are considered to have very good reliability. Scales with a coefficient alpha between 0.70 and 0.80 are considered to have good reliability, and an alpha value between 0.60 and 0.70 indicates fair reliability. When the coefficient alpha is below 0.6, the scale is considered to have a poor reliability. Most statistical software packages, such as SPSS, can easily compute coefficient alpha.
Test Retest Reliability
The test-retest method of determining reliability involves administering the same scale or measure to the same respondents at two separate times to test for stability. If the measure is stable over time, the test, administered under the same conditions each time, should obtain similar results. Test- retest reliability represents a measure’s repeatability.
Suppose a researcher at one time attempts to measure buying intentions and finds that 12 percent of the population is willing to purchase a product. If the study is repeated a few weeks later under similar conditions, and the researcher again finds that 12 percent of the population is willing to purchase the product, the measure appears to be reliable. High stability correlation or consistency between two measures at time 1 and time 2 indicates high reliability.
Measures of test-retest reliability pose two problems that are common to all longitudinal studies. Firstly, the first measures, may sensitize the respondents to their participation in a research project and subsequently influence the results of the second measure. Furthermore, if the time between measures is long, there may be an attitude change or other maturation of the subjects. Thus, a reliable measure can indicate a low or a moderate correlation between the first and second administration, but this low correlation may be due to an attitude change over time rather than to a lack of reliability in the measure itself.
The ability of a scale or a measuring instrument to measure what it is intended to measure can be termed as validity of measurement. Reliability represents how consistent a measure is, in that the different attempts at measuring the same thing converge on the same point. Accuracy deals more with how a measure assesses the intended concept. Validity is the accuracy of a measure or the extent to which a score truthfully represents a concept. In other words, are we accurately measuring what we think we are measuring? The four basic approaches to establishing validity are face validity, content validity, criterion validity, and construct validity.
Face validity refers to the subjective agreement among experts that a scale logically reflects the concept being measured. Do the test items look like they make sense given a concept’s definition? When an inspection of the test items convinces experts that the items match the definition, the scale is said to have face validity.
Clear questions like “How many children do you have?” generally are agreed to have face validity. But it becomes more challenging to assess face validity in regard to more complicated business phenomena. For instance, consider the concept of customer loyalty. Does the statement “I prefer to purchase my groceries at ABC Foods” appear to capture loyalty? How about “I am very satisfied with my purchases from ABC Fine Foods”? What about “ABC Fine Foods offers very good value”? While the first statement appears to capture loyalty, it can be argued the second question is not loyalty but rather satisfaction. What does the third statement reflect? Do you think it looks like a loyalty statement?
In scientific studies, face validity might be considered a first hurdle. In comparison to other forms of validity, face validity is relatively easy to assess. However, researchers are generally not satisfied with simply establishing face validity. Because of the elusive nature of attitudes and other business phenomena, additional forms of validity are sought.
It refers to the degree that a measure covers the domain of interest. Do the items capture the entire scope, but not go beyond, the concept we are measuring? If an exam is sup- posed to cover chapters 1–5, it is fair for students to expect that questions should come from all five chapters, rather than just one or two. It is also fair to assume that the questions will not come from chapter 6. Thus, when students complain about the material on an exam, they are often claiming it lacks content validity. Similarly, an evaluation of an employee’s job performance should cover all the important aspects of the job, but not something outside of the employee’s specified duties.
validity addresses the question, “How well does my measure work in practice?” Because of this, criterion validity is sometimes referred to as pragmatic validity. In other words, is my measure practical? Criterion validity may be classified as either or depending on the time sequence in which the new measurement scale and the criterion measure are correlated.
Construct validity exists when a measure reliably measures and truthfully represents a unique concept. Construct validity consists of several components, including; Face validity, Content validity, Criterion validity, Convergent validity, and Discriminant validity.
We have discussed face validity, content validity, and criterion validity. These forms of validity represent how unique or distinct a measure is. Convergent validity requires that concepts that should be related are indeed related. For example, in business we believe customer satisfaction and customer loyalty are related. If we have measures of both, we would expect them to be positively correlated. If we found no significant correlation between our measures of satisfaction and our measures of loyalty, it would bring into question the convergent validity of these measures. On the other hand, our customer satisfaction measure should not correlate too highly with the loyalty measure if the two concepts are truly different. If the correlation is too high, we have to ask if we are measuring two different things, or if satisfaction and loyalty are actually one concept. As a rough rule of thumb, when two scales are correlated above 0.75, discriminant validity may be questioned. So, we expect related concepts to display a significant correlation (convergent validity), but not to be so highly correlated that they are not independent concepts (discriminant validity).
For social scientists, an attitude is a stable nature to respond consistently to specific aspects of the world, including actions, people, or objects. One way to understand an attitude is to break it down into its components. Consider this brief statement: “Sally likes shopping at Wal-Mart. She believes the store is clean, conveniently located, and has low prices. She intends to shop there every Thursday.” This simple example demonstrates three components of attitude: affective, cognitive, and behavioral. The affective component refers to an individual’s general feelings or emotions toward an object. A person’s attitudinal feelings are driven directly by his/her beliefs or cognitions. This cognitive component represents an individual’s knowledge about attributes and their consequences. The behavioral component of an attitude reflects a predisposition to action by reflecting an individual’s intentions.
The simplest rating scale contains only two response categories: agree/disagree. Expanding the response categories provides the respondent with more flexibility in the rating task. Even more information is provided if the categories are ordered according to a particular descriptive or evaluative dimension. Consider the following question:
This category scale is a more sensitive measure than a scale that has only two response categories. By having more choices for a respondent, the potential exists to provide more information.
A measure of attitudes designed to allow the respondent to indicate how strongly they agree or disagree with carefully constructed statements that range from very positive to very negative towards an attitude object. No statement should be factual; each statement should be debatable; it should have discriminately order. i.e. poor and higher score.
Here, SA: strongly agree, A: Agree, NAND: Neither agree nor disagree, DA: Disagree, SDA: Strongly disagree
Constant Sum Scale
In this, the respondents are asked to divide a given number of points, usually 100 among two or more attributes based on the importance they attach to each attribute. If an attribute is unimportant, the respondent can assign zero points. In the following figure, eight attributes of bathing soap are given. Respondents may be asked to allocate 100 points among the attributes so that their allocation reflects the relative importance they attach to each attribute. The more points an attribute receives, the more important the attribute is. If an attribute is not important, assign it zero points. If an attribute is twice as important as some other attribute, it should receive twice as many points.
Graphic Rating Scale
A graphic rating scale presents respondents with a graphic continuum. The respondents are allowed to choose any point on the continuum to indicate their attitude. Following figure shows a traditional graphic scale, ranging from one extreme position to the opposite position.
Typically, a respondent’s score is determined by measuring the length (in millimeters) from one end of the graphic continuum to the point marked by the respondent. Many researchers believe that scoring in this manner strengthens the assumption that graphic rating scales of this type are interval scales. Graphic rating scales are not limited to straight lines as sources of visual communication. A variation of the graphic ratings scale is the ladder scale. Following is also a graphic rating scale (happy face scales).
The semantic differential is a seven-point ratings scale with end points associated with bipolar labels that semantic meaning. For example:
Powerful —:—:—:—:—:—:—: Weak
Unreliable —:—:—:—:—:—:—: Reliable
Modern —:—:—:—:—:—:—: Old fashioned
The negative adjective or phrase sometimes appears at the left side of the scale and sometimes at the right. This controls the tendency of some respondents particularly those with very positive or very negative attitudes to mark the right- or left-hand sides without reading the labels. Individual items on a semantic differential scale may be scored on either a -3 to +3 or a 1 to 7 scale.
The Stapel scale, named after Jan Stapel, was originally developed in the 1950s to measure simultaneously the direction and intensity of an attitude. Modern versions of the scale, with a single adjective, are used as a substitute for the semantic differential when it is difficult to create pairs of bipolar adjectives. The modified Stapel scale places a single adjective in the center of an even number of numerical values (ranging, perhaps, from +3 to –3). The scale measures how close to or distant from the adjective a given stimulus is perceived to be. Following figure illustrates Stapel scale for measuring the quality of food and quality of service of a restaurant.
In 1927 attitude research pioneer Louis Thurstone developed the concept that attitudes vary along continua and should be measured accordingly. The construction of a Thurstone scale is a fairly complex process that requires two stages. The first stage is a ranking operation, performed by judges who assign scale values to attitudinal statements. The second stage consists of asking subjects to respond to the attitudinal statements. People indicate which of the statements with which they agree and the average response is computed. First, you must be very clear about exactly what it is you’re trying to measure. Then, collect statements on the topic ranging from attitudes that are favorable to unfavorable.
The Thurstone method is time-consuming and costly. From a historical perspective, it is valuable, but its current popularity is low.
Types of Errors in Measurement
The measurement error is defined as the difference between the true or actual value and the measured value. These errors may arise from different sources and are usually classified into the following types: Systematic (or biased) errors and Random errors.
are biases in measurement, which lead to a situation where the average of many measurements differs significantly from the actual value of the measured attribute. A systematic error makes the measured value always smaller or larger than the true value, but not both.
For example, consider an experimenter taking a reading of the time period of a pendulum’s full swing. If their stopwatch or timer starts with 1 second on the clock, then all of their results will be off by 1 second. If the experimenter repeats this experiment twenty times (starting at 1 second each time), then there will be a percentage error in the calculated average of their results; the final result will be slightly larger than the true period. Types of systematic errors include personal (observational) errors and instrumental errors. Personal errors are the result of ignorance, negligence, or physical limitations on the experimenter. Instrumental errors are attributed to imperfections in the tools with which the researcher works.
is the error, which is caused by the sudden change in the atmospheric condition. These types of error remain even after the removal of the systematic error. Hence such type of error is also called residual error. In simple words, random error is due to factors, which cannot be controlled.
- Measurement (i.e. the measurement process) gives us the language to define/describe what we are studying.
- In measurement, two types of errors can occur: systematic, which we might be able to predict, and random, which are difficult to predict but can sometimes be addressed during statistical analysis.
- There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
- Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
- Once you have used a measure, you should reevaluate its reliability and validity based on your new data. Remember that the assessment of reliability and validity is an ongoing process.
The process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating.
refers to a grouping of several characteristics
re not observable but can be defined based on observable characteristics
level of measurement that is categorical and those categories cannot be mathematically ranked, though they are exhaustive and mutually exclusive
another name for categorical variables (categories have names, are not numeric)
categories of a variable have numeric relationships, can be sequenced in numeric order but are not continuous interval variables (for example, ratings of frequency from never to always)
Level of measurement that follows nominal level. Has mutually exclusive categories and a hierarchy (order).
A higher level of measurement. Denoted by having mutually exclusive categories, a hierarchy (order), and equal spacing between values. This last item means that values may be added, subtracted, divided, and multiplied.
The highest level of measurement. Denoted by mutually exclusive categories, a hierarchy (order), values can be added, subtracted, multiplied, and divided, and the presence of an absolute zero.
The ability of a measurement tool to measure a phenomenon the same way, time after time. Note: Reliability does not imply validity.
the extent to which all questions or items assess the same characteristic, skill, or quality.
The extent to which the scores from a measure represent the variable they are intended to.
The extent to which a measurement method appears “on its face” to measure the construct of interest
The extent to which a measure “covers” the construct of interest, i.e., it's comprehensiveness to measure the construct.
The extent to which people’s scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with.
A variable that theoretically should be correlated with the construct being measured (plural: criteria).
A type of Criterion validity. Examines how well a tool provides the same scores as an already existing tool.
A type of criterion validity that examines how well your tool predicts a future criterion.
seeks an agreement between a theoretical concept and a specific measuring device, such as observation.
An attitude measurement survey is a study, on a properly drawn sample, of a specified population to find out what people in that population feel about a specified issue.
a composite measure designed in a way that accounts for the possibility that different items on an index may vary in intensity (5.3)
Errors that are generally predictable.
Errors lack any perceptable pattern.