12.4 Scale development

A valid, reliable, and rigorous psychological instrument needs to comply with some standards that will support its quality. These standards for scale development (Standards for Educational and Psychological Tests, 2014) are published and updated regularly by the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME).

As shown in Table 12.1, the construction of a test is an iterative process aimed at improving the psychometric principles of reliability, validity, comparability, and fairness. These 12 stages of scale development could be grouped into seven main tasks to be completed by the research team:

To clearly identify the purpose of the test (e.g., clinical screening) as well as any administration restrictions (e.g., time, administration mode).
To define the domain and, based on the domain, create the test specification (i.e., blueprint).
To generate an initial pool of items that should be reviewed by a group of experts before beginning the pilot stage.
To pre-test the reviewed pool of items using a small sample of respondents to evaluate the psychometric principles of the initial test.
To modify the test to achieve higher quality standards after evaluating its psychometric properties.
To field test the modified scale using a larger and normative sample (e.g., respondents with a Major Depressive Disorder diagnosis to develop an instrument aimed at this population) to set the final form of the test.
To develop guidelines for the administration of the test, the scoring system, and the interpretation of the scores.

Table 12.1: Stages of Scale Development
Stage	Description
1	Purpose identification
2	Administration restrictions
3	Defining the domain
4	Test specification (blueprint)
5	Construct an initial pool of items
6	Review the items
7	Pilot the initial test
8	Modification of test
9	Field test
10	Review the items (final form of test)
11	Develop guidelines for administration, scoring, and interpretation of scores
Note. Between Stages 7 and 8, the researcher will analyze the initial pool of items guided by the four psychometric principles to ensure that the scale is reliable, valid, comparable, and fair. Using an iterative approach, the initial pool of items will be sequentially modified until the final form of the scale will be established.

12.4.1 Purpose of the test

We must clarify and specify the purpose of the test, the psychological construct that we would like to measure, and what type of decisions will be derived from the test scores. On the one hand, these scores might be used to decide whether one respondent has enough abilities, skills or knowledge on a certain topic or domain (e.g., criterion-referenced tests). On the other hand, test scores might be used to compare subjects of a normative sample for recruitment, psychological assessment, ranking and promotion, classification of people and roles or even clinical screening.

12.4.2 Administration restrictions

These restrictions are related to time limit (speed versus power tests), the administration mode (e.g., individually, collectively, computer adaptive testing) and the materials being used (e.g., calculator, lab materials, dictionaries, computers).

12.4.3 Defining the domain

To define the domain, we need to think on the set of abilities, knowledge, or traits that we want to measure. These abilities and traits must be related to a specific domain of behaviors and situations that will elicit the latent variables being measured. These behaviors need to reflect the latent variable (i.e., the psychological construct).

Psychological constructs are based on theoretical frameworks. Consequently, an in-depth review of the relevant literature will reveal previous attempts to measure the same (or similar) psychological construct. A theory-driven process defining the limits of the psychological construct is recommended. Once the psychological construct has been set, the researcher must specify the behaviors that will reflect the psychological construct as well as the situations and tasks devised to elicit these behaviors.

12.4.4 Test specification (blueprint)

To generate items or tasks to measure the behaviors identified when defining the domain, we need to specify all the content areas and the differential weighting of facets. For example, a rubric designed to assess psychology students' knowledge in a module of neuroanatomy and psychophysiology will include items or tasks devised to cover the module's key topics (e.g., brain structures' names and functions, cranial nerve nuclei connections, hormonal pathways). However, not all areas or facets being covered in the test have the same relevance. For this reason, most rubrics include facets with different weightings. For instance, writing a good title or abstract in a quantitative report cannot be considered of equal relevance as writing a good results section.

12.4.5 Draft the initial pool of items, review them, and pilot them

The initial pool of items is usually built on the assumption that most of these items won't be included in the final test. It is important to have in mind the purpose of the test before selecting the appropriate response format (e.g., multiple choice, open ended, sorting, identification, correction, filling the gaps, building a response, a presentation).

If we decide to use a response format in which respondents have to select their answers from a group of options, then we need to decide how many options to include. For ability tests, the options could range from a true versus false format to several alternatives (ideally no more than 3 response options). In contrast, personality tests could be designed with dichotomous answers (yes versus no) or with graded responses (usually 4 to 9 different points in a quasi-interval scale). Popular graded scales include the pervasive Likert scale, the semantic differential, or the visual analog (Revelle, 2022).

Regarding the statements, situations, and behaviors that we evaluate in each item, some rules might apply. Clarity and brevity should always guide scale construction. We must avoid redundancy, ambiguity, and double-barreled items. We need to include some items with a negative wording. These items will require reversing when analyzing the psychometric properties of the scale.

The initial pool of items should be reviewed by a group of experts before being administered to a small sample of respondents. These experts will provide very valuable feedback on methodological and theoretical grounds. Once the initial pool of items will be pretested, we will have to conduct analyses to assess the psychometric properties of the items and the test. When piloting the initial pool of items, it is very important to gather qualitative data (e.g., interviews when debriefing the respondents) on instructions' comprehension, item wording and language ambiguity, time constraints, or the effect of respondents' performance.

12.4.6 Psychometric principles: Reliability, validity, comparability, and fairness

The reviewed pool of items needs to be pretested on a small sample of respondents to conduct the first quality checks on the psychometric model: reliability, validity, comparability, and fairness.

12.4.6.1 Reliability

It accounts for the errors produced during the measurement process. To what extent the observed scores reflect the respondent's true score on the test? Thus, reliability might be related to the repetition and generalization of the measures, the internal consistency of test, or the level of agreement across raters.

12.4.6.2 Validity

To what extent the purpose of the test is justified by the nomological network? Are we really measuring what we assume that we are measuring? The psychological construct must be represented in a network of interrelated constructs. We need to examine the network of related psychological constructs and the theories that support the data to test how credible is our proposed model. For doing so, we will use different sources of evidence (e.g., predicted validity, content validity, convergent validity).

12.4.6.3 Comparability

Sometimes, different modes of test administration (computerized versus paper-and-pencil) or different experimenters leading the task could produce an unexpected impact on test scores. We need to rule out alternative explanations that could undermine the comparability of respondents' scores. Interestingly, Generalizability Theory (Brennan, 2001), an extension of Classical Test Theory, allows us to decompose the sources of error variance using facets (i.e., conditions of measurement): respondents, tasks, and raters.

12.4.6.4 Fairness

The concept of fairness is related to the assumption that the different subsamples being tested do not produce test scores that are biased by sociodemographic differences (e.g., education, gender), language comprehension, or even familiarity with the task.

Bias has become a major topic in psychometrics due to the impact of biased high-stakes scales on people's lives. Although there are different forms of bias (e.g., internal, external), item bias has been profusely discussed by psychometricians. There are different approximations to detect item bias (Rust & Golombok, 2009). For example, we can inspect and compare the level of difficulty of each item between two or more groups of respondents (e.g., males versus females). Another approach to item bias relies on comparing the factor analytical structure of different subsamples' scales.

12.4.7 Field test and the development of guidelines for administration, scoring, and interpretation of test scores

After checking the psychometric principles of the initial test, an improved version of this test will be administered to a larger and normative sample. The field test will be key to generate the final form of the test. Once the final form of the test will be established, we will write the test guidelines.

These guidelines are written for test users interested in administering them in the future. Most of the potential users of the recently developed scale won't be psychometricians or experts in the psychological construct of interest. For this reason, the standards to develop and publish tests, questionnaires, and inventories need to be of a certain quality and clarity (Standards for Educational and Psychological Tests, 2014).

The guidelines will include the theoretical framework that supports the psychological construct's specification. Likewise, the guidelines will comprise the purposes of the test, the target population, instructions to administer the test, reliability coefficients and details on the accuracy of the measurement. We will also provide evidence of different types of validity (e.g., criterion validity, convergent validity, construct validity) for the given purposes of the test. Finally, we will provide suggestions on the potential applications of the test (e.g., clinical screening) and the norms to interpret test scores.