Consider the following ASCD paper (copied verbatim from https://www.ascd.org/portal/site/ascd/menuitem.d6eaddbe742e2120db44aa33e3108a0c/template.ascdexpressjournal?articlemoid=7b7f89b094a75010VgnVCM1000003d01a8c0RCRD&journalmoid=f36f89b094a75010VgnVCM1000003d01a8c0RCRD):
A Policymaker’s Primer on Testing and Assessment
Standardized testing plays an increasingly important role in the lives of today’s students and educators. The U.S. No Child Left Behind Act (NCLB) requires assessment in math and literacy in grades 3–8 and 10 and, as of 2007–08, in science once in grades 3–5, 6–9, and 10–12. Based on National Center for Education Statistics enrollment projections, that will be roughly 68 million tests per year, simply to meet the requirements of NCLB. Such an intense focus on assessment, with real consequences attached for students and educators, makes it imperative that policymakers understand the complexities involved with assessment and in using assessments as part of high-stakes accountability policies.
As policymakers continue to establish and revise state and national assessment and accountability systems, two overarching questions must be addressed:
Do current tests supply valid and reliable information?
What happens to such assessments when high stakes are attached to the outcomes?
The American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council for Measurement in Education (NCME) have jointly released The Standards for Educational and Psychological Testing (1999), a detailed set of guidelines on assessment use. Within these guidelines, the associations note that although tests, “when used appropriately, can be valid measures of student achievement,” decisions “about a student’s continued education, such as retention, tracking, or graduation, should not be based on the results of a single test but should include other relevant and valid information” (APA, 2001, paras. 9, 14). In a position supported by its Leadership Council, ASCD takes a similar stance (see box).
ASCD Adopted Position on High-Stakes Testing, 2004
Decision makers in education—students, parents, educators, community members, and policymakers—all need timely access to information from many sources. Judgments about student learning and education program success need to be informed by multiple measures. Using a single achievement test to sanction students, educators, schools, districts, states/provinces, or countries is an inappropriate use of assessment. ASCD supports the use of multiple measures in assessment systems that are
Fair, balanced, and grounded in the art and science of learning and teaching;
Reflective of curricular and developmental goals and representative of content that students have had an opportunity to learn;
Used to inform and improve instruction;
Designed to accommodate nonnative speakers and special-needs students; and
Valid, reliable, and supported by professional, scientific, and ethical standards designed to fairly assess the unique and diverse abilities and knowledge base of all students.
Complexities in Assessment
On both the individual and system levels, assessment poses issues worthy of consideration.
Individual Assessment. Multiple forms of assessment are important because of the potential effect of human error within even well-designed systems. Researchers at the National Board on Educational Testing and Public Policy found that human error in testing programs occurs during all phases of testing (from design and administration to scoring and reporting), and that such errors can have a significant negative effect on students when high-stakes decisions are made.
In 1999, researchers found that individuals involved in the assessment process made numerous errors across the different phases of the assessment process, resulting in significant negative consequences. For example, 50 students were wrongly denied graduation; 8,668 students were needlessly required to attend summer school; and 257,000 students were misclassified as limited-English-proficient (Rhodes & Madaus, 2003). In January of 2003, more than 4,000 teacher candidates were incorrectly failed on their certification tests due to an ETS scoring error (Clark, 2004).
Systemic Assessment. Using test results to evaluate educational systems is also problematic. As highlighted in a recent presentation at ETS (Raudenbush, 2004), the general concept of using tests for this purpose assumes there is a causal relationship between the system (treatment) and the test score (outcomes); however, assessment systems as currently designed are not structured to determine causation (there are no comparison or control groups). The assessment systems assume that school effects cause any differentiation in scores, but those differences could be the result of other, uncontrolled-for variables, such as the effect of previous schools or the effect of wealth or community characteristics (Popham, 2003; Turkheimer, Haley, Waldron, D’Onofrio, & Gottesman, 2003). According to Raudenbush, using school-mean proficiency results (NCLB’s basic accountability mechanisms) to evaluate schools is “scientifically indefensible,” and although value-added assessment (which measures year-to-year gain) addresses some issues, it, too, presents a flawed analysis of schoolwide performance, particularly when there are transitions between schools or significant differences in earlier educational experiences.
The addition of high-stakes consequences to assessment systems in order to motivate change in educator behavior adds one more serious degree of complexity. High-stakes accountability mechanisms generally rely on operant theories of motivation that emphasize the use of external incentives (punishments or rewards) to force change (Ryan & Brown, in press). Other theories of motivation, however, suggest that such reliance on external incentives will result in negative and unintended consequences (Ryan & Brown, in press; Ryan & Deci, 2000). Operant approaches to motivation focus on behaviors (that is, the reward or punishment is designed to cause behavioral change), but the testing movement focuses on outcomes (the achievement of specific scores) regardless of behavior change. These conflicting goals result in a situation where the ends (higher test scores) become more important than the means (changes in educator behavior) used to achieve those ends. In other words, because the rewards and punishments stemming from the testing program are attached to conditions that educators may not have control over (including school and classroom resources, community poverty, social supports, and so on), educators are left to make changes in variables they do control (such as student enrollments, test administration, and classroom instruction).
As predicted by Ryan and Brown, the change in these variables is complex and includes consequences that policymakers could not have intended, such as narrowing the curriculum and associated training to tested subjects (Berry, Turchi, Johnson, Hare, Owens, & Clements, 2003; Moon, Callahan, & Tomlinson, 2003), increased push-out of underperforming students (Lewin & Medina, 2003), and increased manipulation of test administration (Rodriguez, 1999). A recent survey conducted by the National Board on Educational Testing and Public Policy found that 75 percent of teachers thought that state-mandated testing programs led teachers in their school to teach in ways that contradict their own ideas of good educational practice (Pedulla, 2003).
Assessment Types, Uses, and Scoring
Because much of the responsibility for the use of assessments resides with the users, it is important that policymakers understand in general what tests can and cannot do, as well as the appropriate ways in which tests might be used as part of an accountability system.
At best, tests are an incomplete measure of what a student knows and can do. A final score measures only the student’s performance relative to the sample of items included on that specific test. This is why educators argue for the use of multiple measures in evaluating students—so that a more complete picture of the student can be generated. Educators use assessments that cover a variety of purposes and measure differing levels of knowledge, skills, and abilities. For an assessment to work well, it must be consistent with the instructions of the test maker. Using a test for a purpose for which it was not intended can result in invalid or unreliable outcomes. The same is true regarding use of a test that has not been fully validated, or using tests where the scoring parameters have been set for political or public relations purposes rather than measurement purposes.
Thus, it is critical that the appropriate assessments and measures be used for the identified policy or educational goals. Three general areas to consider when examining assessments are test type (such as achievement tests or aptitude tests), test use (for diagnostics, placement, or formative or summative evaluation), and the scoring reference (raw scores, norm-referenced scores, or criterion-referenced scores).
Test Type. Achievement and aptitude tests, although similar, attempt to measure two different concepts. Achievement tests generally measure the specific content a student has (or has not) learned, whereas aptitude tests attempt to predict a student’s future behavior or achievement (Popham, 2003). Although student outcomes on these tests may be related, it would be inappropriate to use the tests interchangeably because they measure different constructs. The SAT is an example of an aptitude test that is frequently misused by policy activists to make content-focused judgments or comparisons of student achievement.
Test Use. Tests are used to help diagnose areas of student strength and weakness, as well as specific learning difficulties. Tests can also be used to guide school readiness and placement decisions, and to make formative or summative evaluations. Formative evaluations are structured assessments designed to gauge the progress of students as measured against specific learning objectives. Such assessments are used to help guide instruction so that teachers and students have a general idea of what learning outcomes have been achieved, and where further focus is needed. Summative assessments, on the other hand, are used to evaluate achievement at the end of specific educational programs (for example, mathematics achievement at the end of grade 10).
Scoring. The way in which tests are designed to have scores reported—as norm-referenced or criterion-referenced—also plays a key role in test usage. Norm-referenced tests are designed to result in a score spread, so that students can be compared to their peers and placed in a hierarchy by percentage. Scores reported from a norm-referenced test, therefore, are broken out in such a way as to ensure that half of the test takers score in the top 50 percent, and half score in the bottom 50 percent. Because the goal is to differentiate between test takers, when test items are created and validated, items that are too easy—or too hard—are discarded because they fail to differentiate between students. Even if a norm-referenced test is created from a set of state standards, it is exceptionally difficult to use such a test as a summative assessment because important content items may have been discarded in the test building process for being deemed too easy or too hard (Popham, 2003; Linn & Gronlund, 1995).
Criterion-referenced tests, however, do try to focus specifically on student outcomes relative to a fixed body of knowledge. Criterion-referenced tests can result in the majority of students scoring above, or below, a specified cut score. And, in fact, a criterion-referenced test should be positively (or negatively) skewed, depending on the success of the students and teachers in addressing the body of content from which the test has been constructed. State assessments designed to measure the achievement of students relative to the state’s content standards should be criterion-referenced.
Test scores are also occasionally reported in raw scores, which are simply the total of correct responses. Unfortunately, the raw score is frequently misinterpreted because it is reported without interpretation. A test that is particularly difficult (or easy) may have an unusually low (or high) average score. Without knowing the context of the test or the scoring, it is impossible to make a judgment as to what the raw scores say about the performance of test takers.
Interpreting Test Scores
Linn and Gronlund (1995) offer five cautions for interpreting test scores:
Scores should be interpreted in terms of the specific tests from which they were derived. In other words, student scores on a reading test should not be taken to represent students’ general ability to read; rather, the scores should be examined only in light of the skills the assessment was intended to measure. For instance, a reading test that measures a student’s ability to sound out words would not tell us how well a student comprehends the main idea in a paragraph of text.
Scores should be interpreted in light of all the student’s relevant characteristics. A student’s score on a specific test may be influenced by many variables, including language background, education, cultural background, and motivation. A low score does not necessarily indicate that the student does not know the material or that the system has failed to engage the student.
Scores should be interpreted according to the type of decisions to be made. Test scores should not be generalized to actions beyond the original purpose of the test.
Scores should be interpreted as a band of possible scores, rather than an absolute value. Because tests are only an approximate measure of what a student actually knows and can do, the student’s true abilities may differ from the measured score. Most tests include a measure of standard error, which can be used to help determine where a student’s true score may lie. For example, the true score for a student scoring a 68 on a test with a 4-point standard error is likely to fall within the range of 64 to 72.
Scores should be verified by supplementary evidence. This is perhaps the single most important admonition for test users. No test can ensure the accurate measure of a student’s true performance; other evidence should be examined. Allowing students to retake the same test does not provide supplementary evidence of performance. Instead, alternative measures, such as classroom performance, should be used to help make accurate determinations of student abilities.
Constructing Assessment Systems
In constructing assessment systems, test makers can draw from a variety of item types and formats, depending on the type of assessment being created and its purpose. For example, although selected-response tests (such as multiple-choice tests) are easy to score and offer a reasonable measure for vocabulary, facts, or general principles and methods, they are less useful for measuring complex achievement, such as the application of principles or the ability to generate hypotheses or conduct experiments. Such complex abilities require more complex item constructs, such as those found on constructed-response tests, which may include essay questions or actual performance assessments.
On the other hand, performance and portfolio assessments (authentic assessment assessments) allow students to more intentionally demonstrate their competence. Although such assessments may resemble traditional constructed-response tests, their goal is to mirror tasks that people might face in real life. For example, they might require students to demonstrate writing competence through a series of polished essays, papers, or poems (depending on the type of writing being assessed), or to design, set up, run, and evaluate a science experiment. Other types of performance assessment include speeches, formal presentations, or exhibits of student work.
Portfolio assessments, although similar to performance assessments, are designed to collect data over time and can also include measures from traditional assessments. The goal of portfolios is to allow teachers, students, and evaluators to gauge student growth by examining specific artifacts that students have created. Students in British Columbia, for example, are required to present a Graduation Portfolio Assessment, which accounts for 4 of the 80 course credits required to be awarded a diploma (BC Ministry of Education, 2004). The portfolio documents student work in grades 10–12 in six domains: Arts and Design, Community Involvement and Responsibility, Education and Career Planning, Employability Skills, Information Technology, and Personal Health. Although districts have approached the requirement in different ways, Surrey School District, which has the largest enrollment in British Columbia, is helping students create electronic portfolios that will provide Web-accessible evidence of their academic performance. In Providence, Rhode Island, the Met School has gone one step further and eliminated grades and traditional tests altogether, evaluating student work completely through publicly presented portfolios (Washor & Mojkowski, 2003).
Constructed-response tests—including performance and portfolio assessments—provide a richer evaluation of students, but they are much more time-consuming for teachers, students, and evaluators; they are also more expensive and difficult to administer and score in a large-scale standardized manner. Connecticut school officials are currently in a dispute with the U.S. Department of Education regarding assessment costs, because they don’t want to “dumb down” their constructed-response tests by dropping writing components that require hand scoring (Archer, 2005). Even so, the educational richness inherent in authentic assessments suggests that policymakers take seriously the possibility of incorporating a deep evidence base in assessment and accountability models.
Assessment and Ethics
The ethical practices related to testing and assessment further complicate the picture. As highlighted by Megargee (2000), the ethical responsibilities for assessment are split between the test developer and the test user—the developer being responsible for ensuring the tests are scientifically reliable and valid measures, and the user for “the proper administration and scoring of the test, interpretation of the test scores, communication of the results, safeguarding the welfare of the test takers, and maintaining the confidentiality of their test records” (p. 52). This separation of ethical responsibility between test makers and consumers results in a loophole that allows commercial test makers to sell assessments to clients even when they know the tests will be misused. Additionally, although the education profession has taken responsibility for creating ethical standards, it currently has no mechanisms for enforcement.
Policymakers face a daunting challenge in designing school assessment and accountability systems; however, professionals in assessment have worked hard to provide the basic outline for policies that can support positive assessment systems. These systems cannot be implemented cheaply, and when cost-saving compromises are made, serious damage to both individuals and systems (school and assessment) can result. Therefore, policymakers should work to carefully understand (and adjust for) the trade-offs they make as they seek to create cost-effective accountability systems. It is not an understatement to say that the lives of individual students will be positively—or negatively—affected by the decisions they make.
In an effort to increase both the instructional use of assessments and public confidence in such systems, states should work to keep these systems transparent, allowing relevant stakeholders to review test content and student answer sheets. Teachers, parents, and students cannot use test data to improve instruction or focus learning if they are denied access to detailed score reports. In fact, states may be required to give such information to parents. Washington State officials recently decided to give parents access to student tests and booklets because they determined that under the Family Educational Rights and Privacy Act (FERPA), exams were defined as part of a student’s educational records and, therefore, must be made available to parents—and to students once they reach 18 years of age (Houtz, 2005).
Professional associations and psychometricians have focused on creating standards for test use (AERA, APA, & NCME, 1999), some of which have been delineated here. Due to the split between assessment creators and consumers regarding ethical responsibilities for test usage, as well as the lack of professional enforcement mechanisms, it is imperative that policymakers incorporate the recommendations of assessment professionals as they create systems that use evidence from standardized and large-scale assessment programs.
Recent Origins of Standardized Testing
Much of the theory and many constructs undergirding standardized assessments evolved from work done on standardized intelligence testing. British psychologist Sir Francis Galton, French psychologist Alfred Binet, and an American from Stanford University, Lewis Terman, are generally credited as the fathers of modern intelligence testing (Megargee, 2000). The work of Terman and Binet ultimately resulted in the Stanford-Binet Intelligence Scale, which is still in use today. The SAT—an aptitude test (a test that attempts to predict a student’s future achievement)—came into being in 1926 to help predict a student’s likely success in college, and the Graduate Records Examinations (GRE) were introduced a decade later. In 1939, David Wechsler introduced an intelligence scale that broke intelligence into discrete pieces, in this case verbal and nonverbal subtests. The first large-scale use of standardized intelligence testing occurred in the U.S. military during World War I, when more than 1,700,000 recruits were tested to determine their role (as officers or enlisted men) or denote them as unable to serve. Standardized achievement tests, which attempt to measure the specific knowledge and skills that a student currently possesses (and not general intellectual ability or potential for future achievement), came into widespread use in the 1970s through minimum competency testing (Popham, 2001).
The evolution of intelligence testing has been turbulent, with researchers still debating whether intelligence is a single construct referred to as “g” (Gottfredson, 1998) or consists of many different intelligences, such as Gardner’s theory of multiple intelligences posits: linguistic, logical-mathematical, musical, spatial, bodily-kinesthetic, interpersonal, intrapersonal, and naturalist (Checkley, 1997). In addition to debates about how to define intelligence, scientists are trying to determine how much of it—if any—is hereditary and how much is learned—that is, influenced positively or negatively by the environment in which a person exists. One recent study, for example, found that the effects of poverty on intelligence could overwhelm any genetic differences, emphasizing the complex nature of intelligence (Turkheimer, Haley, Waldron, D’Onofrio, & Gottesman, 2003).
Historically, intelligence testing has also been used in ways that many people today find offensive. The eugenics movement of the early-mid-20th century used intelligence testing to identify individuals who were “feebleminded” (or had other deficiencies) so that they could be institutionalized or placed in basic-skills tracks (Stoskopf, 1999b). Eugenic policies were created to “strengthen” the genetic makeup of Americans, and scientists who supported these policies provided the impetus for U.S. immigration restrictions in the 1920s and sterilization laws that were in effect through the 1960s—resulting in the sterilization of, at a minimum, 60,000 individuals (Reilly, 1987). As recently as last year, a candidate for U.S. Congress from Tennessee, James Hart, garnered almost 60,000 votes running on a platform of eugenics (Associated Press, 2004; Hart, 2004; McDowell, 2004).
Early IQ testing, which was greatly affected by culturally biased items, also resulted in the tracking of African American children into low-level courses and vocational schools, on the basis of the assumption that they had generally low mental abilities (Stoskopf, 1999a). In 1923, Carl Brigham, who later helped create the SAT, published A Study of American Intelligence, which alleged on the basis of U.S. Army testing that intelligence was tied to race. Brigham recanted his findings in 1930; however, his work was used extensively to provide “scientific” evidence for racist policies in the 1920s (Stoskopf, 1999a).
[Extensive bibliography omitted]
Dan Laitsch is an assistant professor in the Faculty of Education at Simon Fraser University in British Columbia, Canada, and is coeditor of the International Journal for Education Policy and Leadership.
July 2005 Number 42
Copyright © 2005 by Association for Supervision and Curriculum Development
© 2008 Association for Supervision and Curriculum Development