Reimagining Assessment
Testing for Dummies: 5 Facts about Standardized Testing
Topics
Educators are rethinking the purposes, forms, and nature of assessment. Beyond testing mastery of traditional content knowledge—an essential task, but not nearly sufficient—educators are designing assessment for learning as an integral part of the learning process.
As states release standardized test scores, the rhetoric reflects false understandings of their meaning and potential for impact on policy goals, needs, and accountability for schools.
This week the Texas Education Agency did its annual dump of standardized test scores onto the public. All states are in the process of doing the same. Once again, the rhetoric surrounding this dump (in my state and probably yours) is embarrassingly and dangerously ignorant of what a standardized test is and the limits of its interpretive reach.
We might as well release the blood pressure readings of all Americans from the lowest to the highest and then entrust all interpretations to those with no medical training or understanding of what a blood pressure reading means. It’s that ridiculous. Should the medical community speak up, to follow the logic applied to test scores, we should ignore their training and logic, treat them as apologists unwilling to stand up for patients, and act as if our patently false understandings are true.
And we should use those data we do not understand to judge the effectiveness of our hospitals and the overall medical community.
I wish that were hyperbole, but it really is that dumb. And that isn’t anti-accountability. It's a request to stop doing dumb things that don’t/can’t meet the stated policy goals and needs.
An annual tradition of mine as someone who understands standardized testing reasonably well is to try yet again to set the record straight. Please, if you’ve read this far, forward this to ten people you know. Maybe I won’t have to do this next year.
Here are some facts that everyone should know about standardized testing that if broadly understood would have prevented us ever getting into this mess.
1. Standardized test scores say zero about the effectiveness of a school.
Nothing. Nada. Zip. A high score is not on its own an indicator of anything. If a student was likely to score high regardless of the school attended, a high score is unlikely a signal of an effective school. In fact, it may be a signal that the school is coasting. Or, on the other hand, a high score could be the result of the school doing amazing things. Regardless, other data are required to make a valid interpretation.
A low score is not on its own an indicator of anything. If the school is filled with students who are highly likely to struggle, the scores are indicative of the socioeconomics of the neighborhood, and whether the school is effectively serving those students will need to be determined elsewhere. Or it could be that the school needs to dramatically improve. Regardless, other data are required to make a valid interpretation.
There is no such thing as a good or a bad standalone standardized test score.
Let’s try that a second way: there is no such thing as a good or a bad standalone standardized test score. There are going to be a range of scores, but what they mean for the schools from which they came cannot be ascertained until someone goes behind the scenes and does some research. That step does not occur in our current test-based accountability, and thus we are guilty of making judgments about schools with no evidence that those judgments are accurate or valid.
And now a third: we spend billions of dollars a year on a system that is supposed to help the public understand the effectiveness of our schools that by design can say nothing about the effectiveness of our schools. Let that sink in.
2. Standardized test scores are uninterpretable without lots of other information.
Standardized test scores offer researchers one tiny piece of information that is incredibly difficult to obtain—the broad patterns within a population that exist as of a moment in time. That is useful only as a part of a larger research agenda. Period.
Everyone should realize that what caused standardized testing to be invented was the desire to order students along a scale so that these patterns might be observed. That scale starts in the middle, at average, and then reaches out towards two extremes: the student furthest below and the student furthest above average. It doesn’t start at zero and move to a hundred like a test of learning does. This isn’t about what was learned in a classroom or a school. It's about patterns as of a moment in time.
Test makers must be able to know how the items that go into their tests will be answered ahead of time. Otherwise, you might have everyone getting them all right or all wrong. Or everyone getting the exact same score. If that happened, you wouldn’t be positioning students along a scale. You wouldn’t have a standardized test instrument but something else. You wouldn’t be able to compare scores and the patterns behind them over time, which is the definition of what a standardized test does.
The tiny piece of valuable information that emerges from such an effort is in sensing what patterns exist in the results when you add in other variables. Gender would be one possibility. If, for example, boys tended to score higher than girls, educators could determine if that was acceptable or if interventions were warranted to break a negative pattern.
If you wanted to see over time if a negative pattern was dissipating, you could check for that as well.
On their own, standardized test scores are as useless as an inflatable dartboard. Alongside other information they can contribute limited observations that would otherwise be hard to come by. That’s it. Anything else risks being akin to the blood pressure example above.
Not everyone can be above average on a standardized test.
3. Drawing a compliance line in the results of a research instrument’s results and pretending it means something invalidates the research function of a test.
In other words, the test results change from being about the topic being researched to a convenient but invalid way to make a judgment.
Research instruments are to inform research. The results always require the skills of a professional interpreter. Test scores and the scales that underlie them are no different.
Some who are willfully ignorant will always want to draw compliance lines at points along the scale and pretend they mean something. The most common pretend meaning is that everyone above is effective, and everyone below fails. But that would be impossible to know without lots of additional research.
And common sense is all we need to invalidate that premise. If a school is filled with students likely to score high no matter the school, declarations of success are akin to a participation trophy and make no sense. Why reward a school for opening its doors?
If a school is filled with students likely to struggle, a declaration of failure is a manufactured judgment likely to serve as a self-fulfilling prophecy. Why doom a school for its willingness to support at risk populations and make it harder than ever to do its job?
Compliance lines cannot identify effectiveness. It’s embarrassing that we presume otherwise, because doing so manufactures effectiveness where it may not exist, and failure where it is convenient. It's also embarrassing because compliance is so roundly rejected as an effectiveness tool by the rest of the world, making school accountability a strange anomaly.
4. Standardized test scores cannot inform instruction.
See my notes above on the underlying scale and the selection of items that behave in a narrow predictable way. A critical step in selecting those items is to eliminate those that respond to specific instructional efforts.
In other words—and I get that this will feel contrarian—teaching to this sort of test is akin to malpractice. It’s designed to sit in the background and make broad observations about patterns in numeracy and literacy attainment, and to do that it sacrifices any capacity to inform instruction.
That means any attempt to glean instructional inferences will result in invalid decisions and is likely to hurt more than help. Please don’t do it or encourage others to as well.
5. Complaints about not enough students being above average (or in a certain bucket) are, well, stupid.
Every year pundits find something. This year in Texas it’s that not enough kids are scoring above one of the compliance lines the state equates with deep learning.
Remember that scale I talked about? Standardized testing will always start in the middle of that scale and then with each question answered correctly or incorrectly move up or down until it finds each student’s position.
The complaint that there aren’t enough kids in one part of that scale is to pretend that this isn’t a standardized test. That all kids could be above a certain point and have the test still qualify as a standardized test.
Not everyone can be above average on a standardized test. And so, by extension, not everyone can be above any point drawn along the scale. Holding schools accountable for more kids occupying a spot somewhere along the scale is akin to asking a leopard to change its spots. It represents a misunderstanding of the limits within standardized test land.
If someone wants to argue that deep learning isn’t happening to the degree it should? Great. Based on what I see in my work I’d agree with you, and I’d place a huge amount of the blame on the ignorance behind standardized test usage, teaching to the test, etc.
And to those that like to treat test scores like a Rorschach blot onto which you can interpret however you like, please realize that doesn’t lead to solutions. It just leads to more misunderstandings and inappropriate use of a research instrument which will always make things worse, not better.
That’s the problem with invalid interpretations of all kinds—they will always do more harm than good, even when they stumble upon the right answer.
There it is. Testing for dummies. See you next year. Or maybe not.
But here's the hope: Deep down I think everybody knows how bad/dumb the current system is, but what to do looks like a big black hole. My organization has been researching what to do for the last decade and a half, and the thing is, more and more schools and district are joining a movement to do something about it. For more, please visit bravEd's website.
This article originally appeared on the EdContrarian blog on August 18, 2023.
Image at top courtesy of the Assessment for Learning Project.