Sensitivity Analysis of Student Proficiency
Lately I have been reading an article by Andrew Ho in the journal Educational Researcher titled “The problem with `proficiency’: limitations of statistics and policy under No Child Left Behind.” It is a worthwhile article and I encourage others (hint: employees of The Standards Company) to read it.
Briefly, Ho describes a statistical analysis of student proficiency on state assessments. Now, each state defines “proficiency” differently. So while one state may decide that 50% correctness is worthy of a student to be denoted “proficient,” another state may decide 70% is the better choice. Naturally, the cut-off between “proficient” and “not proficient” would dramatically change the percentage of students considered proficient in one state in comparison to another. Who would argue otherwise?
What is not so clear is whether changes in the percentage of students scoring proficient would change between one state and the next if the cut-off point was changed. At first glance it shouldn’t matter: where two race cars begin on a track has little to do with how rapidly one car would gain on the other.
But complications set in when one considers that the race car analogy does not quite hold when examining state test proficiency, which does not map the gains from the same students from one year to the next but rather the same graduating class of students from one year to the next. In other words, we measure a batch of students designated as Grade 6 in one year, and another batch of students designated as Grade 6 the next year. In this case, the arbitrary cut-off point between “proficient” and “not proficient” can affect the growth in the number of students scoring proficient. It all comes down to a basic issue: how many students were on the edge of the cut-off point?
Ho performed a statistical analysis of data collected from a blind state (that is, Ho may know which state is under consideration, but he isn’t telling us) to measure the rate-of-change in student proficiency across two years. He then changed the cut-off point and re-performed the analysis.
Guess what? The rate-of-change changed dramatically, indicating that rate-of-change is overly sensitive to the choice of cut-off point.
There is a direct analogy of this issue in computer simulation. A common measure of the worth of a computer simulation is the sensitivity analysis. For example, suppose I have a computer simulation designed to indicate weather patterns appearing ten years from today. Naturally, I have to feed initial conditions into the computer before I can start the simulation. Suppose that I input the temperature, pressure, humidity, and wind speeds appearing at 5,000 locations spread across the globe. When I run the simulation, I find that a hurricane is blowing through College Station, TX, ten years from today. I can certainly publish my results and have Aggie fans begin upgrading their building codes in preparation for the onslaught of weather.
Now suppose that, for kicks, I decide to run the simulation once again. If I choose the same exact initial conditions, I should get the same result, a hurricane over College Station. (Actually, most simulations feature randomness, either by design or through random error, that would produce a different result if ran again.) However, I decide to modify the initial temperature at one of the 5,000 points by one degree. When I run the simulation once again, I instead get a hurricane over China.
So of what use is my computer simulation? Very little, because there is a fat chance that I would be able to input enough data describing the initial conditions to such precision that would produce a reliable result. Miss one temperature measurement by one degree and the hurricane appears on the other side of the globe.
According to Ho, the same problem exists in state test score proficiency. Small changes in what is arbitrarily termed “proficient” produces marked changes in improvement. So one state that improves rapidly in comparison to the others could be doing so simply because of its choice in the cut-off value, and little else.
As an education researcher, Ho’s findings reinforce my assertion that the real gains in education are made not through test score analysis (which Ho has shown are highly suspect) but through upgrades in curriculum and instruction. Many would then ask how, if we disregard test scores, we can tell if gains are being made in curriculum and instruction? The answer to that question is simple and cuts directly to the core of the beliefs of The Standards Company: direct measurement of curricular materials and teaching based on objective criteria. I would rather judge the teaching taking place in Mississippi or Kansas based on the levels of Bloom’s taxonomy and depth of knowledge appearing in their curricular materials than on student testing proficiency. After all, as a teacher I can control the curriculum I deliver to my students and the strategies I use to teach them.