Most drugs are first tested on animals, often on “animal models” of the human disease at which the drug is aimed. This 2008 Nature article reveals that in at least one case, the animal model is flawed in a way no one really understands:
In the case of ALS, close to a dozen different drugs have been reported to prolong lifespan in the SOD1 mouse, yet have subsequently failed to show benefit in ALS patients. In the most recent and spectacular of these failures, the antibiotic minocycline, which had seemed modestly effective in four separate ALS mouse studies since 2002, was found last year to have worsened symptoms in a clinical trial of more than 400 patients.
I think that “close to a dozen” means about 12 in a row, rather than 12 out of 500. The article is vague about this. A defender of the mouse model said this:
As for the failed clinical trial of minocycline, Friedlander suggests that the drug may have been given to patients at too high a dose — and a lower dose might well have been effective. “In my mind, that was a flawed study,” he says.
Not much of a defense.
That realization is spreading: some researchers are coming to believe that tests in mouse models of other neurodegenerative conditions such as Alzheimer’s and Huntington’s may have been performed with less than optimal rigor. The problem could in principle apply “to any mouse model study, for any disease”, says Karen Duff of Columbia University in New York, who developed a popular Alzheimer’s mouse model.
“Less than optimal rigor”? Oh no. Many scientists seem to believe that every problem is due to failure to follow some rules they read in a book somewhere. They have no actual experience testing this belief (which I’m sure is false — the world is a lot more complicated than as described in their textbooks); they just feel good criticizing someone else’s work like that. In this case, the complaints include “small sample sizes, no randomization of treatment and control groups, and [no] blinded evaluations of outcomes.” Very conventional criticisms.
Here’s a possibility no one quoted in the article seems to realize: The studies were too rigorous, in the sense that the two groups (treatment and control) were too similar prior to getting the treatment. These studies always try to reduce noise. A big source of noise, for example, is genetic variability. The less variability in your study, however, the less likely your finding will generalize, that is, be true in other situations. The Heisenberg Uncertainty Principle of experimental design. Not in any textbook I’ve seen.
In the 1920s and 30s, a professor in the UC Berkeley psychology department named Robert Tryon tried to breed rats for intelligence. His measure of intelligence was how fast they learned a maze. After several generations of selective breeding he derived two strains of rats, Maze Bright and Maze Dull, which differed considerably in how fast they learned the maze. But the maze-learning differences between these two groups didn’t generalize to other learning tasks; whatever they were bred for appeared to be highly specific to maze learning. The measure of intelligence lacked enough variation. It was too rigorous.
When an animal model fails, self-experimentation looks better. With self-experimentation you hope to generalize from one human to other humans, rather from one genetically-narrow group of mice to humans.
Thanks to Gary Wolf.