In a recent post I said that med school professors cared about process (doing things a “correct” way) rather than result (doing things in a way that produces the best possible outcomes). Feynman called this sort of thing “ cargo-cult science“. The problem is that there is little reason to think the med-school profs’ “correct” way (evidence-based medicine) works better than the “wrong” way it replaced (reliance on clinical experience) and considerable reason to think it isn’t obvious which way is better.
After I wrote the previous post, I came across an example of the thinking I criticized. On bloggingheads.tv, during a conversation between Peter Lipson (a practicing doctor) and Isis The Scientist (a “physiologist at a major research university” who blogs at ScienceBlogs), Isis said this:
I had an experience a couple days ago with a clinician that was very valuable. He said to me, “In my experience this is the phenomenon that we see after this happens.” And I said, “Really? I never thought of that as a possibility but that totally fits in the scheme of my model.” On the one hand I’ve accepted his experience as evidence. On the other hand I’ve totally written it off as bullshit because there isn’t a p value attached to it.
Isis doesn’t understand that this “ p value” she wants so much comes with a sensitivity filter attached. It is not neutral. To get it you do extensive calculations. The end result (the p value) is more sensitive to some treatment effects than others in the sense that some treatment effects will generate smaller (better) p values than other treatment effects of the same strength, just as our ears are more sensitive to some frequencies than others.
Our ears are most sensitive around the frequency of voices. They do a good job of detecting what we want to detect. What neither Isis nor any other evidence-based-medicine proponent knows is whether the particular filter they endorse is sensitive to the treatment effects that actually exist. It’s entirely possible and even plausible that the filter that they believe in is insensitive to actual treatment effects. They may be listening at the wrong frequency, in other words. The useful information may be at a different frequency.
The usual statistics (mean, etc.) are most sensitive to treatment effects that change each person in the population by the same amount. They are much less sensitive to treatment effects that change only a small fraction of the population. In contrast, the “clinical judgment” that Isis and other evidence-based-medicine advocates deride is highly sensitive to treatments that change only a small fraction of the population — what some call anecdotal evidence. Evidence-based medicine is presented as science replacing nonsense but in fact it is one filter replacing another.
I suspect that actual treatment effects have a power-law distribution (a few helped a lot, a large fraction helped little or not at all) and that a filter resembling “clinical judgment” does a better job with such distributions. But that remains to be seen. My point here is just that it is an empirical question which filter works best. An empirical question that hasn’t been answered.
I think this post is terrible. Firstly, you try to bring Feynman in on your side because Feynman was a genius & everyone cashing in on his name (like the “Feynman’s Rainbow” book advertised on your blog. But the connection with cargo-cult science is far from clear.
Secondly the statement about means being more sensitive when the treatment change each person by the same amount as opposed to a fraction of the population etc is hilariously wrong. Means DON’T CARE who is changed.
Example:Pre-treatment distribution
1,2,3,4
Post-treatment (A):
101,2,3,4
Post-treatment (B):
26,27,28,29
In one case I added 100 to just one subject in the second I added 25 to all 4. So the mean changes by the same amount. D’oh.
If you are taking about Medians, thats different but treatment effect models seldom look at this parameter. Or are you talking about heterogenous treatment effects?
It seems to me that you are trying to provide some sort of veneer of respectibility to anecdotal evidence relative to those who do things the correct i.e. the scientific way. Its not about using different “filters”. I really doubt if Feynman woud have derided those who followed proper scientific methods as you appear to suggest.
But you’re skewing the point by pre-choosing an overall mean (100) and then dividing it up different ways. Then it’s merely tautological that the overall mean is changed by the same amount. The real issue is if the treatment helps 10% of people by X vs. helping 100% of people by X. In the former situation, it won’t change the overall mean nearly as much, even if it’s a real effect for those 10% of people.
Kevin, Feynman was complaining about the same thing as me — a focus on process rather than results among people who claim to be scientists. To point out other people who had similar ideas is reviewing the literature.
I’m talking about how best to detect heterogeneous treatment effects, yes. I am saying that what you call the “scientific” method is likely to be better at detecting certain treatment effects and worse at detecting others than the method it replaced. Of course it is better to base medicine on evidence than on no evidence. But there is more than one way of evaluating evidence and the current way has a certain disadvantage (greatest sensitivity to homogenous treatment effects) not shared by the method it replaced (reliance on clinical experience).
Seth, off topic, but would be interesting to get your views on this:
https://www.mercurynews.com/bay-area-news/ci_15480908?source=rss
Do you have a way to generate graphs to illustrate the distributions you are talking about? This is an excellent example of statistical literacy, and I want to send it around to math people (parents and educators). We just need pictures.
When I worked for a pharmaceutical company, I used to conduct interviews and focus groups with doctors. I have some reservations about the quality of their clinical judgment.
Doctors are susceptible to the same sorts of cognitive biases that afflict everyone else. (Example: If your first three patients on Prozac all did very well, you’ll be a Prozac loyalist for years.) With doctors, though, the cognitive biases might be even worse than average, though, because doctors are typically overconfident in their own abilities.
In Kevin Denny’s example, post-treatment distribution A (with 100 added to 1 person) has a much larger variance than distribution B (with 25 added to 4 people). So, any statistical test is going to be more sensitive to B: they will be a smaller p-value and it will be more statistically significant. Just as Seth initially claimed.
Maria,
Thanks for your interest. The statistical point I am making here is pretty simple, yes, but the people who run clinical trials hire statisticians to analyze the data. To say those statisticians are not statistically literate is a little complicated. The point I make here has not appeared in the statistical literature (or biomedical literature) as far as I know.
Seth, I did not mean to imply that clinical statisticians are illiterate, but only that, say, a beginner statistics course would do very well to include your example as a way to promote literacy in working with data. It is simple, but interesting because it’s real. A pretty good combination for an educational topic.
I will ask a couple of my math ed colleagues to look at the topic and see what comes of it.
Also, I would like to thank you for a few very simple changes I made after reading your blog, which had unexpectedly large result/effort ratios.
Seth,
The key insight here is the following, and it answers a question I posed to you some time back:
“I suspect that actual treatment effects have a power-law distribution..”
Dennis, yes, that is the key insight. And, yes, it answers the question you asked me a while ago. The broader point is that it’s possible to find out what the distribution of treatment effects has been (in similar experiments) and then tune the test to be most sensitive to that distribution. The current assumption that everyone changes the same amount is probably the least plausible assumption that could be made.