What Should Double-Blind Placebo-Controlled Trials Be Replaced With?

For a sick person, which is worse?

1. Getting better for the wrong reason.

2. Wasting a lot of money.

It sounds like a joke — #1 isn’t even harmful, whereas the cost of health care is a very serious problem. Yet the FDA and legislators with FDA oversight have been given this choice — and chosen #1. They have chosen to protect us against #1 but not #2.

If you get better from a placebo effect, that’s the wrong reason. How dare you! The requirement that drugs be better than placebo controls prevents this from happening. The requirement might have been — but isn’t — that a new drug be better than pre-existing alternatives. Many aren’t but they are always more expensive — not to mention more risky.

Proposed Book: How to Lie with Experimental Design

From ABC News:

Angelo Tremblay [a professor at Laval University] noticed something odd every time he worked up a grant application for his research program in a Quebec university. He had a craving for chocolate chip cookies.

Professor Tremblay wondered if this meant that thinking makes you fat — which is curious, because it implies that the rest of his job didn’t involve thinking, or at least less of it. More likely is that anxiety makes you crave pleasure-producing food (such as chocolate-chip cookies) to dull the pain; there is a term for it, emotional eating. Grant writing is anxiety-producing, of course: You worry about not getting the grant. Yet — to his credit — Tremblay did experiments to test his idea. And these experiments, he believes, supported his idea that thinking alone can cause obesity, which I’m pretty sure is wrong. It makes me want to write a book: How to Lie With Experimental Design (although I’m sure Tremblay wasn’t trying to deceive anyone). Its predecessor, How to Lie with Statistics, was a big success.

Thanks to Dave Lull.

More From a documentary about Ranjit Chandra:

In the Nestle and Mead Johnson studies, Chandra concluded that those company’s products helped reduce the risk of allergies, while the Ross formula which was virtually the same did nothing.

Masor says he asked, “‘Dr. Chandra, how can you explain that we didn’t see anything with our study and you did with the Nestle study?’ And he said, ‘Well, the study really wasn’t designed right.’

“I said: ‘Dr. Chandra we designed the study with you. You designed it. That’s why we went to you, so you would be able to do it correctly.’ And he said, ’Well, you didn’t really pay me enough money to do it correctly.’”

An extreme case.

How Could Epidemiologists Write Better Papers?

Inspired by Andrew Gelman’s posting of his discussion of a paper, here is a review I recently wrote of a omega-3 epidemiology paper. The shortcomings — or opportunities for improvement — I point out are so common that I hope this will be of interest to others besides the authors and the editor.

This is an important paper that should be published when the analysis is improved. The data set analyzed was gathered at great cost. The question of the relationship between omega-3 and *** [*** = a health measure] is very important and everyone would like to know what this data set has to say about it.
That said, the data analysis has many problems [= opportunities for improvement]. Most of them, perhaps all of them, are very common in epidemiology papers, I realize. Here are the big problems:

1. No figures. The authors should illustrate their main points with figures. They should use lowess — not straight lines — to summarize scatterplots. The relationships are unlikely to be linear.

2. Failure to transform their measures. Every one of their continuous variables should be transformed to be roughly normal or at least symmetrical before further analysis is done. It’s very likely that this will get rid of the outliers that led them to treat a continuous variable (omega-3 consumption) as a categorical one.

3. What was the distribution of *** scores? How did this distribution vary across subgroups? If the distribution isn’t normal — and it probably is far from normal — then a transformation might greatly improve the sensitivity of the analysis. Since the distribution is not shown the reader has no idea how much sensitivity was lost by failure to transform.

4. Pointless analyses. It is never explained why they separately analyse EPA and DHA; that is, no data are given to suggest that these two forms of omega-3 have different effects. Rather than analyse separately EPA and DHA they should simply analyze the sum. Nor is there any reason to think that fish consumption per se — apart from its omega-3 content — does anything. (At least I don’t know of any reason and this paper doesn’t give any reason.) Doing weak tests (fish, EPA alone, DHA alone) dilutes the power of the strongest test (EPA + DHA).

5. Failure to test the claim of interaction. I don’t mind separate analyses of large subgroups but if you say an effect is present in women but not men — which naive readers will take to mean that men and women respond differently — you should at least do an interaction test and tell readers the result. (You should also provide a graph showing the difference.) Likewise if you are going to claim Caucasians and African-Americans are different, you should do an interaction test. Perhaps the results are different for men and women because *** — and if so there may not be an interaction. Finding the relationship in women but not men has several possible explanations, only one of which is a difference in the function relating omega-3 intake to ***. For example, men might have more noise in their omega-3 measurement, or a smaller range of omega-3 intake, or a smaller range of ***, and so on. The abstract states “the associations were more pronounced in Caucasian women.” The same point: When the authors state that something is “more” than something else, they should provide statistical evidence for that — i.e., that it is reliably more.

6. It is unclear if the p values are one-tailed or two-tailed. They should be one-tailed.

7. It is unclear why the data are broken down by race. Why do the authors think that race is likely to affect the results? Nowhere is this explained. Why not stratify the results by age or education or a dozen other variables?

8. The authors have collected a rich data set — measuring many variables, not just sex and race — but they inexplicably do a very simple analysis. If I were analyzing these data I would ask 2 questions: 1. Is there a relation between EPA+DHA and ***? This is the question of most interest, of course, and should be answered in a simple way. This is a confirmatory analysis. 2. Getting some measure of that relationship, such as a slope, I would ask how that slope or whatever is affected by the many other variables they measured, such as age and so on. This is an exploratory analysis. There are no indications in this paper that the authors understand the value of exploratory analyses (which is to generate new ideas). Yet this is a good data set for such analyses. To fail to do such analyses and report the results, positive or negative, is to throw away a lot of the value in this data set.

9. The single biggest flaw (or to be more positive, opportunity for improvement) is losing most of the info in the *** measurements by dichotimizing them . . . .

It would also be nice if epidemiologists would stop including those “limitations” comments at the end of most papers. They rarely say something that isn’t obvious.

Red State Blue State Rich State Poor State

To explain why Andrew Gelman et al.’s Red State Blue State Rich State Poor State is such an important book I have to tell two stories.

A few years ago a student did a senior thesis with me that consisted of measuring PMS symptoms day by day in several women. After she collected her data she went to the Psychology Department’s statistics consultant (a psychology grad student) to get help with the analysis. The most important thing to do with your data is graph it, I told the student. The statistics consultant didn’t know how to do this! There was little demand for it. Almost all the data analyses done in the Psychology Department were standard ANOVAs and t tests. If you look at statistics textbooks aimed at psychologists, you’ll see why: They say little or nothing about the importance of graphing your data. Gelman et al.’s book is full of informative graphs and will encourage any reader to plot their data. There are few examples of this sort of thing. That’s the obvious contribution. Because graphing data is so important and neglected, that’s a big contribution right there.

The other contribution is even more important, but more subtle. Recently I was chatting with a statistics professor whose applied area is finance. What do you think of behavioral economics? she asked. I said I didn’t like it. “It’s too obvious.” (More precisely, it’s too confirmatory.) For example, the conclusion that people are loss-averse — fine, I’m sure they are, but it’s too clear to be a great discovery. She mentioned prospect theory. Tversky and Kahneman’s work has had a big effect on economists — which certainly indicates it wasn’t obvious. Yes, it has been very influential, I said. I’m not saying their conclusions were completely obvious — just too obvious. Tversky and Kahneman were/are very smart men who had certain ideas about how the world worked. They did experiments that showed they were right. There’s value in such stuff, of course, but I prefer research that shows what I or the researcher never thought of.

Red State Blue State is an example. Andrew and his colleagues didn’t begin the research behind the book intending to show what turned out to be the main point (that the red state/blue state difference is due to an interaction — the effect of wealth on tendency to vote Republican varies from state to state). I suspect they got the idea simply by making good graphs, which is an important way to get new ideas. (Neglect of graphics and neglect of idea generation go together.) Red State Blue State could be used in any class on scientific method to illustrate the incredibly important point that you can get new ideas from your data. There aren’t many possible examples.

If I were teaching scientific method, I’d assign a few chapters of Red State Blue State and then have a class discussion about how to explain the results. Not just the state-by-wealth interaction but also the fact (revealed by a scatterplot) that the United States is far more religious than other rich countries — an outlier. Then I’d say: The graphs in the book made you think new thoughts. Your own graphs can do that.

Science in Action: Why Did I Sleep So Well? (part 10)

Long ago, talking about scientific discovery, Pasteur said “chance favors the prepared mind.” In my case, I now realize, this generalization can be improved on. The underlying pattern can be described more precisely.

I’ve made several discoveries because two things came together, as Pasteur said, with one element a kind of chance and the other a kind of knowledge. The two elements were:

  1. I did something unusual.
  2. I knew something unusual.

Here are the discoveries and how they fit this pattern:

1. Breakfast. Discovery: Eating breakfast caused me to wake up too early more often. Did something unusual: I copied one of my students, who told me about his experiences during office hour. This eventually led me to vary my breakfast. Knew something unusual: I had detailed records of my sleep. The combination made it clear that breakfast was affecting my sleep.

2. Morning faces. Discovery: Seeing faces in the morning improves my mood the next day. Did something unusual: I watched a tape of Jay Leno soon after getting up. Knew something unusual: From teaching intro psych, I knew there was a strong connection between depression and bad sleep.

3. Standing and sleep. Discovery: Standing a lot reduces early awakening. Did something unusual: I arranged my life so that I stood a lot more than usual . Knew something unusual: I had detailed sleep records. They made the reduction in early awakening easy to see.

4. Sleep and health. Discovery: At the same time my sleep greatly improved, I stopped getting colds. Did something unusual: To improve my sleep I was standing a lot and getting a lot of morning light from a bank of lights on my treadmill. Knew something unusual: I had records of my colds going back ten years.

5. The Shangri-La Diet. Discovery: Drinking sugar water causes weight loss. Did something unusual: I went to Paris. Knew something unusual: I had developed a new theory of weight control.

6. Flaxseed oil and the brain. Discovery: Flaxseed oil improves my mental function . Did something unusual: One evening I took 6-8 flaxseed oil capsules. Knew something unusual: I had been putting on my shoes standing up for more than a year and knew how difficult it usually was. The morning after I took the flaxseed oil capsules it was a lot easier.

7. Standing on one foot and sleep. Discovery: Standing on one foot improves my sleep. Did something unusual: In order to stretch my quadriceps, I stood on one foot several times one day. Knew something unusual: I knew that if I stood a lot my sleep improved (Discovery 3).

The unusual actions ranged from things as common as foreign travel (Paris) and stretching to the extremely rare (watch a tape of Jay Leno soon after waking up). The unusual knowledge ranged from quirky and casual (knowing how hard it is to put on shoes standing up) to sets of numbers (sleep records) to generalizations based on numbers (what scientific papers are about) to the sort of stuff taught in science classes (a theory of weight control) to the sort of knowledge derived from teaching science classes (connecting depression and bad sleep). To call this stuff unusual knowledge is actually too broad because in every case it’s knowledge related to causality.

Only after Discovery 7 (more precisely, this morning) did I notice this pattern. Read the discussion section of this paper (which is about Discoveries 1-5) to see how badly I missed it earlier.

More on Discovery 6. Discovery 7.

Suppose You Write the Times to Fix an Error (part 2)

The Roberts-Schwartz correspondence continued. I replied to Schwartz:

“Dining establishments”? [His previous email stated: “Four restaurants simply cannot represent the variety of dining establishments in New York City”] I thought the survey was about sushi restaurants. Places where raw fish is available.

Quite apart from that, I am sorry to see such a fundamental error perpetuated in a science section. If you don’t believe me that the teenagers’ survey was far better than you said, you might consult a friend of mine, Andrew Gelman, a professor of statistics at Columbia.

John Tukey — the most influential statistician of the last half of the 20th century — really did say that a well-chosen of sample of 3 was worthwhile when it came to learning about sexual behavior. Which varies even more widely than sushi restaurants. A sample of 4 is better than a sample of 3.

Schwartz replied:

The survey included 4 restaurants and 10 stores.

The girls would not disclose the names of any of the restaurants, and only gave me the name of one store whose samples were not mislabeled. Their restaurants and stores might have been chosen with exquisite care and scientific validity, but without proof of that I could not say it in the article.

I wrote:

I realize the NY Times has an “answer every letter” policy and I am a little sorry to subject you to it. Except that this was a huge goof and you caused your subjects damage by vastly undervaluing their work. Yes, I knew the survey included 4 restaurants and 10 stores. That was clear.

As a reader I had no need to know the names of the places; I realized the girls were trying to reach broad conclusions. They were right not to give you the names because to do so might have obscured the larger point. It was on your side that the big failing occurred, as far as I can tell. Did you ask the girls about their sampling method? That was crucial info. Apparently The Times doesn’t correct errors of omission but that was a major error in your article: That info (how they sampled) wasn’t included.

He replied:

I could have been more clear on the subject of sample size, but I did not commit an error. Neither do my editors. That is why they asked me to write a letter to you instead of writing up a correction.

I don’t feel I have been “subjected to” anything, or that this is some kind of punishment. This is an interesting collision between the precise standards of someone with deep grounding in social science and statistical proof and someone who tries to write intelligible stories about science for a daily newspaper and a general interest audience. But I am not sorry that you wrote to me, even a little sorry.

i wrote:

“I did not commit an error.” Huh? What am I missing? Your article had two big errors:

1. An error of commission. You stated the study should be not taken seriously because the sample size was too small. For most purposes, especially those of NY Times readers, the sample size was large enough.

2. An error of omission. You failed to describe the sampling protocol — how those 10 stores and 4 restaurants were chosen. This was crucial info for knowing to what population the results should be generalized.

If you could explain why these aren’t errors, that would be a learning experience.

Did you ask the girls how they sampled?

His full reply:

We’re not getting anywhere here.

Not so. After complaining he didn’t have “proof” that the teenagers used a good sampling method, he won’t say if he asked them about their sampling method. That’s revealing.

Something similar happened with a surgeon I was referred to, Dr. Eileen Consorti, in Berkeley. I have a tiny hernia that I cannot detect but one day my primary-care doctor did. He referred me to Dr. Consorti, a general surgeon. She said I should have surgery for it. Why? I asked. Because it could get worse, she said. Eventually I asked: Why do you think it’s better to have surgery than not? Surgery is dangerous. (Not to mention expensive and time-consuming.) She said there were clinical trials that showed this. Just use google, you’ll find them, she said. I tried to find them. I looked and looked but failed to find any relevant evidence. My mom, who does medical searching for a living, was unable to find any completed clinical trials. One was in progress (which implied the answer to my question wasn’t known). I spoke to Dr. Consorti again. I can’t find any studies, I said, nor can my mom. Okay, we’ll find some and copy them for you, she said, you can come by the office and pick them up. She sounded completely sure the studies existed. I waited. Nothing from Dr. Consorti’s office. After a few weeks, I phoned her office and left a message. No reply. I waited a month, phoned again, and left another message. No reply.

More. In spite of Dr. Consorti’s statement in the comments (see below) that “I will call you once I clear my desk and do my own literature search,” one year later (August 2009) I haven’t heard from her.

Suppose You Write the Times to Fix an Error (part 1)

Recently the New York Times published a fascinating article by John Schwartz in the science section about how two teenagers discovered that a lot of raw fish sold in New York is mislabeled. Unfortunately, the article contained two big mistakes: 1. The teenagers’ results were dismissed as unconvincing because the sample size (10 stores and 4 sushi restaurants) was, according to Schwartz, too small. For many purposes the sample was large enough, if their sampling method was good. 2. The sampling method wasn’t described. Without knowing how the stores and restaurants were chosen, it’s impossible to know to what population the results apply. This was like reviewing a car and not saying the price.

In an email to the Times I pointed out the first mistake:

Your article titled “Fish Tale Has DNA Hook” by John Schwartz, which appeared in your August 22, 2008 issue, has two serious errors:

1. The article states: “The sample size is too small to serve as an indictment of all New York fishmongers and restaurateurs.” To whom the results apply — whom they “indict” — depends on the sampling method used — how the teenagers decided what businesses to check. Sample size has almost nothing to do with it. This was the statistician John Tukey’s complaint about the Kinsey Report. The samples were large but the sampling method was terrible — so it didn’t matter that the samples were large.

2. The article states: “the results are unlikely to be a mere statistical fluke.” It’s unclear what this means. In particular, I have no idea what it would mean that the results are “a mere statistical fluke.” The error rate of the lab where the teenagers sent the fish to be identified is probably very low.

In retrospect the second error is “serious” only if incomprehensibility is serious. Maybe not. I should have pointed out the failure to describe the sampling protocol) but didn’t.

I got the following reply from Schwartz:

Thank you for your note about my article, “Fish Tale Has DNA Hook,” which appeared in the newspaper on Friday. You state that the story misstated the importance of sampling size as “an indictment of all New York fishmongers and restaurateurs.” Although you are certainly correct in stating that poor methodology can undercut work performed using even the largest samples, it is also ill advised to try to establish broad conclusions from a very small sample. The fact that mislabeling occurred one in four pieces of seafood from 14 restaurants and shops in no way allows us to conclude that 25 percent of fish sold in New York or in the United States is mislabeled. And that is all I was trying to say with the reference to sample size was that while the girls’ experiment shows that some mislabeling has occurred, their work cannot say how much of it goes on or whether any given restaurant or shop is mislabeling its products. Similarly, when I wrote that it is unlikely the findings are a “statistical fluke,” I merely meant that while it is possible that Kate and Louisa found the only 8 restaurants and shops in New York City that mislabel their products, that is not likely, and so the possibility that the practice is widespread should not be discounted. And, of course, I hope you can forgive the pun.

Thanks again for taking the time read the article and respond to it, and I hope that you will find more to like in other stories that I write.

Uh-oh. The email was as mistaken as the article, although it did clear up what “statistical fluke” meant. I wrote again:

Thanks for your reply. I’m sorry to say that you still have things more or less completely wrong.

“Their work cannot say how much of it goes on or whether any given restaurant or shop is mislabeling its products.” Wrong. [Except for the obvious point that the survey does not supply info about particular places.] I don’t know what sampling protocol they used — how they chose the restaurants and fish sellers. (This is another big problem with your article, that you didn’t state how they sampled.) Maybe they used a really good sampling protocol, one that gave each restaurant and fish seller an equal chance of being in the sample. If so, then their work can indeed “say how much [mislabeling] goes on.” They can give an estimate and put confidence intervals around that estimate. Just like the Gallup poll does.

Somewhere you got the idea that big samples are a lot better than small ones. Sometimes you do need a big sample — if you want to predict the outcome of a close election, for example. But for many things you don’t need a big sample to answer the big questions. And this is one of those cases. There is no need to know with any precision how much mislabeling goes on. If it’s above 50%, it’s a major scandal, if it’s 10-50% it’s a minor scandal, if it’s less than 10%, it’s not a scandal at all. And the study you described in your article probably puts the estimate firmly in the minor scandal category. In contrast to your “it’s cute but doesn’t really tell us anything” conclusion quite the opposite is probably true (if their sampling procedure was good): It probably tells us most of what we want to know. You’re making the same mistake Alfred Kinsey made: He thought a big sample was wonderful. As John Tukey told him, he was completely wrong. Tukey said he’d rather have a sample of 3, well-chosen.

Thanks for explaining what you meant by “statistical fluke.” You may not realize you are breaking new ground here. Scientists wonder all the time if their results are “a statistical fluke.” What they mean by this is that they’ve done an experiment and have two groups, A (treated) and B (untreated) and wonder if the measured difference between them — there is always some difference — could be due to chance, that is, is a statistical fluke. In your example of the mislabeled fish there are not two groups — this is why your usage is mysterious. I have never seen the phrase used the way you used it. And I think that the readers of the Times already realized, without your saying so, that it is exceptionally unlikely that these were the only fish sellers in New York that mislabeled fish.

Schwartz replied:

I understand your points, and certainly see the difference between a small-but-helpful sample and a large-but-useless sample. but four restaurants simply cannot represent the variety of dining establishments in New York City. Four restaurants, ten markets.

I also realize that you must think I am thickheaded to keep at this, but I will certainly keep in mind your points in the future and will try not make facile references to small and large samples when the principles are, as you state, more complicated than that.

To be continued. My original post about this article.

The Emperor’s New Clothes: Meta-Analysis

In an editorial about the effect of vitamin-mineral supplements in the prestigious American Journal of Clnicial Nutrition, the author, Donald McCormick, a professor of nutrition at Emory University, writes:

This study is a meta-analysis of randomized controlled trials that were previously reported. Of 2311 trials identified, only 16 met the inclusion criteria.

That’s throwing away a lot of data! Maybe, just maybe, something could be learned from other 2295 randomized controlled trials?

Evidence snobs.

Citizen Science: What’s Your Sushi?

Self-experimentation is an example of the more general idea that non-experts can do valuable research. Another example is that two New York teenagers have shown that fish sold in New York City is often mislabeled. They gathered samples from 4 sushi restaurants and 10 grocery stores and sent them to a lab to be identified using a methodology and database called Barcode of Life. They found that “one-fourth of the fish samples with identifiable DNA were mislabeled . . . [and concluded] that 2 of the 4 restaurants and 6 of the 10 grocery stores had sold mislabeled fish.”

The article, by John Schwartz, appeared in the Science section, which makes the following sentence highly unfortunate:

The sample size is too small to serve as an indictment of all New York fishmongers and restaurateurs, but the results are unlikely to be a mere statistical fluke.

This is a Samantha-Powers-sized blunder. It could hardly be more wrong. How much you can generalize from a sample to a population depends on how the samples were chosen. Sample size has very little to do with it. (John Tukey had the same complaint about the Kinsey Report: Stop boasting about your sample size, he said to Kinsey. Your sampling methods were terrible.) To know to what population we can reasonably generalize these results we’d need to know how the two teenagers decided what grocery stores and restaurants to sample from. (Which the article does not say.) If the 14 fish sellers were randomly sampled from the entire New York City population of grocery stores and restaurants, it would be perfectly reasonable to draw broad conclusions.

I have no idea what it could mean that the results are “a mere statistical fluke”.

The effect of these errors is that Mr. Schwartz places too low a value on this research. It’s impressive not only for its basic conclusion that there’s lots of mislabeling but also for showing what non-experts can do.

The end of the article did see the big picture:

In a way, Dr. Ausubel said, their experiment is a return to an earlier era of scientific inquiry. “Three hundred years ago, science was less professionalized,” he said, and contributions were made by interested amateurs. “Perhaps the wheel is turning again where more people can participate.”

More about Unreported Side Effects of Powerful Drugs

A few days ago I blogged about how Tim Lundeen, via careful and repeated measurement — let’s call it self-experimentation — uncovered a serious and previously-unreported side effect of a drug he was taking. Tim’s example illustrates an important use of self-experimentation: discovering unreported side effects, which I believe are common.

By coincidence today I came across a talk about the very subject of unmentioned side effects: Alison Bass speaking about her new book, Side Effects: A Prosecutor, a Whistleblower, and a Bestselling Antidepressant on Trial. Near the end, Bass said,

It’s not the just the antidepressants, it’s not just the antipsychotics. This is happening with a lot of other drugs. With Vioxx, with Vytorin, an anti-cholesterol drug, with Propries [?] and Marimet [?], anti-anemia drugs. Where again and again the drug companies know that there are more severe side effects and they’re not letting the public know about that. It just keeps happening, unfortunately.

Just as it would be foolish to think the problem is limited to mental-health drugs, it would be foolish to think the problem is limited to side effects, that drug company researchers do everything right except fail to report side effects. Tim’s example shows how hard it is to learn about unreported side effects — so it is only realistic to think that there are other big problems with drug company research we don’t know about. Bass mentioned one I didn’t know about. A company did a clinical trial of Paxil. The goal was to see if the drug helped with Measures of Depression A and B. Turns out it didn’t: no effect. So the company changed the measures! They shifted to reporting different measures that the drug did seem to improve. Creating the hypothesis to be tested after the data supposedly supporting that hypothesis had already been collected. Without making this clear. (Which I presciently mentioned here, in response to an interesting comment by Andrew Gelman.) And if you think that drug companies do research like this — in ways that seriously damage people’s lives — but everyone else, such as academia, is really good, that is as realistic as thinking the problem with drug company research is restricted to side effects. Self-experimentation has all sorts of limitations, yes, but (a) you know what they are and (b) it is cheap enough so that you can gather more data to deal with the problems. Drug company research and lots of other research is too expensive to fail — or even be honest about shortcomings.

This is an aspect of scientific method that scientists rarely discuss: the effect of cost on honesty. Is there an economic term (a Veblen good, perhaps?) for things whose quality goes down as their cost goes up?