Andrew Gelman is a professor of statistics at Columbia University. Years ago we co-taught a seminar about left-handedness. His blog. This interview took place via instant messaging in February, 2007 and has been edited slightly.
SR I want to ask your opinion of web trials. People go to a website where they choose or are randomly assigned a treatment. Then they come back and report the results.
AG Then the records of their choices and outcomes are made publicly available.
SR Yes. And there would probably be some summary of the results prepared by experts. It wouldn’t be just raw data.
COMPARING TO CURRENT STATE OF THE ART IN MEDICAL RESEARCH
AG We could compare to the current state of the art in medical research, which I think is to have some moderately large randomized clinical trials, each of which is published in a journal, followed by a meta-analysis of these trials. A difficulty with the current state-of-the-art is that sample sizes in clinical trials seem to be simultaneously too small and too large. Too small in that results tend to be just barely statistical significant (and often not significant for subgroups), so that you can’t really put your faith in one study, hence the need for meta-analysis. Too large in that each study is unwieldy, takes a huge amount of effort and doesn’t allow for much learning and experimentation during the study.
SR. A famous epidemiologist [Richard Doll] once said that if the effect is strong, you don’t need a big study.
AG In some way the high cost is a good barrier in that people have to think seriously and justify what they want to do. On the other hand, within any particular research plan, it would seem to limit the possibility for innovation.
Speaking generally, a challenge is to integrate clinical judgment (including ideas of experimentation and trying different things with different patients) with scientific goals such as replicability.
Also, there are well-known cognitive illusions in clinical judgment, which is what motivates the evidence-based-medicine movement (for randomized trials, public records of data, etc.) in the first place.
SR How do web trials fit into the picture you have drawn?
AG Ideally, web trials are intermediate between controlled randomized trials on one hand, and full recording of observational data on the other. If people are really volunteering to be randomized, then they follow the protocol, then this is a clean randomized expt (albeit not blinded, an issue I’d like to raise with you). In practice there will be lots of selection, dropout, measurement error, etc., which moves it toward an observational study. The dispersed nature of the data collection is similar to (in fact, more dispersed than) the idea of individual clinicians recording their experiences and outcomes into a centralized databased. That is, the data collection is dispersed, the database is centralized.
SR A web trial would have more regularity — less variation — across subjects than observations collected from individual doctors. Because everyone would get the same instructions. Whereas different doctors are obviously going to give different instructions (for the same nominal treatment).
AG Yes. That’s why I said the web trial is in between.
DIFFICULTIES WITH BLINDING
SR In the area of blinding I think a web trial would be better than the conventional double-blind clinical trial. If the goal is to guide practice. In practice patients are not blinded. Blinding is a tool to equate expectations. Better to equate expectations by comparing different treatments both believed to be effective.
AG One of the difficulties with your self-experimentation is that there’s no blinding at all. Similarly with these trials. Some of it is the nature of your treatments, but perhaps with some effort you could come up with blinded versions.
SR In my self-experimentation the expectations are equal in the different conditions, in many cases.
AG For example, consider the recent self-experiment that you describe on your blog, where you try different oils and measure your balance. I’d believe these results a lot more if you blinded the treatments.
SR Sure, blinding would help in that case, I agree. I plan to do something like that. But blinding is not necessary to equate expectations. For example, I tried many ways of losing weight. In every case I expected it to work. Some ways worked much better than others. It is this comparison of the effects of different treatments that is interesting. In general expectations cannot be very powerful or there would be no problems left to solve. Expectations are powerful in a few areas and seem to have no effect in many areas. I don’t mean we should ignore them; but to emphasize them as a big deal is not what the evidence suggests. In any case in web trials the participants would only be randomized (or choose) treatments they thought might work
AG There’s some work by Rubin and other statisticians on “broken randomized trials” which can more generally be thought of as experiments that have partial randomization.
SR I think of web trials as giving “entrants” (or subjects) a choice: to be or not to be randomized. Then when it’s all over you compare the two groups.
AG That makes sense. You’ll still have some problems: 1. People not following protocol. 2. Non-blindness of treatments. 3. Other problems, I’m sure, which I can’t think of offhand.
SR Well, these are equal for all conditions so they shouldn’t distort anything
AG In a controlled trial you can deal with some of these things: 1. In a controlled trial you can have more interactions with the experimental subjects, thus maybe more likely they’ll follow protocol. 2. In a controlled trial you can (sometimes) ensure blindness. In general, I don’t think you can get away with assuming that biases cancel out.
ANALYZING DATA FROM WEB TRIALS
AG Your web trials should give us a big juicy source of data that can be thrown at a stat Ph.D. student as a thesis project, perhaps! My intuition as an amateur sociologist of applied statistics is that an exemplary applied analysis is a good way to kick-start the study of a statistical problem.
SR What’s an example of such a kick-start? That’s an interesting point.
AG I’m thinking of the hierarchical models that were fit by Lindley, Novick, Rubin, and others in the late 1960s thru early 1980s to educational data. These provided examples for people to follow–templates–as well as demonstrations that these methods really worked. There were various interesting disciussions of these models in the stat literature, in particular I’m thinking of a paper by Rubin on law school validity studies in J. Amer. Stat. Assoc. from 1980 that had several discussants.
SR Yes, it is true that the data from web trials would be complex and interesting in new ways and accessible to everyone.
AG Yes, having available data is another plus–that’s really a new feature which should help. Now back to the warnings. A very well known example is the Nurses Health Study, an observational study that found that taking post-menapausal drugs was associated with lower heart-attack risks (and lower death rates). But when a big randomized expt was done, no association was found. Actually, taking the drugs slightly increased cancer risk, I believe. See here.
I talked with various people about this, and there are different potential explanations for the discrepancies. One story is that the women who took the drugs were otherwise healthier, more health conscious, etc.–even after controlling for whatever pre-treatment variables they controlled for. Another story is that the populations of the 2 studies were different (in particular, in their average ages), and perhaps the drugs are beneficial for some ages but not others. (Incidentally, the drugs were not originally intended to reduce heart-attack risk. This was an unexpected effect (or non-effect), I believe.)
Anyway, the people I trust on these matters (notably John Carlin) believe that the difference is because of “selection”, i.e., the drugs don’t really reduce heart attack risk. But the observational study led people to recommend the drugs. So this is a big example where the obs study was misleading.
SR: Did the randomized study conclusively rule out the effect size seen in the correlational study? or did it simply find no effect?
AG I’m not sure. My impression is that the expt actually contradicted the obs study–a stat signif negative effect for one, and a stat signif positive effect for the other–not just that there was significance for the expt and no signif for the obs study–but I never really looked into it.
SR I’d like to return to the issue of blind vs don’t blind. You believe any experiment where subjects are not blind to the treatment has a problem?
AG Yes, if knowledge of the treatment could affect the outcome (for example, through motivation). I worry about it for your diet and depression studies.
SR Well, in much research the first question is whether there is a useful effect. later experiments deal with mechanism. I was under the impression that what matters is to equate expectations across conditions and that blinding is just one way to do this.
AG Maybe you’re right, I’m not actually up on this literature. I know that Paul Rosenbaum has written about it.
** MORE ON BLINDNESS: CONSIDERING THE SHANGRI-LA DIET **
AG My knowledge of it is not particularly sophisticated. For your diet and depression studies, there are obvoious stories based on motivation.
I wouldn’t go so far as some people and simply dismiss your results. But the concerns are natural, I think. It’s a little different than the problem with the Nurses study. Here I’m worried about motivation, there the issue was selection.
Although there’s a possible selection problem in your study too, in that the people (including you) doing the Shangri-La Diet might be those who are ready to try something new and lose weight.
SR There are a lot of people who are always ready to try something new and lose weight.
AG Again, this could be tested with a blinded study. For example, half the people get the oil apart from a meal, half get the oil with the meal. Not that this would solve all problems of interpretation. . . .
For example, Caroline thinks that your diet works, but that the reason why it works is that it stops people for snacking for a 2-hour period (before and after the oil) and also focuses people on their snacking.
SR If anyone thinks that — and it is a perfectly reasonable thing to think if you are just starting to learn about it — then they can replace the oil with water and see if they continue to lose.
AG To answer your comment (”there are a lot of people who are always ready to try something new and lose weight”): yes, I remember you saying this before, and this is a big reason I wouldn’t dismiss your results immediately. But, still, people willing to try this wacky new thing might be special (on average). To put it another way, I expect there were similar successes with people trying Scarsdale, Atkins, etc.
SR I’m sure that people who try my diet are unusual early adopter types. I think Atkins has some truth to it — some reasons it would actually work. I don’t know enough about Scarsdale to comment. My theory says that merely changing what you eat (to foods with unfamiliar or at least less familiar flavors) should lower your set point.
AG Sure, but you had another point which was that these were people for whom nothing worked before. I was just using these diets as examples of other things that worked when nothing worked before. It relates to the historical perspective of new diets as things that will work for a few years before burning out. Possibly because the new diets can motivate people.
SR I tend to think they burn out because the new food becomes familiar.
AG I’m not saying that this is necessarily true of your diet–yours might be different–I’m just giving a historical control to give insight as to how there could really be motivational issues.
SR That’s true, research to distinguish my explanation of the burn out and a motivational one could be done but of course hasn’t been.
AG Your story, “they burn out because the new food becomes familiar”, is plausible. It’s also plausible that it’s easier to motivate yourself with a plan that’s new and different.
SR I hope there will be studies of whether the theory behind my diet is correct. These would essentially be studies that test the prediction that familiarity matters. This is a prediction that other theories do not make.
AG Yeah, based on reading the appendix to your book, there’s still some research synthesis that needs to be done (presumably with the help of animal studies).
SR I agree.
BACK TO WEB TRIALS
SR Web trials are relatively early in the research chain and they are relatively practical. In these cases you don’t worry a lot about mechanism, you worry much more about efficacy — is there an effect?
AG Regarding the analysis of web trials, it would be interesting to look at other examples of partially randomized experiments. Rubin and Hill and others worked on a study of school choice where they looked into some of these issues. It was a study that randomized some aspects of which kids went to which schools, but parents had some choices too.
In medicine and also in economics/public-policy, there has been a lot of interest in recent years in trying to get inside this sort of study rather than just relying on the “intent to treat” or explicit randomization.
SR “get inside this sort of study”–what do you mean?
AG: I mean, look at what treatments are actually chosen by the individuals in the study, not just looking at what treatments they were assigned to.
SR Could you sum up why you like the idea of web trials?
AG 1. Lots of data. 2. Motivates people to randomize, to apply the treatment, and to record results. 3. More generally, gets people involved in the project as participants, not just “subjects”
SR Those are good points, thanks.
AG Thank you for giving me the opportunity to think about these things. I’m still struggling with the question, “Are medical experiments too small or too big (in number of subjects)?”. As discussed here.