In 2004, two 14-year-old New Zealand girls found that a blackcurrant drink made by GlaxoSmithKline, the giant company, contained almost no Vitamin C — contrary to advertising that boasted of its Vitamin C content. Today the company was fined about $150,000 (US).
Category: scientific method
Science in Action: Omega-3 (old data re-analysed)
A few months ago I did a little experiment to test my belief that omega-3 was affecting my balance. I replaced fats high in omega-3 (flaxseed oil and walnut oil) with a fat low in omega-3 (sesame oil). Here is a new analysis of the data:
The raw data are the same. The new analysis differs from the earlier analysis in two ways: 1. How the number for each day is computed. The old analysis dropped the first 5 trials and took the mean of the rest. The new analysis fits a regression line to balance as a function of trial to estimate an effect of trial and subtract it, then takes a mean of all the trials. 2. Allowance for improvement. The new analysis, as the graph shows, fits a slope to all the data. The improvement over days is subtracted from each day’s score before the two conditions are compared.
The old analysis gave t = 4.1 (p = very tiny). The new one gives t = 6.3 (p = very very tiny). Big improvement!
Directory of my omega-3 posts.
Science in Action: Omega-3 (measurement improvement)
I’ve learned a few things. As some of you may know, I’ve been measuring my balance by standing on a board that is balanced on a tiny platform (a pipe plug) — pictures here. Now and then the board would slip off the platform. I supposed this was a failure of balance but I wasn’t sure, especially if it happened as soon as I stood on it. So I got another board into which my brother-in-law kindly drilled the perfect-size hole so that the plug will never slip:
To see if this made a difference I did an experiment with a design I have never used before but that I really like: ABABABAB… (one day per condition). In other words, Monday I tested my balance with the old board, Tuesday with the new board, Wednesday with the old board, Thursday with the new board, etc. Simple, efficient, well-balanced. Here are the results:
The red line is fit to the red points, the blue line to the blue points. The two lines are constrained to have the same slope.
Well, that’s clear. I expected my balance to be better with the new board, actually.
Speaking of the unexpected, I made another measurement improvement that truly surprises me — the surprise is that I never did it before. When I looked at my early balance data (the first 10 or so days of data) I saw that my balance improved for the first 5 trials and was roughly constant after that. Each session was 20 trials so I dropped (excluded) the first 5 trials from my analyses — considering them “warm-up” trials. I took the mean of the last 15 trials. That seemed very reasonable and I thought nothing of it.
Recently I asked again how performance changes over a session. The answer was a bit different: I found that performance improved for the first 10 trials. Now there are 30 trials in a session, so dropping the first 10 of them seemed okay. And that’s what I did.
But then I looked at how variability changed over a session. I expected the earliest trials to be more variable than the rest but the data didn’t show that. Variability was pretty constant from the first trials to the last. Hmm. Maybe I am losing valuable information by not including those early trials in my averages. It occurred to me: why not allow for the warmup effect by modelling it, rather than by excluding it? (Modelling it meaning estimating it and then subtracting it.) I did that, and then I looked at the size of the standard errors of the means (standard errors based on the residuals from the fit) for the most recent 40 days — essentially, the error in measurement. Here is what I found. Median standard errors:
First 10 trials (out of 30) excluded: 0.073
First 5 trials excluded: 0.064
First trial excluded: 0.061
No trials excluded: 0.059
My eyes opened wide when I saw these numbers. Oh my god! I was throwing away so much! A reduction in error from 0.073 to 0.059 — that’s 20% better.
How To Do Experiments That Generate Ideas
A few days ago a graduate student in economics asked me what I thought of behavioral economics. On the positive side, I said, some of the phenomena are impressive. For example, the endowment effect, which is so strong I would demonstrate it in class. On the negative side, none of the researchers use experiments to generate ideas. They don’t merely not do it; they seem unaware of the possibility of doing it. The graduate student wondered how it can be done. I said there were three main ways:
1. Do something extra. Do a little more than necessary so that your experiment tells you about something that isn’t the focus of interest. For example, vary a factor that you think is not important. This is Saul Sternberg’s idea. I did this in my peak-procedure experiments: measured how long rats held down the bar. This was irrelevant to the purpose of the experiments, which was to understand how rats measured time. These measurements greatly surprised me. For years, I misunderstood them. Eventually they led to a new line of research about the control of variability.
2. Measure a function, not a point. Ask how your treatment changes a whole function, not just this or that numerical measure. This is what I did in my peak procedure experiments: The experiments generated for every condition an entire function showing response rate as a function of time. I saw how treatments changed the entire function. This talk describes some of the new ideas this led to.
3. Make your experiment easy and fast. The easier and faster it is, the more you can do it in lots of variations. Our ignorance of behavior being great, some fraction of these are likely to generate unexpected – and therefore inspiring — results. This is one reason self-experimentation is good for generating ideas: It is easy and fast.
I am not aware of any other written answers to this question, strangely enough.
Andrew Gelman on Web Trials and the Shangri-La Diet
Andrew Gelman is a professor of statistics at Columbia University. Years ago we co-taught a seminar about left-handedness. His blog. This interview took place via instant messaging in February, 2007 and has been edited slightly.
SR I want to ask your opinion of web trials. People go to a website where they choose or are randomly assigned a treatment. Then they come back and report the results.
AG Then the records of their choices and outcomes are made publicly available.
SR Yes. And there would probably be some summary of the results prepared by experts. It wouldn’t be just raw data.
COMPARING TO CURRENT STATE OF THE ART IN MEDICAL RESEARCH
AG We could compare to the current state of the art in medical research, which I think is to have some moderately large randomized clinical trials, each of which is published in a journal, followed by a meta-analysis of these trials. A difficulty with the current state-of-the-art is that sample sizes in clinical trials seem to be simultaneously too small and too large. Too small in that results tend to be just barely statistical significant (and often not significant for subgroups), so that you can’t really put your faith in one study, hence the need for meta-analysis. Too large in that each study is unwieldy, takes a huge amount of effort and doesn’t allow for much learning and experimentation during the study.
SR. A famous epidemiologist [Richard Doll] once said that if the effect is strong, you don’t need a big study.
AG In some way the high cost is a good barrier in that people have to think seriously and justify what they want to do. On the other hand, within any particular research plan, it would seem to limit the possibility for innovation.
Speaking generally, a challenge is to integrate clinical judgment (including ideas of experimentation and trying different things with different patients) with scientific goals such as replicability.
Also, there are well-known cognitive illusions in clinical judgment, which is what motivates the evidence-based-medicine movement (for randomized trials, public records of data, etc.) in the first place.
SR How do web trials fit into the picture you have drawn?
AG Ideally, web trials are intermediate between controlled randomized trials on one hand, and full recording of observational data on the other. If people are really volunteering to be randomized, then they follow the protocol, then this is a clean randomized expt (albeit not blinded, an issue I’d like to raise with you). In practice there will be lots of selection, dropout, measurement error, etc., which moves it toward an observational study. The dispersed nature of the data collection is similar to (in fact, more dispersed than) the idea of individual clinicians recording their experiences and outcomes into a centralized databased. That is, the data collection is dispersed, the database is centralized.
SR A web trial would have more regularity — less variation — across subjects than observations collected from individual doctors. Because everyone would get the same instructions. Whereas different doctors are obviously going to give different instructions (for the same nominal treatment).
AG Yes. That’s why I said the web trial is in between.
DIFFICULTIES WITH BLINDING
SR In the area of blinding I think a web trial would be better than the conventional double-blind clinical trial. If the goal is to guide practice. In practice patients are not blinded. Blinding is a tool to equate expectations. Better to equate expectations by comparing different treatments both believed to be effective.
AG One of the difficulties with your self-experimentation is that there’s no blinding at all. Similarly with these trials. Some of it is the nature of your treatments, but perhaps with some effort you could come up with blinded versions.
SR In my self-experimentation the expectations are equal in the different conditions, in many cases.
AG For example, consider the recent self-experiment that you describe on your blog, where you try different oils and measure your balance. I’d believe these results a lot more if you blinded the treatments.
SR Sure, blinding would help in that case, I agree. I plan to do something like that. But blinding is not necessary to equate expectations. For example, I tried many ways of losing weight. In every case I expected it to work. Some ways worked much better than others. It is this comparison of the effects of different treatments that is interesting. In general expectations cannot be very powerful or there would be no problems left to solve. Expectations are powerful in a few areas and seem to have no effect in many areas. I don’t mean we should ignore them; but to emphasize them as a big deal is not what the evidence suggests. In any case in web trials the participants would only be randomized (or choose) treatments they thought might work
AG There’s some work by Rubin and other statisticians on “broken randomized trials” which can more generally be thought of as experiments that have partial randomization.
SR I think of web trials as giving “entrants” (or subjects) a choice: to be or not to be randomized. Then when it’s all over you compare the two groups.
AG That makes sense. You’ll still have some problems: 1. People not following protocol. 2. Non-blindness of treatments. 3. Other problems, I’m sure, which I can’t think of offhand.
SR Well, these are equal for all conditions so they shouldn’t distort anything
AG In a controlled trial you can deal with some of these things: 1. In a controlled trial you can have more interactions with the experimental subjects, thus maybe more likely they’ll follow protocol. 2. In a controlled trial you can (sometimes) ensure blindness. In general, I don’t think you can get away with assuming that biases cancel out.
ANALYZING DATA FROM WEB TRIALS
AG Your web trials should give us a big juicy source of data that can be thrown at a stat Ph.D. student as a thesis project, perhaps! My intuition as an amateur sociologist of applied statistics is that an exemplary applied analysis is a good way to kick-start the study of a statistical problem.
SR What’s an example of such a kick-start? That’s an interesting point.
AG I’m thinking of the hierarchical models that were fit by Lindley, Novick, Rubin, and others in the late 1960s thru early 1980s to educational data. These provided examples for people to follow–templates–as well as demonstrations that these methods really worked. There were various interesting disciussions of these models in the stat literature, in particular I’m thinking of a paper by Rubin on law school validity studies in J. Amer. Stat. Assoc. from 1980 that had several discussants.
SR Yes, it is true that the data from web trials would be complex and interesting in new ways and accessible to everyone.
AG Yes, having available data is another plus–that’s really a new feature which should help. Now back to the warnings. A very well known example is the Nurses Health Study, an observational study that found that taking post-menapausal drugs was associated with lower heart-attack risks (and lower death rates). But when a big randomized expt was done, no association was found. Actually, taking the drugs slightly increased cancer risk, I believe. See here.
I talked with various people about this, and there are different potential explanations for the discrepancies. One story is that the women who took the drugs were otherwise healthier, more health conscious, etc.–even after controlling for whatever pre-treatment variables they controlled for. Another story is that the populations of the 2 studies were different (in particular, in their average ages), and perhaps the drugs are beneficial for some ages but not others. (Incidentally, the drugs were not originally intended to reduce heart-attack risk. This was an unexpected effect (or non-effect), I believe.)
Anyway, the people I trust on these matters (notably John Carlin) believe that the difference is because of “selection”, i.e., the drugs don’t really reduce heart attack risk. But the observational study led people to recommend the drugs. So this is a big example where the obs study was misleading.
SR: Did the randomized study conclusively rule out the effect size seen in the correlational study? or did it simply find no effect?
AG I’m not sure. My impression is that the expt actually contradicted the obs study–a stat signif negative effect for one, and a stat signif positive effect for the other–not just that there was significance for the expt and no signif for the obs study–but I never really looked into it.
SR I’d like to return to the issue of blind vs don’t blind. You believe any experiment where subjects are not blind to the treatment has a problem?
AG Yes, if knowledge of the treatment could affect the outcome (for example, through motivation). I worry about it for your diet and depression studies.
SR Well, in much research the first question is whether there is a useful effect. later experiments deal with mechanism. I was under the impression that what matters is to equate expectations across conditions and that blinding is just one way to do this.
AG Maybe you’re right, I’m not actually up on this literature. I know that Paul Rosenbaum has written about it.
** MORE ON BLINDNESS: CONSIDERING THE SHANGRI-LA DIET **
AG My knowledge of it is not particularly sophisticated. For your diet and depression studies, there are obvoious stories based on motivation.
I wouldn’t go so far as some people and simply dismiss your results. But the concerns are natural, I think. It’s a little different than the problem with the Nurses study. Here I’m worried about motivation, there the issue was selection.
Although there’s a possible selection problem in your study too, in that the people (including you) doing the Shangri-La Diet might be those who are ready to try something new and lose weight.
SR There are a lot of people who are always ready to try something new and lose weight.
AG Again, this could be tested with a blinded study. For example, half the people get the oil apart from a meal, half get the oil with the meal. Not that this would solve all problems of interpretation. . . .
For example, Caroline thinks that your diet works, but that the reason why it works is that it stops people for snacking for a 2-hour period (before and after the oil) and also focuses people on their snacking.
SR If anyone thinks that — and it is a perfectly reasonable thing to think if you are just starting to learn about it — then they can replace the oil with water and see if they continue to lose.
AG To answer your comment (”there are a lot of people who are always ready to try something new and lose weight”): yes, I remember you saying this before, and this is a big reason I wouldn’t dismiss your results immediately. But, still, people willing to try this wacky new thing might be special (on average). To put it another way, I expect there were similar successes with people trying Scarsdale, Atkins, etc.
SR I’m sure that people who try my diet are unusual early adopter types. I think Atkins has some truth to it — some reasons it would actually work. I don’t know enough about Scarsdale to comment. My theory says that merely changing what you eat (to foods with unfamiliar or at least less familiar flavors) should lower your set point.
AG Sure, but you had another point which was that these were people for whom nothing worked before. I was just using these diets as examples of other things that worked when nothing worked before. It relates to the historical perspective of new diets as things that will work for a few years before burning out. Possibly because the new diets can motivate people.
SR I tend to think they burn out because the new food becomes familiar.
AG I’m not saying that this is necessarily true of your diet–yours might be different–I’m just giving a historical control to give insight as to how there could really be motivational issues.
SR That’s true, research to distinguish my explanation of the burn out and a motivational one could be done but of course hasn’t been.
AG Your story, “they burn out because the new food becomes familiar”, is plausible. It’s also plausible that it’s easier to motivate yourself with a plan that’s new and different.
SR I hope there will be studies of whether the theory behind my diet is correct. These would essentially be studies that test the prediction that familiarity matters. This is a prediction that other theories do not make.
AG Yeah, based on reading the appendix to your book, there’s still some research synthesis that needs to be done (presumably with the help of animal studies).
SR I agree.
BACK TO WEB TRIALS
SR Web trials are relatively early in the research chain and they are relatively practical. In these cases you don’t worry a lot about mechanism, you worry much more about efficacy — is there an effect?
AG Regarding the analysis of web trials, it would be interesting to look at other examples of partially randomized experiments. Rubin and Hill and others worked on a study of school choice where they looked into some of these issues. It was a study that randomized some aspects of which kids went to which schools, but parents had some choices too.
In medicine and also in economics/public-policy, there has been a lot of interest in recent years in trying to get inside this sort of study rather than just relying on the “intent to treat” or explicit randomization.
SR “get inside this sort of study”–what do you mean?
AG: I mean, look at what treatments are actually chosen by the individuals in the study, not just looking at what treatments they were assigned to.
SR Could you sum up why you like the idea of web trials?
AG 1. Lots of data. 2. Motivates people to randomize, to apply the treatment, and to record results. 3. More generally, gets people involved in the project as participants, not just “subjects”
SR Those are good points, thanks.
AG Thank you for giving me the opportunity to think about these things. I’m still struggling with the question, “Are medical experiments too small or too big (in number of subjects)?”. As discussed here.
Omega-3 and Freakonomics
Steve Levitt, co-author of Freakonomics, has done me the great favor of bringing my omega-3 self-experimentation to a wider audience in this post. He thinks my results might be due to my expectations. I posted this comment:
Thanks, Steve, for writing about this. Here’s why I think the balance improvements I’ve noticed are unlikely to be due to expectations:
1. I first noticed the effect putting on my shoes the morning after I started taking flaxseed oil. I had been putting on my shoes standing up for two years; until that morning, I had always had trouble. Every morning. (I had expected it to get much easier — practice effect — but it didn’t.) The sudden improvement was a complete surprise. I had never heard of such an effect. I had hoped that flaxseed oil would improve my sleep.
2. The sudden improvement I saw when I switched from 2 tablespoons/day to 3 tablespoons/day was also a surprise, although I realize this may be harder to believe.
3. When I switched from flaxseed oil and walnut oil to sesame oil, I expected my balance to get worse. It did, but not when I expected. (It took 2 days to see a change; I expected to see it on the first day.)
Which is not to say I’m sure. If the effects I’ve seen are repeatable, I’ll test myself not knowing what oil I’ve ingested.
And forgot to sign my name. Oops.
My reading of the data (such as this) is that placebo effects sometimes exist but are vastly overrated — like many dangers.
Addendum: Stephen Dubner, Levitt’s co-author, blogged today that
nearly everything we’ve written about, either in the book or our journalism or the blog, has some element of people worrying too much about something
Two Ways of Thinking About Self-Experimentation
Self-experimentation (at least, mine) is an example of what larger category?
My self-experimentation was very practical: I improved my sleep, mood, and health (went from average number of colds/winter to no colds/winter), and lost weight. My omega-3 self-experimentation has improved my balance. From this point of view self-experimentation looks like engineering. > 99.99% of engineering is making things better. The entirely new thing (e.g., the transistor) is very rare. The connection with Eric von Hippel’s work (who finds that product users do a lot of innovative engineering) is pretty clear. I “used” (applied) scientific research — for example, mood research.
Yesterday, however, Tyler Cowen, who knows ethnic restaurants, posted this:
Four chairs, one table, A+ decor, and the best Asian food in D.C. Nothing nearby comes close. Staff = 1, so you must call not only for reservations, but indeed hours in advance with an actual order so he can start making your food. I loved the salmon in red curry sauce, the pad thai, the larb, and some amazing chicken dish with the guy’s last name on it; the drunken noodles are recommended as well. But I am not not not saying the other dishes are worse. 515 Florida Avenue, NW.
I’ll never view the theory of the firm in the same light again. Monitoring doesn’t work, and who needs division of labor anyway? The coolest place in DC right now, by far.
This is an example of what might be called the stunning single case — in this instance, drawn from everyday life. A stunning single case is an observation that casts doubt on a well-respected theory or leads to a new theory.
Another view of self-experimentation is that it is a way to learn from — take advantage of — stunning single cases in everyday life. Which is science (more precisely, theory building), not engineering. For example, one morning I woke up and felt much better than usual. This one event launched several years of self-experimentation that led to a new theory of mood.
The Shangri-La Diet was suggested by a single event (loss of appetite in Paris) but the theory behind the diet, which helped me learn from that event, was already there. (It was inspired by rat experiments.) The Paris event had a small effect on my theory but a big effect on how well I applied the theory. If all applications of theory count as engineering, the post-Paris development of SLD was engineering.
(Incidentally, I didn’t notice the “not not not” in Tyler’s post until the third or fourth reading, an example of repetition blindness.)
The Hidden Relevance of Experimental Psychology
I used to teach introductory psychology. As I skimmed introductory psych texts, I could sense the disinterest that almost all the authors of these books had for my field — experimental psychology. Pavlov, memory — that was boring. What did that stuff have to do with everyday life? the authors seemed to be saying.
The Shangri-La Diet was built on thousands of experiments about Pavlovian learning. Empirical generalizations from that data helped me make the mental jump from experiments by Israel Ramirez to a new theory of weight control. A conceptual understanding of Pavlovian learning (what makes an association weak or strong) allowed me to use the new theory to find new ways of losing weight. Suddenly that boring stuff was relevant.
My omega-3 findings (such as this), if they hold up, would do the same thing for two other areas of experimental psychology. The experimental designs I use, such as ABA, are straight from Skinnerian psychology. Although I am now measuring my balance — not part of experimental psychology — my guess is that most of the measurements will eventually be more “mental.” I assume that omega-3s improve my whole brain, not just the balance-related part. Experimental psychologists have spent 100 years developing simple and effective measures of many mental functions; all that measurement work should help us figure out how much omega-3 and omega-6 we should consume. Too little omega-3 and too much omega-6 appear to cause a vast range of health problems, including the most serious. The problem is that it is extremely hard to measure the functioning of our immune system or our circulatory system or most other parts of our body. It is even hard to measure how well our mood-regulating system is working. (Too little omega-3 appears to increase the risk of bipolar disorder.) It is much easier to measure memory.
Experimental psychology can be divided into two parts — human (Part A) and animal (Part B). Part B can be subdivided into B1 (Skinnerian) and B2 (associative learning). Part B2 can be subdivided into B21 (Pavlovian learning) and B22 (instrumental learning). If you know the field you know these are the natural divisions. All my mainstream work has been in B22. I have managed (or hope to manage) to show the relevance of every area of experimental psychology except my own. Curious.
What Should “Correlation Does Not Imply Causation” Be Replaced With?
I shed an invisible tear whenever I hear “correlation does not imply causation” which the otherwise excellent swivel (a website about correlations) emphasizes. Of course, there’s truth to it. It saddens me because:
Jane Jacobs on Scientific Method
You try, if you can, to get people to look at the specific thing that is happening and not try to generalize it as an ideology. Ideologies, no matter what kind, are one of the greatest afflictions, because they blind us to seeing what is going on, or to what is being done.
From this interview.