Methodological Lessons from Self-Experimentation (part 1 of 4)

On Tuesday (January 9) I am giving a talk about my self-experimentation to a group of interface designers who I hope will be interested in the broad methodological conclusions to be drawn from it. An audio file of the talk and the PowerPoint will be available but I think the most interesting stuff will be clearer and more accessible if I write it down. So here it is.

Usually we learn from our mistakes. This is the rare case where I learned from success — I expected my self-experimentation — to improve my sleep, to find effective ways to lose weight — to fail and was surprised and impressed when I was wrong. The seven lessons that follows (divided into four posts) are the broad conclusions I draw from what happened.

1. Do something. I started the long-term self-experimentation that led to my paper because I didn’t want to wake up too early for the rest of my life. I expected my little self-experiments to fail, and they did fail, but I didn’t realize that I would slowly learn from failure. I learned how to record my data, for instance, and how to analyze it. The effect of that learning was that my self-experimentation got better and better and after many years of failure I got somewhere. I think American culture teaches that success is good and failure bad, but the truth for scientists is that failure is good in the sense that you learn from your mistakes.

2. Keep doing something. I learned the value of drudgery. The research took many years. After my initial failures I continued not because I could see I was learning stuff — the learning was too slow to be perceptible — but for the same reason I started: I didn’t want to wake up early for the rest of my life. One of my students had been a classical musician. She said that her job had been athletic, not aesthetic. It involved great repetition of the same movements, like manual labor. Likewise, scientists often see science as something intellectually wonderful. I came to see it differently. Perhaps a question has one answer and there are 100 plausible alternatives. To find the answer you may just need to test each of the 100 possibilities. No way around it. That was roughly the position I was in trying to improve my sleep: There were many possibilities and no alternative to simply testing them one by one. (More complex experimental designs, such as factorial designs, were impractical.) There was nothing intellectually wonderful about it. “One thing nobody tells you about being a postdoc is that stuff that used to be fun for its own sake becomes tedious when you’ve done it hundreds or thousands of times,” blogged a postdoc.

Part 2 is here.

Note: You no longer need to register in order to comment.

The Decline of Harvard

In high school, I learned a lot from Martin Gardner‘s Mathematical Games column in Scientific American. I read it at the Chicago Public Library on my way home from school while transferring from one bus line to another — thank heavens transfers were good for two hours. In college, it was long fact articles in The New Yorker. Now it’s Marginal Revolution, where I recently learned:

Harvard has also declined as a revolutionary science university from being the top Nobel-prize-winning institution for 40 years, to currently joint sixth position.

The full paper is here.

What should we make of this? Clayton Christensen, the author of The Innovator’s Dilemma (excellent) and a professor at the Harvard Business School, has been skeptical of Harvard’s ability to maintain its position as a top business school. He believes, based on his research and the facts of the matter, that it will gradually lose its position due to down-market competitors such as Motorola University and the University of Phoenix, just as Digital Equipment Corporation, once considered one of the best-run companies in the world, lost its position. A few years ago, in a talk, he described asking 100 of his MBA students if they agreed with his analysis. Only three did.

How would we know if Harvard was losing its luster? Christensen asked a student who strongly disagreed with him. Harvard business students (except Christensen’s) are taught to base their decisions on data. So Christensen put the question like this: If you were dean of the business school, what evidence would convince you that this was happening and it was time to take corrective action?

When the percentage of Harvard graduates among CEO’s of the top 1000 international companies goes down, said the student.

But by then it will be too late, said Christensen. His students agreed: By then it would be too late to reverse the decline.

Christensen’s research is related to mine, oddly enough — we both study innovation. For explicit connections, see the Discussion section of this article and the Reply to Commentators section of this one.

Why I Like Self-Experimentation

Self-experimentation, like blogs, Wikipedia, and open-source software (and before them, books) gives outsiders far more power. This took me a long time to figure out. For years, I liked self-experimentation for five reasons:

1. It worked. It reduced my acne, improved my sleep, and enabled me to lose plenty of weight. This surprised me. I am a professional scientist. My professional experiments, about animal learning, generally worked, but never had practical value.

2. It had unexpected benefits. I discovered accidentally that seeing faces in the morning improved my mood the next day. Better sleep (from self-experimentation) improved my health.

3. It was easy. What I did never involved more than small changes in my life. Even standing 8 hours per day wasn’t hard, after a few days.

4. My conclusions fit what others had found — usually, facts that didn’t fit mainstream views. For example, the fact that depression is often worst in the morning and gets better throughout the day doesn’t fit the conventional view that depression is a biochemical disorder but does fit my idea that depression is often due to a malfunctioning circadian oscillator. Self-experimentation seemed to be pointing me in correct directions.

5. My conclusions were surprising. That breakfast is bad (for sleep), the effect of faces on mood, and the Shangri-La Diet are examples.

Recently, though, the rise of blogging, Wikipedia, and open-source software, showed me the power of a kind of multiplicative force: (pleasure of hobbies) multiplied by (professional skills). Blogging, for example: (people enjoy writing) multiplied by (professional expertise, which gives them something interesting and unusual to say). In other words, expertise and job skills used in a hobby-like way. My self-experimentation, I realized, was another example: I used my professional (scientific) skills to solve everyday problems. My self-experimentation was like a hobby in that I did it year after year without financial reward or recognition. It was its own reward. The hobby aspect — persistence, freedom to try anything, no need for recognition or payment — made it powerful. I could go in depth where professionals couldn’t go at all.

But I was still missing something — something obvious to many others. The power of blogging isn’t

(hobby) x (job skills).

That’s just one person. The total power of blogging is

(hobby) x (job skills) x (anyone can do it)

Which is very powerful. Finally I saw there was a sixth reason to like self-experimentation:

6. Anyone can do it.

As Aaron Swartz Read more “Why I Like Self-Experimentation”

Web Trials

Thanks to Rey Arbolay, at the Shangri-La Diet forums, the eternal question “will this help?” is being answered in a new way. The specific question is “will the Shangri-La Diet help me lose weight?” The new way of answering it is that people are posting their results with the diet in the Post Your Tracking Data Here section of the forums. What they post is standardized and numerical enough that ordinary statistical methods can be used to learn from them. I’ll call this sort of thing a web trial.

It’s a lot better than nothing or a series of individual cases studied separately. I learned a lot from my most recent analysis — for example, that people lose at a rate of about 1 pound per week after Week 5. I couldn’t have done a good job of predicting where any of the fitted lines on the scatterplots would be or the size of the male/female difference. Nor could I have done a good job predicting the variability — the scatter around the lines.

It’s a lot worse than perfection. It would be much better if a comparison treatment (in the case of SLD, a different way of losing weight) was being tested in the same way. Then results from the two treatments could be compared and you would be closer to answering the practical question “what should I do?” (That modern clinical trials — very difficult and expensive — still use placebo control groups although placebos are not serious treatment options is a sign of . . . something not good.)

I can imagine a future in which people with a health problem (acne, insomnia, etc.) go to a website and enroll in a web trial. They told about several plausible treatments: A, B, C, etc., all readily available. They are given a choice of (a) choosing among them or (b) being randomly assigned. They post their results in a standardized format for a few weeks or months. Then someone with data analysis skills analyzes the data and posts the results. As for the participants, if the problem hasn’t been solved they could enroll again. This would be a way that anyone with a problem could help everyone with that problem, including themselves. The people who set up the trials and analyze the results would be like the book industry or Wikipedia insiders — people with special skills who help everyone learn from everyone.

Science Versus Human Nature

Last weekend I saw the writer Thomas Cahill on Book TV. He mentioned his book How The Irish Saved Civilization. The real contribution of the Irish, he said, wasn’t that they saved the sacred texts, it was that they brought humor to their study. “They brought irreverence to reverence,” he said. “That was entirely new.”

This reminded me of Brian Wansink’s comments about cool data. His notion that research designs should be judged on their coolness was entirely new to me. I’m not the only one; the Wikipedia entry for scientific method says nothing about it. Using cool and research design in the same sentence is quite a bit like bringing irreverence to reverence. Once somebody says it, though, it makes sense. I remember being thanked after an interview; I replied that there’s no point doing the research if no one ever learns about it. Coolness obviously plays into that — influences the chance that other people will learn about it.

I think most scientists will agree with Wansink, that coolness matters. I think you don’t find his idea in books and articles about scientific method not only because there is so little written about research design (at least compared to the amount written about data analysis) but also because it appears undignified. “I’m important, I shouldn’t have to worry about being cool” is the (very human) unspoken attitude.

Varieties of Shangri-La Diet Experience

The theory behind the Shangri-La Diet suggests several new ways of losing weight. As far as I can tell, they all work at least some of the time. To get an overview of the new methods, I asked users to rate them on power and ease of use. (Thanks to Brian Wansink for this suggestion.) Here are the average ratings (so far):

Power and Ease of Use of Different Ways of Doing the Shangri-La Diet

The two scales were defined as follows:

Power
5 = very powerful
4 = quite powerful
3 = somewhat powerful
2 = slightly powerful
1= not powerful at all

Ease of Use
5 = very easy/convenient
4 = quite easy/convenient
3 = easy enough
2 = quite difficult
1 = too difficult to ever do

The cluster in the top corner consists of “flavorless oil” and “nose-clipped oil”.

I like to think this little diagram predicts the future of SLD: lots of people drinking flavorless oil, lots of people drinking nose-clipped oil, fewer people drinking sugar water, etc. A friend of mine showed me a photo of her when she was 2 years old. Another 2-year-old was in the picture but I could tell which one was my friend.

I collected the data by Web. Maybe I should have used www.surveymonkey.com (as my students have) but it was incredibly easy compared to other data-collection methods.

On Scientific Method

When I visited George Mason University recently, I asked Tyler Cowen, “What’s the secret of a successful blog?” Cowen and Tabarrok’s Marginal Revolution is the most successful blog I know of.

His answer: “Three elements: 1. Expertise. 2. Regularity. 3. Recurring characters, like a TV show.” By regularity he meant at least 5 times/week.

I saw I had considerable room for improvement. Since then, I’ve tried to post at least twice/week. With this post I am adding scientific method to the subtitle, which I hope makes me appear more expert. A Berkeley philosophy professor named Paul Feyeraband wrote a book that I thought is called On Method but that I see is actually called Against Method. He was at Berkeley when I arrived. I remember two things about him: 1. He gave all his students A’s. 2. He ate at Chez Panisse every night.

The Wisdom of Experts: John Chambers on Research Design

John Chambers, a retired Bell Labs statistician and one of the persons most responsible for R, the free open-source data analysis package I use, told me an interesting story yesterday. AT&T used to make microchips. The “yield” of chips — the percent of chips that were defect-free — was very important. Chambers and other Bell Labs statisticians were asked to help the chip makers improve their manufacturing process by increasing the yield. At the chip factory, the people Chambers and his colleagues spoke to were chemists and engineers. They wanted to do experiments that varied voltage, temperature, and similar variables. Chambers and his colleagues had a hunch that the operator — the person running the fabrication machines — was important, and this turned out to be true.

I like this story because it has a wisdom-of-crowds-but-not-exactly twist: the supposed experts at one thing (data analysis) turned out to have useful (and unpredictable) knowledge about something else. We don’t think of statisticians as experts in human behavior but in this case they were at least more expert than the chemists and engineers. I mean: who were the experts here? And when we deal with someone, which is more likely: We overestimate how much they can help us with our problem? Or we underestimate (as in this story, where the chip makers underestimated the statisticians)? And if we have no idea which it is, how might we find out?

I told Chambers that statisticians were hurt by the name of their department: statistics. It puts them in too-small a box. John Tukey’s term data analysis (in place of statistics) was an improvement, yes, but only a bit; it would be a lot better if they were called how-to-do-research departments. Yes, Chambers said, that would be an improvement.

I am fascinated by the similarity between three things:

1. Data analysis. Much of data analysis consists of putting data together in a way that allows you to extract a little bit of information from each datum. These little piece of information, added together, can be quite informative. A scatterplot, for example.

2. Wisdom-of-crowds phenomena. For example, many people guess the weight of a cow. The average of their guesses is remarkably accurate, even though the variation in guesses is large.

3. Self-experimentation. The new and interesting feature of my self-experimentation was that it involved my everyday life. From activities I was going to do anyway (such as eat and sleep), I managed to extract useful information.

In each case it’s like extracting gold from seawater: You get something of value from what seemed useless. Are there other examples? How can we find new examples? Chamber’s story suggests one direction: Making some small change so that you learn from your co-workers about stuff you wouldn’t think they could teach you about.

Brian Wansink on Research Design

An experiment in which people eat soup from a bottomless bowl? Classic! Or mythological: American Sisyphus. It really happened. It was done by Brian Wansink, a professor of marketing and nutritional science in the Department of Applied Economics and Management at Cornell University, and author of the superb new book Mindless Eating: Why We Eat More Than We Think (which the CBC has called “the Freakonomics of food”). The goal of the bottomless-soup-bowl experiment was to learn about what causes people to stop eating. One group got a normal bowl of tomato soup; the other group got a bowl endlessly and invisibly refilled. The group with the bottomless bowl ate two-thirds more than the group with the normal bowl. The conclusion is that the amount of food in front of us has a big effect on how much we eat.

There are many academic departments (called statistics departments) that study the question of what to do with your data after you collect it. There is not even one department anywhere that studies the question of what data to collect — which is much more important, as every scientist knows. To do my little bit to remedy this curious and unfortunate imbalance, I have decided to ask the best scientists I know about research design. My interview with Brian Wansink (below) is the first in what I hope will be a series.

SR: Tell me something you’ve learned about research design.

BW: When I was a graduate student [at the Stanford Business School], I would jog on the school track. One day on the track I met a professor who had recently gotten tenure. He had only published three articles (maybe he had 700 in the pipeline), so his getting tenure surprised me. I asked him: What’s the secret? What was so great about those three papers? His answer was two words: “Cool data.” Ever since then I’ve tried to collect cool data. Not attitude surveys, which are really common in my area. Cool data is not always the easiest data to collect but it is data that gets buzz, that people talk about.

SR: What makes data cool?

BW: It’s data where people do something. Like take more M&Ms on the way out of a study. All the stuff in the press about psychology — none of it deals with attitude change. Automaticity is seldom a rating, that’s why it caught on. It’s how long they looked at something or how fast they walked. That’s why I’ve been biassed toward field studies. You lose control sometimes in field studies compared to lab studies, but the loss is worth it.

The popcorn study is an example. We found that people ate more popcorn when we gave them bigger buckets. I’d originally done all that in a lab. So that’s great, that’s enough to get it published. But it’s not enough to make people go “hey, that’s cool.” I found a movie theatre that would let me do it. It became expensive because we needed to buy a lot of buckets of popcorn. Once you find out it happens in real theatres, people go “cool.” You can’t publish it in great journal because you can’t get 300 covariates; we published it in slightly less prestigious journal but it had much greater impact than a little lab study would have had.

One thing we found in that study was that there was an effect of bucket size regardless of how people rated the popcorn. Even people who hated the taste ate more with the bigger bucket. We asked people what they thought of the popcorn. We took the half of the people who hated the popcorn the most — even they showed the effect. But there was range restriction — the average rating in that group was only 5.0 on a 1-9 scale — not in the “sucky” category. Then we used old popcorn. The results were really dramatic. It worked with 5-day-old popcorn. It worked with 14-day-old popcorn — that way I could say “sitting out for 2 weeks.” That study caught a lot of attention. The media found it interesting. I didn’t publish the 5-day-old popcorn study.

I’m a big believer in cool data. The design goal is: How far can we possibly push it so that it makes it a vivid point? Most academics push it just far enough to get it published. I try to push it beyond that to make it much more vivid. That’s what [Stanley] Milgram did with his experiments. First, he showed obedience to authority in the lab. Then he stripped away a whole lot of things to show how extreme it was. He took away lab coats, the college campus. That’s what made it so powerful.

SR: A good friend of mine, Saul Sternberg, went to graduate school with Milgram. They had a clinical psychology class together. The professor was constantly criticizing rat experiments. This was the 1950s. He said that rats were robot-like, not a good model for humans. One day Milgram and my friend brought a shoebox to class. In the box was a rat. They put the box on the seminar table and opened it, leaving the rat on the table. The rat sniffed around very cautiously. Cautious and curious, much more like a person than like a robot. It was a brilliant demonstration. My friend thinks of Milgram’s obedience experiments as more like demonstrations than experiments. But you are right, they are experiments consciously altered to be like demonstrations. Those experiments were incredibly influential, of course — it supports your point.

BW: When we first did the soup bowl studies, we refilled the soup bowls so that we gave people larger and smaller portions than they thought had. We heated the soup up for them but gave them 25% more to see if they would eat more than they thought. You could put that in an okay journal. The bottomless soup bowl would be more cool. Cool data is harder to get published and it’s much more of hassle to collect the data, but it creates incredible loyalty among grad students, because they think they are doing something more exciting. It’s more of military operation than if they are just collecting some little pencil-and-paper thing in the lab. It makes research more of an adventure.

Another thing: field experiments are difficult. There’s a general tendency in research to be really underpowered with things [that is, to not have enough subjects]. Let’s say you’re doing the popcorn bucket study. Is the effect [of bucket size] going to come out? Rather than having too many cells and not get significance, it’s a good idea to have fewer cells — replace a manipulated variable with one or two measured variables. For example, instead of doing a two-by-two between-subjects design we might have a design when one factor is measured rather than manipulated. If the measured factor doesn’t come out you haven’t lost anything; you still have all the power. With the popcorn study we knew the study would work with the big bucket [that is, we knew there would be an effect of bucket size] but we didn’t know if there would be an effect of bucket size if we gave them [both good corn and] bad corn [thereby doing a two-by-two study] and only 200 people showed up [leaving only 50 people per cell]. So when we did the field study for the first time, we gave them all popcorn 5 days old. We measured their taste preference for popcorn then used it as a crossing variable. We used scores on that measure to divide the subjects into two groups.

SR: Let’s stop here. This is great stuff.

——————————————————-

Most of the VCP-310 professionals advice against PMI-001 and suggest going for EX0-101 instead, proceeding to CCIE-LAB finally.

Ranjit Chandra Update

If you have been following the strange case of Dr. Ranjit Chandra, you may be interested to know:

1. He has sued the CBC (Canadian Broadcasting Corporation) because of a documentary they ran last year titled “The Secret Life of Dr. Chandra”. A lawyer for the CBC told me last week the lawsuit is at a very early stage.

2. A paper about Dr. Chandra’s research by Saul Sternberg and me has been accepted by Nutrition Journal, an open-source journal. Our third and final paper on the subject.

The Wikipedia entry for Chandra has a good summary of the story so far.