Seth Brown, a “data scientist” with a Ph.D. in computational genomics, has done several experiments about the best way to make coffee. In one, he compared other people’s burr grinders to his blade grinder. There was no clear difference in taste. In another, an Aeropress apparently produced better-tasting coffee than drip extraction. He hasn’t found other factors that matter. If I drank coffee, I’d be happy to know these things.
If I were teaching how to do experiments, his work would be a good case study. I’d have my students read it and suggest improvements. The contrast between his data analysis (sophisticated) and experimental design (unsophisticated) is striking, maybe because he has no background in experimentation.
Here’s what I would have done differently:
1. Study my reactions, not the reactions of guests. He had house guests rate the coffee he made. Yet he brews coffee for himself much more often than for others — at least, he gives that impression. Since his main customer is himself, it wasn’t clear why other people’s opinions are more important than his opinion. Maybe he read somewhere that blinding is good and thought it would be easier to achieve if other people did the ratings. He could have rated coffee he made himself blinded. Put stickers on the bottom of identical cups, shuffle the cups. However, since he will usually make coffee unblinded (he will know how he made it), it isn’t clear that blinding is good.
2. No “control” experiments. In a “control” experiment, he asked guests which of two identically-made cups of coffee was better. He doesn’t say what he learned from this — apparently nothing.
3. Simultaneous presentation. He gave guests two cups of coffee made differently and asked which they preferred. Apparently he gave them one cup at a time. Simultaneous presentation, allowing them to go back and forth, would have allowed much better discrimination. Maybe the two types of grinder differed but his experiment was too noisy to detect this.
In a footnote he wrote:
Ideally, I would have liked to use better control conditions [he appears to realize that there was something wrong with his control experiment — SR], larger sample sizes, more thorough subject randomization [I have no idea what this means; his designs are within-subject. In within-subject experiments, subjects are not randomized — SR], and a more consistent testing environment.
All of these changes would have made his experiments more difficult. Maybe he has internalized the rule harder is better.
The beginning of wisdom about science is roughly the opposite: do the simplest easiest thing that will tell you something. We always know less than we think, so make as few assumptions and as little investment as possible. The easier your experiment, the less you will lose if you make a wrong assumption. The smaller your sample size, the more resources (time, money, subjects, energy) you will have left over for other experiments. Bunsen’s experiments would have been easier if he had studied himself. By studying others, he made an untested assumption that they resembled him.
I’ve done dozens of tea experiments in which I compared tea brewed two different ways. The main things I’ve learned, besides best brew times and best amounts of tea to use, are: 1. Rinse tea before brewing. It eliminates a kind of dirty taste. 2. Combine chocolate tea and black tea. The combination is better than either alone. 3. A little bit of salt helps.
It’s a pity that he has not tried to rate the response of his subjects, or himself for that matter, for each coffee brewed. For example, in the case of Grinders we are not told by how much one was preferred to the other. Perhaps some people really really like coffee from one grinder over another, while others barely notice the difference.
“A little bit of salt helps.” Also with coffee – hat tip to my father (who also recommended pepper on strawberries – that works too).
Anyway: experimentation. I’m disinclined to make much of data-hammering done by someone who didn’t have the sense to attend to #s 1, 2 & 3.
Seth: You raise a good question: why don’t I try pepper in my tea? and if it doesn’t help, why not? I found #2 (meaningless control experiment) especially surprising in someone who works with data for a living and has a Ph.D. related to science. Yet I was impressed he computed a Bayes factor for his evidence.
I never did rigorous experiments with coffee making, but I paid close attention to the factors which I had read made for better quality. Achieving particular effects repeatably is hard, because there are so many factors that interact.
I roast my own coffee, use a decent burr grinder, and have brewed using the Aeropress, French press, vacuum brewer, and the old faithful pour-over cone. Even with these investments into quality and repeatability, the results can vary noticeably from brew to brew.
What I have been able to determine to my own satisfaction is the primary importance of these factors, in order:
* age of roast. Younger than 2 days is inferior, older than 10 days becomes stale.
* proper water temperature. Over 205 F and you start to get bitter; under 195 and you get sour.
* quality of the beans. The green beans from sweetmarias.com, for instance, are substantially better than those my father purchases from some cheaper website.
* burr grinder. When using freshly roasted coffee of high quality, I found an immense difference between the coffee produced from a cheap grinder (whether whirly-blade or a mill) and an expensive burr grinder.
The major flaw seems to be in not controlling for other variables that could be dominating the effects of he variables he is measuring.
In particular, he mentions nothing about water temperature. If he’s using boiling water, he’ll produce acidic coffee regardless of grind size and method.
Anther mistake he seems to be making, at least in the write up, is to treat the factors as independent. A burr grinder is recommended because it produces grinds of a roughly consistent size rather than a wide distribution of sizes.
If you use a whirling grinder and a pour over filter then the coffee tends to form into a sludge and the water remains in contact with the coffee for much longer than if you’d used a burr grinder. This is a large effect and easily measured with stopwatches.
The aeropress works by forcing water through the coffee in a fixed time period using mechanical pressure, so the effect of grind consistency is much less than it would be for almost any other method.
I think it’s a good idea to test the flavor produced by varying different elements of the coffee making process, but the process is a delicate one with many factors. Ignoring that will just produce results that don’t generalize.
It would be a far better better to start by learning how to make an *excellent* cup of coffee, and then testing variables from there, to see which elements are unnecessary.
I was puzzled by his use of statistical analysis. In a personal experiment like this, I think a simple bar graph works.
Which brings up a related issue which I have been thinking about. How do you represent a number of steps of a tasting experiment? For example, with side-by-side tastings of tea, there are multiple variables: the tea, the amount, temperature, washed vs. unwashed.
Start by fixing the type of tea, the water temperature, and the amount of tea. Compare washed vs. unwashed and washed wins, which is easy to visualize because it is a binary decision.
Next step compares amounts. Suppose the larger amount tastes better, but is that amount optimal? Which requires tastings of amounts both smaller and larger than the first winner, and that step repeats until an optimal amount has been determined for that particular tea. Start the next step which compares water temperature, which also results in multiple tastings.
Have you seen a good visualization of this type of experimentation?
Seth, I was surprised that you didn’t mention the scoring. If I am reading it right he took what could have been a rich/sensitive recording (how much do you like this 0-100) and reduced it to a binary (which do you prefer A/B).
Any reason why you didn’t mention that? Am I missing something?
Seth: No you’re right. A better measure of preference would have helped. If you are comparing two cups of coffee, I’m not sure you want to ask separately for each cup “how much do you like this 0-100?” I think you want to ask about degrees of preference: slight, somewhat, etc., and assign numbers to them (e.g., slight = 10, somewhat = 20, etc.). I’ve never seen that done but it’s a good idea.
Kirk, I think that conjoint analysis addresses some of the issues you raise with regard to multiple variables that all influence decision-making.
https://en.wikipedia.org/wiki/Conjoint_analysis_%28marketing%29
@Brian I made a similar remark, but I see my comment must have gotten lost in the system.
Seth: Comments must be approved the first time you comment. So the first time you comment what you write will not immediately appear.
You mention a bunch of different factors that might be studied (= varied). There are dozens more, such as how much salt to add, how much sweetener (and which sweetener), hardness of water, on and on. Although you speak of “steps” there is no obvious order to how you should test each of the factors.
No I haven’t seen any visualization. It’s more basic experimentation than a type of experimentation. Fisher, the statistician, used tea comparison experiments to illustrate basic principles. “A woman says she can tell whether the cream was added before or after the tea…”
“It would be far better to start by learning how to make an *excellent* cup of coffee, and then testing variables from there, to see which elements are unnecessary.”
that’s a very good point. That’s definitely what I do if I find something. First get the effect over and over then gradually see what is essential. For example, suppose Food X improves brain test scores. First show that beyond doubt then slowly vary Food X — 1. try different brands, 2. try closely related types of Food X, and so on.
I’ve been drinking and comparing teas for many, many years. When I order samples of tea, I set up an NCAA-basketball-bracket for the initial taste-offs. The results usually are obvious between tea A and B, C and D, and so on. Sometimes there’s little difference so I carry both teas to the next round. Eventually I end up at the final round where I do the final taste-off. There isn’t always a clear winner. At any rate, along the way I’ve been making notes about the losers, such as ‘harsh edge’ or ‘good but very average”. These help me when I record the results in a spreadsheet. For tea, I use a ranking from 1 to 5 (where 5 is best). I list 60 entries in my spreadsheet now. (I purchased them from the same company over many years). 4 teas sit at the 5 ranking.
For recipes I use a 1 to 10 ranking. Anything 5 or below goes in the trashcan. A 6 is edible but will never be cooked again. A 7 could be eaten once a month, an 8 several times a month, a 9 once a week, and 10′s are too good. (Think creme brulee).
Most people probably don’t want to see the step-by-step analysis of a taste-off, especially when a particular case is being refined, as in the example of a particular tea: how much, what temp, how long, etc. Most people want a recipe. They curate their recipe experts by choosing a particular cookbook.
However, in this case (the blog posting on how to make coffee) lots of people were interested because many people have run their own coffee experiments and found other variables to be influential.
To answer my own question, I speculate that a story-line illustrated with simple bar graphs at each step would help others run similar taste-off experiments. Or, more likely, they’ll say, “Hey, you missed THESE critical variables.”
Kirk, after your earlier comment about tea buying and comparing, I bought about 20 samples from Upton. It didn’t work — I didn’t like any of them. Part of the problem might have been that I couldn’t figure out the best way to brew them — the sample wasn’t enough tea to test the possibilities and the suggested brewing instructions I disagreed with.
I agree with your third suggestion: tasting side by side, back and forth, is much more sensitive.
However, for the your first and second points, I disagree:
1- Maybe he was not trying to establish which coffee he likes better, but how to make the best coffee.
2- He seems to try to figure out the noise in the rating. An example might help clarify what I mean. I measure my blood pressure every morning. The daily variations can be large (10, 20, even 30 mmHg). That seems to mean my blood pressure varies wildly form one day to the next. But no! If I measure my blood pressure several times the same morning (about 2 minutes apart, keeping conditions stable) I get the same kind of variations. Thus, the daily variations are just noise. Without some control measurements (in my case, several measures the same morning), I would not have reached the right conclusion.
He wasn’t using ratings. He was just asking people which of two coffees they like better. I still fail to see what he learned from having people compare two identical coffees. He might have learned from that comparison that people tend to prefer the first presented to the second presented, but he didn’t say anything like that. He said nothing about the possibility of order effects.
I surprised he didn’t test cold brew coffee. The bitterness in coffee is from larger alkaloid molecules which diffuse less at lower temperatures. The result is a sweeter coffee.
“cold brew coffee”. thank you for mentioning that. I recently discovered that cold brew tea is a lot better than regular brew, just like you say, much less bitter. I discovered this in Berkeley but I’m now in Beijing, where I forgot it until you mentioned this.