Researchers Fool Themselves: Water and Cognition

A recent paper about the effect of water on cognition illustrates a common way that researchers overstate the strength of the evidence, apparently fooling themselves. Psychology researchers at the University of East London and the University of Westminster did an experiment in which subjects didn’t drink or eat anything starting at 9 pm and the next morning came to the testing room. All of them were given something to eat, but only half of them were given something to drink. They came in twice. On one week, subjects were given water to drink; on the other week, they weren’t given water. Half of the subjects were given water on the first week, half on the second. Then they gave subjects a battery of cognitive tests.

One result makes sense: subjects were faster on a simple reaction time test (press button when you see a light) after being given water, but only if they were thirsty. Apparently thirst slows people down. Maybe it’s distracting.

The other result emphasized by the authors doesn’t make sense: Water made subjects worse at a task called Intra-Extra Dimensional Set Shift. The task provided two measures (total trials and total errors) but the paper gives results only for total trials. The omission is not explained. (I asked the first author about this by email; she did not explain the omission.) On total trials, subjects given water did worse, p = 0.03. A surprising result: after persons go without water for quite a while, giving them water makes them worse.

This p value is not corrected for number of tests done. A table of results shows that 14 different measures were used. There was a main effect of water on two of them. One was the simple reaction time result; the other was the IED Stages Completed (IED = intra/extra dimensional) result. It is likely that the effect of water on simple reaction time was a “true positive” because the effect was influenced by thirst. In contrast, the IED Stages Completed effect wasn’t reliably influenced by thirst. Putting the simple reaction time result aside, there are 13 p values for the main effect of water; one is weakly reliable (p = 0.03). If you do 20 independent tests, purely by chance one is likely to have p < 0.05 at least once even when there are no true effects. Taken together, there is no good reason to believe that water had main effects aside from the simple reaction time test. The paper would be a good question for an elementary statistics class (“Question: If 13 tests are independent, and there are no true effects present, how likely will at least one be p = 0.03 or better by chance? Answer: 1 – (0.97^13) = 0.33″).

I wrote to the first author (Caroline Edmonds) about this several days ago. My email asked two questions. She replied but failed to answer the question about number of tests. Her answer was written in haste; maybe she will address this question later.

A better analysis would have started by assuming that the 14 measures are unlikely to be independent. It would have done (or used) a factor analysis that condensed the 14 measures into (say) three factors. Then the researchers could ask if water affected each of the three factors. Far fewer tests, far more independent tests, far harder to fool yourself or cherry-pick.

The problem here — many tests, failure to correct for this or do an analysis with far fewer tests — is common but the analysis I suggest is, in experimental psychology papers, very rare. (I’ve never seen it.) Factor analysis is taught as part of survey psychology (psychology research that uses surveys, such as personality research), not as part of experimental psychology. In the statistics textbooks I’ve seen, the problem of too many tests and correction for/reduction of number of tests isn’t emphasized. Perhaps it is a research methodology example of Gresham’s Law: methods that make it easier to find what you want (differences with p < 0.05) drive out better methods.

Thanks to Allan Jackson.

Assorted Links

Thanks to Bryan Castañeda.

The Growth of Personal Science: Implications For Statistics

I have just submitted a paper to Statistical Science called “The Growth of Personal Science: Implications For Statistics”. The core of the paper is examples, mostly my work (on flaxseed oil, butter, standing, and so on). There is also a section on the broad lessons of the examples — what can be learned from them in addition to the subject-matter conclusions (e.g., butter makes me faster at arithmetic). The paper grew out of a talk I gave at the Joint Statistical Meetings a few years ago, as part of a session organized by Hadley Wickham, a professor of statistics at Rice University.

I call this stuff personal science (science done to help yourself), a new term, rather than self-experimentation, the old term, partly because a large amount of self-experimentation — until recently, almost all of it — is not personal science but professional science (science done as part of a job). Now and then, professional scientists or doctors or dentists have done their job using themselves as a subject. For example, a dentist tests a new type of anesthetic on himself. That’s self-experimentation but not personal science. Moreover, plenty of personal science is not self-experimentation. An example is a mother reading the scientific literature to decide if her son should get a tonsillectomy. It is personal science, not professional self-experimentation, whose importance has been underestimated.

An old term for personal science might be amateur science. In almost all areas of human endeavor, amateur work doesn’t matter. Cars are invented, designed and built entirely by professionals. Household products are invented, designed and built entirely by professionals. The food I eat comes entirely from professionals. And so on. Adam Smith glorified this (“division of labor” — a better name is division of expertise). There are, however, two exceptions: books and science. I read a substantial number of books not by professional writers and my own personal science has had a huge effect on my life. As a culture, we understand the importance of non-professional book writers. We have yet to grasp the importance of personal scientists.

Professional science is a big enterprise. Billions of dollars in research grants, hundreds of billions of dollars of infrastructure and equipment and libraries, perhaps a few hundred thousand people with full-time jobs, working year after year for hundreds of years. Presumably they are working hard, have been working hard, to expand what we know on countless topics, including sleep, weight control, nutrition, the immune system, and so on. Given all this, the fact that one person (me) could make ten or so discoveries that make a difference (in my life) is astonishing — or, at least, hard to explain. How could an amateur (me — my personal science, e.g., about sleep is outside my professional area of expertise) possibly find something that professional scientists, with their vastly greater resources and knowledge and experience, have missed? One discovery — maybe I was lucky. Two discoveries — maybe I was very very very lucky. Three or more discoveries — how can this possibly be?

Professional scientists have several advantages over personal scientists (funding, knowledge, infrastructure, etc.). On the other hand, personal scientists have several advantages over professional scientists. They have more freedom. A personal scientist can seriously study “crazy” ideas. A professional scientist cannot. Personal scientists also have a laser-sharp focus: They care only about self-improvement. Professional scientists no doubt want to make the world a better place, but they have other goals as well: getting a raise, keeping their job, earning and keeping the respect of their colleagues, winning awards, and so on. Personal scientists also have more time: They can study a problem for as long as it takes. Professional scientists, however, must produce a steady stream of papers. To spend ten years on one paper would be to kiss their career goodbye. The broad interest of my personal science is that my success suggests the advantages of personal science may in some cases outweigh the advantages of professional science. Which most people would considered impossible.

If this sounds interesting, I invite you to read my paper and comment. I am especially interested in suggestions for improvement. There is plenty of time to improve the final product — and no doubt plenty of room for improvement.

Assorted Links

Usual Drug Trial Analyses Insensitive to Rare Improvement

In a comment on an article in The Scientist, someone tells a story with profound implications:

I participated in 1992 NCI SWOG 9005 Phase 3 [clinical trial of] Mifepristone for recurrent meningioma. The drug put my tumor in remission when it regrew post surgery. However, other more despairing patients had already been grossly weakened by multiple brain surgeries and prior standard brain radiation therapy which had failed them before they joined the trial. They were really not as young, healthy and strong as I was when I decided to volunteer for a “state of the art” drug therapy upon my first recurrence. . . . I could not get the names of the anonymous members of the Data and Safety Monitoring committee who closed the trial as “no more effective than placebo”. I had flunked the placebo the first year and my tumor did not grow for the next three years I was allowed to take the real drug. I finally managed to get FDA approval to take the drug again in Feb 2005 and my condition has remained stable ever since according to my MRIS.

Apparently the drug did not work for most participants in the trial — leading to the conclusion “no mnore effective than placebo” — but it did work for him.

The statistical tests used to decide if a drug works are not sensitive to this sort of thing — most patients not helped, a few patients helped. (Existing tests, such as the t test, work best with normality of both groups, treatment and placebo, whereas this outcome produces non-normality of the treatment group, which reduces test sensitivity.) It is quite possible to construct analyses that would be more sensitive to this than existing tests, but this has not been done. It is quite possible to run a study that produces for each patient a p value for the null hypothesis of no effect (a number that helps you decide if that particular patient has been helped) but this too has not been done.

Since these new analyses would benefit drug companies, their absence is curious.

Gene Linked to Autism?

An article in the New York Times describes research that supposedly linked a rare gene mutation to autism:

Dr. Matthew W. State, a professor of genetics and child psychiatry at Yale, led a team that looked for de novo mutations [= mutations that are not in the parents] in 200 people who had been given an autism diagnosis, as well as in parents and siblings who showed no signs of the disorder. The team found that two unrelated children with autism in the study had de novo mutations in the same gene — and nothing similar in those without a diagnosis.

“That is like throwing a dart at a dart board with 21,000 spots and hitting the same one twice,” Dr. State said. “The chances that this gene is related to autism risk is something like 99.9999 percent.”

It is like throwing 200 darts at a dart board with 21,000 spots (the number of genes) and hitting the same one twice. (Each person has about 1 de novo mutation.) What are the odds of that? If all spots are equally likely to be hit, then the probability is about 0.6. More likely than not. (Dr. State seems to think it is extremely unlikely.) This is a variation on the birthday paradox. If there are 23 people in a room, it is 50/50 that two of them will share a birthday.

When Dr. State says, “The chances that this gene is related to autism risk is something like 99.9999 percent,” he is making an elementary mistake. He has taken a very low p value (maybe 0.000001) from a statistical test to indicate the likelihood that the null hypothesis (no association with autism) is true. P values indicate strength of evidence, not probability of truth.

One way to look at the evidence is that there is a group of 200 people (with an autism diagnosis) among whom two have a certain mutation and another group of about 600 people (their parents and siblings) none of whom have that mutation. If two instances of the mutation were randomly distributed among 800 people what are the odds that both instances would be in any pre-defined group of 200 of the 800 people (defined, say, by the letters in their first name)? The chance of this happening is 1/16. Not strong evidence of an association between the mutation and the actual pre-defined group (autism diagnosis).

Another study published at the same time found an link between autism and a mutation in the same gene identified by Dr. State’s group but again the association was weak. It may be a more subtle example of the birthday paradox: If twenty groups of genetics researchers are looking for a gene linked to autism, what are the odds that two of them will happen upon the same gene by chance?

If the gene with the de novo mutations is actually linked to autism, then we will have insight into the cause of 1% of the 200 autism cases Dr. Smart’s group studied. When genetics researchers try so hard and come up with so little, it increases my belief that the main causes of autism are environmental.

Thanks to Bryan Castañeda.

“Seth, How Do You Track and Analyze Your Data?”

A reader asks:

I haven’t found much on your blog commenting on tools you use to track your data. Any recommendations? Have you tried smart phones? For example, I have tried tracking fifteen variables daily via the iPhone app Moodtracker, the only one I found that can track and graph multiple variables and also give you automated reminders to submit data. There are other variants (Data Logger, Daytum) that will graph one variable (say, miles run per day), but Moodtracker is the only app I’ve found that lets you analyze multiple variables.

I use R on a laptop to track and analyze my data. I write functions for doing this — they are not built-in. This particular reader hadn’t heard of R. It is free and the most popular software among statisticians. It has lots of built-in functions (although not for data collection — apparently statisticians rarely collect data) and provides lots of control over the graphs you make, which is very important. R also has several programs for fitting loess curves to your data. Loess is a kind of curve-fitting. There is a vast amount of R-related material, including introductory stuff, here.

To give an example, after I weigh myself each morning (I have three scales), I enter the three weights into R, which stores them and makes a graph. That’s on the simple side. At the other extreme are the various mental tests I’ve written (e.g., arithmetic) to measure how well my brain is working. The programs for doing the test are in R, the data is stored in R, and analyzed with R.

The analysis possibilities (e.g., the graphs you can make, your control over those graphs) I’ve seen on smart phone apps are hopelessly primitive for what I want to do. The people who write the analysis software seem to know almost nothing about data analysis. For example, I use a website called RankTracer to track the Amazon ranking of The Shangri-La Diet. Whoever wrote the software is so clueless the rank versus time graphs don’t even show log ranks.

I don’t know what the future holds. In academic psychology, there is near-total reliance on statistical packages (e.g., SPSS) that are so limited perhaps they can extract only half of the information in the usual data. There are many graphs you’d like to make that it is impossible to make. SPSS may not even have loess, for example. Yet I see no sign of this changing. Will personal scientists want to learn more from their data than psychology professors (and therefore be motivated to go beyond pre-packaged analyses)? I don’t know.

Causal Reasoning in Science: Don’t Dismiss Correlations

In a paper (and blog post), Andrew Gelman writes:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

Box, Hunter, and Hunter (1978) (a book called Statistics for Experimenters) is well-regarded by statisticians. Perhaps Box, Hunter, and Hunter, and Andrew, were/are unfamiliar with another quote (modified from Beveridge): “Everyone believes an experiment except the experimenter; no one believes a theory except the theorist.”

Box, Hunter, and Hunter were/are theorists, in the sense that they don’t do experiments (or even collect data) themselves. And their book has a massive blind spot. It contains 500 pages on how to test ideas and not one page — not one sentence — on how to come up with ideas worth testing. Which is just as important. Had they considered both goals — idea generation and idea testing — they would have written a different book. It would have said much more about graphical data analysis and simple experimental designs, and, I hope, would not have contained the flat statement (“To find out what happens …”) Andrew quotes.

“To find out what happens when you change something, it is necessary to change it.” It’s not “necessary” because belief in causality, like all belief, is graded: it can take on an infinity of values, from zero (“can’t possibly be true”) to one (“I’m completely sure”). And belief changes gradually. In my experience, significant (substantially greater than zero) belief in the statement A changes B usually starts with the observation of a correlation between A and B. For example, I began to believe that one-legged standing would make me sleep better after I slept unusually well one night and realized that the previous day I had stood on one leg (which I almost never do). That correlation made one-legged standing improves sleep more plausible, taking it from near zero to some middle value of belief (“might be true, might not be true”) Experiments in which I stood on one leg various amounts pushed my belief in the statement close to one (“sure it’s true”). In other words, my journey “to find out what happens” to my sleep when I stood on one leg began with a correlation. Not an experiment. To push belief from high (say, 0.8) to really high (say, 0.99) you do need experiments. But to push belief from low (say, 0.0001) to medium (say, 0.5), you don’t need experiments. To fail to understand how beliefs begin, as Box et al. apparently do, is to miss something really important.

Science is about increasing certainty — about learning. You can learn from any observation, as distasteful as that may be to evidence snobs. By saying that experiments are “necessary” to find out something, Box et al. said the opposite of you can learn from any observation. Among shades of gray, they drew a line and said “this side white, that side black”.

The Box et al. attitude makes a big difference in practice. It has two effects:

  1. Too-complex research designs. Just as researchers undervalue correlations, they undervalue simple experiments. They overdesign. Their experiments (or data collection efforts) cost far more and take much longer than they should. The self-experimentation I’ve learned so much from, for example, is undervalued. This is one reason I learned so much from it — because it was new.
  2. Existing evidence is undervalued, even ignored, because it doesn’t meet some standard of purity.

In my experience, both tendencies (too-complex designs, undervaluation of evidence) are very common. In the last ten years, for example, almost every proposed experiment I’ve learned about has been more complicated than I think wise.

Why did Box, Hunter, and Hunter get it so wrong? I think it gets back to the job/hobby distinction. As I said, Box et al. didn’t generate data themselves. They got it from professional researchers — mostly engineers and scientists in academia or industry. Those engineers and scientists have jobs. Their job is to do research. They need regular publications. Hypothesis testing is good for that. You do an experiment to test an idea, you publish the result. Hypothesis generation, on the other hand, is too uncertain. It’s rare. It’s like tossing a coin, hoping for heads, when the chance of heads is tiny. Ten researchers might work for ten years, tossing coins many times, and generate only one new idea. Perhaps all their work, all that coin tossing, was equally good. But only one researcher came up with the idea. Should only one researcher get credit? Should the rest get fired, for wasting ten years? You see the problem, and so do the researchers themselves. So hypothesis generation is essentially ignored by professionals because they have jobs. They don’t go to statisticians asking: How can I better generate ideas? They do ask: How can I better test ideas? So statisticians get a biased view of what matters, do biased research (ignoring idea generation), and write biased books (that don’t mention idea generation).

My self-experimentation taught me that the Box et al. view of experimentation (and of science — that it was all about hypothesis testing) was seriously incomplete. It could do so because it was like a hobby. I had no need for publications or other steady output. Over thirty years, I collected a lot of data, did a lot of fast-and-dirty experiments, noticed informative correlations (“accidental observations”) many times, and came to see the great importance of correlations in learning about causality.

 

 

 

 

 

 

 

 

The Problem with Evidence-Based Medicine

In a recent post I said that med school professors cared about process (doing things a “correct” way) rather than result (doing things in a way that produces the best possible outcomes). Feynman called this sort of thing “ cargo-cult science“. The problem is that there is little reason to think the med-school profs’ “correct” way (evidence-based medicine) works better than the “wrong” way it replaced (reliance on clinical experience) and considerable reason to think it isn’t obvious which way is better.

After I wrote the previous post, I came across an example of the thinking I criticized. On bloggingheads.tv, during a conversation between Peter Lipson (a practicing doctor) and Isis The Scientist (a “physiologist at a major research university” who blogs at ScienceBlogs), Isis said this:

I had an experience a couple days ago with a clinician that was very valuable. He said to me, “In my experience this is the phenomenon that we see after this happens.” And I said, “Really? I never thought of that as a possibility but that totally fits in the scheme of my model.” On the one hand I’ve accepted his experience as evidence. On the other hand I’ve totally written it off as bullshit because there isn’t a p value attached to it.

Isis doesn’t understand that this “ p value” she wants so much comes with a sensitivity filter attached. It is not neutral. To get it you do extensive calculations. The end result (the p value) is more sensitive to some treatment effects than others in the sense that some treatment effects will generate smaller (better) p values than other treatment effects of the same strength, just as our ears are more sensitive to some frequencies than others.

Our ears are most sensitive around the frequency of voices. They do a good job of detecting what we want to detect. What neither Isis nor any other evidence-based-medicine proponent knows is whether the particular filter they endorse is sensitive to the treatment effects that actually exist. It’s entirely possible and even plausible that the filter that they believe in is insensitive to actual treatment effects. They may be listening at the wrong frequency, in other words. The useful information may be at a different frequency.

The usual statistics (mean, etc.) are most sensitive to treatment effects that change each person in the population by the same amount. They are much less sensitive to treatment effects that change only a small fraction of the population. In contrast, the “clinical judgment” that Isis and other evidence-based-medicine advocates deride is highly sensitive to treatments that change only a small fraction of the population — what some call anecdotal evidence. Evidence-based medicine is presented as science replacing nonsense but in fact it is one filter replacing another.

I suspect that actual treatment effects have a power-law distribution (a few helped a lot, a large fraction helped little or not at all) and that a filter resembling “clinical judgment” does a better job with such distributions. But that remains to be seen. My point here is just that it is an empirical question which filter works best. An empirical question that hasn’t been answered.

Does Lithium Slow ALS?

In 2008, an article in Proceedings of the National Academy of Sciences (PNAS) reported that lithium had slowed the progression of amyotrophic lateral sclerosis (ALS), which is always fatal. This article describes several attempts to confirm that effect of lithium. Three studies were launched by med school professors. In addition, patients at PatientsLikeMe also organized a test.

One of Nassim Taleb’s complaints about finance professors is their use of VAR (value at risk)Â to measure the riskiness of investments. It’s still being taught at business schools, he says. VAR assumes that fluctuations have a certain distribution. The distributions actually assumed turned out to grossly underestimate risk. VAR has helped many finance professionals take risks they shouldn’t have taken. It would have been wise for finance professors to wonder how well VAR does in practice, thereby to judge the plausibility of the assumed distribution. This might seem obvious. Likewise, the response to the PNAS paper revealed two problems that might seem obvious:

1. Unthinking focus on placebo controls. It would have been progress to find anything that slows ALS. Anything includes placebos. Placebos vary. From the standpoint of those with ALS, it would have been better to compare lithium to nothing than to some sort of placebo. As far as I can tell from the article, no med school professor realized this. No doubt someone has said that the world can be divided into people focused on process (on doing things a certain “right” way) and those focused on results (on outcomes). It should horrify all of us that med school professors appear focused on process.

2. Use of standard statistics (e.g., mean) to measure drug effects. I have not seen the ALS studies, but if they are like all other clinical trials I’ve seen, they tested for an effect by comparing means using a parametric test (e.g., a t test). However, effects of treatment are unlikely to have normal distributions nor are likely to be the same for each person. The usual tests are most sensitive when each member of the treatment group improves the same amount and the underlying variation is normally distributed. If 95% of the treatment group is unaffected and 5% show improvement, for example, the usual tests wouldn’t do the best job of noticing this. If medicine A helps 5% of patients, that’s an important improvement over 0%, especially with a fatal disease. And if you take it and it doesn’t help, you stop taking it and look elsewhere. So it would be a good idea to find drugs that only help a fraction of patients, perhaps a small fraction. The usual analyses may have caused drugs that help a small fraction of patients to be considered worthless when they could have been detected.

All the tests of lithium, including the PatientsLikeMe test, turned out negative. The PatientsLikeMe trial didn’t worry about placebo effects, so my point #1 isn’t a problem. However, my point #2 probably applies to all four trials.

Thanks to JR Minkel and Melissa Francis.