A recent paper about the effect of water on cognition illustrates a common way that researchers overstate the strength of the evidence, apparently fooling themselves. Psychology researchers at the University of East London and the University of Westminster did an experiment in which subjects didn’t drink or eat anything starting at 9 pm and the next morning came to the testing room. All of them were given something to eat, but only half of them were given something to drink. They came in twice. On one week, subjects were given water to drink; on the other week, they weren’t given water. Half of the subjects were given water on the first week, half on the second. Then they gave subjects a battery of cognitive tests.
One result makes sense: subjects were faster on a simple reaction time test (press button when you see a light) after being given water, but only if they were thirsty. Apparently thirst slows people down. Maybe it’s distracting.
The other result emphasized by the authors doesn’t make sense: Water made subjects worse at a task called Intra-Extra Dimensional Set Shift. The task provided two measures (total trials and total errors) but the paper gives results only for total trials. The omission is not explained. (I asked the first author about this by email; she did not explain the omission.) On total trials, subjects given water did worse, p = 0.03. A surprising result: after persons go without water for quite a while, giving them water makes them worse.
This p value is not corrected for number of tests done. A table of results shows that 14 different measures were used. There was a main effect of water on two of them. One was the simple reaction time result; the other was the IED Stages Completed (IED = intra/extra dimensional) result. It is likely that the effect of water on simple reaction time was a “true positive” because the effect was influenced by thirst. In contrast, the IED Stages Completed effect wasn’t reliably influenced by thirst. Putting the simple reaction time result aside, there are 13 p values for the main effect of water; one is weakly reliable (p = 0.03). If you do 20 independent tests, purely by chance one is likely to have p < 0.05 at least once even when there are no true effects. Taken together, there is no good reason to believe that water had main effects aside from the simple reaction time test. The paper would be a good question for an elementary statistics class (“Question: If 13 tests are independent, and there are no true effects present, how likely will at least one be p = 0.03 or better by chance? Answer: 1 – (0.97^13) = 0.33″).
I wrote to the first author (Caroline Edmonds) about this several days ago. My email asked two questions. She replied but failed to answer the question about number of tests. Her answer was written in haste; maybe she will address this question later.
A better analysis would have started by assuming that the 14 measures are unlikely to be independent. It would have done (or used) a factor analysis that condensed the 14 measures into (say) three factors. Then the researchers could ask if water affected each of the three factors. Far fewer tests, far more independent tests, far harder to fool yourself or cherry-pick.
The problem here — many tests, failure to correct for this or do an analysis with far fewer tests — is common but the analysis I suggest is, in experimental psychology papers, very rare. (I’ve never seen it.) Factor analysis is taught as part of survey psychology (psychology research that uses surveys, such as personality research), not as part of experimental psychology. In the statistics textbooks I’ve seen, the problem of too many tests and correction for/reduction of number of tests isn’t emphasized. Perhaps it is a research methodology example of Gresham’s Law: methods that make it easier to find what you want (differences with p < 0.05) drive out better methods.
Thanks to Allan Jackson.