Voodoo Correlations in Social Neuroscience

Few scientific papers arouse emotion in reviewers and editors but this one — by my friend and collaborator Hal Pashler and his colleagues — must have because they allowed the use of voodoo in the title instead of spurious. Here is part of the abstract:

The newly emerging field of Social Neuroscience has drawn much attention in recent years, with high-profile studies frequently reporting extremely high (e.g., >.8) correlations between behavioral and self-report measures of personality or emotion and measures of brain activation obtained using fMRI. We show that these correlations often exceed what is statistically possible . . . Social-neuroscience method sections rarely contain sufficient detail to ascertain how these correlations were obtained. We surveyed authors of 54 articles that reported findings of this kind to determine the details of their analyses. More than half acknowledged using a strategy that computes separate correlations for individual voxels, and reports means of just the subset of voxels exceeding chosen thresholds. We show how this non-independent analysis grossly inflates correlations, while yielding reassuring-looking scattergrams. This analysis technique was used to obtain the vast majority of the implausibly high correlations in our survey sample.

The papers shown to be misleading appeared in such journals as Science and Nature.

22 thoughts on “Voodoo Correlations in Social Neuroscience

  1. Another case of don’t believe everything you read, even (especially) if it comes from a well-respected source.

    I wonder if anyone can give an example of a popular paper that has been found to have these spurious correlations?

  2. Andrew: I totally agree with your evaluation of Science and Nature – especially when it comes to psychological research. I like to refer to them both as “The Journal of Irreproducible Results”. In their quest for sexy science, they often publish psych research that is totally outrageous, and turns out to be totally unfounded, or irreproducible. For example, light behind the knees was shown to influence circadian rhythm, but with more stringent tests, there was actually no effect.

    https://www.genomenewsnetwork.org/articles/08_02/bright_knees.shtml

    Because of the seriously questionable results I’ve read in these Journals, I no longer hold them in the high regard that many other scientists do. I’ve also vowed never to send a paper their as a first author.

    As for the voodoo reference, it’s a colloquialism that’s well-established.

    https://en.wikipedia.org/wiki/Voodoo

    https://en.wikipedia.org/wiki/Voodoo_science

    I think the idea that voodoo is referring to things that seem to appear magical – which is a ridiculous notion. Andrew, do you believe in magic? If you do, then I think you can be mocked, too. “Gyp” however, is a derogatory term for an ethnic group, and as such is a racial slur. I think the comparison is totally unfair.

  3. I loved the paper and the “voodoo” in its title. The authors did a great job in stimulating a public debate on the issue, which is sorely needed given the sensationalism with which some of the questionable findings were publicized. I wonder if Science is willing to publish the rest of the data. Were I a journalist, I would ask them at one of their next press conferences.

  4. I love this quote by James “Another case of don’t believe everything you read, even (especially) if it comes from a well-respected source.” when he is doing exactly that when reading n abstarct from one paper. Ridiculous!

  5. I commend Vul and colleagues for raising awareness of statistical issues in fMRI analysis. However, as several strong rebuttals have begun to emerge and the merits of individual papers are being explored more closely, it is rapidly becoming clear that the Vul critique may have incorrectly targeted a good number of papers that conducted legitimate and valid analyses. The Vul paper seems to have totally ignored the a priori hypotheses and theoretical basis of the selected regions that drove the analysis of most of the papers they criticize. Further, as was well articulated in the recent response by Lieberman and colleagues (https://www.scn.ucla.edu/pdf/LiebermanBerkmanWager(invitedreply).pdf ), the Vul critique seems to have misconstrued the way most of the analyses were actually conducted. In other words, they appear to be accusing the authors of doing something they actually did not do. This is unfortunate.

  6. Bill, I found the Lieberman et al. reply too vague to be persuasive. It claims that some papers were falsely accused (they are accused of doing X but they did not do X) but no specific examples are given. Care to give an example of a falsely-accused paper?

  7. If you read p. 3 in the paragraph that the footnote comes from, you’ll see how everyone contacted runs their regressions. “When a whole-brain regression analysis is conducted, the goal is typically to identify regions of the brain whose activity shows a reliable non-zero correlation with another individual difference variable. A likelihood estimate that this correlation was produced in the absence of any true effect (e.g. a p-value) is computed for every voxel in the brain without any selection of voxels to test. This is the only inferential step in the procedure, and standard corrections for multiple tests are implemented to avoid false positive results. Subsequently, descriptive statistics (e.g. effect sizes) are reported on a subset of voxels or clusters. The descriptive statistics reported are not an additional inferential step, so there is no “second analysis.” For any particular sample size, the t and r-values are merely re-descriptions of the p-values obtained in the one inferential step and provide no additional inferential information of their own.” Each of the authors indicated they did this rather than the two inferential steps that Vul et al. describe. Vul et al. were under the mistaken impression that people were using the personality variable to choose voxels to test and then subsequently running a new inferential test on those selected voxels. Not true. We run the test on all voxels and then report those that are reliable along with descriptive statistics. Everyone knows that all the other brain regions were also tested but did not pass the threshold for significance.

    How could they have gotten this so wrong in their paper? Vul et al. asked multiple choice questions that did not allow researchers to explain what they were doing and because Vul et al. never told researchers why they were asking the questions, there was no way for the researchers to guess Vul et al’s true purpose and realize that Vul et al’s questions could not provide Vul with the answers they needed. They also never followed up with any of the authors to ask if they were properly characterizing the conducted research.

  8. Matt, you write “there is no ‘second’ analysis”. I don’t follow. You use the word “subset”. It sounds like one analysis is done on all the data and then a second analysis is done on a “subset” of the data. Why this second analysis somehow doesn’t count as “second” isn’t clear. Sure, the first and second analyses aren’t the same but that isn’t the point. The point is that the selection of the subset increases the size of the reported correlation. Since you don’t seem to dispute that I am not entirely clear what your point is.

    Vul et al go on and on about inflated — spuriously large, impossibly large — correlations. Not two inferential tests. And the procedure you have described does exactly that: inflates correlations.

    “standard corrections for multiple tests are implemented” — you make it sound so easy. Yet nobody does multiple corrections so why you write “corrections” instead of “correction” isn’t clear. Which “standard correction” did you use?

  9. A single test is run on each voxel in the brain (not on a subset). Let’s say that’s 40,000 tests. Each of those tests are computed independently. We get a p-value for each. That is the only inferential step that occurs. Two things then follow this. First, we have to report something in the journal manuscript itself. Convention for all of cognitive neuroscience (as well as most behavioral studies as well) is that tests that meet a significance threshold are reported and all other tests are assumed to have been computed but had p-values that did not meet the significance threshold (i.e. readers then know which clusters have p-values below the threshold and which have p-values above the threshold). For the tests that are reported, because their p-values were below some conventional level, you then have to decide what to report to describe the data. You might report means and standard deviations, you might report t or Z statistics, or you might report r or d. In each case, these are descriptive statistics rather than an additional inferential biased because it uses the same criteria again.

    Let’s say we have 4 classrooms and you run t-tests comparing the heights of the students in each two classroom combination. Imagine that you find that classrooms 2 & 4 show a reliable difference form each other on height. You might then say “Of the combinations tested there was a significant difference between classrooms 2 & 4″ and readers would assume there were no other significant differences. There’s no selection bias here – just reporting of what’s significant. Now are you allowed to tell readers what the average heights are in classrooms 2 & 4? Because Vul et al., say you can’t, but of course you can and should. If you don’t believe me, read Vul’s other chapter with Kanwisher where he says that if you run a whole-brain contrast comparing responses to emotional faces to neutral faces, that you can’t graph the results from significant clusters.

    Incidentally “corrections” was plural because there are multiple techniques that can be used for correcting. In any particular case, a research group does a single form of correction.

  10. Gee, Matt, you’re still not denying that post-hoc selection of a subset inflates the correlations. Which — correct me if I’m wrong — was the main point of Vul et al. Along with the point that this inflation was not made clear in the published papers.
    I’m still curious: What correction for multiple tests did your research group use?

  11. Actually, what I am denying is that there was any post-hoc selection of a subset of the data. Running multiple comparisons leads to inflated effect sizes but that has nothing to do with “post-hoc selection”. And actually, the main point of their article was not that there is inflation but rather that this inflation is so great that the results should be considered worthless and likely spurious and also that the methods used to obtain the results are invalid and therefore the results themselves are invalid. These tests are run in order to identify regions where there are reliably non-zero correlations and they are a perfectly valid way of doing so. To report descriptive statistics is entirely valid as well. Since we seem to be talking past each other, let’s consider one last example. Let’s say I run my 40,000 independent tests on my 40,000 voxels. You would admit at this point there has been no “selection bias” inflating these tests, correct? You might have some large effects due to sampling fluxuations, but our Figure 1 shows that with normal fMRI sample sizes and appropriate correction for multiple comparisons, this is relatively rare (Vul’s simulation was done assuming 10 subjects which is not representative of fMRI studies). Let’s further assume that I submit my paper to the journal with a 200 page table that lists the p-value (along with descriptive statistics) for every voxel in the brain. Still no selection bias inflating these tests, correct? If you sorted this table by p-value we’d still be ok, right? Now the editor comes along and says “we can’t have a 200 page table” so cut off everything with a p-value worse than ___ and add a note to indicate that all other voxels had p-values above that threshold. The voxels that remained would be no more inflated after this editorial decision than before – its just a matter of convention for displaying data. This is what we all do and there is no “non-independence error” as Vul claims.

  12. Matt, could you post again the last part of your comment? It was cut off.

    What you call “selection bias” — computing a correlation using only voxels selected by looking at the data and thereby inflating the correlation — doesn’t inflate “tests” it inflates correlations.

    Nor are “voxels” inflated (by “voxel” I guess you mean the correlation computed for just one voxel), it is the correlation computed over many voxels that is inflated. That’s when you got into trouble — by computing a number that might be grossly inflated.

    Let’s say I select a subset of free throw attempts where Michael Jordan missed. Then I compute his free throw percentage over only those free throws. It is 0%. To report that 0% as if it means something is . . . well, call it what you want. As far as I can tell, that is basically what you did.

  13. Seth,

    “Let’s say I select a subset of free throws where Michael Jordan missed. Then I compute his free throw percentage over only those free throws. It is 0%. To report that 0% as if it means something is . . . well, call it what you want. As far as I can tell, that is basically what you did.”

    I may have missed something here, but isn’t that what Matt is claiming that Vul has done? Isn’t one of the strong arguements in the Lieberman paper that Ed Vul simply hand picked results from papers that would show the effect he wanted to show and ignored the others that didn’t? In fact when Matt puts all the data in to the analysis there is no bias in the correlation coeffecient at all.

  14. Sorry for the typos (yes its the correlations that are inflated, but that’s not what the test is testing – its testing for reliable non-zero relationships). Your Jordan analogy doesn’t quite apply. That would assume that we are making claims about the average correlation in the brain but only reporting on a subset of voxels (and pretending they are all the voxels). We aren’t making claims about how the brain as a whole or on average relates to personality – rather we are looking for which regions do correlate reliably and then providing descriptive statistics for those that do.

    To get the Jordan analogy right, the question would be “Are there certain days of the week when Jordan shoots a higher percentage than others?”. We’d have him shoot 100 free throws each day of the week for say 10 weeks. So we’d have 1000 data points for each of the seven days of the week. We wouldn’t care at all what his average across all days was, just how each day compares to each other. If his averages were 30% on mondays, 90% on fridays and 60% on all other days, we would say something interesting is happening on mondays and fridays, report that test and the descriptives that go along with it (e.g. 90%). Now if we reported that Jordan shoots 90% on average because we claimed that fridays were the only days were looking at, we’d be in trouble, but nobody does that. Our question isn’t the average, but rather, when is there something different from average going on.

  15. So, we combine a significance threshold (i.e. p-values less than .005 or .001) with an extent threshold (i.e. there have to be at least 10 contiguous voxels that all have p-values less than the significance threshold). This is a standard procedure used throughout cognitive neuroscience for the past 15 years.

Leave a Reply

Your email address will not be published. Required fields are marked *