Lack of Repeatability of Cancer Research: The Mystery

In a recent editorial in Nature (gated), the research head of a drug company complained that scientists working for him could not repeat almost all of the “landmark” findings in cancer research that they tried to repeat. They wanted to use these findings as a basis for new drugs. An article in Reuters summarized it like this:

During a decade as head of global cancer research at Amgen, C. Glenn Begley identified 53 “landmark” publications — papers in top journals, from reputable labs — for his team to reproduce. Begley sought to double-check the findings before trying to build on them for drug development. Result: 47 of the 53 could not be replicated.

Yet these findings were cited, on average, about 200 times. The editorial goes on to make reasonable suggestions for improvement based on differences between the findings that could be repeated and those that could not. The Reuters article goes on to describe other examples of lack of reproducibility and includes a story about why this is happening:

Part way through his project to reproduce promising studies, Begley met for breakfast at a cancer conference with the lead scientist of one of the problematic studies. “We went through the paper line by line, figure by figure,” said Begley. “I explained that we re-did their experiment 50 times and never got their result. He said they’d done it six times and got this result once, but put it in the paper because it made the best story.

Okay, cancer research is less trustworthy than someone just barely outside it (Begley) ever guessed. Apparently careerism is one reason why. What is unexplained in both the Nature editorial and the Reuters summary is how research can ever succeed if things aren’t reproducible. Science has been compared to a game of Twenty Questions. Suppose you play Twenty Questions and 25% of the answers are wrong. It’s hopeless. In experimental research, you generally build on previous experimental results. The editorial points out that the non-reproducible results had been cited 200 times but what about how often they had been reproduced in other labs? The editorial says nothing about this.

I can think of several possibilities: (a) Current lab research is based on experimental findings of thirty years ago when (for unknown reasons) careerism was less of a problem. Standards were higher, there was less pressure to publish, whatever. (b) There is a silent invisible “survival of the reproducible”: Findings that can be reproduced live on because people do lab work based on them. The other findings are cited but are not the basis of new work. (c) There is lots of redundancy — different people approach the same question in different ways. Although each individual answer is not very trustworthy their average is considerably more trustworthy.

Leaving aside the mystery (how can science make any progress if so many results are not reproducible?), the lack of reproducibility interests me because it suggests that the pressure to publish faced by professional scientists has serious (bad) consequences. In contrast, personal scientists are under zero pressure to publish.

Thanks to Bryan Castañeda.

“Seth, How Do You Track and Analyze Your Data?”

A reader asks:

I haven’t found much on your blog commenting on tools you use to track your data. Any recommendations? Have you tried smart phones? For example, I have tried tracking fifteen variables daily via the iPhone app Moodtracker, the only one I found that can track and graph multiple variables and also give you automated reminders to submit data. There are other variants (Data Logger, Daytum) that will graph one variable (say, miles run per day), but Moodtracker is the only app I’ve found that lets you analyze multiple variables.

I use R on a laptop to track and analyze my data. I write functions for doing this — they are not built-in. This particular reader hadn’t heard of R. It is free and the most popular software among statisticians. It has lots of built-in functions (although not for data collection — apparently statisticians rarely collect data) and provides lots of control over the graphs you make, which is very important. R also has several programs for fitting loess curves to your data. Loess is a kind of curve-fitting. There is a vast amount of R-related material, including introductory stuff, here.

To give an example, after I weigh myself each morning (I have three scales), I enter the three weights into R, which stores them and makes a graph. That’s on the simple side. At the other extreme are the various mental tests I’ve written (e.g., arithmetic) to measure how well my brain is working. The programs for doing the test are in R, the data is stored in R, and analyzed with R.

The analysis possibilities (e.g., the graphs you can make, your control over those graphs) I’ve seen on smart phone apps are hopelessly primitive for what I want to do. The people who write the analysis software seem to know almost nothing about data analysis. For example, I use a website called RankTracer to track the Amazon ranking of The Shangri-La Diet. Whoever wrote the software is so clueless the rank versus time graphs don’t even show log ranks.

I don’t know what the future holds. In academic psychology, there is near-total reliance on statistical packages (e.g., SPSS) that are so limited perhaps they can extract only half of the information in the usual data. There are many graphs you’d like to make that it is impossible to make. SPSS may not even have loess, for example. Yet I see no sign of this changing. Will personal scientists want to learn more from their data than psychology professors (and therefore be motivated to go beyond pre-packaged analyses)? I don’t know.

Assorted Links

Thanks to David Cramer, Jahed Momand and Nancy Evans.

What is a Healthy Scientific Ecosystem?

An area of science is an ecosystem in the sense that research builds on other research. In an ordinary ecosystem the animals and plants need each other. Different organisms add different things. Their contributions fit together. In a healthy scientific ecosystem, different types of research add different things and fit together.

Personal science (science done to help yourself) differs greatly from professional science (science done as a job). The big differences help personal science and professional science benefit from each other. They are likely to benefit each other because they have interlocking strengths and weaknesses. Personal science is fast (experiments can be started quickly), has great endurance (experiments can last years), cheap, and intensely focussed on benefit. Professional science has none of these features, but it has other features that personal science lacks: it is “wealthy” (allowing expensive equipment and tests), peer-reviewed, and not intensely focussed on benefit, which allows studies without obvious value. These differences suggest that a system that contains both kinds of science is going to function better than a system with only one kind. Peer review, for example, is a helpful filter but may also suppress the diversity of ideas that are tested. Which implies that not all science should be peer-reviewed.

The relation between personal and professional science somewhat resembles the relation between animals (= personal science) and plants (= professional science). Animals and plants are very different, as are personal and professional science. Animals move faster than plants; personal science moves faster than professional science. Animals range more widely than plants. Likewise, a personal scientist can test a much wider range of treatments than a professional scientist. If you want to sleep better, for example, you can try almost anything. Professional scientists cannot try almost anything. For example, they cannot test treatments considered “crazy”.

Animals and plants helped each other evolve, in the sense of diversifying to exploit new habitats. Animals helped plants exploit new habitats because they increased seed dispersal. This helped plants “test” more locations, helped them survive difficult circumstances such as drought (because some places are drier than others), and reduced competition between seeds (allowing more resources to be devoted to overcoming bad features of new places). Animals are like catalysts that speed up the combination of old plant and new environment to yield new plant. Likewise, plant evolution helped animals evolve because new plants in new places provided more food, more diverse food, and more places to live.

It is likely that personal science and professional science will help each other “evolve” (e.g., solve problems). Personal science wouldn’t function well without professional science. For example, statistical packages, which help personal scientists, wouldn’t exist without professional science. In the other direction, personal science can help professional science “evolve” (e.g., solve problems, build better theories) in two ways. One is idea generation, especially discovery of new cause-effect relationships. Personal scientists can easily do large amounts of trial and error. They can easily test many “crazy” (= low-probability-of-success )treatments, one after the other, until they find something that works. Professional scientists cannot do this sort of thing, which in the world of professional science has a derogatory name: fishing expedition. The other way personal science can help professional science involves idea application. Personal science can tailor ideas from professional science to individual circumstance. Professional scientists don’t like to do this. They would rather do a big study in which all subjects are treated alike. Making better practical use of ideas from professional science is what Richard Bernstein did when he invented home blood glucose monitoring. He made better use of already-known cause-effect relationships.

I have not heard scientists talk about science as an ecosystem. If they did, it might cut down on the dismissiveness (correlation does not equal causation, the plural of anecdote is not data, etc.), evidence snobbery, and one-way skepticism.

Peter Lawrence on the Ills of Modern Science

Peter A. Lawrence is a British biologist who has written several papers about problems with the way biology and other areas of science are now done. In this interview a year ago he summarizes his complaints:

  • Scientific publication “has become a system of collecting counters for particular purposes – to get grants, to get tenure, etc. – rather than to communicate and illuminate findings to other people. The literature is, by and large, unreadable.” There is far too much counting of papers.
  • “There’s a reward system for building up a large group, if you can, and it doesn’t really matter how many of your group fail, as long as one or two succeed. You can build your career on their success.” If you do something on your own it is viewed with suspicion.
  • There is too much emphasis on counting citations. “If you work in a big crowded field, you’ll get many more citations. . . . This is independent of the quality of the work or whether you’ve contributed anything. [There is] enormous pressure on the journals to accept papers that will be cited a lot. And this is also having a corrupting effect. Journals will tend to take papers in medically-related disciplines, for example, that mention or relate to common genetic diseases. Journals from, say, the Cell group, will favor such papers when they’re submitted.”
  • Grant writing takes too much time — e.g., 30-40% of your time. “There is an enormous increase in bureaucracy – form
    filling, targeting, assessment, evaluations. This has gone right through society, like the Black Death!”
  • “Science is not like some kind of an army, with a large number of people who make the main steps forward together. You need to have individually creative people who are making breakthroughs – who make things different. But how do you find those people? I don’t think you want to have a situation in which only those who are competitive and tough can
    get to the top, and those who are reflective and retiring would be cast aside.” I’ve said something similar: Science is like single ants wandering around looking for food, not like a trail of ants to and from a food source. The trail of ants is engineering.

I agree. I would add that I think modern biology is far too invested in the idea that genes cause disease and that studying genes will help reduce human suffering. I think the historical record (the last 30 years) shows that this is not a promising line of work — but modern biologists cannot switch course.

What explains the depressing facts Lawrence points out? I think it is something deep and impossible to change: Science and job don’t mix well. The demands of any job and the demands of science are not very compatible. Jobs are about repetition. Science is the opposite. Jobs demand regular output. Science is unpredictable. However, jobs and science overlap in terms of training: Both benefit from specialized knowledge. They also overlap in terms of resources: More resources (e.g., better tools) will usually help you do your job better, likewise with science. So we have two groups (insiders — professional scientists — and outsiders — everyone else). Both groups have big advantages and big disadvantages relative to the other. In the last 50 years, the insiders have been “winning” in the sense of doing better work. Their advantages of training and resources far outweighed the problems caused by the need for repetition and predictability. But now — as I try to show on this blog — outsiders are catching up and going ahead because the necessary training and tools have become much more widely available (e.g., tools have become much cheaper). And, as Lawrence emphasizes, professional science has gotten worse.

 

Justification For Self-Experimentation and My Belief that N=1 Results Will Generalize

At the Quantified Self blog, in response to a video of me talking about QS and the Ancestral Health Symposium (paleo), someone named Colin made the following comment:

Very interesting talk. I am just curious how someone can claim a study conducted with a sample size of one is “100 times better” than someone else’s study. I do not know anything about the other study mentioned, but I do know that a study based on n=1 cannot be considered scientific proof. And sure, he hears from people who have lost weight drinking the sugar water he prescribed, but it is quite possible there are 100 times as many people who didn’t email him because they didn’t see any positive results and decided to try something else. I think the QS stuff is very interesting and helpful on a personal level, but it seems like a stretch to generalize your results to others.

I responded:

I have two responses.

1. Sample size isn’t everything. Sure, a study with n=1 isn’t “scientific proof”. Nor is any other study, in my experience. “Scientific proof” has always required many studies. New scientific ideas have very often started with n = 1 experiments or observations. Later, larger experiments or observations were done. Both — the initial n=1 observation and the later n = many observations — were necessary for the new idea to be discovered and confirmed.

2. The history of biology teaches there are few exceptions to general rules. See any biology textbook. For example, a textbook might say “lymphocytes fight germs”. This means no serious exceptions have ever been found to that rule. So, as matter of biological history, the person who managed to figure out what one particular lymphocyte does turned out to have figured out what they all do. Biology textbooks have thousands of statements like “lymphocytes fight infection” meaning that this sequence of events (you can generalize from one to all, or nearly all) has happened thousands of times. There is no shadow hidden history of biology that teaches otherwise.

Gelman and Fung versus Levitt and Dubner: How “Wrong” is Freakonomics?

In the latest issue of American Scientist, Andrew Gelman (an old friend) and Kaiser Fung criticize Freakonomics and Superfreakonomics by Steve Levitt and Stephen Dubner (who wrote about my work). Although the article is titled “Freakonomics: What Went Wrong?” none of the supposed errors are in Freakonomics. You can get an idea of the conclusions from the title and this sentence: “How could an experienced journalist and a widely respected researcher slip up in so many ways?”

Gelman and Fung examine a series (“so many ways”) of what they consider mistakes. I will comment on each of them.

1. The case of the missing girls. I agree with Gelman and Fung: Levitt and Dubner accepted Emily Oster’s research too uncritically.

2. The risk of driving a car. I think Gelman and Fung miss the point. Yes, the claim (driving drunk is safer than walking drunk) was not well-supported by the evidence provided because the comparison was so confounded. However, I read the whole example differently. I didn’t think that Levitt and Dubner thought drunk people should drive. I thought their point was more subtle — that comparisons are difficult (“look how we can reach a crazy conclusion”).

3. Stars are made not born. I think Gelman and Fung fail to see the big picture. The birth-month effect in professional sports, which Gelman and Fung dismiss as “very small,” is of great interest to many people, if not to Gelman and Fung. It suggests what Levitt and Dubner and Gladwell and others say: Early success matters. That’s not obvious at all. There are lots of similar associations in epidemiology. They have been the first evidence for many important conclusions, such as smoking causes lung cancer. Are professional sports important? Maybe. But epidemiology and epidemiological methods are surely important. By learning about this effect, we learn about them. Lots of smart people fail to take epidemiology seriously enough (e.g., “correlation does not equal causation”).

4. Making the majors and hitting a curve ball. Gelman and Fung point out that one sentence is misleading. One sentence. This is called praising with faint damn.

5. Predicting terrorists. Gelman and Fung say that the terrorist prediction algorithm of a man named Ian Horsley, which Levitt and Dubner seem to take seriously, is not practical. But their review fails to convince me it was presented as practical. Since there are no data about how well the algorithm works, and Levitt and Dubner are all about data….

6. The climate change dust-up. I agree with Gelman and Fung that Nathan Myrvold’s geoengineering ideas are unimportant. (My view of Myrvold’s patent trolling.) But in this case, I’d say both sides — Gelman and Fung and Levitt and Dubner — miss what’s really important, namely that the usual claims that humans are dangerously warming the planet are held far too strongly. The advocates of this view are far too sure of themselves. I have blogged about this many times. In a nutshell, the climate models that we are supposed to trust have never been shown to persuasively predict the climate ten or twenty years from now (or even one year from now). There is no good reason to believe them. That Levitt and Dubner seem to take that stuff seriously is the only big criticism I have of their work . At least in that geoengineering stuff Levitt and Dubner were dissenting from conventional wisdom. Gelman and Fung do not. They fail to realize that something we’ve been told thousands of times is nonsense (in the sense of being wildly overstated). It was Levitt and Dubner’s comments about this that led me to look closely at all that climate-change scare stuff. I was surprised how poor the evidence was.

The biggest problem with Gelman and Fung’s critique is that they say nothing about the great contribution of Steve Levitt to economics. They fail to grasp that he has made economics considerably more of a science, if by science you mean a data-driven enterprise as opposed to an ideologically-driven or prestige-driven one (mathematics is prestigious, the more difficult, the more prestigious). He did so by pioneering a new way to use data to learn interesting things. His method is essentially epidemiological, except his methods are considerably better (better matching, less formulaic) and his topics much more diverse (e.g., sumo wrestling) than mainstream epidemiology. A large fraction of prestige economics is math, divorced from empirical tests. This stuff wins Nobel Prizes, but, in my and many other people’s opinion, contributes very little to understanding. (Psychology has had the same too much math, too little data problem — minus the Nobel Prizes, of course.) To persuade a big chunk of an entire discipline to pay more attention to data is a huge accomplishment.

Levitt’s methodological innovation makes Freakonomics far from what Gelman and Fung call “pop statistics”. It is actually an amusing and well-written record of something close to a revolution. In the 1980s, a friend of mine at UC Berkeley took an introductory economics class. She told me a little of what the teacher said in class. All theory. What about data? I said. It’s a strange science that doesn’t care about data. My friend went to office hours. She asked the instructor (a Berkeley economics professor): What about data? Don’t worry about data, he replied. Gelman and Fung fail to appreciate what economics used to be like. The ratio of strongly-asserted ideas to persuasive data used to be very large. Now it is less.

Thanks to Ashish Mukharji.

Assorted Links

  • Top ten excuses for climate scientists behaving badly. For example, “the emails are old” and “the timing is suspicious”.
  • Scientific retractions are increasing. My guess is that retractions are increasing because scientific work has become easier to check. Tools are cheaper, for example.
  • More Dutch scientific misconduct. “Professor Poldermans published more than 600 scientific papers in a wide range of journals, including JAMA and the New England Journal of Medicine.”
  • The next time someone praises “evidence-based medicine”, ask them: What about Accutane? It illustrates how evidence-based medicine encourages dangerous drugs. You can’t make lots of money from cheap, time-tested things that we know to be safe (such as dietary changes) so the drug industry revolves around things that are not time-tested and therefore dangerous — far more dangerous than dietary changes. Evidence-based medicine, which says that certain tests (expensive) are much better than other tests (cheap), provides cover for this. Because the required tests are so expensive, they are allowed to be short.

Thanks to Allan Jackson.

Evidence-Based Medicine Versus Innovation

In this interview, a doctor who does research on biofilms named Randall Wolcott makes the same point I made about Testing Treatments — that evidence-based medicine, as now practiced, suppresses innovation:

I take it you [meaning the interviewer] are familiar with evidence-based medicine? It’s the increasingly accepted approach for making clinical decisions about how to treat a patient. Basically, doctors are trained to make a decision based on the most current evidence derived from research. But what such thinking boils down to [in practice — theory is different] is that I am supposed to do the same thing that has always been done – to treat my patient in the conventional manner – just because it’s become the most popular approach. However, when it comes to chronic wound biofilms, we are in the midst of a crisis – what has been done and is accepted as the standard treatment doesn’t work and doesn’t meet the needs of the patient.

Thus, evidence-based medicine totally regulates against innovation. Essentially doctors suffer if they step away from mainstream thinking. Sure, there are charlatans out there who are trying to sell us treatments that don’t work, but there are many good therapies that are not used because they are unconventional. It is only by considering new treatment options that we can progress.

Right on. He goes on to say that he is unwilling to do a double-blind clinical trial in which some patients do not receive his new therapy because “we know we’ve got the methods to save most of their limbs” from amputation.

Almost all scientific and intellectual history (and much serious journalism) is about how things begin. How ideas began and spread, how inventions are invented. If you write about Steve Jobs, for example, that’s your real subject. How things fail to begin — how good ideas are killed off — is at least as important, but much harder to write about. This is why Tyler Cowen’s The Great Stagnation is such an important book. It says nothing about the killing-off processes, but at least it describes the stagnation they have caused. Stagnation should scare us. As Jane Jacobs often said, if it lasts long enough, it causes collapse.

Thanks to Heidi.