“Science is the Belief in the Ignorance of Experts” — Richard Feynman

“Science is the belief in the ignorance of experts,” said the physicist Richard Feynman in a 1966 talk to high-school science teachers. I think he meant science is the belief in the fallibility of experts. In the talk, he says science education should be about data – how to gather data to test ideas and get new ideas — not about conclusions (“the earth revolves around the sun”). And it should be about pointing out that experts are often wrong. I agree with all this.

However, I think the underlying idea — what Feynman seems to be saying — is simply wrong. Did Darwin come up with his ideas because he believed experts (the Pope?) were wrong? Of course not. Did Mendel do his pea experiments because he didn’t trust experts? Again, of course not. Darwin and Mendel’s work showed that the experts were wrong but that’s not why they did it. Nor do scientists today do their work for that reason. Scientists are themselves experts. Do they do science to reveal their own ignorance? No, that’s blatantly wrong. If science is the belief in the ignorance of experts, and X is the belief in the ignorance of scientists, what is X? Our entire economy is based on expertise. I buy my car from experts in making cars, buy my bread from bread-making experts, and so on. The success of our economy teaches us we can rely on experts. Why should high-school science teachers say otherwise? If we can rely on experts, and science rests on the assumption that we can’t, why do we need scientists? Is Feynman saying experts are wrong 1% of the time, and that’s why we need science?

I think what Feynman actually meant (but didn’t say clearly) is science protects us against self-serving experts. If you want to talk about the protection-against-experts function of science, the heart of the matter isn’t that experts are ignorant or fallible. It is that experts, including scientists, are self-serving. The less certainty in an area, the more experts in that area slant or distort the truth to benefit themselves. They exaggerate their understanding, for instance. A drug company understates bad side effects. (Calling this “ignorance” is too kind.) This is common, non-obvious, and worth teaching high-school students. Science journalists, who are grown ups and should know better, often completely ignore this. So do other journalists. Science (data collection) is unexpectedly powerful because experts are wrong more often than a naive person would guess. The simplest data collection is to ask for an example.

When Genius by James Gleick (a biography of Feynman) was published, I said it should have been titled Genius Manqué. This puzzled my friends. Feynman was a genius, I said, but lots of geniuses have had a bigger effect on the world. I heard Feynman himself describe how he came to invent Feynman diagrams. One day, when he was a graduate student. his advisor, John Wheeler, phoned him. “Dick,” he said, “do you know why all electrons have the same charge? Because they’re the same electron.” One electron moves forward and backward in time creating all the electrons we observe. Feynman diagrams came from this idea. The Feynman Lectures on Physics were a big improvement over standard physics books — more emotional, more vivid, more thought-provoking — but contain far too little about data, in my opinion. Feynman failed to do what he told high school teachers to do.

The Blindness of Scientists: The Problem isn’t False Positives, It’s Undetected Positives

Suppose you have a car that can only turn right. Someone says, Your car turns right too much. You might wonder why they don’t see the bigger problem (can’t turn left).

This happens in science today. People complain about how well the car turns right, failing to notice (or at least say) it can’t turn left. Just as a car should turn both right and left, scientists should be able to (a) test ideas and (b) generate ideas worth testing. Tests are expensive. To be worth the cost of testing, an idea needs a certain plausibility. In my experience, few scientists have clear ideas about how to generate ideas plausible enough to test. The topic is not covered in any statistics text I have seen — the same books that spend many pages on to how to test ideas.

Apparently not noticing the bigger problem, scientists sometimes complain that this or that finding “fails to replicate”. My former colleague Danny Kahneman is an example. He complained that priming effects were not replicating. Implicit in a complaint that Finding X fails to replicate is a complaint about testing. If you complain that X fails to replicate, you are saying that something was wrong with the tests that established X. There is a connection between replication failure and failure to generate ideas worth testing. If you cannot generate new ideas, you are forced to test old ideas. You cannot test an old idea exactly — that would be boring/repetitive. So you give an old idea a slight tweak and test the variation. For example, someone has shown that X is true in North America. You ask if X is true in South America. You hope you haven’t tweaked X too much. No idea is true everywhere, except maybe in physics, so as this process continues — it goes on for decades — the tested ideas gradually become less true and the experimental effects get weaker. This is what happened in the priming experiments that Kahneman complained about. At the core of priming — the priming effects studied 30 years ago — is a true phenomenon. After reading “doctor” it becomes easier to decide that “nurse” is a word, for example. This was followed by 30 years of drift away from word recognition. Not knowing how to generate new ideas worth testing, social psychologists have ended up studying weak effects (recent priming effects) that are random walks away from strong effects (old priming effects). The weak effects cannot bear the professional weight (people’s careers rest on them) they are asked to carry and sometimes collapse (“failure to replicate”). Sheena Iyengar, a Columbia Business School professor and social psychologist, got a major award (best dissertation) for and wrote a book about a new effect that has turned out to be very close to non-existent. Inability to generate ideas — to understand how to do so — means that what appear to be new ideas (not just variations of old ideas) are more likely to be mistakes. I have no idea whether Iyengar’s original effect was true or not. I am sure, however, that it was weak and made little sense.

Statistics textbooks ignore the problem. They say nothing about how to generate ideas worth testing. I haven’t asked statisticians about this, but they might respond in one of two ways: 1. That’s someone else’s problem. Statistics is about what to do with data after you gather it. That makes as much sense as teaching someone how to land a plane but not how to take off. 2. That’s what exploratory data analysis is for. If I said “E xploratory data analysis can only identify effects of factors that the researcher decided to vary or track. Which is expensive. What about other factors?” they’d be baffled, I believe. In my experience, exploratory data analysis = full analysis of your data. (Many people do only a small fraction, such as 10%, of all reasonable analyses of their data.) Full analysis is better than partial analysis, but calling it a way to find new ideas fails to understand that professional scientists study the same factors over and over.

I suppose many scientists feel the gap acutely. I did. I became interested in self-experimentation most of all because it generated new ideas at a much higher rate (per year) than my professional experiments with rats. I had no idea why, at first, but as it kept happening — my self-experimentation generated one new idea after another. I came to believe that by accident I was doing something “right”. I was doing something that fit a general rule of how to generate ideas, even though I didn’t know what the general rule was.

T he sciences I know about (psychology and nutrition) have great trouble coming up with new ideas. The paleo movement is a response to stagnation in the field of nutrition. The Shangri-La Diet shows what a new idea looks like in the area of weight control. The failure of nutritionists to study fermented foods is ongoing. Stagnation in psychology can be seen in the fact that antidepressants remain heavily prescribed, many years after the introduction of Prozac (my work on morning faces and mood suggests a much different approach), lack of change in treatments for bipolar disorder over the last 50 years (again, my morning-faces work suggests another approach), and in the failure of social psychologists to discover any big new effects in the last ten years.

 

Here is the secret to idea generation: Cheaper tests. To find ideas plausible enough to be worth testing with Test X, you need a way of testing ideas that is cheaper than Test X. The cheaper your test, the larger the region of cause-effect space you can explore. Let’s say Test Y is cheaper than Test X. With Test Y, you can explore more of cause-effect space than you can explore with Test X. In the region unexplored by Test X, you can find points (cause-effect relationships) that pass Test Y. They are worth testing with Test X. My self-experimentation generated new ideas worth testing with more expensive tests because it was much cheaper than existing tests. Via self-experimentation, I could test many ideas too implausible or too expensive to be tested conventionally. Even cheaper than a self-experiment was simply monitoring myself — tracking my sleep, for example. Again and again, this generated ideas worth testing via self-experimentation. I did what all scientists should do: use cheaper tests to generate ideas worth testing with more expensive tests.

Showers and the Ecology of Knowledge

In a recent post, I said a well-functioning system will produce both optimality and complexity. I meant important systems like our bodies, economies, and formal education. If you look at the nutrition advice provided by the United States Department of Agriculture — the food pyramid, the food plate, the recommended daily allowances, and the associated reports — you will find nothing that increases the complexity of metabolism inside our bodies (in particular, the diversity of metabolic pathways). The advice is all optimality — for example, the best amounts of various micronutrients. The people behind the USDA advice, reflecting the thinking of the best nutrition scientists in the world, utterly fail to grasp the importance of complexity. Half of nutrition research — or more than half, since the topic has been so neglected — should be about how to increase internal complexity. In practice, almost none of it is. It’s obvious, I think, that the microbes within us are very important for health. They are mostly in our intestines and must be heavily influenced by what we eat. How did they get there? How can their number be increased? How can their diversity be increased?

The absence is especially striking because the point is so simple. To solve actual problems, you need both optimality and complexity. Showers — what we use to take a shower — provide an example. You want to adjust the water temperature. If you try to do this while taking a shower, it can be hard because of the delay between changing the hot/cold water proportions and feeling the effects. It is better to use the bathtub (lower) tap to set the temperature (measuring it with your wrist) and only after you’ve optimized the temperature, shift the water to the shower head. The bathtub tap produces simple output (a single stream of water) that is easy to optimize. The shower head produces more complex output that is harder to optimize but does a better job of washing (an actual problem). You need both bathtub tap (for optimization) and shower head (for complexity) to do a good job solving the problem. Likewise, we need both an understanding of necessary nutrients (Vitamin A, etc.), which can be optimized, and an understanding of microbes, which cannot be optimized but can be made more complex, to make good decisions about what food to eat. Ordinary food is the hardware, you might say; and microbes are the software.

 

Smoking and Cancer

In his interview with me about The Truth in Small Doses (Part 1, Part 2), Clifton Leaf praised Racing to the Beginning of the Road (1996) by Robert Weinberg. “A masterful job . . . the single best book on cancer,” wrote Leaf. In an email, he continued:

In Chapter 3 of “Racing to the Beginning of the Road,” Weinberg goes through much of the early epidemiological work linking tobacco to smoking (John Hill, Percivall Pott, Katsusaburo Yamagiwa, Richard Doll), but then focuses on the story of Ernst Wynder, who just happens to be one of Weinberg’s cousins. [As a medical student, Wynder found a strong correlation between smoking and lung cancer.] Building on his own prior epidemiological work, and that of many others, Wynder actually built an experimental “smoking machine” at the Sloan-Kettering Institute in New York in the early 1950s. The machine collected the tar from cigarette smoke (and later, the condensate from the smoke) and Wynder used those to produce skin cancers in mice and rabbits. But the amazing part of the story is what happened later…with Wynder’s bosses at Sloan-Kettering and with one of the legendary figures in cancer research, Clarence Cook Little. I don’t want to give the story away. (If you have the time, you really would love reading the book.) But it’s one of the most damning stories of scientific interference I’ve read.

Wynder met a lot of opposition. His superiors at Sloan-Kettering required that his papers be okayed by his boss, who disagreed with his conclusions. Clarence Cook Little, according to Weinberg, made the following arguments:

The greater rates of lung cancer in smokers only gave evidence of a correlation, but hardly proved a causal connection. One’s credulity had to be strained to accept the ability of a single agent [he means smoking] to cause lung cancer along with so many other diseases including bronchitis, emphysema, coronary artery disease, and a variety of cancers of the mouth, pharynx, esophagus, bladder and kidney. After all, many of these diseases existed long before people started smoking.

A little masterpiece of foolishness . . . and more reason to never ever say correlation does not equal causation. Little was at one point President of the University of Michigan. Later he worked for the tobacco industry. It wasn’t just Little. Weinberg says that Wynder’s colleagues complained about his “statistical analyses and experimental protocols, which they found to be less than rigorous.”

Weinberg says little about epidemiology in the rest of the book — which, to be fair, is about the laboratory study of cancer. At the very end of the book, he writes:

We learned much about how cancer begins; it is no longer a mystery. We will surely learn more . . . but the major answers already rest firmly in our hands. . . . No, we have still not found the cure. But after so long, we know where to look.

The claim that “we know where to look” is not supported by examples. And Weinberg says nothing about prevention.

Weinberg’s book reminded me of a new-music concert I attended at the Brooklyn Academy of Music. Hard to listen to (non-melodic, etc.) like lots of new non-popular music. I didn’t enjoy it, but surely the composer did — this was fascinating. How did it happen? I wondered. Weinberg describes a great deal of research that has so far produced little practical benefit. Weinberg, it seems, has managed to avoid being bothered by this — if he even notices it. How did this happen?

I don’t think it’s “bad” or wrong or undesirable to do science with no practical benefit, just as I don’t complain about “unlistenable” music. Plenty of “useless” science has ultimately proved useful, but the transition from useless to useful can take hundreds of years, which is why there must be “scaffolding,” sources of support other than practicality. This is why scientists use the word elegant so much. Their enjoyment of “elegance” is scaffolding. Long before “useless” science, there was “useless” decoration (and nowadays there is “unlistenable” music). Thorstein Veblen showed no sign of understanding that the “waste” he mocked made possible exploration of the unknown, which is necessary for progress. (By supporting artisans, such as the artisans who make decorations, we support their research.) What is undesirable is when someone (like Wynder) manages to do something useful, to foolishly criticize it, as Little and Wynder’s colleagues did.

Academic Job Advice: Be Able to Say Why You Study What You Study

Recently I interviewed two job candidates for an assistant professor position at Tsinghua. I asked both of them: “Why did you decide to study this?” (this = their field of research). One had no answer at all. The other had an answer that didn’t make sense. I didn’t mean it as a tough question. If they had said “because that’s what they were doing where I got a postdoc” I would have been perfectly happy. If that were the answer, I might have asked “why does your advisor study it?” — to which “I don’t know” would have been perfectly acceptable. Of course, there are better answers.

When I was a graduate student, I read Adventures of a Mathematician by Stanislaw Ulam (a very good well-written book). One of the book’s comments impressed me: That John Von Neumann was able to distinguish the main lines of growth of the tree of mathematics from the branches. My research was about how rats measure time. The relevance to big questions in the psychology of learning wasn’t obvious. I wondered: Am I studying something important? Or something that will be irrelevant in twenty years? My advisor didn’t seem to have thought about this.

When I interviewed for jobs at various universities, no one asked me why do you study this? But it was still a question worth answering. As a grad student I had no choice. But eventually I would have a choice: I could continue to study how rats measure time. Or I could study something else. (Eventually I did change — to studying what controls variation in behavior.)

Here’s what I would say now about how to choose a research topic.

What’s best is a new method. If you can use a new method to answer questions in your field, do that. The cheaper, easier and more available the method, the better. As a graduate student, I developed a new way to study how rats measure time, which I called the peak procedure. It made it easier to determine if an experimental treatment affected an animal’s internal clock.

What’s second best is a new experimental effect. Discovering a new way to change something of interest. The bigger, cheaper, newer, and more surprising the effect, the better. Using the peak procedure, my colleagues and I discovered a large and surprising effect (at a certain time during the peak procedure, the variability of bar-press duration — how long a rat holds down the bar when pressing it — became much larger). When I first saw the result, I assumed it was due to a software mistake. It turned out to be a window in what controls the variability of behavior — an easy way of studying that. In that sense it was also a new method.

I don’t know if the two job candidates I interviewed were doing either of these two things. Maybe not. My broader point is that if you don’t have a good understanding of how to choose a research topic you will have to retreat to studying something simply because others are studying it. Which is exactly the wrong thing to do if you want to be an innovator and a leader.

 

 

 

 

Posit Science: Does It Work? (Continued)

In an earlier post I asked 15 questions about Zelinski et al. (2011) (“Improvement in memory with plasticity-based adaptive cognitive training: results of the 3-month follow-up”), a study done to measure the efficacy of the brain training sold by Posit Science. The study asked if the effects of training were detectable three months after it stopped. Henry Mahncke, the head of Posit Science, recently sent me answers to a few of my questions.

Most of my questions he declined to answer. He didn’t answer them, he said, because they contained “innuendo”. My questions were ordinary tough (or “critical”) questions. Their negative slant was not at all hidden (in contrast to innuendo). For the questions he didn’t answer, he substituted less critical questions. I give a few examples below. Unwillingness to answer tough questions about a study raises doubts about it.

His answers raised more doubts. From his answer to Question 7, I learned that although the investigators gave their subjects the whole RBANS, (a) they failed to report the results from the visual subtests and (b) these unreported results did not support their conclusions. Mahncke says this result was not reported “due to lack of publication space.” The original paper did not say that some results were omitted due to lack of space. I assume all favorable results were reported. To report all favorable results but omit some unfavorable results is misleading.

To further explain the omission, Mahncke says

We used the auditory measures as the primary outcome measure because we hypothesized that cognitive domains [by “cognitive domains” he means the cognitive gains due to training — Seth] would be restricted to the trained sensory domain, in this case the auditory system. [emphasis added]

He doesn’t say he believed the gains would be greater with auditory stimuli, he says he believed they would be restricted to auditory stimuli. The Posit Science website says their training increases “memory”, “intelligence”, “focus” and “thinking speed”. None of these are restricted to the auditory system — far from it. Unless I am misunderstanding something, the head of Posit Science doesn’t believe the main claims of the Posit Science website.

Why Mahncke fails to see a difference between methods (Question 13) and results (Question 14), fails to see a difference between methods (Question 11) and discussion (Question 15), and gives a one-word answer (“yes”) to Question 12, I cannot say. In each case, however, he errs on the side of not answering.

My overall conclusion is that this study does not support Posit Science claims. The main measure (RBANS auditory subtests) didn’t show significant retention. A closely related set of measures (RBANS visual subtests) didn’t show significant retention. A third set of measures (“secondary composite measure”) did show retention, but the p value was not corrected for multiple tests. When the p value is corrected for multiple tests, the secondary composite measure may not show significant retention. Because of the large number of subjects (more than 500), repeated failure to find significant retention under presumably near-optimal conditions (e.g., 1 hour/day of training) suggests that the training effect, after three months without training, is small or zero.

I assume that Posit Science sponsored this study because they believed it was unrealistic for subjects to spend 1 hour/day for the rest of their life doing their training. One hour/day was realistic for a while, yes, but not forever. So subjects will stop. Will the gains last? was the question. Apparently the answer is no.

If Mahncke has any response to this, I will post it.

This is another illustration of why personal science (science done for your own benefit, rather than as a job) is important. Professional scientists are under pressure to get certain results. This study is an example. Mahncke was a co-author. Someone employed by Posit Science is under pressure to get results that benefit Posit Science. (I am not saying Mahncke was affected by this pressure.) A personal scientist is not under pressure to get certain results. For example, if I study the effect of tetracycline (an antibiotic) on my acne, I simply want to know if it helps. Both possible answers (yes and no) are equally acceptable. We may need personal scientists to get unbiased answers.

Here are my original questions along with Mahncke’s answer or lack of answer.

1. Isn’t it correct that after three months there was no longer reliable improvement due to training according to the main measure that was chosen by you (the investigators) in advance? If so, shouldn’t that have been the main conclusion (e.g., in the abstract and final paragraph)?

Not answered.

[Seth: Here is Mahncke’s substitute question: “Why do you conclude that “Training effects were maintained but waned over the 3-month no-contact period” given that the “previously significant improvements became non-significant at the 3-month follow-up for the primary outcome”?”]

2. The training is barely described. The entire description is this: “a brain plasticity-based computer program designed to improve the speed and accuracy of auditory information processing and to engage neuromodulatory systems.” To learn more, readers are referred to a paper that is not easily available — in particular, I could not find it on the Posit Science website. Because the training is so briefly described, I was unable to judge how much the outcome tests differ from the training tasks. This made it impossible for me to judge how much the training generalizes to other tasks — which is the whole point. Why wasn’t the training better described?

Not answered.

[Seth: Here is Mahncke’s substitute question: “Could you describe the training program in more depth, to help judge the similarity between the training exercises and the cognitive outcome measures?”]

3. What was the “ET [experimental treatment] processing speed exercise”?

The processing speed exercise is a time order judgment task in which two brief auditory frequency modulated sweeps are presented, either of which may sweep up or down in frequency. The subject must identify each sweep in the correct order (i.e., up/up, down/down, up/down, down/up). The inter-stimulus interval is adaptively manipulated to determine a threshold for reliable task performance. Note that this is not a reaction time task. The characteristics of the sweeps are chosen to match the frequency modulated sweeps common in stop consonant sounds (like /ba/ or /da/). Older listeners generally show strong correlations between processing speed, speech reception accuracy, and memory; which led us to the hypothesis that improving core processing speed in this way would contribute to improving memory. This approach is discussed extensively in “Brain plasticity and functional losses in the aged: scientific bases for a novel intervention” available at https://www.ncbi.nlm.nih.gov/pubmed/17046669

3. [continue] It sounds like a reaction-time task. People will get faster at any reaction-time task if given extensive practice on that task. How is such improvement relevant to daily life? If it is irrelevant, why is it given considerable attention (one of the paper’s four graphs)?

Not answered.

4. According to Table 2, the CSRQ (Cognitive Self-Report Questionnaire) questions showed no significant improvement in trainees’ perceptions of their own daily cognitive functioning, although the p value was close to 0.05. Given the large sample size (~500), this failure to find significant improvement suggests the self-report improvements were small or zero. Why wasn’t this discussed? Is the amount of improvement suggested by Posit Science’s marketing consistent with these results?

Not answered.

5. Is it possible that the improvement subjects experienced was due to the acquisition of strategies for dealing with rapidly presented auditory material, and especially for focusing on the literal words (rather than on their meaning, as may be the usual approach taken in daily life)? If so, is it possible that the skills being improved have little value in daily life, explaining the lack of effect on the CSRQ?

Not answered.

6. In the Methods section, you write “In the a priori data analysis plan for the IMPACT Study, it was hypothesized that the tests constituting the secondary outcome measure would be more sensitive than the RBANS given their larger raw score ranges and sensitivity to cognitive aging effects.” Do the initial post-training tests (measurements of the training effect soon after training ended) support this hypothesis? Why aren’t the initial post-training results described so that readers can see for themselves if this hypothesis is plausible? If you thought the “secondary outcome measure would be more sensitive than the RBANS” why wasn’t the secondary outcome measure the primary measure?

In a large-scale clinical trial such as IMPACT, it is considered best practice to pick as the primary outcome measure a measure that has been employed in earlier studies. We had used the RBANS in two previous studies (references 8 and 17 in the paper). While we had seen significant results in both studies, it was also clear from those studies that the RBANS had ceiling effects in cognitively intact populations that would limit the statistical sensitivity of the measure. For example, the RBANS list recall measure had 10 words, and a reasonable portion of participants get all 10 correct at baseline, leaving no room for improvement regardless of the efficacy of the intervention. Given that observation, we added measures to the IMPACT study that we hypothesized would be more sensitive. For example, the RAVLT has 15 words, leaving more room for improvement and fewer ceiling effects. [It is unclear that more words = more sensitivity. It depends on the words — Seth] However, since we had not used those measures in previous studies, we decided to define these new measures as secondary outcome measures in the data analysis plan. This issue is discussed in depth in the methods section of the main training effect paper (reference 6), and of course that’s where all of the initial post-training results you mention are described. This improved sensitivity of the secondary outcome measures was quite evident in the post-training data; however for reasons of publication length we did not discuss it in that paper. The comparative data would make an interesting publication, and one that might be helpful to other researchers in this field.

7. The primary outcome measure was some of the RBANS (Repeatable Battery for the Assessment of Neuropsychological Status). Did subjects take the whole RBANS or only part of it? If they took the whole RBANS, what were the results with the rest of the RBANS (the subtests not included in the primary outcome measure)?

Participants took the entire RBANS. We used the auditory measures as the primary outcome measure because we hypothesized that cognitive domains [by “domains” he means “gains” — Seth] would be restricted to the trained sensory domain, in this case the auditory system. Interestingly, there was a significant effect on the overall RBANS measure, however there was no significant effect on a composite of the RBANS visual measures. This interesting result was not included in our papers for reasons of publication length.

[Seth: As I said earlier, a surprising answer.]

8. The data analysis refers to a “secondary composite measure”. Why that particular composite and not any of the many other possible composite measures? Were other secondary composite measures considered? If so, were p values corrected for this?

The measures used were the Rey Auditory Verbal Learning Test total score (sum of trials 1–5) and word list delayed recall, Rivermead Behavioral Memory Test immediate and delayed recall, and Wechsler Memory Scale letter-number sequencing and digit span backwards tests. These measures were chosen a priori as more sensitive than their RBANS cognate measures, and a priori we conservatively chose to integrate all 6 into a single composite measure. Individual test scores are all shown in table 2. This issue is discussed in depth in the methods section of the main training effect paper (reference 6). It’s straightforward to evaluate what the effects shown on other potential composites would be simply from inspecting the individual test data in table 2. In the methods section of the main training effect paper (reference 6), we discuss our approach to multiple comparisons, where we state “A single primary outcome measure (RBANS Memory/ Attention) was predefined to conserve an overall alpha level of 0.05. No corrections for multiple comparisons were made on the secondary measures.” I can see that it would have been helpful to re-iterate that statement in the 2011 paper, and my apologies for the oversight.

[Seth: He doesn’t answer my question “were other secondary measures considered?”]

9. If Test A resembles training more closely than Test B, Test A should show more effect of training (at any retention interval) than Test B. In this case Test A = the RBANS auditory subtests and Test B = the secondary composite measure. In contrast to this prediction, you found that Test B showed a clearer training effect (in terms of p value) than Test A. Why wasn’t this anomaly discussed (beyond what was said in the Methods section)?

Not answered.

10. Were any tests given the subjects not described in this report? If there were other tests, why were their results not described?

All outcome measures performed in the study are reported in the publication.

[Seth: I have no idea how this answer is consistent with (a) the subjects took the visual subtests of the RBANS and (b) the paper fails to report the results of those tests (see answer to Question 7). The paper does not say that the subjects took the visual subtests of the RBANS.]

11. The secondary composite measure is composed of several memory tests and called “Overall Memory”. The Posit Science website says their training will not only help you “remember more” but also “think faster” and “focus better”. Why weren’t tests of thinking speed (different from the training tasks) and focus included in the assessment?

Not answered.

12. Do the results support the idea that the training causes trainees to “focus better”?

Yes.

[Seth: That’s his whole answer.]

13. The Posit Science homepage suggests that their training increases “intelligence”. Was intelligence measured in this study?

At the time we designed IMPACT, we were focused on establishing the effect of the training on memory, as the most common complaint of people with general cognitive difficulties. As IMPACT was in progress, Jaeggi et. al published their very interesting paper on the effect of N-back training on measures of intelligence, where they stated that improving working memory was likely to improve measures of intelligence. It would be quite interesting to repeat the IMPACT study with those or other measures of intelligence, given the improvements in working memory documented in IMPACT. The statement on the Posit Science web page relates to the Jaeggi et. al. paper, given that the Posit training program (BrainHQ) includes N-back training.

13 (continued). If not, why not?

Not answered.

[Seth: In Question 12, Mahncke failed to explain his answer about focus (“yes”) apparently because I left out “if yes, please explain how”. In this question, he dislikes my inclusion of “if not, why not?”]

14. Do the results support the idea that the training causes trainees to become more intelligent?

This question appears to be redundant with 13.

[Seth: Question 13 asked: Was intelligence measured? (A methods question.) This question asked: What about the results? Do they support claims about intelligence? (A results question.)]

15. The only test of thinking speed included in the assessment appears to be a reaction-time task that was part of the training. Are you saying that getting faster on one reaction-time task after lots of practice with that task shows that your training causes trainees to “think faster”?

This question appears to be redundant with 11.

[Seth: Question 11 was a methods question. This is a question about what the results mean — a discussion question. I still have no idea why Posit Science says their training causes trainees to “think faster” or why I should care that their subjects get faster on a laboratory task after lots of practice.]

A Revolution in Growing Rice

Surely you have heard of Norman Borlaug, “Father of the Green Revolution”. He won a Nobel Peace Prize in 1970 for

the introduction of these high-yielding [wheat] varieties combined with modern agricultural production techniques to Mexico, Pakistan, and India. As a result, Mexico became a net exporter of wheat by 1963. Between 1965 and 1970, wheat yields nearly doubled in Pakistan and India.

He had a Ph.D. in plant pathology and genetics. He learned how to develop better strains in graduate school. He worked as an agricultural researcher in Mexico.

You have probably not heard of Henri de Laulanié, a French Jesuit priest who worked in Madagascar starting in the 1960s. He tried to help local farmers grow more rice. He had only an undergraduate degree in agriculture. In contrast to Borlaug, he tested simple variations that any farmer could afford. He found that four changes in traditional practices had a big effect:

• Instead of planting seedlings 30-60 days old, tiny seedlings less than 15 days old were planted.
• Instead of planting 3-5 or more seedlings in clumps, single seedlings were planted.
• Instead of close, dense planting, with seed [densities] of 50-100 kg/ha, plants were set out carefully and gently in a square pattern, 25 x 25 cm or wider if the soil was very good; the seed [density] was reduced by 80-90% . . .
• Instead of keeping rice paddies continuously flooded, only a minimum of water was applied daily to keep the soil moist, not always saturated; fields were allowed to dry out several times to the cracking point during the growing period, with much less total use of water.

The effect of these changes was considerably more than Borlaug’s doubling of yield:

The farmers around Ranomafana who used [these methods] in 1994-95 averaged over 8 t/ha, more than four times their previous yield, and some farmers reached 12 t/ha and one even got 14 t/ha. The next year and the following year, the average remained over 8 t/ha, and a few farmers even reached
16 t/ha.

The possibility of such enormous improvements had been overlooked by both farmers and researchers. They were achieved without damaging the environment with heavy fertilizer use, unlike Borlaug’s methods.

Henri de Laulanié was not a personal scientist but he resembled one. Like a personal scientist, he cared about only one thing (improving yield). Professional scientists have many goals (publication, promotion, respect of colleagues, grants, prizes, and so on) in addition to making the world a better place. Like a personal scientist, de Laulanié did small cheap experiments. Professional scientists rarely do small cheap experiments. (Many of them worship at the altar of large randomized trials.) Like a personal scientist, de Laulanié tested treatments available to everyone (e.g., butter). Professional scientists rarely do this. Like a personal scientist, he tried to find the optimal environment. In the area of health, professional scientists almost never do this, unless they are in a nutrition department or school of public health. Almost all research funding goes to the study of other things, such as molecular mechanisms and drugs.

Personal science matters because personal scientists can do things professional scientists can’t or won’t do. de Laulanié’s work shows what a big difference this can make.

A recent newspaper article. The results are so good they have been questioned by mainstream researchers.

Thanks to Steve Hansen.

How to Encourage Personal Science?

I wonder how to encourage personal science (= science done to help yourself or a loved one, usually for health reasons). Please respond in the comments or by emailing me.

An obvious example of personal science is self-measurement (blood tests, acne, sleep, mood, whatever) done to improve what you’re measuring. Science is more than data collection and the data need not come from you. You might study blogs and forums or the scientific literature to get ideas. Self-measurement and data analysis by non-professionals is much easier than ever before. Other people’s experience and the scientific literature are much more available than ever before. This makes personal science is far more promising than ever before.

Personal science has great promise for reasons that aren’t obvious. It seems to be a balancing act: Personal science has strengths and weakness, professional science has strengths and weaknesses. I can say that personal scientists can do research much faster than professionals and are less burdened with conflicts of interest (personal scientists care only about finding a solution; professionals care about other things, including publication, grants, prizes, respect, and so on). A professional scientist might reply that professional scientists have more training and support. History overwhelming favors professional science — at least until you realize that Galileo, Darwin, Mendel, and Wegener (continental drift) were not professional scientists. (Galileo was a math professor.) There is very little personal science of any importance.

These arguments (balancing act, examination of history) miss something important. In a way, it isn’t a balancing act. Professional science and personal science do different things. In some ways history supports personal science. Let me give an example. I believe my most important discovery will turn out to be the effect of morning faces on mood. The basic idea that my findings support is that we have a mood control system that requires seeing faces in the morning to work properly. When the system is working properly, we have a circadian rhythm in mood (happy, eager, serene during the day, unhappy, reluctant, irritable at night). The strangest thing is that if you see faces in the morning (e.g, 7 am) they have no noticeable effect until 6 pm the same day. There is a kind of uncanny valley at work here. If you know little about mood research, this will seem unlikely but possible. If you are an average professional mood researcher, it will seem much worse: can’t possibly be true, total nonsense. If you know a lot about depression research, however, you will know that there is considerable supporting research (e.g., in many cases, depression gets better in the evening). It will still seem very unlikely, but not impossible. However, if you’re a professional scientist, it doesn’t matter what you think. You cannot study it. It is too strange to too many people, including your colleagues. You risk ridicule by studying it. If you’re a personal scientist, of course you can study it. You can study anything.

This illustrates a structural problem:

2013-02-28 personal & professional science in plausibility space

This graph shows what personal and professional scientists can do. Ideas vary in plausibility from low to high; data gathering (e.g., experiments) varies in cost from low to high. Personal scientists can study ideas of any plausibility, but they have a relatively small budget. Professional scientists can spend much more — in fact, must spend much more. I suppose publishing a cheap experiment would be like wearing cheap clothes. Another limitation of professional scientists is that they can only study ideas of medium plausibility. Ideas of low plausibility (such as my morning faces idea) are “crazy”. To take them seriously risks ridicule. Even if you don’t care what your colleagues think, there is the additional problem that a test of them is unlikely to pay off. You cannot publish results showing that a low-plausibility idea is wrong. Too obvious. In addition, professional scientists cannot study ideas of high plausibility. Again, the only publishable result would be that your test shows the idea is wrong. That is unlikely to happen. You cannot publish results that show that something that everybody already believes is true.

It is a bad idea for anyone — personal or professional scientist — to spend a lot of resources testing an idea of low or high plausibility. If the idea has low plausibility, the outcome is too likely to be “it’s wrong”. There are a vast number of low-plausibility ideas. No one can afford to spend a lot of money on one of them. Likewise, it’s a bad idea to spend a lot of resources testing an idea of high plausibility because the information value (information/dollar) of the test is likely to be low. If you’re going to spend a lot of money, you should do it only when both possible outcomes (true and false) are plausible.

This graph explains why health science has so badly stagnated — every year, the Nobel Prize in Medicine is given for something relatively trivial — and why personal science can make a big difference. Health science has stagnated because it is impossible for professionals to study ideas of low plausibility. Yet every new idea begins with low plausibility. The Shangri-La Diet is an example (Drink sugar water to lose weight? Are you crazy?). We need personal science to find plausible new ideas. We also need personal science at the other extreme (high plausibility) to customize what we know. Everyone has their quirks and differences. No matter how well-established a solution, it needs to be tailored to you in particular — to what you eat, when you work, where you live, and so on. Professional scientists won’t do that. My personal science started off with customization. I tested various acne drugs that my dermatologist prescribed. It turned out that one of them didn’t work. It worked in general, just not for me. As I did more and more personal science, I started to discover that certain low-plausibility ideas were true. I’d guess that 99.99% of professional scientists never discover that a low-plausibility idea is true. Whereas I’ve made several such discoveries.

Professional scientists need personal scientists to come up with new ideas plausible enough to be worth testing. The rest of us need personal scientists for the sake of our health. We need them to find new solutions and customize existing ones.

 

 

 

More Trouble in Mouse Animal-Model Land

Mice — inbred to reduce genetic variation — are used as laboratory models of humans in hundreds of situations. Researchers assume there are big similarities between humans and one particular genetically-narrow species of mouse. A new study, however, found that the correlation between human genomic changes after various sorts of damage (“trauma”, burn, endotoxins in the blood, and so on) and mouse genomic changes was close to zero.

According to a New York Times article about the study, the lack of correlation “helps explain why every one of nearly 150 drugs tested at huge expense in patients with sepsis [severe blood-borne infection] has failed. The drug tests all were based on studies in mice.”

This supports what I’ve said about the conflict between job and science. If your only goal is to find a better treatment for sepsis, after ten straight failures you’d start to question what you are doing. Is there a better way? you’d wonder. After twenty straight failures, you’d give up on mouse research and starting looking for a better way. However, if your goal is to do fundable research with mice — to keep your job — failures to generalize to humans are not a problem, at least in the short run. Failure to generalize actually helps you: It means more mouse research is needed.

If I’m right about this, it explains why researchers in this area have racked up an astonishing record of about 150 failures in a row. (The worst college football team of all time only lost 80 consecutive games.) Terrible for anyone with sepsis, but good for the careers of researchers who study sepsis in mice. “Back to the drawing board,” they tell funding agencies. Who are likewise poorly motivated to react to a long string of failures. They know how to fund mouse experiments. Funding other sorts of research would be harder.

In the comments on the Times article, some readers had trouble understanding that 10 failures in a role should have suggested something was wrong. One reader said, “If one had definitive, repeatable, proof that the [mouse model] approach wouldn’t work…..well, that’s one thing.” Not grasping that 150 failures in a row is repeatable in spades..

When this ground-breaking paper was submitted to Science and Nature, the two most prestigious journals, it was rejected. According to one of the authors, the reviewers usually said, ”It has to be wrong. I don’t know why it is wrong, but it has to be wrong.” 150 consecutive failed drug studies suggest it is right.

As I said four years ago about similar problems,

When an animal model fails, self-experimentation looks better. With self-experimentation you hope to generalize from one human to other humans, rather from one genetically-narrow group of mice to humans.

Thanks to Rajiv Mehta.

Web Browsers, Black Swans and Scientific Progress

A month ago, I changed web browsers from Firefox to Chrome (which recently became the most popular browser). Firefox crashed too often (about once per day). Chrome crashes much less often (once per week?) presumably because it confines trouble caused by a bad tab to that tab. ”Separate processes for each tab is EXACTLY what makes Chrome superior” to Firefox, says a user. This localization was part of Chrome’s original design (2008).

After a few weeks, I saw that crash rate was the only difference between the two browsers that mattered. After a crash, it takes a few minutes to recover. With both browsers, the “waiting time” distribution — the distribution of the time between when I try to reach a page (e.g., click on a link) and when I see it — is very long-tailed (very high kurtosis). Almost all pages load quickly (< 2 seconds). A few load slowly (2-10 seconds). A tiny fraction (0.1%?) cause a crash (minutes). The Firefox and Chrome waiting-time distributions are essentially the same except that the Chrome distribution has a thinner tail. As Nassim Taleb says about situations that produce Black Swans, very rare events (in this case, the very long waiting times caused by crashes) matter more (in this case, contribute more to total annoyance) than all other events combined.

Curious about Chrome/Firefox differences, I read a recent review (“Chrome 24 versus Firefox 18 — head to head”). Both browsers were updated shortly before the review. The comparison began like this:

Which browser got the biggest upgrade? Who’s the fastest? The safest? The easiest to use? We took a look at Chrome 24 and Firefox 18 to try and find out.

Not quite. The review compared the press releases about the upgrades. It said nothing about crash rate.

Was the review superficial because the reviewer wasn’t paid enough? If so, Walt Mossberg, the best-paid tech reviewer in the world, might do a good review. The latest browser review by Mossberg I could find (2011) says this about “speed”:

I found the new Firefox to be snappy. . . . The new browser didn’t noticeably slow down for me, even when many tabs were opened. But, in my comparative speed tests, which involve opening groups of tabs simultaneously, or opening single, popular sites, like Facebook, Firefox was often beaten by Chrome and Safari, and even, in some cases, by the new version 9 of IE . . . These tests, which I conducted on a Hewlett-Packard desktop PC running Windows 7, generally showed very slight differences among the browsers.

No mention of crash rate, the main determinant of how long things take. Mossberg ignores it — the one difference between Chrome and Firefox that really matters. He’s not the only one. As far as I can tell, all tech reviewers have failed to measure browser crash rate. For example, this review of the latest Firefox. ”I’m still a big Firefox fan,” says the reviewer.

Browser reviews are a small example of a big rule: People with jobs handle long-tailed distributions poorly. In the case of browser reviews, the people with jobs are the reviewers; the long-tailed distribution is the distribution of waiting times/annoyance. Reviewers handle this distribution badly in the sense that they ignore tail differences, which matter enormously.

Another browser-related example of the rule is the failure of the Mozilla Foundation (people with jobs) to solve Firefox’s crashing problem. My version of Firefox (18.0.1) crashed daily. Year after year, upgrade after upgrade, people at Mozilla failed to add localization. Their design is “crashy”. They fail to fix it. Users notice, change browsers. Firefox may become irrelevant for this one reason. This isn’t Clayton Christensen’s “innovator’s dilemma”, where industry-leading companies become complacent and lose their lead. People at Mozilla have had no reason to be complacent.

Examples of the rule are all around us. Some are easy to see:

1. Taleb’s (negative) Black Swans. Tail events in long-tailed distributions often have huge consequences (making them Black Swans) because their possibility has been ignored or their probability underestimated. The system is not designed to handle them. All of Taleb’s Black Swans involve man-made systems. The financial system, hedge funds, New Orleans’s levees, and so on. These systems were built by people with jobs and react poorly to rare events (e.g., Long Term Capital Management). Taleb’s anti-fragility is what others have called hormesis. Hormesis protects against bad rare events. It increases your tolerance, the dose (e.g., the amount of poison) needed to kill you. As Taleb and others have said, many complex systems (e.g., cells) have hormesis. All of these systems were fashioned by nature, none by people with jobs. No word means anti-fragile, as Taleb has said, because there exist no products or services with such a property. (Almost all adjectives and nouns were originally created to describe products and services, I believe. They helped people trade.) No one wanted to say buy this, it’s anti-fragile. Designers didn’t (and still don’t) know how to add hormesis. They may even be unaware the possibility exists. Products are designed by people with jobs. Taleb doesn’t have a job. Grasping the possibility of anti-fragility — which includes recognizing that tail events are underestimated — does not threaten his job or make it more difficult. If a designer tells her boss about hormesis her boss might ask her to include it.

2. The Boeing 787 (Dreamliner) has had battery problems. The danger inherent in use of a lithium battery has a long-tailed distribution: Almost all uses are safe, a very tiny fraction are dangerous. In spite of enormous amounts of money at stake, Boeing engineers (people with jobs) failed to devise adequate battery testing and management. The FAA (people with jobs) also missed the problem.

3. The designers of the Fukushima nuclear power plant (people with jobs) were perfectly aware of the possibility of a tsunami. They responded badly (did little or nothing) when their assumptions about tsunami likelihood were criticized. The power of the rule is suggested by the fact that this happened in Japan, where most things are well-made.

4. Drug companies (people with jobs) routinely hide or ignore rare side effects, judging by the steady stream of examples that come to light. An example is the tendency of SSRIs to produce violence, including suicide. The whole drug regulatory system (people with jobs) seems to do a poor job with rare side effects.

Why is the rule true? Because jobs require steady output. Tech reviewers want to write a steady stream of reviews. The Mozilla Foundation wants a steady stream of updates. Companies that build nuclear power plants want to build them at a steady rate. Boeing wants to introduce new planes at a steady rate. Harvard professors (criticized by Taleb) want to publish regularly. At Berkeley, when professors come up for promotion, they are judged by how many papers they’ve written. Long-tailed distributions interfere with steady output. To seriously deal with them you have to measure the tails. That’s hard. Adding hormesis (Nature’s protection against tail events) to your product is even harder. Testing a new feature to learn its effect on tail events is hard.

This makes it enormously tempting to ignore tail events. Pretend they don’t exist, or that your tests actually deal with them. At Standard & Poor’s, which rated all sorts of financial instruments, people in charge grasped that they were doing a bad job modelling long-tailed distributions and introduced new testing software that did a better job. S & P employees rebelled: We’ll lose business. Too many products failed the new tests. So S & P bosses watered down the test: “If the transaction failed E3.0, then use E3Low [which assumes less variance].” Which test (E3.0 or E3Low) was more realistic? The employees didn’t care. They just wanted more business.

It’s easy to rationalize ignoring tail events. Everyone ignores them. Next tsunami, I’ll be dead. The real reason they are ignored is that if your audience is other people with jobs (e.g., a regulatory agency, reviewers for a scholarly journal, doctors), it will be easy to get away with ignoring them or making unrealistic assumptions about them. Tail events from long-tailed distributions make a regulator’s job much harder. They make a doctor’s job much harder. If doctors stopped ignoring the long tails, they would have to tell patients That drug I just prescribed — I don’t know how safe it is. The hot potato (unrealistic risk assumptions) is handed from one person to another within a job-to-job system (e.g., drug companies market new drugs to the FDA and to doctors) but eventually the hot potato (or ticking time bomb) must be handed outside the job-to-job system to an ordinary Person X (e.g., a doctor prescribes a drug to a patient). It is just one of many things that Person X buys. He doesn’t have the time or expertise to figure out if what he was told about risk (the probability of very bad very rare events) is accurate. Eventually, however, inaccurate assumptions about tail events may be exposed when people without jobs related to the risk (e.g., parents whose son killed himself after taking Prozac, everyone in Japan, airplane passengers who will die in a plane crash) are harmed. Such people, unlike people with related jobs, are perfectly free to complain and willful ignorance may come to light. In other words, doctors cannot easily complain about poor treatment of rare side effects (and don’t), but patients and their parents can (and do).

There are positive Black Swans too. In some situations, the distribution of benefit has a very long-tailed distribution. Almost all events in Category X produce little or no benefit, a tiny fraction produce great benefit. One example is scientific observations. Almost all of them have little or no benefit, a very tiny fraction are called discoveries (moderate benefit), and a very very tiny fraction are called great discoveries (great benefit). Another example is meeting people. Almost everyone you meet — little or no benefit. A tiny fraction of people you meet — great benefit. A third example is reading something. In my life, almost everything I’ve read has had little or no benefit. A very tiny fraction of what I’ve read has had great benefits.

I came to believe that people with jobs handle long-tailed distributions badly because I noticed that jobs and science are a poor mix. My self-experimentation was science, but it was absurdly successful compared to my professional science (animal learning research). I figured out several reasons for this but in a sense they all came down to one reason: my self-experimentation was a hobby, my professional science was a job. My self-experimentation gave me total freedom, infinite time, and commitment to finding the truth and nothing else. My job, like any job, did not. And, as I said, I saw that scientific progress per observation had a power-law-like distribution: Almost all observations produce almost no progress, a tiny fraction produce great progress.

It is easy enough for scientists to recognize the shape of the distribution of progress per observation but, if you don’t actually study the distribution, you’re not going to have much of an understanding. Professional scientists ignore it. Thinking about it would not help them get grants and churn out papers. (Grants are given by people with jobs, who also ignore the distribution.) Because they don’t think about it, they have no idea how to change the “slope” of the power-law distribution (such distributions are linear on log-log coordinates). In other words, they have no idea how to make rare events more likely. Because it is almost impossible to notice the absence of very rare events (the great discoveries that don’t get made), no one notices. I seem to be the only one who points out that year after year, the Nobel Prize in Physiology/Medicine indicates lack of progress on major diseases. When I was a young scientist, I wanted to learn how to make discoveries. I was surprised to find that everything written on the topic — which seemed pretty important — was awful. Now I know why. Everything on the topic was written by a person with a job.

With long-tailed distributions of benefit, there is nothing like hormesis. If any organism has evolved something to improve long-tailed distributions of benefit, I don’t know what it is. Our scientific system handles the long-tailed distribution of progress poorly in two ways:

1. The people inside it, such as professional scientists, do a poor job of increasing the rate of progress, i.e., making the tails thicker. I think you can make the tails thicker via subject-matter knowledge (Pasteur’s “chance favors the prepared mind”), methodological knowledge (better measurements, better experiments, better data analysis), and novelty. Professional scientists understand the value of the first two factors, but they ignore the third. They like to do the same thing over and over because it is safer. Great for their careers, terrible for the rest of us.

2. When an unlikely observation comes along, the system is not set up to develop it. An example is Galvani’s discovery of galvanism, which led to batteries, which led to widespread electricity. This one discovery, from one observation, arguably produced more progress than all scientific observations in the last 100 years. Galvani’s job (surgery research) left him unable to go further with his discovery. (“Galvani had certain commitments. His main one was to present at least one research paper every year at the Academy.”) His research job left him unable to develop one of the greatest discoveries of all time. In contrast, Darwin (no job) was able to develop the observations that led to his theory of evolution. It took him 18 years to write one book, longer than any job would have allowed. He wouldn’t have gotten tenure at Berkeley.

After a discovery has been made, the shape of the benefit distribution changes. It becomes more Gaussian, less long-tailed. As our understanding increases, science becomes engineering, which becomes design, which becomes manufacturing. Engineering and design and making things fit well with having a job. Take my chair. Every time I use it, I get a modest benefit, always about the same size. Every time I use my pencil, I get a modest benefit, always about the same size. No long-tailed distribution.

Modern science works well as a way of developing discoveries, not making them. An older system was better for encouraging discovery. Professors mainly taught. Their output was classes taught. They did a little research on the side. If they found something, fine, they had enough expertise to publish it, but nothing depended on their rate of publication. Mendel was expert enough to write up his discoveries but his job in no way required him to do so. Just as Taleb recommends most of your investments should be low-risk, with a small fraction high-risk, this is a “job portfolio” where most of the job is low benefit with high certainty and a small fraction of the job is high benefit with low certainty. In the debate over climate change (is the case that humans are dangerously warming the planet as strong as we’re told?) it is striking that everyone with any power on the mainstream side of the debate (scientists, journalists, professional activists) has a job involving the subject. Everyone on the other side with any power (Stephen McIntyre, Bishop Hill, etc.) does not. People without jobs are much more free to speak the truth as they see it.

We need personal science (using science to help yourself) to better handle long-tailed distributions, but not just for that reason. Jobs disable people in other ways, too. Personal science matters, I’ve come to believe, for three reasons.

1. Personal scientists can make discoveries that professional scientists cannot. The Shangri-La Diet is one example. Tara Grant’s discovery of the effect of changing the time of day she took Vitamin D is another. For all the reasons I’ve said.

2. Personal scientists can develop discoveries that professional scientists cannot. Will there be a clinical trial of the Shangri-La Diet (by a professional weight-control researcher) in my lifetime? Who knows. It is so different from what they now believe. (When I applied to the UC Berkeley Animal Care and Use Committee for permission to do animal tests of SLD, I was turned down. It couldn’t possibly be true, said the committee.) Long before that, the rest of us can try it for ourselves and tell others what happened.

3. By collecting data, personal scientists can help tailor any discovery, even a well-developed one, to their own situation. For example, they can make sure a drug or a diet works. (That’s how my personal science started — testing an acne medicine.) They can test home remedies. By tracking their health with sensitive tests, they can make sure a prescribed drug has no bad side effects. Individualizing treatments takes time, which gets in the way of steady output. You have all the time in the world to gather data that will help you be healthy. Your doctor doesn’t. People who have less contact with you than your doctor, such as drug companies, insurance companies, medical school professors and regulatory agencies, are even less interested in your special case.