The Effects of Institutionalization on Children

From the latest issue of the American Journal of Psychiatry:

Young children living in institutions in Bucharest were enrolled when they were between 6 and 30 months of age. Following baseline assessment, 136 children were randomly assigned to care as usual (continued institutional care) or to removal and placement in foster care that was created as part of the study. Psychiatric disorders, symptoms, and comorbidity were examined by structured psychiatric interviews of caregivers of 52 children receiving care as usual and 59 children in foster care when the children were 54 months of age. Both groups were compared to 59 typically developing, never-institutionalized Romanian children recruited from pediatric clinics in Bucharest. Foster care was created and supported by social workers in Bucharest who received regular consultation from U.S. clinicians. Results: Children with any history of institutional rearing had more psychiatric disorders than children without such a history (53.2% versus 22.0%). Children removed from institutions and placed in foster families were less likely to have internalizing disorders than children who continued with care as usual (22.0% versus 44.2%). Boys were more symptomatic than girls regardless of their caregiving environment and, unlike girls, had no reduction in total psychiatric symptoms following foster placement.

Note the phrase “internalizing disorders” — it means that other types of disorders were not decreased by the expensive treatment. Moreover, the 22.0% “control” value is probably higher than what you’d find if all kids of that age were surveyed; I assume the kids found at pediatric clinics are less healthy than average. Although the experiment is trying to show a (negative) effect of institutionalization, it doesn’t even manage to do that very well, because of the cherry-picking aspect of the results. All in all, a horrible situation.

Micromeasures of development — something you can measure every week, for example — might help so that many little things could be tried with individual children rather than doing these difficult large-scale experiments.

The whole thing has the feel of the 1800s when to be institutionalized was to be at high risk for some sort of vitamin deficiency, such as pellagra or beriberi.

How the Truth Comes Out (continued)

In a previous post I wrote about the need for independence — safety from retaliation — to tell the truth. Here is Jane Jacobs’s brush with this fact of life, from a 2006 interview in Urban Design magazine:

I got a grant from the Rockefeller Foundation [to write her first book]; well, apparently, rumor quickly reached Harvard and MIT that I had this grant, and they had started something called the “joint urban design center,” something like that. So, I was invited up there to have lunch at Harvard by, I think it was Martin Meier at the time, and I forget who the MIT one was, but the three of us had lunch, so they had worked out what I had to spend my time on. (I had no connection with them, they just heard somebody had a grant, and they would try to recommend . . . ) What I was to do was to make out a question [a survey], and spend my time on questioning people who lived in middle income housing projects, to see what they liked about them and what they didn’t like, and that was to be my book on the city!

Well, I was so glad that I was not a graduate student there, I felt so sorry for anybody who was caught in that trap and had to do that kind of junk, and so I thanked them very much for their interest and left them. Oh my god! I was out of there, because I could hardly wait to leave this behind: disgusting, absolutely disgusting! And that’s what their interest in cities was, just junk like that….and different people trying to further their own career by roping in other people. And it was not to really find out things.

If the Harvard and MIT profs had said to Jacobs, “can we help you?” that would have been one thing. Under the guise of being helpful they said the opposite: Here’s how you can help us.

The Naysayers

Jane Jacobs called them squelchers: People in powerful positions who say no to new ideas. The effect of such people is that problems remain unsolved.

What about scientists? On the face of it, research is about discovering new ideas. If you’re a non-scientist, you might even think that is the whole point of research. Certainly that is why it is supported with tax dollars — taxpayers hope research will improve health, for example. But quite a few researchers don’t see it that way.

In London, a group called Business in the Community is creating “ toolkits” to help companies improve employee health. One toolkit is about emotional resilience. An early draft of that toolkit contained this passage:

Heart attacks and other ischemic cardiovascular diseases can be created by stressful office dynamics that come from the top. Even one year of working under a manager with poor leadership skills can raise the risk of acute myocardial infarction, unstable angina, cardiac death, or ischemic heart disease death by a significant 24%, while four years under the same stressful conditions produces a 39% elevated risk of ischemic heart disease events.

The data are from Nyberg A, et al., “Managerial leadership and ischaemic heart disease among employees: the Swedish WOLF study” Occup Environ Med 2008. At a meeting to discuss this toolkit, attended by representatives of large companies and a few academics, the academics objected to this passage. The study was methodologically flawed, they said. “But what if it’s true?” the non-academics said. The passage was removed.

On The Larry King Show a few years ago I heard a prominent woman psychiatrist (Nancy Andreassen or Kay Jamison) say that it was a good time, if there ever was one, to have a mental illness because it was a golden age of psychiatric research. Researchers, she said, were making one breakthrough after another. Nobody asked her, why, if that was so, was bipolar disorder still being treated with lithium? That’s 50 years old. Why hasn’t research come up with something better? Two psychiatric researchers believe it is because research proposals to test new treatments are turned down due to what the critics call inadequate methodological purity. For example, you’re supposed to do such studies with patients who have only bipolar disorder, although comorbidity is common.

This is the behavior produced by what academics (admiringly!) call “critical thinking”: ignoring what is valuable or promising in a rush to point out what is imperfect.

Something is better than nothing.

Trouble in Mouse Animal-Model Land

Most drugs are first tested on animals, often on “animal models” of the human disease at which the drug is aimed. This 2008 Nature article reveals that in at least one case, the animal model is flawed in a way no one really understands:

In the case of ALS, close to a dozen different drugs have been reported to prolong lifespan in the SOD1 mouse, yet have subsequently failed to show benefit in ALS patients. In the most recent and spectacular of these failures, the antibiotic minocycline, which had seemed modestly effective in four separate ALS mouse studies since 2002, was found last year to have worsened symptoms in a clinical trial of more than 400 patients.

I think that “close to a dozen” means about 12 in a row, rather than 12 out of 500. The article is vague about this. A defender of the mouse model said this:

As for the failed clinical trial of minocycline, Friedlander suggests that the drug may have been given to patients at too high a dose — and a lower dose might well have been effective. “In my mind, that was a flawed study,” he says.

Not much of a defense.

That realization is spreading: some researchers are coming to believe that tests in mouse models of other neurodegenerative conditions such as Alzheimer’s and Huntington’s may have been performed with less than optimal rigor. The problem could in principle apply “to any mouse model study, for any disease”, says Karen Duff of Columbia University in New York, who developed a popular Alzheimer’s mouse model.

“Less than optimal rigor”? Oh no. Many scientists seem to believe that every problem is due to failure to follow some rules they read in a book somewhere. They have no actual experience testing this belief (which I’m sure is false — the world is a lot more complicated than as described in their textbooks); they just feel good criticizing someone else’s work like that. In this case, the complaints include “small sample sizes, no randomization of treatment and control groups, and [no] blinded evaluations of outcomes.” Very conventional criticisms.

Here’s a possibility no one quoted in the article seems to realize: The studies were too rigorous, in the sense that the two groups (treatment and control) were too similar prior to getting the treatment. These studies always try to reduce noise. A big source of noise, for example, is genetic variability. The less variability in your study, however, the less likely your finding will generalize, that is, be true in other situations. The Heisenberg Uncertainty Principle of experimental design. Not in any textbook I’ve seen.

In the 1920s and 30s, a professor in the UC Berkeley psychology department named Robert Tryon tried to breed rats for intelligence. His measure of intelligence was how fast they learned a maze. After several generations of selective breeding he derived two strains of rats, Maze Bright and Maze Dull, which differed considerably in how fast they learned the maze. But the maze-learning differences between these two groups didn’t generalize to other learning tasks; whatever they were bred for appeared to be highly specific to maze learning. The measure of intelligence lacked enough variation. It was too rigorous.

When an animal model fails, self-experimentation looks better. With self-experimentation you hope to generalize from one human to other humans, rather from one genetically-narrow group of mice to humans.

Thanks to Gary Wolf.

Steve Levitt and John List Teach Experimentation to MBA Students

From the Financial Times:

“The level of experimentation [at big businesses such as United Airlines] is abysmal,” says Prof List. “These firms do not take full advantage of feedback opportunities they’re presented with. After seeing example after example, we sat down and said, ‘We have to try to do something to stop this.’ One change we could make is to teach 75 to 100 of the best MBA students in the world how to think about feedback opportunities and how to think about designing their own field experiments to learn something that can make their company better.”

The two economists decided to team up to develop a course for [University of Chicago] Booth [Business School] students on “Using Experiments in Firms” – the first time either had taught at the business school.

This is an interesting middle ground between conventional science (done by professors) and what I have done a lot of (self-experimentation to solve my own problems — e.g., sleep better). I’m (a) trying to solve my own problems and (b) it’s not a job. Conventional scientists are (a) trying to solve other people’s problems and (b) it is a job. The MBA students will be taught experimentation that involves their own problems — well, their own company’s problems — and it is a job.

One important effect of this course, if the whole idea catches on, could be a cultural shift: A growing belief that experimentation is good and that failure to experiment is bad. Some of my first self-experiments involved acne. I was a grad student. When I told my dermatologist what I’d done — my results showed that a medicine he’d prescribed didn’t work — he looked unhappy. “Why did you do that?” he asked.

The Levitt/List course has a Martin-Luther-esque ring to it. Science: Not just for other people.

Thanks to Nadav Manham.

“Baffling” Link Between Autism and Vinyl Floors

From Scientific American:

Children who live in homes with vinyl floors, which can emit chemicals called phthalates, are more likely to have autism, according to research by Swedish and U.S. scientists published Monday.

The study of Swedish children is among the first to find an apparent connection between an environmental chemical and autism.

The scientists were surprised by their finding, calling it “far from conclusive.” Because their research was not designed to focus on autism, they recommend further study of larger numbers of children to see whether the link can be confirmed. . . .

The researchers found four environmental factors associated with autism: vinyl flooring, the mother’s smoking, family economic problems and condensation on windows, which indicates poor ventilation.

Here, in a nutshell, are several of the weaknesses with the way epidemiology is currently practiced. I doubt there is anything to this, but who knows? It deserves further investigation. Here’s what could have been better:

1. The researchers did dozens of statistical tests but did not correct for the number of tests. This means there will be a high rate of false positives. The researchers appear to not quite understand this. They don’t need “further study of larger numbers” of subjects — they simply need studies of different populations. The sample size isn’t the problem; the statistical test corrects for that. It is the researchers’ failure to correct for number of tests that makes this evidence so weak.

2. They did their dozens of tests on highly correlated variables. This is like buying two of something you only need one of. A big waste. That they measured something as specific as vinyl flooring implies they gave a long questionnaire to their subjects. Perhaps there were 100 questions. Answers to those questions are likely to be highly correlated. Expensive homes tend to be different in several ways from cheaper homes. The presence/absence of vinyl flooring is likely to be correlated with family economic conditions and condensation on windows (more expensive = better ventilation). The researchers could have used factor analysis or principal components analysis to boil down their long questionnaire into a small number of factors — like 4. So instead of doing 100 tests, they could have done 4 much stronger tests. Then, if there was an unexpected correlation, there would be a good reason to take it seriously.

Someone quoted later in the article gets it completely wrong:

Dr. Philip Landrigan, a pediatrician who is director of the Children’s Environmental Health Center at Mount Sinai School of Medicine, called the results “intriguing, but in my mind preliminary because they are based on very small numbers.”

Nope. Statistical tests correct for sample size. This is like an astronomer saying the sun revolves around the earth. In this article this happens twice.

Hey, What Happened to My Brain? (part 3)

The data I posted that showed a sudden improvement in my arithmetic ability is among the most interesting data I’ve ever collected. Not because it revealed something wildly new — I was already sure flaxseed oil helped — but because it revealed something intriguing and new (the time course of the improvement is puzzling).

I collected the data in an unusual way — watchful waiting. I didn’t do an experiment, the way experimental psychology data is usually collected. I didn’t do a survey, the way epidemiological data is collected. In the emphasis on one person it resembles a case report in medical journals — but I didn’t have a problem to be solved and the data is far more numerical and systematic than the data in a case report.

And this rarely-used scientific method paid off. Hmm. I think the scientific methods currently taught have a big weakness: They focus almost entirely on idea testing, whereas idea generation is just as important. Tools that work well for idea testing work poorly for idea generation. The effect of this imbalance — a kind of nutritional deficiency in intellectual diet — is that scientists don’t do a good job of coming up with new ideas.

What should scientists be doing? I would like to find out. My watchful-waiting data collection is/was part of trying to find out. That it paid off pretty quickly is a good sign. It’s the third step in a long process. Step 1. When I was a grad student, my acne self-experimentat led me to realize that one of my prescribed medicines didn’t work — a surprising and useful new idea. Step 2: Later self-experiments had the same effect: Generated surprising and useful ideas. At a much higher rate than my conventional experiments. Why? Perhaps because it involves cheap frequent tests of something important. Step 3: Arrange such a situation — cheap frequent tests of something important — and see what happens.

Will Like vs. Might Love vs. Might Hate

What to watch? Entertainment Weekly has a feature called Critical Mass: Ratings of 7 critics are averaged. Those averages are the critical response that most interests me. Rotten Tomatoes also computes averages over critics. It uses a 0-100 scale. In recent months, my favorite movie was Gran Torino, which rated 80 at Rotten Tomatoes (quite good). Slumdog Millionaire, which I also liked, got a 94 (very high).

Is an average the best way to summarize several reviews? People vary a lot in their likes and dislikes — what if I’m looking for a movie I might like a lot? Then the maximum (best) review might be a better summary measure; if the maximum is high, it means that someone liked the movie a lot. A score of 94 means that almost every critic liked Slumdog Millionaire, but the more common score of 80 is ambiguous: Were most critics a bit lukewarm or was wild enthusiasm mixed with dislike? Given that we have an enormous choice of movies — especially on Rotten Tomatoes – I might want to find five movies that someone was wildly enthusiastic about and read their reviews. Movies that everyone likes (e.g., 94 rating) are rare.

Another possibility is that I’m going to the movies with several friends and I just want to make sure no one is going to hate the chosen movie. Then I’d probably want to see the minimum ratings, not the average ratings.

So: different questions, wildly different “averages”. I have never heard a statistician or textbook make this point except trivially (if you want the “middle” number choose the median, a textbook might say). The possibility of “averages” wildly different from the mean or median is important because averaging is at the heart of how medical and other health treatments are evaluated. The standard evaluation method in this domain is to compare the mean of two groups — one treated, one untreated (or perhaps the two groups get two different treatments).

If there is time to administer only one treatment, then we probably do want the treatment most likely to help. But if there are many treatments available and there is time to administer more than one treatment — if the first one fails, try another, and so on — then it is not nearly so obvious that we want the treatment with the best mean score. Given big differences from person to person, we might want to know what treatments worked really well with someone. Conversely, if we are studying side effects, we might want to know which of two treatments was more likely to have extremely bad outcomes. We would certainly prefer a summary like the minimum (worst) to a summary like the median or mean.

Outside of emergency rooms, there is usually both a wide range of treatment choice and plenty of time to try more than one. For example, you want to lower your blood pressure. This is why medical experts who deride “anecdotal evidence” are like people trying to speak a language they don’t know — and don’t realize they don’t know. (Their cluelessness is enshrined in a saying: the plural of anecdote is not data.) In such situations, extreme outcomes, even if rare, become far more important than averages. You want to avoid the extremely bad (even if rare) outcomes, such as antidepressants that cause suicide. And if a small fraction of people respond extremely well to a treatment that leaves most people unchanged, you want to know that, too. Non-experts grasp this, I think. This is why they are legitimately interested in anecdotal evidence, which does a better job than means or medians of highlighting extremes. It is the medical experts, who have read the textbooks but fail to understand their limitations, whose understanding has considerable room for improvement.

The Twilight of Expertise (medical doctors)

Long ago the RAND Corporation ran an experiment that found that additional medical spending provided no additional health benefit (except in a few cases). People who didn’t like the implication that ordinary medical care was at least partly worthless could say that it was only at the margin that the benefits stopped. This was unlikely but possible. Now a non-experimental study has found essentially the same thing:

To that end, Orszag has become intrigued by the work of Mitchell Seltzer, a hospital consultant in central New Jersey. Seltzer has collected large amounts of data from his clients on how various doctors treat patients, and his numbers present a very similar picture to the regional data. Seltzer told me that big-spending doctors typically explain their treatment by insisting they have sicker patients than their colleagues. In response he has made charts breaking down the costs of care into thin diagnostic categories, like “respiratory-system diagnosis with ventilator support, severity: 4,” in order to compare doctors who were treating the same ailment. The charts make the point clearly. Doctors who spent more — on extra tests or high-tech treatments, for instance — didn’t get better results than their more conservative colleagues. In many cases, patients of the aggressive doctors stay sicker longer and die sooner because of the risks that come with invasive care.

Perhaps the doctors who ordered the high-tech treatments, when questioned about their efficacy, would have responded as my surgeon did to a similar question about the surgery she recommended (and would make thousands of dollars from): The studies are easy to find, just use Google. (There were no studies.)

It’s like the RAND study: Defenders of doctors will say that some of them didn’t know what they were doing but the rest did. But that’s the most doctor-friendly interpretation. A more realistic interpretation is that a large fraction of the profession doesn’t care much about evidence. In everyday life, evidence is called feedback. If you are driving and you don’t pay attention to and fix small deviations from the middle of the road, eventually you crash. You don’t need a double-blind clinical trial not to crash your car — a lesson the average doctor, the average medical school professor, and the average Evidence-Based-Medicine advocate haven’t learned.

The Power Law of Scientific Dismissiveness

In my experience, scientists are much too dismissive, in the sense that most of them have a hard time fully appreciating other people’s work. This dismissiveness follows a kind of power law: a few of them spend a large amount of time being dismissive (e.g., David Freedman); a large number spend a small amount of time being dismissive. The really common form of dismissiveness goes like this (from a JAMA abstract):

In this second article, we enumerate the major issues in judging the validity of these studies, framed as critical appraisal questions. Was the disease phenotype properly defined and accurately recorded by someone blind to the genetic information? Have any potential differences between disease and non-disease groups, particularly ethnicity, been properly addressed? . . . Was measurement of the genetic variants unbiased and accurate? [bold added]

This is the dismissiveness of dichotomization: division of studies into valid and invalid, proper and improper, unbiased and biased, accurate and inaccurate. As if it were that simple. Such dichotomization throws away a lot of information. It leads to such absurdities as a meta-analysis of 2000 studies that decided that only 16 were worth inclusion. As if the rest contained no information of value. In the case of the term accurate the problem is easy to see. To draw a sharp line between accurate and inaccurate makes little sense and ignores the harder and more valuable question how accurate?

The average scientist is religious in many ways, and this is one of them. It is part of what might be called religious method: the dichotomization of persons into good and bad. An example is saying you are either going to heaven or to hell — nothing between.