What to watch? Entertainment Weekly has a feature called Critical Mass: Ratings of 7 critics are averaged. Those averages are the critical response that most interests me. Rotten Tomatoes also computes averages over critics. It uses a 0-100 scale. In recent months, my favorite movie was Gran Torino, which rated 80 at Rotten Tomatoes (quite good). Slumdog Millionaire, which I also liked, got a 94 (very high).
Is an average the best way to summarize several reviews? People vary a lot in their likes and dislikes — what if I’m looking for a movie I might like a lot? Then the maximum (best) review might be a better summary measure; if the maximum is high, it means that someone liked the movie a lot. A score of 94 means that almost every critic liked Slumdog Millionaire, but the more common score of 80 is ambiguous: Were most critics a bit lukewarm or was wild enthusiasm mixed with dislike? Given that we have an enormous choice of movies — especially on Rotten Tomatoes – I might want to find five movies that someone was wildly enthusiastic about and read their reviews. Movies that everyone likes (e.g., 94 rating) are rare.
Another possibility is that I’m going to the movies with several friends and I just want to make sure no one is going to hate the chosen movie. Then I’d probably want to see the minimum ratings, not the average ratings.
So: different questions, wildly different “averages”. I have never heard a statistician or textbook make this point except trivially (if you want the “middle” number choose the median, a textbook might say). The possibility of “averages” wildly different from the mean or median is important because averaging is at the heart of how medical and other health treatments are evaluated. The standard evaluation method in this domain is to compare the mean of two groups — one treated, one untreated (or perhaps the two groups get two different treatments).
If there is time to administer only one treatment, then we probably do want the treatment most likely to help. But if there are many treatments available and there is time to administer more than one treatment — if the first one fails, try another, and so on — then it is not nearly so obvious that we want the treatment with the best mean score. Given big differences from person to person, we might want to know what treatments worked really well with someone. Conversely, if we are studying side effects, we might want to know which of two treatments was more likely to have extremely bad outcomes. We would certainly prefer a summary like the minimum (worst) to a summary like the median or mean.
Outside of emergency rooms, there is usually both a wide range of treatment choice and plenty of time to try more than one. For example, you want to lower your blood pressure. This is why medical experts who deride “anecdotal evidence” are like people trying to speak a language they don’t know — and don’t realize they don’t know. (Their cluelessness is enshrined in a saying: the plural of anecdote is not data.) In such situations, extreme outcomes, even if rare, become far more important than averages. You want to avoid the extremely bad (even if rare) outcomes, such as antidepressants that cause suicide. And if a small fraction of people respond extremely well to a treatment that leaves most people unchanged, you want to know that, too. Non-experts grasp this, I think. This is why they are legitimately interested in anecdotal evidence, which does a better job than means or medians of highlighting extremes. It is the medical experts, who have read the textbooks but fail to understand their limitations, whose understanding has considerable room for improvement.
Seth,
What you are describing for the movies is what is known as recommender systems. They are a huge interest in the machine learning community. One of the prime example being the current netflix competition (https://www.netflixprize.com/ ) where the best minds can do better than 10% of the original netflix:
https://www.netflixprize.com//leaderboard
A transfer into the treatment process is a very interesting idea.
Igor.
Metacritic shows you the rating for every review, in addition to the average rating. You can see, for instance, that Gran Torino (average rating 72 out of 100) didn’t get any ratings over 91 from the 34 reviews, while Benjamin Button (average = 70) got seven 100′s. They also have user ratings, which could be more informative than reviewer ratings, but it’s harder to get a sense of their distribution because there are more of them and they aren’t organized as well.