Wednesday, September 12, 2018

Simpson's Paradox

“There are three types of lies -- lies, damn lies, and statistics.”Benjamin Disraeli


Simpson's paradox is a phenomenon in which a trend appears in several different groups of data but disappears or reverse when these groups are combined. It is important to understand this concept to correctly interpret the data. Edward H. Simpson first addressed this phenomenon in a technical paper in 1951, but Karl Pearson et al. in 1899 and Udny Yule in 1903, had mentioned a similar effect earlier. 

For example, you and a friend each do problems and your friend answers a higher proportion correctly than you on two days that you've competed. Does that mean your friend has answered a higher proportion correctly. Not necessarily!


  • On Saturday, you solved  out of  attempted problems, but your friend solved  out of  You had solved more problems, but your friend pointed out that he was more accurate, since . Fair enough.
  • On Sunday, you only attempted  problems and got  correct. Your friend got  out of  problems correct. Your friend gloated once again, since .
However, the competition is about the one who solved more accurately over the weekend, not on individual days. Overall, you have solved  out of  problems whereas your friend has solved  out of  problems. Thus, despite your friend solving a higher proportion of problems on each day, you actually won the challenge by solving the higher proportion for the entire weekend. 
Simpson's paradox can also arise in correlations, in which two variables appear to have (say) a positive correlation towards one another, when in fact they have a negative correlation, the reversal having been brought about by a "lurking" confounder. When two different contexts compel us to take two opposite actions based on the same data, our decision must be driven not by statistical considerations, but by some additional information extracted from the context. Berman et al. give an example from economics, where a dataset suggests overall demand is positively correlated with price (that is, higher prices lead to more demand), in contradiction of expectation. Analysis reveals time to be the confounding variable: plotting both price and demand against time reveals the expected negative correlation over various periods, which then reverses to become positive if the influence of time is ignored by simply plotting demand against price.
Resources

No comments:

Post a Comment

Some Remarks on The Corrections by Jonathan Franzen

In 2001 when The Corrections was published it was regarded as the most important book of the 21st century. Some of it was due to the tim...