“There are three types of lies -- lies, damn lies, and statistics.” ~ Benjamin Disraeli
Simpson's paradox is a phenomenon in which a trend appears in several different groups of data but disappears or reverse when these groups are combined. It is important to understand this concept to correctly interpret the data. Edward H. Simpson first addressed this phenomenon in a technical paper in 1951, but Karl Pearson et al. in 1899 and Udny Yule in 1903, had mentioned a similar effect earlier.
For example, you and a friend each do problems and your friend answers a higher proportion correctly than you on two days that you've competed. Does that mean your friend has answered a higher proportion correctly. Not necessarily!
- On Saturday, you solved out of attempted problems, but your friend solved out of You had solved more problems, but your friend pointed out that he was more accurate, since . Fair enough.
- On Sunday, you only attempted problems and got correct. Your friend got out of problems correct. Your friend gloated once again, since .
However, the competition is about the one who solved more accurately over the weekend, not on individual days. Overall, you have solved out of problems whereas your friend has solved out of problems. Thus, despite your friend solving a higher proportion of problems on each day, you actually won the challenge by solving the higher proportion for the entire weekend.
Simpson's paradox can also arise in correlations, in which two variables appear to have (say) a positive correlation towards one another, when in fact they have a negative correlation, the reversal having been brought about by a "lurking" confounder. When two different contexts compel us to take two opposite actions based on the same data, our decision must be driven not by statistical considerations, but by some additional information extracted from the context. Berman et al. give an example from economics, where a dataset suggests overall demand is positively correlated with price (that is, higher prices lead to more demand), in contradiction of expectation. Analysis reveals time to be the confounding variable: plotting both price and demand against time reveals the expected negative correlation over various periods, which then reverses to become positive if the influence of time is ignored by simply plotting demand against price.
Resources
No comments:
Post a Comment