Wednesday, January 28, 2009

Plan a coming out party for Outliers-Part 1

I picked up Malcolm Gladwell's book on Outliers because it got me thinking about an issue I encounter in all my analyses. I remembered the case quoted by statisticians as reason for not leaving out outliers. For those who don't already know the story, here it is:

In 1985 three researchers-Farman, Gardinar and Shanklin were extremely puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal January levels. The reason for the puzzlement was because the Nimbus 7 satellite, which had sophisticated instruments aboard for recording ozone levels, hadn't recorded similarly low ozone concentrations.


When they examined the data from the satellite they realised that the satellite in fact had been recording these low concentrations levels and had been doing so for many years. Because the ozone concentrations recorded by the satellite were so low, they were being treated as outliers by a computer program and left out of the analysis.


The Nimbus 7 satellite had in fact been gathering evidence of low ozone levels since 1976. Due to the outliers being discarded without being examined, the damage to the atmosphere caused by CFC's went undetected and untreated for up to nine years(this account is disputed by NASA researchers who say that they had flags in place for low values and did notice the low ozone values and subsequently presented thier paper but the Farman trio's paper on the same beat them to it).

What is the moral of the story? To take a deeper look at outliers in your data because they usually tell a unique story if you are really willing to listen.

Why should you be looking for outliers-you may ask, here's why:

  1. Erroneous results in reporting, dashboards and executive summaries as these are comprised mainly of 'mean/average' numbers. Statistical tests and analysis may be negatively affected.
  2. The outlier may be the story of interest in your data i.e. the high value accounts, the seasonal spenders, the defaulters etc.
  3. The understanding that an outlier may yield about the data gathering process. I remember years ago being asked to cross check data when the results showed that median age of women at the birth of their last child was mid forties for certain eastern Indian states. This result was completely off from the national average(which was lower) and the client suspected a data issue at the agencies end. On enquiry, we learnt that women in these states were losing teenage children(that they had when they were much younger) to terrorism and drugs and thus were having more children in later years. In this case the explanation held else we would have had to investigate why the error occurred in the data collection.

Once you find outliers in the data, what do you do ? Before anything else-report them! It does not matter if they are few in number, if you understand why they occurred or if you plan to leave them out for whatever good reason. In a lot of analysis, I see a disturbing trend of suppressing or 'fixing' outliers without understanding them or reporting them.

My suggestion thus is to have a discussion on outliers in your data before deciding what you will do with them. I will talk about addressing outliers in part 2 of this piece, but for now here are some things to mull over-

  • Can tracking financial performance of companies and individuals and identifying outliers help curb scams of the Satyam and Maddoff type? Markopolos and mathematician DiBartolomeo warned regulators for years that Madoff could not be consistently generating higher than market profits unless he was running a ponzi scheme. I am sure some people out there also looked through Satyam's records and had misgivings but kept quiet.

  • Last year Republican representative Mark Souder proposed that baseball players whose on-field statistics suddenly improved should be tested more often for performance-enhancing substances. The thought is to measure actual player performance against projected performance and history based on a typical career path and identify outlier performance or sharp deviance. Maybe undertaking this analysis for track athletes may give sharper results.

What does all this have to do with Gladwell's book on outliers? Nothing really, except a reiteration to take a fresh and deep look at outliers before tossing them away or standardising them. As for the book, it was alright(not earth shattering), the reason behind the Korean airline crashes made for the best reading.

1 comment:

Datalligence said...

i guess outliers always have an interesting story to tell:-)looking forward to part 2.