Wednesday, January 28, 2009

Plan a coming out party for Outliers-Part 1

I picked up Malcolm Gladwell's book on Outliers because it got me thinking about an issue I encounter in all my analyses. I remembered the case quoted by statisticians as reason for not leaving out outliers. For those who don't already know the story, here it is:

In 1985 three researchers-Farman, Gardinar and Shanklin were extremely puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal January levels. The reason for the puzzlement was because the Nimbus 7 satellite, which had sophisticated instruments aboard for recording ozone levels, hadn't recorded similarly low ozone concentrations.


When they examined the data from the satellite they realised that the satellite in fact had been recording these low concentrations levels and had been doing so for many years. Because the ozone concentrations recorded by the satellite were so low, they were being treated as outliers by a computer program and left out of the analysis.


The Nimbus 7 satellite had in fact been gathering evidence of low ozone levels since 1976. Due to the outliers being discarded without being examined, the damage to the atmosphere caused by CFC's went undetected and untreated for up to nine years(this account is disputed by NASA researchers who say that they had flags in place for low values and did notice the low ozone values and subsequently presented thier paper but the Farman trio's paper on the same beat them to it).

What is the moral of the story? To take a deeper look at outliers in your data because they usually tell a unique story if you are really willing to listen.

Why should you be looking for outliers-you may ask, here's why:

  1. Erroneous results in reporting, dashboards and executive summaries as these are comprised mainly of 'mean/average' numbers. Statistical tests and analysis may be negatively affected.
  2. The outlier may be the story of interest in your data i.e. the high value accounts, the seasonal spenders, the defaulters etc.
  3. The understanding that an outlier may yield about the data gathering process. I remember years ago being asked to cross check data when the results showed that median age of women at the birth of their last child was mid forties for certain eastern Indian states. This result was completely off from the national average(which was lower) and the client suspected a data issue at the agencies end. On enquiry, we learnt that women in these states were losing teenage children(that they had when they were much younger) to terrorism and drugs and thus were having more children in later years. In this case the explanation held else we would have had to investigate why the error occurred in the data collection.

Once you find outliers in the data, what do you do ? Before anything else-report them! It does not matter if they are few in number, if you understand why they occurred or if you plan to leave them out for whatever good reason. In a lot of analysis, I see a disturbing trend of suppressing or 'fixing' outliers without understanding them or reporting them.

My suggestion thus is to have a discussion on outliers in your data before deciding what you will do with them. I will talk about addressing outliers in part 2 of this piece, but for now here are some things to mull over-

  • Can tracking financial performance of companies and individuals and identifying outliers help curb scams of the Satyam and Maddoff type? Markopolos and mathematician DiBartolomeo warned regulators for years that Madoff could not be consistently generating higher than market profits unless he was running a ponzi scheme. I am sure some people out there also looked through Satyam's records and had misgivings but kept quiet.

  • Last year Republican representative Mark Souder proposed that baseball players whose on-field statistics suddenly improved should be tested more often for performance-enhancing substances. The thought is to measure actual player performance against projected performance and history based on a typical career path and identify outlier performance or sharp deviance. Maybe undertaking this analysis for track athletes may give sharper results.

What does all this have to do with Gladwell's book on outliers? Nothing really, except a reiteration to take a fresh and deep look at outliers before tossing them away or standardising them. As for the book, it was alright(not earth shattering), the reason behind the Korean airline crashes made for the best reading.

Wednesday, January 7, 2009

Book Review: Super Crunchers-The fight between experts, gut and data

I enjoyed reading Ian Ayres book. Let me say that again and right - I really enjoyed reading Ian Ayres book. For those who have not already read the book-it details how data driven number crunching algorithms work better than expert predictions and gut feelings and how super crunchers(read-statistically literate and number crunching savvy) individuals will have an edge in decision making in the future.

I liked the book because it lucidly illustrates trends that I have seen in the last decade-a better adoption of predictive models among businesses, more data generation and storage, an industry wide need for talented number crunchers and the conflict when data driven approaches come face to face with the resident expert or the manager who swears by his gut.

The case studies are very interesting and apt-it was amusing to read about the prediction of a vintage by an algorithm(I must pick up some wine based on the prediction soon). I could empathise with the story about a fellow economists frustration at waiting to get the final odds number on the Downs syndrome screening for his unborn child, and the inability of the technicians to apply the Bayes theorem(I've been there). As Ayres points out, I agree neural nets have a long way to go before they replace other mainstream techniques and it's not just due to the over fitting problem. Randomised trials still need to become mainstream among most marketers.


What really makes the book stand out is that data crunchers like me along with millions others 'get it'. I build predictive models that are elegant and simple and able to help clients make better decisions about their businesses. We constantly face sceptics about how predictive models can fare better than the resident experts knowledge of his market or brand or business. We sometimes pitch to client's who tell us their business problems cannot be put in an equation(it makes me squirm because I have a personal data project on which aims at predicting market prices for Indian contemporary art). After years in statistics, it's still difficult to help people understand standard deviation or 2SD.

Do I agree with the book's central premise-yes I do. In a data driven world, let numbers do the talking-stand aside experts and intuition.