Friday, December 5, 2008

Trends: Going the way of R and other open source software

My colleague Girish recently mailed me a New York Times business computing article about how data analysts have taken to R as the open source programming language.

The article took me back in time to 1996, when I was a graduate student in the US. Fellow statisticians were raving about R as the new generation data crunching language and something that was going to give other data packages a run for their money. We were at that point doing our statistical number crunching on student licenses of SAS. While intrigued with the whole issue, I was too busy learning applied statistics and SAS and just getting through grad school semesters.

Now years later it is with a feeling of deja vu that I read the article because today I am much closer to embracing R as 'the' crunching language for myself and our business.

We've done the testing and it's won hands down every time;



  • Ease of use


  • Readily available code modules(learning from others is a key here-we techies love to outsmart each other)


  • Wonderful graphics


  • Excellent data manipulation


  • No fees


  • Ability to customise


  • Lots more...


While competitors are quick to dismiss it, R works because it has created a democratic community of statisticians and others who like to see number crunching become easier and more visual. The fact that it is open source provides the added kick to be able to create customised modules that the community can use. It blends programming and statistical skills together more elegantly than I have ever seen. The fact that it has a fan following among my tribe is therefore not surprising.

Thus, is R and other open source software the way to go-absolutely! The reasons are many but let me rank order them based on how we took the leap-

  1. Stacks up and beats competition on most data crunching modules.
  2. Easy to use.
  3. Collaborative value model: the conviction that a collective community can create better thought and tools than a competitive one.
  4. Better service: less downtime, quicker error resolution and a help desk of people dedicated to fixing issues.
  5. Excellent customisation options: The ability to create what you want for your business and put it out there.
  6. Cutting edge graphics.
  7. The geek factor-the thrill of creating, bettering and showing off to other like minded individuals cannot be underestimated.
  8. Lower technology cost: while this is great, believe me this is not the main reason that businesses use open source.




Monday, December 1, 2008

Segmentation-making it more science than art

The reason for delay in posting this has been because I've been toying with whether I should write on segmentation or not. So much has been written on this subject that it makes me a little hesitant about revisiting this space.


What got me to finally pen this was the title of a paper at an upcoming conference that said 'How statistics get in the way of actionable segmentation'. I don't know what the presenters have to say (must source a copy after the conference) but the title made me laugh. The two words that stuck in my head were 'statistics' and 'actionable segmentation' and whether the twain will ever meet.


I have undertaken enumerable segmentation projects in my career, some simple, others complex, yet others that go nowhere. All of them in the end have the same things in common:

  1. Too many bases variables

  2. Over reliance on cluster analysis as the primary tool for segmentation

  3. Use of subjective judgement to evaluate results of the cluster solution

  4. Lack of reliability and validity tests on the solution

  5. Recreation of the scientific solution into a more 'creative and arty' one

What the above means for managers who implement segmentation solutions is that they could be formulating strategy and targeting segments that are unstable and unreal. There exists a body of research that calls for a deeper look at the statistics and data that go into cluster analysis and segmentation(I will be happy to provide the references).

The real issue continues to be an inability of both analysts and practitioners to put together a common road map for segmentation that takes into account statistical robustness of the technique along with creation of actionable segments that can be targeted through focused marketing programs. In my experience, the science of segmentation gets lost in the art.

Here are the five key things that analysts and practitioners must do to create better, robust and more scientific segments-

  1. Choose bases variables for segmentation that tie in with the end goal of segmentation and keep their number not more than 8-10. Build a set of good profiling variables that tie into the bases variables(there is no restriction in number here).
  2. Explore other tools for segmentation(sometimes simple business rules work just as well). Latent class analysis offers excellent alternatives for both survey and crm data and is still a highly underused technique. Try two techniques, if possible and compare and contrast results.
  3. Use a variety of statistical parameters to evaluate a solution instead of relying on one or two or on subjectivity. Decide which metrics you want to look at before the study. For example-dendograms, change rate, psuedo Rsq, hotelling's Tsq can be some metrics for evaluating no. of clusters in a cluster solution. The BIC, p-value, parsimony(no. of parameters) and the bootstrap p-value can be the parameters to nail number of segments in a latent class segmentation. Reliance on 'many' statistics vs. 'few' should be the mantra.
  4. Test reliability through hold out samples and validity through looking at profiling variables and how they differentiate the solution. The holdout sample results must match those of the developmental sample in terms of the number of segments and profiles. Most of the picked profiling variables must adequately differentiate the final segment solution. If there is an issue with reliability and validity-the solution may have a problem. Going back and reworking the same is the best way out.
  5. Don't use the art of segmentation to sidestep the science for a solution, use it if you will to add to the same.