Tuesday, February 24, 2009

Predicting the success of Slumdog Millionnaire and other movies

The dust from the Oscar award ceremony has settled and my fellow countrymen continue to go about their businesses with big smiles on their faces over the oscar wins for AR Rahman, Resul Pookutty and the film. A question that I keep thinking about is whether analytics could have predicted the success for Slumdog?

Movie box office revenue prediction is coming of age and Hollywood is beginning to recognise that it may need the help of number crunchers to utilise it's funds better by getting behind films with higher chances of success. I remember reading in detail the article by Ramesh Sharda a few years ago, where he deployed a neural net model to predict the box office receipts of movies before their theatrical release. Risk and money involved for investors in the movie business is very high, coupled with the fact that a large portion of a movie's total revenue comes from the first few weeks after release. Interest in the ability of statistical and mathematical models to predict this revenue not only for funding purposes but for better distribution and marketing strategy has been growing in the last few years.

There are some companies and individuals who have cracked the code using a variety of variables. While there are still many expert sceptics out there, making predictions that do significantly better than chance present a win-win solution for both the developer of these algorithms and the investor and studio.

  • Can it be done? YES, YES and YES! With more and more people throwing their weight behind the science of the subject, prediction algorithms in this space continue to get better and better.

  • Is it easy? NO! That's where the frustration and challenge among data crunchers lies.

What's my recipe to get this right? In my experience(yes I've had the pleasure of taking a shot at this exciting problem) employ the 'layered accuracy approach'.

  1. Decide which part of the problem you want to tackle-pre release or post release prediction(first few weeks) or both.

  2. Identify the structure of your base model(this is the model that will provide you with the benchmark predictive power and understanding of the revenue aspect of the movies). Try a model structure that is easy to execute and interpret and fits the data well. Make decisions about quantitative vs. behavioral models, point estimates vs. classification into revenue groups, segment models or all movie population models. .

  3. Use tried and tested variables relevant to the model being built-star power, no. of screens, genre, MPAA rating, time of release, competition at time of release, critics ratings, sequel etc. I recommend that you breakdown any variable that is still too dense-for example create your own version of the traditional genre variable as it usually does not add much in it's present form.

  4. Use other not so mainstream variables-plot, positive buzz on internet forums and the Hollywood blacklist for starters. This is your creative space, use it to construct variables that you believe can add more punch to the model.

  5. Build the model and examine predictive accuracy and insights. Rank order the insight variables. If something does not make sense explore it again.

  6. Validate the model to see that it stands up tall.

  7. Try another model structure and see if you get better results(it's all about accuracy Watson, even a little more lift counts when we are talking millions of dollars).

  8. Get a movie fanatic data cruncher to do all the above for you(I promise the predictive accuracy will dramatically improve).

  9. Explore other non-conventional ways to better your prediction accuracy. A big area now is prediction markets.

As science makes the business of revenue prediction in movies and other entertainment areas much easier, the issue becomes less about whether we could have predicted the success of Slumdog Millionaire and more about if we want to. Malcolm Gladwell presents this case so eloquently in his absolute must-read piece in The New Yorker.

Thursday, February 19, 2009

The Practical Statistician-A Toolkit

I have had the pleasure of working with a lot of statisticians, mathematicians, data miners and econometricians (let's call them PEMD-persons extracting meaning from data, for ease) in my career. An observation I have often made is that while all of them know the tools of their trade, only a few eventually go on to become excellent practitioners or as I call them 'practical statisticians' in the industry. What is it that these experts have that gets them far ahead in their trade? A toolkit that helps them survive the real world journey. Here is the list of items in that toolkit:


Item #1: Pen and notebook (a thick one)-they carry this around at all times even to bed. This helps them make copious notes when others are talking and think aloud when they are structuring their thoughts, attacking problems and analyzing outputs. They guard this notebook zealously and get visibly upset if it ever gets lost or misplaced. They recognize that in order to streamline loads of work, manage their time well, analyze the problem fully and present the output lucidly without going insane they must structure their thoughts. Written matter is the key.

Item #2: Three books for reference and speed reading skills-one is usually about the software they are using, the other two are the best applied texts on most used techniques in their field and new emerging areas(which no one else has a clue about). They read many more research articles than other people (and yes they usually do that during their breaks or in their leisure time). If they don’t understand an article the first time round, they absolutely have to read it again and again till they do.

Item #3: Data dirty fingers-they execute projects no matter how high they rank in the corporate hierarchy. They recognize that leading from the front means ability to do the work at the back end especially when all hell breaks lose.

Item #4: Non-technical speak-they are able to communicate their ideas and statistical methods to a wide audience without using statistical jargon.

Item #5: Graphs-they like to graph data and get a sense of numbers visually. This ability to look at both numbers and graphs helps them get a finer sense of the data and what they don’t know and must find out from it.

Item #6: A good dose of imagination, critical thinking and skepticism-they function like detectives and for them most business problems present cases to be cracked. After the project starts they devote all effort in cracking the case oblivious to everything and everyone else.

Item #7: Mentoring and training calendar-unless they pass on their wisdom and how they put the problem, method and experience together, they know they will continue to do the same work over and over again.

Item #8: A broad view of their role -they define their role rather than let client's, coworkers and organizations peg them. They like their roles to be larger and ‘more whole’ not constrained by their degree and specialization.

Item #9: Practical adequate solutions-while striving for the best solution, they recognize that they may need to deliver less optimum solutions based on project constraints and client readiness.


Item #10: A passion for statistics-especially it's applications in different fields, and an understanding of what it can and cannot do.

Wednesday, February 18, 2009

Trends: R vs. SAS-What's really at the heart of the matter?

Okay, I promised myself that I would not jump into this debate and I bit my tongue and fingers like a thousand times last week. Go ahead and shoot me I'm only human.

Here goes...

Methinks this R vs. SAS debate is less about the merits and demerits of the two software and more about the David vs. Goliath(or Hare vs. Tortoise) effect. David, in this case also provides strong competition in a slightly monopolistic market situation.



I have worked with SAS and I don't have a strong opinion against it(except for it's really bad graphics). I am new to R and I like it(yes, there will be some pet peeves as time goes by). I have also used most other competing software in this space(SPSS, Minitab, Stata, Matlab etc).


So what's the issue you may ask? Well, no matter what anyone says(or posts), I believe one of the main reasons that R is generating such a lot of press(and don't get me wrong-it has strong merits) is the fact that with all it's merits it is also FREE! Whether we like to admit it or not, it bothers us that we have to pay for using SAS when R which is as good(if not better in some areas) is available for zero cost. Would the same debate be as heated if R did not deliver? I doubt it.

Add to this the point that R comes in as the 'underdog' that most of us like to see win and you get a better idea of why there is so much angst all around on this issue.

Enough said.

Tuesday, February 17, 2009

Parlez vous Statistics?

As I sat debating the issue about whether we should have an informal case study 'test' for statisticians who want to work or intern with us, I read Andrew Gelman's blog article on a new course in statistical communication that he would like to teach sometime. It brought home to me the fact that if I went ahead with this test we would not have any new hires at all, since most would flunk out.

Why oh why do we not teach statistical communication at most universities or even at jobs? The lack of this skill has made proponents and users of the subject in the industry unable to communicate in the same language.

So what are some of the skills I would test for statisticians? Here is my list :
  1. Translating a business problem to an analytical and statistical problem
  2. Writing a proposal(or at least the proposed analytical solution part of the proposal)
  3. Creating a process/flow chart of the analytical solution
  4. Graphical presentation of data(raw, cleaned and analysed or modeled)
  5. Summarizing and communicating statistical results in both technical and non-technical ways(depending upon the audience). This would also include documentation of the project and an executive summary of findings
  6. Ability to write simple and elegant computer code and read the same(irrespective of software and writer differences)
  7. Collaborative work effort with other colleagues(programmers, consultants, academicians etc)
  8. Knowledge of statistical pitfalls
  9. Other communication skills(e-mails, blogs, discussions, knowledge sharing etc)
  10. Ability to read, understand and summarise research papers(good knowledge of work in relevant focus areas)
More organisations need to get involved with universities to encourage teaching of these skills at an early level to practitioners who plan to join the industry. Statisticians on the other hand need to move out of their comfort zone and ensure that they become adept at communicating their language to a wider audience. Once the above skills have been mastered along with a sound knowledge of statistics, we may finally be viewed as the geeks with the sexy job(as Hal Varian-Google's chief economist points out in an interview with the McKinsey Quarterly).