Tuesday, February 24, 2009

Predicting the success of Slumdog Millionnaire and other movies

The dust from the Oscar award ceremony has settled and my fellow countrymen continue to go about their businesses with big smiles on their faces over the oscar wins for AR Rahman, Resul Pookutty and the film. A question that I keep thinking about is whether analytics could have predicted the success for Slumdog?

Movie box office revenue prediction is coming of age and Hollywood is beginning to recognise that it may need the help of number crunchers to utilise it's funds better by getting behind films with higher chances of success. I remember reading in detail the article by Ramesh Sharda a few years ago, where he deployed a neural net model to predict the box office receipts of movies before their theatrical release. Risk and money involved for investors in the movie business is very high, coupled with the fact that a large portion of a movie's total revenue comes from the first few weeks after release. Interest in the ability of statistical and mathematical models to predict this revenue not only for funding purposes but for better distribution and marketing strategy has been growing in the last few years.

There are some companies and individuals who have cracked the code using a variety of variables. While there are still many expert sceptics out there, making predictions that do significantly better than chance present a win-win solution for both the developer of these algorithms and the investor and studio.

  • Can it be done? YES, YES and YES! With more and more people throwing their weight behind the science of the subject, prediction algorithms in this space continue to get better and better.

  • Is it easy? NO! That's where the frustration and challenge among data crunchers lies.

What's my recipe to get this right? In my experience(yes I've had the pleasure of taking a shot at this exciting problem) employ the 'layered accuracy approach'.

  1. Decide which part of the problem you want to tackle-pre release or post release prediction(first few weeks) or both.

  2. Identify the structure of your base model(this is the model that will provide you with the benchmark predictive power and understanding of the revenue aspect of the movies). Try a model structure that is easy to execute and interpret and fits the data well. Make decisions about quantitative vs. behavioral models, point estimates vs. classification into revenue groups, segment models or all movie population models. .

  3. Use tried and tested variables relevant to the model being built-star power, no. of screens, genre, MPAA rating, time of release, competition at time of release, critics ratings, sequel etc. I recommend that you breakdown any variable that is still too dense-for example create your own version of the traditional genre variable as it usually does not add much in it's present form.

  4. Use other not so mainstream variables-plot, positive buzz on internet forums and the Hollywood blacklist for starters. This is your creative space, use it to construct variables that you believe can add more punch to the model.

  5. Build the model and examine predictive accuracy and insights. Rank order the insight variables. If something does not make sense explore it again.

  6. Validate the model to see that it stands up tall.

  7. Try another model structure and see if you get better results(it's all about accuracy Watson, even a little more lift counts when we are talking millions of dollars).

  8. Get a movie fanatic data cruncher to do all the above for you(I promise the predictive accuracy will dramatically improve).

  9. Explore other non-conventional ways to better your prediction accuracy. A big area now is prediction markets.

As science makes the business of revenue prediction in movies and other entertainment areas much easier, the issue becomes less about whether we could have predicted the success of Slumdog Millionaire and more about if we want to. Malcolm Gladwell presents this case so eloquently in his absolute must-read piece in The New Yorker.

3 comments:

Datalligence said...

sounds exciting but it's gonna be challenging 'coz there are a lot of inputs/variables that are very very subjective. when you talk about ratings, reviews, buzz in forums....whose rating, which reviewer, which sites/forums are you going to select? your selection of these opinions/ratings (subjective in themselves) is going to become subjective again :-)

and let's say you've selected a few popular ones (based on your opinion), how will you then summarize the information? can you pls share your thoughts?

thanks,

Anuradha said...

Hi Romakanta,
No matter how subjective the choice and construction of the final variables, the real test is whether the variable adds a layer of better predictive accuracy to your model. Narrowing down the search to those variables that give you more bang for your buck is the fun part.
All the variables, I mention in my write up have been used in some way or the other in the past by researchers-the goal is to create a ‘better predictor’ from them. Here are some quick thoughts on the variables you mentioned (there are other ways both simple and complex to do the same) :

1. Critic’s reviews-I would look at the top selling five or ten newspapers and pick the critics who wrote for these papers and match their ratings using simple majority rules. A movie with a final high rating across critics would get a higher weight vs. others where the review was poor.
2. On the Hollywood blacklist, I could choose to live with the given ranking or create one of my own. I may create a new variable with 11 classes (or less) to accommodate the ranking. Films not ranked would be in the 11th class.
3. Online ratings/opinions/chatter- I would look at movie related sites with most user traffic i.e. Yahoo movies, Netflix, Amazon’s IMDB, Rotten Tomatoes etc. User ratings (averages or weighted averages) across these sites would be compared and matched using cut off rules to create the final ratings for the movie. What I may do before this is to make sure the scales for the ratings are aligned or nearly identical across all the sites. I may create other buzz variables using user ratings for breakdown of the plot, word of mouth recommendations and positive/negative chatter.

I hope this helped?

Datalligence said...

yes, it definitely helped. thanks!!!