I finally have all the data loaded and can begin messing around with some algorithms. The table of movie ratings is over 100 million records long. Even with indexes, it just too big to do any sort of grouping or joins on. This rules out algorithms like “User A has rating x number of movies very similar to User B. Therefor they will probably feel the same way about movie y.”

I can’t imagine this algorithm would be of any use to Netflix anyway.

They have 5.6 million subscribers currently and predict they’ll have 6.3 million by the end of the year. Plus they claim to have over 65,000 titles. That’s somewhere around 364 billion ratings/estimated ratings.

For this thing to be of any use to them it’s going to have to be pretty darn fast.

I threw together some simple tests just to get started. First I took the average rating for each movie and used that for all estimated ratings. This yeilded a RMSE of 1.0519. Next I took the average rating for each user and used that as the estimate. This yeilded a RMSE of 1.0426. This is a bit surprising. How the user voted on previous movies is a better predictor to how they will vote on future movies then how others voted.

More to come…

Popularity: 4% [?]