Part 2 of a multi-part series concluding 24 days from now, more or less
Part 1 | Part 2 | Part 3 | Part 4 | Part 5
Imagine this, if you will: Netflix asks you to rate 10 movies you saw in the last month. Then they give you a magic black box with a computer inside. Into that black box they feed 10 entries which include your name, names of the movies you watched, and when you rated the movies. For five of these entries, they include your rating for the movie. Then they ask the black box to predict, as best as it can, the rating you gave the other five movies. This is essentially what Netflix is paying $1 million for: a system that on average gets +/- .8558 away from the actual user rating. It doesn’t sound very exciting when abstracted that way, and perhaps we do it some injustice because it actually requires a sophisticated algorithm, but this is what the Netflix Prize is all about.
So the question remains: does increasing a computer’s predictive accuracy by 10% on this very narrow question equate to improving a recommendation system by the same amount? We think not. Recommendations encompass a lot more than a user’s ratings about movies already seen. Most importantly, the Netflix Prize neglects to ask the question of its contestants: how well can your algorithm recommend new movies to users?
This is the goal for which every recommendation system strives: the ability to surprise, astound, and delight a user by suggesting a movie yet unseen by them. And to keep them coming back for more. In contrast, it does very little to remind them of how they rated something they have seen.
People love to discover new movies in the same way they love to hear new music; and even for the most voracious consumers, there will always be undiscovered territory. But how can a machine (in this case, our “magic black box”) which only specializes in telling you how much you liked something you’ve already seen also tell you how much you will like something you haven’t yet heard of? Although these two things are sometimes related, they aren’t always so.
The assumption underlying the Netflix data exercise is that an algorithm which can accurately predict how you will rate movies you have already seen will have the same predictive accuracy on a dataset of movies you have yet to watch. In the case where a large percentage of the audience has watched the same movies, this works well. But the assumption breaks down when we try to make predictions on movies that rarely appear together on viewers’ watched lists. From a product perspective, this hampers the ability to make recommendations to adventurous viewers and find interesting choices that lie outside the obvious list of related items.
For example, in the dialogue from Part 1 of this series, Socrates finds out that Crito really liked The Shawshank Redemption; in fact, he gave it 4 out of 5 stars. From this, Socrates (our little black box) could extrapolate that Crito really likes movies about jail or jailbreaks; or movies with Morgan Freeman; or movies based on Stephen King stories. But Socrates, as he is designed, cannot interpret anything beyond the 1-5 rating given by Crito and thus his recommendation power is severely limited. By design, he misses the point when Crito mentions his love of a good jailbreak movie and can’t do anything with that very valuable information (we’ll look in more depth at ways of eliciting preference without 1-5 ratings in a future article). Socrates will evenly weight recommendations of unseen movies like The Great Escape, Driving Miss Daisy, and Children of the Corn when, really, he should give The Great Escape and other jailbreak movies much more weight than anything else. Furthermore, Socrates could lose Crito’s trust by recommending him something entirely awry. Children of the Corn is enough to scare anyone off, generally.
In a complete recommendation system (as opposed to the narrow Netflix algorithm bake-off), questions on how a technique works on both a viewer’s discovered and undiscovered items are central. By focusing all efforts on a specific algorithm to match known ratings, the contest introduces a strong bias. Undoubtedly, as Mr. Smith mentions in the comments to our last article, advances in algorithm technique have been made. But, the algorithms as presently constructed should not be directly applied to an actual live recommendation system. Instead, they are merely building blocks upon which other datasets, techniques, specific tunings, human curatorship and product choices need to be imposed. Given the unique biases of the specific dataset and question posed by the Netflix Prize, is there any proof that all the specific tuning and blending from BellKor’s Pragmatic Chaos to meet the 10% will yield tangible improvements to the final product? Or, would any of the extremely similar SVD algorithms developed for the contest perform equally well?

Of course, Netflix intentionally set up a narrow definition of “recommendations” for their contest. They never ask the teams to test out their algorithm on movies which the user hasn’t rated. And for good reason: the accuracy of such a diagnostic would be difficult to assess and verify. But this is where the true value of such an algorithm lies. As it stands, true predictive power in the scheme of the Netflix Prize consists of accurately telling you what you already know, 3 years and $1 million later.
Stay tuned for additional articles in this series coming soon…
Part 1 | Part 2 | Part 3 | Part 4 | Part 5
Tags: black box, Countdown to 10%, crito, magic lantern, netflix, SVD
Sorry, but you got it completely wrong. The contest is about predicting a user’s rating for a film he had *not* seen before. (The predicted ratings are not part of the “Laerning” set)
Also the black box you describe has only access to the votes of the customer in question. In reality the black box accesses *all* available votes of *all* customers and is able to build up pretty detailed knowledge about movie preferences.
If you give a few votes to the recommender it can conclude a lot about your preferences by “comparing” you to other customers.
Also a very small improvement in RMSE can mean a much better set of proposed movies.
Or look at it the other way round: Netflix people seem very smart to me. If they pay $1M for a “small improvement” it is pretty sure that this improvement will mean many millions of income for them.
DrKoch, please take a look at the Prize rules here. Contestants are provided with 2 datasets: (1) A “training” set which consists of over 100M known ratings from 480k Netflix users; and (2) a “qualifying” set which contains 2.8M customer/movie ID pairs with ratings dating withheld “selected from the most recent ratings from a subset of the same customers in the training data set, over a subset of the same movies.” The qualifying set is a set of movies *already* watched by the customer but with the specific rating withheld.
Since the goal of the Prize is for an algorithm to predict the withheld ratings in the qualifying set, the entire goal of the contest is to predict the ratings for movies which have already been watched by the user.
Of course, we are not disputing the fact that you can build correlations between movies by finding patterns in usage data. (MediaUnbound *is* a recommendation technology provider, after all, and uses a variety of algorithmic techniques in addition to human analysts to create interesting media recommendations.) Instead, we are questioning whether a 10% improvement in the very specific task set out in the Netflix Prize rules will actually achieve an improvement in the user recommendation experience.
We agree that the Netflix people are very smart. At the least, their $1M investment will mean many millions in marketing exposure for them.
I’m struggling a bit with a point upon which you focus numerous times but which is a little bit unclear to me: the relationship between predicting what a user will rate a watched movie and predicting what a user will rate an unwatched movie.
As I understand it, you are trying to be very clear that the criteria of Netflix’s contest is to achieve accuracy in using data set A (a set of user ratings) to predict the information in data set B (a set of different ratings from the same users). In reality, the user is at a point in time after A and B.
The conceit, I infer, is to say: at one point, the user was at a point in time BETWEEN A and B. Data set B existed as pure potentiality, a set of possible effects without causes. (The users hadn’t seen the movies!) Data set B was not merely unknowable, it did not exist. If a black box can generate data set C, a set of predictions about what users will rate those movies AFTER they have watched them, users will be delighted proportionally to the similarity of B and C.
The only point at which anyone evaluates a data set in this story is once it exists. A user evaluates the merit of the prediction, I suspect, right as her particular datum is formed: as the movie-yet-to-be-watched becomes the movie-just-watched and she has a reaction to it. She thinks: yeah, I’d give that four stars, which is what Netflix predicted!
To me, the conceit of creating an imaginary prediction scenario by hiding a data set (set B) and positing a historical moment in time (the moment when users had not seen the films in set B) is easily and obviously parallel to the real equivalent.
In fact, the only way I can imagine (at the moment) any interesting difference between the Netflix conceit and reality is if a user’s opportunity to interact with data set C (the predictions) would significantly alter the data in set B (the hypothetical, not-yet-real ratings).
Please forgive me if any of this is not clear; this is a difficult subject to write precisely about!
Or, to put it in dialogue form:
CRITO: Socrates, I just watched The Shawshenk Redemption last week.
SOCRATES: I would guess that you truly enjoyed that film, Crito. A 4 out of 5?
CRITO: Yeah, I probably would give it a 4 out of 5.
SOCRATES: In that case, I think you should watch In the Name of the Father, because I predict that you will give it 5 out of 5.
CRITO: Cool, I’ll check it out.
SOCRATES: Is my prediction valuable to you?
CRITO: Umm, I don’t know, yet. I’ll have to watch it. If I like it as much as you predict, I’ll probably come to you for more recommendations.
–
In other words, Crito has no ability to assess the merit of Socrates’ recommendation until he turns the movie Socrates recommends from an unwatched movie into a movie, at which time he will assess its merit using the same criteria by which the Netflix competition is judged.
Hi Nate. We agree, writing precisely about math is tricky stuff (and heartily applaud your use of dialogue) !
You are correct that there was a point in time when the user had yet to watch the movies in set B. So, from a semantic perspective, the Netflix algorithms are indeed making predictions about events which have yet to be actualized.
In theory, the Netflix Prize algorithms can be asked to make a prediction about any movie in the dataset. But, in practice, the algorithms are only graded on their ability to accurately predict the ratings on those movies the user actually ends up viewing in the future. This introduces very specific mathematical biases in the algorithms. For example, they are rewarded for making accurate predictions about obvious recommendations and not penalized at all for making wildly inaccurate predictions on more adventurous or “discovery” oriented choices.
Your dialogue brings up the obvious reason the Netflix Prize was structured in this way. Since Crito has yet to watch In the Name of the Father we cannot judge whether Socrates has made an accurate recommendation. Similarly, recommendation systems builders who only rely only on post-facto numerical means to judge the success of their algorithms can never determine how well their approaches will work when it comes to discovery. This is why we suggest a more holistic approach to judging recommendation systems which encompasses feedback from panels of real viewers and careful analysis of the underlying business goals for recommendations along with the standard a/b blinded test results.
More directly: What do we mean when we talk about “successful” recommendations? While the meaning of success might change based on context and application, we certainly want success to measure our ability to recommend to Crito a movie he will love and would never have heard about without our recommendation. The Netflix Prize scoring methodology is silent on the ability of an algorithm to accomplish this feat.
MediaUnbound wrote:
“In theory, the Netflix Prize algorithms can be asked to make a prediction about any movie in the dataset. But, in practice, the algorithms are only graded on their ability to accurately predict the ratings on those movies the user actually ends up viewing in the future.”
I guess what I don’t understand is: how is any algorithm ever graded any differently? As in my dialogue, Crito himself (the user) has no ability to evaluate Socrates’ recommendation until he watches the movie. Movies that Crito does not end up viewing in the future have no actual rating, either in a user’s soul or anywhere else. They do not exist! How could we be more or less accurate in making predictions about them? Moreover, why would we be interested in this non-existent data set?
This single point makes it very hard for me to grasp any of your critiques of the Netflix contest as a limited framework for describing the task of recommendation systems and evaluating them.
I should add that I’m very confident you have good reasons for all your claims: you’ve been doing this and thinking about this much longer than I have. I’m just trying to press some of the points that I don’t understand; perhaps it can be of slight benefit in the future if only as a reference for concepts that might merit further explanation.
I agree that the metric Netflix is using is not really the right one. Users try to watch movies they like, thus most of the ratings are positive. One would have to force people to watch and rate randomly selected movies to make this unbiased.