Bandit Algorithms: The next frontier in content recommendations for newsrooms
The reward system of the human brain motivates us to do and achieve all kinds of things. When we try something we haven’t done before, say a new sport, and find that we enjoy it, we’re likely to search for new satisfaction in the future from that same activity.
Repeated positive associations with an activity will reinforce our desire to do it again. We may not get the same level of satisfaction from an activity every time we do it; but even if it isn’t always a great success, we’ve internalized that we do like the activity in general, and we’re confident that we’ll be able to enjoy it again in the future. Activities we’ve found unpleasant repeatedly, on the other hand, we’ll try to avoid in the future.
A similar idea lies at the basis of machine-learning multi-armed bandits. In a multi-armed bandit setting, a system has a number of actions to choose from (hence “multi-armed”). Each action has an intrinsic chance of reward, but just as a person doesn’t know whether they’ll like something before they’ve tried it a few times, the reward probabilities of the actions are initially unknown to the system.
The system is then tasked to collect as many rewards as possible. It can do this by trying out (exploring) the different actions repeatedly and observing which actions yield a reward most often. The more often the bandit takes an action, the more accurate its estimation of the actual reward probability of that action becomes. As the system gains knowledge about the actions, it can exploit this knowledge by continuing to take the actions which yielded the most during exploration, and avoid the actions that weren’t successful.
We’ve put this bandit theory into practice in our own recommender engine. Our new bandit recommendation algorithm implements a bandit system which presents news articles to a reader, learning from the feedback to its own actions as it goes.
> When an article gets published, the bandit algorithm will recommend the new article to people in an initial exploration phase. Observing the number of clicks it receives, it estimates the probability that someone will click it when presented.
> It then exploits this knowledge by recommending articles with a high probability of being clicked.
What’s unique about the bandit algorithm
Traditional recommendation algorithms are often based on general user behavior like pageviews representing who has read which articles.
This bandit approach to content recommendation is different in that it takes into account the feedback to its own recommendations. Thanks to this, it is especially well-suited to optimize for a specific context: it figures out what users are most likely to click in a particular box: be it on the homepage, article page; on the website or in the app.
Due to its emphasis on exploration, it can help users discover content that they wouldn’t otherwise have encountered. Below we’ll show how this is useful with a live experiment we ran.
When to use bandit
Our bandit will work best when it has enough opportunities to learn an accurate estimation of the reward probability. As a rule of thumb, we try to make sure that every article can accumulate a thousand impressions relatively quickly after publication. The more traffic a location receives, the larger the set of articles can be.
If, on the other hand, a box of recommended articles receives very little traffic, then a different recommendation technique may be more appropriate.
Results
The benefits show clearly in the click-through rate on a box captioned “Recommended for you” on Het Nieuwsblad where the bandit was A/B-tested against a baseline of popular articles. The CTR of the bandit recommender was more than twice the CTR of the baseline over a period of a month.
Simply showing articles that are popular in general was clearly not optimal for this box. This is where the bandit’s ability to learn from its own recommendations turned out to be a real strength. It discovered by itself which content was most appropriate to recommend in this particular box, sparking users’ interest significantly more than the baseline.
Future development
The example only scratches the surface of the power of a bandit recommender. We’re working to make it even more powerful. In the future the bandit will be aware of contextual information: in deciding what to recommend the bandit will take into account the user’s interests, or what might be relevant to the article the user is currently reading.
We will make the bandit context-aware by constructing a multi-dimensional space which consists of a number of numerical features. Then we represent each user as a point in this space. Each feature represents something about the user. This could be something specific such as “how much does this user like or dislike sports articles” or something more abstract like a parameter learned by a machine-learning model.
Once we have an adequate feature space we can either partition users into clusters of users with similar interests and compute estimated click probabilities within those groups, or train a regression model with which we can then estimate the probability that a specific user will click an article.
Conclusion
To recap, we can conclude that the new bandit algorithm recently integrated into our recommender system learns from its own actions as it goes.
Also, whereas most other recommendation algorithms will only recommend content that is already popular or otherwise known to be a safe bet, the bandit recommender can help users discover content that they wouldn’t otherwise have encountered thanks to its emphasis on exploration.
The bandit algorithm can significantly increase CTR in recommending news articles, and there is a lot of potential to make it more advanced in the future. For example, contextual bandit will take into account users’ interests and other context.
All that being said, there is a lot more to be explored within the bandit algorithms and at Froomle we look forward to be seeing what future discoveries are coming and how we can implement them.