University of Antwerp: How much do readers care about regional news?
When newspapers want to recommend an article to a reader, they quickly run into some limitations. How do you know if that person will find the article interesting, for example? By looking at that person's reading history? Should you recommend a popular or recent article?
These are just some of the questions news managers struggle with daily.
Froomle, the leading Belgian AI company with a strong focus on journalism, recognizes that different recommendation use cases each require a specific approach. Therefore the scale up joined forces with academician Len Feremans (University of Antwerp) to research this topic. The subject of the study? To see if we can predict which regional articles people want to read, thereby jointly considering the user’s article and geographical preferences, social influence and time.
On newspaper websites, such articles are often suggested based on the municipality where you live. That data is extracted from your subscription information or by looking at your IP address. Afterwards, the media rank recent and/or popular news articles from that region.
This way of working creates biases related to imbalance, such as item popularity and city dominance, where readers living in smaller towns get more recommendations from big cities. The second set of biases is related to the lack of data, i.e. cold-start users, cold-start items and cold-start regions, i.e. low population regions without recent publications. Thirdly, many biases are temporal, such as the short lifetime of news articles and concept drift where user preferences and local news topics evolve.
Feremans researched if this method could be improved - and if machine learning could come into play.
200 GB of articles
The data scientist used a dataset of 200 GB of articles & web analytics sourced from Het Nieuwsblad through Froomle’s big data platform. Feremans loaded all interactions and article metadata during 40 days (from June 1 until August 11, 2021) and excluded all articles containing general news and sport. He also fetched each location's corresponding longitude and latitude coordinates using a public API that supports forward geocoding. This allows him to compute geographically nearby regions.
The study calculated offline the probability that someone would read an article, then tested that prediction against that person's online reading behaviour, using a sliding window based evaluation. The Python algorithm produced a ranking of different articles, sorted according to the relevance of the article. Feremans used four metrics to measure the success of his work:
- Recall: relevant percentage of recommendations; the article was suggested at the right time
- Hit rate: percentage of users that view the recommendations; in the real world people have viewed this article
- Kendall Tau: percentage of articles ranked correctly
- NDCG: the division between Recall and Kendall Tau
The results always included the distinction between the data with and without popular articles because article popularity is not always related to a specific region.
The results
The relevant percentage of recommendations if we were to recommend 10 articles (‘recall@10’) of the experiment was 33%. When it comes to the hit rate at 10 recommendations, the performance of the research group was 36% higher than the control group. This means that, additionally, 36% more users would have read an article from the list of recommended articles if this algorithm had been running in a live environment.
To achieve such success, Feremans conducted several experiments to see what works and what does not. For example, there was a noticeable difference by suggesting articles from the past two weeks and going back in time up to three months. The most decisive parameter, of course, was the location. However, there are different ways to approach that.
Feremans analyzed which regional news users had read in the past three months, excluded the 1% most popular articles and then ranked the most popular regions to create user profiles. He used only the top regions because the NDCG at 10 recommended articles increases to 13.2% from 8.0% (+65%) if he uses the top-2 instead of the complete user profile.
He then used OpenCage API to determine the closest regions based on longitude and latitude coordinates. The recall for 10 recommended articles was 33%, the hit rate 54.4%. This is a lot more than when working with the existing list of Het Nieuwsblad, which even decreased the hit rate by -1.5%.
For optimizing ranking, Feremans experimented with different functions to improve ranking, thereby assigning a score to articles based on a combination of recency, popularity and relevance of the location of an article. In the end, a combination of jointly filtering on recency and popularity and ranking on the popularity in the past 24 hours divided by the age of an article in hours improved results for Kendall Tau by 38.4% for anonymous users.
After evaluation of thousands of algorithms and methodologies offline, the following content based recommendation algorithm came out with the best results for this use case:
- Create user profiles based on regional reading history, excluding national and popular articles
- Limit the amount of regions in each user profile
- Take neighboring regions into account
- From the articles matching a user's profile, recent articles perform the best
Many experiments had little impact, such as ensembles with collaborative filtering. Experiments using content-based recommendations did not give a substantial improvement as well. For example, Feremans investigated whether articles could be linked via word2vec based on content or title, but this did not significantly increase the accuracy.
What is next?
Further research focus is on increasing the complexity of the integrated model. Feremans wants to further research the perfect ranking function, including a personalization component (either content-wise or using collaborative filtering) together with geographic preferences. In the beginning of 2022, the second part of the research will be published.
In the meantime, Froomle’s data engineering team has already progressed on implementing this offline research into a live production environment. In November 2021 the first online tests using the regional profile started for ‘push audience selection’, soon to be followed by website recommendations.