The lifespans of most news articles are very short, but a very small number of them maintain a timeless quality and are consistently of interest to the public. At the Post, we analyzed the characteristics of these articles and have developed a model to automatically identify them as evergreen articles.
As illustrated in Figure 1, we categorize news articles published by the Post into different groups based on their traffic patterns. Our study focused on the evergreen articles whose levels of traffic are significant for a long period of time. In Figure 1, examples in blue and red are the targets of this study.
Definition of Evergreen Articles
The first challenge in this study was to define the evergreen articles. After talking to our domain experts at the Post, we decided to use traffic patterns to define the evergreen articles.
We preprocessed the traffic information for all articles in two steps. First, we ignored the traffic information of the first 3 months for an article as we are looking for long term traffic patterns. Secondly, we smoothed the traffic information of an article using the median of a 5 month time window. This smoothing step is important to remove the one-off spikes in the traffic pattern.
After the preprocessing, an article is considered to be an evergreen article if it satisfies these two conditions: 1) the average number of page views each month should be at least 500; 2) the month-to-month page view decrease should be less than 60%.
Based on this definition, we found 0.5% of Post articles published between 2015-2017 to be evergreen.
Characteristics of Evergreen Articles
We first categorized evergreen articles at the Post. Figure 2 enumerates the top 10 categories of evergreen articles. 6.31% of all evergreen articles are book reviews. Although we define evergreen articles using traffic patterns, the categories of evergreen articles actually are very meaningful.
We also investigated the time aspect of evergreen articles. First, we looked at the publication hour of evergreen articles. In Figure 3, most of the evergreen articles were published around the middle of the day, while other articles were published in the afternoon. In Figure 4, most of the evergreen articles were published early in the week, while the other articles were published more evenly throughout the whole week.
Evergreen Articles Prediction
After we understood more about evergreen articles, we have built a backend prediction model to automatically decide if a recently published article is an evergreen article.
At the prediction phase, we considered only articles that have been published more than 3 months ago. Out of these articles, we decided if any of them are potential evergreen articles.
We have explored three types of signals that are very indicative to decide if an article is evergreen:
- Initial pageviews
- Monthly pageviews in past 90 days
- Month-to-month pageview differences in past 90 days
- Article Content
- Article keywords
- Article topics
- Article Metadata
- Publication time
We were able to automatically learn the weights/thresholds of all of these signals from over 200,000 training articles published at the Post. Given a new article, we applied the weights/thresholds on the above signals. If the article satisfies all of these pre-trained thresholds, it is considered to be an evergreen article.
How it is being used at the Post
Prediction results have been shared with a group of editors regularly via Slack for review. If an article is approved to be an evergreen article, it can be shared on social media or inserted as a related link in another article.
This work has been accepted by ECML/PKDD 2019, and you can find more technical details in it here.