Simplifying Content-Based Neural News Recommendation: On User Modeling and Training Objectives

The advent of personalized news recommendation has given rise to increasingly complex recommender architectures. Most neural news recommenders rely on user click behavior and typically introduce dedicated user encoders that aggregate the content of clicked news into user embeddings (early fusion). These models are predominantly trained with standard point-wise classification objectives. The existing body of work exhibits two main shortcomings: (1) despite general design homogeneity, direct comparisons between models are hindered by varying evaluation datasets and protocols; (2) it leaves alternative model designs and training objectives vastly unexplored. In this work, we present a unified framework for news recommendation, allowing for a systematic and fair comparison of news recommenders across several crucial design dimensions: (i) candidate-awareness in user modeling, (ii) click behavior fusion, and (iii) training objectives. Our findings challenge the status quo in neural news recommendation. We show that replacing sizable user encoders with parameter-efficient dot products between candidate and clicked news embeddings (late fusion) often yields substantial performance gains. Moreover, our results render contrastive training a viable alternative to point-wise classification objectives.

Most NNR models encode users and candidate news separately, in a candidate-agnostic manner [1,29,32].Candidate-aware models [20,22,25,42], in contrast, acknowledge that not all clicked news are equally informative w.r.t. the relevance of the candidate (e.g., a candidate is often representative of only a subset of a user's preferences), and contextualize representations of clicked news with the embedding of the candidate in user-level aggregation with UE.Finally, the candidate's embedding (output of NE) is compared against the user embedding (output of UE): the candidate's recommendation score is computed directly as the dot product of the two embeddings [29] or with a feed-forward scorer [26].NNR models are predominantly trained via standard classification objectives [26,29,32,36] with negative sampling [7,30].
The existing body of work has two main shortcomings.First, despite general design homogeneity, direct comparisons between recent NNRs are hindered by lack of transparency and adoption of ad-hoc evaluation protocols [8,23].In particular, a vast majority of personalized news recommenders are evaluated on proprietary datasets (e.g., MSN News [29,32], Bing News [26], NewsApp [20]).Even the few models evaluated using the publicly available datasets such as Adressa [6] or MIND [38] cannot be directly compared due to different dataset splits and evaluation protocols (e.g., model selection strategy) [5,27,36,42].Secondly, simpler and arguably more intuitive design alternatives have largely been left unexplored.First, the existing work adopts EF as default architecture, proposing increasingly complex user encoding components [1,20], often with little empirical justification for added complexity.Second, only a small fraction of NNRs leverage contrastive learning objectives [33,41], despite such training criteria being proven highly effective in closely related retrieval and recommendation tasks [14,28,39,40].
In this work, we remedy the above shortcomings of current NNRs and shed new light on user modeling and training objectives. 1 1) Concretely, we introduce a unified framework for neural news recommendation, facilitating systematic and fair comparison of NNR models across three crucial design dimensions: (i) candidateawareness in user modeling, (ii) click behavior fusion, and (iii) training objectives.2) We propose to replace user modeling with complex user encoders (i.e., early fusion) with simple pooling of dot-product scores between candidate and clicked news embeddings (i.e., late fusion).We show that, despite conceptual simplicity, LF brings substantial performance gains over EF-based NNR, rendering complex UEs empirically unjustified.3) Finally, we demonstrate the benefits of supervised contrastive training as a viable alternative to pointwise classification.Our work fundamentally challenges the status quo of NNR by introducing simpler and more effective alternatives to the established paradigm based on complex user modeling.

METHODOLOGY
Figure 1 depicts our unified evaluation framework for NNR, focusing on three critical dimensions of news recommendation.Given input data, comprising news and user behaviors, we analyze (i) candidateagnostic (C-AG) vs. candidate-aware (C-AW) user modeling under (ii) two click behavior fusion strategies, namely EF and LF, where each model can be (iii) trained by minimizing either the standard cross-entropy loss (CE) or a supervised contrastive objective (SCL).Next, we describe the models selected for evaluation and formalize the concrete design choices.

User Modeling
Candidate-Agnostic (C-AG) Models.For these models, the UE produces the user embedding from embeddings of clicked news without contextualization against the candidate.We evaluate the following C-AG models, mutually differing in their NE component (i.e., how they embed the clicked news): (1) NPA [30] uses a personalized attention module to aggregate the representations of the users' clicked news, with projected embeddings of the users IDs as attention queries; (2) NAML [29] uses additive attention [2] to encode users' preferences; (3) NRMS [32] learns user representations with a two-layer encoder that consists of multi-head self-attention [24] and additive attention; (4) LSTUR [38] learns user representations with recurrent networks: a short-term user embedding is produced from the clicked news with a GRU [4], and combined with a long-term embedding, consisting of a randomly initialized and fine-tuned part; the final user embedding is then obtained either (i) as the final hidden state of the short-term GRU, initialized with the long-term embedding (LSTUR ini ), or (ii) by simply concatenating the short-and long-term user embeddings (LSTUR con ); (5) Cen-NewsRec [21] adopts a similar UE architecture as LSTUR, but learns long-term user vectors from clicked news using a sequence of multihead self-attention and attentive pooling networks, as opposed to storing an explicit embedding per user; (6) MINS [27] encodes users through a combination of multi-head self-attention, multi-channel GRU-based recurrent network, and additive attention.
Candidate-Aware (C-AW) Models.UEs in candidate-agnostic models produce the same user embedding, regardless of the content of the candidate news.In contrast, UEs of candidate-aware models, two of which we include in our empirical analysis, produce user embeddings dependent on the candidate.( 7) DKN [26] computes candidate-aware representations of users as the weighted sum of their clicked news embeddings, with weights being produced by an attention network that takes as input the embeddings of the candidate and of the clicked news, as produced by the NE.More recently, (8) CAUM [20] combines (i) a candidate-aware self-attention network to model long-range dependencies between clicked news, conditioned on the candidate, and (ii) a candidate-aware convolutional network (CNN) to capture short-term user interests from adjacent clicks, again conditioned with the candidate's content; the candidate-aware user embedding is finally obtained by attending over the long-range and short-term representations.
News Encoders.The NNR models included in our evaluation primarily use news titles as input, which they typically embed via pretrained word embeddings [17].NAML, LSTUR, MINS, and CAUM additionally leverage category information, with categories embedded with a linear layer.CAUM additionally encodes title entities and DKN exploits knowledge graph embeddings [9].The shallow word and entity embeddings are contextualized either using a combination of multi-head self-attention (in NRMS, MINS, CAUM), or a sequence of CNN [11] and additive attention networks (in NAML, LSTUR).NPA [30] also utilizes a CNN to contextualize word embeddings, followed by a personalized attention module, analogous to the one used in its user encoder, whereas DKN employs a word-entity-aligned knowledge-aware CNN [26].CenNewsRec [21] combines the CNN network with multi-head self-attention and additive attention modules.Models with multiple feature vectors produce final news embeddings by simply concatenating them (LSTUR, CAUM), or by attending over them (NAML, MINS).

Click Behavior Fusion
We question whether the design and computational complexity of early fusion (EF), i.e., existence of dedicated user encoders in stateof-the-art NNR models, is justified.To this end, we propose, as a lightweight alternative, the late fusion (LF) approach that replaces the elaborate user encoders with mean-pooling of dot-product scores between the embedding of the candidate   and the embeddings of the clicked news    .Given a candidate news   and a sequence of news clicked by the user  =   1 , ...,    , we compute the relevance score of the candidate news with regards to the user 's history as where n denotes the embedding of a news learned by the news encoder and  the history length.
Although LF suggests that explicitly encoding user behavior may not be necessary for click prediction, user embeddings are still needed in collaborative-filtering models [13].Note that the LF formulation above is equivalent to the dot product between the candidate embedding n  and the mean of embeddings of the user's clicked news n   ,  (n This means that LF can also seamlessly provide user embeddings (simply as averages of clicked news embeddings) if needed.LF can thus been seen as a parameterless user encoder, i.e., a computationally efficient alternative to complex parameterized UEs in existing EF models.Because (i) we produce embeddings of candidate and clicked news independently, and (ii) yield user embeddings as averages of clicked news embeddings, LF models are candidate agnostic (C-AG).

Training Objectives
The vast majority of existing NNR work, regardless of the concrete user modeling architecture, tunes the parameters by minimizing the arguably most straightforward classification objective, crossentropy loss (with negative sampling; see Figure 1), and largely fails to explore effective alternatives, foremost contrastive objectives [16,33].This prevents understanding of models effectiveness under different training regimes.We address this limitation by training all models (see §2.1) not only with (1) common cross-entropy loss (with negative sampling), but also via (2) a contrastive learning objective, in particular supervised contrastive loss [10].

EXPERIMENTAL SETUP
Data.We conduct experiments on the MINDsmall and MINDlarge datasets, introduced by Wu et al. [38].Table 1 summarizes their main statistics.Since Wu et al. [38] do not release test set labels, we use the respective validation portions for testing, and split the respective training sets into temporally disjoint training (first four days of data) and validation portions (the last day).
Implementation and Optimization Details.We use 300-dimensional pretrained Glove embeddings [17] and 100-dimensional TransE embeddings [3] pretrained on Wikidata to initialize respectively the word and entity embeddings of the NNR models under comparison.We set the maximum history length to 50.Following Wu et al. [33], our negative sampling creates four negatives per positive example.We find the optimal temperature for SCL using the validation performance, sweeping the interval [0.08, 0.3] with a 0.02 step.We train with batch size of 512 for all C-AG models, 256 for DKN and only 64 for CAUM (due to computational limitations).We set all other model-specific hyperparameters, to optimal values reported in the respective papers.We train all models with mixed precision, under a fixed computational budget: for 25 epochs on MINDsmall and 10 epochs on MINDlarge.We optimize with the Adam algorithm [12], with the learning rate set to 1e-4.We repeat each experiment five times (with different random seeds) and report averages (and std.deviation) for common metrics: AUC, MRR, nDCG@5, and nDCG@10.Each model is trained on a single NVIDIA Tesla V100 GPU with 32GB memory.Our implementation is publicly available. 2One confounding factor that we do not control for, however, and which warrants a mindful comparison of the results, is that models differ not just in UE, but also in NE components, i.e., w.r.t.how they encode news and which features they use as input.For example, NAML and MINS, with an identical NE, achieve similar performance on MINDsmall.On MINDlarge, however, the more complex UE of MINS brings substantial gains over the simpler UE of NAML (but only under standard EF fusion and CE training).2 shows the number of trainable parameters in original EF configurations, on MINDsmall. 3While the NE accounts for the majority of parameters in most models, the plot shows that the proportion of UE parameters is non-negligible for several models, and largest by a wide margin for LSTUR.With a parameterless UE, along with performance gains, LF brings a relative reduction of model size of 14.7%, 18.1%, and massive 82.3% for CenNewsRec, CAUM, and LSTUR ini , respectively. 3For some models, e.g., LSTUR with its user embedding matrix, the number of parameters depends on the size of the training data.

CONCLUSION
Rapid development of personalized neural news recommenders hinders fair comparative model evaluations and systematic analyses of design choices.In this work we introduce a unified framework for neural news recommendation focusing on three crucial design dimensions of NNR: (i) candidate-awareness in user modeling, (ii) click behavior fusion, and (iii) training objectives.Extensive evaluation of a wide range of recent state-of-the-art models reveals that NNR can be drastically simplified: replacing complex user encoders with parameterless aggregation of clicked news embeddings brings substantial performance gains across the board, reducing at the same time model complexity.Further, we show that contrastive learning can be a viable alternative to standard classification-based (cross-entropy) loss.We hope that our findings will inspire more transparent NNR evaluation, including systematic model ablations to uncover the components that drive the performance.

Figure 1 :
Figure 1: Illustration of the unified NNR framework, focusing on three crucial design dimensions: (i) candidate-awareness in user modeling (green box), (ii) click behavior fusion (orange box), and (iii) training objectives (purple box).

Table 1 :
Statistics of the MINDsmall and MINDlarge datasets.

Table 2
LF models variants with trivial, parameterless UEs match or surpass the performance of CAUM with EF, undermine this conclusion.With the exception of DKN, all other models exhibit better performance when trained on the larger MINDlarge dataset.NAML and MINS are, however, competitive (except w.r.t.AUC metric) on MINDsmall, but fall behind CAUM on MINDlarge, suggesting that CAUM's elaborate UE benefits the most from more training data.
shows the performance on MINDsmall and MINDlarge for both C-AG (NPA, NAML, NRMS, LSTUR, CenNewsRec, and MINS) and C-AW models (DKN, CAUM), under four different configurations of our comparative evaluation framework: (i) user modeling with EF vs. LF, combined with (2) training with CE vs. SCL objective.We next dissect the results along the three axes of our framework ( §2): user modeling, click behavior fusion, and training objectives.Candidate-Agnostic vs. Candidate-Aware NNRs.We analyze C-AG vs. C-AW models under their default EF configuration, since with LF all models become candidate-agnostic.CAUM, with the most complex and candidate-aware UE, generally outperforms all other models under both training regimes (CE and SCL) and for most evaluation metrics.The gaps are particularly prominent on the large training dataset, MINDlarge, w.r.t. the AUC metric.This result alone could mislead to a conclusion that more complex, candidate-aware user modeling is necessary for better recommendation.The fact that (1) DKN, as the other C-AW model in our evaluation -generally performs much worse than C-AG models, as well as that (2) our

Table 2 :
Recommendation performance of the compared models under combinations of click behavior fusion (CBF), and training objectives.We report averages and standard deviations across five different runs.Early vs. Late Click Behavior Fusion.Replacing complex EFbased UEs with the simple parameterless LF that we propose brings substantial performance gains across the board.Averaged across all models and both training objectives, LF brings massive gains of 5.58 and 4.63 MRR points on MINDsmall and MINDlarge, respectively.Equally importantly, with LF -i.e., with the same parameterless UE -models exhibit mutually much more similar performance than under EF, with other models generally closing the gap to CAUM.This suggests that LF makes differences in NE architectures across models less consequential, thus not only simplifying UE with parameterless averaging of clicked news embeddings, but also allowing for simpler news encoders.Cross-Entropy vs. Supervised Contrastive Loss.Overall, we find SCL to be a viable alternative to the common cross-entropy based classification with negative sampling (compare columns CE and SCL across evaluation metrics in Table2).SCL brings large gains over CE in terms of AUC (+8.26 points on MINDsmall and +12.14 points on MINDlarge, averaged across all models, in both EF and LF variants).This suggests that, SCL leads to better separation of clicked and not clicked news in the representation space.In contrast, SCL falls slightly behind CE according to ranking measures, MRR and nDCG (-1.54 and -1.78 MRR points on MINDsmall and MINDlarge, respectively).We hypothesize that this is because of hard negatives -news not clicked by the user that resemble user's clicked news -for which CE more directly signals irrelevance: these likely become highly-ranked false positives for SCL-trained models.Model Size.Finally, we quantify the reduction in model parameters that LF brings w.r.t.EF. Figure