Targeted display advertising using Machine Learning

This paper delves into the intricate challenges of problem formulation and data representation in the context of a large-scale machine learning system for targeted display advertising. Unlike traditional models, this system is not just conceptual but has been operational for years across thousands of advertising campaigns. Since obtaining ideal training data is cost-prohibitive, data is sourced from related domains and tasks and then adapted for the target task. The paper outlines the architecture of this multi-stage transfer learning system, emphasizing the problem formulation aspects. Extensive experiments demonstrate the value of each transfer stage. Real-world results with diverse advertising clients from various industries showcase the system's performance. The paper concludes with valuable insights gained from over half a decade of work on this complex, widely deployed machine learning system.

This paper delves into the intricate challenges of problem formulation and data representation in the context of a large-scale machine learning system for targeted display advertising.Unlike traditional models, this system is not just conceptual but has been operational for years across thousands of advertising campaigns.Since obtaining ideal training data is cost-prohibitive, data is sourced from related domains and tasks and then adapted for the target task.The paper outlines the architecture of this multi-stage transfer learning system, emphasizing the problem formulation aspects.Extensive experiments demonstrate the value of each transfer stage.Real-world results with diverse advertising clients from various industries showcase the system's performance.The paper concludes with valuable insights gained from over half a decade of work on this complex, widely deployed machine learning system.

INTRODUCTION
The advertising industry, a significant contributor to the U.S. GDP at approximately 2%, places great emphasis on precise ad targeting.Online display advertising, a subfield within this industry, presents both opportunities and complexities.It is promising due to the vast data available for ad targeting, yet challenging as it involves a convoluted ecosystem with multiple stakeholders.This paper primarily addresses the intricate realm of customer prospecting in online display advertising, targeting consumers who haven't interacted with a brand but are potential customers.
The rise of real-time bidding exchanges (RTBs) has revolutionized display advertising, offering efficient methods for advertisers to reach specific consumers with real-time auctions.Each ad view, referred to as an "impression," is auctioned off during webpage rendering.Advertisers receive bid requests containing user data, supplementing it with additional consumer and website information.With billions of daily auctions, advertisers require large-scale, high-speed systems for real-time decision-making.
This complexity naturally aligns with the integration of machine learning into ad optimization.It leverages massive consumer behavior data, brand-related actions, and real-time ad delivery.The paper explores the workings of a deployed machine learning system used by M6D for finding prospective customers and running targeted display ad campaigns.
The paper's key contribution to machine learning is its practical application, revealing how data characteristics and limitations translate into a complex problem formulation.It highlights that, for pragmatic reasons, the system must draw data from various sampling distributions to create the machine learning solution.
The core challenge is to identify prospective customers for diverse ad targeting campaigns automatically.Obtaining sufficient training data is prohibitively expensive and time-consuming, given the high dimensionality of the problem and low purchase probabilities.To address this, the system employs a two-level modeling approach.The first level utilizes abundant but biased data sources to handle sparsity and high dimensionality, while the second level combines and refines the outputs from the first level using data from the target distribution.
This paper aims to shed light on the design and operational choices of a massive-scale, real-world learning system, which is often overlooked in the machine learning literature.It emphasizes the importance of addressing data availability constraints, including working with non-ideal data distributions and rare outcomes.The system incorporates transfer learning and stacked ensemble classification techniques.Overall, the paper advocates for viewing most machine learning applications as instances of transfer learning, emphasizing the practicality of these techniques in real-world applications.

LITERATURE REVIEW
1. Background on M6D Display Advertising and Related Work M6D, a significant player in the online display targeting industry, predominantly focuses on prospecting for over 100 brands, delivering millions of ad impressions daily.The system relies on cookies to maintain unique user identifiers, allowing the association of various events with the same consumer.They collaborate with data partners to track partial browsing histories and install tracking pixels on brand websites to record visits, purchases, and other meaningful interactions.This comprehensive data enables meaningful campaign evaluation, emphasizing post-view conversions as the primary metric for success.
M6D primarily delivers ad impressions through ad exchanges, evaluating the prospectiveness of consumers and submitting bids accordingly.Bid prices are determined by a separate machine learning process.While M6D's system is not the only one in the advertising ecosystem, there is common ground in the challenges it faces, such as rare event rates, highdimensional feature vectors, and the "cold start" problem of having no campaign data before a new campaign begins.
To address the rare event/high dimensionality problem, various solutions have been proposed.Agarwal et al. used  Liu et al. introduced transfer learning in the context of online display advertising with a multi-task learning approach, where data from multiple tasks are pooled, and parameters are estimated across a joint feature space.However, cross-campaign transfer is not applied by M6D to avoid using one brand's data to optimize a competing brand's campaign, which is undesirable.
The transfer learning approach presented in this paper extends beyond the standard campaign and utilizes source domains not typically considered.This paper is the first to describe such an application of transfer learning in advertising, particularly one that conducts transfer learning across numerous source tasks at scale.Additionally, it's the first to detail a functional display advertising system that combines multiple models via (stacked) ensemble learning.

Transfer Learning for Display Advertising
The paper's central focus is on transfer learning across different tasks, which necessitates precise definitions to discuss the concept thoroughly.Transfer learning involves learning from a task that differs from the target task in terms of sampling distribution, features, label, or functional dependence between features and the label, and then applying this knowledge to enhance learning in the target task.
A task consists of a domain and a mapping, where the domain includes an example space, a sampling distribution on that space, and a featurization for the examples.Importantly, users may be sampled and featurized differently from the target distribution to augment training data.
A target task is the ultimate goal, with its own domain and mapping.Transfer learning aims to improve the learning of the target task by leveraging knowledge from one or more source tasks.Each source task has its domain and mapping, distinct from the target task.
For the M6D system, the target task is to identify internet users likely to make their first purchase shortly after seeing an advertisement.The target sampling distribution, featurization, and outcome are precisely defined.
Drawing data from the target task is expensive and impractical due to the need to purchase random impressions, the large feature space, the scarcity of positive examples, and the inefficiency of random ad targeting.Advertisers require campaigns to meet their goals rapidly, and thus, the M6D system addresses this by using existing data collected over time, involving different sampling distributions and actions related to the target outcome.Transfer learning is essential for leveraging this alternative data effectively.2.1 Possible Mappings/Labels for Targeted Advertising To increase the number of positive examples for estimation and make transfer learning more effective, various liberal definitions of labels (Y) can be considered.The primary target label, "purchase after being exposed to an ad," is a rare event that requires costly impressions.Alternative labels (YS) can include: 1. Clicking on an ad (still requires showing impressions).
2. Any purchase, not necessarily the first time, after exposure to an ad.
3. Any purchase, with or without exposure to an ad. 4. Any other brand action, with or without exposure to an ad.
The number of positively labeled internet users is larger for the alternative actions, with option 4 being a superset of 3, and 3 being a superset of 2. For effective knowledge transfer, the estimated function fS(•) should be closely related to the function of interest, fT(•).Consequently, the outcomes YS and YT should be strongly related.In essence, this implies that the fundamental behavioral drivers for YT should also reasonably influence YS.

Domains and Features of a Users's Online Activity
As defined earlier, a domain (D) comprises three key components: the example space (E), the sampling distribution (P(E)), and the featurization (X(E)).The example space generally represents internet users or online consumers, but these users are sampled in various ways, resulting in substantial heterogeneity across different source and target tasks.The sampling events during which M6D interacts with users include: 1. General internet activity: Users visiting sites/URLs with which M6D has data partnerships.
2. Bid requests from exchanges/bidding systems.
4. Clicking on ads. 5. Making purchases at a campaign's brand's site.6. Engaging in other online brand-related actions that can be tracked, like visiting the brand's homepage or store locator page.
The main distinctions between populations collected through these sampling events lie in the differences in their sampling distributions (P(E)).In this paper, the source domain for stage-1 experiments is based on the union of all these events, although in practice, M6D builds separate source-domain models for different events.
Furthermore, sampling events can be used to label examples, and this can lead to the creation of modeling datasets by sampling one population and assigning labels from a different event.For example, users who were shown an ad might represent the population, while those who subsequently purchase from the brand's website are the positively labeled consumers.
The target featurization (XT(E)) includes a consumer's browsing history and other user information.In any domain or event sample, a user is characterized by a set of features {x1i, x2i, ..., xKi}, which capture various aspects of the event, the user, and the user's browsing history.Features can include binary indicators of visiting specific URLs or real-numbered values reflecting browsing frequency and recency.The system anonymizes URL data by hashing it to maintain user privacy.
Appendix B provides specific definitions of the target and source tasks used in the experiments in section 4. Figure 1 illustrates the relationships between user events, the target task, and the two-stage transfer learning tasks.

Two-Stage Transfer Learning
To achieve the ultimate goal of predicting which users are most likely to purchase a product after being exposed to an ad, the system employs a two-stage transfer learning approach.Instead of selecting a single source learning task, the system leverages multiple source learning tasks, each with its own domain and mapping.The first stage aims to significantly reduce the target feature set (XT) so that in the second step, learning can occur based on the target sampling distribution (PT).
In the first stage, multiple parallel source learning tasks are considered, and each task estimates a function (fs(X)) to approximate the label (YS).In the second stage, the system learns how to transfer the set of predictions from the first stage by weighting individual inputs using a learned linear classifier.The distinctions between source and target tasks are rooted in different events, leading to varying sampling distributions and labels, as illustrated in Figure 1.An interesting aspect of the system is that the "correct" target learning task, which is whether a consumer purchases after an ad impression, is not always used in the production system for certain campaigns.Budget constraints or issues with tracking pixels on the brand's website may make it unrealistic to serve enough impressions to observe sufficient conversions.In such cases, the system uses the next best outcome, often a visit to the brand's website following an ad impression, as the target learning task.In practice, using a site visit as the training outcome can outperform using a purchase as the training outcome when predicting purchases.Therefore, the paper combines purchases and site visits as the target label, and the focus in this paper lies primarily on sampling distributions (P(E)) and how site visits/purchases are used as labels

METHODOLOGY
In our study, we have extensively addressed the intricate challenges in targeted display advertising through a carefully defined problem formulation.The fundamental obstacle we tackled is the cost and scarcity of training data from the target sampling distribution.To mitigate this, we introduced a two-stage transfer learning approach that harnesses models trained on surrogate domains and learning tasks and subsequently transfers this knowledge to the target task.Our empirical findings have underscored the remarkable value of different transfer stages in enhancing system performance.From these findings, several critical insights have emerged for the broader machine learning community.These include the significance of deliberate data definition, the ability of transfer learning to combat cold-start problems, the importance of pragmatic constraints and data cost in decision-making, the efficacy of progressive dimensionality reduction, and the prevalence of transfer learning in diverse real-world applications.Overall, our study underscores the transformative potential of explicit transfer learning considerations in solving complex real-world challenges and guiding the development of automated systems.

TRANSFER LEARNING RESULTS
In the subsequent sections, we present the results obtained from the different stages of our transfer learning system.These experiments aim to address the questions posed earlier and assess the impact of training on various source tasks in stage 1, as well as the combination and weighting of models in stage 2. For our evaluation, we employ tasks characterized by the appropriate sampling distribution PT(ET), representing the target task, which consists of random and untargeted users who can be exposed to an ad and have not previously engaged in any brand actions.These tasks utilize the same featurization as the training data.It's important to note that our stage 1 and stage 2 models, in sequence, provide a mapping for the complete feature set of the target featurization XT, which includes browsing history (Xbinary) and user characteristics (Xinfo).Furthermore, positive instances in these tasks are users who perform a brand action within seven days of encountering the ad.

The benefits of stage-1 transfer
This section explores the results of our transfer learning system's different stages and aims to answer the questions posed earlier.The experiments focus on using a convenient sampling distribution (PS(E)) and labeling scheme to maximize positive examples, even if they don't perfectly reflect the actual target task, often yielding better results than consistently using the target distribution (PT(E)).From a transfer learning perspective, we demonstrate that the estimation of function fS(•) often serves as a better predictor of YT (target label) than the estimation of fT(•).
To empirically confirm the significant differences between source and target tasks, we conducted tests comparing the sampling distributions PT(E) and PS(E).A classifier was built using binary URL indicators as features to distinguish users sampled from these distributions, demonstrating measurable differences between the two.The out-of-sample AUC achieved by this model further supports the disparities between the populations.
In our analysis, we define the source population (ES) as all active internet users within our system, with the sampling distribution (PS(ES)) representing a composite of various sampling events.The source label (YS) indicates whether a user has visited the marketer's website in the past.These models are compared against models trained directly on the target task, where the target population (ET) comprises users who could potentially win ad auctions, and the target label (YT) represents a brand action following an ad.
The results indicate that the models trained on the stage-1 source task consistently outperform those trained on the target task, with a notable advantage in learning from the extensive, high-dimensional URL featurization.
In cases where we conducted an extensive parameter search for target training, models trained on the source task still proved to be more effective.This counterintuitive result suggests that, in scenarios with scarce positive examples and different training distributions, the bias introduced by the source task can be outweighed by the increased positive-class signal it provides.
These findings highlight the practicality of using biased initial sampling schemes in real-world applications, where positive-class data are limited or expensive.

2.
Stage-2 Ensemble Model In this section, the performance of the second stage (stage-2) in our transfer learning process is evaluated by comparing it to the constituent stage-1 models.The primary aim is to assess whether the adjustment to the target task, achieved through the stage-2 ensemble, offers improvements over solely using one of the source models without any target task adjustment.
For these experiments, we collected 30 days of randomly targeted users from PT(ET) as the basis for the target distribution.The data sets had varying numbers of positive examples ranging from 50 to 10,000, along with a large number of negative examples.The stage-2 featurization involved approximately 50 features, including stage-1 model scores specific to the campaign and user, along with various user characteristic features (Xinfo) such as browser type, cookie age, and geo-location information.
The stage-2 model is a logistic regression classifier trained using elastic net regularization, combining L1 and L2 regularization.The experimental results are presented across 29 different campaigns, representing recurring advertising tasks.The performance comparison is based on the area under the ROC curve (AUC) of the stage-2 model against the AUC of the best-performing stage-1 model.All performance evaluations were conducted on an out-of-time hold-out set, ensuring a proper assessment of both stages.
The results demonstrate the significant improvements achieved by combining source models and integrating information about the target task in the stage-2 ensemble.The median and average AUC improvements across different campaigns were 0.0375 and 0.0411, respectively.Notably, the enhancement is even more pronounced when the best stage-1 model exhibits relatively poor performance.Cases where the best stage-1 model falls in the lower 50% of campaigns showed median and average improvements of 0.056 and 0.061, respectively.Any potential "negative transfer" is effectively managed by the learning procedure, where poorly performing stage-1 models receive low or negative weights in the ensemble.It's important to note that the variance in AUC across campaigns in both stages is due to the diverse nature of clients and brands involved.While some brands yield highly discriminative models, others, particularly mass-market brands, face more challenges in building discriminative models.Therefore, the absolute AUC values are less significant compared to the relative improvements demonstrated across methods.These results underscore the effectiveness of the stage-2 ensemble in the transfer learning process.

CONCLUSIONS AND RECOMMENDATIONS
In conclusion, this paper offers valuable insights and practical lessons derived from a real-world, large-scale machine learning system for targeted display advertising.The system addresses the challenges of limited data availability by employing a two-stage transfer learning approach, leveraging different source sampling distributions and training labels before transferring the knowledge to the target task.
Explicit consideration of the nuances in defining events (E), sampling distributions (P(E)), and labels (Y) can significantly enhance machine learning outcomes.Employing data from distributions and labels that differ from the target task can lead to performance improvements, highlighting the need for results adjustment to the target distribution.
Transfer learning serves as a practical solution to the "cold-start" problem, especially when insufficient training data is available for the target task.The flexibility to add new modeling methods easily and adapt to evolving hierarchical relationships for probability estimates.Chen et al. incorporated Laplacian smoothing into Poisson regression, while Pandey et al. and Dalessandro et al. augmented rare outcomes with correlated outcomes having higher occurrence rates.Transfer learning, specifically the use of alternative outcomes in classification models, has been explored.