[ad_1]
Caching is a common idea in computer science that greatly improves the performance of storage and retrieval systems by storing a subset of popular items closer to the client based on request patterns. An important algorithmic part of cache management is the decision policy used to dynamically update the cached itemset, which has been extensively optimized over several decades, leading to several efficient and robust heuristics. Although the use of machine learning in cache policy has shown promising results in recent years (e.g., LRB, LHD, storage applications), it remains a challenge to advance robust heuristics in a way that can reliably generalize to non-standard production settings while maintaining competitive computation. and memory overhead.
In “HALP: Heuristic Aided Learned Preferences Eviction Policy for YouTube Content Delivery Network,” presented at NSDI 2023, we present a scalable state-of-the-art cache eviction framework based on learned rewards and using preference learning with automatic feedback. The Heuristic Aided Learned Preference (HALP) framework is a meta-algorithm that uses randomization to combine a lightweight heuristic baseline exclusion rule with a learned reward model. The reward model is a lightweight neural network that is continuously trained with automatic feedback on priority comparisons designed to mimic an offline oracle. We discuss how HALP improved infrastructure efficiency and user video playback latency for YouTube’s content delivery network.
Learn preferences for cache eviction solutions
The HALP framework computes cache eviction decisions based on two components: (1) a neural reward model trained with automatic feedback via preference learning and (2) a meta-algorithm that combines the learned reward model with fast heuristics. As the cache observes incoming requests, HALP continuously trains a small neural network that predicts a scalar reward for each item, formulating this as a preference learning method via pairwise preference feedback. This aspect of HALP is similar to Reinforcement Learning from Human Feedback (RLHF) systems, but with two important differences:
- There is a connection automated and uses known results on the optimal policy structure for offline cache eviction.
- There is a model He studied continuously Using a transient buffer of training examples built from an automatic feedback process.
The eviction decision relies on a filtering mechanism with two steps. First, a small subset of candidates is selected using heuristics that are efficient but suboptimal in terms of performance. Then, the re-ranking step is optimized from within the base candidates by moderate application of the neural network’s scoring function to “boost” the quality of the final decision.
As an implementation of a production-ready cache policy, HALP not only makes eviction decisions, but also incorporates a pairwise preference query selection process that is used to efficiently construct relevant feedback and update the model for eviction decisions.
A model of neural reward
HALP uses a lightweight two-layer multilayer perceptron (MLP) as a reward model to selectively collect individual items in a cache. Features are built and managed as metadata-only “ghost caches” (similar to classic policies like ARC). After any given request, in addition to regular cache operations, HALP performs the accounting (eg, tracking feature metadata and updating the key-value store) required to update the dynamic internal representation. These include: (1) external benchmark functions provided as input by the user, along with the cache retrieval request, and (2) internally constructed dynamic functions (eg, time since last access, average time between accesses) constructed from the observation time of each item.
HALP learns its reward model fully online from an initialization of random weights. This may seem like a bad idea, especially if decisions are made exclusively to optimize the reward model. However, eviction decisions rely on both a learned reward model and a suboptimal but simple and robust heuristic such as LRU. This allows for optimal performance when the reward model is fully generalized, while remaining robust to a temporally uninformative reward model that is not yet generalized, or in the process of capturing a changing environment.
Another advantage of online training is specialization. Each cache server operates in a potentially different environment (eg, geographic location), which affects local network conditions and what content is popular locally, among other things. Online training automatically captures this information while reducing the burden of generalization as opposed to a single offline training decision.
Sample scores from a randomized priority order
It may be impractical to optimize the quality of the eviction decision exclusively as a study objective for two reasons.
- Compute performance constraints: Learning network inference can be significantly more expensive than the computations performed in a practical cache policy at scale. This limits not only the expressiveness of the network and functions, but also how often they are called during each eviction decision.
- Robustness to distribution-free generalization: HALP is deployed in a configuration involving continuous learning, where rapidly changing workloads can produce demand patterns that may not be temporally distributed with respect to previously seen data.
To address these issues, HALP first uses a low-cost heuristic scoring rule consistent with eviction priority to identify a small candidate sample. This process is based on efficient random sampling that approximates exact priority queues. The prioritization function for generating candidate samples is intended to be computationally fast using existing hand-tuned algorithms, eg, LRU. However, this is configurable to approximate other cache replacement heuristics by editing a simple cost function. Unlike previous work where randomization was used to shift the efficiency approximation, HALP also relies on Inherent randomization in the selected candidates over time steps to provide the required search diversity in the selected candidates for both training and inference.
The final discarded point is chosen from the provided candidates, which is equivalent to the best reranked sample corresponding to the maximization of the predicted. Preference score according to neural reward model. The same pool of candidates used for eviction decisions is also used to construct pairwise preference queries for automated feedback, helping to reduce training and inference bias between samples.
An overview of the two-step process used to make each eviction decision. |
Online learning advantage with automatic feedback
The reward model is learned using online feedback based on automatically assigned priority labels that indicate, where possible, a ranked priority order based on the time required for future re-accesses, starting from a given snapshot of time between each requested sample. things. This is similar to the oracle’s optimal policy, which at any given time removes the item with the furthest future access from all items in the cache.
Automatic feedback generation for reward model learning. |
To inform this feedback process, HALP formulates pairs of priority questions that are most likely to be relevant to eviction decisions. In synchronization with normal cache operations, HALP sends a small number of pairwise priority requests at each eviction decision and concatenates them pending comparisons. The labels of these pending comparisons can only be resolved at random future times. For online functionality, HALP also performs additional logging after each lookup request to handle any pending comparisons that may be staged after the current request. HALP indexes the pending comparison buffer with each element involved in the comparison and processes memory consumed by old comparisons (none of which will be re-accessed) to ensure that the memory overhead associated with feedback generation remains limited over time.
Overview of all major components of HALP. |
Implications: Impact on YouTube CDN
Through empirical analysis, we show that HALP compares favorably with state-of-the-art cache policies in terms of cache miss rates on public benchmark traces. However, while public benchmarks are a useful tool, they are rarely sufficient to capture all usage patterns around the world over time, let alone the diverse hardware configurations we’ve already deployed.
Until recently, YouTube servers used the LRU-optimized variant to evict the memory cache. HALP increases YouTube’s memory egress/input—the ratio of total bandwidth served by the CDN and consumed due to cache misses—by about 12% and memory hit rate by 6%. This reduces user latency because memory reads are faster than disk reads, and also improves the throughput of disk-bound machines by shielding them from disk traffic.
The figure below shows a visually compelling reduction in the byte-miss ratio after the final HALP rollout to the YouTube CDN, which now serves more content from cache, with lower latency to the end user and no usage. More expensive to mine, which increases operating costs.
Worldwide YouTube byte-skipping ratio before and after launch (vertical dashed line). |
Overall performance improvements may still hide significant regressions. In addition to measuring the overall impact, we also conduct an analysis in the paper to understand its impact on different racks using machine-level analysis and find it to be overwhelmingly positive.
conclusion
We present a state-of-the-art scalable cache eviction framework that is based on learned rewards and uses preference learning with automatic feedback. Due to its design choices, HALP can be deployed like any other cache policy, without operational overhead, without separately managing labeled examples, training procedure, and model versions, as additional offline pipelines common to machine learning systems. Therefore, it has only a small overhead compared to other classical algorithms, but has the added advantage of being able to take advantage of additional features to make its eviction decisions and continuously adapt to access patterns.
This is the first large-scale implementation of a learned cache policy on a widely used and highly trafficked CDN, and has significantly improved the efficiency of the CDN infrastructure while providing a better user experience.
Acknowledgments
Ramki Gummadi is now part of Google DeepMind. We would like to thank John Gillard for his help with the illustrations and Richard Schooler for his feedback on this post.
[ad_2]
Source link