Why Offline Evaluations are Necessary but Not Sufficient for Real-world Assessments

Apr 13, 2025

Evaluation remains one of the biggest challenges in deploying AI today, especially with the rise of generative models that produce open-ended text, images, and video. These outputs don't fit neatly into correctness assessments, so teams must measure success using proxies for what they want to optimize.

We can borrow from what works in disciplines like personalization and recommendation, where engineers have long operated without direct loss functions tied to user outcomes. You can't optimize engagement or satisfaction using a closed-form equation. Instead, we rely on offline metrics to test algorithms before putting them before real users.

In recommendation, most systems follow a two-stage approach:

Candidate retrieval pulls a manageable set of potentially relevant items.
Ranking scores those items to generate a final list.

Offline evaluation for retrieval often uses MRR (Mean Reciprocal Rank), while ranking uses NDCG (Normalized Discounted Cumulative Gain) to filter out poor models cheaply. However, these are not the primary objectives of the optimization, and they are typically weak predictors of user preferences.

Suppose you're recommending streaming video, and for your next experiment, you propose changes to the ranking logic that boosts NDCG by 5%. This assessment seems promising, but once deployed, users watch less. Despite the signals that ranking may be "more relevant," in practice, it has reduced total view time, a statistic that measures both ad revenue and user satisfaction. This skew between offline and online evaluation is not uncommon and is part of why industry-standard experimentation success rates typically fall between 10% and 30%.

Several sources of bias challenge the reliability of offline metrics to predict outcomes for your proposed feature:

Exposure Bias: Users only interact with what you've presented
Position Bias: Prominently featured items get more clicks regardless of quality
Feedback Loop Bias: models reinforce prior behavior

As a result, even after observing a lift in your offline metrics, you might find reduced engagement after deploying into production. That's why mature software teams measure impact on business outcomes—using statistics based on session length, conversion rate, or revenue per user.

Consider evaluations in a more general software engineering setting. While you may be interested in assessing your team's impact, you'll often turn to cheap metrics like lines of code committed or features shipped. But learning which features are worth the maintenance to your business requires observation and measuring business value more directly.

In Kohavi's "Trustworthy Online Controlled Experiments," you'll learn of the small code change implemented by an intern at Bing.com, a proposal that sat in the experiment backlog for six months. This little tweak was responsible for a 12% lift in revenue, an additional $100 million per year. The only way to know if that feature proposal is as great as it looks is to evaluate it in the real world by running an online experiment.

So why bother with offline metrics and proxies?

Online evaluation is costly. You need infrastructure to run experiments reliably; bad experiments can degrade user experience.

We use offline metrics to screen ideas cheaply and build confidence before shipping to production. They're necessary but not sufficient to make launch/no-launch decisions.

An AI's performance in the LM Arena may help assess potential enhancements to your chatbot's conversational capabilities. Likewise, domain-specific and custom benchmarks can help you score the capabilities necessary for your application. But even the best model will make users bounce if it's too slow, so you'll also evaluate your AI considering other factors like latency.

The real advantage comes from learning which offline metrics correlate with real-world success and building your process around that knowledge. Mature ML teams treat offline metrics as guardrails, not goals. They use benchmarks to filter, not to select.

In summary,

Offline metrics are proxies. Use them to support a testable hypothesis, not to certify your picks.
Online metrics measure truth. Use them to build knowledge and continuously improve.

AI evals isn't just about model quality or ranking your options. It's about using data-driven decision-making to learn what works for your AI product. And like everything else in this empirical discipline, it takes experimentation to perfect.

Myx'd Results

Discussion about this post