Expertise Industries Research Company +44 131 392 7838 Book a Consultation

Expertise / Recommendation & Personalization

Recommendations that lift. Personalization that pays.

Most enterprise recommendation systems were architected before LLMs and never updated. We rebuild the stack — sequential transformers, LLM rerankers, real-time signals, and the evaluation discipline to know whether your changes are actually moving conversion, retention, and revenue.

Book a Personalization Review → How we build relevance →

Built by senior teams from the University of Edinburgh, Amazon, Stanford & the Alan Turing Institute

The gap

Why most personalization underperforms.

Product recommendations drive up to 31% of e-commerce revenue when they’re built well. The teams capturing that value have rebuilt their stacks since 2024. The teams that haven’t are watching conversion sag, cold-start fail, and personalization devolve into popularity bias.

Stale architectures

Two-tower retrieval and gradient-boosted rankers were state of the art five years ago. In 2026 they’re the floor, not the ceiling — and the gap to a sequential-transformer ranker shows up directly in your conversion numbers.

Latency blow-ups

LLM rerankers improve relevance — until they push the funnel past the latency budget and conversion drops. Engineered correctly, the reranker lives inside a 200ms budget. Engineered badly, it kills the very lift it was meant to deliver.

Cold-start collapse

New items, new users, new contexts — most systems fail here. LLM-assisted cold-start is now possible and shipped at scale, but rarely implemented. Most teams are still padding the gap with popularity defaults.

Feedback loops you can’t see

Popularity bias, exposure bias, model drift. Without an offline + online evaluation harness, you don’t know what’s broken until revenue falls — and by then the regression has been compounding for months.

How we build

Relevance, engineered.

A four-stage funnel that combines retrieval, ranking, reranking, and policy — with the latency budget, evaluation, and feedback loops to keep it sharp.

01 Multi-source candidate generation. Two-tower retrieval, popularity, and GNN-derived candidates — combined, not chosen between. Diversity engineered in at the candidate stage, not bolted on as an afterthought.
02 Sequential ranker. Transformer or HSTU-style sequential model over user action history. The 2026 floor for serious ranking quality. Trained on your data, evaluated against your baseline.
03 LLM reranker. Top-K reordering with reasoning and, where it matters, an explanation in the response payload. Mainstream at Spotify, Salesforce, Duolingo — engineered for your latency budget, not a research demo.
04 Business rules and diversity. Policy filters, exploration, fairness constraints. Recommendations that comply with your governance, not just your model. Audit trail on every decision.
05 Continuous evaluation. Offline metrics, online A/B, exposure debiasing, drift detection. We instrument before we ship — and the eval harness stays in your hands.

Talk through your personalization use case →

Candidates → ranked by a sequential model → reranked with LLM reasoning → filtered for policy and diversity — under a real latency budget, with continuous A/B evaluation feeding signal back into the loop.

What we ship

Personalization patterns.

Three production patterns — each engineered against a measurable lift target, not a vanity metric.

Product & content recommendations

Retail, media, marketplaces. Lift conversion and engagement against a measured baseline — not against last quarter’s hand-tuned rules.

Next-best-action & offer personalization

Financial services, telco, SaaS. Treatment decisions ranked against expected value, with policy filters and exploration built into the ranker.

Conversational discovery agents

LLM-driven shopping, content discovery, and recommendation explanation as a first-class UX. Grounded in your catalog, governed by your rules.

The production layer

What we engineer around the recommender.

The model is one piece. These are the layers that make personalization earn its keep in production.

Real-time feature pipeline

Kafka, Spark/Flink, sub-200ms serving. Personalization at the speed of click — because session signals beat stale features every time.

Evaluation harness

Offline metrics, online A/B, exposure debiasing, fairness checks. The harness you’ll use long after we’ve gone — to know what’s working and to catch regressions before they hit revenue.

Cold-start & drift handling

LLM-assisted bootstrapping for new items and users, automated retraining triggers on signal. The two failure modes most teams patch over — engineered properly.

Where it earns its keep

Where personalization pays.

The strongest 2026 use cases share a shape: a measurable revenue or retention line tied directly to ranking quality — where a better recommendation is a better number, and a stale one is a slow leak.

✓

Retail & e-commerce

Product recs, search ranking, category browse. Up to 31% of revenue when the stack is built for it — and a slow leak when it isn’t.

✓

Media & streaming

Next-item, session, and content discovery. Sequential models over real watch and listen history, not yesterday’s collaborative filter.

✓

Financial services

Next-best-action, offer ranking, retention treatments. Ranked against expected value, with policy filters and audit trail on every decision.

✓

B2B & SaaS

In-product feature surfacing, content recommendations, expansion targeting. Personalization that earns its keep against churn and ARR, not engagement vanity.

✓

Ad targeting & CTR prediction

Sequential ranking with calibrated probability outputs. Calibration matters as much as ranking when the downstream is an auction.

✓

Conversational discovery

LLM-driven shopping and content discovery agents — grounded in your catalog, governed by your rules, explainable by design.

FAQ

Personalization, answered straight.

What’s actually new in recommendation systems in 2026?

Three things, mainly. Sequential transformer rankers (HSTU-style, up to 1.5T parameters at Meta) are now the production floor for serious quality. LLM rerankers — shipped at Spotify, Salesforce, Duolingo — are mainstream when engineered inside a latency budget. And LLM-assisted cold-start finally makes new items and new users a solvable problem rather than a popularity-default patch.

Where does HSTU or generative recommendation make sense for us?

Usually only when you have billion-scale interaction logs and a measurable revenue line tied to ranking quality. For most enterprises, a well-engineered sequential ranker plus an LLM reranker captures the lift HSTU promises, at a fraction of the training cost. We’ll tell you straight whether your data and traffic justify the heavier architecture.

How do we measure recommendation lift before we A/B test?

Offline first, online second — and never one without the other. We build the eval harness up front: counterfactual evaluation, exposure debiasing, ranking metrics calibrated to your business KPI, not just NDCG. Then we ship behind an A/B with a pre-registered hypothesis, so the result isn’t argued about after the fact.

How does cold-start work with LLM-assisted recommenders?

Well, finally. LLMs can embed new items and new users from sparse metadata (descriptions, profile signals, category context) into the same space as your trained ranker — so day-one ranking is meaningfully better than popularity-default. It doesn’t replace warm collaborative signal, but it closes the gap between “we have no data on this item” and “we have enough to rank it sensibly.”

Do we own the personalization stack you build with us?

Yes. The code, the models, the eval harness, the feature pipelines, the integrations — all yours. We design every engagement so your team can operate, retrain, and extend the system independently. Enablement is built into the project, not sold as an upsell afterward.

Build versus buy: do we need NVIDIA Merlin, Vertex AI, Algolia, or build from scratch?

Usually a hybrid. Managed platforms (Vertex AI, AWS Personalize, Algolia’s Recommendation Analytics from April 2026) get you to a credible baseline fast — but the lift that matters lives in the reranker, the feature pipeline, and the eval harness, and those are almost always custom. We help you draw the line and build only what platforms can’t.

Where to start

Where to start.

Personalization Review · 3 weeks · fixed fee

Bring us your current recommendation or personalization system.

We instrument it, benchmark it against modern architectures (sequential rankers, LLM rerankers, real-time features), identify the 2–3 changes most likely to lift conversion or retention, and deliver a sequenced rebuild plan with effort estimates and expected lift.

What you get: a production-readiness assessment scored against twelve criteria — architecture, latency budget, eval harness, feature freshness, cold-start handling, drift, fairness, governance, and operational ownership; a target architecture for the recommender in your environment; a staged rebuild plan with timelines and expected lift; and one workshop with your data, engineering, and product leads. Led by a senior consultant — fixed scope, fixed fee.

Book a Personalization Review →

Start the conversation

Ready to ship personalization that actually moves the number?

A 30-minute conversation with a senior consultant. Bring the recommender you have or the one you wish you had. We’ll tell you where the lift is, what the gaps are, and what a Personalization Review would surface.

Book a Personalization Review →