Frontier research. Production discipline.

Miniml’s leadership includes active academics — University of Edinburgh professors, an Alan Turing Institute Fellow, and an ELLIS Scholar. We work at the boundary of machine learning research and enterprise systems, then carry the methods that hold up under scrutiny into production. Research that ships.

Book a consultation → Recent papers →

Papers

Recent research from the team.

Technical work from Miniml researchers and collaborators — new methods, evaluations, and findings from frontier model development. Each entry links to a short summary on this site, with the full paper one click away.

GRADA: Graph-based Reranker against Adversarial Documents Attack

November 18, 2025

Retrieval Augmented Generation (RAG) frameworks improve the accuracy of large language models (LLMs) by integrating external knowledge from retrieved ...
Read summary
FLARE: Faithful Logic-Aided Reasoning and Exploration

November 11, 2025

We introduce Faithful Logic-Aided Reasoning and Exploration (FLARE), a novel interpretable approach for traversing the problem space using task decomp...
Read summary
DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations

November 4, 2025

Large Language Models (LLMs) often hallucinate, producing unfaithful or factually incorrect outputs by misrepresenting the provided context. We propos...
Read summary
Neurosymbolic Diffusion Models

October 21, 2025

Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predict...
Read summary
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

October 7, 2025

We introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks to evaluate long-context vision-language mo...
Read summary
Neurosymbolic Reasoning Shortcuts under the Independence Assumption

September 23, 2025

The ubiquitous independence assumption among symbolic concepts in neurosymbolic (NeSy) predictors is a convenient simplification that speeds up probab...
Read summary
Noiser: Bounded Input Perturbations for Attributing Large Language Models

September 9, 2025

Feature attribution (FA) methods are common post-hoc approaches that explain how Large Language Models (LLMs) make predictions. Generating faithful at...
Read summary
An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

August 26, 2025

Large Language Models (LLMs) frequently produce factually inaccurate outputs—a phenomenon known as hallucination—which limits their accuracy in knowle...
Read summary
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

August 12, 2025

Autoregressive language models rely on a Key-Value (KV) Cache to avoid re-computing past hidden states during generation. As model sizes and context l...
Read summary
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

July 29, 2025

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like...
Read summary
Inverse Scaling in Test-Time Compute

July 19, 2025

We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse sc...
Read summary
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs

July 15, 2025

Understanding time from visual representations is a fundamental cognitive skill, yet it remains a challenge for multimodal large language models (MLLM...
Read summary
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

May 15, 2025

We present SynDARin, a methodology for synthesizing high-quality reasoning datasets in low-resource languages. Our approach combines template-based ge...
Read summary
Adaptive Computation Modules: Granular Conditional Computation for Efficient Inference

April 15, 2025

We introduce Adaptive Computation Modules (ACMs) that enable fine-grained conditional computation for more efficient neural network inference. Our app...
Read summary
Self-Training Large Language Models for Tool-Use Without Demonstrations

March 15, 2025

We propose a self-training approach that enables large language models to learn tool use without requiring human demonstrations. Our method uses self-...
Read summary
When Can Proxies Improve the Sample Complexity of Preference Learning?

February 15, 2025

We provide theoretical and empirical analysis of when proxy rewards can improve sample efficiency in preference learning. Our work establishes conditi...
Read summary
Is Complex Query Answering Really Complex?

January 15, 2025

We challenge conventional wisdom about complex query answering by demonstrating that many supposedly complex queries can be solved with surprisingly s...
Read summary
An Auditing Test to Detect Behavioral Shift in Language Models

December 15, 2024

We develop a comprehensive auditing framework to detect behavioral shifts in language models across different contexts and time periods. Our approach ...
Read summary
TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

November 15, 2024

We investigate the cross-lingual transferability of backdoor attacks in instruction-tuned large language models. Our findings reveal that backdoors ca...
Read summary
Steering Knowledge Selection Behaviours in LLMs

October 28, 2024

Read Paper Knowledge Selection in LLMs We investigate how large language models (LLMs) select and utilise knowledge when generating responses. Our analysis reveals that LLMs exhibit systematic biases in knowledge selection, often favouring certain…
Read summary
Mixtures of In-Context Learners

October 15, 2024

We introduce Mixtures of In-Context Learners (MiCL), a novel approach that combines multiple in-context learning strategies to improve few-shot perfor...
Read summary
Low-rank lottery tickets

October 4, 2024

Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations Read Paper Neural networks deliver exceptional performance but can be impractical for applications with limited hardware or energy resources due to their…
Read summary
Robust low-rank training via approximate orthonormal constraints

September 8, 2024

Robust low-rank training via approximate orthonormal constraints Read Paper As models and datasets grow, pruning techniques using low-rank matrix factorizations have become popular for reducing resource demands while maintaining accuracy. However, we find that…
Read summary
Are We Done with MMLU?

August 25, 2024

Read Paper Analyzing Flaws in MMLUOur analysis uncovers significant issues with the Massive Multitask Language Understanding (MMLU) benchmark, which is widely used to assess LLMs. We found numerous ground truth errors, with 57% of…
Read summary
Enhancing AI Model Robustness with Natural Language Explanations

August 9, 2024

Enhancing AI Model Robustness with Natural Language Explanations Read Paper In this paper, we explore how natural language explanations (NLEs) can improve the robustness of large language models (LLMs) in tasks like natural language…
Read summary
Probing the Emergence of Cross-lingual Alignment during LLM Training

August 4, 2024

Probing the Emergence of Cross-lingual Alignment during LLM Training Read Paper Multilingual LLMs excel at zero-shot cross-lingual transfer, likely by aligning languages without parallel sentence supervision. This study uses intrinsic probing to analyze neuron…
Read summary
Using Natural Language Explanations to Improve Robustness of In-context Learning

June 3, 2024

Using Natural Language Explanations to Improve Robustness of In-context Learning Read Paper This work explores improving the robustness of LLMs against adversarial inputs by augmenting in-context learning (ICL) with natural language explanations (NLEs). Prompting…
Read summary
SPARSEFIT: Few-shot Prompting with Sparse Fine-tuning

April 19, 2024

SPARSEFIT: Few-shot Prompting with Sparse Fine-tuning for Jointly Generating Predictions and Natural Language Explanations Read Paper This work introduces SparseFit, a sparse few-shot fine-tuning strategy for generating natural language explanations (NLEs) with pre-trained language…
Read summary
Analysing The Impact of Sequence Composition on Language Model Pre-Training

March 18, 2024

Analysing The Impact of Sequence Composition on Language Model Pre-Training Read Paper Pre-training sequence composition plays a critical role in language model performance. Traditional causal masking can introduce distractions from unrelated documents, hindering effectiveness.…
Read summary
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression

March 5, 2024

A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression Read Paper Deploying large language models (LLMs) is challenging due to the high memory demands of the Key-Value (KV) cache, especially with longer…
Read summary

Research to production

How the research earns its place.

Most enterprise problems do not need a new method. They need known methods applied with discipline. Research enters our work where it genuinely moves the result — and the bar for that is high.

Evaluation first. Before a method ships, it is measured against your real data and real failure modes — not a benchmark. The same evaluation rigor we use in research is how we decide what is ready for production.
Reliability over novelty. A dependable, well-understood approach beats a clever one that drifts. We reach for frontier methods only where a simpler approach genuinely cannot meet the reliability bar.
Novel methods where they pay. When a problem sits at the edge of what current tools can do, our academic leadership can bring methods from the literature — and the judgment to know when not to.
Built to own. What we ship is documented, testable, and handed over — so your team can run, extend, and re-evaluate it without us in the room.

Talk through a problem at the edge →

From the lab → method selected → evaluated on real data → hardened for reliability → production — with production behaviour fed back into evaluation

Start the conversation

Have a problem at the edge of what’s possible?

If your hardest problem needs more than off-the-shelf tools — or you simply want a second opinion grounded in current research — talk to us. A 30-minute conversation with a senior consultant who can tell you what is genuinely solvable today, and what it would take to ship it.

Book a consultation →

Frontier research. Production discipline.

Recent research from the team.

GRADA: Graph-based Reranker against Adversarial Documents Attack

FLARE: Faithful Logic-Aided Reasoning and Exploration

DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations

Neurosymbolic Diffusion Models

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Neurosymbolic Reasoning Shortcuts under the Independence Assumption

Noiser: Bounded Input Perturbations for Attributing Large Language Models

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

Inverse Scaling in Test-Time Compute

Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Adaptive Computation Modules: Granular Conditional Computation for Efficient Inference

Self-Training Large Language Models for Tool-Use Without Demonstrations

When Can Proxies Improve the Sample Complexity of Preference Learning?

Is Complex Query Answering Really Complex?

An Auditing Test to Detect Behavioral Shift in Language Models

TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

Steering Knowledge Selection Behaviours in LLMs

Mixtures of In-Context Learners

Low-rank lottery tickets

Robust low-rank training via approximate orthonormal constraints

Are We Done with MMLU?

Enhancing AI Model Robustness with Natural Language Explanations

Probing the Emergence of Cross-lingual Alignment during LLM Training

Using Natural Language Explanations to Improve Robustness of In-context Learning

SPARSEFIT: Few-shot Prompting with Sparse Fine-tuning

Analysing The Impact of Sequence Composition on Language Model Pre-Training

A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression

How the research earns its place.

Have a problem at the edge of what’s possible?