Expertise / Computer Vision

Vision in production. Not in the lab.

Detection, inspection, video understanding, document AI — most enterprise CV projects look great in the lab and stall on the factory floor. We build modular vision systems engineered for production: fast detectors plus VLM reasoning, edge-deployed, drift-monitored, and continuously evaluated.

Why vision projects die in pilot.

Vision is one of the oldest AI categories, and most enterprises still run pilots that never reach the floor. The lab-to-production gap is the number-one enterprise hurdle in CV today — and the answer isn’t a bigger model. It’s the modular architecture, the labeling discipline, and the operational rigor that turns vision into a production system.

Lab accuracy, production failure

Models tested on curated data fall apart under real lighting, real wear, real hardware drift. The accuracy number on the pilot deck has almost no relationship to what happens on the line in week three.

Label scarcity

Most organizations have abundant raw images and few labels. Without synthetic data, active learning, and weak supervision loops, retraining stalls — and the model that worked at launch quietly stops working.

Latency or cost blow-ups

Always-on VLM inference is expensive — a single image consumes around 4,096 tokens in early-fusion models. Edge-first design isn’t optional; it’s the difference between a production system and a quarterly cloud bill that kills the project.

Drift you can’t see

Lighting changes, equipment ages, seasons shift, packaging gets redesigned. Without drift monitoring, accuracy degrades silently until exceptions pile up and operators stop trusting the output. By then the regression has been compounding for months.

Modular pipelines, not monolithic models.

Five steps to vision that ships and keeps shipping — engineered for the edge, the drift, and the operational reality.

  • Define the operational job. What’s the decision, the failure cost, the latency budget, and the edge constraint? Vision systems that earn their keep are scoped against the operational reality before a single model is chosen.
  • Architect modularly. Fast detector or segmenter for region proposals (YOLO26, RF-DETR with DINOv2 backbones); VLM for reasoning, OCR, or attribute extraction where it earns its keep. Two-stage, not monolithic. Each stage swappable.
  • Solve the labeling problem. Synthetic data, active learning, weak supervision — engineered loops, not annotation projects. The labeling pipeline is treated as production code, with the same rigor as the model itself.
  • Deploy to the edge or the cloud as the job demands. Triton, ONNX Runtime, TensorRT for edge; cloud inference where latency permits and economics work. On-prem and air-gapped where the regulatory environment requires it. The deployment target is a design input, not an afterthought.
  • Monitor for drift and retrain on signal. MLOps loops that auto-trigger retraining on observed drift; champion-challenger model promotion with full audit trail. The system stays sharp after we leave.
Talk through your vision use case
EDGE · DRIFT · RETRAIN Drift Monitor CONTINUOUS EVAL Synthetic Data + Active Labeling RETRAIN ON SIGNAL Camera IMAGE · VIDEO Detector YOLO26 · RF-DETR VLM REASON · OCR WHERE IT EARNS ITS KEEP Output STRUCTURED · ACT
Detect with a real-time model; reason with a VLM only where it earns its keep; monitor drift continuously; retrain on signal — at the edge, on the floor, where the work happens.

Vision patterns.

Three production patterns — each engineered for a different shape of operational work.

01

Real-time detection & inspection

Manufacturing QA, retail loss prevention, warehouse package handling, security monitoring. Sub-frame latency on the line, with the accuracy and audit trail that operators trust.

02

Vision-language reasoning

Inspection with rationale, document AI, video understanding with structured output. VLM in the loop only where the reasoning earns its cost — not as the default for every pixel.

03

Multimodal agents

Vision integrated into operational agents that see, reason, and act inside your systems. Governed by the same constrained-loop discipline as the rest of our agent work.

What we engineer around the vision system.

The model is one piece. These are the layers that make vision trustworthy in production.

01

Edge deployment & runtime

Triton, ONNX Runtime, TensorRT, on-device inference. Quantized, profiled, and engineered to hit the latency and power budget on the hardware you actually have.

02

Labeling & data pipeline

Synthetic data generation, active learning, weak supervision. The labeling pipeline that keeps retraining cheap and fast — yours to operate after we’ve gone.

03

Drift monitoring & retraining

Continuous evaluation, automated retraining triggers, model versioning, champion-challenger promotion. The audit trail your operations team needs and your auditors expect.

Where vision earns its keep.

The strongest 2026 use cases share a shape: a decision on the floor, a latency budget that rules out the cloud, and a cost of being wrong that justifies doing it right.

Manufacturing inspection

Defects to 0.1mm at line speed; 100% inspection replaces sampling. 41% of manufacturing CV revenue in 2026 sits in this one use case — for good reason.

Warehouse & logistics

Package dimensioning, label and barcode reading, damage detection, robotic guidance. Sub-frame latency, edge-deployed, integrated into the WMS.

Retail shelf analytics & loss prevention

Real-time shelf state and incident detection wired into action workflows — not a dashboard nobody opens.

Document AI

Layout-aware extraction with citation. Intersects with our Document Intelligence practice when the document is the system of record.

Medical imaging

Radiology triage, pathology screening using domain foundation models (Rad-DINO, Merlin, ELIXR). Built against the regulatory environment, not around it.

Physical security & workplace safety

PPE detection, perimeter monitoring, incident classification. Edge-deployed, governed, and built so false positives don’t erode operator trust.

Vision in production, answered straight.

Have VLMs replaced specialized detectors for production CV?

No — and they probably won’t. The 2026 production pattern is modular: a fast specialized detector (YOLO26, RF-DETR past 60.5 mAP on COCO at 25 FPS) for region proposals and real-time classification, with a VLM in the loop only where reasoning, OCR, or attribute extraction earns its cost. VLMs and detectors are complementary, not competitive.

How do we handle the labeling problem without an army of annotators?

You engineer the labeling pipeline like production code. Synthetic data generation closes the gap on rare classes and edge conditions; active learning queues the highest-information examples; weak supervision converts rules and partial labels into training signal. Annotation is the last resort, not the first move.

Can we run vision at the edge, on-prem, or in air-gapped environments?

Yes, and for most operational use cases you should. Edge runtimes — Triton, ONNX Runtime, TensorRT, on-device inference — are mature in 2026, and the economics of always-on VLM inference rarely survive a serious cost model. Air-gapped deployment is well-trodden ground for regulated industries; we’ve shipped it.

How do we know our model isn’t drifting?

You instrument it before it ships — not after exceptions pile up. We build continuous evaluation into the pipeline: input distribution monitoring, prediction confidence tracking, periodic shadow evaluation against a held-out set, and automated retraining triggers when drift exceeds threshold. Champion-challenger promotion keeps the audit trail clean.

Do we own the vision system you build with us?

Yes. The code, the models, the labeling pipeline, the eval harness, the edge runtime, the integrations — all yours. We design every engagement so your team can operate, retrain, and extend the system independently. Enablement is built into the project, not sold as an upsell afterward.

Build versus buy: should we use open-source models like Qwen2.5-VL or Granite Vision, or a managed platform?

Usually a hybrid, and the line is sharper than people think. Open-source VLMs (Qwen2.5-VL, IBM Granite 4.0 3B Vision from April 2026) are credible in production for document and reasoning work, especially in regulated or air-gapped environments. Managed platforms (Azure AI Foundry, Google Cloud Vision, Snowflake Cortex, NVIDIA Metropolis) win for fast baselines and infrastructure you don’t want to operate — but the modular two-stage pipeline almost always needs to be custom.

Where to start.

Vision System Review · 3 weeks · fixed fee

Bring us a vision problem — or your plan to build one.

Inspection, detection, video understanding, document AI. We benchmark your current pipeline (or design plan) against modern modular architectures, prove out the right model class on your data, and deliver a production deployment plan with accuracy estimates and edge constraints accounted for.

What you get: a production-readiness assessment scored against twelve criteria — architecture, latency, edge runtime, labeling pipeline, drift monitoring, retraining loops, governance, and operational ownership; a target architecture for the vision system in your environment; a staged delivery plan with timelines, accuracy targets, and edge-constraint analysis; and one workshop with your engineering, operations, and (where relevant) compliance leads. Led by a senior consultant — fixed scope, fixed fee.

Book a Vision System Review
Start the conversation

Ready to ship vision that works on the floor?

A 30-minute conversation with a senior consultant. Bring a vision problem you’re stuck on — a pilot that won’t scale, a process you think a camera and a model could own, or a system that worked at launch and is quietly degrading. We’ll tell you whether it’s ready to ship, what the gaps are, and what a Vision System Review would surface.

Book a Vision System Review