All insights
Engineering·7 min read·Apr 14, 2026

Evaluating freight LLM outputs without trusting the LLM to evaluate itself

It is easy to ship an LLM feature. It is hard to ship one you can stand behind in front of a customer. Here is the eval harness we built to keep ourselves honest.

MZ
Musa Zulfqar
Founder, FreightSurge

Why LLM evals are hard in freight

Most public LLM benchmarks measure things that do not matter to a freight broker: trivia recall, math word problems, code generation. None of them tell you whether a quote draft will hold up in front of a logistics manager who has been doing this for fifteen years.

Worse, the easy way to evaluate LLM output is to use another LLM as the grader. That is fast and looks rigorous on a slide, but it is a closed loop — you end up measuring whether your model agrees with another model, which is not the same as whether the output is correct.

The four axes we evaluate on

1. Extraction accuracy

For every quote request, we have a ground-truth set of structured fields a senior broker would extract: origin and destination zips, equipment type, commodity class, weight, dates, special handling. We score model output against the ground truth using exact match and a freight-aware fuzzy match (so "Houston, TX 77001" and "Houston 77001" both count). We track per-field F1 and watch for regressions tightly.

2. Pricing reasonableness

We do not grade pricing on "right or wrong" — every brokerage has its own pricing logic. We grade on reasonableness: is the suggested rate inside a defensible window relative to the brokerage's recent comparable loads and the live spot benchmark? Outliers are flagged automatically and queued for human review.

3. Tone match

For repeat customers, we score whether the draft matches the historical tone of the broker's previous correspondence with that customer. We use deterministic similarity metrics (greeting style, sign-off, formality) rather than asking another model to judge "did this sound right." Tone failures are quiet — they cost relationships, not loads — so we treat them seriously.

4. Factual consistency

Every claim in a draft — pickup date, weight, equipment type, lane history reference — must trace back to either the inbound email or the retrieved customer context. Anything the model adds without a source is treated as a hallucination and blocks the draft from being marked "ready to send."

A draft that fails any one of the four axes does not get a confidence score — it gets a flag. Human-in-the-loop is not a fallback, it is the design.

How brokers contribute to evals

Every edit a broker makes to a draft is logged and tagged. When the same kind of edit shows up across multiple brokers and multiple customers, that signal feeds into our regression test set. The brokers who use the product are continuously building the eval suite that keeps the product honest.

Why we refuse to use LLMs as the judge

LLM-as-judge is fashionable because it scales. It also reliably underestimates real-world failure modes — particularly when the judge model and the generator model share training biases. For high-stakes business correspondence, we use deterministic metrics where we can, ground-truth datasets where we can't, and human review on the long tail. It is slower. It is also defensible.

When a procurement team asks us to explain how we know the AI is performing, we hand them this eval framework, the regression dashboard, and the audit log. That is the conversation enterprise buying is built on — and it is the conversation we built the engineering posture for.

See the numbers

What would this look like on your brokerage?

Plug in your monthly quote volume, response time, and win rate. See a live projection of the margin and time impact — no email required.