Blog

2026-06-02 · Petr Kindlmann

Your agent's JSON is a contract, so test it like one

A field that goes missing, changes type, or drifts its enum values breaks the consumer silently while the text still looks right. Schema-validate structured output in CI like any other API response.

2026-06-01 · Petr Kindlmann

Text evals miss the agent regressions that actually hurt

By 2026, asserting tool behavior is table stakes, not a moat. The real question is which layer each tool belongs to: a broad eval platform, hosted observability, or a small per-PR CI gate.

2026-05-30 · Petr Kindlmann

Generate your EU AI Act evidence from the test run, not a doc sprint

Annex IV asks you to document the testing you did for accuracy and robustness. For a well-tested agent, that is data you already have as test runs. Generate the draft from the run, tied to the commit.

2026-05-28 · Petr Kindlmann

Make agent tests deterministic by mocking the tools, not the model

Most teams never test agent behavior in CI because real tool calls are flaky. The fix is old and boring: stub the tool boundary, fix the temperature, repeat, and gate on the rate.

2026-05-26 · Petr Kindlmann

Cost and latency are regressions, so gate them in CI

A passing agent that quietly got 3x more expensive or 2x slower is still a regression. Why token cost and latency belong in your CI gate, not a dashboard you check later.

2026-03-20 · Petr Kindlmann

Your agent can still pass and still be drifting

Why agent drift shows up before obvious failures, and how behavioral baselines catch degradation that headline metrics miss.

2026-03-20 · Petr Kindlmann

LLM-as-judge is better than people think, and worse than marketers say

Where judge models are genuinely useful, where they fail, and how to combine them with deterministic assertions.

2026-03-20 · Petr Kindlmann

PII leaks through the agent path you forgot to test

What current research says about privacy leakage in multi-turn agents, memory, and augmented context, and how to test for it.

2026-03-20 · Petr Kindlmann

Tool calls break before the text does

Why small prompt or toolkit changes can silently degrade tool-calling agents, and how to test for it before deploy.

2026-03-07 · Petr Kindlmann

LLM-as-Judge: How to Score AI Agent Responses Automatically

Traditional metrics can't evaluate AI agent quality. LLM-as-judge uses one model to score another's output against human-written criteria. Learn how to write good judge criteria, set thresholds, and combine judge assertions with deterministic checks in KindLM.

2026-03-07 · Petr Kindlmann

How to Test AI Agents in CI/CD Pipelines

A practical guide to adding AI agent behavioral tests to your CI/CD pipeline with KindLM, GitHub Actions, JUnit reporting, and pass rate gates.

2026-03-06 · Petr Kindlmann

3 AI Agent Disasters That Testing Would Have Prevented

Air Canada's hallucinated refund policy. Cursor's fabricated login explanation. Zomato's bot that refused to escalate. We recreate each failure and show how behavioral tests catch them.