2026-03-20 · Petr Kindlmann
Your agent can still pass and still be drifting
Why agent drift shows up before obvious failures, and how behavioral baselines catch degradation that headline metrics miss.
2026-03-20 · Petr Kindlmann
Why agent drift shows up before obvious failures, and how behavioral baselines catch degradation that headline metrics miss.
2026-03-20 · Petr Kindlmann
Where judge models are genuinely useful, where they fail, and how to combine them with deterministic assertions.
2026-03-20 · Petr Kindlmann
What current research says about privacy leakage in multi-turn agents, memory, and augmented context, and how to test for it.
2026-03-20 · Petr Kindlmann
Why small prompt or toolkit changes can silently degrade tool-calling agents, and how to test for it before deploy.
2026-03-07 · Petr Kindlmann
Traditional metrics can't evaluate AI agent quality. LLM-as-judge uses one model to score another's output against human-written criteria. Learn how to write good judge criteria, set thresholds, and combine judge assertions with deterministic checks in KindLM.
2026-03-07 · Petr Kindlmann
A practical guide to adding AI agent behavioral tests to your CI/CD pipeline with KindLM, GitHub Actions, JUnit reporting, and pass rate gates.
2026-03-06 · Petr Kindlmann
Air Canada's hallucinated refund policy. Cursor's fabricated login explanation. Zomato's bot that refused to escalate. We recreate each failure and show how behavioral tests catch them.