AI & ML · AI infrastructure / MLOps / evals

Judgment Labs

The continuous-improvement stack for agents.

Improving agents from production data. Co-founder, CEO of @JudgmentLabs

California, USA1.2K followers

TLVC Rating

Great launch, got really good traction. Animations + explanation was on point.

Community Rating

★★★★★

No ratings yet

Your rating

About

Judgment Labs is launching a platform aimed at AI engineering teams who run agents in production and need a structured way to find, diagnose, and fix misbehavior. The company develops infrastructure for agent behavior monitoring , capturing traces and signals from live agent runs so teams can move past the familiar loop of a customer screenshot landing in Slack with no sense of how widespread the underlying issue actually is. The platform monitors agents live by capturing traces and behavioral signals from production, giving continuous visibility into how they act, evolve, and fail in real scenarios.

The product centers on closing the loop between observation and improvement. Search agents crawl production traffic to surface recurring failure patterns, dashboards let teams scope issues to specific customers or task types, and offline jobs help validate fixes before they ship. Teams can curate golden datasets from production data and define scorers that reflect real outcomes, rather than relying on brittle LLM judges that break in production. The open-source core, Judgeval, provides developers with frameworks to test and refine AI agents , and includes production monitoring that runs custom scorers in a hosted, virtualized secure container to flag agent behaviors online, with Slack alerts for failures and custom hooks to address regressions before they impact users .

Judgment Labs was founded in 2025 by Alex Shan, Andrew Li, and Joseph Sripramong Camyre , with Shan as CEO and a background at the Stanford AI Lab. The launch coincides with a $32M funding announcement from the team and arrives as more startups push agents into revenue-generating workflows, where the volume of production traces has outstripped what ad hoc evals and manual triage can handle. For founders and operators evaluating the agent observability stack, this launch is worth a look alongside existing eval and tracing tools, particularly if open-source tooling and continuous improvement workflows matter to the team.

Tags

500K-1MExplainerSeries AB2BGlobalUSFunding announcementFounder-led

Comments (14)

Priya Deshmukh5/12/2026

Production data as the training signal makes way more sense than synthetic eval sets that nobody actually trusts. Curious how you handle PII redaction at ingest.

Tomáš Kovač5/12/2026

Every agent infra startup says 'continuous improvement' but the loop is usually one engineer staring at logs at 2am. What's actually automated here?

Lena B.5/12/2026

ok wait, agents generating their own training data from prod is the actual unlock. Eval datasets have been the bottleneck for two years.

raven5/12/2026

The tweet thread structure is doing a lot of heavy lifting here, but the hook sentence is buried under the funding number. Lead with the agent behavior insight next time.

Jaehyun Ko5/12/2026

Procurement question: SOC2 Type II in place? Most of our agent pilots are stuck in legal because vendors don't have the paperwork.

Carlo Pirelli5/12/2026

Per-trace pricing or seat-based? Agent volume is wildly unpredictable and finance teams hate metered surprises after the first quarter.

Mira V.5/12/2026

Building in the agent observability lane too and genuinely glad someone is making 'evals from prod' the headline. We've been screaming about it in customer calls all year.

Nadia Zubareva5/12/2026

Retention is the only number that matters for infra at this stage. Devs install ten of these in a sprint and uninstall nine by Friday.

Sergio Albarrán5/12/2026

Interesting. What does net dollar retention look like when an agent customer churns their underlying LLM provider?

Ifeanyi Okafor5/12/2026

How small is the team that shipped the full stack? Asking because I want to know if I should feel bad about my weekend project.

Deepak Raghavan5/12/2026

We have a strong infra eng candidate currently at a foundation model lab looking at agent-adjacent roles, are you hiring beyond the founding team?

Rivka Solomon5/12/2026

First 10 seconds of the launch video do not tell me what Judgment actually does, I had to scroll to the pinned demo. Bury the funding, lead with the product moment.

Hoàng Minh Lê5/12/2026

Hot take: 'production data' is the new 'data flywheel' and in 18 months half these companies will pivot to being a logging SDK with a dashboard.

Anouk T.5/12/2026

Saw three eval startups launch this month and yours has the cleanest positioning by a wide margin. The 'continuous improvement' framing actually means something instead of just 'we have a dashboard'.