
Judgment Labs provides the first post-building layer for AI agents, enabling developers to unit test and monitor their agents with traces, evaluations, and tool telemetry. Their platform offers…

Judgment Labs provides the first post-building layer for AI agents, enabling developers to unit test and monitor their agents with traces, evaluations, and tool telemetry. Their platform offers…
Your next opportunity is in here somewhere. Sign up to explore 52,000+ startups and their open roles. No spam. No gamification. Just jobs.
52,000+
Startups
60,000+
Open Roles
500+
New This Week
Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM) . While traditional observability focuses on logging exceptions and latency, our ABM surfaces behavioral anomalies such as instruction drifts and context retrieval loss in scaled production environments.
Hundreds of teams building autonomous agents rely on Judgment to understand how their systems are behaving post-deployment. Instead of reactive incident triage, they cluster patterns across conversations and workflows, correlate regressions to specific interaction types, and pinpoint where reliability breaks down in their usage context.
We’ve raised $30M+ across two rounds in the past five months. Our investors include Lightspeed, SV Angel, Valor Equity Partners, Nova Global, Chris Manning, Michael Ovitz, Michael Abbott, Cory Levy, Kevin Hartz, and others.
The Role:
Forward Deployed Engineers at Judgment Labs instrument our agent behavior monitoring (ABM) infrastructure directly into customer production systems. You act as a trusted partner in agent reliability — working inside live codebases, analyzing traces from real-world usage, and diagnosing failures in running environments while integrating monitoring and evaluation into mission-critical agent workflows. This is deep technical work: you need to move fast in unfamiliar stacks, form accurate hypotheses from incomplete data, and ship instrumentation that holds up under production load.
Most days look like this: you go on-site and instrument our SDK in a new customer's codebase in the morning, spend the afternoon analyzing trace data to surface failure clusters, and close out with a stakeholder check-in where you translate what you found into something the Head of AI can act on. You're running 2–3 of these deployments simultaneously — each at a different stage, each with a different team on the other side. You define what "quality" means for each customer's domain, and then you make it measurable.
You'll be at the forefront of Judgment, interacting daily with enterprise customers alongside our GTM, product, and research teams — reasoning about agent behavior, translating high-level goals into concrete ABM deployments, and owning outcomes end-to-end across real production environments. The customers you'll work with are AI-native startups. Their engineers have opinions, their infra teams have constraints, and their ops and product leads want to know why Judgment matters to them specifically. You figure that out fast and make it land. The scope, autonomy, and 0→1 execution this role demands make it a proving ground for people who want to build or lead a technical company.
What You'll Do:
Tracing & Deployment
Evals, Behaviors & Judges
Customer & Deal Management
What Great Looks Like:
Who You Are:
Why Judgment?