Judgment Labs

Judgment Labs provides the first post-building layer for AI agents, enabling developers to unit test and monitor their agents with traces, evaluations, and tool telemetry. Their platform offers…

AI AgentsDebuggingEvaluationMonitoringOpen SourceOptimizationTelemetryTestingjudgmentlabs.ai

Judgment Labs

Judgment Labs provides the first post-building layer for AI agents, enabling developers to unit test and monitor their agents with traces, evaluations, and tool telemetry. Their platform offers…

AI AgentsDebuggingEvaluationMonitoringOpen SourceOptimizationTelemetryTestingjudgmentlabs.ai

HQSan Francisco, US

Team Size14

Open JobsUnknown

Total Funding-

Latest FundraiseUnknown

Join the Team

Forward Deployed Engineer

On-SiteSan Francisco, CA, US

On-Site • San Francisco, CA, US

Startup jobs. A lot of them.

Your next opportunity is in here somewhere. Sign up to explore 52,000+ startups and their open roles. No spam. No gamification. Just jobs.

52,000+

Startups

65,000+

Open Roles

1,300+

New This Week

Product Designer

Full-timeJerusalem

Full-time • Jerusalem

Mobile Developer

Part-timeNiš, RS

Part-time • Niš, RS

DevOps Engineer

Full-timeTel Aviv

Full-time • Tel Aviv

Technical Writer

Part-timeCambridge, GB

Part-time • Cambridge, GB

Backend Developer

InternshipNew York, US

Internship • New York, US

DevOps Engineer

Part-timeUtrecht, NL

Part-time • Utrecht, NL

Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM) . While traditional observability focuses on logging exceptions and latency, our ABM surfaces behavioral anomalies such as instruction drifts and context retrieval loss in scaled production environments.

Hundreds of teams building autonomous agents rely on Judgment to understand how their systems are behaving post-deployment. Instead of reactive incident triage, they cluster patterns across conversations and workflows, correlate regressions to specific interaction types, and pinpoint where reliability breaks down in their usage context.

We’ve raised $30M+ across two rounds in the past five months. Our investors include Lightspeed, SV Angel, Valor Equity Partners, Nova Global, Chris Manning, Michael Ovitz, Michael Abbott, Cory Levy, Kevin Hartz, and others.

The Role:

Forward Deployed Engineers at Judgment Labs instrument our agent behavior monitoring (ABM) infrastructure directly into customer production systems. You act as a trusted partner in agent reliability — working inside live codebases, analyzing traces from real-world usage, and diagnosing failures in running environments while integrating monitoring and evaluation into mission-critical agent workflows. This is deep technical work: you need to move fast in unfamiliar stacks, form accurate hypotheses from incomplete data, and ship instrumentation that holds up under production load.

Most days look like this: you go on-site and instrument our SDK in a new customer's codebase in the morning, spend the afternoon analyzing trace data to surface failure clusters, and close out with a stakeholder check-in where you translate what you found into something the Head of AI can act on. You're running 2–3 of these deployments simultaneously — each at a different stage, each with a different team on the other side. You define what "quality" means for each customer's domain, and then you make it measurable.

You'll be at the forefront of Judgment, interacting daily with enterprise customers alongside our GTM, product, and research teams — reasoning about agent behavior, translating high-level goals into concrete ABM deployments, and owning outcomes end-to-end across real production environments. The customers you'll work with are AI-native startups. Their engineers have opinions, their infra teams have constraints, and their ops and product leads want to know why Judgment matters to them specifically. You figure that out fast and make it land. The scope, autonomy, and 0→1 execution this role demands make it a proving ground for people who want to build or lead a technical company.

What You'll Do:

Tracing & Deployment

Instrument Judgment's SDK across customer codebases — diagnosing integration issues across diverse and often unfamiliar agent architectures
Manage 2-3 customer deployments simultaneously, owning the technical lifecycle from scoping to go-live
Configure trace pipelines, set up span hierarchies, and ensure observability coverage is diagnostic— not just present

Evals, Behaviors & Judges

Write domain-specific evals tailored to each customer's vertical — defining what a correct, safe, or high-quality agent response looks like for their use case
Analyze trace data to surface failure patterns, cluster failure modes, and prioritize what gets measured
Build and tune agent-as-a-judge models using real customer feedback, iterating until the judge reflects the customer's actual quality bar
Design behaviors and scorers that work in production — low false positive rate, interpretable outputs, actionable signal

Customer & Deal Management

Own the customer relationship through the full deployment lifecycle — technical scoping, milestone communication, escalation handling, and renewal
Interface across the full customer org: engineers who want to understand the SDK, infra teams with real deployment constraints, product and ops leads who need to understand why Judgment makes their job faster
Educate customers on what Judgment is and how it fits into their stack — not with a sales pitch, but by showing them what their agent is actually doing and what they should be monitoring
Translate customer needs into product feedback for research and engineering
Able to go deep on eval design in one conversation and explain a precision/recall tradeoff to a non-technical exec in the next

What Great Looks Like:

Customers deploy in days, not weeks. They realize technical value from our tracing and behaviors immediately. You work closely with them and visit on-site when it matters until they have a clear picture of how their agents fail.
Their judges reflect their domain. You accurately capture the failure modes that matter for their industry, form factor, and criticality. You keep iterating until the eval signal is well-calibrated and the customer trusts it.
Their team understands their failure patterns. Engineers and stakeholders are aligned on how and why their agents fail, and they have alerting to know when this happens in production.

Who You Are:

You're comfortable with technical peers who push back. Your customers are AI-native — their engineers have built agents, read the papers, and will ask hard questions. You earn credibility through depth, drawing on insights from our research team on why certain methods do or don't work.

Why Judgment?

Agents can’t work without this. Today’s agents hallucinate, drift, and break in production. We’re building the infrastructure that fixes this: the monitoring layer that makes agents self-improving.
We’re wired to win. We're a team of less than 20 but we ship like 50+ on the daily. You'll be working with olympiad medalists, debate champions, and competitive athletes who bring that same intensity to company building.
Fast track to founding. Our engineers interface directly with customers, ship code into their environments, and use their feedback to dictate what’s next on the roadmap. Everyone on the team is either an ex-founder or a founder-to-be.