
Protege is an AI training data platform that connects AI developers with data holders. For AI developers, Protege offers a vast collection of high-quality training data across numerous modalities and verticals, with a quick and easy process for data procurement, reducing the time by 90% or more. The data is thoughtfully sourced and ethically managed. For data holders, Protege provides access to AI developers ranging from startups to large tech companies, ensuring governance control over privacy and IP, and offering a pain-free platform to share data. Protege's platform allows for seamless and quick data exchange, with expertise in determining data value and ensuring fair compensation. They have a network of AI tech companies using their platform and emphasize data source centricity with expert support.

Protege is an AI training data platform that connects AI developers with data holders. For AI developers, Protege offers a vast collection of high-quality training data across numerous modalities and verticals, with a quick and easy process for data procurement, reducing the time by 90% or more. The data is thoughtfully sourced and ethically managed. For data holders, Protege provides access to AI developers ranging from startups to large tech companies, ensuring governance control over privacy and IP, and offering a pain-free platform to share data. Protege's platform allows for seamless and quick data exchange, with expertise in determining data value and ensuring fair compensation. They have a network of AI tech companies using their platform and emphasize data source centricity with expert support.
What they do: AI training-data platform that connects AI developers with data holders and curates rights-protected multimodal datasets
Founded: 2024
Headquarters / Focus: New York City; initial vertical focus includes healthcare and media
Recent funding: Multiple rounds including $10M seed (Sep 2024) and a $25M Series A (Aug 2025); a later $30M Series A extension announced Jan 2026
Data infrastructure for AI training — sourcing, curating, and transacting high-quality, rights-cleared training data across verticals (notably healthcare and media).
2024
Data Infrastructure and Analytics
10000000
Participants included SV Angel, Liquid 2 Ventures, Bloomberg Beta, Flex Capital, Adam D'Angelo and others
25000000
30000000
Described as an extension expanding the prior Series A
“Includes participation from CRV, Footwork, Andreessen Horowitz (a16z), Bloomberg Beta, Flex Capital, SV Angel, Liquid 2 Ventures, Adam D'Angelo, Shaper Capital, Travis May, and others”
| Company |
|---|
Company Overview: We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data.
Solving AI’s data problem is a generational opportunity. We’re backed by world-class investors and already powering partnerships with some of the most ambitious teams in AI. The company that succeeds will be one of the largest in AI — and in tech.
We’re a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.
We’re hiring a to sit at the center of Protege’s research and innovation engine.
This role exists to translate cutting-edge AI research and experimentation into scalable product capabilities — ensuring that the tools, workflows, and systems our Data Lab uses are aligned with how modern AI models are actually trained, evaluated, and deployed.
You will work closely with research scientists, applied ML engineers, and product teams to:
This is a role for someone who understands frontier AI deeply , but chooses to apply that understanding through product judgment rather than research authorship.
What You’ll Do Productize Frontier AI Workflows
Build Tools That Reflect How AI Is Actually Built
Lead product discovery and execution for internal tools that support modern AI development:
dataset versioning
evaluation pipelines
annotation and human-in-the-loop workflows
experiment tracking and reproducibility
Ensure tooling reflects real-world frontier practices, not academic abstractions
Be a Bridge Between Research and Product
Exercise Strong Product Judgment
Measure Impact, Not Activity
Define success metrics tied to:
experiment cycle time
researcher productivity
adoption of internal tools
downstream impact on customer data products
Use qualitative and quantitative feedback to continuously iterate
Who You Are Deeply Fluent in Modern AI
You have hands-on or adjacent experience with how frontier AI models are built today — including large-scale training, fine-tuning, evaluation, and data iteration
You understand concepts like:
training data quality vs quantity tradeoffs
evaluation benchmarks vs real-world performance
human feedback loops
multimodal data challenges
You can have credible conversations with PhD-level researchers and senior ML engineers
A Product Thinker, Not a Researcher
Experienced Product Manager
5+ years of product management experience, ideally in:
AI/ML platforms
developer tools
data infrastructure
or internal research tooling
Strong experience working with highly technical stakeholders
Proven ability to lead ambiguous, zero-to-one initiatives
Collaborative and High-Agency
Nice to Have
Why Protege