
Protege is an AI training data platform that connects AI developers with data holders. For AI developers, Protege offers a vast collection of high-quality training data across numerous modalities and…

Protege is an AI training data platform that connects AI developers with data holders. For AI developers, Protege offers a vast collection of high-quality training data across numerous modalities and…
What they do: AI training-data platform that curates, licenses, and delivers real-world multimodal datasets for model development
Founded: 2024
Headquarters / HQ: New York (documented in profiles)
Recent funding: $30M Series A extension led by Andreessen Horowitz (Jan 2026); prior Series A $25M (Aug 2025); $10M seed (Sep 2024)
Founders / leadership: Bobby Samuels (CEO & co-founder); Travis May (Chairman & co-founder)
Bridging the gap between data holders and AI developers by enabling compliant, high-quality dataset sourcing and licensing for model training and evaluation.
2024
Data Infrastructure and Analytics
10000000
Seed round with participation from SV Angel, Liquid 2 Ventures, Bloomberg Beta, Flex Capital, Adam D'Angelo, Travis May, and others
25000000
30000000
Extension that expanded the August 2025 Series A, bringing cumulative funding to $65M since founding
“Includes participation from prominent investors such as CRV, Footwork, Andreessen Horowitz, Bloomberg Beta, Flex Capital, SV Angel, Liquid 2 Ventures, Adam D'Angelo, Travis May, and others”
| Company |
|---|
Company Overview: We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data.
Solving AI’s data problem is a generational opportunity. We’re backed by world-class investors and already powering partnerships with some of the most ambitious teams in AI. The company that succeeds will be one of the largest in AI — and in tech.
We’re a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.
Role Overview We are hiring a Solutions Applied Data Scientist to help design, construct, and validate complex healthcare data cohorts used for AI model training. This role sits within the delivery organization , working closely with Solutions Leads and delivery engineers to solve complex data challenges that arise during customer projects.
The ideal candidate is someone who enjoys solving messy real-world data problems, working directly with large healthcare datasets, writing complex SQL and collaborating closely with cross-functional teams. Our environment has a lot going on as we grow - so we’re looking for someone energized by and excited by the fast pace of the industry and our company!
What You'll Do Technical Escalation & Delivery Collaboration During delivery projects, Solutions Leads may encounter complex data challenges that require deeper analysis or technical problem-solving. You will act as a technical partner , helping solve things such as:
You will work collaboratively with Solutions Leads to unblock delivery challenges while keeping projects moving toward successful completion.
When solutions require infrastructure or pipeline changes, you will partner with the Solutions Engineer and internal platform engineering teams to implement the required workflows.
Cohort Definition & Dataset Construction Work with Solutions Leads to translate customer requirements into concrete dataset logic. You will help ensure that datasets accurately represent the intended population and meet customer specifications.
Responsibilities include:
Data Quality Validation & Completeness Analysis Before complex datasets are delivered to customers you will help validate that they meet required standards. You will work closely with Solutions Leads before datasets are delivered to ensure that the datasets meet agreed acceptance criteria. Review bespoke QA methodology and suggest platform improvements to Product and Engineering to decrease custom work across engagements.
Responsibilities include:
Data Feasibility Many customer projects involve AI researchers who are defining the healthcare datasets required to train or evaluate models. You will work with these customer teams to translate research goals into practical dataset specifications.
Responsibilities include:
This role requires someone who is comfortable engaging with technically sophisticated stakeholders while grounding conversations in the realities of messy, real-world data.
Data Partner & Source Data Analysis Many datasets originate from external healthcare data partners.
You will help analyze partner datasets to:
This work helps ensure that projects are grounded in what data actually exists.
Delivery Tooling & Workflow Improvements As delivery patterns emerge, you will help develop tools and reusable workflows that improve efficiency.
Examples include:
This role is an important bridge between manual dataset delivery and scalable data infrastructure .
What Success Looks Like 30 days: Learn the delivery motion and source-data reality. Build working knowledge of Solutions workflows, healthcare data partners, common cohort patterns, and how complex requests get escalated. Shadow active projects, understand existing QA approaches, and start contributing to scoped feasibility and validation work.
60 days: Own scoped technical escalations and create early leverage
Independently support complex cohort-definition and dataset-construction work, write and validate SQL / Python workflows, and help Solutions Leads answer hard feasibility questions with clear tradeoffs.
90 days: Become a trusted technical partner across delivery
Handle the hardest dataset problems with limited oversight, improve QA and repeatability, and propose workflow or platform improvements that reduce bespoke work across engagements.
What You Bring
Protege Values
Your next opportunity is in here somewhere. Sign up to explore 52,000+ startups and their open roles. No spam. No gamification. Just jobs.
52,000+
Startups
66,000+
Open Roles
1,500+
New This Week