Physical Intelligence

Physical Intelligence is bringing general-purpose AI into the physical world. We are a group of engineers, scientists, roboticists, and company builders developing foundation models and learning…

xn--1xa.com

Physical Intelligence

Physical Intelligence is bringing general-purpose AI into the physical world. We are a group of engineers, scientists, roboticists, and company builders developing foundation models and learning…

xn--1xa.com

HQRemote

Team Size191

Open JobsUnknown

Total Funding-

Latest FundraiseUnknown

TL;DR

Sector: Embodied AI / Robotics

Headquarters: San Francisco, CA

Notable founder: Sergey Levine

Total reported funding: Just over $1B (reported)

Prominent investors: Bond, Jeff Bezos, Khosla, Lux, OpenAI, Redpoint, Sequoia, Thrive

Company Overview

Problem Domain

General-purpose embodied AI and robotics: enabling robots to understand language and vision and translate that into physical actions across novel tasks.

Founded

2024

Industry

Research Services

Funding Track Record

- 2024

$470M

2024 figure listed in a Tracxn report.

$400M

Reported $400M financing round (headline coverage) valuing the company at about $2B.

- 2026

Just over $1B (cumulative)

Reported cumulative funding and that the company was in talks to raise an additional ~$1B.

Investor Signal

“Bond, Jeff Bezos, Khosla Ventures, Lux Capital, OpenAI (Startup Fund), Redpoint Ventures, Sequoia Capital, Thrive Capital”

Founders

What we do

Join the Team

Machine Learning Infrastructure Engineer

On-SiteSan Francisco Bay Area, US

On-Site • San Francisco Bay Area, US

Related Companies

Company	HQ	Industry	Total Funding
Maven Robotics	🇺🇸Santa Clara, US	Data and AnalyticsDeepTechHardwareSoftware	-
Archetype AI	🇺🇸Palo Alto, US	Data and AnalyticsDeepTechInformation TechnologySoftware	$48M
UniversalAGI	🇺🇸San Francisco, US	Data and AnalyticsDeepTechEducation	-
Flexion Robotics	🇨🇭Zürich, CH	DeepTech	$57M
SpAItial AI	🇬🇧London, GB	Information TechnologySoftware	$13M

Who you are

We’re intentionally flexible on exact background, but strong candidates usually have:
Strong software engineering fundamentals
Experience building or operating job scheduling / resource management systems at scale
Experience with large-scale compute clusters (GPU and/or TPU)
Familiarity with schedulers and orchestration systems (SLURM, Kubernetes, GKE, K3S, or internal equivalents)
Comfort reasoning about resource allocation, bin-packing, priority scheduling, and multi-tenancy
Understanding of how ML training workloads behave — long-running, multi-node, sensitive to stragglers, topology-dependent
A bias toward owning systems end-to-end, from design to operation
Enjoy working closely with researchers and unblocking fast-moving projects
Experience building multi-cluster or federated scheduling systems
Experience with TPU infrastructure (GCP TPU slices, Multislice, GKE)
Background in cluster resource managers (Borg, YARN, Mesos, or custom schedulers)
Linux systems engineering, networking, and infrastructure-as-code
NCCL/collective communication and topology-aware placement
Experience with capacity planning and cloud cost optimization at scale
Familiarity with JAX, PyTorch, or similar ML frameworks at the runtime/systems level

What the job involves

Startup jobs. A lot of them.

Your next opportunity is in here somewhere. Sign up to explore 70,000+ startups and their open roles. No spam. No gamification. Just jobs.

70,000+

Startups

81,000+

Open Roles

4,600+

New This Week

Backend Developer

Part-timeHaifa

Part-time • Haifa

Mobile Developer

InternshipRotterdam, NL

Internship • Rotterdam, NL

Data Scientist

Part-timeSan Francisco, US

Part-time • San Francisco, US

Backend Developer

Part-timeHaifa

Part-time • Haifa

Mobile Developer

ContractHaifa

Contract • Haifa

Product Designer

Part-timeNovi Sad, RS

Part-time • Novi Sad, RS

Physical Intelligence builds general-purpose AI for the physical world. Training our models requires orchestrating thousands of accelerators across a heterogeneous fleet of GPU and TPU clusters — spanning different hardware generations, cloud providers, and cluster topologies

Today, researchers often need to know which cluster to target, what resources are available, and how to configure their jobs accordingly. That doesn't scale. We need a scheduling and compute layer that makes the right placement decision automatically — routing jobs to the best cluster based on availability, hardware fit, cost, and priority — so researchers can focus entirely on the science

This role owns that problem end-to-end: the scheduling systems, the placement logic, the cluster management layer, and the operational tooling that keeps it all running

This is not cloud DevOps. It's not about standing up clusters and walking away. It's a systems role for people who care about intelligent resource allocation, utilization, fault tolerance, and making large-scale distributed training seamless

The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast

You will work closely with ML Infra (training systems), data platform, and research teams to ensure compute scheduling is never the bottleneck

Own Intelligent Job Scheduling and Placement: Design and build multi-tenant scheduling systems that automatically place training jobs on the best available cluster based on hardware requirements, topology, availability, cost, and priority. Support fair resource sharing across teams and projects with quota management, priority tiers, and preemption policies. Abstract away cluster differences so researchers submit jobs without needing to know where they will land

Scale Multi-cluster Orchestration: Build the control plane that manages the job lifecycle across diverse clusters (mixed GPU/TPU, multi-generation hardware, on-prem/cloud) and enables seamless job migration, failover, and re-scheduling

Optimize Accelerator Utilization and Efficiency: Monitor and optimize GPU/TPU utilization across the entire fleet. Implement priority, preemption, queueing, and fairness policies that balance research velocity with cost efficiency

Ensure Scaling and Stability: Implement fault detection, automatic recovery, and resilience for long-running multi-node training jobs. Manage health checking, node management, and scaling to thousands of accelerators

Support Inference and Robot Deployment: Extend scheduling and orchestration to inference workloads, including deploying models to edge devices on physical robots

Enhance Observability and Developer Experience: Build the dashboards, alerting, SLOs, and debugging tools necessary for researchers to understand job status and for the team to ensure high scheduling quality and cluster reliability