Staff Cloud Site Reliability Engineer | Wayve · Teeming.ai
Wayve
We're Wayve, a leading developer of embodied intelligence for autonomous vehicles. We use AI to pioneer a next-generation approach to self-driving: AV2.0, which enables fleet operators to unlock the…
We're Wayve, a leading developer of embodied intelligence for autonomous vehicles. We use AI to pioneer a next-generation approach to self-driving: AV2.0, which enables fleet operators to unlock the…
Clear communication skills, including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements
Experience operating GPU-backed environments or large-scale ML infrastructure
Experience running model training or inference pipelines in production (MLOps)
Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments
Experience defining and running SLOs/SLIs and building reliability programs across multiple teams
Experience as an early or founding SRE hire establishing processes from scratch
Interest in helping shape and grow a Cloud SRE function, with potential to take on leadership responsibilities over time
What the job involves
Benefits
Private healthcare: Choose our optional health insurance for comprehensive coverage for you and your family.
Paid time off: Paid vacation plus public holidays and additional leave programs, ensuring you have time to unwind.
Mental health resources: Through Spill, you can access therapy and mental health support.
Community and socials: Join clubs or attend team socials to connect over hobbies, sports, or just for fun.
Competitive compensation: Our compensation package includes cash and equity, making you a true partner in our success.
Learning and development: Budgets for books, courses, and company-wide training to support your continuous growth.
Startup jobs. A lot of them.
Your next opportunity is in here somewhere. Sign up to explore 52,000+ startups and their open roles. No spam. No gamification. Just jobs.
52,000+
Startups
65,000+
Open Roles
1,500+
New This Week
AI Researcher
Part-timeNiš, RS
Part-time • Niš, RS
Technical Writer
Full-timeBerlin, DE
Full-time • Berlin, DE
DevOps Engineer
Part-timeAustin, US
Part-time • Austin, US
Data Scientist
ContractNew York, US
Contract • New York, US
Software Engineer
InternshipAustin, US
Internship • Austin, US
Software Engineer
Full-timeSan Francisco, US
Full-time • San Francisco, US
As a Cloud Site Reliability Engineer at Wayve, you will build and scale the reliability foundations of our AI cloud platform. This includes our Model Development Platform (powering end-to-end model development from raw data to on-road experimentation) and our GPU Compute platform (large-scale, multi-tenant GPU fleets and scheduling systems driving model training and inference at scale)
This is a founding Cloud SRE role. You won’t inherit a mature SRE function, you’ll help create it. You will define the frameworks, automation, and operational standards that ensure our model development infrastructure, distributed systems, and large compute clusters operate predictably, efficiently, and at scale
This role sits at the intersection of AI research, large-scale cloud infrastructure, and production operations. Your work will directly enable faster model training, reliable experimentation, and scalable AI deployment by ensuring our cloud infrastructure is resilient and performant
Reliability & Platform Ownership
Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments
Define and operationalise SLOs, SLIs, and error budgets across platform services
Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters
Partner with ML, platform, and software teams to establish clear production readiness standards
Incident Response & On-Call
Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents
Lead incident triage, escalation, communications, and root cause analysis
Translate post-incident learning into durable architectural or automation improvements
Continuously reduce alert noise and recurring operational burden
Observability & Operational Excellence
Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery
Build dashboards that reflect real user-centric platform health (not just infrastructure metrics)
Improve deployment safety through better change management, validation, and rollback mechanisms
Automation & Tooling
Build automation for cluster operations, training workflows, remediation, and scaling tasks
Implement self-healing patterns and resilient recovery workflows
Harden CI/CD and release processes to improve deployment safety and velocity
Support infrastructure-as-code and policy-driven guardrails to ensure secure, reliable cloud environments