Wayve

We're Wayve, a leading developer of embodied intelligence for autonomous vehicles. We use AI to pioneer a next-generation approach to self-driving: AV2.0, which enables fleet operators to unlock the…

wayve.ai

Wayve

wayve.ai

HQGB

Team Size895

Open JobsUnknown

Total Funding$3B

Latest FundraiseUnknown

TL;DR

Founded: 2017

Headquarters: London, UK

Core product: End-to-end, mapless Wayve AI Driver for embodied vehicle autonomy

Recent fundraising: Multiple large rounds including a $1.05B Series C (May 2024) and subsequent ~ $1.2B raise (Feb 2026)

Notable investors: SoftBank, Nvidia, Microsoft, Uber, automakers (e.g., Mercedes‑Benz, Nissan, Stellantis)

Company Overview

Problem Domain

Autonomous driving / embodied vehicle intelligence

Founded

2017

Industry

Software Development

Funding Track Record

Series B

200000000

Reported $200M (extension) cited in company press

Series C- 2024-05-06

1050000000

Reported Series C led by SoftBank

- 2026-02-24

1200000000

Reported later raise (~$1.2B) with participation from Nvidia, Microsoft, Uber and automakers; reports indicate possible additional contingent capital

Investor Signal

“Mixed strategic and financial backing including major technology firms, semiconductor partners, automakers and institutional investors (examples: SoftBank, Nvidia, Microsoft, Eclipse, Balderton, Uber, Mercedes‑Benz, Nissan, Stellantis, AMD, Arm, Qualcomm, Baillie Gifford)”

Founders

What we do

Join the Team

Staff Cloud Site Reliability Engineer

On-SiteLondon, GB

On-Site • London, GB

Related Companies

Company	HQ	Industry	Total Funding
PlusAI	🇺🇸Santa Clara, US	Data and AnalyticsDeepTechInformation TechnologySoftware	$520M
Boam AI	🇺🇸US	Data and AnalyticsDeepTechFoodInformation TechnologySoftware	-
Pittsburgh Robotics Network	🇺🇸Pittsburgh, US	Community and LifestyleGovernment and Military	$750K
Bot Auto	🇺🇸Houston, US	Transportation	$20M
Waabi	🇨🇦Toronto, CA	DeepTechTransportation	$283M

Who you are

In order to set you up for success as a Cloud Site Reliability Engineer at Wayve, we’re looking for the following skills and experience
Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems
Strong Kubernetes experience, including operating production clusters
Hands-on experience running production workloads in AWS, GCP, or Azure
Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads
Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred
Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation
Deep troubleshooting skills across networking, storage, distributed systems, and performance at scale
Experience designing and operating observability stacks (e.g. Datadog, Prometheus, Grafana, OpenTelemetry)
Clear communication skills, including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements
Experience operating GPU-backed environments or large-scale ML infrastructure
Experience running model training or inference pipelines in production (MLOps)
Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments
Experience defining and running SLOs/SLIs and building reliability programs across multiple teams
Experience as an early or founding SRE hire establishing processes from scratch
Interest in helping shape and grow a Cloud SRE function, with potential to take on leadership responsibilities over time

What the job involves

Benefits

Private healthcare: Choose our optional health insurance for comprehensive coverage for you and your family.
Paid time off: Paid vacation plus public holidays and additional leave programs, ensuring you have time to unwind.
Mental health resources: Through Spill, you can access therapy and mental health support.
Community and socials: Join clubs or attend team socials to connect over hobbies, sports, or just for fun.
Competitive compensation: Our compensation package includes cash and equity, making you a true partner in our success.
Learning and development: Budgets for books, courses, and company-wide training to support your continuous growth.

Startup jobs. A lot of them.

Your next opportunity is in here somewhere. Sign up to explore 70,000+ startups and their open roles. No spam. No gamification. Just jobs.

70,000+

Startups

80,000+

Open Roles

4,500+

New This Week

DevOps Engineer

ContractBelgrade, RS

Contract • Belgrade, RS

Data Scientist

ContractManchester, GB

Contract • Manchester, GB

Data Scientist

InternshipNiš, RS

Internship • Niš, RS

Software Engineer

ContractAustin, US

Contract • Austin, US

Frontend Developer

ContractNovi Sad, RS

Contract • Novi Sad, RS

Software Engineer

Part-timeHamburg, DE

Part-time • Hamburg, DE

As a Cloud Site Reliability Engineer at Wayve, you will build and scale the reliability foundations of our AI cloud platform. This includes our Model Development Platform (powering end-to-end model development from raw data to on-road experimentation) and our GPU Compute platform (large-scale, multi-tenant GPU fleets and scheduling systems driving model training and inference at scale)

This is a founding Cloud SRE role. You won’t inherit a mature SRE function, you’ll help create it. You will define the frameworks, automation, and operational standards that ensure our model development infrastructure, distributed systems, and large compute clusters operate predictably, efficiently, and at scale

This role sits at the intersection of AI research, large-scale cloud infrastructure, and production operations. Your work will directly enable faster model training, reliable experimentation, and scalable AI deployment by ensuring our cloud infrastructure is resilient and performant

Reliability & Platform Ownership

Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments

Define and operationalise SLOs, SLIs, and error budgets across platform services

Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters

Partner with ML, platform, and software teams to establish clear production readiness standards

Incident Response & On-Call

Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents

Lead incident triage, escalation, communications, and root cause analysis

Translate post-incident learning into durable architectural or automation improvements

Continuously reduce alert noise and recurring operational burden

Observability & Operational Excellence

Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery

Build dashboards that reflect real user-centric platform health (not just infrastructure metrics)

Improve deployment safety through better change management, validation, and rollback mechanisms

Automation & Tooling

Build automation for cluster operations, training workflows, remediation, and scaling tasks

Implement self-healing patterns and resilient recovery workflows

Harden CI/CD and release processes to improve deployment safety and velocity

Support infrastructure-as-code and policy-driven guardrails to ensure secure, reliable cloud environments