Submer

We enable next generation cooling and automation for data & energy-intense environments by integrating our pristine, highly-efficient & sustainable technologies. Solving the challenges of today and…

submer.com

Submer

submer.com

HQES

Team Size157

Open Jobs14

Total Funding$110M

Latest FundraiseUnknown

TL;DR

Headquarters: Barcelona (with North America HQ in Houston and R&D in Taipei)

Founded: 2015

Core product: Single-phase immersion cooling systems (SmartPod family, SmartPod EXO) for high-density and AI workloads

Employees: 157

Total funding: USD 109,650,000

Company Overview

Problem Domain

Data center cooling and infrastructure efficiency for high-density and AI compute workloads.

Founded

2015

Industry

IT Services and IT Consulting

Funding Track Record

Growth- October 2024

USD 55,500,000

Reported implied valuation of roughly $500M

Earlier equity rounds- circa 2020

USD 12,000,000

Investor Signal

“Attracted institutional growth capital and debt financing (notable investors include M&G Investments and Santander)”

Founders

What we do

Join the Team

Senior Observability & Telemetry Engineer

RemoteUnited Kingdom, Europe, GB

Remote • United Kingdom, Europe, GB

Related Companies

Company	HQ	Industry	Total Funding
TensorWave	🇺🇸US	Data and AnalyticsDeepTechHardwareInformation TechnologyInternet ServicesSoftware	-
EXCELTIC	🇪🇸Madrid, ES	Consumer ProductsDeepTechInformation TechnologyProfessional Services	-
Nscale	🇬🇧GB	Data and AnalyticsDeepTechHardwareInformation TechnologyInternet ServicesSoftware	$155M
Fluid2Chip	🌍Remote	DeepTech	-
Perle	🇺🇸US	Data and AnalyticsDeepTechInformation TechnologyInternet ServicesSoftware	-

Who you are

Proven experience operating large distributed infrastructure platforms
Strong background in observability systems and telemetry pipelines
Experience building metrics, logging, tracing, alerting, and dashboards at production scale
Strong programming skills in Go, Python, or Rust
Experience with large-scale time-series data platforms
Experience with large-scale GPU cloud platforms, HPC environments, or AI infrastructure
Experience monitoring AI workloads such as training or inference clusters
Deep understanding of distributed systems observability
Familiarity with cloud-native infrastructure such as Kubernetes, automation, and CI/CD
Experience operating observability systems for high-performance or large-scale environments
Experience monitoring complex networking environments
Familiarity with telemetry protocols such as gNMI, SNMP, and streaming telemetry
Experience integrating network and system telemetry into centralized monitoring platforms
Strong data analysis capabilities
Ability to interpret complex telemetry signals and translate them into actionable insights
Ability to diagnose performance issues across distributed systems

What the job involves

Benefits

Medical Insurance Plan
401k Employee volunteer contribution Plan
A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach
You´ll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution

Startup jobs. A lot of them.

Your next opportunity is in here somewhere. Sign up to explore 52,000+ startups and their open roles. No spam. No gamification. Just jobs.

52,000+

Startups

65,000+

Open Roles

1,400+

New This Week

Technical Writer

InternshipNovi Sad, RS

Internship • Novi Sad, RS

AI Researcher

ContractNovi Sad, RS

Contract • Novi Sad, RS

DevOps Engineer

ContractUtrecht, NL

Contract • Utrecht, NL

Product Designer

ContractAmsterdam, NL

Contract • Amsterdam, NL

Frontend Developer

InternshipManchester, GB

Internship • Manchester, GB

Mobile Developer

InternshipMunich, DE

Internship • Munich, DE

Type of Contract: Permanent, full-time

Mission: Design and build the observability platform that powers visibility, reliability, and performance insights for large-scale GPU cloud infrastructure as well as smaller edge deployments

This role is responsible for designing and implementing key parts of the observability architecture across the platform, enabling engineering, operations, and customers to understand system behavior in real time across distributed AI workloads, GPU clusters, networking fabrics, storage systems, and edge inference environments

You will design and operate low-latency, high-scale telemetry pipelines that collect, process, and analyze metrics, logs, and traces from infrastructure running across core datacenter clusters and smaller edge deployments

The platform you build will support internal operations, automated reliability mechanisms, and customer-facing observability experiences

As a senior engineer, you will lead delivery of major observability initiatives, contribute to the evolution of telemetry standards and SLO implementation, and work with other teams to ensure observability is effectively integrated into the platform architecture from infrastructure to application layers

You will collaborate closely with infrastructure, networking, storage, and platform engineering teams to provide clear visibility into performance bottlenecks, infrastructure degradation, and distributed workload behavior across both hyperscale GPU environments and smaller edge installations

This role contributes directly to improving platform reliability by analyzing production telemetry, identifying systemic issues, and driving improvements in performance, efficiency, and operational stability across the stack

Design and implement scalable telemetry pipelines for metrics, logs, and traces across distributed GPU infrastructure

Architect observability systems capable of ingesting high-cardinality telemetry from thousands of nodes and services

Build and operate telemetry storage systems optimized for large-scale time-series and event data

Contribute to observability standards across services, including metrics, tracing instrumentation, logging, and SLO implementation

Build visibility across compute, storage, and networking layers of the platform

Instrument GPU clusters, inference workloads, and distributed training environments

Detect infrastructure degradation such as:

Hardware degradation

Implement telemetry pipelines for GPU, CPU, network, and storage performance metrics

Build dashboards and monitoring tools that expose system health and performance to both internal teams and customers

Provide insights into workload performance including:

Distributed inference performance

Develop performance analysis tools that help customers understand system bottlenecks

Develop and maintain network observability platforms

Build telemetry collectors and exporters using Python or Go

Ingest telemetry from infrastructure components including:

NVIDIA Cumulus Linux,

Citrix NetScaler / WAF

Design telemetry ingestion pipelines using protocols such as:

Design advanced alerting and anomaly detection systems

Contribute to platform SLOs, SLIs, and reliability metrics

Build automated detection of infrastructure anomalies

Integrate observability signals with operational workflows and incident management systems

Participate in on-call rotations supporting platform observability and telemetry infrastructure

Partner with platform, networking, storage, and compute teams to instrument services

Work closely with operations teams to improve monitoring and incident response

Provide guidance and mentorship to engineers on observability best practices

Promote good observability practices across teams and help engineers adopt effective instrumentation and monitoring patterns

Technical Stack: Observability and telemetry technologies used across the platform include:

Observability Framework:

Distributed logging systems

High-scale telemetry databases, such as ClickHouse or similar

Hardware and Infrastructure Telemetry

Redfish / BMC telemetry

Linux system metrics

Hardware health monitoring and node lifecycle telemetry:

NVIDIA GPU Telemetry:

NVIDIA GPU Operator telemetry stack

NVSwitch / NVLink telemetry

AI Workload Telemetry:

Distributed training telemetry

Inference latency and throughput metrics

NCCL communication health

GPU synchronization latency

KV-cache access latency for inference workloads

Dataset loading and storage I/O performance

Networking Telemetry:

GNMI streaming telemetry

Network flow telemetry

RDMA / RoCE performance monitoring