We enable next generation cooling and automation for data & energy-intense environments by integrating our pristine, highly-efficient & sustainable technologies. Solving the challenges of today and…
We enable next generation cooling and automation for data & energy-intense environments by integrating our pristine, highly-efficient & sustainable technologies. Solving the challenges of today and…
Familiarity with telemetry protocols such as gNMI, SNMP, and streaming telemetry
Experience integrating network and system telemetry into centralized monitoring platforms
Strong data analysis capabilities
Ability to interpret complex telemetry signals and translate them into actionable insights
Ability to diagnose performance issues across distributed systems
What the job involves
Benefits
Medical Insurance Plan
401k Employee volunteer contribution Plan
A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach
You´ll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution
Startup jobs. A lot of them.
Your next opportunity is in here somewhere. Sign up to explore 52,000+ startups and their open roles. No spam. No gamification. Just jobs.
52,000+
Startups
65,000+
Open Roles
1,400+
New This Week
Technical Writer
InternshipNovi Sad, RS
Internship • Novi Sad, RS
AI Researcher
ContractNovi Sad, RS
Contract • Novi Sad, RS
DevOps Engineer
ContractUtrecht, NL
Contract • Utrecht, NL
Product Designer
ContractAmsterdam, NL
Contract • Amsterdam, NL
Frontend Developer
InternshipManchester, GB
Internship • Manchester, GB
Mobile Developer
InternshipMunich, DE
Internship • Munich, DE
Start: ASAP
Type of Contract: Permanent, full-time
Mission: Design and build the observability platform that powers visibility, reliability, and performance insights for large-scale GPU cloud infrastructure as well as smaller edge deployments
This role is responsible for designing and implementing key parts of the observability architecture across the platform, enabling engineering, operations, and customers to understand system behavior in real time across distributed AI workloads, GPU clusters, networking fabrics, storage systems, and edge inference environments
You will design and operate low-latency, high-scale telemetry pipelines that collect, process, and analyze metrics, logs, and traces from infrastructure running across core datacenter clusters and smaller edge deployments
The platform you build will support internal operations, automated reliability mechanisms, and customer-facing observability experiences
As a senior engineer, you will lead delivery of major observability initiatives, contribute to the evolution of telemetry standards and SLO implementation, and work with other teams to ensure observability is effectively integrated into the platform architecture from infrastructure to application layers
You will collaborate closely with infrastructure, networking, storage, and platform engineering teams to provide clear visibility into performance bottlenecks, infrastructure degradation, and distributed workload behavior across both hyperscale GPU environments and smaller edge installations
This role contributes directly to improving platform reliability by analyzing production telemetry, identifying systemic issues, and driving improvements in performance, efficiency, and operational stability across the stack
Design and implement scalable telemetry pipelines for metrics, logs, and traces across distributed GPU infrastructure
Architect observability systems capable of ingesting high-cardinality telemetry from thousands of nodes and services
Build and operate telemetry storage systems optimized for large-scale time-series and event data
Contribute to observability standards across services, including metrics, tracing instrumentation, logging, and SLO implementation
Build visibility across compute, storage, and networking layers of the platform
Instrument GPU clusters, inference workloads, and distributed training environments
Detect infrastructure degradation such as:
GPU throttling,
Network congestion,
Storage latency,
Hardware degradation
Implement telemetry pipelines for GPU, CPU, network, and storage performance metrics
Build dashboards and monitoring tools that expose system health and performance to both internal teams and customers
Provide insights into workload performance including:
GPU utilization,
Storage throughput,
Network latency,
Distributed inference performance
Develop performance analysis tools that help customers understand system bottlenecks
Develop and maintain network observability platforms
Build telemetry collectors and exporters using Python or Go
Ingest telemetry from infrastructure components including:
NVIDIA Cumulus Linux,
VyOS routers,
Citrix NetScaler / WAF
Design telemetry ingestion pipelines using protocols such as:
GNMI,
SNMP,
Streaming telemetry
Design advanced alerting and anomaly detection systems
Contribute to platform SLOs, SLIs, and reliability metrics
Build automated detection of infrastructure anomalies
Integrate observability signals with operational workflows and incident management systems
Participate in on-call rotations supporting platform observability and telemetry infrastructure
Partner with platform, networking, storage, and compute teams to instrument services
Work closely with operations teams to improve monitoring and incident response
Provide guidance and mentorship to engineers on observability best practices
Promote good observability practices across teams and help engineers adopt effective instrumentation and monitoring patterns
Technical Stack: Observability and telemetry technologies used across the platform include:
Observability Framework:
Prometheus
OpenTelemetry
Grafana
Distributed logging systems
High-scale telemetry databases, such as ClickHouse or similar
Hardware and Infrastructure Telemetry
Redfish / BMC telemetry
IPMI
Linux system metrics
Hardware health monitoring and node lifecycle telemetry: