Reliability engineer Jobs in Berlin
Jobalert für diese Suche erstellen
Reliability engineer • berlin
Site Reliability Engineer (SRE) Team Lead
1GLOBALBerlin, BE, DESite Reliability Engineer (SRE) Team Lead
TruphoneBerlin, GermanySite Reliability Engineer
GetYourGuideBerlin, Berlin, Germany(Senior) Cloud Site Reliability Engineer (mfx)
Scalable GmbHBerlin, Berlin, GermanySite Reliability Engineer (m / w / d)
Assecor GmbHBerlin, GermanyJunior Site Reliability Engineer
ShineBerlin, Berlin, .DESite Reliability Engineer (w / m / d)
IONOSBerlinSite Reliability Engineer (Go), Storage Platform
WoltBerlin, Berlin, GermanySite Reliability Engineer (w / m / d)
IONOS SEBerlin, Germany(Senior) Site Reliability Engineer – Data and ML Platform
Trade RepublicBerlin, Berlin, GermanySite Reliability Engineer (m / f / d)
eGymBerlin, GermanySite Reliability Engineer
ZattooBerlin, GermanySenior Site Reliability Engineer (mfd)
Redcare PharmacyBerlin, Berlin, GermanySite Reliability Engineer
BillieBerlinReliability Test Engineer
Daikin IndustriesBerlin, BE, DESenior / Staff Software Engineer - Platform and Reliability
QdrantBerlin, GermanySite Reliability Engineer
Third RepublicBerlin, Berlin, Germany- Gesponsert
Senior Site Reliability Engineer - Infrastructure
N26 GmbHBerlin, Berlin, DESite Reliability Engineer SCI (fmd)
SAPBerlin, Berlin, GermanyBeliebte Suchanfragen
Site Reliability Engineer (SRE) Team Lead
1GLOBALBerlin, BE, DE- Quick Apply
1GLOBAL is a technology-driven global mobile communications provider dedicated to empowering enterprises worldwide to unlock the full growth potential of mobile connectivity. With a best-in-class telecom technology platform, a comprehensive suite of globally viable regulatory licenses, and privileged access to the telecom wholesale market, 1GLOBAL is uniquely positioned to deliver seamless compliance and connectivity solutions. Serving the world’s leading banks, corporations, and digital-first businesses—including neo-banks, travel companies, and payment service providers—1GLOBAL connects over 43 million devices globally.
With 2024 full-year revenue exceeding US$100 million and in line to exceed US$200 million in FY25, 1GLOBAL is a profitable business generating significant cash flows to fund its ongoing investments in infrastructure, transformation, and growth. 2024 saw major client wins and marked 1GLOBAL’s evolution from a multi-market telecommunication provider to a global technology-driven mobile connectivity powerhouse.
Established in 2022 by experienced tech founders and entrepreneurs Hakan Koç and Pyrros Koussios, 1GLOBAL is a European technology leader driving digital transformation in the global telecommunications market. It operates as a fully regulated Mobile Virtual Network Operator (“MVNO”) in ten countries and as a regulated telecommunications operator in an additional 31 countries. Headquartered in the Netherlands, with world-class R&D hubs in Lisbon, Berlin, and São Paulo, 1GLOBAL employs over 450 experts across 15 countries.
Position Overview
We are looking for a talented Site Reliability Engineering (SRE) Team Lead to join our Technology Department.
We are open to hiring this role in Berlin, Germany.
As the SRE Team Lead, you will be responsible for ensuring the stability, scalability, and reliability of our global infrastructure and services across both cloud and on-prem environments.
You will lead a team of SREs focused on service availability, resilience, and operational excellence, driving a data-driven reliability culture based on SLIs, SLOs, and error budgets.
Your mission will be to proactively identify weaknesses across systems and improve reliability through redundancy testing, automation, and observability.
You will build tools and processes to automatically detect, prevent, and recover from incidents — ensuring our services remain reliable and performant for customers around the world.
This role collaborates closely with DevOps, Infrastructure, IP Network, and Security teams to maintain carrier-grade reliability standards across all layers of our platform.
Key Responsibilities
- Lead and mentor a team of Site Reliability Engineers, setting clear priorities, goals, and reliability metrics.
- Define, measure, and maintain SLIs and SLOs for core infrastructure and customer-facing services.
- Plan and execute redundancy and resilience testing across service, infrastructure, and networking layers — validating failover, HA configurations, and disaster recovery readiness.
- Design and implement automated recovery mechanisms , self-healing workflows, and intelligent alerting systems.
- Drive incident response, root-cause analysis, and blameless post-mortems , and ensure implementation and tracking of corrective and preventive actions derived from them to achieve continuous improvement.
- Develop and enhance observability (metrics, logs, traces) using Prometheus, Grafana, Loki, and OpenTelemetry.
- Collaborate with Infrastructure and DevOps teams to ensure deployment safety, rollback policies, and configuration consistency.
- Proactively identify weaknesses through fault-injection, load, and chaos testing .
- Continuously reduce operational toil through automation and reliability tooling.
- Establish on-call practices, improving alert quality, runbooks, escalation procedures and incident management processes.
- Conduct capacity planning, performance benchmarking, and resilience audits across systems.
- Ensure compliance with security, reliability, and availability standards.
- Create and maintain internal documentation, playbooks, and operational guidelines for peers and users.
- Built and managed cloud cost-optimization frameworks, including reserved capacity planning, autoscaling design, storage tiering, workload right-sizing, and continuous anomaly detection.
Requirements
Must Have
Nice to Have
Benefits
Why 1GLOBAL?
1GLOBAL is an equal opportunity employer, we value your character as much as your talent. Diversity drives our innovation, and we offer a collaborative, dynamic, and international work environment. We are excited for you to join our mission to revolutionise connectivity globally.