Location : Hybrid Cologne (Rheinauhafen) 3 days in the office 2 remote (Tue Thu)
Team : Engineering Reports to CTO
Keep the world awake build reliability at scale
ilert helps thousands of DevOps & IT teams detect fix and communicate incidents faster.
Our platform is mission-critical : customers rely on us 24 / 7 to keep their always-on businesses running.
As a Site Reliability Engineer at ilert youll own the reliability performance and scalability of our core platform across AWS Kubernetes Kafka and more.
Tasks
Build & operate a highly available platform
- Run and evolve our AWS-based infrastructure
- Operate and optimize self-managed Kafka ClickHouse clusters and our Observability stack
- Ensure resilience disaster recovery and capacity planning across the stack
Improve reliability & performance
Build and maintain SLOs SLIs error budgets and observability dashboardsDebug production issues across layers (networking Kubernetes application DB)Improve performance of our ingestion pipelineAutomation & tooling
Automate operations with Terraform Helm Kubernetes operators and internal toolingBuild tooling for safer deploys blue / green rollouts and automated verificationStrengthen incident response workflows through deep collaboration with our AI SRE agent teamSecurity & compliance
Implement best practices for workload isolation secrets management IAM and auditabilitySupport our ISO27001 posture by automating controls and hardening our infrastructureCross-functional impact
Partner with Backend AI and Product teams to design reliable servicesParticipate in on-call rotationLead post-incident reviews and drive reliability improvements long-termRequirements
3 years experience as SRE Platform Engineer DevOps Engineer or Infrastructure EngineerStrong hands-on experience with AWS Kubernetes Linux internals networking performance tuningExperience operating self-managed distributed systems ideally Kafka or ClickHouseStrong understanding of observabilityExperience automating infrastructure with Terraform and CI / CD systemsFluent English (our working language); German optionalBenefits
Product-centric - 100 % focused on solving a mission-critical pain felt by every always-on businessHybrid freedom - 2 days remote by default; gorgeous Rheinauhafen roof terrace when youre in townFocus >meetings - We time-box syncs favour async docs and protect maker time
28 days off - plus public holidaysCommute perks - subsidised public transportKey Skills
Kubernetes,FMEA,Continuous Improvement,Elasticsearch,Go,Root cause Analysis,Maximo,CMMS,Maintenance,Mechanical Engineering,Manufacturing,Troubleshooting
Employment Type : Employee
Experience : years
Vacancy : 1