Talent.com

Reliability engineer Jobs in Berlin

Jobalert für diese Suche erstellen

Reliability engineer • berlin

Zuletzt aktualisiert: vor 12 Stunden

AI Infrastructure & Reliability Engineer

HiBobBerlin, de

AI Infrastructure & Reliability Engineer.HiBob helps modern, mid-size businesses transform the way they manage people, giving HR and managers all they need to connect, engage, develop, and retain t... Mehr anzeigen

 • Gesponsert

Reliability & Test Engineer

Dunia Innovations GmbHBerlin, Berlin, DE

Make sure the system never lies, and rarely fails, no matter the complexity .Dunia is building AI-native, automated laboratories for materials discovery.Our systems combine hardware, software, chem... Mehr anzeigen

Senior Site Reliability Engineer (SRE)

1GLOBALBerlin, BE, DE
Quick Apply

Powered by a best-in-class telecom platform – including its own owned and operated global mobile core network, fully fledged in-house developed eSIM technology, and an extensive portfolio of teleco... Mehr anzeigen

Senior Cloud Engineer

Arendt & MedernachBiesdorf, DE

Arendt is your legal, tax and business services firm in Luxembourg.At Arendt we combine the entire value chain of services dedicated to asset managers, banks, insurers, public institutions, commerc... Mehr anzeigen

 • Gesponsert

Site Reliability Engineer (m/w/d)

Assecor GmbHBerlin, Germany
Quick Apply

Du brennst für Technologie, denkst systematisch und handelst mit Weitblick? Du liebst es, Neues zu lernen, gehst Herausforderungen proaktiv an und behältst auch in kritischen Situationen einen kühl... Mehr anzeigen

Automation & Process Engineer / Commissioning Engineer (m/f/d)

Paul Wurth / Paul Wurth GeproluxBiesdorf, DE

Automation & Process Engineer / Commissioning Engineer (m/f/d).Based in Luxembourg, you will integrate and reinforce our Pulverized Coal Injection department.Are you fascinated by technology driven... Mehr anzeigen

 • Gesponsert

Site Reliability Engineer

deepset GmbHBerlin, Berlin, DE

You'll work across SaaS, private cloud, and on-prem environments to make our self-hosted platform production-ready, drive CI/CD and GitOps maturity, and reduce complexity at scale.Your work will di... Mehr anzeigen

Site Reliability Engineer (m/f/d)

Solactive AGBerlin, Germany
Quick Apply

Since its foundation in 2007, Solactive AG has evolved into one of the world’s most important and fastest-growing index providers.From our headquarters in Frankfurt, we power global investment prod... Mehr anzeigen

Structural Engineer

Gradel Groupe SABiesdorf, DE

Join GRADEL LW and Shape the Future of Lightweight Innovation.GRADEL Group, a medium-sized company with 70 employees, delivers cutting-edge solutions for space & defence, nuclear, and glass industr... Mehr anzeigen

 • Gesponsert

Senior Site Reliability Engineer (m/f/d)

Redcare PharmacyBerlin, de

Redcare Pharmacy is powered by passionate teams and cutting-edge innovation.We strive to create a healthy, collaborative work environment where every employee feels valued and inspired to contribut... Mehr anzeigen

 • Gesponsert

Site Reliability Engineer (w/m/d)

IONOS SEBerlin, Germany
Quick Apply

Bei IONOS arbeitest Du bei dem führenden europäischen Anbieter von Cloud-Infrastruktur, Cloud-Services und Hosting-Dienstleistungen partnerschaftlich mit unterschiedlichen Teams zusammen.Wir bieten... Mehr anzeigen

Site Reliability Engineer

ZattooBerlin, Germany
Quick Apply

The ideal blend of stability and flexibility.A genuinely human employer that cares for people and the planet.True autonomy to shape what comes next, for us and you.This is the perfect platform to t... Mehr anzeigen

Service Technicien Engineer

LUXSCAN TECHNOLOGIESBiesdorf, DE

We are world-wide leader in the design, manufacturing and installation of industrial scanners for automation in the timber industry.Our company, based in Luxembourg, is part of the WEINIG group.WEI... Mehr anzeigen

 • Gesponsert

Solutions Integration Engineer

JAOBiesdorf, DE

We partner with Europe to empower and enhance a sustainable energy market”.Located in Luxembourg, we are a service company that hosts Europe's single leading trading platform(e-CAT) for cross-borde... Mehr anzeigen

 • Gesponsert

Test Environment Engineer

Sogeti, part of CapgeminiBiesdorf, DE

At Sogeti, we believe the best is inside every one of us.Whether you are early in your career or at the top of your game, we’ll encourage you to fulfill your potential to be better.Through our shar... Mehr anzeigen

 • Gesponsert

IoT Engineer

L.E.A.SE. S.A.Biesdorf, DE

Be one of the point of contact for user’s requests, incidents or questions (administrative users and drivers) with a focus on the OBC components: telecom, Hardware, MDM.Determine proper escalation ... Mehr anzeigen

 • Gesponsert

Site Reliability Engineer - Observability

N26 GmbHBerlin, Germany

We are seeking a Site Reliability Engineer to join the Observability group inside our Platform Engineering domain.Platform Engineering's goal is to provide easy to use, self-service platforms to en... Mehr anzeigen

 • Gesponsert • Neu!

Doctoral Researcher in Data Quality & Sensor Reliability

Université du LuxembourgBiesdorf, DE

Faculty of Science, Technology and Medicine.FSTM) at the University of Luxembourg contributes multidisciplinary expertise in the fields of Mathematics, Physics, Engineering, Computer Science, Life ... Mehr anzeigen

 • Gesponsert

Backbone Engineer

Proximus LuxembourgBiesdorf, DE

Proximus Luxembourg is a leading historical player in the ICT & Telecoms markets.Proximus Luxembourg addresses both residential and business markets through its commercial brands Tango, Proximus NX... Mehr anzeigen

 • Gesponsert
Häufig gestellte Fragen
Diese Stelle ist in deinem Land nicht verfügbar.
AI Infrastructure & Reliability Engineer

AI Infrastructure & Reliability Engineer

HiBobBerlin, de
Vor 8 Tagen
Stellenbeschreibung
AI Infrastructure & Reliability Engineer

About Us

HiBob helps modern, mid-size businesses transform the way they manage people, giving HR and managers all they need to connect, engage, develop, and retain top talent. Since 2015, we’ve achieved consecutive triple-digit year-over-year growth, all backed by our amazing team of Bobbers from across the globe, making us the choice HRIS of over ~5500 midsize and multinational companies and over 1 Milion users.

Our HR platform is intuitive, data-driven, and built for the way people work today: globally, remotely, and collaboratively.

What this role is really about

You’ll join a 3-person platform team within our Business Technology group -owning the internal infrastructure that our AI platform and its users depend on. This isn’t a product engineering role, and it isn’t ticket work or babysitting pipelines someone else built. You’re building and operating the internal foundation that the company runs on. The work covers the full stack of platform engineering: core cloud infrastructure (AWS, Kubernetes, IaC), CI/CD pipelines, AI-driven infrastructure components, and the SRE and observability practice that keeps it all honest -metrics, alerting, incident response, and reliability standards. As our AI capabilities grow, so does the complexity underneath them, and staying ahead of that is central to the role. If you treat infrastructure as a product -reusable, automated, observable, and built to last -this is your kind of role.

Job requirements

  • 2-4 years Hands-on DevOps, SRE, or infrastructure engineering in production SaaS environments.
  • Strong AWS experience: multi-account architecture, cross-account IAM, serverless and event-driven services (Lambda, SQS, SNS, EventBridge), and EKS cluster management.
  • Proven Kubernetes experience in production, including cross-account migrations and stateful workload management.
  • Proficiency with Terraform - repository structure design, module architecture, and CI/CD pipeline implementation.
  • Hands-on experience building and maintaining GitHub Actions pipelines for end-to-end CI/CD workflows.
  • Working Python proficiency for scripting, internal tooling, and workflow automation.
  • Practical experience implementing observability stacks from scratch: metrics, logging, distributed tracing, and alerting.
  • Experience owning reliability practices: SLOs, incident response, and postmortem culture.

Nice to have

  • Hands-on experience operating LLM APIs in production: rate-limit and quota management, cost attribution per team/model, latency monitoring, and resilience patterns (retries, fallbacks, circuit breakers).
  • FinOps experience across cloud, AI, and observability spend.
  • Experience introducing self-healing or auto-remediation patterns in production.

Job responsibilities

  • DevOps & AI-Driven Infrastructure - own CI/CD, deployment processes, and release reliability. Build and operate cloud infrastructure that is automated, intelligent, and continuously self-improving - not just managed.
    • Design and build our Terraform repository and IaC pipeline from scratch -AI-assisted generation, drift detection, and policy enforcement built in.
    • Build AI-driven GitHub Actions pipelines -automated code review, risk assessment, and intelligent deployment decisions.
    • Manage Kubernetes workloads across AWS accounts -zero downtime, fully automated, nothing left behind.
  • Embed AI into the operational layer -proactive drift detection, automated remediation, and intelligent scaling toward a self-healing runtime.
  • Reliability & SRE -improve uptime, resilience, and incident response.
    • Define and enforce SLOs/SLIs, error budgets, and on-call practices.
    • Lead incident response, postmortems, and systemic reliability improvements.
  • Own AI-specific reliability: model latency SLOs, token quota monitoring, rate limit handling, fallback and retry strategies, and cost-per-request alerting.
  • Observability & Telemetry - increase visibility, reduce noise, improve troubleshooting.
  • Establish and continuously evolve the observability stack: metrics, logs, distributed tracing, and alerting tuned for both application and AI workloads.
  • AI / LLM Operations- bringing AI systems to production and operating them at scale, with a focus on reliability, performance, and trust.
    • Own the AI infrastructure layer: rate limits, quota management, latency SLOs, and fallback strategies (retries, circuit breakers).
  • Operate LLM APIs in production with resilience and cost attribution per team/model.
  • FinOps & Cost Optimization - optimize AI, infra, and logging costs at scale.
  • Build cost visibility and guardrails across AWS, LLM usage, and observability pipelines.