Senior Site Reliability Engineer
vac

Senior Site Reliability Engineer

Remote | Full-time

We’re looking for a dedicated and talented SRE to join the team of our client — a cutting-edge company building next-generation AI infrastructure.

The company is focused on redefining how AI workloads are deployed and scaled by leveraging a distributed GPU network. Their platform enables seamless deployment across multiple environments, optimizing for cost, performance, and flexibility. The mission is to empower AI teams with a fast, scalable, and cost-efficient cloud experience, removing vendor lock-in and supporting the growing demands of modern AI systems.

Tasks and responsibilities:

  • Ensure reliability, availability, and performance of the production platform
  • Monitor and maintain infrastructure supporting AI workloads running in production
  • Set up and improve monitoring, alerting, and incident response processes
  • Participate in on-call rotations and handle production incidents
  • Work with observability tools (metrics, logs, tracing) to track system health (latency, error rates, SLA)
  • Support and optimize platform stability for customers running production workloads

What you need to be successful in this position

  • 5+ years of experience in SRE
  • Strong Linux administration skills
  • Solid understanding of networking fundamentals
  • Hands-on experience with monitoring, alerting, incident response, and on-call practices
  • Experience with observability (metrics, logs, tracing) and system reliability metrics (latency, error rates, SLA)
  • Upper-intermediate English level (B2+)

Additionally:

  • Experience with Kubernetes and cloud/GPU infrastructure
  • Familiarity with containers and CI/CD pipelines
  • Understanding of performance and cost optimization for AI/GPU workloads
  • Basic knowledge of production security and data handling
  • Experience with APIs and distributed systems reliability
  • Knowledge of autoscaling and capacity planning
  • Experience with AWS and tools like Grafana, Prometheus, Loki, EKS

This is a great opportunity to join a modern, fast-growing team working on cutting-edge AI infrastructure and solving complex, real-world challenges.

Please include a short summary of your relevant experience in your cover letter, and specify your English level as well as your experience working in a fully English-speaking environment. Thank you, and I look forward to the opportunity to discuss more in person!

Apply for this position