Service Reliability Engineer

Prague, Czech Republic

Hybrid

Who We Are

Role Description

Looking for Service Reliability engineer who will be able to start immediately setting up monitoring, connect with the right people and automating client’s deployment pipelines.

‍

Responsibilities:

Design, build, and maintain highly available and scalable services or applications, focusing on observability.
Develop and implement monitoring, tracing, and logging solutions to provide deep insights into service behavior and performance.
Collaborate with cross-functional teams to define key performance indicators (KPIs) and service level objectives (SLOs) for the services.
Establish and maintain service dashboards and visualizations to provide real-time visibility into service health and performance.
Develop and maintain alerting systems to proactively detect and respond to service anomalies or degradation.
Analyze and troubleshoot complex issues related to service performance, reliability, and availability.
Conduct post-incident analysis using observability tools and implement improvements to prevent similar incidents in the future.
Drive continuous improvement of the observability platform, including evaluating and adopting new tools and technologies.
Participate in capacity planning exercises and provide recommendations to ensure optimal service performance.
Collaborate with development teams to design and implement monitoring, tracing, and logging instrumentation within the services.

We Expect You to Have:

Bachelor’s degree in computer science, Information Technology, or a related field.
Strong experience in software development or system administration, with a focus on observability.
Proficiency in programming and scripting languages such as Python, Java, Ruby, or Bash.
Solid understanding of distributed systems, microservices architectures, and observability principles.
Experience with observability tools such as Prometheus, Grafana, Jaeger, Elasticsearch, or Splunk.
Familiarity with distributed tracing and logging frameworks, such as OpenTelemetry or Fluentd.
Knowledge of cloud technologies (e.g., AWS, GCP, Azure) and containerization (e.g., Docker, Kubernetes).
Understanding of infrastructure-as-code tools such as Terraform or Ansible.
Strong problem-solving and troubleshooting skills, with the ability to analyze complex system behavior using observability data.
Excellent communication and collaboration skills to work effectively with various teams.

‍

Apply for this position

Our team will review your application within the next 5 days.

Upload Resume

Uploading...

fileuploaded.jpg

Upload failed. Max size for files is 10 MB.

Send

Thank you!
We will be in touch shortly

kid giving a thumbs-up while sitting at a desktop table

Done

Oops! Something went wrong while submitting the form.

Apply for this position

Thank you!We will be in touch shortly

Thank you!
We will be in touch shortly