Site Reliability Engineer (Remote)
About the job
Overview
As a Site Reliability Engineer at our you will work cross-departmentally with your partners on the application, infrastructure and quality teams to enhance the performance, reliability, resilience and scalability of the web services that make up our websites. We are a cloud native organization with 100% of our services in Docker running on Kubernetes in AWS’ public cloud. We also leverage observability, monitoring, CI/CD automation and custom tooling to push multiple production releases a day.
Your day to day focus will be leveraging your engineering skills to assist in the building out, monitoring, reducing developer toil, configuring CI workflows and improving our deployment pipelines. You will also be a knowledge reference for our development teams to ensure they are leveraging consistent tooling for metrics, logging, build, and deployment. You will work closely with the development and infrastructure teams to identify the essential service-specific metrics (beyond the golden metrics) that need to be monitored and work with application development teams to create libraries to allow services to easily instrument their services.
**The Impact You'll Make**
* Collaborate with stakeholders to drive best practices for monitoring, CI/CD pipelines * Troubleshoot deployment issues in our CI pipeline * Advocate emphatically for the DevOPS culture * Identify areas for automation and embrace the codification of all things * Evangelize best practices around collaboration, reliability, security and performance to all partner teams * Take ownership of the application configuration/scaling for given services to ensure that they are following the established practices of the organization
What You've Accomplished
- Minimum 2 years of development experience at startup/mid-sized companies
- Proficiency in Python, Go, Node, Ruby or Elixir
- Knowledge of containerization, particularly Docker (Kubernetes is a plus)
- Effective communication skills, a positive attitude, and ability to give and receive constructive feedback
- Professional experience with cloud native observability standard such as Open metrics, Open tracing and Open Census
- Expertise using/configuring modern CI/CD workflows
- Intimate understanding of SLIs, SLOs and SLAs from the service level to the business level
- Intimate understanding of the golden metrics, how to monitor and alert on them
- Deep understanding of the GitHub branching strategy
- Experience troubleshooting containerized applications
Bonus Points
- Familiarity with cloud infrastructure concepts (AWS, GCS)
- Experience with Hashicorp tools such as Terraform, Consul, Vault
- Computer science or other engineering background
- Experience with CI tools such as CircleCi, Jenkins, Travis, Drone, Semaphore, etc.
- Experience with monitoring and observability tools such as with tools like Prometheus, CloudWatch, DataDog, and Grafana