Site Reliability Engineer - DevOps focus
Cox AutomotiveJob Description
We are part of Cox Automotive and our Site Reliability Engineer II will join a dynamic SRE team dedicated to improving reliability, observability, and maturity across more than 150 teams of engineers. Our mission is to create scalable processes, documentation, and tools that enhance operational efficiency and quality. We offer a competitive salary ranging from $89,400.00 to $134,000.00, flexible vacation policies, holiday pay, and paid wellness hours among various other benefits that prioritize the well-being of our employees. Our work environment is fast-paced and ever-evolving, perfect for individuals who thrive in a dynamic setting. - Proficiency in at least one programming language: Python, Typescript, or Java. - Bachelors degree in a relevant field coupled with 4 years of experience; alternatively, a masters degree with 2 years experience; a Ph.D. with up to 1 year of experience; or 16 years of experience in a related domain. - Current authorization to work in the United States without the need for sponsorship. - Expertise in the design, analysis, and troubleshooting of large-scale distributed systems. - Extensive hands-on experience with observability tools such as CloudWatch and NewRelic. - Demonstrated capacity to evaluate engineering practices and implement measurable improvements across various teams. - Experience defining SLIs/SLOs, managing error budgets, and enhancing alert signal-to-noise ratios. - Strong foundation in release engineering, CI/CD, and progressive deployment techniques. - Proficient in AWS, Terraform, AWS CDK, and GitHub/GitHub Actions. - Passion for leveraging AI, LLMs, and automated solutions to tackle operational reliability challenges. - Proven history of decreasing MTTR and enhancing system availability through automation and architectural refinements. - Excellent communication skills, both written and verbal, suitable for audiences ranging from engineers to executives. - Methodical problem-solving ability with a keen sense of ownership and initiative. - Familiarity with Linux operating systems, networking basics, and performance principles. - Competency in building trust and influencing decisions through data-informed insights. - Experience in facilitating effective post-incident analyses and fostering systemic improvements. - Willingness to excel in a fast-paced and dynamic work environment. - Define and promote the adoption of SLIs, SLOs, error budgets, and exceptional alerting standards throughout the organization. - Design comprehensive observability frameworks encompassing metrics, logs, traces, and business signals with a consistent taxonomy and discoverability. - Develop centralized dashboards, reliability scorecards, and runbooks utilized by engineering teams and leadership. - Establish baselines for engineering practice maturity and collaborate with teams to create measurable enhancement strategies. - Generate standardized pipelines, infrastructure modules, and service templates to facilitate swift and consistent delivery. - Lead the adoption of AI and automated solutions to minimize manual effort, expedite incident responses, and improve operational workflows. - Conduct internal workshops, simulation events, and educational programs to promote operational excellence. - Serve as a trusted advisor to product and engineering leadership, providing data-driven insights regarding reliability risks and trade-offs. - Guide post-incident reviews towards systemic solutions (guardrails, automation, design modifications) rather than superficial remedies. - Design and enhance self-service platforms for deployment, progressive delivery, and automated recovery functionalities. - Minimize MTTR via enhanced telemetry, automation, AI-driven diagnostics, and resilience strategies. - Mentor engineers across teams to cultivate local reliability advocates, amplifying SRE impact without increasing workforce.