Senior DevOps Engineer - AWS & AI Platform
CarParts.comJob Description
We are CarParts.com, a fast-growing eCommerce platform focused on auto care and maintenance, helping drivers find quality parts at competitive prices and book appointments with trusted mechanics through our website. We operate with a company-owned national distribution network and use modern design and technology to deliver a fast, intuitive experience. Our team is performance-driven, data-focused, and fast-paced, with a culture centered on safety, customer focus, excellence, high standards, ownership, and promotion from within. We have more than 1,000 employees worldwide, are scaling rapidly following a recent strategic partnership and $35 million investment, and are expanding Axle, our award-winning AI platform. This is an on-site role based in the Los Angeles / Long Beach area, and it offers a salary range of $170,000-$210,000. We are an equal-opportunity employer and are committed to fair employment practices. - We want 10+ years of practical DevOps, SRE, or platform engineering experience in live AWS cloud environments. - We need deep AWS expertise across EKS, EC2, SQS, CloudWatch, IAM, Organizations, and multi-account designs. - We are looking for strong Kubernetes capability, including cluster operations, node group administration, workload separation, taints/tolerations, and autoscaling. - We prefer experience with Akamai or a comparable enterprise CDN, including configuration, purge workflows, and traffic-routing controls. - We require CI/CD ownership experience with GitHub Actions and/or Jenkins, monorepo build systems, and release gating. - We expect production experience building or running AI agents, including LLM integration, autonomous workflow design, and prompt engineering. - We need proficiency in Node.js and/or Python for automation, tooling, and MCP server development. - We want observability ownership experience with Elastic/Kibana, log analysis, alerting design, and SLO/SLI instrumentation. - We expect comfort with on-call responsibility for a production eCommerce platform with meaningful revenue impact. - We need strong written and verbal communication skills for collaboration with engineering leadership and executive presentations. - We require the ability to work on-site in the Los Angeles / Long Beach area, or willingness to relocate there. - We want you to own our entire platform engineering and SRE function using AI-native tooling and autonomous agents. - You will extend and improve our AWS multi-account infrastructure, including EKS, EC2 worker nodes, SQS pipelines, and AWS Bedrock workloads. - You will manage and evolve our Kubernetes and container platform, including EKS and Kops clusters, node groups, and isolated environment tiers. - You will oversee CI/CD and release management across multiple repositories, including GitHub Actions, Jenkins, Turbo, and canary or blue-green deployment controls. - You will maintain and tune our Akamai CDN and traffic-management setup, including phased releases, security controls, throttling, monitoring, and cache invalidation. - You will lead observability and incident response across Elastic/Kibana, CloudWatch, business monitoring, backlog alerting, and AI-assisted triage. - You will build, deploy, and continuously improve autonomous AI agents that handle monitoring, alerting, incident triage, and routine operational work. - You will extend our OpsWhisperer platform, contribute to Axle, and build MCP servers that expand agent capabilities. - You will apply LLM-powered reasoning and automation to infrastructure challenges that would otherwise require multiple people. - You will own on-call duties and act quickly to resolve incidents, deployment issues, and infrastructure changes with precision.