CL
Senior Cloud Reliability Engineer
Covalent Solutions, LLCSREOnsite • Washington, District-Of-Columbia, Pennsylvania Avenue Northwest$155k-165kPosted about 20 hours ago
Job Description
We are Covalent Solutions, LLC, and we are hiring a Senior Cloud Reliability & Telemetry Engineer to serve as the reliability orchestrator for a federal government clients cloud environment. This is a remote-eligible role, though we strongly prefer candidates located in the Washington, D.C., Maryland, or Virginia area and available to travel to Washington, D.C. as needed for critical meetings and collaboration. The position offers a salary range of $155,000 to $165,000 per year and includes 401(k), dental, health, life, paid time off, and vision insurance. We are seeking a mission-driven professional who can strengthen platform reliability, observability, automation, and cost efficiency while supporting a highly regulated federal enterprise environment.
- We require a bachelors degree in Computer Science, Cloud Computing, Computer Engineering, or a related technical discipline; equivalent military IT or practical systems engineering experience may also be considered, and a masters degree is preferred.
- We look for 6+ years of progressive professional experience in site reliability engineering, cloud infrastructure operations, or systems administration.
- We need at least 2+ years of direct experience supporting, scaling, and configuring automation, deployment, and monitoring tools in a production federal or heavily regulated cloud environment.
- We expect advanced, hands-on expertise with modern cloud observability platforms, centralized log management, alerting systems, and telemetry configurations used to track error budgets and incident MTTR.
- We require proven experience using code to automate routine monitoring, identify configuration drift, and implement self-healing recovery workflows.
- We need strong troubleshooting and problem-solving skills across complex hybrid federal environments that combine cloud infrastructure with multi-vendor legacy backend systems.
- We value experience working in an outcome-focused, fast-moving Agile software development setting using capacity-based team structures.
- We prefer candidates who have successfully partnered with product managers and cross-functional teams to align infrastructure delivery with user needs, product backlogs, and operational priorities.
- We require a U.S. citizen or permanent resident who is willing to undergo a Public Trust clearance process.
- We expect strong English communication skills and the ability to commute to Washington, DC when needed.
- We require a bachelors degree and 6 years of cloud engineering experience, as well as the ability to work onsite when required.
- We maintain high availability, scalability, resilience, and a 99.5%+ uptime baseline for the pilot cloud platform, automation tools, and foundational environment frameworks.
- We build, refine, and sustain standardized logging, monitoring, and auditing dashboards across all cloud accounts to deliver full environment visibility and support federal compliance requirements such as NIST, FedRAMP, and FISMA.
- We continuously watch for infrastructure configuration drift, document deviations, and create automated remediation routines or playbooks to help external vendors restore approved baselines.
- We respond quickly to critical security findings by providing hands-on operational support, updated reference deployment patterns, and source-code corrections within 72 hours of validation.
- We apply tailored monitoring, logging, alerting, and metrics-gathering solutions to integrate legacy enterprise applications into a unified cloud observability view.
- We conduct regular resource-usage reviews and share quarterly recommendations for cloud cost optimization and right-sizing with vendor teams and senior leadership.
- We participate in cross-vendor governance forums and planning sessions to review standards adoption, reduce operational risk, manage on-call production demands, and keep delivery moving efficiently.
- We ensure dashboards, reporting templates, and internal interfaces meet Section 508 and WCAG 2.1 AA accessibility standards, and we remediate any gaps within 30 days.
- We provide direct technical support to external infrastructure vendors and internal development teams when issues arise, with a focus on visibility and automated incident response.
- We support critical meetings, technical exchanges, and collaborative working sessions with the clients Washington, D.C. offices as needed.
- We require a bachelors degree in Computer Science, Cloud Computing, Computer Engineering, or a related technical discipline; equivalent military IT or practical systems engineering experience may also be considered, and a masters degree is preferred.
- We look for 6+ years of progressive professional experience in site reliability engineering, cloud infrastructure operations, or systems administration.
- We need at least 2+ years of direct experience supporting, scaling, and configuring automation, deployment, and monitoring tools in a production federal or heavily regulated cloud environment.
- We expect advanced, hands-on expertise with modern cloud observability platforms, centralized log management, alerting systems, and telemetry configurations used to track error budgets and incident MTTR.
- We require proven experience using code to automate routine monitoring, identify configuration drift, and implement self-healing recovery workflows.
- We need strong troubleshooting and problem-solving skills across complex hybrid federal environments that combine cloud infrastructure with multi-vendor legacy backend systems.
- We value experience working in an outcome-focused, fast-moving Agile software development setting using capacity-based team structures.
- We prefer candidates who have successfully partnered with product managers and cross-functional teams to align infrastructure delivery with user needs, product backlogs, and operational priorities.
- We require a U.S. citizen or permanent resident who is willing to undergo a Public Trust clearance process.
- We expect strong English communication skills and the ability to commute to Washington, DC when needed.
- We require a bachelors degree and 6 years of cloud engineering experience, as well as the ability to work onsite when required.
- We maintain high availability, scalability, resilience, and a 99.5%+ uptime baseline for the pilot cloud platform, automation tools, and foundational environment frameworks.
- We build, refine, and sustain standardized logging, monitoring, and auditing dashboards across all cloud accounts to deliver full environment visibility and support federal compliance requirements such as NIST, FedRAMP, and FISMA.
- We continuously watch for infrastructure configuration drift, document deviations, and create automated remediation routines or playbooks to help external vendors restore approved baselines.
- We respond quickly to critical security findings by providing hands-on operational support, updated reference deployment patterns, and source-code corrections within 72 hours of validation.
- We apply tailored monitoring, logging, alerting, and metrics-gathering solutions to integrate legacy enterprise applications into a unified cloud observability view.
- We conduct regular resource-usage reviews and share quarterly recommendations for cloud cost optimization and right-sizing with vendor teams and senior leadership.
- We participate in cross-vendor governance forums and planning sessions to review standards adoption, reduce operational risk, manage on-call production demands, and keep delivery moving efficiently.
- We ensure dashboards, reporting templates, and internal interfaces meet Section 508 and WCAG 2.1 AA accessibility standards, and we remediate any gaps within 30 days.
- We provide direct technical support to external infrastructure vendors and internal development teams when issues arise, with a focus on visibility and automated incident response.
- We support critical meetings, technical exchanges, and collaborative working sessions with the clients Washington, D.C. offices as needed.
More Site Reliability Engineering Jobs
about 20 hours ago
4 days ago