Senior HPC DevOps Engineer
PeratonJob Description
At Peraton, we are a next-generation national security company specializing in impactful missions that range from local to intergalactic endeavors. As the premier integrator of mission capabilities and transformative enterprise IT solutions, we are committed to providing trusted technologies to safeguard our nation and allies. Our operations span traditional and nontraditional threats across land, sea, air, space, and cyberspace, partnering with essential government agencies and serving every branch of the U.S. armed forces. Our team of dedicated professionals tackles some of the most challenging issues our customers face daily. Join us at our Maryland-based location to be part of a rewarding work environment focused on innovation and excellence. We offer competitive salaries and a comprehensive benefits package, including health care, retirement plans, and paid time off. - 12+ years of experience and a BS in computer science, IT, or related technical field; MS with 10 years of experience; or a Ph.D. with 8 years of experience. For candidates without a Bachelors degree, a total of 16 years of experience is required. - 7+ years of experience in Linux systems, SRE, or DevOps, specifically in production cluster operations within an HPC or large-scale computing environment. - 3+ years of hands-on experience in building and managing Ansible automation at scale (roles/collections, idempotency, inventories, secrets). - Solid understanding of Linux hardening and compliance (SELinux/AppArmor, SSH key automation, baseline configuration management). - Proven track record in operating or automating clustered compute environments (HPC, extensive Linux farms, or similar). - Practical experience with container tools in Linux settings, including image lifecycle and versioning. - Knowledge of incident response and runbook-driven operations; capability to automate common resolutions. - Proficiency in Git workflows and documentation practices. - Must possess at least one active technical certification from categories such as Systems Engineering (e.g., INCOSE), Information Security (e.g., CISSP), Networking (e.g., CCNA), System Administration (e.g., RHCE, MCSE), Virtualization (e.g., VCP), IT Systems Management (e.g., ITIL), or Project Management (e.g., PMP, Agile). - This role requires an active TS/SCI clearance with Polygraph. - Take charge of automation workflows, encompassing job templates, inventories, credentials, RBAC configurations, execution environments, and inter-environment promotion. - Enforce desired state across cluster services through code-driven configurations; implement drift detection and alert on discrepancies; reconcile runtime state with configured settings. - Establish an automated node bootstrap process for onboarding compute nodes (Bare-metal/VM), including OS installation, security and performance standard applications, node enrollment into the scheduling and shared storage ecosystem, hardware, and service readiness validation, along with pass/fail reporting. - Execute rolling maintenance and patch automation to uphold defined vulnerability response SLAs; manage version-controlled container build definitions, integrating image scanning into the build/release cycle. - Ensure automation and operational processes generate auditable logs for centralized analytics while integrating with metrics and alerting systems to enable dependable incident responses, proactive detection, and safe auto-remediation. - Automate responses to common incidents (e.g., hung nodes, storage performance alarms, image vulnerabilities, hardware failures) using out-of-band hardware management interfaces and standardized runbooks. - Keep runbooks and operational documentation version-controlled alongside automation efforts and publish operator guidance on the documentation platform.