Job Description

Shipping infrastructure software is only half the job. The other half is making it work in environments you don’t control—across messy reality, strict security constraints, and endless platform variations. The difference between a good product and a trusted one is how quickly you can diagnose issues and how effectively you prevent them from happening again.

At OpsMill, we're building Infrahub, a schema-driven infrastructure source of truth that helps teams unify data and scale automation reliably. Our customers deploy Infrahub on-prem, which means reliability is a product feature, not just an operational concern. When something breaks in the field, it's not just a support ticket—it's a signal about what we need to fix, test, or instrument better.

Why This Role Exists

We need someone who can operate in both worlds: diving deep on gnarly customer escalations while systematically eliminating entire classes of problems. You'll be the crucial bridge between "customer is blocked right now" and "this type of issue can't happen again." You'll build the diagnostics, tests, and automation that turn on-prem deployment chaos into predictable, debuggable, fixable reliability.

What You'll Be Doing

Partner directly with customers and with our Solution Architecture/Customer Success teams on L2/L3 escalations—communicating findings, driving root-cause analysis, and resolving complex packaging, deployment, upgrade, and runtime issues across heterogeneous Kubernetes environments.
Drive issues to resolution by reproducing problems locally, isolating root causes, and coordinating fixes with engineering—then documenting learnings in crisp RCAs that become actionable improvements
Build and maintain diagnostics tooling including support bundles, health checks, environment validators, and "what changed?" helpers that make future troubleshooting 10x faster
Own the test automation infrastructure roadmap, improving CI stability, reducing flaky tests, and creating reproducible integration/e2e environments that catch issues before customers do
Establish and maintain performance baselines and regression tests that serve as actionable gates, helping teams catch scale and latency issues early
Improve installation and upgrade robustness by identifying recurring failure modes and eliminating them through product changes, automation, and guardrails
Write production-quality code in Python, Go, or Rust for internal tooling and product improvements that directly enhance reliability
Close the reliability feedback loop by systematically turning field issues into better tests, observability, documentation, and product defaults—measuring success through reduced time-to-resolution and fewer repeat incidents

What You Bring

4-7 years of experience in production engineering, SRE, platform engineering, or similar roles where you've owned reliability and customer escalations
Strong software engineering fundamentals including design, debugging, testing, code review, and a focus on maintainable, production-quality code
Practical Kubernetes expertise sufficient to debug real deployments: troubleshooting resources, networking, storage, RBAC, and platform-specific quirks across different distributions
Deep troubleshooting instincts and observability experience using logs, metrics, and traces to diagnose issues quickly in complex, distributed systems
Experience with at least one of: Python, Go, or Rust for building tooling and contributing to product code (you don't need to be expert in all three)
Excellent problem decomposition and communication skills—you can break down messy, ambiguous issues and clearly explain your findings and recommendations
Self-directed remote work capability with strong async communication skills and the ability to operate independently in a fast-moving environment where priorities shift based on customer needs
Collaborative mindset with experience partnering across product, engineering, and customer-facing teams to drive systematic improvements

Nice-to-Haves

Experience with packaging and distribution systems (containers, Helm charts, installers) and managing upgrade/migration flows
Background running CI/CD at scale including test parallelization, hermetic environments, and artifact management
Familiarity with performance tooling such as profiling, load generation, and benchmark harnesses
Previous experience in customer-facing technical roles like escalation engineering, support engineering, or solutions engineering
Contributions to open source projects, especially in infrastructure, observability, or reliability tooling

Why OpsMill?

The people: Work alongside world-class engineers who've built and scaled automation platforms in production. Daily technical challenges with smart colleagues who push you to grow.
The product: Shape Infrahub based on real customer needs. Your input directly influences features, integrations, and roadmap priorities.
The mission: We're making enterprise-grade infrastructure automation accessible to any organization. Open-source at the core, production-ready out of the box. This is a multi-year journey, not a quarterly sprint.
The impact: You'll work with teams managing some of the world's most complex infrastructure deployments, solving problems that ripple across entire organizations.

Our Commitment to Diversity and Inclusion

OpsMill is committed to building a diverse and inclusive team. We believe different perspectives make us stronger and more innovative. We encourage applications from candidates of all backgrounds and experiences, and we're committed to providing an inclusive environment where everyone can do their best work.

Optimize Your Resume for This Job

Get a match score and see exactly which keywords you're missing

Optimize Resume

Ready to Apply?

This will take you to the OpsMill application page

Apply on OpsMill

About OpsMill

Infrahub from OpsMill is the data management platform for powering reliable infrastructure automation at scale. Just like a car won’t get far with a faulty engine, your automation won’t succeed without reliable data. When infrastructure data is fragmented or out of date, automation becomes hard to scale and harder to believe in. Infrahub brings structure and versioning to your infrastructure data, so you can build automation that’s roadworthy from the start and resilient over time. UNIFY INFRASTRUCTURE DATA • Keep all data sources in sync across teams and systems • Model your infrastructure your way with full schema control • Connect business logic to technical data to capture and build on design intent • Audit every change, who made it, and how it affects the rest of your infrastructure VERSION & VALIDATE EVERYTHING • Track every change with Git-like version control • Work safely in branches and test before you deploy • Collaborate through peer review to verify changes • Validate updates automatically with native CI workflows PACKAGE & DEPLOY AS YOU WANT • Generate configs and artifacts in a scalable way • Push artifacts to any deployment or orchestration tool in your stack • Turn infrastructure designs into versioned code that can evolve over time • Expose infrastructure as APIs to deliver X-as-a-service

Product Reliability Engineer | US

Job Description

Why This Role Exists

What You'll Be Doing

What You Bring

Nice-to-Haves

Why OpsMill?

Our Commitment to Diversity and Inclusion

Optimize Your Resume for This Job

Ready to Apply?

Job Details

About OpsMill

More Roles at OpsMill

Similar Aerospace Engineering Roles

See full Aerospace Engineering compensation at OpsMill

Career Guides