Software Engineer, Reliability

Posted 1 hour ago• Software

Job Description

Our mission is to automate coding. The first step in our journey is to build the best tool for professional programmers, using a combination of inventive research, design, and engineering. Our organization is very flat, and our team is small and talent dense. We particularly like people who are truth-seeking, passionate, and creative. We enjoy spirited debate, crazy ideas, and shipping code.

About the role

We’re hiring a Software Engineer, Reliability to help Cursor scale with a high reliability bar.

You will work across the stack (client, backend services, model routing/integrations, and infra) to find the reliability bottlenecks that most impact users, ship durable fixes, and build the tooling and guardrails that keep us fast and stable.

This is a strongly engineering-focused role. It is not a program management role, and it is not an ops-only SRE role.

What you’ll do

Own reliability work end-to-end, from user-facing symptoms (crashes, latency, streaming failures) to root causes in services, infrastructure, or vendor dependencies.
Design and implement resilience patterns for upstream dependency failures (for example model providers): fallbacks, routing strategies, and degraded-mode designs.
Build and maintain reliability guardrails that make teams faster and safer: deployment safety, rollbacks, operational playbooks, automated checks, and standards for production readiness.
Improve observability (metrics, logs, traces, and client telemetry) so engineers can quickly answer “Is it up?” and “What changed?”
Reduce operational toil through automation and better tooling.
Partner with product and infrastructure engineering teams as a drop-in reliability multiplier: embed on the highest-impact problems and drive them to a durable technical outcome.
Participate in an on-call rotation and help improve incident response practices over time (severity definitions, runbooks, retrospectives, and clear ownership of follow-up fixes).

You will own a small set of high-leverage reliability “themes” at a time (for example client crash rate, streaming reliability, deploy safety). You drive these end-to-end until the reliability bar measurably moves.
You will not be “responsible for everyone’s metrics” by default. You will build the system and partner with teams; service owners ultimately own their service SLOs and fixes.
You will not be the owner of all CI/CD. You will raise the production-readiness bar with guardrails and tooling, while infrastructure and product teams own their pipelines and day-to-day workflows.
On-call is part of the job, but the goal is to eliminate recurring incidents and toil, not to be a permanent triage function.

You may be a fit if

Have a track record of improving reliability by empowering other engineers with excellent tooling, guardrails, and simple operational systems.
Own problems end-to-end, learn quickly, and enjoy working across layers (client symptoms, service behavior, infra primitives, and third-party dependencies).
Prefer pragmatic, high-leverage fixes over perfection, and can raise standards without becoming “the voice of no.”
Are comfortable leading through influence: aligning teams on the “why,” landing changes in multiple codebases, and driving clarity on ownership.

Strong experience owning reliability for production systems, including both incident response and long-term engineering fixes.
Strong software engineering instincts. You write code to automate, eliminate recurring operational work, and prevent regressions.
Expert-level experience in at least one of: Go, Node/TypeScript, or Python.
Deep practical knowledge of cloud infrastructure (AWS) and modern deployment/orchestration patterns (Kubernetes and/or ECS).
Experience with observability systems and practices (metrics, logs, traces, and alerting).
Clear communication and cross-team leadership.

Bonus points

Experience with multi-region architecture and global distribution strategies.
Experience with networking and long-lived connection workloads (for example HTTP/2 streaming).
Experience building reliability programs in high-growth orgs with incredibly high velocity.

Applying

If there appears to be a fit, we'll reach to schedule 2-3 short technicals. After, we'll schedule an onsite in our office, where you'll work on a small project, discuss ideas, and meet the team.