
Principal Systems Engineer: Linux Kernel
Job Description
Introducing Cedana
The Problem
AI and HPC infrastructure suffer from scarcity and high costs, so when failures occur, they are costly in time and money. Cluster productivity directly determines research output and revenue. Achieving high utilization and throughput is increasingly challenging due to the complexity of workloads, hardware, and operations.
Cedana’s Solution
Cedana maximizes AI+HPC cluster utilization and reliability with automated GPU checkpointing infrastructure. We enable transparent, fast migration of GPU workloads across instances without losing work. Workloads automatically migrate to achieve new levels of reliability and throughput while accelerating time to results. Our system is at the kernel/OS level, requiring no code or config changes, and works seamlessly with Kubernetes, SLURM, and NVIDIA Dynamo. Today, we're deploying into leading inference platforms, neoclouds, enterprise, and research clusters.
The Team
Cedana's founding team has spent over a decade making computation run fast, productively, and reliably for AI. Our research appears in NeurIPS and CVPR. We published some of the earliest formal methods for guaranteeing convergence in distributed training. At Shopify, we've developed a control plane for robotics fleets used in warehouse automation. We bring repeat founder experience, having built and exited a Series B healthcare AI company.
Backed by Y Combinator, Initialized Capital, Pebblebed (founders of OpenAI and Facebook AI Research), Keith Adams (engineer #20 at VMware, founded HHVM and FAIR at Facebook, Chief Architect at Slack), Venture Guides, Garry Tan, and Gokul Rajaram.
The Role
What you’ll own
As a Forward Deployed Engineer at Cedana, you’ll lead and own technical engagement from end to end. You’ll engage with customers to understand and deploy on their environments: from production SLURM at a university, bare-metal Kubernetes at an inference provider, hybrid setup at a Fortune 100 Pharma enterprise. You’ll rapidly understand their key pain points and use Cedana to solve their problems. For each customer, you own everything from the OS up: SLURM plugins, Kubernetes operators, node configuration, networking, and observability.
This role will expose you to the cutting edge of AI and HPC infrastructure, working with the world’s leading research and commercial customers to deliver a breakthrough solution.
What You'll Do
- Validate and test automation: Our engineers contribute to and develop testing capabilities. This will be a core part of your initial work to establish your understanding of how our system works. Reliability and testing is a key part of our culture.
- Measure and optimize platform performance: Get Cedana to the theoretical maximum performance by understanding fundamental bottlenecks. Measure reliability, throughput and performance using our internal tools.
- Design and contribute to key system components : Our solution touches all the major aspects of the OS, kernel, GPU, and CPU. Write design papers to outline your vision for a specific capability and then lead implementation.
- Educate our team: Our team is our best learning resource and we continually educate each other. We huddle and co-pair as needed.
- Excellent communication: You enjoy writing concise and articulate design papers on your code and experiments. You respond to slacks and emails within our internal SLAs.
What we are looking for
- 5-10 years of software engineering experience with Linux Kernel.
- Kernel-level depth: reads kernel and driver source, and root-causes defects at the kernel/driver/syscall boundary rather than only reproducing them at the surface. Comfortable operating below the abstraction line, at the hardware/software interface.
- Proven expert-level mastery of at least one performance- or correctness-critical Linux systems domain (networking data plane, virtualization/hypervisor, storage and block I/O, scheduling, memory management, or real-time/determinism), plus demonstrated ability to ramp into an unfamiliar low-level subsystem quickly. Depth in one hard domain is a valuable signal that you can reach depth in the next; we are hiring the descent capability, not the specific subsystem. This included some combination of:
- OVS, DPDK, SR-IOV, RDMA, or high-performance packet processing
- CRIU + QEMU live migration + VFIO
- runc / containerd / OCI / namespaces / cgroups v2 / overlayfs
- Performance and latency engineering: characterizes throughput, jitter, and tail latency; uses tracing and profiling tooling (perf, ftrace, eBPF, or equivalent) to localize bottlenecks to a code path.
- Systems-level QE: designs and owns automated test infrastructure, performance and regression harnesses, and reproduces kernel-level races and corner cases. Not manual or UI QA.
- Enterprise Linux distribution environment (RHEL or equivalent): version matrices, backports, customer-grade triage.
- Principal scope: owns test or validation strategy for a subsystem; sets technical direction; mentors.
Bonus if you have
- Real-time Linux (PREEMPT_RT) and deterministic-path validation.
- Device virtualization beyond NICs: VFIO, vGPU / GPU passthrough.
- Virtualization internals: KVM / QEMU / libvirt.
- Container and Kubernetes networking (CNI, OVN-Kubernetes).
- Upstream kernel or open-source contribution history.
- Telco / 5G / NFV domain context.
- Credentials as proxies, not gates: RHCA / RHCE, ISTQB CTAL-TM.
Logistics
- Remote, US-based.
- Base $140,000–$180,000 + meaningful early-stage equity.
Benefits
- 100% covered medical, dental, and vision insurance for employees and families.
- Unlimited PTO policy.
- Workstation setup budget.
- 401K Plan.
Equal Opportunity Employer
Cedana is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status
Interview Process
- Initial interview for fit
- Written component to understand background and motivation. Not a coding test.
- Interviews with engineering team.
- References
Optimize Your Resume for This Job
Get a match score and see exactly which keywords you're missing
Job Details
- Category
- Aerospace Engineering
- Employment Type
- Full Time
- Location
- US / Remote (US) (Remote)
- Posted
- Compensation
- $140,000 - $180,000 per year
About Cedana
Cedana (YC S23) brings hyperscaler and frontier-lab orchestration capabilities for AI workflows. Our core capability is live migration for CPUs and GPUs workloads. This increases cost savings up to 80%, accelerates time to first token 2-10x, and enables stateful reliability of training jobs even through catastrophic GPU failures. We've integrated our solution into K8s, and support Kueue and Slurm for training distributed jobs, and Kserve for serving inference. OpenAI, Meta and Microsoft have flavors of these capabilities internally and we’re bringing them to everyone. Our vision is to transform cloud compute into a real-time, arbitraged commodity. https://www.cedana.ai
More Roles at Cedana
Similar Aerospace Engineering Roles



Found this role interesting?