Job Description

We would love to meet you if you:

Philosophy: You are your own worst critic. You have a high bar for quality and don’t rest until the job is done right—no settling for 90%. We want someone who ships fast, with high agency, and who doesn't just voice problems but actively jumps in to fix them.
Experience: You have deep expertise in Python and PyTorch, with a strong foundation in low-level operating systems concepts including multi-threading, memory management, networking, storage, performance, and scale. You're experienced with modern inference systems like TGI, vLLM, TensorRT-LLM, and Optimum, and comfortable creating custom tooling for testing and optimization.
Approach: You combine technical expertise with practical problem-solving. You're methodical in debugging complex systems and can rapidly prototype and validate solutions.

The core work will include:

Architecting and implementing robust, scalable inference systems for serving state-of-the-art AI models
Optimizing model serving infrastructure for high throughput and low latency at scale
Developing and integrating advanced inference optimization techniques
Working closely with our research team to bring cutting-edge capabilities into production
Building developer tools and infrastructure to support rapid experimentation and deployment.

Bonus points if you:

Have experience with low-level systems programming (CUDA, Triton) and compiler optimization
Are passionate about open-source contributions and staying current with ML infrastructure developments
Bring practical experience with high-performance computing and distributed systems
Have worked in early-stage environments where you helped shape technical direction
Are energized by solving complex technical challenges in a collaborative environment

This is an in person role at our office in SF. We’re an early stage company which means that the role requires working hard and moving quickly. Please only apply if that excites you.

Optimize Your Resume for This Job

Get a match score and see exactly which keywords you're missing

Optimize Resume

Ready to Apply?

This will take you to the Reducto application page

Apply on Reducto

About Reducto

Reducto is a company that specializes in converting complex documents into AI-ready inputs, leveraging state-of-the-art vision models developed by a team from MIT. The technology enables AI teams to process unstructured data, such as medical records and financial statements, with high accuracy and reliability. These models read documents in a way that mimics human understanding, addressing a critical bottleneck in AI workflows.

LLM/ML Engineer (Inference)

Job Description

We would love to meet you if you:

The core work will include:

Bonus points if you:

Optimize Your Resume for This Job

Ready to Apply?

Job Details

About Reducto

More Roles at Reducto

Similar Software Roles

See full Software compensation at Reducto