
Job Description
About the Role
Responsibilities
- Track and manage all break/fix incidents across multiple data centers
- Monitor ticket queues and ensure SLA compliance for incident response and resolution
- Coordinate with on-site technicians, remote hands teams, vendors, and engineering groups
- Maintain accurate records of failed hardware, replacements, RMAs, and repair status
- Escalate critical outages and recurring infrastructure issues to leadership and engineering teams
- Schedule and oversee maintenance windows and emergency repair activities
- Provide daily/weekly operational status reports and incident summaries
- Ensure all work follows data center operational procedures and change management policies
- Identify trends in hardware failures and recommend process improvements
Requirements
- Experience working in data center operations, IT infrastructure, or hardware support
- Strong understanding of server, storage, and networking hardware
- Experience with ticketing systems such as ServiceNow, Jira, or Remedy
- Ability to manage multiple priorities across several sites simultaneously
- Excellent communication and organizational skills
- Familiarity with SLA management and incident escalation processes
- Proficiency with Excel, reporting dashboards, and inventory tracking tools
Preferred Qualifications
- Experience supporting enterprise or hyperscale data centers
- Knowledge of remote hands operations and vendor management
- Understanding of ITIL processes and change management
- CompTIA Server+, Network+, or similar certifications
About Together AI
Together AI is a research-driven AI infrastructure company on a mission to dramatically lower the cost of modern AI by co-designing software, hardware, algorithms, and models. We believe open and transparent AI systems create the best outcomes for society — and we're building the physical and computational foundation to make that real. Our team has been behind landmark advances including FlashAttention, Hyena, FlexGen, and RedPajama.
Compensation
We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $150,000-200,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.
Equal Opportunity
Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
Please see our privacy policy at https://www.together.ai/privacy
Optimize Your Resume for This Job
Get a match score and see exactly which keywords you're missing
Job Details
- Category
- Operations
- Employment Type
- Full Time
- Location
- San Francisco, CA
- Posted
About Together AI
Together AI builds infrastructure to accelerate training, fine-tuning, and inference on performance-optimized GPU clusters. Their platform enables developers and researchers to train, fine-tune, and deploy generative AI models at scale.
More Roles at Together AI





Similar Operations Roles



Found this role interesting?