Skip to main content
artificial intelligence

A 139-Person Startup Is Training Foundation Models on TikToks to Replace the Entire Camera-to-Timeline Workflow

By Sarah Mitchell

What Mirage Is Actually Building, and Why ML Engineers Are Paying Attention

Mirage, formerly known as Captions, secured $75 million in growth financing in March 2026 from General Catalyst's Customer Value Fund, bringing its total funding past $175 million. The bet wasn't on another AI video toy. It was on replacing the entire camera-to-timeline production workflow with a single system that understands what makes social video work.

The core product thesis is straightforward, even if the engineering behind it isn't. Mirage has built a proprietary foundation model trained specifically on high-performing short-form social video — TikToks, Reels, Shorts — and wrapped it in two product layers. Captions by Mirage serves individual creators: type a prompt, get a finished video with AI-generated actors, automated captions in over 100 languages, and conversational editing through text commands like "make the pacing faster" or "add high-energy transitions." Mirage Studio serves enterprise marketing teams, letting them generate hundreds of ad variants from a single concept, swap hooks and backgrounds by editing text fields, and run iterative A/B tests without reshoots.

Mirage says it doesn't rely on stock footage, voice cloning, or lip-syncing, a key distinction from competitors like D-ID, Synthesia, and Hour One. Its model generates video from scratch: custom AI avatars with natural speech and facial expressions, AI-generated backgrounds, original B-roll, music, and sound effects. Users can upload a selfie to create a digital twin, or pick from pre-built AI actors. The output targets a 60-second ceiling per continuous generation, which maps to the short-form formats where the company sees its clearest product-market fit.

CEO Gaurav Misra framed the rebrand as a signal that the real race for AI video hasn't started yet. TechCrunch reported the company now counts more than 20 million creators, small businesses, and enterprises among its users, with roughly 100,000 daily active users and over 3 million videos produced monthly. Clients include HubSpot, Comcast, Harvard, Fox, and King.

That technical ambition is what draws ML attention. Mirage isn't fine-tuning an open-source diffusion model and bolting on a chat interface. It trains its own multimodal foundation models, audio and video, on social-video data, optimizing for pacing, retention hooks, and platform-native authenticity. That's a different problem than generating a pretty clip. The model must understand narrative structure, audience behavior, and the unwritten grammar of a format where the first two seconds determine whether anyone watches the other fifty-eight.

For ML engineers, that gap between "generates video" and "generates video that performs" is where the interesting work lives, and why Mirage's open roles are filling up.

The Hiring Blitz: ML Engineer, Generative Video, and What the Role Demands

Mirage is hiring an ML Engineer, Generative Video at its Union Square HQ. The job sits at the intersection of research and production systems engineering, and the requirements signal exactly how hard the generative-video problem is proving to be.

The core mandate: build and scale the systems powering video generation models. That means training and optimizing large-scale video and multimodal models, then making them fast and cheap enough to run in production. The posting lists specific techniques — distillation, quantization, pruning — applied to diffusion and autoregressive generation pipelines. This isn't prompt engineering. It's GPU-level work.

The required stack is specific: PyTorch, CUDA, Triton, and distributed training frameworks like FSDP. Candidates need at least two years of industry experience with deep learning systems and infrastructure, plus a track record of scaling models under low-latency inference constraints. Debugging and performance profiling skills are listed explicitly. Mirage expects engineers to find bottlenecks, not just ship prototypes.

The role carries an entry-level seniority tag, but the responsibilities read as mid-to-senior: build distributed training systems, optimize GPU utilization and parallelism, develop internal tooling for experimentation and evaluation, and translate research models into production-ready systems. That gap between title and scope is common right now in generative ML. The talent pool is thin enough that companies lower the title bar while keeping the scope wide.

Benefits include medical, dental, vision, 401K with match, commuter benefits, catered lunch multiple days a week, a dinner stipend for late nights, a Grubhub subscription, and multiple team offsites per year. All roles require in-person presence at the Union Square office.

The company's careers page shows 13 open positions, with five added in the past week alone, including a second ML Engineer slot for Agentic Systems at the same pay band, plus a Software Engineer, Agents and a backend engineering role. That's a team scaling fast across both the modeling and product sides. For ML engineers weighing options, the compensation sits competitive with Snap's Level 5 generative-ML role and above most listed NYC media companies, though below the top of the market at places like Spotify or Goliath Partners. The draw here is scope: early team, hard problem, and equity from a Sequoia- and a16z-backed startup.

Role / Source Compensation / Range
ML Engineer, Generative Video (Mirage) $175,000 – $275,000 base
ML Engineer, Agentic Systems (Mirage) $175,000 – $275,000 base
Generative ML, Level 5 (Snap) Comparable to Mirage's range
Runway valuation (early 2026) $5.3 billion
Mirage valuation (2025) $500 million
Runway funding round (early 2026) $308 million

Why Union Square, Not SoHo or Brooklyn, Is Becoming NYC's Generative-Media Hub

Mirage's office sits in Union Square, and that placement looks less like coincidence and more like gravity. AI Atlas NYC's live map tracks 73 AI startups across the city. Flatiron, which neighbors Union Square and shares its talent corridor, holds 10 companies, making it the largest single neighborhood cluster. Bryant Park, Chelsea, DUMBO, and Midtown each have five. But the cluster forming around Union Square and Flatiron is distinct in one respect: it concentrates companies building creative and media-facing AI tools.

Glif, a "creative super agent" for generating images, videos, audio, and ads from a chat interface, operates out of Flatiron. Y Combinator-backed Stewdio gives creative teams a shared workspace for image and video generation models from its DUMBO office, close enough to draw from the same talent pool. MOTHER.Tech, also Flatiron-based, builds prompt-free creative AI for consumer content. FLORA, a Chelsea company at Series A, targets AI-assisted production workflows for creative teams. The pattern is consistent: startups building generative media tools, the exact category Mirage competes in, cluster within a roughly one-mile radius centered on Union Square and Flatiron.

The economics reinforce the geography. Seed-stage companies make up 28 of the 73 startups on the AI Atlas map, and the Union Square/Flatiron corridor offers something neither SoHo nor Brooklyn matches at that stage: walkable access to Midtown engineering talent commuting from Penn Station, proximity to the venture firms along Broadway and Fifth Avenue, and office rents that still undercut Chelsea and SoHo. Y Combinator's own New York portfolio includes multiple generative-media companies — Stewdio, Closera, Darkgrade, Melder — that have gravitated toward this same band of Manhattan.

The Union Square cluster also has a structural advantage for generative-video startups specifically: the neighborhood sits at the overlap between NYC's advertising industry, still headquartered around Madison Avenue and Bryant Park, and its growing ML engineering workforce. Generative-media products need both the creative-domain users who understand post-production workflows and the ML engineers who can build models that serve those users. Putting an office at that intersection is a recruitment strategy as much as a real estate decision.

SoHo's AI scene skews toward consumer and social apps. Nori, Ohai.ai, Tapestry all list SoHo addresses. Brooklyn's DUMBO and Williamsburg clusters lean infrastructure and data tooling. Mega, Loyalist, Cerca. Neither has the density of creative-AI startups that the Union Square/Flatiron corridor has accumulated. For a company like Mirage that needs ML engineers who understand both temporal video models and the production pipelines those models must plug into, the neighborhood isn't incidental. It's the product.

The Technical Stack: What 'Contextual Awareness' in Video Generation Actually Means

The reason Mirage is hiring ML engineers at $175,000–$275,000 a year, and the reason those roles demand a rare overlap of skills, comes down to a single unsolved problem: getting a model to understand what it's generating well enough to keep it consistent over time.

Video generation is image generation with a constraint that changes everything. Each frame must preserve character identity, lighting, camera motion, and scene layout relative to every frame before it. Get it wrong and a character's jacket changes color mid-shot, a background wall drifts, or a face morphs between cuts. The entire field is organized around solving this coherence problem, and the approaches differ sharply.

Temporal modeling: the core bottleneck

Modern video models process 3D tokens that capture both spatial detail and temporal motion, rather than the 2D tokens used in image models. The DataCamp analysis of the top 10 video generation systems of 2026 describes the pipeline: a text encoder converts prompts into structured representations, a denoising network refines random noise step by step, and encoders move between pixel space and a compressed latent space for efficiency. Decoding is memory-intensive enough that many pipelines generate frame by frame.

But the architecture choices vary. Google's Veo 3 produces 8-second clips at 1080p with native synchronized audio at 24fps, excelling at dialogue-driven scenes where lip sync and ambient sound need to match the visual. ByteDance's Seedance 1.0 handles multi-shot narrative videos, maintaining subject consistency and atmosphere across shot transitions, a different technical bet focused on longer-form storytelling. Wan2.2, the open-source model from Wan-AI, uses a Mixture-of-Experts diffusion architecture that routes specialized experts across different denoising stages: a high-noise expert handles early global layout, a low-noise expert handles fine detail later. Each design reflects a different theory about where coherence breaks down.

Creative-decision orchestration: the layer that makes Mirage hard

Raw generation is only part of what Mirage's platform needs. The company's thesis, natural language orchestration of production and editing decisions, requires a system that can translate a prompt like "cut to a close-up, hold for three seconds, then pull back as the character stands" into a sequence of model calls, each conditioned on the output of the last. That's not a single inference. It's a chain of dependent generations where each step must respect the creative decisions of the previous one.

This is where the ML engineering gets genuinely difficult. The system needs what researchers call contextual awareness: an understanding of what has already been generated, what the user intends next, and how to constrain the next generation so it doesn't contradict what came before. Runway's Motion Brush, where a user paints movement paths onto specific frame elements, is one commercial approach to giving models that kind of directed, localized control. Kling 2.5 Turbo's physics-aware motion, which incorporates gravity and impact dynamics, is another. Neither solves the full orchestration problem. They address slices of it.

Professional-editor parity: the gap that defines the job category

The benchmark these roles implicitly target is output that a professional editor would accept without manual correction. Current tools are close in isolated clips. They fall apart across sequences.

CrePal's 2026 filmmaking tools review tested every major platform and found that maintaining the exact same character across 20+ shots in a short film remains difficult even with Kling v3's multi-shot breakthrough and Runway's world consistency features. "You'll get close," the review concluded. "You won't get perfect — not without significant prompt engineering and reference image discipline." Audio sync has improved, largely because of Veo 3.1's native audio generation, but complex multi-speaker dialogue scenes still require manual post-production work.

That gap, between a single impressive clip and a coherent multi-shot sequence, is the technical problem Mirage's ML engineers are being hired to close. It requires fluency in diffusion architectures, temporal modeling, prompt-conditioning pipelines, and the creative-tooling product decisions that determine how a director actually interacts with the system. It's a job category that sits between research ML and product engineering, and it didn't exist at scale before 2024.

The fact that Mirage lists its ML roles at the same compensation level as its iOS and backend roles signals how central the technical challenge is to the company's product. This isn't a team bolting a generation model onto an existing editor. It's a team building the orchestration layer that makes generative video usable as a production tool, and the difficulty of that problem is exactly why the hiring bar is where it is.

Who Mirage Is Competing Against for Talent, and the Broader Hiring Surge

The salary range Mirage posts for its ML engineering roles is a signal, not just a number. It's calibrated to pull engineers out of Google's video AI teams, OpenAI's generative media groups, and Adobe's creative-cloud research division, the three organizations whose work most directly overlaps with what Mirage is building.

That's a notable bet. Google has been expanding its generative video research through DeepMind and Google Labs. OpenAI's Sora team has been hiring since 2024. Adobe, through Firefly and its Premiere Pro integrations, employs hundreds of ML engineers working on creative-tool AI. All three can offer base salaries at or above Mirage's range, plus equity packages that a Series-stage startup can't match on paper.

So Mirage trades on scope. Its job descriptions emphasize "foundational problems that remain largely unsolved across the industry" and "outsized impact on the future of creative expression," language designed to appeal to engineers who want to own a problem end-to-end rather than contribute to a subsystem inside a larger platform. At a 139-person company, an ML engineer working on generative video models is likely training, evaluating, and shipping models directly. At Google or OpenAI, that same engineer might spend a quarter waiting for compute allocation and internal review.

The talent war for generative-media engineers is broad enough that Mirage isn't just fishing in the same pond as the big three. HeyGen, Synthesia, Runway, and D-ID are all building overlapping capabilities and hiring for similar roles. Canva has been acquiring startups to bolt animation and marketing-creation tools onto its platform. The result is a market where experienced video-generation ML engineers, people who understand temporal modeling, diffusion architectures for video, and the specific failure modes of generative media in production, are scarce enough that every new entrant forces the rest to raise offers or sharpen their pitch.

Mirage's investor roster gives it some credibility in that fight. Backing from Sequoia Capital, Andreessen Horowitz, Kleiner Perkins, and Index Ventures means the company can name firms that engineers recognize as having picked past winners. General Catalyst's Pranav Singhvi told TechCrunch that Mirage's "business equation is extremely figured out," a line that, while directed at investors, also reads as a recruiting message: this isn't a research lab burning cash, it's a company with unit economics that work.

The constraint is location. Every engineering role at Mirage lists Union Square, New York City as the workplace, and the careers page states plainly that all roles require in-person presence at HQ. That rules out the remote-first engineers who have been increasingly selective since 2024, and it puts Mirage in direct competition with every other NYC-based AI startup for the same pool of engineers willing to commute to a Manhattan office.

Across LinkedIn, postings tagged "AI video generation" have climbed past 643 open roles in the United States alone, with 58 new listings added in a single recent sweep. A broader "video AI" search returns over 5,000 positions. These aren't traditional editing jobs with an AI keyword bolted on. They are hybrid roles, part ML engineer, part creative-tooling product thinker, that barely existed before 2024. The titles tell the story: Cisco is hiring "AI Video Creator & Storyteller." xAI has posted "Member of Technical Staff, Video Generation." Tesla's AI org wants an "AI Engineer, World Modeling & Video Generation." Netflix listed a "Product Manager - AI Video." OpenArt AI is searching for a "Creative Director - Video & AI Content." Demand is showing up outside pure-play AI companies, too. Legacy media, advertising, and enterprise software firms are building internal generative-video teams rather than waiting for off-the-shelf tools to mature. Upwork reports 4,503 open "AI-generated video" contract gigs, a signal that companies without the budget for full-time ML staff are still trying to get the capability in the door.

What to Watch: Funding, Product Velocity, and the Race to Professional-Editor Parity

Mirage's $75 million growth round, closed in March 2026 from General Catalyst's Customer Value Fund, gave the company runway, but the next 18 months would determine whether the hiring blitz turned into market dominance or just expensive headcount. Three milestones matter most.

1. The next fundraise and what it signals about unit economics.

The $75M is growth financing, not a traditional priced round, which means Mirage is buying time to prove its numbers before facing a valuation test. The signals so far are strong: Appfigures data puts Captions at over 3.2 million downloads in the trailing 365 days and $28.4 million in in-app revenue. Misra said the platform has produced more than 200 million videos, with 75% of revenue coming from outside the U.S. General Catalyst managing director Pranav Singhvi said Mirage's unit economics are "clearly ahead of the pack." The next round, whether it happens in late 2026 or early 2027, will be the first external verdict on those claims. Watch for whether the company prices at a premium to its $500 million 2025 valuation or gets marked sideways as investors compare it to Runway, which closed a $308 million round at a $5.3 billion valuation in early 2026.

2. "Assembly intelligence," the product bet that justifies the ML hires.

Misra told TechCrunch the company's next set of models will focus on "assembly intelligence," pulling together video from different sources and components into a finished piece. This is the technical leap that separates Mirage from the current generation of AI video tools, which mostly generate clips from scratch or apply templates. If the team can ship a model that reliably assembles multi-source video with coherent pacing, framing, and narrative structure, it opens the enterprise marketing market in a way that template-based competitors like Canva or CapCut can't easily match. The five open ML and agentic-systems roles suggest the company is staffing aggressively against this milestone.

3. The accent problem and the international audio model.

Mirage's new audio model, which preserves speaker accents in generated video, is a direct response to a gap Misra noticed when his father, who speaks with an Indian accent, used the app and had his speech flattened into an American accent. It sounds like a small feature. It isn't. Roughly 75% of Mirage's revenue already comes from outside the U.S., and accent preservation is a prerequisite for selling to marketing teams in India, Southeast Asia, and the Middle East without the uncanny-valley effect that kills adoption. Shipping this well is arguably as important as the assembly-intelligence push for near-term revenue.

4. The competitive set is moving fast.

Canva acquired animation and marketing startups in early 2026. Webflow bought AI-video platform Vidoso in March. D-ID acquired Berlin-based SimpleShow in late 2025. HeyGen and Avataar keep adding features. The AI video tooling space is consolidating while simultaneously fragmenting, with new entrants targeting every niche from product demos to avatar-led ads. Mirage's bet is that vertical depth, models purpose-built for short-form video pacing, framing, and attention dynamics, beats horizontal breadth. The hiring pace suggests the company agrees, but the window to build that lead is finite.

The concrete thing to watch: whether Mirage's next product launch demonstrates assembly-intelligence output that meets that same standard. That's the benchmark. Everything else, the funding, the headcount, the accent model, is in service of crossing it.


Working in AI? Zero G Talent tracks the openings: browse AI jobs, openings at OpenAI and Mirage, and the people building the field.

Ready to Start Your Space Career?

Browse artificial intelligence jobs and find your next opportunity.

View artificial intelligence Jobs