
Job Description
About Us
Velvet is a data research company building the datasets that power the next generation of multimodal AI. Founded by Lucas Mantovani (ex Meta FAIR) and Lucas Tucker (ex Adobe Infra), our mission is to make AI more human by producing high-quality audiovisual training data for frontier labs.
We're hiring a Research Scientist to develop and fine-tune models for video and audio data processing and enhancement, as well as to conduct data-oriented research that pushes the boundaries of multimodal quality.
What You'll Do
- Research, develop, and fine-tune models for audio and video enhancement — including denoising, super-resolution, speech restoration, and perceptual quality improvement — ensuring outputs meet the standards required for frontier model training.
- Experiment with novel architectures, training objectives, and data augmentation strategies to improve model performance across diverse and noisy real-world audiovisual data.
- Build evaluation frameworks and benchmarks to rigorously measure enhancement quality, guiding iterative model improvement.
- Collaborate with infrastructure and data pipeline engineers to integrate trained models into large-scale processing workflows that handle wide variation in speech, visual quality, and format.
What We're Looking For
- Strong research background in deep learning, with hands-on experience training and fine-tuning models for audio processing, video processing, or related domains.
- Proficiency in PyTorch. Experience designing and running experiments at scale.
- Solid understanding of signal processing fundamentals and how they inform model design for enhancement tasks.
- A publication track record or demonstrated research output in relevant areas (audio/speech enhancement, video restoration, generative models, multimodal learning).
- Ability to work effectively in an early-stage environment where scope is broad and priorities shift fast.
Even Better
- Prior work at a frontier AI lab or data company focused on multimodal data.
- Experience fine-tuning large pretrained models (diffusion models, autoencoders, or transformer-based architectures) for perceptual quality tasks.
- Familiarity with perceptual quality metrics and human evaluation methodologies for audio and video.
- Track record working with datasets spanning tens of thousands of hours of audio or video.
You'll Thrive Here If
- You're excited by applied research with immediate, visible impact on data quality and downstream model performance.
- You move fluidly between reading papers, writing training loops, and analyzing failure cases.
- You hold yourself to a high bar for rigor — because you understand that model quality directly determines the value of the data we produce.
Interview Process
- First round of interviews (remote)
- Second round of interviews (remote)
- Work trial (on-site)
- Offer
Optimize Your Resume for This Job
Get a match score and see exactly which keywords you're missing
Job Details
- Category
- Research
- Employment Type
- Full Time
- Location
- San Francisco, CA
- Posted
- Compensation
- $250,000 - $300,000 per year
About Velvet
Velvet is known for its modern, sophisticated contemporary apparel brand with laid-back California attitude, for women and men. The brand has attracted trendsetters since its inception.
More Roles at Velvet
Similar Research Roles



Found this role interesting?