Agentic Code-Generation Loop Research Intern

Windmill•Paris, IDF 75011, FR• Internship

Posted 2 hours ago• Research

Compensation

€2,000–€3,000/month

Job Description

The intern will design, evaluate and bring to the state of the art the internal Windmill agentic loop for generating scripts, flows and full-stack apps - and build the benchmarking system that measures its progress. The work tackles several open questions: how to objectively evaluate a generated workflow or app beyond "it compiles" (functional tests, end-to-end execution, UX quality, semantic correctness); how an agent should decompose a natural-language specification into coherent atomic steps; how to efficiently inject Windmill-specific context (hub, types, resource schemas) without saturating the context window; how to exploit execution feedback for self-correction; how to keep a dependency graph of scripts, flows and apps coherent across iterative multi-file edits; and how to detect hallucinations, silent regressions and "fake successes" where tests pass for the wrong reasons.

The mission runs over 5–6 months. Phase 1 maps the existing agentic loop, reviews the literature and reproduces 2–3 reference baselines. Phase 2 builds the benchmark: a task corpus covering isolated scripts, multi-step flows and full-stack apps - inspired by real Windmill workloads from the public hub and anonymized customer workspaces - with a sandboxed execution harness, multi-criteria scoring (correctness, quality, efficiency, readability) and continuous regression tracking re-run on every agent commit; an open-source release is envisioned. Phase 3 is the core experimental work: iterating on prompts, planning strategies, tool design, retrieval and execution-feedback loops; comparing frontier models (Claude, GPT, Gemini) with open-weights alternatives (Llama, Qwen, DeepSeek); and exploring, where ROI is demonstrated, supervised fine-tuning on execution traces or RL approaches. Measured improvements ship progressively to production. Phase 4 consolidates the work into a thesis, internal documentation and - depending on timing - a workshop or conference submission (NeurIPS, ICLR, ICML, COLM).

Expected deliverables: the Windmill benchmark (corpus, harness, tracking dashboard); an improved agentic loop shipped to production with documented progression metrics; a weekly lab notebook; the final thesis report; and possibly a publication or open-source release. The intern works directly with Ruben Fiszel (co-founder & CEO) and the Windmill R&D / AI team, with daily interaction, weekly reviews and full access to the codebase, to anonymized usage data, to frontier-model API budgets and to GPU infrastructure for fine-tuning experiments.

State of the art

Code-generation agents:

Inline assistants: Copilot, Cursor, Codeium - local completion and editing, short context
Autonomous agents: Claude Code, Aider, SWE-agent, OpenHands, Devin - planning, execution, self-correction
RL / fine-tuning approaches: AgentCoder, Reflexion, Self-Refine, agent tuning on execution traces
Retrieval methods: RAG over documentation, code embeddings, graph-RAG

Reference benchmarks:

SWE-bench / SWE-bench Verified - resolving GitHub issues (Python); now saturated on frontier models
HumanEval, MBPP, APPS, BigCodeBench - generation of isolated functions
LiveCodeBench - anti-contamination, temporally controlled tasks
WebArena, AppWorld - agents on simulated environments
TAU-bench, AgentBench - agent evaluation with tool use

Limitations of these benchmarks for our use case: none covers workflow generation (step composition, branching, parallelism, state management); none tests generation of full-stack apps with interactive UI; none integrates the specifics of Windmill (type system, resources, variables, hub, multi-language runtime).

Scientific and technical locks:

Evaluation: how to objectively measure the quality of a generated workflow or app, beyond mere "it compiles / it passes a unit test"?
Decomposition: how should an agent break a natural-language specification into coherent atomic scripts/steps?
Contextualization: how to efficiently feed the agent with Windmill context without exploding the context window?
Iteration loop: how to optimally exploit execution feedback for self-correction?
Multi-file editing: coherent management of a dependency graph between scripts, flows and apps during iterative editing.
Robustness: detection of hallucinations, silent regressions, and "fake successes."

Work plan (5–6 months)

Phase 1 - Mapping & state of the art (weeks 1–3): audit of Windmill's current agentic loop (architecture, prompts, tool-use); systematic review of existing literature and benchmarks; selection / reproduction of 2–3 reference baselines.

Phase 2 - Benchmark (weeks 3–8): design of the evaluation task corpus (isolated scripts, multi-step flows, full-stack apps); design of the evaluation harness (sandboxed execution, multi-criteria scoring); set up continuous regression tracking; open-source release of the benchmark envisioned.

Phase 3 - Improvement of the agentic loop (weeks 8–20): iterative experimentation on prompts, planning strategies, tool design, retrieval, execution feedback; comparison of frontier models vs open-weights; targeted exploration of supervised fine-tuning and RL approaches; progressive production deployment.

Phase 4 - Consolidation & deliverables (weeks 20–24): writing of the thesis / final-year report; internal technical documentation; possible paper submission.

Who we're looking for

M2 / final-year student in computer science or applied mathematics. Solid programming foundations (Python, TypeScript, bonus Rust), strong interest in LLMs / agents / evaluation methodology, empirical and rigorous approach.

Required skills : proficiency in Python and TypeScript; concrete understanding of how LLMs work (tokenization, context window, prompting, tool use, function calling); hands-on experience with at least one agentic assistant; design of controlled experiments and reproducible metrics; Git, testing, code review, CI; fluent English.

Nice-to-have: Rust; Svelte / modern frontend; fine-tuning & RL experience (SFT, DPO, RLHF, RLAIF); agent/benchmark evaluation experience; prior publication or significant open-source contribution; Docker, PostgreSQL, sandboxing, observability.

Education : Master’s student (M2) or final-year student (PFE) in computer science or applied mathematics: MPRI, École Polytechnique (X), École Normale Supérieure (ENS) (Ulm / Paris-Saclay / Lyon), Télécom Paris, CentraleSupélec, Mines, ENSIMAG, EPITA, 42, EPFL, or equivalent