Back

Automating Video Editing with AI Agents

How Re-skill Built an LLM-Powered Workflow for Smarter Video Editing

Mar 18, 2025

At Re-Skill, we wanted to automate the tedious process of creating educational videos. While simple tasks like merging clips can be hardcoded, advanced editing—like animations and text rendering—requires a smarter approach. That’s why, we partnered with Diffusion Studio (YC F24) to develop an AI Agent that streamlines our video editing workflows.

We launched the AI Agent at the AI Engineering Summit NYC. You can also watch the demo and the talk here:

In this post, we wanted to share what we’ve learned from building AI agents and give practical advice for developers on building effective agents for video editing.

Intro to AI agents

Any efficient system using AI will need to provide LLMs some kind of access to the real world. For instance, the possibility to call a search tool to get external information, or to act on certain programs in order to solve a task like video editing. In other words, LLMs should have agency. Agentic programs are the gateway to the outside world for LLMs.

AI Agents are programs where LLM outputs control the workflow.

Any system leveraging LLMs will integrate the LLM outputs into code. The influence of the LLM’s input on the code workflow is the level of agency of LLMs in the system.

Agents can be used for open-ended problems like video editing where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents autonomy make them ideal for scaling tasks in trusted environments.

How it started?

While evaluating potential options for our AI agent backbone, our criteria were intuitive API that could be picked up regardless of the LLM we would use, it’s speed and, ideally, client side rendering.

To this end, after realizing limitations of ffmpeg, we started looking for more intuitive and flexible alternatives.

ffmpeg
- supports singe command render combination, can not subsample videos etc
- does not support advanced layering
remotion
- complex architecture with many dependencies that requires extensive boiler plate code
- it does not support client side rendering
core from Diffusion Studio
- very efficient small engine that runs directly in browser environment
- minimalistic API that is AI-friendly (context efficient)

The Architecture

In our multi-step AI agent, at each step, the LLM can write an action in the form of some calls to external tools.

The common format (used by Anthropic, OpenAI, and many others) for writing these actions is generally different shades of “writing actions as a JSON of tools names and arguments to use, which you then parse to know which tool to execute and with which arguments”.

The reason for this simply is that we crafted our code languages specifically to be the best possible way to express actions performed by a computer. If JSON snippets were a better expression, JSON would be the top programming language and programming would be hell on earth.

Hence our default agent is Code Agent. It will write JS/TS code and execute Python code snippets at each step. Since Core can do complex compositions via a JS/TS-based programmatic interface, it is perfect match for our Code Agent.

Tools

It is also crucial to design toolsets and their documentation clearly and thoughtfully, as we can think of agents as LLMs using tools based on environmental feedback in a loop.

Here is set of tools that we carefully crafted for our agent:

A VideoEditingTool generates code based on user prompts and runs it in the browser
If additional context is needed, DocsSearchTool uses RAG to pull relevant info
After each execution step, the composition is sampled (currently 1 frame per second) and analyzed using VisualFeedbackTool
The feedback system decides whether to proceed with rendering or refine further

For VisualFeedbackTool we used this generator evaluator blueprint:

This is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are:

The LLM responses can be demonstrably improved when a human articulates their feedback
The LLM can provide such feedback.

This is analogous to the iterative writing process a human writer might go through when producing a polished document.

`/llms.txt`

We also shipped /llms.txt, which is proposal to standardize a file to provide information to help LLMs use a website at inference time. It is essentially /robots.txt but for agents.

It is important for the following reason LLMs increasingly rely on website information, but face a critical limitation: context windows are too small to handle most websites in their entirety. Converting complex HTML pages with navigation, ads, and JavaScript into LLM-friendly plain text is both difficult and imprecise.

While websites serve both human readers and LLMs, the latter benefit from more concise, expert-level information gathered in a single, accessible location. This is particularly important for use cases like development environments, where LLMs need quick access to programming documentation and APIs.

In our case it gave access to up-to-date docs of core library for our agent.

Outro

Our AI Agent already makes our workloads faster and cheaper, and we believe with advent of more AI-generated content online, the agentic video editing is a natural extension.

We’d love to collaborate with researchers interested in pushing this forward and co-authoring a research paper. There is so much more to build the future of AI-powered video editing together!

You can check the public version of our work at this repo:

https://github.com/diffusionstudio/agent