By Dariush Wahdany in AI — 03 Apr 2025

AI Everything

MCP, a vision for human and AI working together

We are leaping at lightspeed towards connecting everything to AI. Large language models (LLMs), are no longer limited to text-focused tasks. Today, we can send AI agents off to browse the internet, reverse-engineer binaries, compose music, create 3D models and much more. This is happening right now and the process of connecting evermore things to AI is only speeding up.

An LLM is navigating a shopping website using the Chrome browser

How?

How do LLMs control other programs? Do they move your mouse or what?

Anthropic's Computer Use

Moving your mouse and looking at your screen directly is one way of connecting LLMs capable of processing visual input, so-called vision LLMs (vLLMs), to your applications. This is arguable the most naïve way of doing it, but also the most universal approach. Anthropic dubs it Computer use. You can basically imagine the model taking a screenshot and then deciding to click on something (this is literally how it works, as you will see below). And it does that repeatedly until it feels (⚠️anthropomorphism!) it has achieved the goal.

Model Context Protocol (MCP)

Computer use is just one of many possible tools though, and Anthropic has also open-sourced standardized an interface for all kinds of tools, called Model Context Protocol. It allows to connect arbitrary endpoints to your LLM, exposing them as tools the model can decide to call. The framework that runs your LLM is routing those commands to the tool endpoint instead of printing the output to the user.

This basic MCP framework means you can connect absolutely everything to LLMs, as long as you can provide an endpoint with various commands to the model. An endpoint in python code could be something like this:

@mcp.tool()
async def get_alerts(state: str) -> str:
    """Get weather alerts for a US state.

    Args:
        state: Two-letter US state code (e.g. CA, NY)
    """
    url = f"{NWS_API_BASE}/alerts/active/area/{state}"
    data = await make_nws_request(url)

    if not data or "features" not in data:
        return "Unable to fetch alerts or no alerts found."

    if not data["features"]:
        return "No active alerts for this state."

    alerts = [format_alert(feature) for feature in data["features"]]
    return "\n---\n".join(alerts)

Which the model will recognize as an available tool to call

You can imagine a conversation looking something like this (generated by Gemini 2.0, tasked with coming up with a mock-chat):

User: Hey, are there any active weather alerts in California?

Thinking (hidden):

User Query Analysis: The user is asking for weather alerts in California. This requires using a tool to fetch the information.
Tool Selection: The get_alerts tool seems appropriate for this task.
Tool Invocation:
tool_code get_alerts(state="CA")
Tool Execution: (The get_alerts tool is executed in the background)
Tool Response: (Assuming the tool successfully fetches data from the National Weather Service API)
text *** Headline: Heat Advisory issued August 26 at 2:23PM PDT until August 29 at 8:00PM PDT by NWS Description: ...High temperatures expected... Severity: Moderate Urgency: Expected Effective: 2024-08-26T14:23:00-07:00 Expires: 2024-08-29T20:00:00-07:00 Area: Los Angeles County; Ventura County --- Headline: Red Flag Warning issued August 26 at 1:57PM PDT until August 27 at 8:00PM PDT by NWS Description: ...critical fire weather conditions... Severity: Severe Urgency: Expected Effective: 2024-08-26T13:57:00-07:00 Expires: 2024-08-27T20:00:00-07:00 Area: Santa Barbara County Mountains; ... ***
Response Generation: The MCP formats the tool's response into a user-friendly message.

Response to User:

There are a couple of active weather alerts for California:

Heat Advisory: In effect for Los Angeles and Ventura Counties until August 29, 2024. Expect high temperatures.
Red Flag Warning: In effect for Santa Barbara County Mountains and other areas until August 27, 2024. Critical fire weather conditions are expected.

Examples

Computer Use

The example from above where an LLM can "use" your computer also just boils down to various tool calls. This is what Claude tells me (using Gitingest) which tools Anthropic's Computer Use Implementation provides:

screenshot - Take a screenshot of the current screen
left_click - Perform a left mouse click
right_click - Perform a right mouse click
middle_click - Perform a middle mouse click
double_click - Perform a double click
mouse_move - Move the mouse cursor to specific coordinates
left_click_drag - Click and drag from current position to specified coordinates
key - Press keyboard keys/shortcuts
type - Type text
cursor_position - Get the current mouse cursor position
left_mouse_down - Press down the left mouse button
left_mouse_up - Release the left mouse button
scroll - Scroll in specified direction (up, down, left, right)
hold_key - Hold down a key for a specified duration
wait - Wait for a specified duration
triple_click - Perform a triple click
bash - Run bash commands in a persistent shell session
str_replace_editor - Edit, view, and create files with commands:
- view - View file content
- create - Create new files
- str_replace - Replace text in files
- insert - Insert text at specific line
- undo_edit - Undo previous edits

There are many tools, but typically the model will do screenshot to see what's going on, left_click to do something, wait and screenshot again.

Browser Use

The only MCP server I personally use is mcp-browser-use through Cursor IDE. It allows the model to use a browser directly. So, instead of operating your whole computer, taking screenshots and clicking the browser, it can directly communicate with the browser. This uses 1) less resources, 2) is more reliable and 3) allows the use of unimodal (only-text) models.

Configuring it was a breeze, a testament to how far the AI ecosystem has evolved. Thanks in part to astral's great uv which ships with uvx (like npx for python), which I previously recommended in my post about python dependency management.

That way Cursor can code a website and debug it by directly getting feedback from a real browser. A truly agentic way of programming (with still compounding errors…)

The Cursor agent uses the browser to identify and remedy issues.

Reverse-Engineering Binaries

When you have a binary it's trivial to turn it back into some sort of code. The issue is with understanding what that code is supposed to do. Unless a binary ships with debug symbols (hint: it usually doesn't), nothing has sensible names after decompiling.

Screenshot of a binary decompiled using Ghidra. The function and variables have non-descript names such as *FUN_0040060c*, *uVar1* or *local_18*.

On the one hand, you can see exactly what is happening. But on the other hand, it's not obviously clear what FUN_0040060c (a name the decompiler just makes up) is supposed to do and what role iVar4 plays in that process, unless you thoroughly analyze the code. And that's a lot of effort. That's (presumably) why Laurie Kirk, now at Google Research and well known under her artist alias (I think YouTubers are artists as well) LaurieWired, developed an MCP server for LLMs to use the reverse engineering tool Ghidra, appropriately named GhidraMCP.

It can interactively go through the code and start decompiling and interpreting functions, using these function calls:

list_methods: Lists function names with pagination.
list_classes: Lists namespace/class names with pagination.
decompile_function: Decompiles a function by name and returns the C code.
rename_function: Renames a function.
rename_data: Renames a data label at a specified address.
list_segments: Lists memory segments with pagination.
list_imports: Lists imported symbols with pagination.
list_exports: Lists exported functions/symbols with pagination.
list_namespaces: Lists non-global namespaces with pagination.
list_data_items: Lists defined data labels and their values with pagination.
search_functions_by_name: Searches for functions by substring.

Music Generation

You might know Suno, which uses AI to generate music. It generates music in the way that a diffusion model might generate an image or video: it just generates the output.

I always assumed that these AI models interact with some kind of synthesizer, but they just directly generate the music. What you can now do thanks to tools like Ableton MCP is to connect an actual digital audio workstation (DAW) to your LLM, allowing the LLM to configure a DAW to generate the desired music instead of generating it directly. This has two huge advantages:

you get the "source-code" to your music, the individual tracks constituting the song you just generated, and
as a result of 1), you can modify the output to your liking.

An issue often raised when using AI to generate art: where is the artistic process? how do you adjust the output? Using MCP, an artist and AI can work together on a project.

Without MCP, how do you go about fixing a project so it fits your artistic vision? Let's say you generated an image and you think it's 80% ready. How do you polish it to get those remaining 20%? Now I can already hear you screaming "but there is inpainting and various other methods of img2img!!" and yes, widely used tools like InvokeAI or AUTOMATIC1111's stable diffusion webui have such features, but it's a different layer of abstraction. With inpainting and img2img, you are working in the output space. If we imagine an MCP for drawing images, the end-result would allow you to not only edit individual output pixels, but also to intervene at any other point of the generation pipeline, e.g., change the stroke of a brush, change the color of a pencil, remove a layer, and so on.

3D Modelling

I think by now you got the gist of it, so we'll quickly go over the example of Blender MCP, a project to connect the 3D modelling software Blender. You tell the model something, it tries to build it in Blender.

The Vision

The last result required quite some manual intervention as you can see in the video. To me, that's exactly the vision I see in MCP, namely a great way for AI and humans to work together using the same tool. Before MCP, a black-box AI would generate mostly "AI slop" and that's what you got. With MCP, you can cooperate with an AI model. It can help you with repetitive or tedious tasks but maintain full creative control. You get the best output you can humanly achieve, more efficiently.

This will change how we work with AI profoundly, and that happens on top of the landslide of changes that AI is already causing.

I am very excited about what this will enable and how it will transform our work - without leaving us with less control than before. As with all the other AI systems, we must not lose track of how this will impact security. Ex-Google now Anthropic security researcher Nicholas Carlini argues in his recent post about the very real risks of AI that should compel anyone, whether you believe in "AI doom and human extinction through artificial superintelligence" or not. Those include mass unemployment, phishing and mass surveillance. It's not hard to imagine how MCP could exacerbate some of these issues, e.g., by allowing malicious actors to execute phishing campaigns fully automated through more means.

So, as we hand all these controls to AI, we need to be mindful of the implications. We are not just the witnesses but also stewards of the exciting times that lay ahead.