From Chatbots to Digital Workers: Large Action Models (LAM) and Computer Use

Introduction: The End of API Limitations

Traditional automation (like RPA or tools like n8n/Zapier) relies on APIs (Application Programming Interfaces) for systems to talk to each other. If a website lacks an API or it's too complex, automation hits a wall.

The rising "Computer Use" paradigm and Large Action Models (LAM) in 2025 overcome this barrier by mimicking humans. These models don't need an API; they look at the screen just like a human, see the "Buy" button, move the mouse cursor there, and click. This expands the boundaries of automation from "what is codable" to "what is visible".

1. What is a Large Action Model (LAM)?

LLMs (Large Language Models) are trained to generate text tokens. LAMs are trained to generate Actions.

The output space of a LAM includes:

`CLICK(x=450, y=300)`
`TYPE("Hello World")`
`SCROLL(down)`
`DRAG_AND_DROP(source, target)`

These models can simulate anything a human can do with a keyboard and mouse by talking to an operating system (OS) level interface.

Neuro-Symbolic Approach

A pure LLM doesn't know the coordinates of a button on a website. Therefore, LAMs are a combination of a Vision Encoder (the eye seeing the screen) and a Symbolic Planner (logic reading the DOM tree or accessibility tags).

2. Vision and Clicking: The "Grounding" Problem

For a model to click the "Submit" button on the screen, it needs to make sense of the pixels of that button. This is called Visual Grounding. Two main techniques are used:

A. Set-of-Mark (SoM) Prompting

Before the model takes a screenshot, an intermediate layer pastes numbered labels (bounding boxes) over all interactive elements (buttons, links) on the screen.

Model sees the screen: "Box number 3 is the 'Settings' button."
Generates action: `CLICK(box_id=3)`

B. Pixel-Based Navigation (Fovea Vision)

The model predicts coordinates by focusing on a specific region of the screen (High-Res Crop), just like the human eye. This method is more precise but requires more processing power. Cards like the RTX 5090 process high-resolution (4K) screenshots in milliseconds, minimizing this latency.

3. UI Navigation and DOM Tree

Vision processing alone is sometimes insufficient (e.g., two very similar icons). At this point, LAMs utilize the web browser's Accessibility Tree or a simplified HTML DOM structure.

The model works in "Dual-Modality" by mapping visual data with code structure (HTML):

"I see a red button in the top right of the image. In the HTML code, there is an element tagged `class='delete-btn'`. So, this is the delete button."

This hybrid structure reduces the risk of hallucination and prevents the agent from clicking the wrong place.

4. Security and the "Sandbox" Necessity

Giving an AI full control of your computer (permission to delete files, send emails, access bank accounts) is a major security risk. Therefore, Computer Use agents are never run on the main operating system (Host OS).

Dockerized Environments: The agent runs inside an isolated Docker container on a virtual desktop.
Human-in-the-Loop: Before critical actions (money transfer, sending emails), the agent pauses and asks for user approval ("Do you confirm this action?").
Action Whitelisting: The agent is only allowed to access specific domains (e.g., `linkedin.com`, `crm.mycompany.com`).

Conclusion: RPA 2.0 and Autonomous Workflows

Large Action Models are retiring rule-based RPA (Robotic Process Automation) bots. An RPA bot breaks when a website's design changes. A LAM, however, can see and find the "Login" button in its new location and continue the workflow even if the website changes.

As BRIQ Mind, we combine modern tools like n8n with these visual agents to bring even your non-API "Legacy" systems into the modern world of automation.