Browser Agents Explained: How AI Controls the Web Like a Human
Emma Liu
Tech journalist covering the AI agent ecosystem and startups.
Every developer has written a web scraper. Every developer has also watched that scraper break when a class name changed, a CAPTCHA appeared, or a site deployed a new JavaScript framework. The promise...
Browser Agents: Building AI That Navigates the Web Like a Human
The Rise of Computer-Using AI
Every developer has written a web scraper. Every developer has also watched that scraper break when a class name changed, a CAPTCHA appeared, or a site deployed a new JavaScript framework. The promise of browser agents is fundamentally different: instead of brittle selectors and hardcoded flows, you hand an AI model a browser and a goal. It figures out the rest.
This isn't theoretical anymore. In late 2024 and into 2025, we've seen a genuine inflection point. Anthropic shipped Computer Use in Claude, Browserbase raised significant funding to build cloud infrastructure purpose-built for AI agents, and a constellation of open-source projects have made Playwright-powered agents accessible to any developer with an API key.
But "accessible" and "production-ready" are different things. This article breaks down how each approach actually works, where they shine, where they fail, and what the real engineering tradeoffs look like.
How Browser Agents Actually Work
Before diving into specific tools, it's worth understanding the core architecture most browser agents share.
At a high level, a browser agent is a loop:
- Observe — capture the current state of the browser (screenshot, DOM snapshot, accessibility tree, or some combination)
- Reason — send that state to an LLM along with the goal, asking it to decide the next action
- Act — execute the chosen action (click, type, scroll, navigate)
- Repeat — until the goal is achieved or a failure condition is met
The critical differences between approaches lie in how they observe, what actions they can take, and where the browser runs.
┌─────────────────────────────────────────────┐
│ Agent Loop │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────┐ │
│ │ Observe │───▶│ Reason │───▶│ Act │ │
│ │(capture) │ │ (LLM) │ │(exec)│ │
│ └──────────┘ └──────────┘ └──────┘ │
│ ▲ │ │
│ └────────────────────────────────┘ │
│ │
│ Termination: goal achieved | max steps │
└─────────────────────────────────────────────┘
Anthropic Computer Use
What It Is
Anthropic's Computer Use is a beta feature that gives Claude the ability to interact with a computer desktop — including a browser — by viewing screenshots and issuing mouse/keyboard commands. It's not a browser-specific tool; it's a general-purpose computer control system that happens to be very useful for browser automation.
How It Works
Claude receives screenshots of the desktop environment and can issue three primary tool calls:
- computer — move the mouse, click, type, take screenshots, scroll
- text_editor — view and edit files (for code-related tasks)
- bash — execute shell commands
For browser tasks specifically, Claude observes screenshots at a configurable resolution (typically 1024×768 or 1280×800), identifies UI elements visually, and issues coordinate-based commands.
import anthropic
import base64
from PIL import Image
import pyautogui
import subprocess
client = anthropic.Anthropic()
# Start a browser
subprocess.Popen(["chromium", "--no-sandbox", "--window-size=1280,800"])
def take_screenshot():
screenshot = pyautogui.screenshot()
screenshot.save("/tmp/screen.png")
with open("/tmp/screen.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
def computer_use_loop(goal: str, max_turns: int = 20):
messages = [{"role": "user", "content": goal}]
for turn in range(max_turns):
screenshot_b64 = take_screenshot()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=[
{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1280,
"display_height_px": 800,
"display_number": 0,
}
],
messages=messages + [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
},
}
],
}
],
)
# Process tool use blocks
for block in response.content:
if block.type == "tool_use":
execute_action(block.name, block.input)
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": block.id,
"content": "Action executed."}]
})
elif block.type == "text":
print(f"Claude: {block.text}")
if "task_complete" in block.text.lower():
return
def execute_action(tool_name, action):
if action["action"] == "click":
pyautogui.click(action["coordinate"][0], action["coordinate"][1])
elif action["action"] == "type":
pyautogui.typewrite(action["text"], interval=0.02)
elif action["action"] == "key":
pyautogui.press(action["key"])
elif action["action"] == "screenshot":
pass # Already handled in the loop
Strengths
- Visual understanding: Claude doesn't need selectors or DOM access. It sees what a human sees. This makes it remarkably resilient to site redesigns and dynamic layouts.
- General-purpose: The same system can navigate a website, fill a PDF form in a desktop app, or operate a spreadsheet. It's not limited to browsers.
- No site-specific configuration: You don't need to map out a site's structure. Give it a URL and a goal.
Limitations
This is where honest assessment matters:
- Latency: Each turn requires a screenshot capture, image upload, LLM inference, and action execution. A single "click" operation takes 2–5 seconds. A 15-step workflow burns 30–75 seconds minimum.
- Cost: Sending screenshots as images consumes significant tokens. A typical screenshot at reasonable resolution costs roughly 1,600 tokens per image. A 20-turn interaction easily runs $0.50–$2.00.
- Coordinate precision: Clicking by pixel coordinates is inherently fragile. Small UI elements, dropdown menus, and overlapping elements cause frequent mis-clicks. Claude is good but not perfect at estimating coordinates.
- No DOM access: This is both a strength (resilience) and a weakness (no ability to inspect hidden elements, read data attributes, or interact with elements that aren't visually rendered).
- Beta status: The tool use format has already changed versions (
computer_20241022→computer_20250124). API surfaces are still shifting.
When to Use It
Computer Use excels at tasks that are exploratory, involve desktop applications, or target sites you can't instrument with traditional tools. It's a poor fit for high-volume data extraction or latency-sensitive workflows.
Browserbase
What It Is
Browserbase provides cloud-hosted browser instances purpose-built for AI agents. Rather than running a browser on your own machine (or in your own Docker container), you get a managed, scalable pool of remote browsers with built-in observability.
Architecture
Browserbase solves a specific infrastructure problem: running headless browsers at scale is painful. Memory leaks, zombie processes, fingerprinting challenges, proxy rotation, and CAPTCHA handling are all operational burdens that Browserbase absorbs.
┌──────────────┐ WebSocket/CDP ┌──────────────────┐
│ Your Agent │◄──────────────────────▶│ Browserbase │
│ (Python/TS) │ │ Cloud Browser │
└──────────────┘ │ ┌────────────┐ │
│ │ Playwright │ │
│ │ instance │ │
│ └────────────┘ │
│ + Proxies │
│ + Captcha solve │
│ + Session mgmt │
└──────────────────┘
Integration with Agent Frameworks
Browserbase integrates cleanly with LangChain, CrewAI, and direct Playwright usage:
from playwright.sync_api import sync_playwright
from browserbase import Browserbase
bb = Browserbase(api_key="your-api-key")
# Create a session with specific configuration
session = bb.sessions.create(
project_id="your-project-id",
browser_settings={
"viewport": {"width": 1280, "height": 800},
"fingerprint": {
"devices": ["desktop"],
"locales": ["en-US"],
"operatingSystems": ["windows"],
},
},
)
# Connect Playwright to the remote browser
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp(session.connect_url)
context = browser.contexts[0]
page = context.pages[0]
page.goto("https://example.com/login")
# Your agent logic here — combine with an LLM
# for intelligent element interaction
page.wait_for_selector("#email")
page.fill("#email", "user@example.com")
page.fill("#password", "secure-password")
page.click("button[type='submit']")
# Session recording is automatic — view in Browserbase dashboard
content = page.content()
print(f"Page title: {page.title()}")
browser.close()
Key Capantages
- Session replay: Every browser session is recorded. When your agent fails (and it will), you can replay the session visually to debug what went wrong. This alone is worth the price of admission.
- Fingerprinting and stealth: Browserbase handles browser fingerprint rotation, making agents appear as distinct human users. This is critical for sites with bot detection.
- Proxy infrastructure: Built-in residential and datacenter proxy pools with automatic rotation.
- Live view: Real-time debugging of running sessions through a web dashboard.
- Scalability: Spin up hundreds of concurrent browser sessions without managing infrastructure.
Honest Assessment
Browserbase is infrastructure, not intelligence. It gives you a reliable, scalable browser — but you still need to build the agent logic on top. If you're connecting it to an LLM for decision-making, you're still responsible for the observe-reason-act loop, error handling, and goal decomposition.
Pricing is usage-based (sessions + compute time), which can add up for high-volume workloads. For a team running thousands of browser sessions daily, the cost can easily reach hundreds or thousands of dollars per month.
Playwright-Based Agents
The Foundation
Playwright has become the de facto standard for browser automation, and it's the substrate most browser agents are built on. Its CDP (Chrome DevTools Protocol) integration, cross-browser support, and robust API make it the natural choice.
The interesting evolution is combining Playwright's deterministic capabilities with LLM reasoning. The result: agents that can handle the messy, unpredictable parts of web interaction while still leveraging precise selectors when available.
Stagehand: The Playwright Agent Framework
Stagehand (from the Browserbase team) is worth examining specifically because it demonstrates the current best-practice pattern for Playwright-based agents:
import { Stagehand } from "@browserbase/stagehand";
const stagehand = new Stagehand({
modelName: "claude-sonnet-4-20250514",
modelClientOptions: {
apiKey: process.env.ANTHROPIC_API_KEY,
},
});
await stagehand.init();
const page = stagehand.page;
// Navigate to a site
await page.goto("https://news.ycombinator.com");
// Use natural language to describe what to do
// Stagehand uses the LLM to find the right element and act on it
const topStory = await page.extract({
instruction: "Extract the title and URL of the top story",
schema: {
title: "string",
url: "string",
},
});
console.log(topStory);
// { title: "Show HN: ...", url: "https://..." }
// Natural language action — no selectors needed
await page.act("click on the top story link");
// Wait for the page to load
await page.waitForLoadState("domcontentloaded");
// Extract structured data from the new page
const articleContent = await page.extract({
instruction: "Extract the main article text and author",
schema: {
author: "string",
content: "string",
},
});
await stagehand.close();
The Three Primitives
Stagehand (and similar frameworks) expose three core operations that map cleanly to the agent loop:
| Primitive | What It Does | How It Works Internally |
|---|---|---|
act() |
Perform an action on the page | LLM analyzes DOM snapshot, identifies target element, generates Playwright selector or action |
extract() |
Pull structured data from the page | LLM receives page content, returns data matching a provided schema |
observe() |
Identify available actions | LLM analyzes current page state, returns list of actionable elements |
This is a more structured approach than pure screenshot-based agents. Instead of pixel coordinates, the LLM works with the DOM — which is more precise and more reliable.
Building a Custom Playwright Agent
If you want to build this pattern from scratch without a framework, here's a more detailed implementation:
import json
import openai
from playwright.sync_api import sync_playwright
client = openai.OpenAI()
def get_page_state(page):
"""Capture the current page state for the LLM."""
# Get the accessibility tree — more structured than raw HTML
snapshot = page.accessibility.snapshot()
# Also get the URL and title for context
return {
"url": page.url,
"title": page.title(),
"accessibility_tree": snapshot,
}
def decide_action(page_state: dict, goal: str, history: list) -> dict:
"""Ask the LLM what to do next."""
system_prompt = """You are a browser automation agent. Given the current
page state and a goal, decide the next action.
Available actions:
- navigate(url): Go to a URL
- click(target): Click an element (describe it by its text/role/label)
- type(target, text): Type text into an input field
- scroll(direction): Scroll up or down
- extract(query): Extract information from the page
- done(result): Task is complete, return the result
Respond with JSON: {"action": "...", "params": {...}, "reasoning": "..."}
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"""Goal: {goal}
Current page state:
URL: {page_state['url']}
Title: {page_state['title']}
Accessibility tree: {json.dumps(page_state['accessibility_tree'], indent=2)[:8000]}
Previous actions taken: {json.dumps(history[-5:])}
What is the next action?"""}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
def execute_browser_action(page, action: dict):
"""Execute an action in the browser."""
match action["action"]:
case "navigate":
page.goto(action["params"]["url"])
case "click":
# Use LLM-identified text to find the element
page.get_by_text(action["params"]["target"], exact=False).first.click()
case "type":
page.get_by_label(action["params"]["target"]).fill(action["params"]["text"])
case "scroll":
direction = action["params"].get("direction", "down")
page.mouse.wheel(0, 500 if direction == "down" else -500)
case "extract":
return page.inner_text("body")
def run_agent(goal: str, start_url: str, max_steps: int = 15):
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto(start_url)
history = []
for step in range(max_steps):
state = get_page_state(page)
action = decide_action(state, goal, history)
print(f"Step {step + 1}: {action['action']} — {action.get('reasoning', '')}")
if action["action"] == "done":
print(f"Result: {action['params'].get('result', 'Complete')}")
break
result = execute_browser_action(page, action)
history.append({"step": step + 1, "action": action})
# Brief pause to let pages settle
page.wait_for_load_state("networkidle", timeout=5000)
browser.close()
return history
# Usage
run_agent(
goal="Find the current price of NVIDIA stock on Google Finance",
start_url="https://www.google.com",
)
Why Accessibility Trees Beat Screenshots
The key insight in Playwright-based agents is using the browser's accessibility tree instead of (or alongside) screenshots:
# Accessibility tree gives you structured data like:
{
"role": "WebArea",
"name": "Example Domain",
"children": [
{"role": "heading", "name": "Example Domain", "level": 1},
{"role": "paragraph", "name": "This domain is for use in illustrative examples..."},
{"role": "link", "name": "More information...", "url": "https://www.iana.org/..."}
]
}
Compare this to a screenshot where the LLM must visually parse text, estimate coordinates, and hope the resolution is sufficient. The accessibility tree is deterministic, compact, and directly mappable to Playwright actions.
The tradeoff: accessibility trees can be incomplete. Many modern SPAs have poor accessibility markup. React components without proper ARIA labels, custom dropdowns, and canvas-based UIs produce sparse or empty trees. In these cases, falling back to screenshots or raw DOM parsing becomes necessary.
Practical Applications
Application 1: Automated QA Testing
Browser agents can write and execute test cases from natural language specifications:
def run_qa_test(spec: str, base_url: str):
"""Run a QA test from a natural language specification."""
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(base_url)
results = []
steps = spec.split("\n")
for step in steps:
if not step.strip():
continue
try:
# Use the LLM to convert natural language to Playwright actions
action = nl_to_action(step, page)
execute_action(page, action)
results.append({"step": step, "status": "pass"})
except Exception as e:
results.append({"step": step, "status": "fail", "error": str(e)})
# Take a screenshot for debugging
page.screenshot(path=f"fail_{len(results)}.png")
browser.close()
return results
# Usage
spec = """
Navigate to the login page
Enter 'test@example.com' in the email field
Enter 'password123' in the password field
Click the sign in button
Verify that the dashboard heading is visible
Click the 'New Project' button
Verify a modal dialog appears
"""
This is genuinely useful for exploratory testing but unreliable for regression testing where deterministic, reproducible results are non-negotiable. Use agents for discovery, traditional Playwright scripts for CI.
Application 2: Lead Generation and Data Extraction
def extract_company_data(company_url: str) -> dict:
"""Extract structured company information from their website."""
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(company_url)
# Navigate to common info pages
for path in ["/about", "/team", "/contact", "/careers"]:
try:
page.goto(f"{company_url.rstrip('/')}{path}", timeout=5000)
break
except:
continue
content = page.content()
# Use an LLM to extract structured data from the page content
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Extract the following from this webpage:
- Company name
- Description (1-2 sentences)
- Founders/leadership names
- Employee count (if available)
- Location
- Industry
- Key products/services
Page content:
{content[:15000]}
Return as JSON."""
}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
Application 3: Form Filling Automation
One of the most practical applications: automating repetitive form submissions across different sites with varying layouts.
def fill_form_intelligently(page, form_data: dict):
"""Use an LLM to map form data to form fields, even on unfamiliar sites."""
# Get all form elements
inputs = page.query_selector_all("input, select, textarea")
form_elements = []
for inp in inputs:
attrs = {
"tag": inp.evaluate("el => el.tagName"),
"type": inp.get_attribute("type") or "",
"name": inp.get_attribute("name") or "",
"id": inp.get_attribute("id") or "",
"placeholder": inp.get_attribute("placeholder") or "",
"label": inp.evaluate("""el => {
const label = el.labels?.[0];
return label ? label.textContent.trim() : '';
}"""),
}
form_elements.append(attrs)
# Ask the LLM to map our data to form fields
mapping = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Map this form data to the available form fields.
Form data: {json.dumps(form_data)}
Form fields: {json.dumps(form_elements, indent=2)}
Return a JSON array of mappings: [{{"field_id": "...", "value": "...", "action": "fill|select|check"}}]
Only map fields where you have high confidence."""
}],
response_format={"type": "json_object"},
)
mappings = json.loads(mapping.choices[0].message.content)
for m in mappings.get("mappings", []):
selector = f"#{m['field_id']}" if m['field_id'] else f"[name='{m['field_name']}']"
try:
if m["action"] == "fill":
page.fill(selector, m["value"])
elif m["action"] == "select":
page.select_option(selector, m["value"])
elif m["action"] == "check":
page.check(selector)
except Exception as e:
print(f"Failed to fill {selector}: {e}")
Comparison Matrix
| Capability | Anthropic Computer Use | Browserbase + Playwright | Custom Playwright Agent |
|---|---|---|---|
| Observation method | Screenshots | DOM / Accessibility tree | DOM / Accessibility tree |
| Action precision | Pixel coordinates | Selectors / locators | Selectors / locators |
| Latency per action | 2–5 seconds | 100–500ms + LLM time | 100–500ms + LLM time |
| Cost per interaction | $0.50–$2.00 | $0.01–$0.10 + infra | $0.01–$0.10 |
| Handles poor accessibility | Yes (visual) | Partially (DOM fallback) | Partially (DOM fallback) |
| Scales to thousands of sessions | No (local desktop) | Yes (managed infra) | Yes (self-managed) |
| Works with desktop apps | Yes | No | No |
| Debugging | Manual screenshots | Session replay + dashboard | Manual |
| Setup complexity | Low | Low–Medium | Medium–High |
The Honest Limitations
Every browser agent approach shares fundamental challenges:
Reliability is the bottleneck. Even the best browser agents succeed 70–85% of the time on straightforward tasks. Multi-step workflows with 10+ interactions see compounding failure rates. A 90% per-step success rate over 10 steps gives you only a 35% overall success rate.
Dynamic content is hard. SPAs with complex state management, infinite scroll, shadow DOM, iframes, and WebSocket-driven updates create observation challenges that no approach has fully solved.
Anti-bot measures are escalating. Cloudflare, DataDome, PerimeterX, and similar services are actively fingerprinting and blocking automated browsers. Browserbase helps with this, but it's an arms race.
Cost scales linearly. Every LLM call costs money and time. A 20-step agent workflow costs 20× a single API call. At scale, this dwarfs traditional scraping costs.
Evaluation is unsolved. How do you know your agent actually did the right thing? Unlike deterministic scripts, LLM-based agents can "succeed" while subtly doing the wrong thing — clicking the wrong button, extracting stale data, or filling fields incorrectly.
What's Coming Next
The trajectory is clear: browser agents will get faster, cheaper, and more reliable. Multi-modal models are improving at visual understanding. Specialized models (like those fine-tuned on DOM interaction) will reduce latency and cost. Frameworks will converge on standard patterns.
But the fundamental architecture — observe, reason, act — will persist. The tools will change. The loop won't.
If you're building with browser agents today, start with Playwright-based approaches using accessibility trees. Use Browserbase if you need scale or stealth. Reserve Computer Use for tasks that genuinely require visual understanding or desktop interaction. And budget for failure handling — because right now, that's where most of your engineering time will go.