Back to Home
Data Agents

Browser Agents Explained: How AI Controls the Web Like a Human

Emma Liu

Tech journalist covering the AI agent ecosystem and startups.

February 21, 202616 min read

Every developer has written a web scraper. Every developer has also watched that scraper break when a class name changed, a CAPTCHA appeared, or a site deployed a new JavaScript framework. The promise...

Browser Agents: Building AI That Navigates the Web Like a Human

The Rise of Computer-Using AI

Every developer has written a web scraper. Every developer has also watched that scraper break when a class name changed, a CAPTCHA appeared, or a site deployed a new JavaScript framework. The promise of browser agents is fundamentally different: instead of brittle selectors and hardcoded flows, you hand an AI model a browser and a goal. It figures out the rest.

This isn't theoretical anymore. In late 2024 and into 2025, we've seen a genuine inflection point. Anthropic shipped Computer Use in Claude, Browserbase raised significant funding to build cloud infrastructure purpose-built for AI agents, and a constellation of open-source projects have made Playwright-powered agents accessible to any developer with an API key.

But "accessible" and "production-ready" are different things. This article breaks down how each approach actually works, where they shine, where they fail, and what the real engineering tradeoffs look like.


How Browser Agents Actually Work

Before diving into specific tools, it's worth understanding the core architecture most browser agents share.

At a high level, a browser agent is a loop:

  1. Observe — capture the current state of the browser (screenshot, DOM snapshot, accessibility tree, or some combination)
  2. Reason — send that state to an LLM along with the goal, asking it to decide the next action
  3. Act — execute the chosen action (click, type, scroll, navigate)
  4. Repeat — until the goal is achieved or a failure condition is met

The critical differences between approaches lie in how they observe, what actions they can take, and where the browser runs.

┌─────────────────────────────────────────────┐
│                  Agent Loop                  │
│                                             │
│   ┌──────────┐    ┌──────────┐    ┌──────┐ │
│   │ Observe  │───▶│  Reason  │───▶│ Act  │ │
│   │(capture) │    │  (LLM)   │    │(exec)│ │
│   └──────────┘    └──────────┘    └──────┘ │
│        ▲                                │   │
│        └────────────────────────────────┘   │
│                                             │
│   Termination: goal achieved | max steps    │
└─────────────────────────────────────────────┘

Anthropic Computer Use

What It Is

Anthropic's Computer Use is a beta feature that gives Claude the ability to interact with a computer desktop — including a browser — by viewing screenshots and issuing mouse/keyboard commands. It's not a browser-specific tool; it's a general-purpose computer control system that happens to be very useful for browser automation.

How It Works

Claude receives screenshots of the desktop environment and can issue three primary tool calls:

  • computer — move the mouse, click, type, take screenshots, scroll
  • text_editor — view and edit files (for code-related tasks)
  • bash — execute shell commands

For browser tasks specifically, Claude observes screenshots at a configurable resolution (typically 1024×768 or 1280×800), identifies UI elements visually, and issues coordinate-based commands.

import anthropic
import base64
from PIL import Image
import pyautogui
import subprocess

client = anthropic.Anthropic()

# Start a browser
subprocess.Popen(["chromium", "--no-sandbox", "--window-size=1280,800"])

def take_screenshot():
    screenshot = pyautogui.screenshot()
    screenshot.save("/tmp/screen.png")
    with open("/tmp/screen.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def computer_use_loop(goal: str, max_turns: int = 20):
    messages = [{"role": "user", "content": goal}]

    for turn in range(max_turns):
        screenshot_b64 = take_screenshot()

        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            tools=[
                {
                    "type": "computer_20250124",
                    "name": "computer",
                    "display_width_px": 1280,
                    "display_height_px": 800,
                    "display_number": 0,
                }
            ],
            messages=messages + [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/png",
                                "data": screenshot_b64,
                            },
                        }
                    ],
                }
            ],
        )

        # Process tool use blocks
        for block in response.content:
            if block.type == "tool_use":
                execute_action(block.name, block.input)
                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": [{"type": "tool_result", "tool_use_id": block.id,
                                 "content": "Action executed."}]
                })
            elif block.type == "text":
                print(f"Claude: {block.text}")
                if "task_complete" in block.text.lower():
                    return

def execute_action(tool_name, action):
    if action["action"] == "click":
        pyautogui.click(action["coordinate"][0], action["coordinate"][1])
    elif action["action"] == "type":
        pyautogui.typewrite(action["text"], interval=0.02)
    elif action["action"] == "key":
        pyautogui.press(action["key"])
    elif action["action"] == "screenshot":
        pass  # Already handled in the loop

Strengths

  • Visual understanding: Claude doesn't need selectors or DOM access. It sees what a human sees. This makes it remarkably resilient to site redesigns and dynamic layouts.
  • General-purpose: The same system can navigate a website, fill a PDF form in a desktop app, or operate a spreadsheet. It's not limited to browsers.
  • No site-specific configuration: You don't need to map out a site's structure. Give it a URL and a goal.

Limitations

This is where honest assessment matters:

  • Latency: Each turn requires a screenshot capture, image upload, LLM inference, and action execution. A single "click" operation takes 2–5 seconds. A 15-step workflow burns 30–75 seconds minimum.
  • Cost: Sending screenshots as images consumes significant tokens. A typical screenshot at reasonable resolution costs roughly 1,600 tokens per image. A 20-turn interaction easily runs $0.50–$2.00.
  • Coordinate precision: Clicking by pixel coordinates is inherently fragile. Small UI elements, dropdown menus, and overlapping elements cause frequent mis-clicks. Claude is good but not perfect at estimating coordinates.
  • No DOM access: This is both a strength (resilience) and a weakness (no ability to inspect hidden elements, read data attributes, or interact with elements that aren't visually rendered).
  • Beta status: The tool use format has already changed versions (computer_20241022computer_20250124). API surfaces are still shifting.

When to Use It

Computer Use excels at tasks that are exploratory, involve desktop applications, or target sites you can't instrument with traditional tools. It's a poor fit for high-volume data extraction or latency-sensitive workflows.


Browserbase

What It Is

Browserbase provides cloud-hosted browser instances purpose-built for AI agents. Rather than running a browser on your own machine (or in your own Docker container), you get a managed, scalable pool of remote browsers with built-in observability.

Architecture

Browserbase solves a specific infrastructure problem: running headless browsers at scale is painful. Memory leaks, zombie processes, fingerprinting challenges, proxy rotation, and CAPTCHA handling are all operational burdens that Browserbase absorbs.

┌──────────────┐     WebSocket/CDP      ┌──────────────────┐
│  Your Agent   │◄──────────────────────▶│  Browserbase     │
│  (Python/TS)  │                         │  Cloud Browser   │
└──────────────┘                         │  ┌────────────┐  │
                                         │  │ Playwright  │  │
                                         │  │ instance    │  │
                                         │  └────────────┘  │
                                         │  + Proxies       │
                                         │  + Captcha solve  │
                                         │  + Session mgmt   │
                                         └──────────────────┘

Integration with Agent Frameworks

Browserbase integrates cleanly with LangChain, CrewAI, and direct Playwright usage:

from playwright.sync_api import sync_playwright
from browserbase import Browserbase

bb = Browserbase(api_key="your-api-key")

# Create a session with specific configuration
session = bb.sessions.create(
    project_id="your-project-id",
    browser_settings={
        "viewport": {"width": 1280, "height": 800},
        "fingerprint": {
            "devices": ["desktop"],
            "locales": ["en-US"],
            "operatingSystems": ["windows"],
        },
    },
)

# Connect Playwright to the remote browser
with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp(session.connect_url)
    context = browser.contexts[0]
    page = context.pages[0]

    page.goto("https://example.com/login")

    # Your agent logic here — combine with an LLM
    # for intelligent element interaction
    page.wait_for_selector("#email")
    page.fill("#email", "user@example.com")
    page.fill("#password", "secure-password")
    page.click("button[type='submit']")

    # Session recording is automatic — view in Browserbase dashboard
    content = page.content()
    print(f"Page title: {page.title()}")

browser.close()

Key Capantages

  • Session replay: Every browser session is recorded. When your agent fails (and it will), you can replay the session visually to debug what went wrong. This alone is worth the price of admission.
  • Fingerprinting and stealth: Browserbase handles browser fingerprint rotation, making agents appear as distinct human users. This is critical for sites with bot detection.
  • Proxy infrastructure: Built-in residential and datacenter proxy pools with automatic rotation.
  • Live view: Real-time debugging of running sessions through a web dashboard.
  • Scalability: Spin up hundreds of concurrent browser sessions without managing infrastructure.

Honest Assessment

Browserbase is infrastructure, not intelligence. It gives you a reliable, scalable browser — but you still need to build the agent logic on top. If you're connecting it to an LLM for decision-making, you're still responsible for the observe-reason-act loop, error handling, and goal decomposition.

Pricing is usage-based (sessions + compute time), which can add up for high-volume workloads. For a team running thousands of browser sessions daily, the cost can easily reach hundreds or thousands of dollars per month.


Playwright-Based Agents

The Foundation

Playwright has become the de facto standard for browser automation, and it's the substrate most browser agents are built on. Its CDP (Chrome DevTools Protocol) integration, cross-browser support, and robust API make it the natural choice.

The interesting evolution is combining Playwright's deterministic capabilities with LLM reasoning. The result: agents that can handle the messy, unpredictable parts of web interaction while still leveraging precise selectors when available.

Stagehand: The Playwright Agent Framework

Stagehand (from the Browserbase team) is worth examining specifically because it demonstrates the current best-practice pattern for Playwright-based agents:

import { Stagehand } from "@browserbase/stagehand";

const stagehand = new Stagehand({
  modelName: "claude-sonnet-4-20250514",
  modelClientOptions: {
    apiKey: process.env.ANTHROPIC_API_KEY,
  },
});

await stagehand.init();
const page = stagehand.page;

// Navigate to a site
await page.goto("https://news.ycombinator.com");

// Use natural language to describe what to do
// Stagehand uses the LLM to find the right element and act on it
const topStory = await page.extract({
  instruction: "Extract the title and URL of the top story",
  schema: {
    title: "string",
    url: "string",
  },
});

console.log(topStory);
// { title: "Show HN: ...", url: "https://..." }

// Natural language action — no selectors needed
await page.act("click on the top story link");

// Wait for the page to load
await page.waitForLoadState("domcontentloaded");

// Extract structured data from the new page
const articleContent = await page.extract({
  instruction: "Extract the main article text and author",
  schema: {
    author: "string",
    content: "string",
  },
});

await stagehand.close();

The Three Primitives

Stagehand (and similar frameworks) expose three core operations that map cleanly to the agent loop:

Primitive What It Does How It Works Internally
act() Perform an action on the page LLM analyzes DOM snapshot, identifies target element, generates Playwright selector or action
extract() Pull structured data from the page LLM receives page content, returns data matching a provided schema
observe() Identify available actions LLM analyzes current page state, returns list of actionable elements

This is a more structured approach than pure screenshot-based agents. Instead of pixel coordinates, the LLM works with the DOM — which is more precise and more reliable.

Building a Custom Playwright Agent

If you want to build this pattern from scratch without a framework, here's a more detailed implementation:

import json
import openai
from playwright.sync_api import sync_playwright

client = openai.OpenAI()

def get_page_state(page):
    """Capture the current page state for the LLM."""
    # Get the accessibility tree — more structured than raw HTML
    snapshot = page.accessibility.snapshot()

    # Also get the URL and title for context
    return {
        "url": page.url,
        "title": page.title(),
        "accessibility_tree": snapshot,
    }

def decide_action(page_state: dict, goal: str, history: list) -> dict:
    """Ask the LLM what to do next."""
    system_prompt = """You are a browser automation agent. Given the current 
page state and a goal, decide the next action.

Available actions:
- navigate(url): Go to a URL
- click(target): Click an element (describe it by its text/role/label)
- type(target, text): Type text into an input field
- scroll(direction): Scroll up or down
- extract(query): Extract information from the page
- done(result): Task is complete, return the result

Respond with JSON: {"action": "...", "params": {...}, "reasoning": "..."}
"""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"""Goal: {goal}

Current page state:
URL: {page_state['url']}
Title: {page_state['title']}
Accessibility tree: {json.dumps(page_state['accessibility_tree'], indent=2)[:8000]}

Previous actions taken: {json.dumps(history[-5:])}

What is the next action?"""}
    ]

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        response_format={"type": "json_object"},
        temperature=0,
    )

    return json.loads(response.choices[0].message.content)

def execute_browser_action(page, action: dict):
    """Execute an action in the browser."""
    match action["action"]:
        case "navigate":
            page.goto(action["params"]["url"])
        case "click":
            # Use LLM-identified text to find the element
            page.get_by_text(action["params"]["target"], exact=False).first.click()
        case "type":
            page.get_by_label(action["params"]["target"]).fill(action["params"]["text"])
        case "scroll":
            direction = action["params"].get("direction", "down")
            page.mouse.wheel(0, 500 if direction == "down" else -500)
        case "extract":
            return page.inner_text("body")

def run_agent(goal: str, start_url: str, max_steps: int = 15):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto(start_url)

        history = []

        for step in range(max_steps):
            state = get_page_state(page)
            action = decide_action(state, goal, history)

            print(f"Step {step + 1}: {action['action']} — {action.get('reasoning', '')}")

            if action["action"] == "done":
                print(f"Result: {action['params'].get('result', 'Complete')}")
                break

            result = execute_browser_action(page, action)
            history.append({"step": step + 1, "action": action})

            # Brief pause to let pages settle
            page.wait_for_load_state("networkidle", timeout=5000)

        browser.close()
        return history

# Usage
run_agent(
    goal="Find the current price of NVIDIA stock on Google Finance",
    start_url="https://www.google.com",
)

Why Accessibility Trees Beat Screenshots

The key insight in Playwright-based agents is using the browser's accessibility tree instead of (or alongside) screenshots:

# Accessibility tree gives you structured data like:
{
    "role": "WebArea",
    "name": "Example Domain",
    "children": [
        {"role": "heading", "name": "Example Domain", "level": 1},
        {"role": "paragraph", "name": "This domain is for use in illustrative examples..."},
        {"role": "link", "name": "More information...", "url": "https://www.iana.org/..."}
    ]
}

Compare this to a screenshot where the LLM must visually parse text, estimate coordinates, and hope the resolution is sufficient. The accessibility tree is deterministic, compact, and directly mappable to Playwright actions.

The tradeoff: accessibility trees can be incomplete. Many modern SPAs have poor accessibility markup. React components without proper ARIA labels, custom dropdowns, and canvas-based UIs produce sparse or empty trees. In these cases, falling back to screenshots or raw DOM parsing becomes necessary.


Practical Applications

Application 1: Automated QA Testing

Browser agents can write and execute test cases from natural language specifications:

def run_qa_test(spec: str, base_url: str):
    """Run a QA test from a natural language specification."""
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(base_url)

        results = []
        steps = spec.split("\n")

        for step in steps:
            if not step.strip():
                continue

            try:
                # Use the LLM to convert natural language to Playwright actions
                action = nl_to_action(step, page)
                execute_action(page, action)
                results.append({"step": step, "status": "pass"})
            except Exception as e:
                results.append({"step": step, "status": "fail", "error": str(e)})
                # Take a screenshot for debugging
                page.screenshot(path=f"fail_{len(results)}.png")

        browser.close()
        return results

# Usage
spec = """
Navigate to the login page
Enter 'test@example.com' in the email field
Enter 'password123' in the password field
Click the sign in button
Verify that the dashboard heading is visible
Click the 'New Project' button
Verify a modal dialog appears
"""

This is genuinely useful for exploratory testing but unreliable for regression testing where deterministic, reproducible results are non-negotiable. Use agents for discovery, traditional Playwright scripts for CI.

Application 2: Lead Generation and Data Extraction

def extract_company_data(company_url: str) -> dict:
    """Extract structured company information from their website."""
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(company_url)

        # Navigate to common info pages
        for path in ["/about", "/team", "/contact", "/careers"]:
            try:
                page.goto(f"{company_url.rstrip('/')}{path}", timeout=5000)
                break
            except:
                continue

        content = page.content()

        # Use an LLM to extract structured data from the page content
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""Extract the following from this webpage:
- Company name
- Description (1-2 sentences)
- Founders/leadership names
- Employee count (if available)
- Location
- Industry
- Key products/services

Page content:
{content[:15000]}

Return as JSON."""
            }],
            response_format={"type": "json_object"},
        )

        return json.loads(response.choices[0].message.content)

Application 3: Form Filling Automation

One of the most practical applications: automating repetitive form submissions across different sites with varying layouts.

def fill_form_intelligently(page, form_data: dict):
    """Use an LLM to map form data to form fields, even on unfamiliar sites."""

    # Get all form elements
    inputs = page.query_selector_all("input, select, textarea")
    form_elements = []
    for inp in inputs:
        attrs = {
            "tag": inp.evaluate("el => el.tagName"),
            "type": inp.get_attribute("type") or "",
            "name": inp.get_attribute("name") or "",
            "id": inp.get_attribute("id") or "",
            "placeholder": inp.get_attribute("placeholder") or "",
            "label": inp.evaluate("""el => {
                const label = el.labels?.[0];
                return label ? label.textContent.trim() : '';
            }"""),
        }
        form_elements.append(attrs)

    # Ask the LLM to map our data to form fields
    mapping = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Map this form data to the available form fields.

Form data: {json.dumps(form_data)}

Form fields: {json.dumps(form_elements, indent=2)}

Return a JSON array of mappings: [{{"field_id": "...", "value": "...", "action": "fill|select|check"}}]
Only map fields where you have high confidence."""
        }],
        response_format={"type": "json_object"},
    )

    mappings = json.loads(mapping.choices[0].message.content)

    for m in mappings.get("mappings", []):
        selector = f"#{m['field_id']}" if m['field_id'] else f"[name='{m['field_name']}']"
        try:
            if m["action"] == "fill":
                page.fill(selector, m["value"])
            elif m["action"] == "select":
                page.select_option(selector, m["value"])
            elif m["action"] == "check":
                page.check(selector)
        except Exception as e:
            print(f"Failed to fill {selector}: {e}")

Comparison Matrix

Capability Anthropic Computer Use Browserbase + Playwright Custom Playwright Agent
Observation method Screenshots DOM / Accessibility tree DOM / Accessibility tree
Action precision Pixel coordinates Selectors / locators Selectors / locators
Latency per action 2–5 seconds 100–500ms + LLM time 100–500ms + LLM time
Cost per interaction $0.50–$2.00 $0.01–$0.10 + infra $0.01–$0.10
Handles poor accessibility Yes (visual) Partially (DOM fallback) Partially (DOM fallback)
Scales to thousands of sessions No (local desktop) Yes (managed infra) Yes (self-managed)
Works with desktop apps Yes No No
Debugging Manual screenshots Session replay + dashboard Manual
Setup complexity Low Low–Medium Medium–High

The Honest Limitations

Every browser agent approach shares fundamental challenges:

Reliability is the bottleneck. Even the best browser agents succeed 70–85% of the time on straightforward tasks. Multi-step workflows with 10+ interactions see compounding failure rates. A 90% per-step success rate over 10 steps gives you only a 35% overall success rate.

Dynamic content is hard. SPAs with complex state management, infinite scroll, shadow DOM, iframes, and WebSocket-driven updates create observation challenges that no approach has fully solved.

Anti-bot measures are escalating. Cloudflare, DataDome, PerimeterX, and similar services are actively fingerprinting and blocking automated browsers. Browserbase helps with this, but it's an arms race.

Cost scales linearly. Every LLM call costs money and time. A 20-step agent workflow costs 20× a single API call. At scale, this dwarfs traditional scraping costs.

Evaluation is unsolved. How do you know your agent actually did the right thing? Unlike deterministic scripts, LLM-based agents can "succeed" while subtly doing the wrong thing — clicking the wrong button, extracting stale data, or filling fields incorrectly.


What's Coming Next

The trajectory is clear: browser agents will get faster, cheaper, and more reliable. Multi-modal models are improving at visual understanding. Specialized models (like those fine-tuned on DOM interaction) will reduce latency and cost. Frameworks will converge on standard patterns.

But the fundamental architecture — observe, reason, act — will persist. The tools will change. The loop won't.

If you're building with browser agents today, start with Playwright-based approaches using accessibility trees. Use Browserbase if you need scale or stealth. Reserve Computer Use for tasks that genuinely require visual understanding or desktop interaction. And budget for failure handling — because right now, that's where most of your engineering time will go.

Keywords

AI agentdata-agents