Skip to main content

What it does

Launches a headless Chromium browser that the assistant can control — navigate to URLs, click elements, fill forms, execute JavaScript, scroll pages, and read content. Uses accessibility snapshots to give the LLM numbered element references instead of fragile CSS selectors.

Requirements

Install Playwright and Chromium (one-time):
bunx playwright install chromium
The browser tool is automatically registered at gateway startup when Playwright is available.

How the assistant uses it

The browser exposes seven actions, each registered as a separate tool:
ActionTool nameWhat it does
Navigatebrowser_navigateGo to a URL
Snapshotbrowser_snapshotGet the current page structure as numbered element refs
Clickbrowser_clickClick an element by its ref number
Typebrowser_typeType text into an input (append or replace with clear: true)
Scrollbrowser_scrollScroll the page in a direction (up, down, left, right)
Waitbrowser_waitWait for a selector, URL pattern, page state, JS condition, or fixed delay
Evaluatebrowser_evaluateExecute JavaScript in the page context and return the result

Typical browsing flow

  1. Assistant calls browser_navigate to open a page
  2. Assistant calls browser_snapshot to see the page structure
  3. Assistant identifies the right element by its ref number
  4. Assistant calls browser_click or browser_type to interact
  5. Assistant takes another snapshot if the page changed

Extracting structured data

For JavaScript-heavy pages (product listings, search results, data tables), browser_evaluate is often faster than repeated snapshot/scroll cycles:
return JSON.stringify(
  [...document.querySelectorAll('.product-card')].map(el => ({
    name: el.querySelector('h2')?.textContent?.trim(),
    price: el.querySelector('.price')?.textContent?.trim(),
  }))
);
Scripts with top-level return statements are automatically wrapped in an IIFE, so you can write return ... directly without wrapping in a function.

Scrolling

browser_scroll scrolls the viewport by a pixel amount (default 500px). Useful for loading lazy content or reaching elements below the fold:
ParameterTypeDefaultDescription
direction"up" | "down" | "left" | "right"requiredScroll direction
amountnumber500Pixels to scroll

Wait conditions

browser_wait supports multiple strategies:
ParameterWhat it does
timeMsSimple delay in milliseconds — best for SPAs
selectorWait for a CSS selector to appear in the DOM
urlWait for the URL to match a pattern
stateWait for a page load state (load, domcontentloaded)
jsConditionWait for a JavaScript expression to evaluate to truthy
Avoid state: "networkidle" on single-page applications — SPAs never stop making requests and it will timeout.

Configuration

SettingDefaultDescription
headlesstrueRun without a visible browser window
maxResultChars50,000Max characters returned from snapshots and evaluate
defaultTimeout30,000 msTimeout for page actions

Live Preview

When enabled, a real-time browser view panel appears on the right side of the chat while the assistant is browsing. You can watch navigation, clicks, form fills, and page transitions as they happen. Live Preview uses Chrome DevTools Protocol screencast to stream compressed JPEG frames directly from Chromium — the same technology that powers Chrome’s remote debugging. Frames are only pushed when the page visually changes, so bandwidth is naturally throttled.

Enabling Live Preview

Live Preview is disabled by default. To turn it on:
  1. Go to Settings → Tools → Browser
  2. Toggle Live Preview on
Or set it via the config file:
{
  tools: {
    browser: {
      enabled: true,
      livePreview: true
    }
  }
}
The setting is hot-applied — no gateway restart required.
Live Preview adds ~20–50 KB per frame over the WebSocket connection. On typical browsing sessions this is well within normal bandwidth, but you may want to disable it on very slow connections.

When to use browser vs web fetch

Use browser whenUse web fetch when
Page requires JavaScript to renderPage is server-rendered HTML
You need to click or fill formsYou just need the text content
Content is behind interactionsContent is at a direct URL
You need to navigate through pagesSingle page read is enough
You need to extract structured data from JS-rendered DOMStatic HTML with clean structure