Skip to content

How AI Chrome Extensions See Your Screen

AI Chrome extensions perceive your screen through two main channels: a screenshot of the visible tab and a structured extraction of the interactive DOM. Here is what each mechanism actually captures, what gets sent to the cloud, and how to audit an extension before installing it.

By Loïc Jané11 min read

Before installing a browser AI extension, most users ask some version of the same question: what exactly is this thing reading, and what does it send somewhere? The honest answer is narrower and more mechanical than the marketing copy suggests. This post explains the two channels that any Chrome AI extension uses to perceive a page, names the APIs involved, and lays out the audit criteria that separate a privacy-respecting implementation from a worrying one.

The two channels, in one sentence

AI Chrome extensions see your screen through two main channels: a screenshot of the currently visible tab, and a structured extraction of the page’s interactive DOM. Most modern assistants use both, feed them into a multimodal model, and discard them after the response. Both channels are gated by Chrome’s permission model; neither is magic. If you know which APIs an extension calls and when, you know what it can see.

For the broader category view of how these tools perceive, decide, and act, see our guide to agentic browser assistants. This post zooms in on perception.

Channel 1 — screenshots of the visible tab

The first mechanism is a bitmap. Chrome exposes a tab-capture API that returns a PNG- or JPEG-encoded image of the currently visible viewport. The canonical call is chrome.tabs.captureVisibleTab(), which returns a data URL encoding the visible area. A few properties worth knowing:

What the screenshot gives an AI model: visual context. Layout, colour, imagery, chart shapes, rendered typography. What it does not give: the underlying structure. A button that looks clickable may not be; a link may be styled as plain text. Vision-only models have to guess.

Channel 2 — extracting the interactive DOM

The second mechanism is structural. The extension injects a content script — or uses chrome.scripting.executeScript on demand — to walk the document tree and collect a compact representation of the page. Exactly what is collected varies; the usual shape is a list of interactive nodes (buttons, links, inputs, menu items, form controls) with their text labels, roles, and stable selectors.

Content scripts share the DOM of the host page but run in an isolated world — a separate JavaScript context — so they cannot observe page-defined globals or closures. They can read every text node, every attribute, every ARIA role. They can also modify the page, which is how the “halo” or cursor overlays in browser AI assistants are rendered.

Why most serious AI extensions extract the DOM on top of the screenshot:

The trade-off: DOM extraction can miss rendered content that lives only on canvas (Figma, Miro), in iframes from a different origin (many embedded widgets), or inside shadow DOM (some component libraries). For those surfaces, the screenshot fills the gap — at reduced reliability.

A third option: the accessibility tree

A smaller group of extensions reads the browser’s accessibility tree — the same structure screen readers walk. The tree is the browser’s own interpretation of what matters on the page: landmarks, headings, controls, labelled widgets, interactive elements. It is less flexible than raw DOM extraction (you see what the accessibility layer sees, not everything) and typically requires additional APIs, but it carries the advantage of being the web’s official “semantic summary.”

In practice, most 2026 AI extensions prefer DOM extraction for flexibility and accept the accessibility tree as a complementary signal when present. We mention it because it is the most privacy-preserving capture pattern conceptually — it contains no incidental text outside what the site already exposed to assistive technology.

What actually goes to the cloud

This is the honest part. Nearly every consumer-grade AI browser extension sends both the screenshot and the DOM extract to a hosted model for inference. Local models small enough to fit in the browser are not yet strong enough to run the full perception pipeline at useful quality for generalised page understanding.

Three sub-questions matter:

Red flags when auditing an extension

Open the Chrome Web Store listing for any AI extension and look for these specifically.

How Clicky implements the two channels

Clicky uses both channels — screenshot plus DOM — with a strict on-demand capture policy.

None of this is novel engineering. What is unusual is that each choice is the minimum needed for the feature to work. Broader permissions are easier to ship; a smaller capture surface takes more work. We think it is worth it.

Frequently asked questions

Can a Chrome extension see my tabs in the background?

Only if it has tabs or broad host permissions. An extension using activeTab only sees the current tab, and only after an explicit user gesture invokes it. The permission model is enforced by Chrome, not by policy.

Is a screenshot more privacy-invasive than a DOM extract?

In most cases, yes. A screenshot is a rendered image of whatever happens to be on screen — including incidental content, notifications, half-visible text. A DOM extract limited to interactive nodes carries less incidental data. Extensions that use DOM-first with screenshot as a fallback are usually more privacy-respecting.

Does the extension record everything I type?

It can if it wants to — a content script can listen to input events. Most AI extensions do not; serious ones state this explicitly. Check the privacy policy for a clear negative statement about keystroke capture.

What does “only when you invoke it” actually mean?

There should be a single, observable user gesture — a hotkey, a button, a voice trigger — that precedes any capture. If the extension cannot point at the gesture, it is probably capturing on a different schedule.

Can I see exactly what Clicky sends for a given question?

Chrome DevTools will show you every network request the extension makes, including the captured payload. We publish the schema on the privacy page; DevTools is the verification.

Next in our 2026 series: the difference between a browser copilot, a browser assistant, and a full browser agent — terminology that marketing conflates and that actually matters for which product fits which job.