Before installing a browser AI extension, most users ask some version of the same question: what exactly is this thing reading, and what does it send somewhere? The honest answer is narrower and more mechanical than the marketing copy suggests. This post explains the two channels that any Chrome AI extension uses to perceive a page, names the APIs involved, and lays out the audit criteria that separate a privacy-respecting implementation from a worrying one.
The two channels, in one sentence
AI Chrome extensions see your screen through two main channels: a screenshot of the currently visible tab, and a structured extraction of the page’s interactive DOM. Most modern assistants use both, feed them into a multimodal model, and discard them after the response. Both channels are gated by Chrome’s permission model; neither is magic. If you know which APIs an extension calls and when, you know what it can see.
For the broader category view of how these tools perceive, decide, and act, see our guide to agentic browser assistants. This post zooms in on perception.
Channel 1 — screenshots of the visible tab
The first mechanism is a bitmap. Chrome exposes a tab-capture API that returns a PNG- or JPEG-encoded image of the currently visible viewport. The canonical call is chrome.tabs.captureVisibleTab(), which returns a data URL encoding the visible area. A few properties worth knowing:
- Visible area only. The API captures what is on screen, not the full scrollable page. If an element is off-screen when the call fires, it is not in the screenshot.
- Requires one of two permissions. The extension must have either the broad
<all_urls>host permission or the narroweractiveTabpermission. The difference is enormous:<all_urls>lets the extension capture any tab silently;activeTabonly grants capture on the current tab after an explicit user gesture. - Rate-limited. Chrome throttles how many times per second an extension can call this. Continuous, per-frame screen recording is not a viable pattern for a well-behaved extension — which is good news for users who want a predictable capture footprint.
- Expensive in bandwidth. A full-viewport PNG at standard retina resolution is several hundred kilobytes to a couple of megabytes. Extensions that send this to a cloud model on every question are measurably slow and measurably bandwidth-heavy.
What the screenshot gives an AI model: visual context. Layout, colour, imagery, chart shapes, rendered typography. What it does not give: the underlying structure. A button that looks clickable may not be; a link may be styled as plain text. Vision-only models have to guess.
Channel 2 — extracting the interactive DOM
The second mechanism is structural. The extension injects a content script — or uses chrome.scripting.executeScript on demand — to walk the document tree and collect a compact representation of the page. Exactly what is collected varies; the usual shape is a list of interactive nodes (buttons, links, inputs, menu items, form controls) with their text labels, roles, and stable selectors.
Content scripts share the DOM of the host page but run in an isolated world — a separate JavaScript context — so they cannot observe page-defined globals or closures. They can read every text node, every attribute, every ARIA role. They can also modify the page, which is how the “halo” or cursor overlays in browser AI assistants are rendered.
Why most serious AI extensions extract the DOM on top of the screenshot:
- Selectors survive reflow. A button at coordinate
(840, 220)moves when the page updates. A button referenced bybutton[data-testid=“export”]does not. Action on the page has to resolve back to a selector eventually; going selector-first is more robust. - Compact. A few kilobytes of JSON versus several megabytes of PNG. Faster to upload, cheaper to process, smaller exposure surface.
- Model-friendly. A list of labels and roles is what the language model is best at. Vision models are improving, but structured input still outperforms a rendered bitmap on most navigation tasks.
The trade-off: DOM extraction can miss rendered content that lives only on canvas (Figma, Miro), in iframes from a different origin (many embedded widgets), or inside shadow DOM (some component libraries). For those surfaces, the screenshot fills the gap — at reduced reliability.
A third option: the accessibility tree
A smaller group of extensions reads the browser’s accessibility tree — the same structure screen readers walk. The tree is the browser’s own interpretation of what matters on the page: landmarks, headings, controls, labelled widgets, interactive elements. It is less flexible than raw DOM extraction (you see what the accessibility layer sees, not everything) and typically requires additional APIs, but it carries the advantage of being the web’s official “semantic summary.”
In practice, most 2026 AI extensions prefer DOM extraction for flexibility and accept the accessibility tree as a complementary signal when present. We mention it because it is the most privacy-preserving capture pattern conceptually — it contains no incidental text outside what the site already exposed to assistive technology.
What actually goes to the cloud
This is the honest part. Nearly every consumer-grade AI browser extension sends both the screenshot and the DOM extract to a hosted model for inference. Local models small enough to fit in the browser are not yet strong enough to run the full perception pipeline at useful quality for generalised page understanding.
Three sub-questions matter:
- When. Only when the user explicitly invokes the assistant (push-to-talk, hotkey, button click) — or continuously in the background? The latter should be a deal-breaker for most users.
- Where. Which region, which model vendor, which retention policy. A serious extension names the vendors on its privacy page. Clicky names Anthropic, Mistral, and ElevenLabs on its privacy page; extensions that list only themselves are routing data through an opaque proxy, which is not necessarily bad but deserves a question.
- What remains after inference. Is the captured page stored, used for training, or discarded? Vendor-facing contracts are usually clearer than the consumer-facing version; extensions that route through enterprise API tiers (zero-retention, no-training) are meaningfully different from ones that hit consumer APIs.
Red flags when auditing an extension
Open the Chrome Web Store listing for any AI extension and look for these specifically.
- “Read and change all your data on all websites.” This is
<all_urls>in plain language — the extension can read every page silently, not just when invoked. Common among AI sidebar extensions. Warrants a real justification in the privacy policy. - No named model provider. If the privacy policy says “third-party AI services” with no names, you do not know where your page data is going or under what retention terms.
- Ambient capture described as “always-on” or “proactive.” Translation: the extension captures continuously, not only when you invoke it. Convenient, and also a persistent data stream.
- Microphone with no push-to-talk option. Means the mic is either always-listening or toggle-based. See our push-to-talk guide for why that is a privacy axis, not a cosmetic one.
- No open-source or at minimum publicly published manifest. You can always inspect the manifest of an installed extension, but a vendor who publishes it up-front and documents its permissions has made the audit easier.
How Clicky implements the two channels
Clicky uses both channels — screenshot plus DOM — with a strict on-demand capture policy.
- Permission.
activeTabonly. The extension has no ambient access to pages. Capture is triggered by the Alt key-down event and terminates on key-up. - Screenshot.
chrome.tabs.captureVisibleTab()fires exactly once per invocation, captures the visible viewport, encodes to JPEG at moderate quality to keep payload size and inference latency reasonable. - DOM extract. A content script walks the page and returns a compact interactive-node list — buttons, links, inputs, menu items — with labels, roles, and selectors. Body text is not in the extract; only the structural skeleton.
- Where it goes. The screenshot and DOM are forwarded through a Fleece AI-managed endpoint to Anthropic (answer generation), Mistral Voxtral (transcription), and ElevenLabs (text-to-speech). Named providers, enterprise-tier terms, no training on user data.
- Session scope. Conversation history lives in Chrome
sessionstorage — notsync, notlocal. It is cleared when the browser session ends. Nothing persists on our servers beyond the short inference window.
None of this is novel engineering. What is unusual is that each choice is the minimum needed for the feature to work. Broader permissions are easier to ship; a smaller capture surface takes more work. We think it is worth it.
Frequently asked questions
Can a Chrome extension see my tabs in the background?
Only if it has tabs or broad host permissions. An extension using activeTab only sees the current tab, and only after an explicit user gesture invokes it. The permission model is enforced by Chrome, not by policy.
Is a screenshot more privacy-invasive than a DOM extract?
In most cases, yes. A screenshot is a rendered image of whatever happens to be on screen — including incidental content, notifications, half-visible text. A DOM extract limited to interactive nodes carries less incidental data. Extensions that use DOM-first with screenshot as a fallback are usually more privacy-respecting.
Does the extension record everything I type?
It can if it wants to — a content script can listen to input events. Most AI extensions do not; serious ones state this explicitly. Check the privacy policy for a clear negative statement about keystroke capture.
What does “only when you invoke it” actually mean?
There should be a single, observable user gesture — a hotkey, a button, a voice trigger — that precedes any capture. If the extension cannot point at the gesture, it is probably capturing on a different schedule.
Can I see exactly what Clicky sends for a given question?
Chrome DevTools will show you every network request the extension makes, including the captured payload. We publish the schema on the privacy page; DevTools is the verification.
Next in our 2026 series: the difference between a browser copilot, a browser assistant, and a full browser agent — terminology that marketing conflates and that actually matters for which product fits which job.