“Hands-free browsing” is not one thing. For someone with a spinal-cord injury it might mean a sip-and-puff switch driving a scanning cursor. For someone with repetitive strain injury it might mean Talon Voice and a foot pedal. For someone with advanced ALS it might mean an eye tracker and a wake-word assistant. These are different stacks with different strengths, and any single product that claims to be “the hands-free solution” is either overreaching or under-informed. This post lays out the 2026 landscape layer by layer, names the serious tools, and describes where a push-to-talk AI assistant like Clicky legitimately fits — and where it does not.
What hands-free means on the web
At the level of the browser, there are three categories of hands-free interaction, and the industry often conflates them.
- Dictation. The user speaks; the system types. This is the oldest category — Dragon shipped dictation for Windows in the 1990s — and it solves text entry, not navigation. A user who can dictate an email still cannot, by dictation alone, click the “Send” button.
- Command. The user says a verb; the system performs a deterministic action. “Click Send.” “Scroll down.” “Switch to Firefox.” This is what modern voice-control systems like Talon, Apple Voice Control, and Windows Voice Access do. The command grammar is finite and predictable, which is exactly why it is reliable enough to live on.
- Agent. The user states a goal; a model figures out the steps. “Find me a cheaper flight.” “Summarise what this dashboard is telling me.” This is the 2025–2026 wave of AI browser assistants. Higher ceiling, much less predictable, and — importantly — not a replacement for the command layer underneath.
A well-built motor-accessibility setup in 2026 uses all three, and chooses them for the task at hand. Dictation for long text. Commands for navigation. Agents for understanding. Anyone who sells you one layer as the whole answer is selling you something smaller than you think.
The three layers of hands-free input
Underneath the software categories sits a stack of three physical and logical layers. Each one has decades of assistive-technology research behind it, and each one solves a different sub-problem.
- Layer 1 — Hardware input. The physical transducer that gets the user’s intent into the machine. Switches, sip-and-puff devices, foot pedals, eye trackers, head trackers. Some users never need this layer; some users cannot browse the web without it.
- Layer 2 — System-level voice and switch software. The layer that turns a transducer into operating-system events. On macOS this is Apple Voice Control, on Windows 11 it is Voice Access, cross-platform there is Talon Voice and the legacy but still-shipping Dragon Professional. Switch access is typically handled by the OS plus purpose-built software from vendors like Tobii Dynavox or AssistiveWare. This is the layer that actually navigates the page.
- Layer 3 — Application-level assistants. What sits inside the browser and helps with the content on the current page: summarising, explaining, pointing at the right control, translating. This is where Clicky, Sider, Monica, Merlin, and the rest of the 2026 AI extension crop live. This layer does not move the mouse at the OS level; it cannot replace layer two.
A user with motor impairment usually needs one tool from layer one, one from layer two, and optionally one from layer three. The application-level layer is genuinely useful — but only once the lower layers are in place.
Hardware: switches, pedals, eye and head trackers
A brief and honest tour of the physical inputs, in rough order of how much motor control they require.
- Foot pedals. Lowest-commitment addition. USB pedals sold by Kinesis, Savant Elite, or Infinity are widely used by voice-coding developers to offload modifier keys — a natural fit for people with wrist or hand pain, less useful for those without lower-limb control.
- Switches. A single button, pressed any way the user can press it — by head, chin, elbow, knee. Paired with scanning software that steps through on-screen targets, one switch can drive an entire browser. Slow by design, but robust; two or three switches speed it up considerably.
- Sip-and-puff. A mouthpiece that reads air pressure; sips and puffs become discrete commands. Mature technology, often the primary input for people with high spinal-cord injuries. Pairs well with on-screen keyboard and scanning.
- Head trackers. Cameras (or IR markers) translate head motion into cursor motion. Apple’s built-in head-tracking on iPad, Quha Zono, and GlassOuse are common choices. Requires reliable neck control; browsing accuracy is good on modern hardware.
- Eye trackers. Dedicated devices like Tobii Eye Tracker 5 or medical-grade Tobii Dynavox systems map gaze to cursor position; dwell or blink triggers clicks. Primary input of last resort for people with very limited voluntary motion; expensive but transformative.
None of these hardware options require AI to work. That is a feature, not a flaw: users relying on assistive input need their stack to work even when the cloud does not.
Software: Talon, Dragon, Voice Control, Voice Access
The system-level voice stack is where most of the real navigation happens in 2026. Four tools dominate, and they are not interchangeable.
- Talon Voice. Cross-platform (macOS, Windows, Linux), scriptable in Python, with an active community command set maintained on GitHub. Originally built by Ryan Hileman to let people code hands-free, Talon has become the default choice for developers with RSI and a serious option for general browsing. Steep learning curve, very high ceiling.
- Dragon Professional. The historical gold standard for dictation, Windows-only in current shipping versions. Nuance, the vendor, was acquired by Microsoft in 2022; active development has slowed significantly since then, but Dragon Professional v16 is still sold and still widely used, particularly in medical and legal contexts. Strong dictation, weaker navigation than the newer OS-native tools.
- Apple Voice Control. Built into macOS and iOS. Lets users click any on-screen element by saying its label, or by showing numbered overlays on every clickable target (“Show numbers”). Runs on-device after a one-time model download, supports custom vocabularies and commands, and requires no subscription. Reliability on web pages is very good thanks to deep accessibility-tree integration.
- Windows Voice Access. Microsoft’s built-in replacement for the older Speech Recognition feature, shipped in Windows 11 22H2 and steadily improved since. Runs offline after setup, offers numbered overlays similar to Apple’s, and increasingly handles browser UI well. Free, which matters more than it sounds in a category where Dragon licences run into the hundreds of euros.
For users coming from Dragon in 2026, the honest advice is: stay on Dragon if it works, but evaluate Apple Voice Control or Windows Voice Access seriously, and look at Talon if you need scripting, mode switching, or cross-platform consistency. None of those four lists Clicky — and they should not. Clicky is not in that layer.
Where AI assistants fit in the stack
AI browser assistants in 2026 solve a narrower problem than the voice stack does. They do not replace Dragon, Talon, Voice Control, or Voice Access. What they add, on top of those tools, is roughly three things.
- Semantic pointing. Voice Control can click “Submit” if the user knows the button is labelled Submit. If it is labelled with an ambiguous icon or sits behind a three-dot menu, the user has to explore. An assistant that understands “the button that sends this form” can narrow the search, then let the voice stack execute the click.
- Page comprehension. Reading a dense dashboard aloud, summarising a contract, explaining what a chart means. This is the original strength of chat-style AI and it transfers well to motor-accessibility scenarios because it reduces the number of voice commands needed to get the information.
- Task setup. For fully autonomous agents like Comet or Atlas, a spoken goal can replace a long chain of clicks. This is powerful for some users — and actively dangerous for others with financial or medical workflows where a mis-executed multi-step task is worse than no action at all. We wrote a longer version of that trade-off in push-to-talk vs always-listening.
The right mental model: voice-control software handles what gets clicked; AI assistants help decide which thing should get clicked and explain why. Built well, the two layers compose. Built badly, they fight each other — which is why permission scopes and input-focus behaviour matter so much at this layer.
Clicky’s honest fit — and honest limits
A straight accounting of where Clicky helps and where it does not, written for someone evaluating it as part of a motor-accessibility stack.
Where Clicky helps. Clicky is a push-to-talk Chrome extension: hold Alt, ask a question, get a spoken answer plus a halo drawn on the relevant DOM element on the page. For a user already running Talon, Apple Voice Control, or Windows Voice Access for navigation, Clicky adds a layer above it: semantic pointing (“where’s the export button on this dashboard?”), page-level explanation, and voice-out answers without needing to visually scan a chat sidebar. The halo lands on the actual element; the voice stack can then click it by its visible label. Clicky does not intercept Talon or Dragon commands — it only listens while Alt is held, and only the active tab is ever inspected (Chrome’s activeTab permission, not broad-host).
Where Clicky does not help — and this is the important part. The default trigger is holding the Alt key, which requires pressing and holding a physical modifier. For users who cannot hold a modifier key — many switch users, many users with severe limb motor impairment, many users with ALS — Clicky in its current shipping form is not usable as a primary input. That is not a marketing softening; it is the design. On the roadmap we have two changes aimed at this exact gap:
- Tap-to-lock. A single Alt tap enters a listen-until-tapped-again mode, so the key does not need to be held. This removes the hold requirement entirely for users who can make one brief press.
- Voice activation. An opt-in wake-phrase mode for users for whom any keyboard trigger, even a tap, is not reliable. This mode necessarily has a different privacy posture than the default — continuous on-device listening — and will ship as a distinct setting rather than a replacement.
Neither of those has shipped as of April 2026. If you are evaluating Clicky today for a user who cannot hold a modifier, the right answer is to wait, or to use Clicky only on top of a voice-control software layer that can simulate Alt-hold as a command. That is not a pitch; it is the truth of the product in the month this post was written.
Clicky is also Chrome-only, does not replace dedicated assistive technology, and carries none of the disability-vendor accreditations that products like Dragon and Tobii have accumulated over decades. If you are deploying hands-free tooling for a protected workflow — clinical, legal, government — dedicated AT remains the right foundation, with Clicky sitting at layer three, if at all.
Real-world setups
Two concrete examples of how these layers compose in 2026. Neither is hypothetical — both are patterns we have seen or heard from users in the first six months of the year.
Developer with RSI, cross-platform. Talon Voice as the system layer, with the community command set and a personal overlay for project-specific shortcuts. A Kinesis foot pedal handles modifier keys to avoid voice fatigue. Clicky sits on top in Chrome, bound to Alt via Talon so the user can say “clicky, where’s the deploy button” to get a halo without reaching for a keyboard. Dictation stays in Talon for code; Clicky never writes text for the user, only speaks answers and points. Push-to-talk suits this user precisely because they are already using voice control for everything else and do not want two voice systems competing for the microphone. See the dedicated PTT post for the privacy and focus reasons that matter here.
Analyst with advanced ALS. Eye tracker for cursor control, Windows Voice Access as the system layer with numbered overlays for clicking, and an always-listening wake-word assistant for voice-first queries. In this setup Clicky in its current form is the wrong tool: there is no reliable way to hold Alt. The user is best served by a wake-word assistant or by an integration between the eye tracker and a voice AI that already accepts gaze as a trigger. This is the exact edge case where push-to-talk is not the right default — and where Clicky’s roadmap matters more than its shipping product.
Frequently asked questions
Does Clicky replace Dragon or Talon?
No. Dragon and Talon are system-level voice-control tools that drive the operating system, including clicking, scrolling, and typing anywhere. Clicky is a browser extension that only runs inside Chrome and only performs overlay-based pointing and voice answers. If you need hands-free navigation across all applications, you need a voice stack at the OS level; Clicky sits above it, not instead of it.
Is push-to-talk hostile to motor-impaired users?
It can be, depending on the user. Requiring a held modifier excludes anyone who cannot reliably press and hold a key. For users who can tap but not hold, the roadmap tap-to-lock mode will remove that barrier; for users who cannot press keys at all, wake-word assistants are the better fit today. We think push-to-talk is the right default because it keeps the microphone off by default — but we treat “only ever push-to-talk” as a bug, not a principle.
What about the WCAG side of this?
Two WCAG 2.2 success criteria apply most directly to motor access on the web: 2.1.1 Keyboard (every function usable from a keyboard) and 2.5.7 Dragging Movements (any dragging must have a single-pointer alternative). A page that respects those criteria works well with the voice-control layer; a page that violates them will fail for voice users regardless of which AI assistant is running on top.
How do AI assistants handle the accessibility tree?
Some do, some do not. Tools that read only a visual screenshot can miss ARIA labels, form roles, and keyboard-accessibility metadata entirely. Tools that walk the DOM and the accessibility tree can address elements by their semantic role, which tends to align with how screen readers and voice-control software already see the page. Clicky reads both screenshot and DOM, and anchors the halo to the DOM node rather than to pixel coordinates, which matters when pages reflow.
Next in the series: a side-by-side comparison of Clicky and Sider — two very different takes on what an AI Chrome extension should do, and where the differences bite in day-to-day use.