LocalMind

A private AI research agent running Gemma entirely in your browser via WebGPU. Tool calling, persistent memory, and web search — all on-device.

Only your search queries touch the network (and only when you choose to). All reasoning stays on your device.

Models are cached after first download — future visits load instantly.

Requirements: Chrome 113+, Edge 113+, or Firefox 130+ with WebGPU.

v2.0.0 · Powered by Transformers.js + Google Gemma (Apache 2.0).

Available Models

  • Ternary Bonsai 1.7B (~470 MB, default) — text + agent (tool calling). Smallest download with tool calling. 1.58-bit ternary weights, Qwen3 backbone, Apache-2.0.
  • Ternary Bonsai 4B (~1.1 GB) — same capabilities, better quality.
  • Ternary Bonsai 8B (~2.2 GB) — best Bonsai quality, 65K context.
  • Gemma 3 1B (~760 MB) — text-only, no tool calling. Fallback option.
  • Gemma 4 E2B (~1.5 GB) — multimodal (image + audio) + agent.
  • Gemma 4 E4B (~4.9 GB) — multimodal + agent, best quality.

Agent Tools (Ternary Bonsai + Gemma 4)

  • calculate — math, percentages, conversions
  • get_current_time — date/time with timezone
  • store_memory — save facts to persistent memory
  • search_memory — recall from stored memories
  • web_search — search the web (requires API key)
  • fetch_page — read a web page’s content
  • set_reminder — browser notification after N minutes
  • list_memories — show what’s stored in memory
  • delete_memory — forget specific memories
  • segment_image — segment objects in attached images (SAM)

Image Segmentation (SAM)

Gemma 4 can call SAM (Segment Anything Model) to segment objects in attached images. Choose your SAM model in Settings — loaded on first use.

  • SlimSAM 50 (~10 MB) — fastest, good enough for most tasks
  • SlimSAM 77 (~14 MB) — default, better accuracy
  • SAM ViT-Base (~350 MB) — full quality, slower download
  • SAM 3 (latest) — newest architecture

Things to try

  • “Segment the main object in this image”
  • “Outline the person on the left”
  • “Isolate the background”
  • “How many distinct objects are in this image?”

Translation works directly — Gemma 4 speaks 140+ languages natively, no tool needed.

Gemma 3 1B works as a simple chatbot — no agent tools. Prefer Ternary Bonsai 1.7B if you want tool calling at a similar size.

Document Upload

  • Text — .txt, .md, .json, .csv
  • PDF — extracted via PDF.js, auto-summarized
  • DOCX — extracted via mammoth.js, auto-summarized

Documents are chunked, embedded, and stored as searchable knowledge. A summary is generated on upload.

Multimodal (Gemma 4)

  • 📎 Attach — images, audio, MP4 video, or documents
  • 📷 Camera / 🎤 Mic / Paste / Drag & drop

Document Upload

  • Text — .txt, .md, .json, .csv
  • PDF — extracted via PDF.js, auto-summarized
  • DOCX — extracted via mammoth.js, auto-summarized
  • Folder — open a local folder to ingest all .md/.txt/.pdf files at once; re-open to sync only changed files (incremental)

Conversations

  • New Chat — archives to History + starts fresh
  • Clear — deletes without saving
  • History — sidebar, click to resume any past chat
  • Share — generate an encrypted or plain link to share any conversation; recipient opens the URL to load it

Memory browser

  • Category pills — filter by fact / preference / finding / document / conversation with live counts
  • Source grouping — document chunks grouped by file; bulk “Delete all” per source
  • Audit — flags stale (>60 days), near-duplicates (cosine sim ≥0.92), and outliers (low avg similarity to category); bulk or per-item delete; auto-reruns after each deletion

Output & Export

  • Save as MD — download any response as Markdown (or write directly to open folder if one is active)
  • Code download — hover code blocks for download button
  • Export / Import — in Memory panel, full data as JSON
  • Auto-backup — toggle in Settings to download on New Chat

Batch Prompts

  • Enter one prompt per line in the Batch panel — they run sequentially through the full agent loop
  • {{previous}} — explicit placeholder substituted with the previous response text
  • Auto-inject — checkbox (on by default) appends the previous response as context even without a placeholder; disabled for any prompt that already contains {{previous}}
  • Stop — halts after the current generation finishes; progress shown live

Other

  • Web Search — Settings → provider + API key → 🌐 button
  • Thinking Mode — see chain-of-thought (collapses when done)
  • Multi-step planning (experimental, Gemma 4 only) — Settings → tick the toggle. Each message is planned into 2–5 steps, each step executed (with tools), then synthesised. Plan + per-step outputs render as collapsible blocks below the answer. Slower (3×+ model calls) but handles research-style queries better.
  • Branch from here — right-click (or long-press) a user message → "Branch from here". Archives the current conversation, then forks a new one containing messages up to that point. Continue the new branch from that question.
  • Custom tools (agent-capable models) — Settings → Custom tools. Paste a tool definition as JSON (name, description, parameters, endpoint). On a tool call the model's args are POSTed to your endpoint as a JSON body; the response is fed back to the model. CORS must allow this origin.
  • MCP servers (agent-capable models) — Settings → MCP servers. Paste a Streamable HTTP MCP endpoint URL (plus optional bearer). LocalMind opens a JSON-RPC 2.0 session, discovers tools via tools/list, and registers each with an mcp_ prefix so the agent loop can use them alongside built-ins.
  • Math & diagrams — inline $\int x^2 dx$ and display $$\\sum_{i=1}^n i$$ math render via KaTeX; ```mermaid blocks render as SVG via lazy-loaded Mermaid.
  • Artifact preview```html / ```svg / ```artifact code blocks get a live sandboxed iframe below the code (sandbox="allow-scripts", no same-origin). Safe to run model-generated UI inline.
  • Voice to text — 🗣 button left of the input records mic audio, decodes to 16 kHz mono PCM on-device, and runs Whisper-base via WebGPU to transcribe into the input. ~80 MB first-use download; nothing leaves the device.
  • Python code tool (agent-capable models) — model can call run_python to execute Python in a sandboxed Pyodide worker. numpy / pandas / matplotlib auto-install on import. ~10 MB first-use download.
  • Cache management — view/clear cached models in Settings
  • Custom models — Settings → paste a Hugging Face ONNX repo id (causal LMs only). The validator probes the HF API, picks the best available quantisation, estimates real load size, and hard-blocks anything that exceeds the device’s WebGPU buffer limit or the 6 GB ceiling.
  • Response badges — On-device / Agent / Web-enriched

JavaScript API (experimental)

Settings → tick Expose window.localmind. Same-tab only — cross-origin scripts cannot reach it. The object is frozen and non-writable; disable the toggle to detach.

Surface (v1.0)

  • version · ready · model — live state getters
  • listModels() — full registry incl. custom models with loaded flag
  • load(idOrKey) — loads a model (short key or HF id); resolves when ready
  • chat.completions.create({ messages, max_tokens, temperature, top_p, model }) — non-streaming, returns OpenAI-shaped chat.completion
  • chat.completions.create({ …, stream: true }) — async iterator yielding chat.completion.chunk objects

Not exposed

  • Tools / tool calling
  • Memory read/write
  • File system, web search, search API keys, user profile
  • Multimodal input

Activity log

Every API call is logged in-memory (last 50). Click the • API chip in the toolbar or Settings → View activity log. Each call shows method, prompt length, tokens generated, duration, and outcome (ok / err / busy).

Demo

Open demo.html in the same folder. It iframes LocalMind, auto-flips the toggle, waits for the model, and runs both a non-streaming and a streaming completion against iframe.contentWindow.localmind.

Experimental — the shape may change before a stable v1.1.

Math & Conversions

What is 15% of 2450? Convert 72 Fahrenheit to Celsius Compound interest: $10K at 7% for 5 years?

Time & Reminders

What time is it in Tokyo? Remind me in 5 minutes to check the oven

Memory

Remember: I'm a software engineer on Dashboard Pro What do you know about me? Forget my preferences

Translation

Translate "Good morning" to Japanese, French, Hindi Train station directions in Spanish & German

Writing & Analysis

Write a polite meeting decline email Microservices vs monolith: pros and cons Explain WebGPU in 3 simple sentences

Documents (attach a PDF, DOCX, or text file)

Summarize the uploaded document Main conclusions from my document

Multimodal (attach an image first)

Describe this image in detail Transcribe text from this image

Web Research (requires API key)

Top tech news today Latest WebGPU browser support status AI in the browser: recent articles with sources

Coding

Sieve of Eratosthenes in Python async/await vs Promises explained

Math & Diagrams

Derive the quadratic formula with LaTeX Pythagorean identity with display math Mermaid: login + JWT flow Mermaid: producer-consumer sequence

Live HTML / SVG Artifacts

Interactive counter (HTML artifact) Pomodoro timer (HTML artifact) Sunrise scene (SVG artifact)

Python (Gemma 4, run_python tool)

First 20 Fibonacci via Python Pandas synthetic data + correlation Solve a 3x3 linear system

Multi-step planning (Gemma 4, Settings toggle)

Compare 3 coffee brewing methods Silk Road in 3 phases

MCP tools (after adding a server in Settings)

List all MCP tools I have Use an MCP fetch tool

Voice to text (no prompt needed)

Click the 🗣 button to the left of the input, speak a sentence, click again to stop. Whisper runs on-device. Try saying: “Write a short summary of the Great Pyramid of Giza in three sentences.”

Tips.
  • For artifact and Mermaid prompts, start with “Output only this…” or “Copy this exactly…” — smaller models tend to wrap code blocks in prose otherwise.
  • If a ```mermaid block renders as plain code, the model stripped the language label. Re-send with “Preserve the language tag after the backticks”.
  • Click any prompt above to paste it. Web search needs a provider in Settings. Multimodal needs a Gemma 4 model + an attached image. Artifact, math, and Mermaid rendering work on any model; run_python, MCP tools, and multi-step planning require an agent-capable (Gemma 4) model.
Initializing...
Drop image, audio, or video here
🧠

Private AI, right in your browser

Gemma runs entirely on your device. Nothing is sent to any server.

Switch to a Gemma 4 model for tool calling, memory, and multimodal input.
Web search requires an API key (Tavily, Brave) or a self-hosted SearXNG instance.

Preparing to download model...
Loading SAM model...
Loading...
Must be a Hugging Face repo with ONNX files under onnx/. Multimodal custom models are not yet supported. Added models appear in the model selector.
On a tool call, LocalMind does POST <endpoint> with the model-generated args as JSON body; the response JSON is fed back to the model. The endpoint must send CORS headers for this origin. Name must be [a-zA-Z_][a-zA-Z0-9_]* and not collide with a built-in.
Tools discovered from an MCP server are registered with the prefix mcp_ to avoid collisions with built-ins. The server must allow CORS for this origin and accept JSON-RPC 2.0 requests at the given URL. Connections re-established on page load.
Lets other JavaScript on this page call the loaded model via an OpenAI-shaped method. Same-tab only — cross-origin scripts cannot reach it. No tool calling, memory access, or web search is exposed.
Memory 0 memories
Batch Prompts 0 prompts
API
History 0 conversations
Shared conversation detected
API activity log
No API calls yet.

Share Conversation