Similarity index: derived projection on Cloudflare Vectorize
Status: Adopted 2026-05-10.
The recurring question
Section titled “The recurring question”Capture-time similarity (“have you written this before?”) and weekly review (“what patterns are emerging?”) both need a vector index over entry content. Where does that index live, and how does it stay consistent with the local-first event log?
This doc encodes the architecture so we don’t re-litigate per AI feature.
Decision rule
Section titled “Decision rule”The vector index is a derived projection of the event log, hosted on the Worker side, keyed per user_id.
Specifically:
- Source of truth is
events.jsonl— local FS primary, R2 cold replica. - The index (Cloudflare Vectorize) is a
Map<EntryId, Vector>derived fromEntryWrittenpayloads. - Embeddings are never written to the event log. Vectors are 2KB each (512 dims × 4 bytes); they bloat the log without paying their rent. The log holds user-authored content; the index is a machine-derived view.
- The index is rebuildable from the log at any time. Re-embed all entries → upsert → done.
- Capture-time queries hit the Worker → Voyage embeds the query → Vectorize returns top-K. Cloud-only; offline silently hides the panel.
- Indexing happens server-side, downstream of sync push (Option B in the ADR section below).
What this preserves
Section titled “What this preserves”- Local-first invariant. Capture writes block on
events.jsonlonly. No new network on the dispatch hot path. The train-crash lesson from Stage 3a still holds. - Augmentative, not core. The app works fully offline for everything except the AI affordances, which degrade to silent absence. Same shape as classify-on-capture today.
- Single source of truth. If R2 and Vectorize ever diverge, R2 wins — Vectorize gets rebuilt from R2.
- No on-device inference. iPhone never embeds; only queries via the Worker.
What this trades away
Section titled “What this trades away”- Offline = no similarity hint. Acceptable per the discovery research: silently hide vs. blocking save. Mirrors how the 401 banner and classify-on-capture already degrade.
- Cloud dependency creep. The Worker becomes load-bearing for the AI surface. If the project ever wanted to drop the Worker entirely (peer-to-peer or single-device), this comes with it.
- Vectorize lock-in. Vectorize is Cloudflare-specific. Escape hatch: keep the embedding model + dimensionality stable; if migrating to workstation TEI + sqlite-vec, Turso, etc., re-embed once and switch the binding.
Three operations, three local-first stories
Section titled “Three operations, three local-first stories”Read — capture-time query
Section titled “Read — capture-time query”Page.Capture (typing, debounced 300ms) ↓Command.SimilaritySearch { content } ↓Rust dispatch → Worker POST /llm/similarity { content } ↓Worker: Voyage embed(content) → Vectorize.query(vec, topK=3) ↓NDJSON or JSON response with [{ entryId, score }, ...] ↓Elm renders collapsed panel below capture form- Doesn’t block save. Save path unchanged: tap Save → write
events.jsonl→ return. - Cloud-only. Offline / 5xx / timeout → panel silently hides.
- Same shape as
ClassifyEntry— direct Worker round-trip, fire-and-forget on failure.
Write — indexing
Section titled “Write — indexing”When a new entry lands, it eventually needs to be embedded and upserted into Vectorize so future queries find it.
Adopted: Option C (downstream of sync, with Rust-orchestrated fast path).
EntriesAppend → record_entry → write events.jsonl → return ↓ (spawned async, mirrors ClassifyEntry) IndexEntry side-effect → Worker POST /llm/index ↓ embed + upsert Vectorize
Sync tick → push batch to R2 → Worker batch handler embeds any unindexed EntryWritten + upserts Vectorize (idempotent safety net)- Fast freshness in the common case. The IndexEntry side-effect mirrors
ClassifyEntry— fires immediately after the local write, indexes within seconds. - Self-healing. The Worker batch handler also indexes on push, so any IndexEntry failure (offline, 5xx) is recovered when sync next runs. Vectorize upsert is idempotent, keyed by
entryId. - One ground truth. The Worker batch handler is the canonical indexer; the Rust side-effect is a latency optimization. If we ever simplify, we drop the Rust side-effect, not the Worker handler.
Rebuild — fresh install / model swap
Section titled “Rebuild — fresh install / model swap”-
Fresh install of jg. Local
events.jsonlrebuilds from R2 (existing cloud-restore path). Vectorize is keyed peruser_idon the Worker — never touched by reinstall. Similarity queries continue working immediately. -
Vectorize wipe / model swap. If we change the embedding model (e.g., voyage-4-lite → voyage-4-large) or dimensionality, run a one-off Worker job: walk all R2 batches for that user, re-embed each
EntryWritten, upsert. ~30s for 150 entries, ~5min for 10K. Within the same Voyage family (e.g., voyage-4-lite ↔ voyage-4-large at the same dimension) embeddings are compatible per Voyage’s docs and the swap is a no-op; cross-family swaps (v3 → v4) require a real re-embed.Implemented as
POST /llm/reindex(sessionAuth, walks the caller’s R2 batches, batch-embeds + upserts under their namespace). Idempotent. -
Local-only loss. Local FS lost; R2 + Vectorize intact. Sync-pull repopulates
events.jsonl. No re-embedding needed.
What is not in scope here
Section titled “What is not in scope here”- No
Event::EntryEmbedded { vector, model }variant. Embeddings are not events. Vectors don’t go through sync. - No on-device embedding model. Justin’s iPhone never runs inference; the Worker is the only embedding caller.
- No multi-model index. Single embedding model at a time. Currently
voyage-4-liteMRL-truncated to 512 dimensions, cosine distance. Pinned here and inworker/src/embeddings.ts’sVOYAGE_MODEL/VOYAGE_DIMENSIONconstants. Model migration is a one-off batch re-embed via/llm/reindex. - No similarity write-back. The hint is advisory only — never auto-merges, never edits, never writes to the log on its own. User decisions (Open / Supersede / Keep both) produce normal capture / supersede events through the existing path.
Open follow-ups
Section titled “Open follow-ups”- Worker-side
/llm/similarityand/llm/indexroutes. - Vectorize binding in
worker/wrangler.toml. Command::SimilaritySearchinCommandenum + dispatch arm + Elm Command encoder.DispatchSideEffect::IndexEntry+ Tauri host arm + dev-web log-and-drop.- Embed-on-push handler in the Worker’s
pushBatchflow. - Backfill script for the existing 150 entries.
- Capture-time UX (
Page.Capturepanel, debounce, side-by-side compare sheet).
The implementation arc is captured in the AI similarity discovery report (/tmp/discovery-ai-similarity.md for now; promote to docs/plans/ when scoping the v0 build).
Related docs
Section titled “Related docs”docs/architecture/event-sourced-state.md— JSONL is truth; projection is derived.docs/architecture/in-memory-projection.md— the in-memory projection sibling to this one (entries / interactions / firings, not vectors).docs/architecture/side-effect-orchestration.md— Rust if writes-to-log, Elm if session-ephemeral. IndexEntry is a Rust side-effect (cloud network, durable index).docs/architecture/local-first-sync.md— the 4-stage plan that makes events.jsonl primary; this doc extends it to a derived index.