Skip to content

Similarity index: derived projection on Cloudflare Vectorize

Status: Adopted 2026-05-10.

Capture-time similarity (“have you written this before?”) and weekly review (“what patterns are emerging?”) both need a vector index over entry content. Where does that index live, and how does it stay consistent with the local-first event log?

This doc encodes the architecture so we don’t re-litigate per AI feature.

The vector index is a derived projection of the event log, hosted on the Worker side, keyed per user_id.

Specifically:

  1. Source of truth is events.jsonl — local FS primary, R2 cold replica.
  2. The index (Cloudflare Vectorize) is a Map<EntryId, Vector> derived from EntryWritten payloads.
  3. Embeddings are never written to the event log. Vectors are 2KB each (512 dims × 4 bytes); they bloat the log without paying their rent. The log holds user-authored content; the index is a machine-derived view.
  4. The index is rebuildable from the log at any time. Re-embed all entries → upsert → done.
  5. Capture-time queries hit the Worker → Voyage embeds the query → Vectorize returns top-K. Cloud-only; offline silently hides the panel.
  6. Indexing happens server-side, downstream of sync push (Option B in the ADR section below).
  • Local-first invariant. Capture writes block on events.jsonl only. No new network on the dispatch hot path. The train-crash lesson from Stage 3a still holds.
  • Augmentative, not core. The app works fully offline for everything except the AI affordances, which degrade to silent absence. Same shape as classify-on-capture today.
  • Single source of truth. If R2 and Vectorize ever diverge, R2 wins — Vectorize gets rebuilt from R2.
  • No on-device inference. iPhone never embeds; only queries via the Worker.
  • Offline = no similarity hint. Acceptable per the discovery research: silently hide vs. blocking save. Mirrors how the 401 banner and classify-on-capture already degrade.
  • Cloud dependency creep. The Worker becomes load-bearing for the AI surface. If the project ever wanted to drop the Worker entirely (peer-to-peer or single-device), this comes with it.
  • Vectorize lock-in. Vectorize is Cloudflare-specific. Escape hatch: keep the embedding model + dimensionality stable; if migrating to workstation TEI + sqlite-vec, Turso, etc., re-embed once and switch the binding.

Three operations, three local-first stories

Section titled “Three operations, three local-first stories”
Page.Capture (typing, debounced 300ms)
Command.SimilaritySearch { content }
Rust dispatch → Worker POST /llm/similarity { content }
Worker: Voyage embed(content) → Vectorize.query(vec, topK=3)
NDJSON or JSON response with [{ entryId, score }, ...]
Elm renders collapsed panel below capture form
  • Doesn’t block save. Save path unchanged: tap Save → write events.jsonl → return.
  • Cloud-only. Offline / 5xx / timeout → panel silently hides.
  • Same shape as ClassifyEntry — direct Worker round-trip, fire-and-forget on failure.

When a new entry lands, it eventually needs to be embedded and upserted into Vectorize so future queries find it.

Adopted: Option C (downstream of sync, with Rust-orchestrated fast path).

EntriesAppend → record_entry → write events.jsonl → return
↓ (spawned async, mirrors ClassifyEntry)
IndexEntry side-effect → Worker POST /llm/index
↓ embed + upsert
Vectorize
Sync tick → push batch to R2 → Worker batch handler embeds any
unindexed EntryWritten + upserts Vectorize
(idempotent safety net)
  • Fast freshness in the common case. The IndexEntry side-effect mirrors ClassifyEntry — fires immediately after the local write, indexes within seconds.
  • Self-healing. The Worker batch handler also indexes on push, so any IndexEntry failure (offline, 5xx) is recovered when sync next runs. Vectorize upsert is idempotent, keyed by entryId.
  • One ground truth. The Worker batch handler is the canonical indexer; the Rust side-effect is a latency optimization. If we ever simplify, we drop the Rust side-effect, not the Worker handler.
  • Fresh install of jg. Local events.jsonl rebuilds from R2 (existing cloud-restore path). Vectorize is keyed per user_id on the Worker — never touched by reinstall. Similarity queries continue working immediately.

  • Vectorize wipe / model swap. If we change the embedding model (e.g., voyage-4-lite → voyage-4-large) or dimensionality, run a one-off Worker job: walk all R2 batches for that user, re-embed each EntryWritten, upsert. ~30s for 150 entries, ~5min for 10K. Within the same Voyage family (e.g., voyage-4-lite ↔ voyage-4-large at the same dimension) embeddings are compatible per Voyage’s docs and the swap is a no-op; cross-family swaps (v3 → v4) require a real re-embed.

    Implemented as POST /llm/reindex (sessionAuth, walks the caller’s R2 batches, batch-embeds + upserts under their namespace). Idempotent.

  • Local-only loss. Local FS lost; R2 + Vectorize intact. Sync-pull repopulates events.jsonl. No re-embedding needed.

  • No Event::EntryEmbedded { vector, model } variant. Embeddings are not events. Vectors don’t go through sync.
  • No on-device embedding model. Justin’s iPhone never runs inference; the Worker is the only embedding caller.
  • No multi-model index. Single embedding model at a time. Currently voyage-4-lite MRL-truncated to 512 dimensions, cosine distance. Pinned here and in worker/src/embeddings.ts’s VOYAGE_MODEL / VOYAGE_DIMENSION constants. Model migration is a one-off batch re-embed via /llm/reindex.
  • No similarity write-back. The hint is advisory only — never auto-merges, never edits, never writes to the log on its own. User decisions (Open / Supersede / Keep both) produce normal capture / supersede events through the existing path.
  • Worker-side /llm/similarity and /llm/index routes.
  • Vectorize binding in worker/wrangler.toml.
  • Command::SimilaritySearch in Command enum + dispatch arm + Elm Command encoder.
  • DispatchSideEffect::IndexEntry + Tauri host arm + dev-web log-and-drop.
  • Embed-on-push handler in the Worker’s pushBatch flow.
  • Backfill script for the existing 150 entries.
  • Capture-time UX (Page.Capture panel, debounce, side-by-side compare sheet).

The implementation arc is captured in the AI similarity discovery report (/tmp/discovery-ai-similarity.md for now; promote to docs/plans/ when scoping the v0 build).

  • docs/architecture/event-sourced-state.md — JSONL is truth; projection is derived.
  • docs/architecture/in-memory-projection.md — the in-memory projection sibling to this one (entries / interactions / firings, not vectors).
  • docs/architecture/side-effect-orchestration.md — Rust if writes-to-log, Elm if session-ephemeral. IndexEntry is a Rust side-effect (cloud network, durable index).
  • docs/architecture/local-first-sync.md — the 4-stage plan that makes events.jsonl primary; this doc extends it to a derived index.