Architecture¶

Current implementation follows the repository runtime behavior and the config surface described in README.md and docs/configuration.md.

Infrastructure Shape¶

Pali keeps one core memory service and exposes it through multiple operator and application surfaces:

REST API for application integration
MCP server for agent hosts
dashboard handlers for operator visibility

The repository layer remains the source of truth for persisted memory rows and tenant metadata. Optional systems such as Qdrant and Neo4j extend retrieval and entity-fact behavior rather than replacing the metadata store used by the dashboard and most service operations.

Embeddings¶

embedding.provider: lexical (default): zero-dependency lexical fallback provider for first boot, CI, and smoke runs.
embedding.provider: ollama: offline HTTP embedder via local Ollama server (/api/embed).
embedding.provider: onnx: advanced local inference path; validates model/tokenizer paths and runs ONNX Runtime inference.
embedding.provider: openrouter: remote embeddings via OpenRouter (/api/v1/embeddings).
embedding.provider: lexical (legacy alias: mock): pure-Go lexical fallback provider.
embedding.fallback_provider: lexical (default): used automatically when primary provider initialization fails.

Retrieval Pipeline¶

Memory search uses hybrid retrieval + reranking:

Candidate generation
lexical BM25 ranking from SQLite FTS5 (memory_fts)
vector ranking from configured embedder + vectorstore when available
Fusion
Reciprocal Rank Fusion (RRF, k=60) merges lexical + dense ranks without score-scale tuning
Configurable reranking (internal/core/memory/search.go)
candidate window is clamped to 50..200 (topK * 5)
retrieval.scoring.algorithm=wal: score = weighted(recency, relevance, importance)
retrieval.scoring.algorithm=match: score = weighted(recency, relevance, importance, query_overlap, route_fit)
recency uses Ebbinghaus-style decay: recency = 0.995 ^ hours_since_last_access
feature signals are normalized to [0,1]
weights are config-driven from pali.yaml

After successful search, returned memories are Touch-updated (last_accessed_at, last_recalled_at, recall_count) for future recency/recall tracking.

Dashboard and Operator Surface¶

The dashboard is built for operators inspecting the running service:

tenant lists and counts come from the tenant and memory repositories
persisted memory rows are rendered from the core repository-backed memory model
retrieval-backed actions still use the memory service, so configured vector/entity extensions affect recall and ranking

Operationally, this means enabling Qdrant or Neo4j does not stop the dashboard from showing memories. Those backends enrich retrieval; they do not become the canonical listing store.

Extension Boundaries¶

Current extension points are all config-driven:

vector storage: sqlite, qdrant
entity facts: sqlite, neo4j
embeddings: ollama, onnx, lexical, openrouter
importance scoring: heuristic, ollama, openrouter
retrieval scoring: wal, match
parsing and query decomposition: heuristic or LLM-backed providers where enabled

That keeps the application contract stable while letting operators change the retrieval stack underneath it.