Files
Aetheel/docs/memory-system.md
Tanmay Karande ec8bd80a3d first commit
2026-02-13 23:56:09 -05:00

372 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Aetheel Memory System
> **Date:** 2026-02-13
> **Inspired by:** OpenClaw's `src/memory/` (49 files, 2,300+ LOC manager)
> **Implementation:** ~600 lines of Python across 6 modules
---
## Table of Contents
1. [Overview](#overview)
2. [Architecture](#architecture)
3. [File Structure](#file-structure)
4. [Identity Files](#identity-files)
5. [How It Works](#how-it-works)
6. [Configuration](#configuration)
7. [API Reference](#api-reference)
8. [Dependencies](#dependencies)
9. [Testing](#testing)
10. [OpenClaw Mapping](#openclaw-mapping)
---
## 1. Overview
The memory system gives Aetheel **persistent, searchable memory** using a combination of markdown files and SQLite. It follows the same design as OpenClaw's memory architecture:
- **Markdown IS the database** — identity files (`SOUL.md`, `USER.md`, `MEMORY.md`) are human-readable and editable in any text editor or Obsidian
- **Hybrid search** — combines vector similarity (cosine, 0.7 weight) with BM25 keyword search (0.3 weight) for accurate retrieval
- **Fully local** — uses fastembed ONNX embeddings (384-dim), zero API calls
- **Incremental sync** — only re-indexes files that have changed (SHA-256 hash comparison)
- **Session logging** — conversation transcripts stored in `daily/` and indexed for search
---
## 2. Architecture
```
┌──────────────────────────┐
│ MemoryManager │
│ (memory/manager.py) │
├──────────────────────────┤
│ • sync() │
│ • search() │
│ • log_session() │
│ • read/update identity │
│ • file watching │
└────────┬─────────────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ Workspace │ │ SQLite │ │ fastembed │
│ (.md files)│ │ Database │ │ (ONNX) │
├──────────────┤ ├─────────────┤ ├──────────────┤
│ SOUL.md │ │ files │ │ bge-small │
│ USER.md │ │ chunks │ │ 384-dim │
│ MEMORY.md │ │ chunks_fts │ │ L2-normalized│
│ memory/ │ │ emb_cache │ │ local only │
│ daily/ │ │ session_logs│ │ │
└──────────────┘ └─────────────┘ └──────────────┘
```
### Search Flow
```
Query: "what are my preferences?"
┌──────────────────┐ ┌──────────────────┐
│ Vector Search │ │ Keyword Search │
│ (cosine sim) │ │ (FTS5 / BM25) │
│ weight: 0.7 │ │ weight: 0.3 │
└────────┬─────────┘ └────────┬─────────┘
│ │
└──────────┬─────────────┘
┌───────────────┐
│ Hybrid Merge │
│ dedupe by ID │
│ sort by score│
└───────┬───────┘
Top-N results with
score ≥ min_score
```
---
## 3. File Structure
### Source Code
```
memory/
├── __init__.py # Package exports (MemoryManager, MemorySearchResult, MemorySource)
├── types.py # Data classes: MemoryConfig, MemorySearchResult, MemoryChunk, etc.
├── internal.py # Utilities: hashing, chunking, file discovery, cosine similarity
├── hybrid.py # Hybrid search merging (0.7 vector + 0.3 BM25)
├── schema.py # SQLite schema (files, chunks, FTS5, embedding cache)
├── embeddings.py # Local fastembed ONNX embeddings (384-dim)
└── manager.py # Main MemoryManager orchestrator (~400 LOC)
```
### Workspace (Created Automatically)
```
~/.aetheel/workspace/
├── SOUL.md # Personality & values — "who you are"
├── USER.md # User profile — "who I am"
├── MEMORY.md # Long-term memory — decisions, lessons, context
├── memory/ # Additional markdown memory files (optional)
│ └── *.md
└── daily/ # Session logs by date
├── 2026-02-13.md
├── 2026-02-14.md
└── ...
```
---
## 4. Identity Files
Inspired by OpenClaw's template system (`docs/reference/templates/SOUL.md`).
### SOUL.md — Who You Are
The agent's personality, values, and behavioral guidelines. Created with sensible defaults:
- Core truths (be helpful, have opinions, be resourceful)
- Boundaries (privacy, external actions)
- Continuity rules (files ARE the memory)
### USER.md — Who I Am
The user's profile — name, role, timezone, preferences, current focus, tools. Fill this in to personalize the agent.
### MEMORY.md — Long-Term Memory
Persistent decisions, lessons learned, and context that carries across sessions. The agent appends entries with timestamps:
```markdown
### [2026-02-13 12:48]
Learned that the user prefers concise responses with code examples.
```
---
## 5. How It Works
### Sync (`await manager.sync()`)
1. **Discover files** — scans `SOUL.md`, `USER.md`, `MEMORY.md`, `memory/*.md`
2. **Check hashes** — compares SHA-256 content hash against stored hash in `files` table
3. **Skip unchanged** — files with matching hashes are skipped (incremental sync)
4. **Chunk** — splits changed files into overlapping text chunks (~512 tokens, 50 token overlap)
5. **Embed** — generates 384-dim vectors via fastembed (checks embedding cache first)
6. **Store** — inserts chunks + embeddings into SQLite, updates FTS5 index
7. **Clean** — removes stale entries for deleted files
8. **Sessions** — repeats for `daily/*.md` session log files
### Search (`await manager.search("query")`)
1. **Auto-sync** — triggers sync if workspace is dirty (configurable)
2. **Keyword search** — runs FTS5 `MATCH` query with BM25 ranking
3. **Vector search** — embeds query, computes cosine similarity against all chunk embeddings
4. **Hybrid merge** — combines results: `score = 0.7 × vector + 0.3 × keyword`
5. **Deduplicate** — merges chunks found by both methods (by chunk ID)
6. **Filter & rank** — removes results below `min_score`, returns top-N sorted by score
### Session Logging (`manager.log_session(content)`)
1. Creates/appends to `daily/YYYY-MM-DD.md`
2. Adds timestamped entry with channel label
3. Marks index as dirty for next sync
---
## 6. Configuration
```python
from memory.types import MemoryConfig
config = MemoryConfig(
# Workspace directory containing identity files
workspace_dir="~/.aetheel/workspace",
# SQLite database path
db_path="~/.aetheel/memory.db",
# Chunking parameters
chunk_tokens=512, # ~2048 characters per chunk
chunk_overlap=50, # ~200 character overlap between chunks
# Search parameters
max_results=10, # maximum results per search
min_score=0.1, # minimum hybrid score threshold
vector_weight=0.7, # weight for vector similarity
text_weight=0.3, # weight for BM25 keyword score
# Embedding model (local ONNX)
embedding_model="BAAI/bge-small-en-v1.5",
embedding_dims=384,
# Sync behavior
watch=True, # enable file watching via watchdog
watch_debounce_ms=2000, # debounce file change events
sync_on_search=True, # auto-sync before search if dirty
# Session logs directory (defaults to workspace_dir/daily/)
sessions_dir=None,
# Sources to index
sources=["memory", "sessions"],
)
```
---
## 7. API Reference
### `MemoryManager`
```python
from memory import MemoryManager
from memory.types import MemoryConfig
# Create with custom config (or defaults)
mgr = MemoryManager(config=MemoryConfig(...))
# Sync workspace → index
stats = await mgr.sync(force=False)
# Returns: {"files_found": 4, "files_indexed": 4, "chunks_created": 5, ...}
# Hybrid search
results = await mgr.search("what are my preferences?", max_results=5, min_score=0.1)
# Returns: list[MemorySearchResult]
# .path — relative file path (e.g., "USER.md")
# .start_line — chunk start line
# .end_line — chunk end line
# .score — hybrid score (0.0 - 1.0)
# .snippet — text snippet (max 700 chars)
# .source — MemorySource.MEMORY or MemorySource.SESSIONS
# Identity files
soul = mgr.read_soul() # Read SOUL.md
user = mgr.read_user() # Read USER.md
memory = mgr.read_long_term_memory() # Read MEMORY.md
mgr.append_to_memory("learned X") # Append timestamped entry to MEMORY.md
mgr.update_identity_file("USER.md", new_content) # Overwrite a file
# Session logging
path = mgr.log_session("User: hi\nAssistant: hello", channel="slack")
# File reading
data = mgr.read_file("SOUL.md", from_line=1, num_lines=10)
# Status
status = mgr.status()
# Returns: {"files": 5, "chunks": 5, "cached_embeddings": 4, ...}
# File watching
mgr.start_watching() # auto-mark dirty on workspace changes
mgr.stop_watching()
# Cleanup
mgr.close()
```
### `MemorySearchResult`
```python
@dataclass
class MemorySearchResult:
path: str # Relative path to the markdown file
start_line: int # First line of the matching chunk
end_line: int # Last line of the matching chunk
score: float # Hybrid score (0.0 - 1.0)
snippet: str # Text snippet (max 700 characters)
source: MemorySource # "memory" or "sessions"
citation: str | None = None
```
---
## 8. Dependencies
| Package | Version | Purpose |
|---------|---------|---------|
| `fastembed` | 0.7.4 | Local ONNX embeddings (BAAI/bge-small-en-v1.5, 384-dim) |
| `watchdog` | 6.0.0 | File system watching for auto re-indexing |
| `sqlite3` | (stdlib) | Database engine with FTS5 full-text search |
Added to `pyproject.toml`:
```toml
dependencies = [
"fastembed>=0.7.4",
"watchdog>=6.0.0",
# ... existing deps
]
```
---
## 9. Testing
Run the smoke test:
```bash
uv run python test_memory.py
```
### Test Results (2026-02-13)
| Test | Result |
|------|--------|
| `hash_text()` | ✅ SHA-256 produces 64-char hex string |
| `chunk_markdown()` | ✅ Splits text into overlapping chunks with correct line numbers |
| Identity file creation | ✅ SOUL.md (793 chars), USER.md (417 chars), MEMORY.md (324 chars) |
| Append to MEMORY.md | ✅ Content grows with timestamped entry |
| Session logging | ✅ Creates `daily/2026-02-13.md` with channel + timestamp |
| Sync (first run) | ✅ 4 files found, 4 indexed, 5 chunks, 1 session |
| Search "personality values" | ✅ 5 results — top: SOUL.md (score 0.595) |
| Search "preferences" | ✅ 5 results — top: USER.md (score 0.583) |
| FTS5 keyword search | ✅ Available |
| Embedding cache | ✅ 4 entries cached (skip re-computation on next sync) |
| Status report | ✅ All fields populated correctly |
---
## 10. OpenClaw Mapping
How our Python implementation maps to OpenClaw's TypeScript source:
| OpenClaw File | Aetheel File | Description |
|---------------|-------------|-------------|
| `src/memory/types.ts` | `memory/types.py` | Core types (MemorySearchResult, MemorySource, etc.) |
| `src/memory/internal.ts` | `memory/internal.py` | hashText, chunkMarkdown, listMemoryFiles, cosineSimilarity |
| `src/memory/hybrid.ts` | `memory/hybrid.py` | buildFtsQuery, bm25RankToScore, mergeHybridResults |
| `src/memory/memory-schema.ts` | `memory/schema.py` | ensureMemoryIndexSchema → ensure_schema |
| `src/memory/embeddings.ts` | `memory/embeddings.py` | createEmbeddingProvider → embed_query/embed_batch (fastembed) |
| `src/memory/manager.ts` (2,300 LOC) | `memory/manager.py` (~400 LOC) | MemoryIndexManager → MemoryManager |
| `src/memory/sync-memory-files.ts` | Inlined in `manager.py` | syncMemoryFiles → _run_sync |
| `src/memory/session-files.ts` | Inlined in `manager.py` | buildSessionEntry → _sync_session_files |
| `docs/reference/templates/SOUL.md` | Auto-created by manager | Default identity file templates |
### Key Simplifications vs. OpenClaw
| Feature | OpenClaw | Aetheel |
|---------|----------|---------|
| **Embedding providers** | OpenAI, Voyage, Gemini, local ONNX (4 providers) | fastembed only (local ONNX, zero API calls) |
| **Vector storage** | sqlite-vec extension (C library) | JSON-serialized in chunks table (pure Python) |
| **File watching** | chokidar (Node.js) | watchdog (Python) |
| **Batch embedding** | OpenAI/Voyage batch APIs, concurrency pools | fastembed batch (single-threaded, local) |
| **Config system** | JSON5 + TypeBox + Zod schemas (100k+ LOC) | Simple Python dataclass |
| **Codebase** | 49 files, 2,300+ LOC manager alone | 6 files, ~600 LOC total |
### What We Kept
- ✅ Same identity file pattern (SOUL.md, USER.md, MEMORY.md)
- ✅ Same hybrid search algorithm (0.7 vector + 0.3 BM25)
- ✅ Same chunking approach (token-based with overlap)
- ✅ Same incremental sync (hash-based change detection)
- ✅ Same FTS5 full-text search with BM25 ranking
- ✅ Same embedding cache (avoids re-computing unchanged chunks)
- ✅ Same session log pattern (daily/ directory)
---
*This memory system is Phase 1 of the Aetheel build process as outlined in `openclaw-analysis.md`.*