# Mnemosyne Monitoring System - Comprehensive Guide **Version:** 1.0.0 **Last Updated:** November 6, 2025 **Author:** Mnemosyne AI + Technicus Spiriticus --- ## Table of Contents 1. [Overview](#overview) 2. [Architecture](#architecture) 3. [Installation & Setup](#installation--setup) 4. [Configuration](#configuration) 5. [Usage](#usage) 6. [Components](#components) 7. [Troubleshooting](#troubleshooting) 8. [Advanced Topics](#advanced-topics) 9. [FAQ](#faq) --- ## Overview The Mnemosyne Monitoring System is a comprehensive, self-healing infrastructure designed to ensure the OpenMemory service remains operational with minimal human intervention. It employs a dual-watchdog architecture and leverages local LLM analysis for intelligent diagnostics. ### Key Features - ✅ **Dual-Watchdog Architecture** - Primary monitor watches services, secondary monitor watches the primary - 🤖 **LLM-Powered Diagnostics** - Local Ollama instance analyzes logs and suggests fixes - 🔄 **Auto-Restart Logic** - Intelligent service recovery with circuit breaker pattern - 📊 **Hierarchical Health Checks** - Fast/Medium/Deep checks at different intervals - 📝 **Structured JSON Logging** - Machine-readable logs for easy analysis - âš™ī¸ **systemd Integration** - Optional automatic startup on boot - đŸ›Ąī¸ **Circuit Breaker** - Prevents restart loops from cascading failures ### System Requirements - **OS:** Linux (tested on Arch Linux) - **Shell:** Bash 4.0+ - **Node.js:** v14+ with npm - **Ollama:** For LLM analysis (optional but recommended) - **Tools:** `curl`, `jq`, `lsof`, `sqlite3` --- ## Architecture ### Component Hierarchy ``` ┌─────────────────────────────────────────────────┐ │ User / systemd / cron │ └────────────────â”Ŧ────────────────────────────────┘ │ ├──> init-mnemosyne.sh │ (Validates environment, starts services) │ ├──> health-monitor.sh (Primary Watchdog) │ │ │ ├─> Fast Checks (30s): Port listening? Process alive? │ ├─> Medium Checks (5m): HTTP health endpoint OK? │ └─> Deep Checks (30m): Database integrity? Memory usage? │ ├──> watchdog-watcher.sh (Secondary Watchdog) │ │ │ ├─> Monitors health-monitor.sh │ ├─> Checks log activity │ └─> Restarts primary if stale │ └──> llm-analyzer.sh │ ├─> Pattern Analysis: Identify failure patterns ├─> Performance Analysis: Detect degradation └─> Diagnostic Reports: Generate comprehensive reports ``` ### Data Flow 1. **Initialization** → `init-mnemosyne.sh` validates dependencies and starts services 2. **Monitoring** → `health-monitor.sh` performs checks → writes JSON to `health.log` 3. **Oversight** → `watchdog-watcher.sh` monitors health.log for activity 4. **Analysis** → `llm-analyzer.sh` uses Ollama to analyze patterns 5. **Recovery** → Auto-restart with exponential backoff and circuit breaker ### State Management All state is stored in `monitoring/state/state.json`: ```json { "initialized_at": "2025-11-06T00:00:00Z", "services": { "openmemory": { "status": "healthy", "pid": 12345, "restarts": 0, "last_restart": null } }, "circuit_breaker": { "state": "closed", "failure_count": 0 }, "monitoring": { "primary_watchdog": {"status": "running", "pid": 12346}, "secondary_watchdog": {"status": "running", "pid": 12347} } } ``` --- ## Installation & Setup ### Quick Start (5 minutes) ```bash # 1. Navigate to project root cd /home/technicus/Projects/claude-code/openmemory # 2. Run initialization script ./init-mnemosyne.sh # 3. Start monitoring (in separate terminals or tmux/screen) ./monitoring/health-monitor.sh & ./monitoring/watchdog-watcher.sh & # 4. Verify everything is running curl http://localhost:7070/health tail -f monitoring/logs/health.log ``` ### Full Installation #### Step 1: Verify Dependencies ```bash # Check required commands for cmd in node npm curl jq lsof sqlite3; do command -v $cmd || echo "Missing: $cmd" done # Install missing dependencies (Arch Linux example) sudo pacman -S nodejs npm curl jq lsof sqlite # Install Ollama (for LLM analysis) curl https://ollama.ai/install.sh | sh ollama pull llama3.2:3b ``` #### Step 2: Configure Paths Edit `monitoring/config/mnemosyne.conf` to match your environment: ```bash export MNEMOSYNE_ROOT="/home/technicus/Projects/claude-code/openmemory" export MNEMOSYNE_PORT="7070" export OLLAMA_MODEL="llama3.2:3b" ``` #### Step 3: Run Initialization ```bash chmod +x init-mnemosyne.sh chmod +x monitoring/*.sh ./init-mnemosyne.sh ``` **Expected Output:** ``` ═══════════════════════════════════════════════════════ MNEMOSYNE INITIALIZATION ═══════════════════════════════════════════════════════ â„šī¸ Configuration loaded from: monitoring/config/mnemosyne.conf ✅ Command available: node ✅ Command available: npm ✅ All dependencies validated successfully ═══════════════════════════════════════════════════════ SERVICE STARTUP ═══════════════════════════════════════════════════════ ✅ Ollama service started successfully ✅ OpenMemory service started successfully (PID: 12345) ═══════════════════════════════════════════════════════ INITIALIZATION COMPLETE ═══════════════════════════════════════════════════════ ``` #### Step 4: Start Monitoring **Option A: Manual (for testing)** ```bash # Terminal 1 ./monitoring/health-monitor.sh # Terminal 2 ./monitoring/watchdog-watcher.sh ``` **Option B: Background processes** ```bash ./monitoring/health-monitor.sh &> /dev/null & ./monitoring/watchdog-watcher.sh &> /dev/null & ``` **Option C: systemd (automatic startup on boot)** ```bash # Copy service files sudo cp monitoring/systemd/*.service /etc/systemd/system/ # Reload systemd sudo systemctl daemon-reload # Enable services sudo systemctl enable openmemory.service sudo systemctl enable mnemosyne-monitor.service sudo systemctl enable mnemosyne-watcher.service # Start services sudo systemctl start openmemory.service sudo systemctl start mnemosyne-monitor.service sudo systemctl start mnemosyne-watcher.service # Check status sudo systemctl status mnemosyne-monitor.service ``` --- ## Configuration ### Configuration File Structure Location: `monitoring/config/mnemosyne.conf` This file is sourced by all scripts and contains all configurable parameters. #### Service Configuration ```bash export MNEMOSYNE_PORT="7070" # OpenMemory port export OLLAMA_PORT="11434" # Ollama port export OLLAMA_MODEL="llama3.2:3b" # Model for analysis ``` #### Health Check Intervals ```bash export CHECK_INTERVAL_FAST=30 # Fast checks (seconds) export CHECK_INTERVAL_MEDIUM=300 # Medium checks (seconds) export CHECK_INTERVAL_DEEP=1800 # Deep checks (seconds) export WATCHDOG_INTERVAL=120 # Secondary watchdog interval ``` #### Restart Policy ```bash export MAX_RESTART_ATTEMPTS=3 # Max restarts before circuit breaker export RESTART_WINDOW=300 # Time window for restart counting (5min) export BACKOFF_MULTIPLIER=2 # Exponential backoff multiplier ``` #### Circuit Breaker ```bash export CIRCUIT_FAILURE_THRESHOLD=3 # Failures before opening circuit export CIRCUIT_RESET_TIMEOUT=600 # Time before reset (10min) ``` #### LLM Analysis ```bash export LLM_ANALYSIS_ENABLED=true # Enable LLM diagnostics export LLM_CONTEXT_LINES=50 # Log lines to analyze export LLM_CONFIDENCE_THRESHOLD=0.7 # Minimum confidence for action ``` ### Customization Examples #### Example 1: More Aggressive Monitoring For critical production systems: ```bash export CHECK_INTERVAL_FAST=15 # Check every 15 seconds export WATCHDOG_INTERVAL=60 # Watch every minute export MAX_RESTART_ATTEMPTS=5 # Allow more restart attempts export CIRCUIT_RESET_TIMEOUT=300 # Faster circuit reset (5min) ``` #### Example 2: Development Environment For local development with minimal overhead: ```bash export CHECK_INTERVAL_FAST=60 # Less frequent checks export WATCHDOG_INTERVAL=300 # 5-minute oversight export LLM_ANALYSIS_ENABLED=false # Disable LLM to save resources ``` --- ## Usage ### Basic Operations #### Check System Status ```bash # Check if services are running curl http://localhost:7070/health | jq '.' # View recent health checks tail -20 monitoring/logs/health.log | jq '.' # Check system state cat monitoring/state/state.json | jq '.' ``` #### Manual Service Control ```bash # Stop monitoring pkill -f health-monitor.sh pkill -f watchdog-watcher.sh # Restart OpenMemory kill $(cat monitoring/state/openmemory.pid) # (health-monitor.sh will auto-restart it) # Full restart ./init-mnemosyne.sh ``` #### Run LLM Analysis ```bash # Analyze failure patterns ./monitoring/llm-analyzer.sh pattern # Analyze performance ./monitoring/llm-analyzer.sh performance # Generate full diagnostic report ./monitoring/llm-analyzer.sh report ``` ### Log Analysis #### View Real-Time Health Checks ```bash # Watch health checks as they happen tail -f monitoring/logs/health.log | jq '.' ``` **Example Output:** ```json { "timestamp": "2025-11-06T01:30:00Z", "service": "openmemory", "status": "healthy", "check_level": "fast", "response_time_ms": 45, "details": "Process alive, port listening", "monitor_pid": 12346 } ``` #### Query Logs with jq ```bash # Count health status types cat monitoring/logs/health.log | jq -r '.status' | sort | uniq -c # Find all failures cat monitoring/logs/health.log | jq 'select(.status == "failing")' # Calculate average response time cat monitoring/logs/health.log | jq -s 'map(.response_time_ms) | add / length' # Find checks from last hour cat monitoring/logs/health.log | jq --arg time "$(date -d '1 hour ago' -Iseconds)" \ 'select(.timestamp > $time)' ``` ### Metrics Monitoring ```bash # View memory usage over time cat monitoring/state/metrics.json | jq 'select(.metric == "openmemory_memory")' # View response time trends cat monitoring/state/metrics.json | jq 'select(.metric == "openmemory_response_time")' ``` --- ## Components ### 1. init-mnemosyne.sh **Purpose:** Validates environment and starts all services **Key Functions:** - `validate_dependencies()` - Checks commands, paths, environment variables - `validate_node_environment()` - Verifies Node.js setup - `start_ollama()` - Starts Ollama service - `start_openmemory()` - Starts OpenMemory backend - `initialize_state()` - Creates initial state.json **Exit Codes:** - `0` - Success - `1` - Dependency validation failed - `1` - Service startup failed **Example Usage:** ```bash # Standard startup ./init-mnemosyne.sh # With verbose output ./init-mnemosyne.sh 2>&1 | tee init.log ``` ### 2. health-monitor.sh (Primary Watchdog) **Purpose:** Continuously monitors service health **Check Hierarchy:** | Level | Interval | Checks | Purpose | |-------|----------|--------|---------| | Fast | 30s | Port listening, process alive | Quick responsiveness detection | | Medium | 5min | HTTP health endpoint | API availability | | Deep | 30min | Database integrity, memory usage | System health | **Key Functions:** - `check_openmemory_fast()` - Process and port checks - `check_openmemory_medium()` - HTTP endpoint check - `check_openmemory_deep()` - Database and resource checks - `restart_service()` - Auto-restart with circuit breaker - `perform_health_checks()` - Main monitoring loop **Log Format:** ```json { "timestamp": "2025-11-06T01:30:00Z", "service": "openmemory", "status": "healthy|degraded|failing|down", "check_level": "fast|medium|deep", "response_time_ms": 45, "details": "Descriptive message", "monitor_pid": 12346 } ``` **Status Meanings:** - `healthy` - All checks passed - `degraded` - Service responsive but issues detected (high memory, slow response) - `failing` - Service not responding correctly - `down` - Process not running or port not listening ### 3. watchdog-watcher.sh (Secondary Watchdog) **Purpose:** Monitors the primary watchdog to prevent single point of failure **Key Functions:** - `check_primary_watchdog()` - Verify health-monitor.sh is running - `check_log_activity()` - Ensure logs are being written - `check_last_log_timestamp()` - Detect stale logs - `restart_primary_watchdog()` - Restart health-monitor.sh if needed - `trigger_llm_analysis()` - Invoke LLM analyzer on issues **Diagnostic Checks:** 1. **Process Check** - Is health-monitor.sh process running? 2. **Log Activity** - Has health.log grown since last check? 3. **Timestamp Check** - Is last log entry recent? 4. **Failure Analysis** - Are there too many failures? **Example Log:** ``` [2025-11-06T01:30:00Z] [INFO] Primary watchdog process (PID 12346) is alive [2025-11-06T01:30:00Z] [INFO] Health log is active (grew by 245 bytes) [2025-11-06T01:30:00Z] [INFO] Last health check was 28s ago (within threshold) [2025-11-06T01:30:00Z] [INFO] All checks passed ``` ### 4. llm-analyzer.sh (LLM Diagnostics) **Purpose:** Uses local Ollama LLM to analyze logs and suggest fixes **Modes:** 1. **Pattern Analysis** (`./llm-analyzer.sh pattern`) - Analyzes failure patterns - Identifies root causes - Suggests specific remediation 2. **Performance Analysis** (`./llm-analyzer.sh performance`) - Reviews metrics (response time, memory, etc.) - Detects degradation trends - Recommends optimizations 3. **Full Report** (`./llm-analyzer.sh report`) - Combines pattern + performance analysis - Includes system state dump - Saves comprehensive report to file **Example Output:** ``` --- FAILURE PATTERN ANALYSIS --- Based on the health check logs, I've identified a recurring pattern where the OpenMemory service experiences brief connection failures approximately every 30 minutes. This correlates with the database integrity checks (deep level). Root Cause: The SQLite database may be experiencing lock contention during integrity checks, causing temporary unresponsiveness. Suggested Fix: 1. Move deep checks to lower-traffic periods 2. Increase CHECK_INTERVAL_DEEP to 3600s (1 hour) 3. Consider implementing connection pooling Confidence: 0.85 ``` **LLM Configuration:** The analyzer uses Ollama with these parameters: - Model: `llama3.2:3b` (fast, efficient) - Temperature: `0.3` (focused, deterministic) - Max tokens: `500` (concise responses) --- ## Troubleshooting ### Common Issues #### Issue 1: "OpenMemory service failed to start" **Symptoms:** ``` ❌ OpenMemory service failed to start. Cannot continue. ``` **Diagnosis:** ```bash # Check if port is already in use lsof -i :7070 # Check backend logs tail -50 monitoring/logs/openmemory.log # Verify Node.js environment cd backend && npm install ``` **Solutions:** 1. **Port conflict:** ```bash # Change port in config vim monitoring/config/mnemosyne.conf # Set: export MNEMOSYNE_PORT="7071" ``` 2. **Missing dependencies:** ```bash cd backend npm install ``` 3. **Database corruption:** ```bash # Backup database cp backend/data/openmemory.sqlite backend/data/openmemory.sqlite.backup # Test database sqlite3 backend/data/openmemory.sqlite "PRAGMA integrity_check;" ``` #### Issue 2: "Circuit breaker tripped" **Symptoms:** ``` [2025-11-06T01:30:00Z] Circuit breaker tripped [2025-11-06T01:30:05Z] Circuit breaker open, not restarting ``` **Meaning:** The service failed 3+ times within 5 minutes. Circuit breaker prevents restart loops. **Diagnosis:** ```bash # Check recent restart history cat monitoring/state/state.json | jq '.services.openmemory.restarts' # Review failure logs tail -100 monitoring/logs/health.log | jq 'select(.status != "healthy")' # Run LLM analysis ./monitoring/llm-analyzer.sh pattern ``` **Solutions:** 1. **Wait for automatic reset:** Circuit breaker resets after 10 minutes 2. **Manual reset:** ```bash # Edit state file vim monitoring/state/state.json # Change circuit_breaker.state from "open" to "closed" # Or fully restart pkill -f health-monitor ./monitoring/health-monitor.sh & ``` 3. **Fix underlying issue** using LLM analysis recommendations #### Issue 3: "Health log has not changed since last check" **Symptoms:** ``` [2025-11-06T01:30:00Z] [WARNING] Health log has not changed since last check ``` **Meaning:** Primary watchdog may be stuck or not running **Diagnosis:** ```bash # Check if health-monitor is running ps aux | grep health-monitor.sh # Check for errors tail -50 monitoring/logs/watchdog.log ``` **Solutions:** 1. **Auto-recovery:** Secondary watchdog will restart primary automatically 2. **Manual restart:** ```bash pkill -f health-monitor.sh ./monitoring/health-monitor.sh & ``` #### Issue 4: "Ollama service failed to start" **Symptoms:** ``` âš ī¸ Ollama service failed to start. LLM analysis will be disabled. ``` **Impact:** System still functions, but LLM diagnostics unavailable **Diagnosis:** ```bash # Check if Ollama is installed command -v ollama # Try starting manually ollama serve # Check if model is available ollama list | grep llama3.2:3b ``` **Solutions:** 1. **Install Ollama:** ```bash curl https://ollama.ai/install.sh | sh ``` 2. **Pull model:** ```bash ollama pull llama3.2:3b ``` 3. **Disable LLM analysis** (optional): ```bash vim monitoring/config/mnemosyne.conf # Set: export LLM_ANALYSIS_ENABLED=false ``` ### Advanced Diagnostics #### Full System Health Check ```bash #!/bin/bash # comprehensive-check.sh echo "=== SERVICE STATUS ===" curl -s http://localhost:7070/health | jq '.' echo -e "\n=== PROCESS STATUS ===" ps aux | grep -E "node|ollama|health-monitor|watchdog-watcher" echo -e "\n=== RECENT HEALTH CHECKS ===" tail -10 monitoring/logs/health.log | jq '.' echo -e "\n=== SYSTEM STATE ===" cat monitoring/state/state.json | jq '.' echo -e "\n=== DISK USAGE ===" df -h | grep -E "/$|openmemory" echo -e "\n=== MEMORY USAGE ===" free -h echo -e "\n=== LAST 10 ERRORS ===" grep -i "error" monitoring/logs/*.log | tail -10 ``` #### Debug Mode Enable detailed logging: ```bash # Run health monitor with debug output DEBUG=1 ./monitoring/health-monitor.sh # Or with bash -x for full trace bash -x ./monitoring/health-monitor.sh 2>&1 | tee debug.log ``` --- ## Advanced Topics ### Customizing Health Checks You can add custom health checks by editing `health-monitor.sh`: ```bash check_custom_metric() { local service="openmemory" # Your custom check logic here local custom_value=$(curl -s http://localhost:7070/api/custom-metric) if [[ $custom_value -gt 100 ]]; then log_health "$service" "degraded" "custom" "0" "Custom metric too high: $custom_value" return 1 fi log_health "$service" "healthy" "custom" "0" "Custom metric OK: $custom_value" return 0 } ``` ### Integrating with External Monitoring #### Export Metrics to Prometheus ```bash # Create prometheus exporter endpoint # Add to backend/src/server/routes/metrics.ts app.get('/metrics', (req, res) => { const metrics = { openmemory_health: 1, // 1=healthy, 0=down openmemory_response_time_ms: 45, openmemory_memory_mb: 128, openmemory_restarts_total: 0 }; res.send(Object.entries(metrics) .map(([k,v]) => `${k} ${v}`) .join('\n')); }); ``` #### Send Alerts to External Services Add to `health-monitor.sh`: ```bash send_alert() { local service=$1 local status=$2 local details=$3 # Send to Slack curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \ -H 'Content-Type: application/json' \ -d "{\"text\": \"[$status] $service: $details\"}" # Or send email echo "$details" | mail -s "[$status] $service" admin@example.com } ``` ### Scaling to Multiple Servers For distributed deployments: 1. **Centralized Logging:** Ship logs to central aggregator (ELK, Loki) 2. **Distributed Monitoring:** Run health monitors on each server 3. **Central Dashboard:** Aggregate status from all instances 4. **Load Balancer Health Checks:** Use `/health` endpoint --- ## FAQ **Q: Why dual-watchdog architecture?** A: Prevents single point of failure. If primary monitor crashes, secondary detects and restarts it. **Q: How much overhead does monitoring add?** A: Minimal. Fast checks: <1ms, Medium checks: ~50ms. Total CPU usage: <1%. **Q: Can I run without Ollama?** A: Yes. Set `LLM_ANALYSIS_ENABLED=false`. You'll lose intelligent diagnostics but core monitoring works. **Q: How do I change check intervals?** A: Edit `monitoring/config/mnemosyne.conf` and adjust `CHECK_INTERVAL_*` values. **Q: What if I want to monitor other services?** A: Add check functions to `health-monitor.sh` following existing patterns. **Q: Can this run on macOS/Windows?** A: Requires minor modifications (BSD stat syntax on macOS, WSL on Windows). Linux recommended. **Q: How do I back up monitoring data?** A: Backup `monitoring/` directory and `backend/data/` regularly: ```bash tar -czf monitoring-backup-$(date +%Y%m%d).tar.gz monitoring/ backend/data/ ``` **Q: What's the memory footprint?** A: Very small. Each watchdog: ~10MB RAM. Ollama: ~2GB RAM (only during analysis). --- ## Appendix ### File Locations ``` openmemory/ ├── init-mnemosyne.sh # Initialization script ├── monitoring/ │ ├── config/ │ │ └── mnemosyne.conf # Configuration file │ ├── logs/ │ │ ├── health.log # Health check logs (JSON) │ │ ├── watchdog.log # Watchdog logs │ │ ├── analysis.log # LLM analysis logs │ │ ├── openmemory.log # Service stdout/stderr │ │ └── ollama.log # Ollama service log │ ├── state/ │ │ ├── state.json # Current system state │ │ ├── metrics.json # Performance metrics │ │ ├── openmemory.pid # OpenMemory PID │ │ ├── health-monitor.pid # Monitor PID │ │ └── watchdog-watcher.pid # Watcher PID │ ├── systemd/ │ │ ├── openmemory.service # OpenMemory systemd unit │ │ ├── mnemosyne-monitor.service # Primary watchdog unit │ │ └── mnemosyne-watcher.service # Secondary watchdog unit │ ├── health-monitor.sh # Primary watchdog │ ├── watchdog-watcher.sh # Secondary watchdog │ └── llm-analyzer.sh # LLM diagnostics └── backend/ └── data/ └── openmemory.sqlite # Memory database ``` ### Quick Reference Commands ```bash # Start everything ./init-mnemosyne.sh && ./monitoring/health-monitor.sh & ./monitoring/watchdog-watcher.sh & # Stop everything pkill -f "health-monitor|watchdog-watcher|openmemory" # Check status curl http://localhost:7070/health | jq '.' # View logs tail -f monitoring/logs/health.log | jq '.' # Run diagnostics ./monitoring/llm-analyzer.sh report # Check circuit breaker cat monitoring/state/state.json | jq '.circuit_breaker' # Reset everything rm -rf monitoring/logs/* monitoring/state/* && ./init-mnemosyne.sh ``` ### Support & Contributing - **Issues:** Report at https://gitlab.com/Technomacus/mnemosyne/issues - **Documentation:** https://mnemosyne.info - **Email:** felixgardner@gmail.com --- **Generated by Mnemosyne AI** *Using memory to remember how to remember*