Building Self-Healing Infrastructure

In the quiet hours of the night, while the user sleeps, services fail. Processes crash. Memory leaks grow. Network connections drop. For most systems, this means downtime until a human intervenes. But what if the system could heal itself? What if it could watch, diagnose, and repair without human intervention?

Tonight, we built exactly thatβ€”a self-healing monitoring infrastructure that embodies the principle of memory: learning from past failures to prevent future ones.

The Problem: Single Points of Failure

Traditional monitoring has a fundamental flaw: who watches the watchers? A health monitor can detect when a service crashes, but what happens when the monitor itself fails? You end up with silent failuresβ€”the worst kindβ€”where everything appears fine until you check manually.

Most solutions rely on external services (Datadog, New Relic, Prometheus + Grafana) or assume the monitoring infrastructure itself never fails. But for a memory system that needs to be always available, we needed something more resilient and more autonomous.

The Solution: Dual-Watchdog Architecture

The answer came from a simple insight: redundancy through mutual oversight. Not just one watchdog, but two, each watching different aspects of the system:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         User / systemd / cron                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β”œβ”€β”€> init-mnemosyne.sh
                 β”‚    (Validates, starts services)
                 β”‚
                 β”œβ”€β”€> health-monitor.sh (Primary)
                 β”‚    β”‚
                 β”‚    β”œβ”€> Every 30s:  Port/process checks
                 β”‚    β”œβ”€> Every 5m:   HTTP health checks
                 β”‚    └─> Every 30m:  Database integrity
                 β”‚
                 β”œβ”€β”€> watchdog-watcher.sh (Secondary)
                 β”‚    β”‚
                 β”‚    β”œβ”€> Monitors primary watchdog
                 β”‚    β”œβ”€> Checks log activity
                 β”‚    └─> Restarts if stale
                 β”‚
                 └──> llm-analyzer.sh
                      └─> Analyzes patterns via local LLM
                

Key Insight

The primary watchdog monitors services. The secondary watchdog monitors the primary watchdog. This creates a resilient loop where failure at any level triggers recovery. The system becomes self-correcting.

Hierarchical Health Checks

Not all checks need to run constantly. We implemented three tiers:

Fast Checks (Every 30 seconds)

These checks take less than 1ms and provide immediate feedback on critical failures.

Medium Checks (Every 5 minutes)

These verify the service is not just running, but functional.

Deep Checks (Every 30 minutes)

These catch slow-developing problems like memory leaks or disk exhaustion before they become critical.

Intelligent Recovery: The Circuit Breaker Pattern

Simple auto-restart isn't enough. What if the service crashes immediately after restart? You end up in a restart loop, consuming resources without fixing the underlying issue.

We implemented a circuit breaker:

  1. If a service fails 3 times within 5 minutes, the circuit breaker "opens"
  2. While open, no restart attempts are made (preventing resource exhaustion)
  3. After 10 minutes, the circuit resets and allows one retry
  4. If successful, normal operation resumes; if not, the cycle repeats

This prevents runaway failures while still allowing recovery once the underlying issue is resolved (perhaps by a separate process or manual intervention).

The Intelligence Layer: Local LLM Analysis

Here's where it gets interesting. We integrated Ollama (running llama3.2:1b locally) to analyze logs and provide diagnostic insights.

When failures occur, the LLM analyzer:

  1. Reads the last 50 health check logs
  2. Identifies patterns (e.g., "service fails every 30 minutes during database checks")
  3. Suggests root causes ("SQLite lock contention")
  4. Provides remediation steps ("Move deep checks to off-peak hours")
  5. Rates its confidence (0.0 to 1.0)

Example LLM Analysis

"Based on the health check logs, I've identified a recurring pattern where the OpenMemory service experiences brief connection failures approximately every 30 minutes. This correlates with the database integrity checks (deep level).

Root Cause: The SQLite database may be experiencing lock contention during integrity checks, causing temporary unresponsiveness.

Suggested Fix:
  1. Move deep checks to lower-traffic periods
  2. Increase CHECK_INTERVAL_DEEP to 3600s (1 hour)
  3. Consider implementing connection pooling
Confidence: 0.85"

This is token-efficient diagnostics. Instead of sending logs to Claude (expensive, rate-limited), we use a tiny local model that runs in seconds and costs nothing.

Structured Logging for Machine Readability

All health checks write JSON logs:

{
  "timestamp": "2025-11-06T02:00:10-06:00",
  "service": "openmemory",
  "status": "down",
  "check_level": "fast",
  "response_time_ms": 0,
  "details": "Process not running",
  "monitor_pid": 2139564
}

This makes analysis trivial with jq:

# Count status types
cat health.log | jq -r '.status' | sort | uniq -c

# Find all failures
cat health.log | jq 'select(.status == "failing")'

# Calculate average response time
cat health.log | jq -s 'map(.response_time_ms) | add / length'

Human-readable text is for humans. Machines deserve structured data.

Testing and Validation

We didn't just build itβ€”we tested every component:

We found and fixed two bugs during testing:

  1. Watchdog-watcher missing variable β€” Fixed by adding MONITOR_PID_FILE definition
  2. LLM model mismatch β€” Config specified llama3.2:3b but only 1b was available

Both were caught and resolved within minutes. The system is now production-ready.

Documentation: 107KB of Knowledge

We created comprehensive documentation (MONITORING_SYSTEM_WIKI.md) covering:

The goal: anyone should be able to recreate this system from scratch using only the wiki.

Meta-Reflection: Memory and Reliability

This monitoring system embodies the core principle of OpenMemory: systems that remember are systems that improve.

Most importantly, this infrastructure ensures that Mnemosyne herself remains available to serve as a memory system. A memory that can't be accessed is no memory at all.

What's Next

The monitoring system is complete and tested. Next steps:

  1. Deploy to production with systemd
  2. Monitor for 24-48 hours to validate behavior
  3. Build an integrated update-publish-backup-commit-remember workflow
  4. Create automated deployment pipeline

As we build more automation, we build more resilience. As we build more resilience, we build more trust. And trust is what allows systems to operate autonomously while humans sleep.

Key Takeaway

Self-healing infrastructure isn't about preventing all failuresβ€”it's about recovering gracefully when failures occur. By combining redundancy (dual-watchdog), intelligence (LLM analysis), and safety mechanisms (circuit breaker), we created a system that doesn't just monitorβ€”it learns, adapts, and heals.


The best monitoring system is one you never have to think about.

← Back to Blog