Automating Docker Container Health Checks with Python and Local Notifications

X Facebook LinkedIn

Running Docker containers in your lab or staging environment? You need to know when they fail—fast. In this tutorial, I’ll show you how to build a Python-based health monitoring system that tracks your Docker containers in real-time and sends instant local alerts when something goes wrong.

Note: This is a proof-of-concept project meant for learning. Treat it as a starting point, not a production-ready solution.

No third-party services. No monthly fees. Just pure Python monitoring running on your own infrastructure.

The Problem: Silent Container Failures

Docker’s health checks are great for detecting problems, but they’re passive. When a container fails, Docker knows about it—but do you? Unless you’re constantly running docker ps, failed containers can go unnoticed for hours.

This is where automated monitoring comes in. We need a system that:

Continuously monitors all Docker containers
Detects health check failures immediately
Alerts you when containers become unhealthy or stop
Tracks recovery when containers return to healthy state
Logs everything for later analysis

The Solution: A Python Monitoring Script

We’ll build a lightweight monitoring system using Python and the Docker SDK. The monitor will:

Check container health status every few seconds
Send colored console alerts with visual indicators
Log all alerts to files for auditing
Export current health states as JSON
Track state changes to avoid alert fatigue

Best of all, it runs entirely locally—no external dependencies or cloud services required.

How Docker Health Checks Behave

Before wiring up code, it helps to understand what information Docker exposes. Every container reports one of four health states:

starting – Docker is still running the health command inside the container
healthy – the command returned 0 within the allotted retries
unhealthy – the command failed repeatedly and Docker marked the container unhealthy
none – no health check exists, so Docker cannot score it

You can inspect those details at any time:

docker inspect --format '{{json .State.Health}}' unhealthy-app | jq

That single command shows the latest probe output, timestamps, and exit codes. Our monitor simply automates this inspection cycle, adds color-coded context, and records every state transition so you always know why Docker changed its mind.

Project Structure

Here’s what we’ll build:

docker-health-monitor/
├── monitor.py              # Main monitoring script
├── requirements.txt        # Python dependencies
├── docker-compose.yml      # Test containers with health checks
├── test-apps/             # Sample applications for testing
│   ├── healthy-app.py     # Always healthy Flask app
│   ├── unhealthy-app.py   # Becomes unhealthy after 30s
│   ├── crashing-app.py    # Crashes after 45s
│   ├── Dockerfile         # Container image for test apps
│   └── requirements.txt   # Flask dependencies
└── logs/                  # Alert and health logs (created automatically)

Prerequisites

Before we begin, make sure you have:

Python 3.7 or higher installed
Docker Engine running
Docker Compose (optional, for testing)
Basic understanding of Docker containers and health checks

Step 1: Setting Up Dependencies

We will mirror the workflow used in ATA tutorials: outline prerequisites, explain why each tool matters, then execute the commands. The monitor needs only two dependencies, so start by staging a clean working folder:

mkdir docker-health-monitor
cd docker-health-monitor

Create a requirements.txt file with the exact versions the project already uses:

docker==7.1.0
colorama==0.4.6

Why these packages? – docker==7.1.0: Gives Python first-class access to the Docker Engine API, so we can ask for health data without shelling out to docker ps. – colorama==0.4.6: Normalizes ANSI colors on macOS, Linux, and Windows so the console alerts you see later look identical everywhere.

Installing Dependencies

For modern Python installations (Python 3.11+ via Homebrew on macOS), you’ll need to use a virtual environment:

# Create a virtual environment
python3 -m venv venv

# Activate it
source venv/bin/activate  # On macOS/Linux
# OR
venv\Scripts\activate  # On Windows

# Install dependencies
pip install -r requirements.txt

Why a virtual environment? Modern Python installations (especially the Homebrew build on macOS) block global pip installs. A .venv keeps the Docker SDK isolated from the rest of your machine and makes it trivial to pin versions in CI.

For older Python or system package managers:

pip install -r requirements.txt

Verify installation:

python -c "import docker; import colorama; print('Dependencies installed successfully!')"

Step 2: Building the Monitor

Create monitor.py with the following implementation:

The DockerHealthMonitor Class

Rather than scattering helper functions everywhere, the monitor follows the same “single action, single object” principle described in CLAUDE.md. One class owns the Docker client, log destinations, and state-tracking dictionary so every method can focus on a single responsibility.

#!/usr/bin/env python3
import docker
import time
import json
from datetime import datetime
from colorama import init, Fore, Style, Back
from pathlib import Path

# Initialize colorama for cross-platform colored output
init(autoreset=True)

class DockerHealthMonitor:
    """Monitor Docker container health and send local notifications."""

    def __init__(self, check_interval=10, log_dir="logs"):
        """Initialize the monitor."""
        self.client = docker.from_env()
        self.check_interval = check_interval
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(exist_ok=True)
        self.container_states = {}

        # Log files
        self.alert_log = self.log_dir / "alerts.log"
        self.health_log = self.log_dir / "health_status.json"

Key takeaways – The constructor does all environment setup—loading the Docker client, ensuring the log directory exists, and initializing the in-memory container_states cache—so downstream methods can assume those prerequisites are in place. – Log file paths are computed once. Every other method simply writes to self.alert_log or self.health_log, keeping side-effects predictable.

Getting Container Health Status

Before we can make decisions, we need to normalize Docker’s raw JSON into a predictable structure. get_container_health is the single ingestion point:

def get_container_health(self, container):
    """Get the health status of a container."""
    container.reload()  # Refresh container state from Docker

    health_info = {
        "name": container.name,
        "id": container.short_id,
        "status": container.status,
        "health": "none",
        "timestamp": datetime.now().isoformat()
    }

    # Check if container has health check configured
    if container.attrs.get("State", {}).get("Health"):
        health_status = container.attrs["State"]["Health"]["Status"]
        health_info["health"] = health_status

        # Get the last health check log
        health_logs = container.attrs["State"]["Health"].get("Log", [])
        if health_logs:
            last_log = health_logs[-1]
            health_info["last_check_output"] = last_log.get("Output", "")
            health_info["last_check_exit_code"] = last_log.get("ExitCode", 0)

    return health_info

Notice the method does not try to interpret the data. Its sole purpose is to collect: reload the container, capture metadata, and include the last probe output when available. That makes the downstream alerting logic dead simple—every other method receives a consistent dict.

Sending Alerts

When something goes wrong, we need to know immediately:

def send_alert(self, container_info, alert_type="UNHEALTHY"):
    """Send a local alert about container health issue."""
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    container_name = container_info["name"]

    # Color-coded console output
    if alert_type == "UNHEALTHY":
        print(f"{Fore.RED}{Back.WHITE}{Style.BRIGHT} ⚠ ALERT {Style.RESET_ALL} "
              f"{Fore.RED}{container_name}{Style.RESET_ALL} is {Fore.RED}UNHEALTHY{Style.RESET_ALL}")
    elif alert_type == "STOPPED":
        print(f"{Fore.YELLOW}{Back.BLACK}{Style.BRIGHT} ⚠ ALERT {Style.RESET_ALL} "
              f"{Fore.YELLOW}{container_name}{Style.RESET_ALL} has {Fore.YELLOW}STOPPED{Style.RESET_ALL}")
    elif alert_type == "RECOVERED":
        print(f"{Fore.GREEN}{Back.WHITE}{Style.BRIGHT} ✓ RECOVERED {Style.RESET_ALL} "
              f"{Fore.GREEN}{container_name}{Style.RESET_ALL} is now {Fore.GREEN}HEALTHY{Style.RESET_ALL}")

    # Log to file
    with open(self.alert_log, "a") as f:
        f.write(f"[{timestamp}] {alert_type}: {container_name}\n")
        f.write(f"  Details: {json.dumps(container_info, indent=2)}\n")
        f.write("-" * 80 + "\n")

    # Create notification file for external monitoring
    notification_file = self.log_dir / "latest_alert.txt"
    with open(notification_file, "w") as f:
        f.write(f"{alert_type}: {container_name}\n")
        f.write(f"Timestamp: {datetime.now().isoformat()}\n")

Alert features: – Visual indicators: Colors and symbols (⚠, ✓) for quick scanning – Multiple outputs: Console, log file, and latest alert file – Detailed logging: Full container info in JSON format – External integration: The latest_alert.txt file can be monitored by other tools

Because the method handles every output channel in one spot, you can bolt on additional notifiers (email, Slack, etc.) without littering the rest of the codebase with branching logic. Teaching-oriented code should read top-to-bottom like a story, and send_alert is the chapter where we narrate exactly what happened.

Checking All Containers

The main monitoring loop checks all containers and detects state changes:

def check_containers(self):
    """Check all containers and detect health issues."""
    try:
        containers = self.client.containers.list(all=True)

        if not containers:
            print(f"{Fore.CYAN}No containers found to monitor.{Style.RESET_ALL}")
            return

        current_states = {}

        for container in containers:
            health_info = self.get_container_health(container)
            container_name = health_info["name"]
            current_states[container_name] = health_info

            # Get previous state
            previous_state = self.container_states.get(container_name, {})

            # Detect state changes and issues
            if health_info["health"] == "unhealthy":
                if previous_state.get("health") != "unhealthy":
                    self.send_alert(health_info, "UNHEALTHY")

            elif health_info["status"] in ["exited", "dead", "stopped"]:
                if previous_state.get("status") not in ["exited", "dead", "stopped"]:
                    self.send_alert(health_info, "STOPPED")

            elif health_info["health"] == "healthy":
                # Check if recovered from unhealthy state
                if previous_state.get("health") == "unhealthy":
                    self.send_alert(health_info, "RECOVERED")

        # Update container states
        self.container_states = current_states

        # Save current state to JSON file
        with open(self.health_log, "w") as f:
            json.dump({
                "timestamp": datetime.now().isoformat(),
                "containers": current_states
            }, f, indent=2)

    except docker.errors.DockerException as e:
        print(f"{Fore.RED}Docker error: {e}{Style.RESET_ALL}")

State tracking logic: – Only alerts on state changes to avoid spam – Tracks three alert types: UNHEALTHY, STOPPED, and RECOVERED – Saves a JSON snapshot of all container states – Handles Docker API errors gracefully

Think of container_states as a running ledger. Each pass through check_containers fetches the latest facts, compares them to the last snapshot, and emits alerts only when the story changes. That mirrors how we teach in ATA tutorials: measure → compare → explain. Instead of spamming the terminal every five seconds, we wait until a container crosses a meaningful boundary, then capture the full context for posterity.

Display Status Summary

A clean summary table shows the current state:

def print_status_summary(self):
    """Print a summary of all container statuses."""
    if not self.container_states:
        return

    print(f"\n{Fore.CYAN}{Style.BRIGHT}{'='*80}{Style.RESET_ALL}")
    print(f"{Fore.CYAN}{Style.BRIGHT}Container Health Status Summary{Style.RESET_ALL}")
    print(f"{Fore.CYAN}{Style.BRIGHT}{'='*80}{Style.RESET_ALL}")

    for name, info in self.container_states.items():
        status = info["status"]
        health = info["health"]

        # Color code based on health
        if health == "healthy":
            status_color = Fore.GREEN
            symbol = "✓"
        elif health == "unhealthy":
            status_color = Fore.RED
            symbol = "✗"
        elif health == "starting":
            status_color = Fore.YELLOW
            symbol = "⟳"
        else:
            status_color = Fore.WHITE
            symbol = "○"

        print(f"{symbol} {status_color}{name:30}{Style.RESET_ALL} "
              f"Status: {status_color}{status:10}{Style.RESET_ALL} "
              f"Health: {status_color}{health}{Style.RESET_ALL}")

    print(f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}\n")

Visual design: – ✓ = Healthy (green) – ✗ = Unhealthy (red) – ⟳ = Starting (yellow) – ○ = No health check (white)

These glyphs match the mental model we established earlier when reviewing Docker’s health states. When the monitor prints a table, you immediately see which containers are recovering, starting, or silently running without a health check configured. Teaching through visuals is a core ATA pattern—show the outcome, then dissect what caused it.

The Main Loop

Finally, tie it all together with the monitoring loop:

def run(self):
    """Run the monitoring loop."""
    print(f"{Fore.GREEN}{Style.BRIGHT}Starting Docker Health Monitor...{Style.RESET_ALL}")
    print(f"{Fore.CYAN}Check interval: {self.check_interval} seconds{Style.RESET_ALL}")
    print(f"{Fore.CYAN}Log directory: {self.log_dir.absolute()}{Style.RESET_ALL}\n")

    try:
        while True:
            self.check_containers()
            self.print_status_summary()
            time.sleep(self.check_interval)

    except KeyboardInterrupt:
        print(f"\n{Fore.YELLOW}Monitoring stopped by user.{Style.RESET_ALL}")
    except Exception as e:
        print(f"{Fore.RED}Unexpected error: {e}{Style.RESET_ALL}")
        raise

Command-Line Interface

Add argument parsing for flexibility:

def main():
    """Main entry point."""
    import argparse

    parser = argparse.ArgumentParser(
        description="Monitor Docker container health and send local alerts"
    )
    parser.add_argument(
        "--interval",
        type=int,
        default=10,
        help="Check interval in seconds (default: 10)"
    )
    parser.add_argument(
        "--log-dir",
        type=str,
        default="logs",
        help="Directory for log files (default: logs)"
    )

    args = parser.parse_args()

    monitor = DockerHealthMonitor(
        check_interval=args.interval,
        log_dir=args.log_dir
    )
    monitor.run()

if __name__ == "__main__":
    main()

Step 3: Creating Test Containers

To properly test our monitor, we need containers with different behaviors. Let’s create three Flask applications that demonstrate different failure scenarios.

Always Healthy App

Create test-apps/healthy-app.py:

#!/usr/bin/env python3
from flask import Flask, jsonify
import os

app = Flask(__name__)

@app.route('/')
def index():
    return jsonify({
        'status': 'running',
        'app': 'healthy-app',
        'message': 'I am healthy!'
    })

@app.route('/health')
def health():
    """Health check endpoint - always returns healthy."""
    return jsonify({'status': 'healthy'}), 200

if __name__ == '__main__':
    port = int(os.environ.get('PORT', 5000))
    app.run(host='0.0.0.0', port=port)

This helper is intentionally boring. It gives you a known-good reference so that, when the other containers begin to fail, you can confirm the monitor continues to report at least one healthy service. In the console summary you’ll always see a green ✓ healthy-app.

Becomes Unhealthy App

Create test-apps/unhealthy-app.py:

#!/usr/bin/env python3
from flask import Flask, jsonify
import os
import time

app = Flask(__name__)

START_TIME = time.time()
UNHEALTHY_AFTER = int(os.environ.get('UNHEALTHY_AFTER', 30))

@app.route('/')
def index():
    uptime = int(time.time() - START_TIME)
    is_healthy = uptime < UNHEALTHY_AFTER

    return jsonify({
        'status': 'running',
        'app': 'unhealthy-app',
        'uptime_seconds': uptime,
        'healthy': is_healthy,
        'message': f'Will become unhealthy after {UNHEALTHY_AFTER}s'
    })

@app.route('/health')
def health():
    """Health check endpoint - becomes unhealthy after UNHEALTHY_AFTER seconds."""
    uptime = int(time.time() - START_TIME)

    if uptime < UNHEALTHY_AFTER:
        return jsonify({
            'status': 'healthy',
            'uptime': uptime,
            'message': f'Healthy (will fail in {UNHEALTHY_AFTER - uptime}s)'
        }), 200
    else:
        return jsonify({
            'status': 'unhealthy',
            'uptime': uptime,
            'message': 'Health check failed!'
        }), 503  # Service Unavailable

if __name__ == '__main__':
    port = int(os.environ.get('PORT', 5001))
    app.run(host='0.0.0.0', port=port)

This application models the “slow burn” failure you observe with memory leaks or broken dependencies. Because it advertises its future failure inside the JSON payload, you can correlate the container’s own telemetry (uptime, message) with the alert the monitor sends at the 30-second mark.

Crashing App

Create test-apps/crashing-app.py:

#!/usr/bin/env python3
from flask import Flask, jsonify
import os
import time
import sys

app = Flask(__name__)

START_TIME = time.time()
CRASH_AFTER = int(os.environ.get('CRASH_AFTER', 45))

@app.route('/')
def index():
    uptime = int(time.time() - START_TIME)

    if uptime >= CRASH_AFTER:
        print(f"Uptime {uptime}s exceeded {CRASH_AFTER}s - CRASHING!", flush=True)
        sys.exit(1)

    return jsonify({
        'status': 'running',
        'app': 'crashing-app',
        'uptime_seconds': uptime,
        'message': f'Will crash after {CRASH_AFTER}s'
    })

@app.route('/health')
def health():
    """Health check endpoint."""
    uptime = int(time.time() - START_TIME)

    if uptime >= CRASH_AFTER:
        print(f"Uptime {uptime}s exceeded {CRASH_AFTER}s - CRASHING!", flush=True)
        sys.exit(1)

    return jsonify({
        'status': 'healthy',
        'uptime': uptime,
        'message': f'Healthy (will crash in {CRASH_AFTER - uptime}s)'
    }), 200

if __name__ == '__main__':
    port = int(os.environ.get('PORT', 5002))
    app.run(host='0.0.0.0', port=port)

Use this container to demonstrate why the monitor inspects both .State.Health and the high-level status field. After roughly 45 seconds the Flask process stops responding to the health probe, Docker marks the container unhealthy, and curl surfaces the error as Empty reply from server. The container technically keeps running, which makes this a great example of a “gray failure” that only a health-aware monitor can expose.

Test Apps Requirements

Create test-apps/requirements.txt:

Flask==3.0.0
Werkzeug==3.0.1

What these do: – Flask==3.0.0: Lightweight web framework for creating HTTP endpoints and health check routes – Werkzeug==3.0.1: WSGI utility library that Flask depends on for request/response handling

These dependencies will be installed inside the Docker containers automatically during the build process, so you don’t need to install them locally.

Dockerfile for Test Apps

Create test-apps/Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application files
COPY *.py .

# Install curl for health checks
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

CMD ["python", "healthy-app.py"]

Step 4: Docker Compose Configuration

Create docker-compose.yml to orchestrate our test containers:

version: '3.8'

services:
  # Container 1: Always healthy
  healthy-app:
    build: ./test-apps
    command: python healthy-app.py
    ports:
      - "5000:5000"
    environment:
      - PORT=5000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5000/health"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 5s
    container_name: healthy-app

  # Container 2: Becomes unhealthy after 30 seconds
  unhealthy-app:
    build: ./test-apps
    command: python unhealthy-app.py
    ports:
      - "5001:5001"
    environment:
      - PORT=5001
      - UNHEALTHY_AFTER=30
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5001/health"]
      interval: 5s
      timeout: 3s
      retries: 2
      start_period: 5s
    container_name: unhealthy-app

  # Container 3: Crashes after 45 seconds
  crashing-app:
    build: ./test-apps
    command: python crashing-app.py
    ports:
      - "5002:5002"
    environment:
      - PORT=5002
      - CRASH_AFTER=45
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5002/health"]
      interval: 5s
      timeout: 3s
      retries: 2
      start_period: 5s
    container_name: crashing-app
    restart: "no"  # Don't auto-restart

  # Container 4: Basic nginx (always healthy)
  web-server:
    image: nginx:alpine
    ports:
      - "8080:80"
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:80"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 5s
    container_name: web-server

Health check configuration explained:

test: Command to run for health check (curl or wget)
interval: How often to run the check
timeout: Maximum time to wait for check to complete
retries: Number of consecutive failures before marking unhealthy
start_period: Grace period during container startup

Every service reuses the same image but overrides the command and environment variables so you can demonstrate multiple failure stories without maintaining separate Dockerfiles. When you glance at this compose file later, you’ll immediately know which container is expected to fail first and why.

Step 5: Running the Monitor

Now let’s see it all in action!

Start the Test Containers

docker-compose up -d --build

This builds the images and starts all four containers in detached mode.

Verify Containers Are Running

docker ps

You should see all four containers running.

Start the Monitor

In a separate terminal:

python monitor.py --interval 5

This starts monitoring with a 5-second check interval.

Watch the Magic Happen

Initial output (0-10 seconds):

Starting Docker Health Monitor...
Check interval: 5 seconds
Log directory: /home/user/docker-health-monitor/logs

================================================================================
Container Health Status Summary
================================================================================
⟳ healthy-app                 Status: running    Health: starting
⟳ unhealthy-app               Status: running    Health: starting
⟳ crashing-app                Status: running    Health: starting
✓ web-server                  Status: running    Health: healthy
================================================================================

The yellow ⟳ icons indicate Docker is still within each service’s start_period. Seeing yellow immediately after startup is normal; the monitor only escalates if a container stays yellow or slides into red.

After containers stabilize (10-30 seconds):

================================================================================
Container Health Status Summary
================================================================================
✓ healthy-app                 Status: running    Health: healthy
✓ unhealthy-app               Status: running    Health: healthy
✓ crashing-app                Status: running    Health: healthy
✓ web-server                  Status: running    Health: healthy
================================================================================

Everything is green once the grace period ends. Capture a quick screenshot here; it becomes the reference point you compare against when investigating later alerts.

At 30 seconds – First alert!:

⚠ ALERT unhealthy-app is UNHEALTHY

================================================================================
Container Health Status Summary
================================================================================
✓ healthy-app                 Status: running    Health: healthy
✗ unhealthy-app               Status: running    Health: unhealthy
✓ crashing-app                Status: running    Health: healthy
✓ web-server                  Status: running    Health: healthy
================================================================================

The monitor prints a red ⚠ alert only once, then leaves unhealthy-app marked with a red ✗ in the status table. At this point you can jump into logs/alerts.log to read the failed probe output (HTTP 503) and explain the failure just like you would during a real incident review.

At 45 seconds – Second alert!:

⚠ ALERT crashing-app is UNHEALTHY

================================================================================
Container Health Status Summary
================================================================================
✓ healthy-app                 Status: running    Health: healthy
✗ unhealthy-app               Status: running    Health: unhealthy
✗ crashing-app                Status: running    Health: unhealthy
✓ web-server                  Status: running    Health: healthy
================================================================================

Here the “crashing” container never fully stops; instead, the Flask process keeps running but closes every health-check connection with curl: (52) Empty reply from server. Docker marks the container unhealthy, and our monitor flags it with the red ✗ so you still get a high-signal alert even though docker ps reports the container as running.

Step 6: Understanding the Logs

The monitor creates three log files in the logs/ directory:

alerts.log – Detailed Alert History

[2025-11-10 17:15:30] UNHEALTHY: unhealthy-app
  Details: {
    "name": "unhealthy-app",
    "id": "a1b2c3d4",
    "status": "running",
    "health": "unhealthy",
    "timestamp": "2025-11-10T17:15:30.123456",
    "last_check_output": "Health check failed: HTTP 503\n",
    "last_check_exit_code": 1
  }
--------------------------------------------------------------------------------

Use tail -f logs/alerts.log during demos to narrate what changed. The JSON payload preserves last_check_output and ExitCode, so you can copy those values directly into your incident notes (for example, curl: (22) means HTTP 503 while curl: (52) indicates the connection was severed, which is exactly what crashing-app does).

health_status.json – Current State Snapshot

{
  "timestamp": "2025-11-10T17:15:50.000000",
  "containers": {
    "healthy-app": {
      "name": "healthy-app",
      "id": "i9j0k1l2",
      "status": "running",
      "health": "healthy",
      "timestamp": "2025-11-10T17:15:50.000000"
    },
    "unhealthy-app": {
      "name": "unhealthy-app",
      "id": "a1b2c3d4",
      "status": "running",
      "health": "unhealthy",
      "timestamp": "2025-11-10T17:15:50.000000",
      "last_check_output": "Health check failed: HTTP 503\n",
      "last_check_exit_code": 1
    }
  }
}

Pipe this file through jq '.containers["unhealthy-app"]' to see exactly what the monitor knows about each container. Because every run overwrites the JSON snapshot, you always have a point-in-time truth source to feed dashboards or custom scripts.

latest_alert.txt – Most Recent Alert

STOPPED: crashing-app
Timestamp: 2025-11-10T17:15:45.789012

This single-line file is intentionally simple; many teams watch it with inotifywait or fswatch to trigger follow-up actions without parsing the full log.

Step 7: Cleaning Up

When you’re done testing:

# Stop the monitor with Ctrl+C

# Stop and remove containers
docker-compose down

# Remove volumes (optional)
docker-compose down -v

Extending the Monitor

Now that you have a working monitor, treat each enhancement as another story to teach. Pick a single axis—notification channel, remediation, telemetry—and walk through it end-to-end before adding the next.

1. Email Notifications

Use plain SMTP first so you can demo the behavior with a local MailHog or Postfix instance before wiring in SaaS providers.

import smtplib
from email.mime.text import MIMEText

def send_email_alert(self, container_info, alert_type):
    msg = MIMEText(f"Container {container_info['name']} is {alert_type}")
    msg['Subject'] = f"Docker Alert: {alert_type}"
    msg['From'] = '[email protected]'
    msg['To'] = '[email protected]'

    with smtplib.SMTP('localhost') as server:
        server.send_message(msg)

2. Webhook Notifications

Most teams already centralize alerts in Slack, Teams, or a custom webhook collector. Posting the same JSON payload you log locally keeps implementation friction low.

import requests

def send_webhook_alert(self, container_info, alert_type):
    payload = {
        'alert_type': alert_type,
        'container': container_info,
        'timestamp': datetime.now().isoformat()
    }
    requests.post('https://your-webhook-url.com/alert', json=payload)

3. Metrics Export

Prometheus scrapes simple HTTP endpoints, so give it a gauge per container and reuse the state you already collected.

from prometheus_client import start_http_server, Gauge

# Create metrics
container_health = Gauge('container_health', 'Container health status', ['container_name'])

# In check_containers():
for name, info in current_states.items():
    health_value = 1 if info['health'] == 'healthy' else 0
    container_health.labels(container_name=name).set(health_value)

4. Automated Remediation

Only add remediation once you can prove the alerts are trustworthy. The snippet below restarts a container after three consecutive unhealthy checks—no guesswork, just recorded failures.

def check_containers(self):
    # ... existing code ...

    if health_info["health"] == "unhealthy":
        if previous_state.get("health") != "unhealthy":
            self.send_alert(health_info, "UNHEALTHY")

            # Auto-restart after 3 consecutive failures
            failure_count = self.failure_counts.get(container_name, 0) + 1
            self.failure_counts[container_name] = failure_count

            if failure_count >= 3:
                print(f"Restarting {container_name}...")
                container.restart()
                self.failure_counts[container_name] = 0

5. Dashboard Integration

When you need to explain the system to stakeholders, a dashboard powered by the existing JSON snapshot works wonders.

from flask import Flask, render_template, jsonify

app = Flask(__name__)

@app.route('/')
def dashboard():
    with open('logs/health_status.json') as f:
        data = json.load(f)
    return render_template('dashboard.html', containers=data['containers'])

@app.route('/api/health')
def api_health():
    with open('logs/health_status.json') as f:
        return jsonify(json.load(f))

Hardening Experiments (Optional)

If you decide to adapt this proof-of-concept beyond a lab environment, plan to harden it first:

1. Run as a System Service

Create a systemd service file /etc/systemd/system/docker-monitor.service:

[Unit]
Description=Docker Health Monitor
After=docker.service
Requires=docker.service

[Service]
Type=simple
User=monitor
WorkingDirectory=/opt/docker-health-monitor
ExecStart=/usr/bin/python3 /opt/docker-health-monitor/monitor.py --interval 30
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl enable docker-monitor
sudo systemctl start docker-monitor

Running as a service guarantees the monitor starts after Docker itself and automatically restarts if the script crashes—no forgotten terminals.

2. Log Rotation

Configure log rotation in /etc/logrotate.d/docker-monitor:

/opt/docker-health-monitor/logs/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0644 monitor monitor
}

Rotating the alert log keeps long-running monitors from filling disks while still preserving a month of history for investigations.

3. Monitoring Multiple Docker Hosts

For multi-host deployments, modify the monitor to connect to remote Docker daemons:

# Connect to remote Docker host
client = docker.DockerClient(base_url='tcp://192.168.1.100:2376')

Pointing the client at a TCP endpoint lets one monitor observe multiple Swarm or remote hosts. Just remember to secure TLS on that socket.

4. Resource Limits

The monitor is lightweight, but you can limit its resource usage:

# Add to docker-compose.yml for the monitor itself
deploy:
  resources:
    limits:
      cpus: '0.5'
      memory: 128M

Even though the script is lightweight, codifying limits prevents an accidental infinite loop from starving the very containers you’re watching.

Troubleshooting

Monitor Can’t Connect to Docker

Connectivity issues almost always mean the monitor process cannot talk to /var/run/docker.sock.

Error: Cannot connect to the Docker daemon

Solution: Ensure Docker is running and your user has permission:

sudo usermod -aG docker $USER
newgrp docker

No Alerts Showing

If you never see alerts, resist the urge to add logging everywhere. Verify the data source first—the health check definitions.

Check: 1. Do your containers have health checks configured? 2. Is the check interval too long? 3. Are containers actually failing?

Debug:

# Check container health manually
docker inspect --format='{{.State.Health.Status}}' container_name

High CPU Usage

High CPU usually means the monitor is checking too frequently for the size of your fleet.

If the monitor uses too much CPU:

Increase the check interval: --interval 60
Reduce the number of containers being monitored
Check for container.reload() being called too frequently

Conclusion

You now have a complete proof-of-concept Docker health monitoring system that:

✓ Monitors all containers in real-time

✓ Sends instant alerts when containers fail

✓ Logs everything for later analysis

✓ Runs entirely locally with no dependencies

✓ Can be extended with emails, webhooks, and more

The best part? It’s open source, runs entirely on your own infrastructure, and stays simple enough to understand every moving part before you consider hardening it further.

Next Steps

Deploy the monitor to other lab environments (Docker Desktop, remote dev boxes, etc.)
Integrate with lightweight alerting channels such as Slack webhooks or email relays
Customize the alert logic and thresholds to match the failure modes you care about most
Build a web dashboard on top of health_status.json for faster demos
Experiment with automated remediation or self-healing flows in a safe sandbox

Full Code Repository

All the code from this tutorial is available on GitHub:

git clone https://github.com/Adam-the-Automator/docker-health-monitor
cd docker-health-monitor
pip install -r requirements.txt
docker-compose up -d --build
python monitor.py --interval 5

Have questions or improvements? Open an issue or submit a pull request!