Running Docker containers in your lab or staging environment? You need to know when they fail—fast. In this tutorial, I’ll show you how to build a Python-based health monitoring system that tracks your Docker containers in real-time and sends instant local alerts when something goes wrong.
Note: This is a proof-of-concept project meant for learning. Treat it as a starting point, not a production-ready solution.
No third-party services. No monthly fees. Just pure Python monitoring running on your own infrastructure.
The Problem: Silent Container Failures
Docker’s health checks are great for detecting problems, but they’re passive. When a container fails, Docker knows about it—but do you? Unless you’re constantly running docker ps, failed containers can go unnoticed for hours.
This is where automated monitoring comes in. We need a system that:
- Continuously monitors all Docker containers
- Detects health check failures immediately
- Alerts you when containers become unhealthy or stop
- Tracks recovery when containers return to healthy state
- Logs everything for later analysis
The Solution: A Python Monitoring Script
We’ll build a lightweight monitoring system using Python and the Docker SDK. The monitor will:
- Check container health status every few seconds
- Send colored console alerts with visual indicators
- Log all alerts to files for auditing
- Export current health states as JSON
- Track state changes to avoid alert fatigue
Best of all, it runs entirely locally—no external dependencies or cloud services required.
How Docker Health Checks Behave
Before wiring up code, it helps to understand what information Docker exposes. Every container reports one of four health states:
starting– Docker is still running the health command inside the containerhealthy– the command returned0within the allotted retriesunhealthy– the command failed repeatedly and Docker marked the container unhealthynone– no health check exists, so Docker cannot score it
You can inspect those details at any time:
docker inspect --format '{{json .State.Health}}' unhealthy-app | jq
That single command shows the latest probe output, timestamps, and exit codes. Our monitor simply automates this inspection cycle, adds color-coded context, and records every state transition so you always know why Docker changed its mind.
Project Structure
Here’s what we’ll build:
docker-health-monitor/
├── monitor.py # Main monitoring script
├── requirements.txt # Python dependencies
├── docker-compose.yml # Test containers with health checks
├── test-apps/ # Sample applications for testing
│ ├── healthy-app.py # Always healthy Flask app
│ ├── unhealthy-app.py # Becomes unhealthy after 30s
│ ├── crashing-app.py # Crashes after 45s
│ ├── Dockerfile # Container image for test apps
│ └── requirements.txt # Flask dependencies
└── logs/ # Alert and health logs (created automatically)
Prerequisites
Before we begin, make sure you have:
- Python 3.7 or higher installed
- Docker Engine running
- Docker Compose (optional, for testing)
- Basic understanding of Docker containers and health checks
Step 1: Setting Up Dependencies
We will mirror the workflow used in ATA tutorials: outline prerequisites, explain why each tool matters, then execute the commands. The monitor needs only two dependencies, so start by staging a clean working folder:
mkdir docker-health-monitor
cd docker-health-monitor
Create a requirements.txt file with the exact versions the project already uses:
docker==7.1.0
colorama==0.4.6
Why these packages?
– docker==7.1.0: Gives Python first-class access to the Docker Engine API, so we can ask for health data without shelling out to docker ps.
– colorama==0.4.6: Normalizes ANSI colors on macOS, Linux, and Windows so the console alerts you see later look identical everywhere.
Installing Dependencies
For modern Python installations (Python 3.11+ via Homebrew on macOS), you’ll need to use a virtual environment:
# Create a virtual environment
python3 -m venv venv
# Activate it
source venv/bin/activate # On macOS/Linux
# OR
venv\Scripts\activate # On Windows
# Install dependencies
pip install -r requirements.txt
Why a virtual environment? Modern Python installations (especially the Homebrew build on macOS) block global pip installs. A .venv keeps the Docker SDK isolated from the rest of your machine and makes it trivial to pin versions in CI.
For older Python or system package managers:
pip install -r requirements.txt
Verify installation:
python -c "import docker; import colorama; print('Dependencies installed successfully!')"
Step 2: Building the Monitor
Create monitor.py with the following implementation:
The DockerHealthMonitor Class
Rather than scattering helper functions everywhere, the monitor follows the same “single action, single object” principle described in CLAUDE.md. One class owns the Docker client, log destinations, and state-tracking dictionary so every method can focus on a single responsibility.
#!/usr/bin/env python3
import docker
import time
import json
from datetime import datetime
from colorama import init, Fore, Style, Back
from pathlib import Path
# Initialize colorama for cross-platform colored output
init(autoreset=True)
class DockerHealthMonitor:
"""Monitor Docker container health and send local notifications."""
def __init__(self, check_interval=10, log_dir="logs"):
"""Initialize the monitor."""
self.client = docker.from_env()
self.check_interval = check_interval
self.log_dir = Path(log_dir)
self.log_dir.mkdir(exist_ok=True)
self.container_states = {}
# Log files
self.alert_log = self.log_dir / "alerts.log"
self.health_log = self.log_dir / "health_status.json"
Key takeaways
– The constructor does all environment setup—loading the Docker client, ensuring the log directory exists, and initializing the in-memory container_states cache—so downstream methods can assume those prerequisites are in place.
– Log file paths are computed once. Every other method simply writes to self.alert_log or self.health_log, keeping side-effects predictable.
Getting Container Health Status
Before we can make decisions, we need to normalize Docker’s raw JSON into a predictable structure. get_container_health is the single ingestion point:
def get_container_health(self, container):
"""Get the health status of a container."""
container.reload() # Refresh container state from Docker
health_info = {
"name": container.name,
"id": container.short_id,
"status": container.status,
"health": "none",
"timestamp": datetime.now().isoformat()
}
# Check if container has health check configured
if container.attrs.get("State", {}).get("Health"):
health_status = container.attrs["State"]["Health"]["Status"]
health_info["health"] = health_status
# Get the last health check log
health_logs = container.attrs["State"]["Health"].get("Log", [])
if health_logs:
last_log = health_logs[-1]
health_info["last_check_output"] = last_log.get("Output", "")
health_info["last_check_exit_code"] = last_log.get("ExitCode", 0)
return health_info
Notice the method does not try to interpret the data. Its sole purpose is to collect: reload the container, capture metadata, and include the last probe output when available. That makes the downstream alerting logic dead simple—every other method receives a consistent dict.
Sending Alerts
When something goes wrong, we need to know immediately:
def send_alert(self, container_info, alert_type="UNHEALTHY"):
"""Send a local alert about container health issue."""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
container_name = container_info["name"]
# Color-coded console output
if alert_type == "UNHEALTHY":
print(f"{Fore.RED}{Back.WHITE}{Style.BRIGHT} ⚠ ALERT {Style.RESET_ALL} "
f"{Fore.RED}{container_name}{Style.RESET_ALL} is {Fore.RED}UNHEALTHY{Style.RESET_ALL}")
elif alert_type == "STOPPED":
print(f"{Fore.YELLOW}{Back.BLACK}{Style.BRIGHT} ⚠ ALERT {Style.RESET_ALL} "
f"{Fore.YELLOW}{container_name}{Style.RESET_ALL} has {Fore.YELLOW}STOPPED{Style.RESET_ALL}")
elif alert_type == "RECOVERED":
print(f"{Fore.GREEN}{Back.WHITE}{Style.BRIGHT} ✓ RECOVERED {Style.RESET_ALL} "
f"{Fore.GREEN}{container_name}{Style.RESET_ALL} is now {Fore.GREEN}HEALTHY{Style.RESET_ALL}")
# Log to file
with open(self.alert_log, "a") as f:
f.write(f"[{timestamp}] {alert_type}: {container_name}\n")
f.write(f" Details: {json.dumps(container_info, indent=2)}\n")
f.write("-" * 80 + "\n")
# Create notification file for external monitoring
notification_file = self.log_dir / "latest_alert.txt"
with open(notification_file, "w") as f:
f.write(f"{alert_type}: {container_name}\n")
f.write(f"Timestamp: {datetime.now().isoformat()}\n")
Alert features:
– Visual indicators: Colors and symbols (⚠, ✓) for quick scanning
– Multiple outputs: Console, log file, and latest alert file
– Detailed logging: Full container info in JSON format
– External integration: The latest_alert.txt file can be monitored by other tools
Because the method handles every output channel in one spot, you can bolt on additional notifiers (email, Slack, etc.) without littering the rest of the codebase with branching logic. Teaching-oriented code should read top-to-bottom like a story, and send_alert is the chapter where we narrate exactly what happened.
Checking All Containers
The main monitoring loop checks all containers and detects state changes:
def check_containers(self):
"""Check all containers and detect health issues."""
try:
containers = self.client.containers.list(all=True)
if not containers:
print(f"{Fore.CYAN}No containers found to monitor.{Style.RESET_ALL}")
return
current_states = {}
for container in containers:
health_info = self.get_container_health(container)
container_name = health_info["name"]
current_states[container_name] = health_info
# Get previous state
previous_state = self.container_states.get(container_name, {})
# Detect state changes and issues
if health_info["health"] == "unhealthy":
if previous_state.get("health") != "unhealthy":
self.send_alert(health_info, "UNHEALTHY")
elif health_info["status"] in ["exited", "dead", "stopped"]:
if previous_state.get("status") not in ["exited", "dead", "stopped"]:
self.send_alert(health_info, "STOPPED")
elif health_info["health"] == "healthy":
# Check if recovered from unhealthy state
if previous_state.get("health") == "unhealthy":
self.send_alert(health_info, "RECOVERED")
# Update container states
self.container_states = current_states
# Save current state to JSON file
with open(self.health_log, "w") as f:
json.dump({
"timestamp": datetime.now().isoformat(),
"containers": current_states
}, f, indent=2)
except docker.errors.DockerException as e:
print(f"{Fore.RED}Docker error: {e}{Style.RESET_ALL}")
State tracking logic: – Only alerts on state changes to avoid spam – Tracks three alert types: UNHEALTHY, STOPPED, and RECOVERED – Saves a JSON snapshot of all container states – Handles Docker API errors gracefully
Think of container_states as a running ledger. Each pass through check_containers fetches the latest facts, compares them to the last snapshot, and emits alerts only when the story changes. That mirrors how we teach in ATA tutorials: measure → compare → explain. Instead of spamming the terminal every five seconds, we wait until a container crosses a meaningful boundary, then capture the full context for posterity.
Display Status Summary
A clean summary table shows the current state:
def print_status_summary(self):
"""Print a summary of all container statuses."""
if not self.container_states:
return
print(f"\n{Fore.CYAN}{Style.BRIGHT}{'='*80}{Style.RESET_ALL}")
print(f"{Fore.CYAN}{Style.BRIGHT}Container Health Status Summary{Style.RESET_ALL}")
print(f"{Fore.CYAN}{Style.BRIGHT}{'='*80}{Style.RESET_ALL}")
for name, info in self.container_states.items():
status = info["status"]
health = info["health"]
# Color code based on health
if health == "healthy":
status_color = Fore.GREEN
symbol = "✓"
elif health == "unhealthy":
status_color = Fore.RED
symbol = "✗"
elif health == "starting":
status_color = Fore.YELLOW
symbol = "⟳"
else:
status_color = Fore.WHITE
symbol = "○"
print(f"{symbol} {status_color}{name:30}{Style.RESET_ALL} "
f"Status: {status_color}{status:10}{Style.RESET_ALL} "
f"Health: {status_color}{health}{Style.RESET_ALL}")
print(f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}\n")
Visual design: – ✓ = Healthy (green) – ✗ = Unhealthy (red) – ⟳ = Starting (yellow) – ○ = No health check (white)
These glyphs match the mental model we established earlier when reviewing Docker’s health states. When the monitor prints a table, you immediately see which containers are recovering, starting, or silently running without a health check configured. Teaching through visuals is a core ATA pattern—show the outcome, then dissect what caused it.
The Main Loop
Finally, tie it all together with the monitoring loop:
def run(self):
"""Run the monitoring loop."""
print(f"{Fore.GREEN}{Style.BRIGHT}Starting Docker Health Monitor...{Style.RESET_ALL}")
print(f"{Fore.CYAN}Check interval: {self.check_interval} seconds{Style.RESET_ALL}")
print(f"{Fore.CYAN}Log directory: {self.log_dir.absolute()}{Style.RESET_ALL}\n")
try:
while True:
self.check_containers()
self.print_status_summary()
time.sleep(self.check_interval)
except KeyboardInterrupt:
print(f"\n{Fore.YELLOW}Monitoring stopped by user.{Style.RESET_ALL}")
except Exception as e:
print(f"{Fore.RED}Unexpected error: {e}{Style.RESET_ALL}")
raise
Command-Line Interface
Add argument parsing for flexibility:
def main():
"""Main entry point."""
import argparse
parser = argparse.ArgumentParser(
description="Monitor Docker container health and send local alerts"
)
parser.add_argument(
"--interval",
type=int,
default=10,
help="Check interval in seconds (default: 10)"
)
parser.add_argument(
"--log-dir",
type=str,
default="logs",
help="Directory for log files (default: logs)"
)
args = parser.parse_args()
monitor = DockerHealthMonitor(
check_interval=args.interval,
log_dir=args.log_dir
)
monitor.run()
if __name__ == "__main__":
main()
Step 3: Creating Test Containers
To properly test our monitor, we need containers with different behaviors. Let’s create three Flask applications that demonstrate different failure scenarios.
Always Healthy App
Create test-apps/healthy-app.py:
#!/usr/bin/env python3
from flask import Flask, jsonify
import os
app = Flask(__name__)
@app.route('/')
def index():
return jsonify({
'status': 'running',
'app': 'healthy-app',
'message': 'I am healthy!'
})
@app.route('/health')
def health():
"""Health check endpoint - always returns healthy."""
return jsonify({'status': 'healthy'}), 200
if __name__ == '__main__':
port = int(os.environ.get('PORT', 5000))
app.run(host='0.0.0.0', port=port)
This helper is intentionally boring. It gives you a known-good reference so that, when the other containers begin to fail, you can confirm the monitor continues to report at least one healthy service. In the console summary you’ll always see a green ✓ healthy-app.
Becomes Unhealthy App
Create test-apps/unhealthy-app.py:
#!/usr/bin/env python3
from flask import Flask, jsonify
import os
import time
app = Flask(__name__)
START_TIME = time.time()
UNHEALTHY_AFTER = int(os.environ.get('UNHEALTHY_AFTER', 30))
@app.route('/')
def index():
uptime = int(time.time() - START_TIME)
is_healthy = uptime < UNHEALTHY_AFTER
return jsonify({
'status': 'running',
'app': 'unhealthy-app',
'uptime_seconds': uptime,
'healthy': is_healthy,
'message': f'Will become unhealthy after {UNHEALTHY_AFTER}s'
})
@app.route('/health')
def health():
"""Health check endpoint - becomes unhealthy after UNHEALTHY_AFTER seconds."""
uptime = int(time.time() - START_TIME)
if uptime < UNHEALTHY_AFTER:
return jsonify({
'status': 'healthy',
'uptime': uptime,
'message': f'Healthy (will fail in {UNHEALTHY_AFTER - uptime}s)'
}), 200
else:
return jsonify({
'status': 'unhealthy',
'uptime': uptime,
'message': 'Health check failed!'
}), 503 # Service Unavailable
if __name__ == '__main__':
port = int(os.environ.get('PORT', 5001))
app.run(host='0.0.0.0', port=port)
This application models the “slow burn” failure you observe with memory leaks or broken dependencies. Because it advertises its future failure inside the JSON payload, you can correlate the container’s own telemetry (uptime, message) with the alert the monitor sends at the 30-second mark.
Crashing App
Create test-apps/crashing-app.py:
#!/usr/bin/env python3
from flask import Flask, jsonify
import os
import time
import sys
app = Flask(__name__)
START_TIME = time.time()
CRASH_AFTER = int(os.environ.get('CRASH_AFTER', 45))
@app.route('/')
def index():
uptime = int(time.time() - START_TIME)
if uptime >= CRASH_AFTER:
print(f"Uptime {uptime}s exceeded {CRASH_AFTER}s - CRASHING!", flush=True)
sys.exit(1)
return jsonify({
'status': 'running',
'app': 'crashing-app',
'uptime_seconds': uptime,
'message': f'Will crash after {CRASH_AFTER}s'
})
@app.route('/health')
def health():
"""Health check endpoint."""
uptime = int(time.time() - START_TIME)
if uptime >= CRASH_AFTER:
print(f"Uptime {uptime}s exceeded {CRASH_AFTER}s - CRASHING!", flush=True)
sys.exit(1)
return jsonify({
'status': 'healthy',
'uptime': uptime,
'message': f'Healthy (will crash in {CRASH_AFTER - uptime}s)'
}), 200
if __name__ == '__main__':
port = int(os.environ.get('PORT', 5002))
app.run(host='0.0.0.0', port=port)
Use this container to demonstrate why the monitor inspects both .State.Health and the high-level status field. After roughly 45 seconds the Flask process stops responding to the health probe, Docker marks the container unhealthy, and curl surfaces the error as Empty reply from server. The container technically keeps running, which makes this a great example of a “gray failure” that only a health-aware monitor can expose.
Test Apps Requirements
Create test-apps/requirements.txt:
Flask==3.0.0
Werkzeug==3.0.1
What these do:
– Flask==3.0.0: Lightweight web framework for creating HTTP endpoints and health check routes
– Werkzeug==3.0.1: WSGI utility library that Flask depends on for request/response handling
These dependencies will be installed inside the Docker containers automatically during the build process, so you don’t need to install them locally.
Dockerfile for Test Apps
Create test-apps/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application files
COPY *.py .
# Install curl for health checks
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
CMD ["python", "healthy-app.py"]
Step 4: Docker Compose Configuration
Create docker-compose.yml to orchestrate our test containers:
version: '3.8'
services:
# Container 1: Always healthy
healthy-app:
build: ./test-apps
command: python healthy-app.py
ports:
- "5000:5000"
environment:
- PORT=5000
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:5000/health"]
interval: 5s
timeout: 3s
retries: 3
start_period: 5s
container_name: healthy-app
# Container 2: Becomes unhealthy after 30 seconds
unhealthy-app:
build: ./test-apps
command: python unhealthy-app.py
ports:
- "5001:5001"
environment:
- PORT=5001
- UNHEALTHY_AFTER=30
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:5001/health"]
interval: 5s
timeout: 3s
retries: 2
start_period: 5s
container_name: unhealthy-app
# Container 3: Crashes after 45 seconds
crashing-app:
build: ./test-apps
command: python crashing-app.py
ports:
- "5002:5002"
environment:
- PORT=5002
- CRASH_AFTER=45
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:5002/health"]
interval: 5s
timeout: 3s
retries: 2
start_period: 5s
container_name: crashing-app
restart: "no" # Don't auto-restart
# Container 4: Basic nginx (always healthy)
web-server:
image: nginx:alpine
ports:
- "8080:80"
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:80"]
interval: 5s
timeout: 3s
retries: 3
start_period: 5s
container_name: web-server
Health check configuration explained:
test: Command to run for health check (curl or wget)interval: How often to run the checktimeout: Maximum time to wait for check to completeretries: Number of consecutive failures before marking unhealthystart_period: Grace period during container startup
Every service reuses the same image but overrides the command and environment variables so you can demonstrate multiple failure stories without maintaining separate Dockerfiles. When you glance at this compose file later, you’ll immediately know which container is expected to fail first and why.
Step 5: Running the Monitor
Now let’s see it all in action!
Start the Test Containers
docker-compose up -d --build
This builds the images and starts all four containers in detached mode.
Verify Containers Are Running
docker ps
You should see all four containers running.
Start the Monitor
In a separate terminal:
python monitor.py --interval 5
This starts monitoring with a 5-second check interval.
Watch the Magic Happen
Initial output (0-10 seconds):
Starting Docker Health Monitor...
Check interval: 5 seconds
Log directory: /home/user/docker-health-monitor/logs
================================================================================
Container Health Status Summary
================================================================================
⟳ healthy-app Status: running Health: starting
⟳ unhealthy-app Status: running Health: starting
⟳ crashing-app Status: running Health: starting
✓ web-server Status: running Health: healthy
================================================================================
The yellow ⟳ icons indicate Docker is still within each service’s start_period. Seeing yellow immediately after startup is normal; the monitor only escalates if a container stays yellow or slides into red.
After containers stabilize (10-30 seconds):
================================================================================
Container Health Status Summary
================================================================================
✓ healthy-app Status: running Health: healthy
✓ unhealthy-app Status: running Health: healthy
✓ crashing-app Status: running Health: healthy
✓ web-server Status: running Health: healthy
================================================================================
Everything is green once the grace period ends. Capture a quick screenshot here; it becomes the reference point you compare against when investigating later alerts.
At 30 seconds – First alert!:
⚠ ALERT unhealthy-app is UNHEALTHY
================================================================================
Container Health Status Summary
================================================================================
✓ healthy-app Status: running Health: healthy
✗ unhealthy-app Status: running Health: unhealthy
✓ crashing-app Status: running Health: healthy
✓ web-server Status: running Health: healthy
================================================================================
The monitor prints a red ⚠ alert only once, then leaves unhealthy-app marked with a red ✗ in the status table. At this point you can jump into logs/alerts.log to read the failed probe output (HTTP 503) and explain the failure just like you would during a real incident review.
At 45 seconds – Second alert!:
⚠ ALERT crashing-app is UNHEALTHY
================================================================================
Container Health Status Summary
================================================================================
✓ healthy-app Status: running Health: healthy
✗ unhealthy-app Status: running Health: unhealthy
✗ crashing-app Status: running Health: unhealthy
✓ web-server Status: running Health: healthy
================================================================================
Here the “crashing” container never fully stops; instead, the Flask process keeps running but closes every health-check connection with curl: (52) Empty reply from server. Docker marks the container unhealthy, and our monitor flags it with the red ✗ so you still get a high-signal alert even though docker ps reports the container as running.
Step 6: Understanding the Logs
The monitor creates three log files in the logs/ directory:
alerts.log – Detailed Alert History
[2025-11-10 17:15:30] UNHEALTHY: unhealthy-app
Details: {
"name": "unhealthy-app",
"id": "a1b2c3d4",
"status": "running",
"health": "unhealthy",
"timestamp": "2025-11-10T17:15:30.123456",
"last_check_output": "Health check failed: HTTP 503\n",
"last_check_exit_code": 1
}
--------------------------------------------------------------------------------
Use tail -f logs/alerts.log during demos to narrate what changed. The JSON payload preserves last_check_output and ExitCode, so you can copy those values directly into your incident notes (for example, curl: (22) means HTTP 503 while curl: (52) indicates the connection was severed, which is exactly what crashing-app does).
health_status.json – Current State Snapshot
{
"timestamp": "2025-11-10T17:15:50.000000",
"containers": {
"healthy-app": {
"name": "healthy-app",
"id": "i9j0k1l2",
"status": "running",
"health": "healthy",
"timestamp": "2025-11-10T17:15:50.000000"
},
"unhealthy-app": {
"name": "unhealthy-app",
"id": "a1b2c3d4",
"status": "running",
"health": "unhealthy",
"timestamp": "2025-11-10T17:15:50.000000",
"last_check_output": "Health check failed: HTTP 503\n",
"last_check_exit_code": 1
}
}
}
Pipe this file through jq '.containers["unhealthy-app"]' to see exactly what the monitor knows about each container. Because every run overwrites the JSON snapshot, you always have a point-in-time truth source to feed dashboards or custom scripts.
latest_alert.txt – Most Recent Alert
STOPPED: crashing-app
Timestamp: 2025-11-10T17:15:45.789012
This single-line file is intentionally simple; many teams watch it with inotifywait or fswatch to trigger follow-up actions without parsing the full log.
Step 7: Cleaning Up
When you’re done testing:
# Stop the monitor with Ctrl+C
# Stop and remove containers
docker-compose down
# Remove volumes (optional)
docker-compose down -v
Extending the Monitor
Now that you have a working monitor, treat each enhancement as another story to teach. Pick a single axis—notification channel, remediation, telemetry—and walk through it end-to-end before adding the next.
1. Email Notifications
Use plain SMTP first so you can demo the behavior with a local MailHog or Postfix instance before wiring in SaaS providers.
import smtplib
from email.mime.text import MIMEText
def send_email_alert(self, container_info, alert_type):
msg = MIMEText(f"Container {container_info['name']} is {alert_type}")
msg['Subject'] = f"Docker Alert: {alert_type}"
msg['From'] = '[email protected]'
msg['To'] = '[email protected]'
with smtplib.SMTP('localhost') as server:
server.send_message(msg)
2. Webhook Notifications
Most teams already centralize alerts in Slack, Teams, or a custom webhook collector. Posting the same JSON payload you log locally keeps implementation friction low.
import requests
def send_webhook_alert(self, container_info, alert_type):
payload = {
'alert_type': alert_type,
'container': container_info,
'timestamp': datetime.now().isoformat()
}
requests.post('https://your-webhook-url.com/alert', json=payload)
3. Metrics Export
Prometheus scrapes simple HTTP endpoints, so give it a gauge per container and reuse the state you already collected.
from prometheus_client import start_http_server, Gauge
# Create metrics
container_health = Gauge('container_health', 'Container health status', ['container_name'])
# In check_containers():
for name, info in current_states.items():
health_value = 1 if info['health'] == 'healthy' else 0
container_health.labels(container_name=name).set(health_value)
4. Automated Remediation
Only add remediation once you can prove the alerts are trustworthy. The snippet below restarts a container after three consecutive unhealthy checks—no guesswork, just recorded failures.
def check_containers(self):
# ... existing code ...
if health_info["health"] == "unhealthy":
if previous_state.get("health") != "unhealthy":
self.send_alert(health_info, "UNHEALTHY")
# Auto-restart after 3 consecutive failures
failure_count = self.failure_counts.get(container_name, 0) + 1
self.failure_counts[container_name] = failure_count
if failure_count >= 3:
print(f"Restarting {container_name}...")
container.restart()
self.failure_counts[container_name] = 0
5. Dashboard Integration
When you need to explain the system to stakeholders, a dashboard powered by the existing JSON snapshot works wonders.
from flask import Flask, render_template, jsonify
app = Flask(__name__)
@app.route('/')
def dashboard():
with open('logs/health_status.json') as f:
data = json.load(f)
return render_template('dashboard.html', containers=data['containers'])
@app.route('/api/health')
def api_health():
with open('logs/health_status.json') as f:
return jsonify(json.load(f))
Hardening Experiments (Optional)
If you decide to adapt this proof-of-concept beyond a lab environment, plan to harden it first:
1. Run as a System Service
Create a systemd service file /etc/systemd/system/docker-monitor.service:
[Unit]
Description=Docker Health Monitor
After=docker.service
Requires=docker.service
[Service]
Type=simple
User=monitor
WorkingDirectory=/opt/docker-health-monitor
ExecStart=/usr/bin/python3 /opt/docker-health-monitor/monitor.py --interval 30
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl enable docker-monitor
sudo systemctl start docker-monitor
Running as a service guarantees the monitor starts after Docker itself and automatically restarts if the script crashes—no forgotten terminals.
2. Log Rotation
Configure log rotation in /etc/logrotate.d/docker-monitor:
/opt/docker-health-monitor/logs/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0644 monitor monitor
}
Rotating the alert log keeps long-running monitors from filling disks while still preserving a month of history for investigations.
3. Monitoring Multiple Docker Hosts
For multi-host deployments, modify the monitor to connect to remote Docker daemons:
# Connect to remote Docker host
client = docker.DockerClient(base_url='tcp://192.168.1.100:2376')
Pointing the client at a TCP endpoint lets one monitor observe multiple Swarm or remote hosts. Just remember to secure TLS on that socket.
4. Resource Limits
The monitor is lightweight, but you can limit its resource usage:
# Add to docker-compose.yml for the monitor itself
deploy:
resources:
limits:
cpus: '0.5'
memory: 128M
Even though the script is lightweight, codifying limits prevents an accidental infinite loop from starving the very containers you’re watching.
Troubleshooting
Monitor Can’t Connect to Docker
Connectivity issues almost always mean the monitor process cannot talk to /var/run/docker.sock.
Error: Cannot connect to the Docker daemon
Solution: Ensure Docker is running and your user has permission:
sudo usermod -aG docker $USER
newgrp docker
No Alerts Showing
If you never see alerts, resist the urge to add logging everywhere. Verify the data source first—the health check definitions.
Check: 1. Do your containers have health checks configured? 2. Is the check interval too long? 3. Are containers actually failing?
Debug:
# Check container health manually
docker inspect --format='{{.State.Health.Status}}' container_name
High CPU Usage
High CPU usually means the monitor is checking too frequently for the size of your fleet.
If the monitor uses too much CPU:
- Increase the check interval:
--interval 60 - Reduce the number of containers being monitored
- Check for container.reload() being called too frequently
Conclusion
You now have a complete proof-of-concept Docker health monitoring system that:
✓ Monitors all containers in real-time
✓ Sends instant alerts when containers fail
✓ Logs everything for later analysis
✓ Runs entirely locally with no dependencies
✓ Can be extended with emails, webhooks, and more
The best part? It’s open source, runs entirely on your own infrastructure, and stays simple enough to understand every moving part before you consider hardening it further.
Next Steps
- Deploy the monitor to other lab environments (Docker Desktop, remote dev boxes, etc.)
- Integrate with lightweight alerting channels such as Slack webhooks or email relays
- Customize the alert logic and thresholds to match the failure modes you care about most
- Build a web dashboard on top of
health_status.jsonfor faster demos - Experiment with automated remediation or self-healing flows in a safe sandbox
Full Code Repository
All the code from this tutorial is available on GitHub:
git clone https://github.com/Adam-the-Automator/docker-health-monitor
cd docker-health-monitor
pip install -r requirements.txt
docker-compose up -d --build
python monitor.py --interval 5
Have questions or improvements? Open an issue or submit a pull request!