Sensor Monitoring System
A comprehensive monitoring and alerting solution for containerized sensor applications with automated fault detection and response capabilities.
Overview
This system monitors a sensor binary that outputs three values per second:
- Sine-like value: Continuous floating-point signal
- Random integer: Random numerical data
- Counter: Incrementing sequence number
The system detects fault states (output: 0 0 0) and implements automated responses to unexpected conditions.
Architecture
Core Components
- Sensor Application: Containerized binary producing telemetry data
- Data Collection: Python exporter parsing sensor output into Prometheus metrics
- Time-Series Storage: Prometheus for metrics collection and querying
- Visualization: Grafana dashboards for real-time monitoring
- Alerting: AlertManager with custom alert rules for fault detection
- Automation: Alert handler service for automated incident response
Key Features
- Fault Detection: Real-time identification of
0 0 0fault patterns - Automated Recovery: Container restart and service healing capabilities
- Comprehensive Alerting: 4 alert rules covering realistic failure scenarios
- Dashboard Visualization: 9 panels showing metrics, trends, and system health
- Incident Logging: Automated logging of faults and recovery actions
Data flow
- sensor-app container runs the sensor binary and writes to its own stdout
- Docker captures sensor-app's stdout and stores it as container logs
- sensor-exporter runs docker logs --follow sensor-app as a subprocess
- sensor-exporter reads the subprocess's stdout (which contains sensor-app's logs)
Visual Flow
┌─────────────┐ stdout ┌──────────────┐ container ┌─────────────┐
│ sensor │────────────▶│ sensor-app │────logs───────▶│ Docker │
│ binary │ │ container │ │ daemon │
└─────────────┘ └──────────────┘ └─────────────┘
│
│
┌─────────────┐ stdout ┌──────────────┐ subprocess ┌──────▼──────┐
│ sensor- │◀────────────│ docker logs │◀───────────────│ docker │
│ exporter │ │ --follow │ │ logs API │
│ (parser) │ │ sensor-app │ │ │
└─────────────┘ └──────────────┘ └─────────────┘
Quick Start
Prerequisites
- Docker and Docker Compose installed
- Ports 3000, 5001, 8080, 9090, 9093 available
Deployment
# Clone or extract the project
cd ea-test/
# Start the complete monitoring stack
docker compose up -d --build
# Verify all services are running
docker compose ps
Access Points
- Grafana Dashboard: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- AlertManager: http://localhost:9093
- Sensor Metrics: http://localhost:8080/metrics
- Alert Handler: http://localhost:5001
Monitoring Features
Dashboard Panels
- Sensor Sine Value: Time-series plot of sine-wave signal
- Sensor Random Value: Random integer visualization
- Sensor Counter Value: Incremental counter tracking
- Fault Count: Total number of detected faults
- Activity Rates: Fault and reading rates per minute
- Sensor Status: Online/offline indicator
- Last Reading Time: Time since last data point
- Fault Percentage: Percentage of readings that are faults
- Total Readings: Cumulative reading count
Alert Rules
- SensorFaultDetected: Immediate alert on any
0 0 0output - SensorOffline: Alert when no data received for 30+ seconds
- SensorCounterStalled: Alert when counter stops incrementing
- SensorReadingStalled: Critical alert when no new readings (exporter down)
Automated Responses
- Sensor Offline (sensor-app down): Automatically restart sensor-app container
- Reading Stalled (exporter down): Automatically Restart sensor exporter service
- Fault Detection (0 0 0 readings): Log individual faults to
/tmp/fault_incidents.log
Management Commands
Service Management
# View service status
docker compose ps
# View logs
docker compose logs sensor-exporter
docker compose logs alert-handler
# Restart specific service
docker compose restart sensor-app
# Stop all services
docker compose down
# Update and restart
docker compose down && docker compose up -d --build
Monitoring Commands
# Check sensor-app metrics
curl http://localhost:8080/metrics | grep sensor_
# Test alert handler
curl http://localhost:5001/
# Query Prometheus
curl "http://localhost:9090/api/v1/query?query=sensor_sine_value"
# Check alert rules
curl "http://localhost:9090/api/v1/rules"
Debugging
# Sensor-app output sample
docker exec sensor-app timeout 5 /app/sensor
# View alert logs
docker exec alert-handler tail -f /tmp/alert_handler.log
# Check fault incidents
docker exec alert-handler cat /tmp/fault_incidents.log
Configuration
File Structure
├── bin/sensor # Sensor binary
├── docker-compose.yml # Service orchestration
├── Dockerfile # Sensor container
├── Dockerfile.exporter # Metrics exporter
├── Dockerfile.alert-handler # Alert automation
├── sensor_exporter.py # Metrics collection service
├── alert_handler.py # Automated response system
├── config/
│ ├── prometheus.yml # Metrics collection config
│ ├── alert_rules.yml # Alerting rules
│ ├── alertmanager.yml # Alert routing config
│ └── grafana/
│ ├── provisioning/ # Auto-provisioning
│ └── dashboards/ # Dashboard definitions
└── README.md # This documentation
Customization
- Alert Thresholds: Edit
config/alert_rules.yml - Dashboard Layout: Modify
config/grafana/dashboards/sensor-dashboard.json - Retention Period: Update Prometheus retention in
docker-compose.yml - Notification Channels: Configure AlertManager in
config/alertmanager.yml
Troubleshooting
Common Issues
- Port Conflicts: Ensure ports 3000, 5001, 8080, 9090, 9093 are free
- Container Startup: Check
docker compose logs <service>for errors - Metrics Not Appearing: Verify sensor-exporter is reading data correctly
- Alerts Not Firing: Check Prometheus rules at http://localhost:9090/rules
Health Checks
# Verify all endpoints
curl http://localhost:8080/health # Sensor exporter
curl http://localhost:3000/api/health # Grafana
curl http://localhost:5001/ # Alert handler
curl http://localhost:9090/-/healthy # Prometheus
Recovery Procedures
- Complete System Reset:
docker compose down -v && docker compose up -d --build - Data Reset: Remove volumes to clear all stored data
- Service Recovery: Individual service restarts preserve other components
Technology Choices
- Prometheus: Industry standard for time-series metrics
- Grafana: Rich visualization and dashboarding capabilities
- Python: Robust ecosystem for data processing and automation
- Docker: Consistent deployment across environments
- AlertManager: Flexible routing and notification management