System Monitoring: Prometheus and Grafana for Server and Application Health
Why Monitoring Matters Before Anyone Complains
A factory sensor stops sending data. A database reaches 95% disk capacity. An application response time doubles. Without monitoring, you discover these problems when a machine operator calls or the production line halts.
Monitoring gives you visibility before failures become emergencies. It means detecting a failing sensor before production defects, seeing database growth trends weeks before disk runs out, and catching memory leaks during development rather than during the night shift.
Prometheus: Collecting Metrics
Prometheus is a time-series database that scrapes metrics from your applications at regular intervals.
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus-data:/prometheus
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
restart: unless-stopped
volumes:
prometheus-data:
Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'factory-monitor'
static_configs:
- targets: ['factory-app:8080']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
Grafana: Beautiful Live Dashboards
Grafana turns raw metrics into visual dashboards that anyone on the factory floor can understand.
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
environment:
- GF_SECURITY_ADMIN_PASSWORD=monitor2025
restart: unless-stopped
Useful PromQL Queries
| Panel | Query |
|---|---|
| CPU Usage | 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
| Memory Usage | node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes |
| HTTP Request Rate | rate(http_requests_total[5m]) |
| Response Latency P95 | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) |
Automatic Alerts: Alertmanager
Dashboards are useless if nobody watches them at 2 AM. Alertmanager sends notifications when metrics cross thresholds.
# prometheus/alert-rules.yml
groups:
- name: factory-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 10m
labels:
severity: critical
- alert: ApplicationDown
expr: up{job="factory-monitor"} == 0
for: 1m
labels:
severity: critical
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: 'factory-team'
receivers:
- name: 'factory-team'
webhook_configs:
- url: 'http://factory-app:8080/api/alerts'
Custom Application Metrics
Your Rust application can expose custom metrics using the prometheus crate:
use prometheus::{IntCounter, Histogram, register_int_counter, register_histogram};
use lazy_static::lazy_static;
lazy_static! {
static ref SENSOR_READINGS: IntCounter = register_int_counter!(
"sensor_readings_total",
"Total number of sensor readings processed"
).unwrap();
static ref PROCESSING_TIME: Histogram = register_histogram!(
"sensor_processing_seconds",
"Time spent processing sensor data"
).unwrap();
}
async fn process_sensor_data(data: SensorData) {
let timer = PROCESSING_TIME.start_timer();
// Process the data...
SENSOR_READINGS.inc();
timer.observe_duration();
}
Metrics Endpoint
use axum::{Router, routing::get};
use prometheus::TextEncoder;
async fn metrics_handler() -> String {
let encoder = TextEncoder::new();
let metric_families = prometheus::gather();
encoder.encode_to_string(&metric_families).unwrap()
}
let app = Router::new().route("/metrics", get(metrics_handler));
Practical Example: Monitoring a Rust + SurrealDB Application
services:
factory-app:
build: .
ports: ["8080:8080"]
environment:
- DATABASE_URL=ws://surrealdb:8000
depends_on: [surrealdb]
restart: unless-stopped
surrealdb:
image: surrealdb/surrealdb:v2.1.4
command: start --user root --pass factory123 surrealkv://data/factory.db
volumes: [surreal-data:/data]
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus:/etc/prometheus:ro
- prometheus-data:/prometheus
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
ports: ["9093:9093"]
volumes: [./alertmanager:/etc/alertmanager:ro]
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
ports: ["9100:9100"]
restart: unless-stopped
volumes:
surreal-data:
prometheus-data:
grafana-data:
Summary
Monitoring transforms reactive firefighting into proactive system management. Prometheus collects metrics, Grafana visualizes them in dashboards accessible to the entire team, and Alertmanager sends notifications before problems become outages. In the next lesson, you will learn networking fundamentals including VPNs and secure connections between factory sites and cloud servers.