System Monitoring: Prometheus and Grafana for Server and Application Health

Why Monitoring Matters Before Anyone Complains

A factory sensor stops sending data. A database reaches 95% disk capacity. An application response time doubles. Without monitoring, you discover these problems when a machine operator calls or the production line halts.

Monitoring gives you visibility before failures become emergencies. It means detecting a failing sensor before production defects, seeing database growth trends weeks before disk runs out, and catching memory leaks during development rather than during the night shift.

Prometheus: Collecting Metrics

Prometheus is a time-series database that scrapes metrics from your applications at regular intervals.

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    restart: unless-stopped
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    restart: unless-stopped
volumes:
  prometheus-data:

Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'factory-monitor'
    static_configs:
      - targets: ['factory-app:8080']
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

Grafana: Beautiful Live Dashboards

Grafana turns raw metrics into visual dashboards that anyone on the factory floor can understand.

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=monitor2025
    restart: unless-stopped

Useful PromQL Queries

Panel	Query
CPU Usage	`100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
Memory Usage	`node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes`
HTTP Request Rate	`rate(http_requests_total[5m])`
Response Latency P95	`histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))`

Automatic Alerts: Alertmanager

Dashboards are useless if nobody watches them at 2 AM. Alertmanager sends notifications when metrics cross thresholds.

# prometheus/alert-rules.yml
groups:
  - name: factory-alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 10m
        labels:
          severity: critical
      - alert: ApplicationDown
        expr: up{job="factory-monitor"} == 0
        for: 1m
        labels:
          severity: critical

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
route:
  receiver: 'factory-team'
receivers:
  - name: 'factory-team'
    webhook_configs:
      - url: 'http://factory-app:8080/api/alerts'

Custom Application Metrics

Your Rust application can expose custom metrics using the prometheus crate:

use prometheus::{IntCounter, Histogram, register_int_counter, register_histogram};
use lazy_static::lazy_static;

lazy_static! {
    static ref SENSOR_READINGS: IntCounter = register_int_counter!(
        "sensor_readings_total",
        "Total number of sensor readings processed"
    ).unwrap();
    static ref PROCESSING_TIME: Histogram = register_histogram!(
        "sensor_processing_seconds",
        "Time spent processing sensor data"
    ).unwrap();
}

async fn process_sensor_data(data: SensorData) {
    let timer = PROCESSING_TIME.start_timer();
    // Process the data...
    SENSOR_READINGS.inc();
    timer.observe_duration();
}

Metrics Endpoint

use axum::{Router, routing::get};
use prometheus::TextEncoder;

async fn metrics_handler() -> String {
    let encoder = TextEncoder::new();
    let metric_families = prometheus::gather();
    encoder.encode_to_string(&metric_families).unwrap()
}

let app = Router::new().route("/metrics", get(metrics_handler));

Practical Example: Monitoring a Rust + SurrealDB Application

services:
  factory-app:
    build: .
    ports: ["8080:8080"]
    environment:
      - DATABASE_URL=ws://surrealdb:8000
    depends_on: [surrealdb]
    restart: unless-stopped
  surrealdb:
    image: surrealdb/surrealdb:v2.1.4
    command: start --user root --pass factory123 surrealkv://data/factory.db
    volumes: [surreal-data:/data]
    restart: unless-stopped
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus:/etc/prometheus:ro
      - prometheus-data:/prometheus
    restart: unless-stopped
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    restart: unless-stopped
  alertmanager:
    image: prom/alertmanager:latest
    ports: ["9093:9093"]
    volumes: [./alertmanager:/etc/alertmanager:ro]
    restart: unless-stopped
  node-exporter:
    image: prom/node-exporter:latest
    ports: ["9100:9100"]
    restart: unless-stopped
volumes:
  surreal-data:
  prometheus-data:
  grafana-data:

Summary

Monitoring transforms reactive firefighting into proactive system management. Prometheus collects metrics, Grafana visualizes them in dashboards accessible to the entire team, and Alertmanager sends notifications before problems become outages. In the next lesson, you will learn networking fundamentals including VPNs and secure connections between factory sites and cloud servers.