Prometheus Metrics
The metal-operator exposes custom Prometheus metrics to provide visibility into the state and health of managed servers. These metrics are exposed at the /metrics endpoint alongside standard controller-runtime metrics.
Accessing Metrics
Local Development
# Port-forward to the metrics endpoint
kubectl -n metal-operator-system port-forward deployment/metal-operator-controller-manager 8443:8443
# Query metrics (skip TLS verification for dev)
curl -k https://localhost:8443/metrics | grep metal_Production
The operator includes a ServiceMonitor configured for Prometheus Operator:
# config/prometheus/monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: metal-operator-controller-manager-metrics-monitor
namespace: metal-operator-system
spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
path: /metrics
port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
selector:
matchLabels:
control-plane: controller-managerAvailable Metrics
Server State Distribution (metal_server_state)
Type: Gauge Description: Current count of servers in each state Labels:
state: ServerState value (Initial, Discovery, Available, Reserved, Error, Maintenance)
Example values:
metal_server_state{state="Available"} 5
metal_server_state{state="Reserved"} 2
metal_server_state{state="Error"} 0
metal_server_state{state="Maintenance"} 1Use cases:
- Monitor available server capacity
- Alert on servers in error states
- Track server lifecycle distribution
Server Power State Distribution (metal_server_power_state)
Type: Gauge Description: Current count of servers in each power state Labels:
power_state: ServerPowerState value (On, Off, PoweringOn, PoweringOff, Paused)
Example values:
metal_server_power_state{power_state="On"} 7
metal_server_power_state{power_state="Off"} 1
metal_server_power_state{power_state="PoweringOn"} 0Use cases:
- Track power operations in progress
- Identify stuck power transitions
- Energy consumption estimation
Server Condition Status (metal_server_condition_status)
Type: Gauge Description: Count of servers with each condition status Labels:
condition_type: Condition type (e.g., "Ready", "PoweringOn", "Discovered")status: Condition status (True, False, Unknown)
Example values:
metal_server_condition_status{condition_type="Ready",status="True"} 1
metal_server_condition_status{condition_type="Discovered",status="True"} 1
metal_server_condition_status{condition_type="PoweringOn",status="False"} 0Use cases:
- Track server health conditions
- Alert on specific condition failures
- Monitor discovery and power operation progress
Server Reconciliation Total (metal_server_reconciliation_total)
Type: Counter Description: Total number of server reconciliations by result Labels:
result: Operation result (success, error_fetch, error_reconcile)
Example values:
metal_server_reconciliation_total{result="success"} 1523
metal_server_reconciliation_total{result="error_fetch"} 2
metal_server_reconciliation_total{result="error_reconcile"} 15Use cases:
- Monitor reconciliation error rates
- Track controller performance
- Debug reconciliation issues
Example Queries
Server Inventory
# Total servers by state
sum by (state) (metal_server_state)
# Available server capacity
metal_server_state{state="Available"}
# Servers requiring attention
metal_server_state{state="Error"} + metal_server_state{state="Maintenance"}
# Percentage of servers in error state
metal_server_state{state="Error"} / sum(metal_server_state) * 100Power Operations
# Servers currently powered on
metal_server_power_state{power_state="On"}
# Servers in transition states (possibly stuck)
metal_server_power_state{power_state="PoweringOn"} + metal_server_power_state{power_state="PoweringOff"}
# Power state distribution
sum by (power_state) (metal_server_power_state)Health and Conditions
# Count of servers with Ready=True
sum(metal_server_condition_status{condition_type="Ready",status="True"})
# Servers with failed power operations
metal_server_condition_status{condition_type="PoweringOn",status="False"}Reconciliation Performance
# Reconciliation error rate (errors per second over 5 minutes)
rate(metal_server_reconciliation_total{result=~"error_.*"}[5m])
# Success ratio
rate(metal_server_reconciliation_total{result="success"}[5m])
/ rate(metal_server_reconciliation_total[5m])
# Total reconciliation rate
sum(rate(metal_server_reconciliation_total[5m]))Alerting Rules
Example PrometheusRule resource:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: metal-operator-server-alerts
namespace: metal-operator-system
spec:
groups:
- name: metal_operator_servers
interval: 30s
rules:
- alert: NoAvailableServers
expr: metal_server_state{state="Available"} == 0
for: 5m
annotations:
summary: "No available servers in the fleet"
description: "All servers are either Reserved, in Maintenance, or in Error state"
labels:
severity: warning
- alert: ServersInErrorState
expr: metal_server_state{state="Error"} > 0
for: 2m
annotations:
summary: "Servers are in Error state"
description: "{{ $value }} server(s) are in Error state and require attention"
labels:
severity: critical
- alert: ServersPoweringOnTooLong
expr: metal_server_power_state{power_state="PoweringOn"} > 0
for: 10m
annotations:
summary: "Servers stuck in PoweringOn state"
description: "{{ $value }} server(s) have been in PoweringOn state for over 10 minutes"
labels:
severity: warning
- alert: HighReconciliationErrorRate
expr: rate(metal_server_reconciliation_total{result=~"error_.*"}[5m]) > 0.1
for: 5m
annotations:
summary: "High server reconciliation error rate"
description: "Server reconciliation errors are occurring at {{ $value | humanize }} per second"
labels:
severity: warning
- alert: LowAvailableServerCapacity
expr: metal_server_state{state="Available"} < 2
for: 5m
annotations:
summary: "Low available server capacity"
description: "Only {{ $value }} server(s) are available"
labels:
severity: warningGrafana Dashboard
Example dashboard queries for visualization:
Server State Distribution Panel (Pie Chart)
sum by (state) (metal_server_state)Server Power State Timeline (Graph)
metal_server_power_stateReconciliation Error Rate (Graph)
rate(metal_server_reconciliation_total{result="success"}[5m])
rate(metal_server_reconciliation_total{result=~"error_.*"}[5m])Available Server Capacity (Gauge)
metal_server_state{state="Available"}Implementation Details
Metric Collection Strategy
The operator uses a custom Collector pattern to ensure accurate metric counts:
- On each Prometheus scrape (default: 30s interval), the collector lists all Server resources
- Counts are computed in-memory and emitted as gauge metrics
- This ensures metrics always reflect current cluster state, not accumulated values
Benefits:
- Accurate counts even if reconciliation loop misses updates
- No metric staleness from deleted servers
- Resilient to operator restarts
Performance considerations:
- ServerList operation uses watch cache (fast)
- Default scrape interval is 30s (adjustable)
- For very large clusters (>1000 servers), consider increasing scrape interval
Cardinality Control
All metrics use bounded label value sets to prevent cardinality explosion:
state: 6 possible valuespower_state: 5 possible valuescondition_type: ~10 typical valuesresult: 3 values
Never used as labels:
- Server names, UUIDs, or namespaces
- IP addresses or MAC addresses
- Timestamps
This ensures Prometheus performance remains optimal even with large server fleets.
Troubleshooting
Metrics Not Appearing
Verify ServiceMonitor is deployed:
bashkubectl -n metal-operator-system get servicemonitorCheck Prometheus targets:
bashkubectl -n monitoring port-forward svc/prometheus-operated 9090:9090 # Open http://localhost:9090/targets # Verify "metal-operator-controller-manager-metrics-monitor" target is UPCheck manager logs for metric registration:
bashkubectl -n metal-operator-system logs deployment/metal-operator-controller-manager -c manager | grep metrics # Should see: "Registered custom server metrics collector"
Incorrect Metric Values
Verify servers are reconciling:
bashkubectl get serversCheck reconciliation metrics:
promqlrate(metal_server_reconciliation_total[5m])Query specific label combinations:
bashcurl -k https://localhost:8443/metrics | grep metal_server_state
High Cardinality Warning
If Prometheus shows cardinality warnings for metal-operator metrics:
- Verify no custom labels were added
- Check for metric label explosion (should never happen with current implementation)
- Review Prometheus storage settings if total metrics exceed capacity