Skip to main content
Version: v4

Metrics

Dkron has the ability to send metrics to Statsd for dashboards and historical reporting or provide prometheus format metrics via the api. It sends job processing metrics, golang, and serf metrics.

Configuration

Statsd

Add this in your yaml config file to enable statsd metrics.

statsd-addr: "localhost:8125"
# Or for datadog statsd
dog-statsd-addr: "localhost:8125"

Prometheus

Add this to your yaml config file to enable serving prometheus metrics at the endpoint /metrics

enable-prometheus: true

Additionally, in your Prometheus config file (prometheus.yml), add the following to link dkron metric endpoint

scrape_configs:
... #initial configuration

- job_name: "dkron_metrics"
# metrics_path defaults to '/metrics'
static_configs:
- targets: ["localhost:8080"]

Monitoring Dashboard Setup

Setting up a monitoring dashboard with metrics from Dkron is highly recommended for production deployments. You have several options:

  1. Grafana + Prometheus: Create dashboards with job execution success rates, execution timing trends, and system health metrics
  2. DataDog: Use built-in dashboard templates with the DogStatsD integration
  3. Custom StatsD Dashboards: Configure with any StatsD-compatible visualization tool

Example Grafana dashboard panels for monitoring Dkron might include:

  • Job execution success rate over time
  • Average execution duration by job
  • System resource utilization
  • Cluster health status

Metrics Reference

Dkron emits several categories of metrics to help you monitor both the application and your job executions:

Agent Event Metrics

These metrics track internal events within Dkron agents:

MetricDescription
dkron.agent.event_received.query_execution_doneCount of completed job execution events received by the agent
dkron.agent.event_received.query_run_jobCount of job run requests received by the agent

Network Communication Metrics

These metrics help monitor the health of inter-node communication:

MetricDescription
dkron.memberlist.gossipCount of gossip protocol messages exchanged
dkron.memberlist.probeNodeCount of node health probe checks
dkron.memberlist.pushPullNodeCount of anti-entropy sync operations
dkron.memberlist.tcp.acceptCount of accepted TCP connections
dkron.memberlist.tcp.connectCount of initiated TCP connections
dkron.memberlist.tcp.sentCount and bytes of TCP packets sent
dkron.memberlist.udp.receivedCount and bytes of UDP packets received
dkron.memberlist.udp.sentCount and bytes of UDP packets sent

gRPC Service Metrics

These metrics track the internal RPC communication:

MetricDescription
dkron.grpc.call_execution_doneCount and timing of execution completion RPC calls
dkron.grpc.call_get_jobCount and timing of job retrieval RPC calls
dkron.grpc.execution_doneCount of completed job executions
dkron.grpc.get_jobCount of job information retrievals

Runtime Metrics

These metrics provide insights into the Go runtime health:

MetricDescription
dkron.runtime.alloc_bytesCurrent bytes allocated by the application
dkron.runtime.free_countCount of memory free operations
dkron.runtime.gc_pause_nsDuration of the last garbage collection pause in nanoseconds
dkron.runtime.heap_objectsNumber of objects in the heap
dkron.runtime.malloc_countCount of memory allocation operations
dkron.runtime.num_goroutinesNumber of goroutines currently running
dkron.runtime.sys_bytesTotal bytes of memory obtained from the OS
dkron.runtime.total_gc_pause_nsTotal time spent in GC pauses
dkron.runtime.total_gc_runsTotal number of completed GC cycles

Serf Metrics

These metrics track the cluster membership and failure detection system:

MetricDescription
dkron.serf.coordinate.adjustment_msTime adjustment for coordinate system in milliseconds
dkron.serf.msgs.receivedCount of messages received
dkron.serf.msgs.sentCount of messages sent
dkron.serf.queriesCount of queries processed
dkron.serf.queries.execution_doneCount of execution completion queries
dkron.serf.queries.run_jobCount of job run queries
dkron.serf.query_acksCount of query acknowledgments
dkron.serf.query_responsesCount of responses to queries
dkron.serf.queue.EventCount of events in processing queue
dkron.serf.queue.IntentCount of intent messages in queue
dkron.serf.queue.QueryCount of queries in processing queue

Alerting on Metrics

To set up effective alerting based on Dkron metrics, consider these recommendations:

  1. Job Failure Alerts: Monitor dkron.serf.queries.execution_done with error status
  2. Cluster Health: Set alerts on node count changes or consistent gossip failures
  3. Performance Degradation: Watch for increases in job execution time trends
  4. Resource Constraints: Monitor dkron.runtime.gc_pause_ns and other runtime metrics for signs of resource pressure

For Prometheus users, example alerting rules:

groups:
- name: dkron_alerts
rules:
- alert: DkronJobFailures
expr: increase(dkron_job_executions_failed_total[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Multiple job failures detected"
description: "There have been more than 3 job failures in the last hour"