Clustering
High Availability Clustering
Dkron is designed to operate as a cluster of multiple nodes for fault tolerance and high availability. This guide explains how to set up and manage a Dkron cluster.
Cluster Formation
Bootstrap Methods
Dkron supports several methods for nodes to discover and join each other:
- Static List: Explicitly specify other cluster members
- Auto-join: Automatically discover other nodes
- Cloud Auto-join: Discover nodes in cloud environments
Basic Cluster Setup
To start a basic 3-node server cluster:
-
Start the first node:
dkron agent --server --bootstrap-expect=3 --node-name=node1 --bind=10.0.0.1 --advertise=10.0.0.1
-
Start additional server nodes:
dkron agent --server --bootstrap-expect=3 --node-name=node2 --bind=10.0.0.2 --advertise=10.0.0.2 --join=10.0.0.1
dkron agent --server --bootstrap-expect=3 --node-name=node3 --bind=10.0.0.3 --advertise=10.0.0.3 --join=10.0.0.1 -
Add agent nodes:
dkron agent --node-name=agent1 --bind=10.0.0.4 --join=10.0.0.1
Key Configuration Parameters
Parameter | Description | Example |
---|---|---|
--server | Run in server mode | --server |
--bootstrap-expect | Expected number of servers | --bootstrap-expect=3 |
--node-name | Unique node identifier | --node-name=node1 |
--bind | Address to bind network services | --bind=10.0.0.1 |
--advertise | Address to advertise to cluster | --advertise=10.0.0.1 |
--join | Address of another node to join | --join=10.0.0.1 |
--retry-join | Auto-retry joining | --retry-join=10.0.0.1 |
--tag | Node tag for job targeting | --tag role=web |
Multi-Datacenter Configuration
Dkron can run in HA mode, avoiding SPOFs, this mode provides better scalability and better reliability for users that wants a high level of confidence in the cron jobs they need to run.
Manually bootstrapping a Dkron cluster does not rely on additional tooling, but does require operator participation in the cluster formation process. When bootstrapping, Dkron servers and clients must be started and informed with the address of at least one Dkron server.
As you can tell, this creates a chicken-and-egg problem where one server must first be fully bootstrapped and configured before the remaining servers and clients can join the cluster. This requirement can add additional provisioning time as well as ordered dependencies during provisioning.
First, we bootstrap a single Dkron server and capture its IP address. After we have that nodes IP address, we place this address in the configuration.
- First bootstrap a node with a configuration like this:
# dkron.yml
server: true
bootstrap-expect: 1
-
Then stop the bootstrapped server and capture the server IP address.
-
To form a cluster, configure server nodes (including the bootstrapped server) with the address of its peers as in the following example:
# dkron.yml
server: true
bootstrap-expect: 3
retry-join:
- 10.19.3.9
- 10.19.4.64
- 10.19.7.215
Deployment Table
Below is a table that shows quorum size and failure tolerance for various cluster sizes. The recommended deployment is either 3 or 5 servers. A single server deployment is highly discouraged as data loss is inevitable in a failure scenario.
Servers | Quorum Size | Failure Tolerance |
---|---|---|
1 | 1 | 0 |
2 | 2 | 0 |
3 | 2 | 1 |
4 | 3 | 1 |
5 | 3 | 2 |
6 | 4 | 2 |
7 | 4 | 3 |
Fault Tolerance and Recovery
Dkron's distributed architecture is designed to maintain operation even when nodes fail.
Leader Failure Recovery
When a leader node fails:
- Remaining server nodes detect the failure through gossip protocol
- A new leader election is automatically triggered
- A new leader is elected from the available server nodes
- The new leader takes over scheduling responsibilities
- All running jobs continue to operate without interruption
Quorum Requirements
- A cluster must maintain a quorum (majority) of server nodes to operate
- For a cluster of size N, at least (N/2)+1 nodes must be available
- If quorum is lost, the cluster stops scheduling new jobs until quorum is restored
- Refer to the Deployment Table above for specific quorum sizes and failure tolerance
Node Rejoining
When a previously failed node rejoins the cluster:
- The node establishes connection with existing cluster members
- It synchronizes its state with the current leader
- If it was previously a leader, it joins as a follower
- The node becomes available for job execution
Data Recovery
In case of catastrophic failure:
- Restore from backup if available
- Bootstrap a new cluster with at least one server node
- Import job definitions via API
- New executions will begin according to schedule
Cluster Maintenance
Proper maintenance ensures your Dkron cluster remains healthy and performant.
Adding Nodes
To add a new server node to an existing cluster:
- Install Dkron on the new server
- Configure with appropriate server settings
- Set
--join
parameter to point to existing cluster members - Start the Dkron process
- Verify the node joins successfully via the web UI or API
Adding agent nodes follows the same process but without the --server
flag.
Removing Nodes
To gracefully remove a node:
- For agents: simply stop the Dkron service
- For server nodes:
- If possible, demote the node to an agent first
- Ensure you maintain sufficient servers for quorum
- Stop the Dkron service
Upgrading Dkron
For minimal disruption during upgrades:
-
Rolling Upgrade (recommended):
- Upgrade one node at a time
- Start with agent nodes
- Then upgrade server nodes, leaving the leader for last
- Allow state synchronization between each node upgrade
-
Version Compatibility:
- Check release notes for compatibility information
- Ensure all nodes in a cluster run compatible versions
- Some upgrades may require special procedures
Monitoring Cluster Health
Regular health checks should include:
- Node Status: Check that all expected nodes are active
- Leadership: Verify leader election is stable
- Job Execution: Monitor successful job execution rates
- Resource Usage: Track CPU, memory, and disk usage
- Log Analysis: Review logs for errors or warnings
Backup and Recovery
Implement a regular backup strategy:
-
Data Store Backup:
- Stop Dkron service or use backup-friendly commands
- Copy the data directory to a secure location
- Restart the service if stopped
-
Configuration Backup:
- Maintain copies of configuration files
- Document any custom settings
-
Job Definition Export:
- Use the API to export job definitions
- Store in version control for tracking changes