Intro
Dkron - Distributed, fault tolerant job scheduling system
Welcome to the Dkron documentation! This is the reference guide on how to use Dkron. If you want a getting started guide refer to the getting started guide.
What is Dkron
Dkron is a distributed system to run scheduled jobs against a server or a group of servers of any size. One of the machines is the leader and the others will be followers. If the leader fails or becomes unreachable, any other one will take over and reschedule all jobs to keep the system healthy.
In case the old leader becomes alive again, it'll become a follower.
Dkron is a distributed cron drop-in replacement, easy to setup and fault tolerant with focus in:
- Easy: Easy to use with a great UI
- Reliable: Completely fault tolerant
- Highly scalable: Able to handle high volumes of scheduled jobs and thousands of nodes
Dkron is written in Go and leverages the power of distributed key value stores and Serf for providing fault tolerance, reliability and scalability while remaining simple and easily installable.
Dkron is inspired by the google whitepaper Reliable Cron across the Planet
Dkron runs on Linux, OSX and Windows. It can be used to run scheduled commands on a server cluster using any combination of servers for each job. It has no single points of failure due to the use of the fault tolerant distributed databases and can work at large scale thanks to the efficient and lightweight gossip protocol.
Dkron uses the efficient and lightweight gossip protocol underneath to communicate with nodes. Failure notification and task handling are run efficiently across an entire cluster of any size.
System Architecture
Dkron utilizes a server-agent architecture where multiple agents can form a cluster for high availability. Here's a high-level overview of how Dkron works:
Key Components
Dkron's architecture consists of several key components:
-
Server Nodes: Nodes running in server mode (with
--server
flag) that participate in leader election and can schedule jobs. -
Leader Node: One server node is elected as leader and is responsible for:
- Scheduling jobs
- Assigning job executions to target nodes
- Maintaining the cluster state
-
Follower Nodes: Server nodes that are not the leader. They:
- Can execute jobs when selected as targets
- Are ready to become the leader if the current leader fails
- Maintain a replicated log of all operations
-
Embedded Data Store: Dkron uses an embedded BoltDB database to store:
- Job definitions
- Execution history
- Cluster state
-
Serf Layer: Handles cluster membership, failure detection, and messaging between nodes using the gossip protocol.
-
HTTP API and Web UI: Provides a RESTful API and web interface for job management.
-
Executors: Plugins that handle the actual execution of job commands (shell, HTTP, etc.).
-
Processors: Plugins that process the output of job executions (log, file, email, etc.).
Dkron Scheduling Flow
Here's how job scheduling works in Dkron:
- Job Definition: Users define jobs with scheduling parameters (cron expression), execution options, and target node tags.
- Leader Scheduling: The leader node tracks job schedules and triggers executions at the appropriate times.
- Target Selection: The leader selects target nodes for job execution based on tags and execution options.
- Execution: Target nodes run the job using the specified executor.
- Processing Output: Job output is processed by configured processors.
- Results Storage: Execution results are stored in the distributed data store.
Key Concepts
Jobs
Jobs are the core entity in Dkron. A job consists of:
- Name: Unique identifier for the job
- Schedule: When to run the job (cron expression)
- Command: What to run
- Executor: How to run the command (shell, HTTP, etc.)
- Processors: How to process the output
- Tags: Key-value pairs for node selection
- Concurrency: Options to control concurrent execution
- Dependent Jobs: Jobs that should run after this job completes
Tags and Node Selection
Dkron uses tags to control which nodes execute specific jobs:
- Node Tags: Assigned to nodes during startup (
--tag key=value
) - Job Tags: Specified in job definitions (
"tags": {"role": "web"}
) - Tag Matching: Jobs run on nodes where all job tags match node tags
Concurrency Options
Dkron provides several options to control job concurrency:
- Concurrency: Allow (or disallow) concurrent executions of the same job
- Executor Concurrency: Limit concurrent executions on a single node
- Global Concurrency: Control concurrent executions across the entire cluster
Status Codes and Retries
Jobs can be configured with:
- Success Status Codes: Define which exit codes indicate success
- Retries: Number of times to retry a failed execution
- Retry Interval: Time to wait between retries
Job Dependencies
Dkron supports job dependencies for complex workflows:
- Parent-Child Relationships: Jobs can depend on other jobs
- Status Checking: Child jobs run only if parent jobs succeed
- Chained Execution: Create multi-step job pipelines
Web UI
The Dkron web UI provides an easy-to-use interface for:
- Creating and editing jobs
- Viewing execution history and logs
- Monitoring cluster status
- Running jobs manually
- Managing job dependencies
Dkron design
Dkron is designed to solve one problem well, executing commands in given intervals. Following the unix philosophy of doing one thing and doing it well (like the battle-tested cron) but with the given addition of being designed for the cloud era, removing single points of failure in environments where scheduled jobs are needed to be run in multiple servers.