Version: v4

Frequently Asked Questions (FAQ)

This FAQ addresses the most common issues and questions reported by Dkron users based on GitHub issues and community discussions.

Performance and Scalability

Why is my Dkron cluster experiencing performance degradation with high-frequency jobs?

Issue: Leader node crashes or becomes unresponsive when running many jobs with frequent schedules (e.g., every second or minute).

Solutions:

Avoid scheduling jobs too frequently (overlapping on the second). Consider batching operations or using a queue system for very high-frequency tasks.
Offload logs to stdout/file using processors.
Monitor memory usage and ensure adequate resources are allocated to the leader node.
Distribute jobs across multiple agent nodes using tags to reduce load on the leader.
Review the number of concurrent jobs and adjust concurrency policies accordingly.
Consider upgrading to a more powerful server for the leader node in large-scale deployments.
Consider upgrading to Dkron Pro using Raft fastlog

Related: Issue #1828, Issue #1842

Does Dkron provide performance benchmark data?

Dkron does not currently provide official benchmark data. Performance varies significantly based on:

Job frequency and complexity
Cluster size and configuration
Hardware resources
Network latency
Type of executor used

For high-frequency scheduling scenarios, it's recommended to conduct your own benchmarking in an environment similar to your production setup.

Related: Issue #1839

Memory Issues

Why is Dkron consuming excessive memory or experiencing OOM kills?

Common causes:

Memory leaks: Older versions had memory leak issues that have been addressed in recent releases.
High execution count: Storing too many execution records (Dkron keeps up to 100 executions per job).
Large job output: Jobs that produce large amounts of output can consume significant memory.
Too many jobs: Running thousands of jobs simultaneously increases memory usage.

Solutions:

Update to the latest version of Dkron (4.x series has several memory improvements).
Use the files processor to write job output to disk instead of storing it in memory.
Limit job output size in executor configurations.
Monitor memory usage with Prometheus metrics.
Set appropriate memory limits in Docker/Kubernetes deployments.
Reduce the number of concurrent jobs or distribute them across more agents.

Related: Issue #1753, Issue #1566, Issue #1553

Clustering and High Availability

Why is my agent unable to connect to the cluster?

Common causes:

Network connectivity issues
Firewall blocking required ports (default: 8946 for Serf, 8190 for Raft)
Incorrect retry-join configuration
DNS resolution problems
Encryption key mismatch

Solutions:

Verify network connectivity between nodes using telnet or nc.
Check firewall rules allow traffic on ports 8946 (Serf) and 8190 (Raft).
Use IP addresses instead of hostnames in retry-join if DNS is unreliable.
Ensure all nodes use the same encryption key if encryption is enabled.
Check logs for specific error messages: level=error ts=<timestamp> caller=agent.go:xxx

Related: Issue #1727, Issue #1702

Why does retry-join fail after a server restart?

If using DNS names in retry-join, Dkron may cache the old IP addresses. Solutions:

Use static IP addresses in retry-join configuration.
Ensure DNS updates propagate before restarting nodes.
Set appropriate DNS TTL values for dynamic environments.

Related: Issue #1253, Issue #1213

How do I recover from "no leader" state after restart?

In single-node mode:

# Check Raft peers
dkron raft list-peers

# If cluster state is corrupted, backup and remove the data directory
cp -r /var/lib/dkron /var/lib/dkron.backup
rm -rf /var/lib/dkron/*
# Restart dkron

In multi-node mode:

Ensure at least bootstrap-expect number of servers are running.
Check that all servers can communicate on the Raft port (8190).
Verify there's no network partition.
Review logs for Raft election errors.

Related: Issue #1529, Issue #1212

Why do servers have inconsistent job data?

Causes:

Network partitions causing split-brain scenarios
Raft replication issues
Corrupted data on one or more nodes

Solutions:

Ensure stable network connectivity between all server nodes.
Verify Raft replication is working: dkron raft list-peers
Check logs for Raft errors: grep -i raft /var/log/dkron.log
In extreme cases, restore from backup or rebuild the cluster.

Related: Issue #1371, Issue #1337

Kubernetes and Container Deployments

Why does one Dkron pod fail with "panic: log not found" in Kubernetes?

This typically occurs during Raft initialization in StatefulSets when persistent volumes are not properly configured.

Solutions:

Ensure persistent volumes are correctly configured and bound.
Verify bootstrap-expect matches the number of server replicas.
Check that all pods can resolve each other's DNS names.
Use StatefulSet with headless service for stable network identities.
Ensure proper initialization order with podManagementPolicy: Parallel or readiness probes.

Related: Issue #1557

How do I configure log rotation in Docker?

Dkron writes logs to stdout/stderr by default in Docker. To manage logs:

Docker Compose:

services:
    dkron:
        logging:
            driver: "json-file"
            options:
                max-size: "10m"
                max-file: "3"

Docker CLI:

docker run --log-driver json-file \
  --log-opt max-size=10m \
  --log-opt max-file=3 \
  distribworks/dkron:latest

To mount logs to the host:

volumes:
    - ./logs:/var/log/dkron

Then configure Dkron to write to files via log-level and redirect output.

Related: Issue #1556

Job Execution

Why does my shell job fail to execute?

Common causes:

Shell not available in the container/environment
Incorrect command syntax
Missing environment variables
Permission issues

Solutions:

Verify shell is available: which sh or which bash
Use absolute paths for commands: /bin/bash -c "command"

Set the shell option explicitly in executor config:

{
    "executor": "shell",
    "executor_config": {
        "command": "your-command",
        "shell": "true"
    }
}

Check file permissions and ownership
Review execution logs for specific error messages

Related: Issue #1775

Why does "forbid" concurrency policy not prevent concurrent executions?

In distributed environments, there can be a race condition where multiple executions start before the concurrency check completes. This is a known limitation of distributed scheduling.

Workarounds:

Implement application-level locking (e.g., using Redis, etcd, or database locks).
Increase job duration to reduce the likelihood of overlapping schedules.
Use idempotent job designs that can handle concurrent executions safely.
Add a small delay at job start to check for other running instances.

Related: Issue #1569

Why does `cmd.Process.Kill()` not clean up child processes?

On Unix systems, killing the parent process doesn't automatically kill child processes. This can leave orphaned processes running.

Solution: Use process groups to ensure all child processes are terminated:

# In your script, create a process group
set -m  # Enable job control
trap 'kill -- -$$' EXIT INT TERM

# Your commands here

Or use the timeout command:

timeout 300 your-long-running-command

Related: Issue #1385

Job Configuration

Why does `runoncreate` block the API call?

When runoncreate is set to true, the job creation API waits for the first execution to complete before returning. This is intentional behavior for validation purposes.

Workaround: If you need immediate API response:

Create the job with runoncreate: false
Make a separate API call to trigger execution: POST /v1/jobs/{job_name}/run

Related: Issue #1268

Web UI

Why does the UI show "Empty output" for running jobs?

The UI displays execution output after the job completes. For long-running jobs, output is not streamed in real-time.

Solutions:

Wait for job completion to see output
Use the log or files processor to write output to a log aggregation system
Use the API to fetch execution details: GET /v1/executions/{execution_id}
Check agent logs directly on the executing node

Related: Issue #1742

Why does a running job show "failed" status with strange finished_at date?

This is a UI display bug that has been addressed in recent versions. The job is still running, but the UI incorrectly shows it as failed.

Solution:

Update to the latest version (4.x)
Check actual job status via API: GET /v1/jobs/{job_name}
Monitor execution status: GET /v1/executions/{execution_id}

Related: Issue #1817

Why is the job ID missing on small screens?

This was a responsive design issue in the Web UI that has been fixed in version 4.x.

Solution: Update to the latest version of Dkron.

Related: Issue #1583

Plugins and Executors

Why doesn't the GRPC executor close connections?

Older versions had connection leak issues. Solutions:

Update to the latest version
Implement connection pooling in your gRPC service
Set appropriate timeout values in executor config

Related: Issue #1236

API

Why does the restore API not work?

The restore endpoint requires proper backup format and permissions.

Usage:

# Create backup
curl http://localhost:8080/v1/backup > backup.json

# Restore backup
curl -X POST http://localhost:8080/v1/restore \
  -H "Content-Type: application/json" \
  -d @backup.json

Note: Restore should be performed on a clean cluster or during maintenance windows.

Related: Issue #1376

Configuration and Setup

How do I set up Dkron in a dual data center configuration?

Dkron uses Raft which requires a majority of nodes to be available. For multi-datacenter setups:

Option 1: Single Raft cluster across DCs (not recommended)

High latency can cause leader election issues
Network partitions can cause outages

Option 2: Separate clusters per DC (recommended)

Run independent Dkron clusters in each DC
Use external synchronization for job definitions
Configure failover at the application level

Option 3: Use Pro version features

Enhanced replication capabilities
Better multi-datacenter support

Related: Issue #1363

What are the resource requirements for Dkron?

Requirements vary based on:

Number of jobs
Job execution frequency
Cluster size
Execution history retention

Minimum recommendations:

Server nodes: 1-2 CPU cores, 512MB-1GB RAM
Agent nodes: 1 CPU core, 256MB-512MB RAM

Production recommendations:

Server nodes: 2-4 CPU cores, 2-4GB RAM
Agent nodes: 2 CPU cores, 1-2GB RAM
Storage: 10GB+ for execution history

Scale vertically for the leader node and horizontally for agents.

Related: Issue #1560

Troubleshooting

How do I debug "invalid memory address or nil pointer dereference" errors?

This panic typically indicates a bug in Dkron or a plugin. Steps to debug:

Update to the latest version - Many nil pointer issues have been fixed
Check logs for the full stack trace
Identify the context - What operation was being performed?
Reproduce - Can you consistently trigger the issue?
Report - If it persists, file a GitHub issue with:
- Dkron version
- Configuration
- Steps to reproduce
- Full logs and stack trace

Related: Issue #1388, Issue #1265

Where can I find more help?

Documentation: https://dkron.io/docs
GitHub Issues: https://github.com/distribworks/dkron/issues
Discussions: https://github.com/distribworks/dkron/discussions
Commercial Support: Available for Dkron Pro - see https://dkron.io/pro

Best Practices

General recommendations for production deployments:

High Availability: Run at least 3 server nodes (odd number for Raft quorum)
Monitoring: Enable Prometheus metrics and set up alerts
Backups: Regularly backup job configurations using the backup API
Updates: Keep Dkron updated to get bug fixes and performance improvements
Resource Limits: Set appropriate memory/CPU limits in container environments
Job Design:
- Use idempotent operations
- Implement proper error handling
- Keep job execution time reasonable (< 1 hour)
- Avoid very high-frequency schedules (< 1 minute)
Security:
- Enable TLS for API communication
- Use firewall rules to restrict access
- Consider using ACLs (Pro version)
Logging: Use processors (log, files, syslog) to persist job output
Testing: Test job configurations in a staging environment first
Documentation: Document your job configurations and dependencies

Performance and Scalability​

Why is my Dkron cluster experiencing performance degradation with high-frequency jobs?​

Does Dkron provide performance benchmark data?​

Memory Issues​

Why is Dkron consuming excessive memory or experiencing OOM kills?​

Clustering and High Availability​

Why is my agent unable to connect to the cluster?​

Why does retry-join fail after a server restart?​

How do I recover from "no leader" state after restart?​

Why do servers have inconsistent job data?​

Kubernetes and Container Deployments​

Why does one Dkron pod fail with "panic: log not found" in Kubernetes?​

How do I configure log rotation in Docker?​

Job Execution​

Why does my shell job fail to execute?​

Why does "forbid" concurrency policy not prevent concurrent executions?​

Why does cmd.Process.Kill() not clean up child processes?​

Job Configuration​

Why does runoncreate block the API call?​

Web UI​

Why does the UI show "Empty output" for running jobs?​

Why does a running job show "failed" status with strange finished_at date?​

Why is the job ID missing on small screens?​

Plugins and Executors​

Why doesn't the GRPC executor close connections?​

API​

Why does the restore API not work?​

Configuration and Setup​

How do I set up Dkron in a dual data center configuration?​

What are the resource requirements for Dkron?​

Troubleshooting​

How do I debug "invalid memory address or nil pointer dereference" errors?​

Where can I find more help?​

Best Practices​

General recommendations for production deployments:​

Performance and Scalability

Why is my Dkron cluster experiencing performance degradation with high-frequency jobs?

Does Dkron provide performance benchmark data?

Memory Issues

Why is Dkron consuming excessive memory or experiencing OOM kills?

Clustering and High Availability

Why is my agent unable to connect to the cluster?

Why does retry-join fail after a server restart?

How do I recover from "no leader" state after restart?

Why do servers have inconsistent job data?

Kubernetes and Container Deployments

Why does one Dkron pod fail with "panic: log not found" in Kubernetes?

How do I configure log rotation in Docker?

Job Execution

Why does my shell job fail to execute?

Why does "forbid" concurrency policy not prevent concurrent executions?

Why does `cmd.Process.Kill()` not clean up child processes?

Job Configuration

Why does `runoncreate` block the API call?

Web UI

Why does the UI show "Empty output" for running jobs?

Why does a running job show "failed" status with strange finished_at date?

Why is the job ID missing on small screens?

Plugins and Executors

Why doesn't the GRPC executor close connections?

API

Why does the restore API not work?

Configuration and Setup

How do I set up Dkron in a dual data center configuration?

What are the resource requirements for Dkron?

Troubleshooting

How do I debug "invalid memory address or nil pointer dereference" errors?

Where can I find more help?

Best Practices

General recommendations for production deployments: