[Bug 57613] Distributed cron replacement

bugzilla-daemon Wed, 27 Nov 2013 15:23:19 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=57613


--- Comment #4 from Tyler Romeo <[email protected]> ---
OK, here's my proposal:

* A cluster of masters, with a maximum of 10 or so, which can double as workers

* One or more masters can manage a cluster of workers. The collection of
masters can use Raft to select which will be the leader and which will be
fallovers.

* Tasks will be specified in an official crontab that will be synced across the
masters. The crontab can be changed on one server, and a CLI tool will trigger
a sync.

* The crontab will specify what task to execute, how often, any restrictions on
which machine the task must be executed, whether the task access external
networks, and whether the task is idempotent and/or atomic.

* The masters use Paxos to claim tasks for execution and to determine the
official crontab specification. The master of a cluster will decide which
workers will execute which tasks.

* Workers will report back to the master when a task finishes or fails. Masters
will report back to the network when a task finishes or fails. How failures are
handled depends on the specification of the task.

* We can use Go (and its Raft implementation) for the daemon, and ZeroMQ as a
means of message sending.

The advantages of this method are: 1) workers and masters can both fail and
somebody will still be able to execute the task, 2) it doesn't require a large
number of nodes since masters can double as workers, which means a really small
cluster can have just masters, and each master treats itself as its worker
cluster, 3) it scales to larger clusters since the hierarchy of masters allows
for better handling of network partitions.

The one problem is if a task that is not idempotent is executed, but then a
network partition occurs and the network cannot determine if the task finished
or not. The task cannot be re-executed because it might cause unwanted side
effects, and the network cannot wait for the issue to resolve itself, since
Paxos assumes messages take arbitrarily long to deliver. Another issue is that
it does not handle byzantine failures, although with any luck that should not
be an issue for this daemon.

Thoughts?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 57613] Distributed cron replacement

Reply via email to