https://bugzilla.wikimedia.org/show_bug.cgi?id=57613
--- Comment #4 from Tyler Romeo <[email protected]> --- OK, here's my proposal: * A cluster of masters, with a maximum of 10 or so, which can double as workers * One or more masters can manage a cluster of workers. The collection of masters can use Raft to select which will be the leader and which will be fallovers. * Tasks will be specified in an official crontab that will be synced across the masters. The crontab can be changed on one server, and a CLI tool will trigger a sync. * The crontab will specify what task to execute, how often, any restrictions on which machine the task must be executed, whether the task access external networks, and whether the task is idempotent and/or atomic. * The masters use Paxos to claim tasks for execution and to determine the official crontab specification. The master of a cluster will decide which workers will execute which tasks. * Workers will report back to the master when a task finishes or fails. Masters will report back to the network when a task finishes or fails. How failures are handled depends on the specification of the task. * We can use Go (and its Raft implementation) for the daemon, and ZeroMQ as a means of message sending. The advantages of this method are: 1) workers and masters can both fail and somebody will still be able to execute the task, 2) it doesn't require a large number of nodes since masters can double as workers, which means a really small cluster can have just masters, and each master treats itself as its worker cluster, 3) it scales to larger clusters since the hierarchy of masters allows for better handling of network partitions. The one problem is if a task that is not idempotent is executed, but then a network partition occurs and the network cannot determine if the task finished or not. The task cannot be re-executed because it might cause unwanted side effects, and the network cannot wait for the issue to resolve itself, since Paxos assumes messages take arbitrarily long to deliver. Another issue is that it does not handle byzantine failures, although with any luck that should not be an issue for this daemon. Thoughts? -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
