[Bug 57613] Distributed cron replacement

2014-09-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Yuvi Panda  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 CC||yuvipa...@gmail.com
 Resolution|--- |FIXED

--- Comment #31 from Yuvi Panda  ---
Closing as fixed since cron just submits to the grid on toollabs now.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-09-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Andre Klapper  changed:

   What|Removed |Added

   Keywords|tracking|

--- Comment #30 from Andre Klapper  ---
[not tracking any other tickets; removing keyword]

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #29 from Tyler Romeo  ---
(In reply to comment #27)
> Do any of the following provide a solution, or something that could be used
> as
> a base for a solution?
> 
> * Chronos - http://nerds.airbnb.com/introducing-chronos/
> 
> * Quora's distributed chron -
> http://engineering.quora.com/Quoras-Distributed-Cron-Architecture
> 
> * The Amazon approach:
> http://stackoverflow.com/questions/10061843/how-to-convert-linux-cron-jobs-
> to-the-amazon-way

The Quora approach looks half-assed considering they use MySQL anyway.

Chronos, on the other hand, looks really interesting and applicable. I pointed
out the possibility of using Apache Mesos as a framework for this project
above, but it seems Chronos has beat me to the point.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #28 from John Broughton  ---
(In reply to comment #21)

https://www.mediawiki.org/wiki/Facebook_Open_Academy/Cron

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

John Broughton  changed:

   What|Removed |Added

 CC||johnbrough...@comcast.net

--- Comment #27 from John Broughton  ---
Do any of the following provide a solution, or something that could be used as
a base for a solution?

* Chronos - http://nerds.airbnb.com/introducing-chronos/

* Quora's distributed chron -
http://engineering.quora.com/Quoras-Distributed-Cron-Architecture

* The Amazon approach:
http://stackoverflow.com/questions/10061843/how-to-convert-linux-cron-jobs-to-the-amazon-way

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-01-21 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #26 from Tyler Romeo  ---
(In reply to comment #25)
> cronie - as used on the Toolserver for quite some time now - offers an option
> "-c" for its crond that enables cluster behaviour.  With that, you can share
> /var/spool/cron for example via NFS.  /var/spool/cron/.cron.hostname contains
> the name of the current host that executes jobs.  On failover, you set it to
> another (remaining) crond host.

Ah, just found it. Thanks.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-01-21 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #25 from Tim Landscheidt  ---
(In reply to comment #24)
> [...]
> Could you possibly explain how cronie is used for cluster execution? Because
> I
> cannot find any documentation on it, nor do I see any evidence of it in the
> source code. (To the contrary, the README specifically says cronie is *not*
> redundant.)

Which README are you referring to?

cronie - as used on the Toolserver for quite some time now - offers an option
"-c" for its crond that enables cluster behaviour.  With that, you can share
/var/spool/cron for example via NFS.  /var/spool/cron/.cron.hostname contains
the name of the current host that executes jobs.  On failover, you set it to
another (remaining) crond host.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-01-21 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #24 from Tyler Romeo  ---
(In reply to comment #23)
> I'm with the NIH critics.  cronie already provides crontabs on a cluster (->
> redundancy), and that can be used with SGE (or anything else) for load
> distribution and resource management.  Plus, it has a userbase in the
> millions.
> 
> The only problem at the moment is that Ubuntu doesn't ship it (cf.
> https://blueprints.launchpad.net/ubuntu/+spec/security-karmic-replace-cron).
> 
> So if the scope of this bug is to package cronie for Ubuntu (or pimp systemd
> or
> upstart for clusters), hooray, but if we end up with yet-another something
> with
> new bugs and nobody maintaining it, I'd rather pass on that.

Could you possibly explain how cronie is used for cluster execution? Because I
cannot find any documentation on it, nor do I see any evidence of it in the
source code. (To the contrary, the README specifically says cronie is *not*
redundant.)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-01-21 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Tim Landscheidt  changed:

   What|Removed |Added

 CC||t...@tim-landscheidt.de

--- Comment #23 from Tim Landscheidt  ---
I'm with the NIH critics.  cronie already provides crontabs on a cluster (->
redundancy), and that can be used with SGE (or anything else) for load
distribution and resource management.  Plus, it has a userbase in the millions.

The only problem at the moment is that Ubuntu doesn't ship it (cf.
https://blueprints.launchpad.net/ubuntu/+spec/security-karmic-replace-cron).

So if the scope of this bug is to package cronie for Ubuntu (or pimp systemd or
upstart for clusters), hooray, but if we end up with yet-another something with
new bugs and nobody maintaining it, I'd rather pass on that.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-01-15 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Pavel (pastakhov)  changed:

   What|Removed |Added

 CC||pastak...@yandex.ru

--- Comment #22 from Pavel (pastakhov)  ---
I may not understand the whole essence of the problem...
Why you can not start the task over ssh on servers with a single cron?
Do you suit a script that gets the task and the list of servers, then runs the
task on all these servers and waiting for their execution?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-01-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #21 from MZMcBride  ---
(In reply to comment #17)
> This project has been picked up by the Facebook Open Academy.

Is there a place (mailing list, wiki page, etc.) to follow the progress of
this?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-01-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #20 from Tyler Romeo  ---
Also my friend suggested using something like 
to manage distributed state and then just having the cluster machines check
that/

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2014-01-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Tyler Romeo  changed:

   What|Removed |Added

 CC||tylerro...@gmail.com

--- Comment #19 from Tyler Romeo  ---
I strongly agree with Ryan that we should not be writing something from scratch
unless we absolutely have to. I haven't used the Sun Grid Engine, but it looks
like an opportunity. There's also Apache Mesos ,
which doesn't solve the whole problem but provides some sort of architecture.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #18 from Seb35  ---
This sounds me like the job of a scheduler. Aside the old Sun Grid Engine, I
briefly used as a user OAR, which is used in research context for HPC
applications (e.g. Grid'5000 is a partner). Compared to the former it seems
more "distributed" as you want: it claims to be multi-scheduler, and obviously
the tasks/jobs are distributed accross a cluster and each unique task can be
itself distributed/parallel. As a user it was more flexible than the old SGE,
and it seems the admins have many tools to supervise the whole system.
http://oar.imag.fr

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-25 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Marc A. Pelletier  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #17 from Marc A. Pelletier  ---
This project has been picked up by the Facebook Open Academy.

Part of the first "step" the students will be tasked with is to clearly define
the scope and requirements and investigate whether there are other open source
components that can be leveraged or built upon - and it seems clear that there
may well be.

I did do an extensive investigation myself to find if something could be used
as-is (and nothing fits the bill) but there are a number of partial solutions
out there that can be built upon.  Good design also means knowing when not to
design new wheels every time you want a car; and IMO doing that exercise is, in
itself, a valuable part of the project.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #16 from Quim Gil  ---
On whether this should be a project for Facebook Open Academy program or not:
the project could be simply based on the definition of the problem to solve,
which doesn't seem to be currently covered by any piece of software that you
can install and run. 

The project could start with an investigation of the possible solutions,
including all the options mentioned here. The actual development doesn't start
until February, so there is time to discuss the solution to implement.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #15 from Antoine "hashar" Musso  ---
Another feature that would be quite nice is the ability to define batch of
tasks. 

At a previous job we used IBM Tivoli Workload Scheduler. I never used directly
myself since I was merely giving scenario to operations.  On each server you
have an agent which is controlled from a central box, it runs the scenario
tasks by tasks and can revert back if something fails.   Most batches being run
overnight, operations ends up in the morning with a nice report of tasks that
passed/failed.


Anyway, it would be nice to start out a page listing the requirements, there
are some listed in the comments above already.  Then we can got hunt for an
open source software that would support them.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #14 from Ryan Lane  ---
native c probably wouldn't be a sane language for this. Something high level
would be good. It's unlikely that raw speed is a major need here.

Anyway, I still think this is a problem that's being overkilled. There's no
problem having a SPOF for something that could simply schedule things onto the
grid.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Peter Bena  changed:

   What|Removed |Added

 CC||benap...@gmail.com

--- Comment #13 from Peter Bena  ---
This sounds like a great idea. I wanted this as well for some time. No matter
of result of facebook thing, if you decide to start developing this, ping me I
would like to contribute myself as long as its going to be written in a sane
language (like native c)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

MZMcBride  changed:

   What|Removed |Added

 CC||b...@mzmcbride.com

--- Comment #12 from MZMcBride  ---
(In reply to comment #11)
> I expect there is a I'm failing to express the concept properly.  Let's try
> again.  There is need for something that:
> 
> * allows endusers to schedule things to run arbitrary snippets of code at
> specified times or intervals (with their privileges); (using crontabs is a
> convenience bonus, not a requirement)
> * does not rely on a single non-redundant daemon (no SPOF); and
> * does not rely on a complex infrastructure or set of dependencies (like an
> entire cluster manager system).

I'm reminded a little of [[Sun Grid Engine]], which the Toolserver used to use.

I'd suggest writing an [[mw:RFC]] with hard and soft requirements: Bugzilla
feels like the wrong forum to suss out requirements.

As Ryan L. and Faidon have suggested, it's possible something like this already
exists, though I don't know of anything off-hand. Rather than starting from
scratch, I wonder whether, for example, modifying (Vixie) cron to support
multiple hosts would make sense.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #11 from Marc A. Pelletier  ---
I expect there is a I'm failing to express the concept properly.  Let's try
again.  There is need for something that:

* allows endusers to schedule things to run arbitrary snippets of code at
specified times or intervals (with their privileges); (using crontabs is a
convenience bonus, not a requirement)
* does not rely on a single non-redundant daemon (no SPOF); and
* does not rely on a complex infrastructure or set of dependencies (like an
entire cluster manager system).

The problem with Jenkins or Gearman isn't that they don't do this perfectly --
it's that they do *none* of this at all.  They're not a solution rejected
because NIH, they're just not a solution at all (at least not to this use
case).

At best, they could be used as components of a solution (though, brittle and
overcomplicated (for the task) as it is, Jenkins doesn't seem promising even
for that).

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Faidon Liambotis  changed:

   What|Removed |Added

 CC||fai...@wikimedia.org

--- Comment #10 from Faidon Liambotis  ---
I'd like to echo Ryan here, this proposal sounds very NIH to me in nature.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #9 from Marc A. Pelletier  ---
(In reply to comment #8)
> Should this project be still listed in that page?

I'm not sure what the usual workflow is for projects of the sort.  I'm
certainly willing to mentor, and it's certainly in-scope for a session project
for a small team.

Quim?  What do you opine our next step should be?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-12-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

vladjohn2...@gmail.com changed:

   What|Removed |Added

 CC||vladjohn2...@gmail.com

--- Comment #8 from vladjohn2...@gmail.com ---
Hi, this project is still listed at 
https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Distributed_cron_replacement
 

Should this project be still listed in that page? If not, please remove it. If
it still makes sense, then it could be moved to the "Featured projects" section
if it has community support and mentors.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-11-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #7 from Marc A. Pelletier  ---
(In reply to comment #6)
> Is there really something wrong with gearman
> or jenkins?

The fact that neither of them solves the stated problem set?  (Well, I suppose
Jenkins could be coerced into doing something similar by combining its job
monitoring with some manually crafted schedules; but that's sorta like using a
car to crack nuts by driving over them -- it'll /work/ but it's hardly the best
solution).  Also, Jenkins is exactly as much a SPOF just running cron is; in
which case it provides no benefit beyond the ability to farm out the actual
run.

I'm not even sure where you see Gearman as even related.  It's a fairly nice
distributed RPC-like dispatcher, but still requires a process to schedule and
start jobs which leads us back to the initial problem.

Neither of those provide any help with the stated objective of "make sure X
happens at time T (or interval I)".

Chronos gets close to the stated objective but is (a) highly complex to use and
(b) has several boatloads of heavy external dependencies.

This is not a new problem; and it's still actively discussed.  May places have
constructed workarounds[2] and ad-hoc systems to do just that[3]; I'm hoping we
can make a solution that is applicable to the general case.

[1] http://nerds.airbnb.com/introducing-chronos/
[2] http://kvz.io/blog/2012/12/31/lock-your-cronjobs/
[3] http://act.yapc.eu/lpw2012/talk/4291

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-11-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #6 from Ryan Lane  ---
I don't understand why we need to write something from scratch for this, or why
it needs to be fully decentralized. Decentralized things are *hard*, especially
when building them from scratch. Is there really something wrong with gearman
or jenkins?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-11-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #5 from Marc A. Pelletier  ---
I think that's a sound approach; my own initial brain experiments were mostly
centered about a network of peers (similar to your master-only scenario); but I
agree that separating master vs worker conceptually is the best approach (even
if, in practice, masters will also be workers in small installations).

The problem with non-idempotents jobs being potentially executed more than one
time, I think, is manageable:

1) have a means to mark a job as having side effects so that
2) have a deterministic map for jobs to exactly one (random) worker

such that a job marked as having side effects and absolutely needing exactly
one run can only be scheduled on the mapped node and will simply fail
otherwise.  This way, in case of network partition, only the masters that can
still speak to the selected node would even attempt the run.

This trades the risk of the run being executed more than once with the risk
that it isn't executed at all -- but since it's selectable the user can make
the appropriate tradeoff depending on what is best for the specific situation.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-11-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #4 from Tyler Romeo  ---
OK, here's my proposal:

* A cluster of masters, with a maximum of 10 or so, which can double as workers

* One or more masters can manage a cluster of workers. The collection of
masters can use Raft to select which will be the leader and which will be
fallovers.

* Tasks will be specified in an official crontab that will be synced across the
masters. The crontab can be changed on one server, and a CLI tool will trigger
a sync.

* The crontab will specify what task to execute, how often, any restrictions on
which machine the task must be executed, whether the task access external
networks, and whether the task is idempotent and/or atomic.

* The masters use Paxos to claim tasks for execution and to determine the
official crontab specification. The master of a cluster will decide which
workers will execute which tasks.

* Workers will report back to the master when a task finishes or fails. Masters
will report back to the network when a task finishes or fails. How failures are
handled depends on the specification of the task.

* We can use Go (and its Raft implementation) for the daemon, and ZeroMQ as a
means of message sending.

The advantages of this method are: 1) workers and masters can both fail and
somebody will still be able to execute the task, 2) it doesn't require a large
number of nodes since masters can double as workers, which means a really small
cluster can have just masters, and each master treats itself as its worker
cluster, 3) it scales to larger clusters since the hierarchy of masters allows
for better handling of network partitions.

The one problem is if a task that is not idempotent is executed, but then a
network partition occurs and the network cannot determine if the task finished
or not. The task cannot be re-executed because it might cause unwanted side
effects, and the network cannot wait for the issue to resolve itself, since
Paxos assumes messages take arbitrarily long to deliver. Another issue is that
it does not handle byzantine failures, although with any luck that should not
be an issue for this daemon.

Thoughts?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-11-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #3 from Tyler Romeo  ---
The specification for this needs a bit more. Some questions:

* How distributed? GitHub is "distributed", but at the same time centralized.
Do we just want the task execution to be distributed, or will the entire system
be distributed? IMO the best solution will most likely be something that's
completely distributed.

* Where are tasks consumed from? The description says it needs to be
configurable by traditional crontabs. Does this mean a single crontab on one
computer is enough to schedule a task for the system? Or should it be entered
on every crontab on every computer? Neither of these are the best options, and
I don't think crontab configuration in such an environment is the most
practical idea.

* Are the tasks themselves distributed? Or will a task be isolated to a single
machine? With most tasks I'd imagine the latter.

If we assume the sort of answers I gave to the questions are correct, then this
will be a fun task to undertake. The network would have to be configured as a
full mesh, and we'd likely use something like Raft to determine consensus
(Paxos works as well, but since Raft has leader election it's probably a better
candidate, not to mention it has a Go implementation).

However, consensus protocols generally do not work well with large clusters,
which mean we'd need to implement some sort of task sharding so that smaller
clusters can operate independently. This brings to mind a sort of worker
approach, where a consensus algorithm is performed on the five or six masters,
who then hand off tasks to a worker cluster they own. This would still allow
for failover and network partitions while still being performant. But that also
brings up the problem of how the master will handle failures in its own worker
cluster.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-11-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

--- Comment #2 from Quim Gil  ---
Facebook Open Academy program [1] is interested in this project. We need to
decide whether we commit to mentor it, and how many students with which skills
are we looking for.

Note that this isn't a full-time program like Google Summer of Code. Each
student will be working on this project about 5-8 a week between February and
June approximately, as if it would a university course.

[1] http://lists.wikimedia.org/pipermail/wikitech-l/2013-November/073226.html

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 57613] Distributed cron replacement

2013-11-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=57613

Ryan Lane  changed:

   What|Removed |Added

 CC||rlan...@gmail.com

--- Comment #1 from Ryan Lane  ---
Jenkins? Gearman?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l