Narayan Desai wrote: > On Thu, 16 Jul 2009 12:16:14 -0400 Doug Hughes wrote: > > Doug> Narayan Desai wrote: > Doug> > On Thu, 16 Jul 2009 11:15:48 -0400 Edward Ned Harvey wrote: > Doug> > > Doug> > Ned> > I am interested in soliciting experiences deploying, using > and > Doug> > Ned> > maintaining the > Doug> > Ned> > Condor batch processing system, especially under Linux / > Debian. > Doug> > Ned> > Ned> > Our use would predominantly be many small jobs, > Doug> > rather than a few large > Doug> > Ned> > jobs, > Doug> > Ned> > with runtimes measured in a few hours. Probably only a > handful of > Doug> > Ned> > nodes, on > Doug> > Ned> > the order of half a dozen, in total.[1] > Doug> > > Doug> > > Doug> > Ned> I don't know anything about condor, or torque. The obvious > Doug> > Ned> choice to me would be SGE. I wonder what advantage there is > to > Doug> > Ned> using something other than SGE? > Doug> > > Doug> > Well, the area where condor is pretty much the undisputed king is > in the > Doug> > scavenger arena. The basic idea is that you could deploy condor on > top > Doug> > of your regular desktops and jobs would be deployed to use wasted > Doug> > cycles (during idle periods or on a set schedule, etc). -nld > Doug> > > Doug> > > Doug> Doesn't it also excel at the whole state/migration thing? E.G. you can > Doug> take a node out for maintenance and migrate a running job off to > Doug> another node by saving the memory state and performing the migration > Doug> and then resuming the job. (May only work for some job configurations) > > So I hear. I don't have any direct experience with the > checkpointing/migration stuff. I gather they are starting to use VMs for > this sort of thing as well as library-based checkpointing. > -nld > This depends on the purpose of the batch jobs. If you're looking for simple load sharing/cloud computing, we've used LSF in our engineering environment for a long time. It has the option of consuming unused desktop cycles, but we found this to be unreliable and problematic - not because LSF was bad, but because individuals had messed around with their desktops in such a way as to mangle any jobs distributed to them. Even distcc is an excellent way to spread out compiles across a bunch of machines (I even use it at home for this).
If the batch jobs are for the purpose of performing functions on particular machines, then you're not looking for a load distribution facility, you're looking for more traditional batch execution. The commercial players in this field are companies like Autosys, BMC, Orsyp, Tidal(*), and such. These products schedule (with very complex calendars and conditions, when necessary) jobs on particular machines (and some of them can load balance as well). I work in a group who's main purpose is to provide automation, especially for the batch processing environment at $WORK. You're welcome to ping me - here on the list or privately - if you would like more help. - Richard [ In the interests of full disclosure, $WORK recently acquired (*) - but I'm not a sales person - I don't even play one on TV! ] _______________________________________________ Tech mailing list [email protected] http://lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
