Lane Schwartz <[email protected]> writes: > On Fri, Mar 11, 2011 at 6:09 PM, Dave Love <[email protected]> wrote: >> What's the attraction of the Condor stuff over BLCR or DMTCP? > > I don't have any prior experience with checkpointing. I simply > searched around, and Condor seemed to be the easiest and most > straightforward to setup.
Don't you need to rebuild the application with it, as the doc suggests, or can you do it with LD_PRELOAD? On the other hand, DMTCP works in user space with unaltered applications (with certain restrictions, like only socket-based communication as far as I remember). I haven't yet tried to set it up under SGE, though, so have no experience to offer. BLCR works reasonably, especially with some MPIs, but you have to maintain the kernel module. > If you have any experience or recommendations wrt other checkpointing > methods, I would be very interested in your perspective. As you're using the proprietary version, presumably you can get advice from Oracle. > My approach so far has been to try using Condor and assign priorities > by changing -js job share. If there is a better way, I'd love to hear > about it. Job share is probably appropriate, but you can also fiddle with the priority -- which is why a default negative priority (e.g. -p 100) is recommended. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
