Lane Schwartz <[email protected]> writes:

> On Fri, Mar 11, 2011 at 6:09 PM, Dave Love <[email protected]> wrote:
>> What's the attraction of the Condor stuff over BLCR or DMTCP?
>
> I don't have any prior experience with checkpointing. I simply
> searched around, and Condor seemed to be the easiest and most
> straightforward to setup.

Don't you need to rebuild the application with it, as the doc suggests,
or can you do it with LD_PRELOAD?

On the other hand, DMTCP works in user space with unaltered applications
(with certain restrictions, like only socket-based communication as far
as I remember).  I haven't yet tried to set it up under SGE, though, so
have no experience to offer.  BLCR works reasonably, especially with
some MPIs, but you have to maintain the kernel module.

> If you have any experience or recommendations wrt other checkpointing
> methods, I would be very interested in your perspective.

As you're using the proprietary version, presumably you can get advice
from Oracle.

> My approach so far has been to try using Condor and assign priorities
> by changing -js job share. If there is a better way, I'd love to hear
> about it.

Job share is probably appropriate, but you can also fiddle with the
priority -- which is why a default negative priority (e.g. -p 100) is
recommended.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to