Well, I'm not sure I understand this... > When these jobs have some kind of checkpointing built in, it can be set up > in SGE to reschedule the job.
They certainly don't have checkpointing built in - these are proprietary binaries, and I can't change their operation. I haven't finished reading up on checkpointing yet, so I don't know if anything else is appropriate (Kernel-level checkpointing seems like a good fix, but I've yet to find enough detail on whether it will work for me). > For the queue setup: one queue per group, inside the own machine's > hostgroup should get a lower sequence number than the other group's > machines (or a soft request for the own machines). This is the bit I don't undestand; aren't the sequence numbers set on a per-queue basis? If so, we're still left with one queue subordinate to the other. That's not going to fly; we need the subordination relationship to be one way round for one hostgroup, and the other way round for the other hostgroup. I'm beginning to wonder whether this is possible... Vic. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
