[gridengine users] Some simple SGE questions concerning checkpointing and "sleeping" of jobs for queue equity.

Jake Carroll Tue, 15 May 2012 14:02:56 -0700

Hi all.

A couple of quick questions this morning with some ROCKS/SGE scheduler 
semantics.



  1.  I've got some new users who want to drive the cluster we have set up with 
the very maximum efficiency possible. I.e – a user can use as much of the 
cluster that is possible when they submit a job. With over 1000 cores  but many 
users, one of the things we did do was limit a users ability to take up more 
than about 300 or 400 slots, such that they could only ever utilise maybe 20 to 
30% of the cluster at any given time. My new users don't like this –and they 
want to be able to use 100% of the system, if it's free and no other jobs are 
running. Now, my understanding is that we could definitely remove that limit of 
300 or 400 slots/jobs, but it'll have a couple of detrimental impacts:

Primarily –it'll preclude any other user from starting jobs at any given time 
if their jobs are running, as there are no free slots.

2. My users told me "no, no – you can simply put our jobs "to sleep" when 
others in the queue log in to run their jobs.

Now, my understanding of that is, yes, that is possible (though, I don't know 
how it's implemented – fairshare policy queue / weight perhaps?) BUT it has the 
big drawback that when a users job is "asleep", it will actually still keep 
ahold of the memory allocation on the node, thus, if another big mem job comes 
along and the node is memory over-subscribed, crashing scenarios will ensue! 
Can somebody confirm that kind of functionality/concern for me?

3. My users want jobs to "persist" over the course of a cluster head node 
crash. Would I be right in saying that it's only possible to persist across 
crashes if the users are using CHECKPOINTING in their jobs? I've heard of it 
before –just never implemented it and don't know where to start.

Thank you for your time, all.

JC

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Some simple SGE questions concerning checkpointing and "sleeping" of jobs for queue equity.

Reply via email to