-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 16/08/13 13:55, Reuti wrote: > Hi, > > Am 16.08.2013 um 01:57 schrieb Alex Chekholko: > >> We're bringing up a new cluster and have some grid engine design >> questions. We already run several GE clusters with close to >> default config, but this one the users want some unusual >> settings. >> >> Here's my best understanding of what they want. >> >> Suppose there's 1000 cores in the system, and we map cores to >> slots. Suppose there's 10 users and they each get "allocated" 100 >> slots. But any user should be able to run jobs to use up any >> free/idle capacity. >> >> E.g. if the cluster is empty, and a user submits 5000 jobs, he >> gets the full 1000 running and 4000 in the queue. Then the >> second user comes along and submits 5 jobs. They want this >> second user's jobs to start as close to "immediately" as >> possible. > > There is nothing in SGE doing this automatically. Once a job is > granted to run, it is allowed to run until it finishes. > > >> Assuming the jobs are all large-memory jobs, so we can't just >> suspend them, and assuming none of the jobs are checkpointable. >> But we can kill the jobs and resubmit them. > > You could use the checkpointing environment, that a suspension of a > job will reschedule it. > > http://arc.liv.ac.uk/SGE/howto/checkpointing.html > > `man checkpoint` `man sge_ckpt` > > But as the suspension is the result of another job being started in > case of a subordination of queues or alike, you will need some kind > of co-scheduler outside of SGE looking for jobs which should be > removed, then suspend these jobs by `qmod -sj ...` and as long as > these jobs were submitted with an appropriate checkpointing > environment they will be rescheduled. > If checkpointing jobs is acceptable then something like the following might work. Limit each user (via rqs) to 100 slots in the underquota cluster queue. Use queue_sort_method seqno to ensure the underquota queue is prefered to the overquota queue. Slotwise subordinate the overquota queue to the underquota queue. Jobs in excess of the 100 per user should go into the overquota queue.
One oddity of this arrangement is that jobs from a user who is currently under quota may get checkpointed and moved if the user was over quota when the job started. Given your policy it should always be able to restart on the next scheduling cycle though and should then be safely in the underquota queue. William > >> So in the scenario above, maybe the first user's 5 newest (most >> recently started) jobs get killed and resubmitted, and the second >> user's jobsx start running. >> >> And then suppose more users come along and submit a few hundred >> jobs each, they should each end up with 100 jobs running, and the >> rest in the queue. >> >> What sort of grid engine configuration might provide this sort of >> setup? >> >> They don't want any runtime limit either. And they don't care >> about "fair" sharing over time, just the initial wait time. >> Which I think rules out sharetree approaches. > > A fair-scheduling you can also implement just looking at the actual > usage (but as said: this won't remove jobs, this has to be done by > a co-scheduler): > > http://www.gridengine.info/2006/01/17/easy-setup-of-equal-user-fairshare-policy/ > > -- Reuti _______________________________________________ users > mailing list [email protected] > https://gridengine.org/mailman/listinfo/users > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQIcBAEBAgAGBQJSEy4XAAoJEKCzH4joEjNW8SAP/RgporVcau5LNPsBZBFHC0in mvDnhNDy7ZX+AYp7Hmg7LfkLTTanPReTo3FgsuktwuiPQRSYg+1Abc6Qkj1yvA/I Ms4Z/4Va9BwsYknYdeX87v4eQ84gRk4KXBGI87koN6tHDxv1JwF5t9DxO0DtsPtM 99mRAi4uUwOby0UELTokdbtXF+BIfi0jsBgBwMgU2GPhrVf3ap+tJ4YA1umBmu3S Uo0d7vaInG1HrelGqfvjLdwsVWWIaPqPVgpdUF7fGRdcFdCi6pgTTZOLntpSy19B IpSiEw7D+H7oBWCwvlpXP0F+pto2Nr6VO+wMY4aUNxCD5z9QiDNe6OHq4HNzdNnJ JsiRWbR8PvR7wq9Xnbf1vNR0QE5Nu6Re3b3qYdjTTQgOhW6wAKRaXO3BQ2mMQ+WF KpLo+cUiVwlZCBvs15TUwGsvoOb7zU4HNlgVhwxFQQPm3eAWoFqAcJug8Olvcazl +Ok1wvzdB+WGS38gVmSXCxchjJRSNkY1EDqNWXdVYB/NELtqrjfa9HOPdtDkoJir A5rWnEO7uFRi6paDYc0UOXJCqtjpVviXp+l4HVut00rqJK4DYdos+VkjUoCtwPw8 IaPMG05kc2so8vKyeFyXT/ty/YAQx5/kc0kktr97Ih8gdAzp8VamzJWrS9gUmkCD qvXMDhVXlnHJwA3H7PTi =Ru7e -----END PGP SIGNATURE----- _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
