Hi Ryan,

Thanks. We had considered this approach but went in a different
direction for a couple reasons:

We have a good number of users that script job submissions and may blast
out up to several hundred jobs. A user might not realize their jobs are
getting cutoff until many of them run and it's a waste of resources.

Also, we have many users that are relatively new to HPC/Slurm and work
from guides or tutorials that don't explain things very well. The
distinct error message at job submission rather than a related error
after a "failure" (from the user's perspective) keeps a lot of support
emails out of my inbox. Of course I'd like them to learn to use Slurm
better but they usually want to focus on their own research first.

- Dan

On 06/28/2013 11:00 AM, Ryan Cox wrote:
> An alternative that we do is choose very low defaults for people:
> PartitionName=Default DefaultTime=30:00 #plus other options ........
> DefMemPerCPU=512
> 
> The disadvantage to this approach is that it doesn't give an obvious
> error message at submit time.  However, it's not hard to figure out what
> happened when they hit the time limit or the error output says they went
> over their memory limit.
> 
> Ryan
> 
> On 06/28/2013 08:29 AM, Daniel M. Weeks wrote:
>> At CCNI, we use backfill scheduling on all our systems. However, we have
>> found that users typically do not specify a time limit for their job so
>> the scheduler assumes the maximum from QoS/user limits/partition
>> limits/etc. This really hurts backfilling since the scheduler remains
>> ignorant of short jobs.
>>
>> Attached is a small patch I wrote containing a job submit plugin and a
>> new error message. The plugin rejects a job submission when it is
>> missing a time limit and will provide the user with a clear and distinct
>> error.
>>
>> I've just re-tested and the patch applies and builds cleanly on the
>> slurm-2.5, slurm-2.6, and master branches.
>>
>> Please let me know if you find this useful, run across problems, or have
>> suggestions/improvements. Thanks.
>>
> 
> -- 
> Ryan Cox
> Operations Director
> Fulton Supercomputing Lab
> Brigham Young University
> 


-- 
Daniel M. Weeks
Systems Programmer
Computational Center for Nanotechnology Innovations
Rensselaer Polytechnic Institute
Troy, NY 12180
518-276-4458

Reply via email to