Hi Ryan, Thanks. We had considered this approach but went in a different direction for a couple reasons:
We have a good number of users that script job submissions and may blast out up to several hundred jobs. A user might not realize their jobs are getting cutoff until many of them run and it's a waste of resources. Also, we have many users that are relatively new to HPC/Slurm and work from guides or tutorials that don't explain things very well. The distinct error message at job submission rather than a related error after a "failure" (from the user's perspective) keeps a lot of support emails out of my inbox. Of course I'd like them to learn to use Slurm better but they usually want to focus on their own research first. - Dan On 06/28/2013 11:00 AM, Ryan Cox wrote: > An alternative that we do is choose very low defaults for people: > PartitionName=Default DefaultTime=30:00 #plus other options ........ > DefMemPerCPU=512 > > The disadvantage to this approach is that it doesn't give an obvious > error message at submit time. However, it's not hard to figure out what > happened when they hit the time limit or the error output says they went > over their memory limit. > > Ryan > > On 06/28/2013 08:29 AM, Daniel M. Weeks wrote: >> At CCNI, we use backfill scheduling on all our systems. However, we have >> found that users typically do not specify a time limit for their job so >> the scheduler assumes the maximum from QoS/user limits/partition >> limits/etc. This really hurts backfilling since the scheduler remains >> ignorant of short jobs. >> >> Attached is a small patch I wrote containing a job submit plugin and a >> new error message. The plugin rejects a job submission when it is >> missing a time limit and will provide the user with a clear and distinct >> error. >> >> I've just re-tested and the patch applies and builds cleanly on the >> slurm-2.5, slurm-2.6, and master branches. >> >> Please let me know if you find this useful, run across problems, or have >> suggestions/improvements. Thanks. >> > > -- > Ryan Cox > Operations Director > Fulton Supercomputing Lab > Brigham Young University > -- Daniel M. Weeks Systems Programmer Computational Center for Nanotechnology Innovations Rensselaer Polytechnic Institute Troy, NY 12180 518-276-4458
