[slurm-dev] Insane message length
Hi all, Today a single user submitted 7000 jobs and squeue and scancel returns the error message: Insane Message Length. I have read on a previous topic in slurm devel listhttps://groups.google.com/forum/#!searchin/slurm-devel/Insane$20message$20length|sort:relevance/slurm-devel/7gyGUEg3zWg/4cxCPzRMMc8J that this is due to the fact that MAX_MSG_SIZE defines a total size of 16 Mb (our slurm version is 2.2.7), which is exceeded by these 7000 jobs. I was not able to cancel a single job with scancel. With sacct I was able to retrieve the JobID of all the jobs in the queue. My questions are: If I stop the slurm control daemon and then I start it with the startclean option will I lose all the jobs?, only the pending ones? Is there a way of cancelling all the pending jobs without cancelling also the running ones? I have 1000 jobs running at this moment and I would like to preserve them. Would it be possible to stop slurmctld and then manually deleting them from /var/slurm-clustername? Thanks in advance. Juan Pancorbo.
[slurm-dev] Job Groups
I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon-
[slurm-dev] Re: Job Groups
2013/6/19 Paul Edmon ped...@cfa.harvard.edu: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. I'm not familiar with LSF, but.. if you are using accounts (need database accounting backend) you can simply create account for them, and limit number of running jobs with: GrpJobs= The total number of jobs able to run at any given time from this association and its children. If this limit is reached new jobs will be queued but only allowed to run after previous jobs complete from this group. Another possibility may be, if users want to set the limit himself, is to create allocation and then submit jobs to this allocation. cheers, marcin
[slurm-dev] Re: Job Groups
Could you just create a dedicated queue for those jobs, and then configure its priority and max simultaneous settings? Then all they would have to do is ensure they submit those jobs to that queue. On Jun 19, 2013, at 8:36 AM, Paul Edmon ped...@cfa.harvard.edu wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon-
[slurm-dev] Re: Job Groups
Sounds like something you would use a QOS for. That way you get all the limits from accounting but only applies to certain jobs. On 06/19/13 09:03, Ralph Castain wrote: Could you just create a dedicated queue for those jobs, and then configure its priority and max simultaneous settings? Then all they would have to do is ensure they submit those jobs to that queue. On Jun 19, 2013, at 8:36 AM, Paul Edmon ped...@cfa.harvard.edu wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon-
[slurm-dev] Re: Job Groups
On 06/19/2013 10:36 AM, Paul Edmon wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. The licenses feature might work OK for this. Create a license for the fileserver with as many seats as max jobs, and jobs hitting the fileserver would request one (or more) licenses. Regards, John
[slurm-dev] Re: Job Groups
Paul, We were discussing this yesterday due to a user not limiting the amount of jobs hammering our storage. A QOS with a GrpJobs limit sounds like the best approach for both us and you. Ryan On 06/19/2013 09:36 AM, Paul Edmon wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
[slurm-dev] Resubmit on failure
Hi, I've tried to look for this, but is there any way to have automatic job resubmission in case it fails. We occasionally have hiccups for random nodes where a job might fail due to temporary network loss or loss of storage mount or what not and when users send thousands of jobs and say 0.1% fail they have to track down the individual jobs and resubmit those even though they might have had a tool that send those 5000 jobs in sequence. It would really be nice if they could just claim that they accept say 1 automatic resubmission with same initial conditions as the job got submitted. The user would know if the filesystems etc is fine with that and in our case mostly is. Is such a feature already in slurm or not? If yes, can you point me to documentation. Thanks, Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman
[slurm-dev] Re: Job Groups
Thanks for the input. Can GrpJobs be modified from the user side? -Paul Edmon- On 06/19/2013 12:15 PM, Ryan Cox wrote: Paul, We were discussing this yesterday due to a user not limiting the amount of jobs hammering our storage. A QOS with a GrpJobs limit sounds like the best approach for both us and you. Ryan On 06/19/2013 09:36 AM, Paul Edmon wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon-
[slurm-dev] Re: jobacct_gather plugins
I second that! Sounds like the correct approach for data intensive computing. Thanks Eva -- University of California, San Diego SDSC, MC 0505 9500 Gilman Drive La Jolla, Ca 92093-0505 Web : http://www.sdsc.edu/~hocks (858) 822-0954email: ho...@sdsc.edu On Tue, 18 Jun 2013, Riccardo Murri wrote: Hello, On 18 June 2013 22:16, Chris Read chris.r...@gmail.com wrote: I've attached a functional patch that implements the fix I've made to ignore shared pages of a process. This means that if a job allocates more RAM than the limit it still gets terminated, but if it's just mmaping very large files it does not get disturbed. [...] Questions are: - Are there enough people out there interested in the functionality described here to warrant making this a config option for jobacct_gather/linux? I am certainly interested in the functionality provided by your patch. Thanks, Riccardo -- Riccardo Murri http://www.gc3.uzh.ch/people/rm Grid Computing Competence Centre University of Zurich Winterthurerstrasse 190, CH-8057 Z??rich (Switzerland) Tel: +41 44 635 4222 Fax: +41 44 635 6888
[slurm-dev] Re: Resubmit on failure
One note: Only batch jobs will be requeued. We can't do much for jobs initiated by salloc or srun. Quoting Aaron Knister aaron.knis...@gmail.com: Hi Mario, SLURM can and will, I believe by default, resubmit jobs that fail due to node failures recognized by slurmctld that put the node in an offline state. This doesnt help you, however, as SLURM doesnt appear to notice these failures. I wonder if a SPANK plugin could do the job here. Sent from my iPad On Jun 19, 2013, at 12:36 PM, Mario Kadastik mario.kadas...@cern.ch wrote: Hi, I've tried to look for this, but is there any way to have automatic job resubmission in case it fails. We occasionally have hiccups for random nodes where a job might fail due to temporary network loss or loss of storage mount or what not and when users send thousands of jobs and say 0.1% fail they have to track down the individual jobs and resubmit those even though they might have had a tool that send those 5000 jobs in sequence. It would really be nice if they could just claim that they accept say 1 automatic resubmission with same initial conditions as the job got submitted. The user would know if the filesystems etc is fine with that and in our case mostly is. Is such a feature already in slurm or not? If yes, can you point me to documentation. Thanks, Mario Kadastik, PhD Researcher --- Physics is like sex, sure it may have practical reasons, but that's not why we do it -- Richard P. Feynman
[slurm-dev] Re: Job Groups
Okay, thanks. -Paul Edmon- On 06/19/2013 04:32 PM, Ryan Cox wrote: Not that I'm aware of. I don't know of a way to give users control over a QOS like you can do with account coordinators for accounts. Ryan On 06/19/2013 10:55 AM, Paul Edmon wrote: Thanks for the input. Can GrpJobs be modified from the user side? -Paul Edmon- On 06/19/2013 12:15 PM, Ryan Cox wrote: Paul, We were discussing this yesterday due to a user not limiting the amount of jobs hammering our storage. A QOS with a GrpJobs limit sounds like the best approach for both us and you. Ryan On 06/19/2013 09:36 AM, Paul Edmon wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon-
[slurm-dev] Re: Job Groups
Not that I'm aware of. I don't know of a way to give users control over a QOS like you can do with account coordinators for accounts. Ryan On 06/19/2013 10:55 AM, Paul Edmon wrote: Thanks for the input. Can GrpJobs be modified from the user side? -Paul Edmon- On 06/19/2013 12:15 PM, Ryan Cox wrote: Paul, We were discussing this yesterday due to a user not limiting the amount of jobs hammering our storage. A QOS with a GrpJobs limit sounds like the best approach for both us and you. Ryan On 06/19/2013 09:36 AM, Paul Edmon wrote: I have a group here that wants to submit a ton of jobs to the queue, but want to restrict how many they have running at any given time so that they don't torch their fileserver. They were using bgmod -L in LSF to do this, but they were wondering if there was a similar way in SLURM to do so. I know you can do this via the accounting interface but it would be good if I didn't have to apply it as a blanket to all their jobs and if they could manage it themselves. If nothing exists in SLURM to do this that's fine. One can always engineer around it. I figured I would ping the dev list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University