[slurm-dev] Insane message length

2013-06-19 Thread Pancorbo, Juan
Hi all,
Today a single user submitted 7000 jobs and squeue and scancel returns the 
error message: Insane Message Length.
I have read on a previous topic in slurm devel 
listhttps://groups.google.com/forum/#!searchin/slurm-devel/Insane$20message$20length|sort:relevance/slurm-devel/7gyGUEg3zWg/4cxCPzRMMc8J
 that this is due to the fact that MAX_MSG_SIZE defines a total size of 16 Mb 
(our slurm version is 2.2.7), which is exceeded by these 7000 jobs. I was not 
able to cancel a single job with scancel.
With sacct I was able to retrieve the JobID of all the jobs in the queue.
My questions are:
If I stop the slurm control daemon and then I start it with the startclean 
option will I lose all the jobs?, only the pending ones?
Is there a way of cancelling all the pending jobs without cancelling also the 
running ones? I have 1000 jobs running at this moment and I would like to 
preserve them.
Would it be possible to stop slurmctld and then manually deleting them from 
/var/slurm-clustername?

Thanks in advance.

Juan Pancorbo.


[slurm-dev] Job Groups

2013-06-19 Thread Paul Edmon

I have a group here that wants to submit a ton of jobs to the queue, but 
want to restrict how many they have running at any given time so that 
they don't torch their fileserver.  They were using bgmod -L in LSF to 
do this, but they were wondering if there was a similar way in SLURM to 
do so.  I know you can do this via the accounting interface but it would 
be good if I didn't have to apply it as a blanket to all their jobs and 
if they could manage it themselves.

If nothing exists in SLURM to do this that's fine.  One can always 
engineer around it.  I figured I would ping the dev list first before 
putting a nail in it.  From my look at the documentation I don't see 
anyway to do this other than what I stated above.

-Paul Edmon-


[slurm-dev] Re: Job Groups

2013-06-19 Thread Marcin Stolarek

2013/6/19 Paul Edmon ped...@cfa.harvard.edu:

 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.  They were using bgmod -L in LSF to
 do this, but they were wondering if there was a similar way in SLURM to
 do so.  I know you can do this via the accounting interface but it would
 be good if I didn't have to apply it as a blanket to all their jobs and
 if they could manage it themselves.

 If nothing exists in SLURM to do this that's fine.  One can always
 engineer around it.  I figured I would ping the dev list first before
 putting a nail in it.  From my look at the documentation I don't see
 anyway to do this other than what I stated above.

I'm not familiar with LSF, but.. if you are using accounts (need
database accounting backend) you can simply create account for them,
and limit number of running jobs with:

GrpJobs= The total number of jobs able to run at any given time from
this association and its children. If this limit is reached new jobs
will be queued but only allowed to run after previous jobs complete
from this group. 

Another possibility may be, if users want to set the limit himself, is
to create allocation and then submit jobs to this allocation.

cheers,
marcin

[slurm-dev] Re: Job Groups

2013-06-19 Thread Ralph Castain

Could you just create a dedicated queue for those jobs, and then configure its 
priority and max simultaneous settings? Then all they would have to do is 
ensure they submit those jobs to that queue.

On Jun 19, 2013, at 8:36 AM, Paul Edmon ped...@cfa.harvard.edu wrote:

 
 I have a group here that wants to submit a ton of jobs to the queue, but 
 want to restrict how many they have running at any given time so that 
 they don't torch their fileserver.  They were using bgmod -L in LSF to 
 do this, but they were wondering if there was a similar way in SLURM to 
 do so.  I know you can do this via the accounting interface but it would 
 be good if I didn't have to apply it as a blanket to all their jobs and 
 if they could manage it themselves.
 
 If nothing exists in SLURM to do this that's fine.  One can always 
 engineer around it.  I figured I would ping the dev list first before 
 putting a nail in it.  From my look at the documentation I don't see 
 anyway to do this other than what I stated above.
 
 -Paul Edmon-


[slurm-dev] Re: Job Groups

2013-06-19 Thread Danny Auble

Sounds like something you would use a QOS for.  That way you get all the 
limits from accounting but only applies to certain jobs.

On 06/19/13 09:03, Ralph Castain wrote:
 Could you just create a dedicated queue for those jobs, and then configure 
 its priority and max simultaneous settings? Then all they would have to do is 
 ensure they submit those jobs to that queue.

 On Jun 19, 2013, at 8:36 AM, Paul Edmon ped...@cfa.harvard.edu wrote:

 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.  They were using bgmod -L in LSF to
 do this, but they were wondering if there was a similar way in SLURM to
 do so.  I know you can do this via the accounting interface but it would
 be good if I didn't have to apply it as a blanket to all their jobs and
 if they could manage it themselves.

 If nothing exists in SLURM to do this that's fine.  One can always
 engineer around it.  I figured I would ping the dev list first before
 putting a nail in it.  From my look at the documentation I don't see
 anyway to do this other than what I stated above.

 -Paul Edmon-


[slurm-dev] Re: Job Groups

2013-06-19 Thread John Thiltges

On 06/19/2013 10:36 AM, Paul Edmon wrote:
 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.

The licenses feature might work OK for this. Create a license for the 
fileserver with as many seats as max jobs, and jobs hitting the 
fileserver would request one (or more) licenses.

Regards,
John


[slurm-dev] Re: Job Groups

2013-06-19 Thread Ryan Cox

Paul,

We were discussing this yesterday due to a user not limiting the amount 
of jobs hammering our storage.  A QOS with a GrpJobs limit sounds like 
the best approach for both us and you.

Ryan

On 06/19/2013 09:36 AM, Paul Edmon wrote:
 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.  They were using bgmod -L in LSF to
 do this, but they were wondering if there was a similar way in SLURM to
 do so.  I know you can do this via the accounting interface but it would
 be good if I didn't have to apply it as a blanket to all their jobs and
 if they could manage it themselves.

 If nothing exists in SLURM to do this that's fine.  One can always
 engineer around it.  I figured I would ping the dev list first before
 putting a nail in it.  From my look at the documentation I don't see
 anyway to do this other than what I stated above.

 -Paul Edmon-

-- 
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Resubmit on failure

2013-06-19 Thread Mario Kadastik

Hi,

I've tried to look for this, but is there any way to have automatic job 
resubmission in case it fails. We occasionally have hiccups for random nodes 
where a job might fail due to temporary network loss or loss of storage mount 
or what not and when users send thousands of jobs and say 0.1% fail they have 
to track down the individual jobs and resubmit those even though they might 
have had a tool that send those 5000 jobs in sequence. It would really be nice 
if they could just claim that they accept say 1 automatic resubmission with 
same initial conditions as the job got submitted. The user would know if the 
filesystems etc is fine with that and in our case mostly is. 

Is such a feature already in slurm or not? If yes, can you point me to 
documentation.

Thanks,

Mario Kadastik, PhD
Researcher

---
  Physics is like sex, sure it may have practical reasons, but that's not why 
we do it 
 -- Richard P. Feynman


[slurm-dev] Re: Job Groups

2013-06-19 Thread Paul Edmon

Thanks for the input.  Can GrpJobs be modified from the user side?

-Paul Edmon-


On 06/19/2013 12:15 PM, Ryan Cox wrote:
 Paul,

 We were discussing this yesterday due to a user not limiting the amount
 of jobs hammering our storage.  A QOS with a GrpJobs limit sounds like
 the best approach for both us and you.

 Ryan

 On 06/19/2013 09:36 AM, Paul Edmon wrote:
 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.  They were using bgmod -L in LSF to
 do this, but they were wondering if there was a similar way in SLURM to
 do so.  I know you can do this via the accounting interface but it would
 be good if I didn't have to apply it as a blanket to all their jobs and
 if they could manage it themselves.

 If nothing exists in SLURM to do this that's fine.  One can always
 engineer around it.  I figured I would ping the dev list first before
 putting a nail in it.  From my look at the documentation I don't see
 anyway to do this other than what I stated above.

 -Paul Edmon-


[slurm-dev] Re: jobacct_gather plugins

2013-06-19 Thread Eva Hocks



I second that! Sounds like the correct approach for data intensive
computing.


Thanks
Eva
--
University of California, San Diego
SDSC, MC 0505
9500 Gilman Drive
La Jolla, Ca 92093-0505   Web  : http://www.sdsc.edu/~hocks
(858) 822-0954email: ho...@sdsc.edu



On Tue, 18 Jun 2013, Riccardo Murri wrote:


 Hello,

 On 18 June 2013 22:16, Chris Read chris.r...@gmail.com wrote:
  I've attached a functional patch that implements the fix I've made to 
  ignore shared pages of a process. This means that if a job allocates more 
  RAM than the limit it still gets terminated, but if it's just mmaping very 
  large files it does not get disturbed.
 
  [...]
 
  Questions are:
 
  - Are there enough people out there interested in the functionality 
  described here to warrant making this a config option for 
  jobacct_gather/linux?

 I am certainly interested in the functionality provided by your patch.

 Thanks,
 Riccardo

 --
 Riccardo Murri
 http://www.gc3.uzh.ch/people/rm

 Grid Computing Competence Centre
 University of Zurich
 Winterthurerstrasse 190, CH-8057 Z??rich (Switzerland)
 Tel: +41 44 635 4222
 Fax: +41 44 635 6888


[slurm-dev] Re: Resubmit on failure

2013-06-19 Thread Moe Jette

One note: Only batch jobs will be requeued. We can't do much for jobs  
initiated by salloc or srun.


Quoting Aaron Knister aaron.knis...@gmail.com:


 Hi Mario,

 SLURM can and will, I believe by default, resubmit jobs that fail  
 due to node failures recognized by slurmctld that put the node in an  
 offline state. This doesnt help you, however, as SLURM doesnt appear  
 to notice these failures.

 I wonder if a SPANK plugin could do the job here.

 Sent from my iPad

 On Jun 19, 2013, at 12:36 PM, Mario Kadastik mario.kadas...@cern.ch wrote:


 Hi,

 I've tried to look for this, but is there any way to have automatic  
 job resubmission in case it fails. We occasionally have hiccups for  
 random nodes where a job might fail due to temporary network loss  
 or loss of storage mount or what not and when users send thousands  
 of jobs and say 0.1% fail they have to track down the individual  
 jobs and resubmit those even though they might have had a tool that  
 send those 5000 jobs in sequence. It would really be nice if they  
 could just claim that they accept say 1 automatic resubmission with  
 same initial conditions as the job got submitted. The user would  
 know if the filesystems etc is fine with that and in our case  
 mostly is.

 Is such a feature already in slurm or not? If yes, can you point me  
 to documentation.

 Thanks,

 Mario Kadastik, PhD
 Researcher

 ---
  Physics is like sex, sure it may have practical reasons, but  
 that's not why we do it
 -- Richard P. Feynman



[slurm-dev] Re: Job Groups

2013-06-19 Thread Paul Edmon

Okay, thanks.

-Paul Edmon-

On 06/19/2013 04:32 PM, Ryan Cox wrote:
 Not that I'm aware of.  I don't know of a way to give users control over
 a QOS like you can do with account coordinators for accounts.

 Ryan

 On 06/19/2013 10:55 AM, Paul Edmon wrote:
 Thanks for the input.  Can GrpJobs be modified from the user side?

 -Paul Edmon-


 On 06/19/2013 12:15 PM, Ryan Cox wrote:
 Paul,

 We were discussing this yesterday due to a user not limiting the amount
 of jobs hammering our storage.  A QOS with a GrpJobs limit sounds like
 the best approach for both us and you.

 Ryan

 On 06/19/2013 09:36 AM, Paul Edmon wrote:
 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.  They were using bgmod -L in LSF to
 do this, but they were wondering if there was a similar way in SLURM to
 do so.  I know you can do this via the accounting interface but it would
 be good if I didn't have to apply it as a blanket to all their jobs and
 if they could manage it themselves.

 If nothing exists in SLURM to do this that's fine.  One can always
 engineer around it.  I figured I would ping the dev list first before
 putting a nail in it.  From my look at the documentation I don't see
 anyway to do this other than what I stated above.

 -Paul Edmon-


[slurm-dev] Re: Job Groups

2013-06-19 Thread Ryan Cox

Not that I'm aware of.  I don't know of a way to give users control over 
a QOS like you can do with account coordinators for accounts.

Ryan

On 06/19/2013 10:55 AM, Paul Edmon wrote:
 Thanks for the input.  Can GrpJobs be modified from the user side?

 -Paul Edmon-


 On 06/19/2013 12:15 PM, Ryan Cox wrote:
 Paul,

 We were discussing this yesterday due to a user not limiting the amount
 of jobs hammering our storage.  A QOS with a GrpJobs limit sounds like
 the best approach for both us and you.

 Ryan

 On 06/19/2013 09:36 AM, Paul Edmon wrote:
 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.  They were using bgmod -L in LSF to
 do this, but they were wondering if there was a similar way in SLURM to
 do so.  I know you can do this via the accounting interface but it would
 be good if I didn't have to apply it as a blanket to all their jobs and
 if they could manage it themselves.

 If nothing exists in SLURM to do this that's fine.  One can always
 engineer around it.  I figured I would ping the dev list first before
 putting a nail in it.  From my look at the documentation I don't see
 anyway to do this other than what I stated above.

 -Paul Edmon-

-- 
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University