[slurm-dev] Jobs at Harvard Research Computing

2017-09-11 Thread Paul Edmon


We have a number of openings here at Harvard FAS RC.  If you are 
interested please check out our employment page for details:


https://www.rc.fas.harvard.edu/about/employment/

-Paul Edmon-


[slurm-dev] Re: multiple MPI versions with slurm

2017-07-21 Thread Paul Edmon


I would build MPI using the pmi libraries and slurm with pmi support.  
Then you can launch any version of MPI using the same srun command and 
leveraging pmi.


-Paul Edmon-


On 07/21/2017 10:03 AM, Manuel Rodríguez Pascual wrote:

Hi all,

I'm trying to provide support to users demanding different versions of
MPI, namely mvapich2 and OpenMPI. In particular, I'd like to support
"srun -n X ./my_parallel_app", maybe with "--with-mpi=X" flag.


Has anybody got any suggestion on how to do that? I'm am kind of new
to cluster administration, so probably there is some obvious solution
that I am not aware of.

Cheers,


Manuel


[slurm-dev] Re: SLURM automatically use idle nodes in reserved partitions?

2017-07-11 Thread Paul Edmon


You can use the REQUEUE partition functionality in slurm to accomplish 
this.  We do that here.  Basically we have a high priority partition 
that is what the hardware owners use and then a lower priority partition 
that backfills it with the requeue option.  If the high priority 
partition has idle resources, the lower priority uses it.  If the high 
priority needs the resources it will requeue the jobs from the lower 
priority partition.


-Paul Edmon-


On 7/11/2017 4:29 PM, David Perel wrote:

Hello --

  Say on a cluster researcher X has his own reserved partition, XP,
  where normally only X can run jobs.

  Can SLURM be configured so that if some resource metric(s) (e.g.,
  load average, memory usage) on XP nodes is below a given level
  for a given time period (such nodes are labelled "XP-L"), then:

   - if researcher Y is authorized to use XP, Y's jobs can automatically
 use the XP-L nodes; and

   - as soon as X submits jobs to XP for which XP-L are needed, Y's
 jobs on the XP-L nodes are moved off of those nodes to wherever they
 can run, or queued

  Thanks.

___
David Perel
IST Academic and Research Computing Systems (ARCS)
New Jersey Institute of Technology
dav...@njit.edu


[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Paul Edmon


Yeah, we keep around a test cluster environment for that purpose to vet 
slurm upgrades before we roll them on the production cluster.


Thus far no problems.  However, paranoia is usually a good thing for 
cases like this.


-Paul Edmon-


On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote:


On 06/26/2017 01:24 PM, Loris Bennett wrote:

We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?


I want to be 99.9% sure that upgrading (my first one) will actually 
work.  I also want to know how roughly long the slurmdbd will be down 
so that the cluster doesn't kill all jobs due to timeouts.  Better to 
be safe than sorry.


I don't expect to inform the users, since the operation is expected to 
run smoothly without troubles for user jobs.


Thanks,
Ole


Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database

We did it in place, worked as noted on the tin. It was less painful
than I expected. TBH, your procedures are admirable, but you shouldn't
worry - it's a relatively smooth process.

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective 
action, to build collective power, to achieve collective 
transformation, rooted in grief and rage but pointed towards vision 
and dreams."


- Patrisse Cullors, Black Lives Matter founder

On 26 June 2017 at 20:04, Ole Holm Nielsen 
<ole.h.niel...@fysik.dtu.dk> wrote:


  We're planning to upgrade Slurm 16.05 to 17.02 soon. The most 
critical step seems to me to be the upgrade of the slurmdbd 
database, which may also take tens of minutes.


  I thought it's a good idea to test the slurmdbd database upgrade 
locally on a drained compute node in order to verify both 
correctness and the time required.


  I've developed the dry run upgrade procedure documented in the 
Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm


  Question 1: Would people who have real-world Slurm upgrade 
experience kindly offer comments on this procedure?


  My testing was actually successful, and the database conversion 
took less than 5 minutes in our case.


  A crucial step is starting the slurmdbd manually after the 
upgrade. But how can we be sure that the database conversion has 
been 100% completed?


  Question 2: Can anyone confirm that the output "slurmdbd: debug2: 
Everything rolled up" indeed signifies that conversion is complete?


  Thanks,
  Ole








[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Paul Edmon


We actually use:

scontrol suspend

I will note though that with the suspend behavior that slurm has, it is 
best to do this while the partitions are set to DOWN so new jobs don't 
try to schedule.  Else you will end up with oversubscribed nodes.  Else 
SIGSTOP is a good route to go.


-Paul Edmon-


On 06/20/2017 10:32 AM, Loris Bennett wrote:

Hi Nick,

We do our upgrades while full production is up and running.  We just stop
the Slurm daemons, dump the database and copy the statesave directory
just in case.  We then do the update, and finally restart the Slurm
daemons.  We only lost jobs once during an upgrade back around 2.2.6 or
so, but that was due a rather brittle configuration provided by our
vendor (the statesave path contained the Slurm version), rather than
Slurm itself and was before we had acquired any Slurm expertise
ourselves.

Paul: How do you pause the jobs?  SIGSTOP all the user processes on the
cluster?

Cheers,

Loris


Paul Edmon <ped...@cfa.harvard.edu> writes:


If you follow the guide on the Slurm website you shouldn't have many problems. 
We've made it standard practice here to set all partitions to DOWN and suspend 
all the jobs when we do upgrades. This has led to
far greater stability. So we haven't lost any jobs in an upgrade. The only 
weirdness we have seen is if jobs exit while the DB upgrade is going. Sometimes 
it can leave residual jobs in the DB that were properly closed
out. This is why we pause all the jobs as it makes it such that we don't end up 
with jobs exiting before the DB is back. In 16.05+ you have the:

sacctmgr show runawayjobs

Feature which can clean up all those orphan jobs. So its not as much a concern 
anymore.

Beyond that we follow the guide at the bottom of this page:

https://slurm.schedmd.com/quickstart_admin.html

I haven't tried going two major versions at once though. The docs indicate that 
it should work fine. We generally try to keep pace with current stable.

Given that you only have 100,000 jobs your upgrade should probably go fairly 
quick. I could imagine around 10-15 minutes. Our DB has several million jobs 
and it takes about 30 min to an hour depending on what
operations are bing done.

-Paul Edmon-

On 06/20/2017 09:37 AM, Nicholas McCollum wrote:

  I'm about to update 15.08 to the latest SLURM in August and would appreciate 
any notes you have on the process.

  I'm especially interested in maintaining the DB as well as associations. I'd 
also like to keep the pending job list if possible.

  I've only got around 100,000 jobs in the DB so far, since January.

  Thanks

  Nick McCollum
  Alabama Supercomputer Authority

  On Jun 20, 2017 8:07 AM, Paul Edmon <ped...@cfa.harvard.edu> wrote:

  Yeah, that sounds about right. Changes between major versions can take
  quite a bit of time. In the past I've seen upgrades take 2-3 hours for
  the DB.

  As for ways to speed it up. Putting the DB on newer hardware if you
  haven't already helps quite a bit (depends on architecture as to how
  much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell
  and saw a factor of 3-4 speed improvement). Upgrading to the latest
  version of MariaDB if you are on an old version of MySQL can get you
  about 30-40%.

  Doing all of these whittled our DB upgrade times for major upgrades to
  about 30 min or so.

  Beyond that I imagine some more specific DB optimization tricks could be
  done, but I'm not a DB admin so I won't venture to say.

  -Paul Edmon-

  On 06/20/2017 08:42 AM, Tim Fora wrote:
  > Hi,
  >
  > Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
  > start. Logs show most of the time was spent on this step and other table
  > changes:
  >
  > adding column admin_comment after account in table
  >
  > Does this sound right? Any ideas to help things speed up.
  >
  > Thanks,
  > Tim
  >
  >




[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Paul Edmon
If you follow the guide on the Slurm website you shouldn't have many 
problems.  We've made it standard practice here to set all partitions to 
DOWN and suspend all the jobs when we do upgrades. This has led to far 
greater stability.  So we haven't lost any jobs in an upgrade.  The only 
weirdness we have seen is if jobs exit while the DB upgrade is going.  
Sometimes it can leave residual jobs in the DB that were properly closed 
out.  This is why we pause all the jobs as it makes it such that we 
don't end up with jobs exiting before the DB is back.  In 16.05+ you 
have the:


sacctmgr show runawayjobs

Feature which can clean up all those orphan jobs.  So its not as much a 
concern anymore.


Beyond that we follow the guide at the bottom of this page:

https://slurm.schedmd.com/quickstart_admin.html

I haven't tried going two major versions at once though. The docs 
indicate that it should work fine.  We generally try to keep pace with 
current stable.


Given that you only have 100,000 jobs your upgrade should probably go 
fairly quick.  I could imagine around 10-15 minutes. Our DB has several 
million jobs and it takes about 30 min to an hour depending on what 
operations are bing done.


-Paul Edmon-


On 06/20/2017 09:37 AM, Nicholas McCollum wrote:
I'm about to update 15.08 to the latest SLURM in August and would 
appreciate any notes you have on the process.


I'm especially interested in maintaining the DB as well as 
associations.  I'd also like to keep the pending job list if possible.


I've only got around 100,000 jobs in the DB so far, since January.

Thanks

Nick McCollum
Alabama Supercomputer Authority


On Jun 20, 2017 8:07 AM, Paul Edmon <ped...@cfa.harvard.edu> wrote:


Yeah, that sounds about right.  Changes between major versions can
take
quite a bit of time.  In the past I've seen upgrades take 2-3
hours for
the DB.

As for ways to speed it up.  Putting the DB on newer hardware if you
haven't already helps quite a bit (depends on architecture as to how
much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell
and saw a factor of 3-4 speed improvement). Upgrading to the latest
version of MariaDB if you are on an old version of MySQL can get you
about 30-40%.

Doing all of these whittled our DB upgrade times for major
upgrades to
about 30 min or so.

Beyond that I imagine some more specific DB optimization tricks
could be
done, but I'm not a DB admin so I won't venture to say.

    -Paul Edmon-


On 06/20/2017 08:42 AM, Tim Fora wrote:
> Hi,
>
> Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
> start. Logs show most of the time was spent on this step and
other table
> changes:
>
> adding column admin_comment after account in table
>
> Does this sound right? Any ideas to help things speed up.
>
> Thanks,
> Tim
>
>






[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Paul Edmon


That's a really good point.  Pruning your DB can also help with this.

-Paul Edmon-


On 06/20/2017 09:19 AM, Loris Bennett wrote:

Hi Tim,

Tim Fora <tf...@riseup.net> writes:


Hi,

Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
start. Logs show most of the time was spent on this step and other table
changes:

adding column admin_comment after account in table

Does this sound right? Any ideas to help things speed up.

It probably depends a great deal on how many entries you have in your
database and what sort of hardware you have.  We are up to around 1.6
million jobs and have never purged anything.  I seem to remember the
last update between major releases taking long enough to allow me to get
slightly uneasy, but not long enough for me to really worry, so I guess
it was probably around 10-15 minutes.  Our CPUs are around 6 years old,
but the DB is on an SSD.

HTH

Loris



[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Paul Edmon


Yeah, that sounds about right.  Changes between major versions can take 
quite a bit of time.  In the past I've seen upgrades take 2-3 hours for 
the DB.


As for ways to speed it up.  Putting the DB on newer hardware if you 
haven't already helps quite a bit (depends on architecture as to how 
much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell 
and saw a factor of 3-4 speed improvement).  Upgrading to the latest 
version of MariaDB if you are on an old version of MySQL can get you 
about 30-40%.


Doing all of these whittled our DB upgrade times for major upgrades to 
about 30 min or so.


Beyond that I imagine some more specific DB optimization tricks could be 
done, but I'm not a DB admin so I won't venture to say.


-Paul Edmon-


On 06/20/2017 08:42 AM, Tim Fora wrote:

Hi,

Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
start. Logs show most of the time was spent on this step and other table
changes:

adding column admin_comment after account in table

Does this sound right? Any ideas to help things speed up.

Thanks,
Tim




[slurm-dev] Re: understanding of Purge in Slurmdb.conf

2017-06-08 Thread Paul Edmon
We use purge here with out much of a problem.  Though I will caution 
that if you have a large database that you have not ever purged before 
(when we initially did it we had 3 years of data accrued and we purged 
down to 6 months), you should do it in stages.  So do maybe a few months 
at a time to walk it up to the level you need.  You can also have the 
purge archive to disk which can be handy if you want to maintain 
historical info.  The purge itself runs monthly at midnight on the 1st 
of the month.


-Paul Edmon-


On 06/08/2017 11:36 AM, Rohan Gadalkar wrote:

Re: [slurm-dev] Re: understanding of Purge in Slurmdb.conf
Hello Dr. Loris,

I will go through the document and will try to work on it. I'll reply 
you on the same as soon as I work on this.



Thanks for the example.


Regards,
Rohan

On Thu, Jun 8, 2017 at 3:17 PM, Loris Bennett 
<loris.benn...@fu-berlin.de <mailto:loris.benn...@fu-berlin.de>> wrote:



Rohan Gadalkar <rohangadal...@gmail.com
<mailto:rohangadal...@gmail.com>> writes:

> Re: [slurm-dev] Re: understanding of Purge in Slurmdb.conf
>
> Hello Dr.Lorris,
>
> I want to understand how the purge works.
>
> I mentioned the points like PurgeJobAfter etc; which I did not
understand.
>
> How do I use these parameters in slurmdb.conf ??
>
> Is there any way that you can make me understand this ??

I'm not sure about that, but if, for example, you write

PurgeJobAfter=6

in your slurmdb.conf, all database entries referring to jobs which are
older than 6 months will be deleted at the beginning of each month.

Disclaimer: I haven't used these settings - I am repeating what it
says in the documentation.

Cheers,

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email
loris.benn...@fu-berlin.de <mailto:loris.benn...@fu-berlin.de>






[slurm-dev] Re: Partition default job time limit

2017-05-09 Thread Paul Edmon

Yes.  Use the DefaultTime option.


*DefaultTime*
   Run time limit used for jobs that don't specify a value. If not set
   then MaxTime will be used. Format is the same as for MaxTime. 



https://slurm.schedmd.com/slurm.conf.html

-Paul Edmon-

On 05/09/2017 05:35 AM, Georg Hildebrand wrote:

Partition default job time limit
Hi @here,

Is it possible to have a default job time limit for an slurm partition 
that is lower than the MaxTime?


Viele Grüße / kind regards
Georg






[slurm-dev] Re: Specify node name for a job

2017-05-04 Thread Paul Edmon

*-w*, *--nodelist*=
   Request a specific list of hosts. The job will contain /all/ of
   these hosts and possibly additional hosts as needed to satisfy
   resource requirements. The list may be specified as a
   comma-separated list of hosts, a range of hosts (host[1-5,7,...] for
   example), or a filename. The host list will be assumed to be a
   filename if it contains a "/" character. If you specify a minimum
   node or processor count larger than can be satisfied by the supplied
   host list, additional resources will be allocated on other nodes as
   needed. Duplicate node names in the list will be ignored. The order
   of the node names in the list is not important; the node names will
   be sorted by Slurm. 



On 5/4/2017 12:49 PM, Mahmood Naderan wrote:

Specify node name for a job
Hi,
S​orry for this simple question but I didn't find that in the verbose 
document pages. I want to specify ​a ​

​node name for a job. If I change

#SBATCH --nodes=1
to
#SBATCH --nodes=compute-0-1

Then the sbatch command returns an error where the node count is 
wrong. How can I specify node name in the sbatch script?

​


Regards,
Mahmood






[slurm-dev] Re: User accounting

2017-05-04 Thread Paul Edmon

sacct is where you want to look:

https://slurm.schedmd.com/sacct.html

-Paul Edmon-


On 5/4/17 9:09 AM, Mahmood Naderan wrote:

User accounting
Hi,
I read the accounting page https://slurm.schedmd.com/accounting.html 
however since it is quite large, I didn't get my answer!
I want to know the user stats for their jobs. For example, something 
like this


   clock time for all jobs including successful and not> cores used>   ...


Assume a user has submitted 2 jobs with the following specs:
job1: 10 minutes, 2.4GB memory, 4 cores, 1GB disk, success
job2: 15 minutes, 6GB memory, 2 cores, 2 GB disk, failed (due to his 
code and not the system error)


So the report looks like

user, 2, 1, 25 min, 6, 8.4GB, 3GB, ...

How can I get that?

Regards,
Mahmood






[slurm-dev] Re: GUI Job submission portal

2017-04-20 Thread Paul Edmon
There is no webportal that SchedMD provides that I am aware of. That 
said the API exists such that one could write one if they were so 
inclined.  The rest of the features that you like slurm does support, 
I'm not sure how much is exposed via the API though.  One could write 
wrappers around the normal slurm commands to effectively do the same thing.


-Paul Edmon-


On 04/20/2017 06:23 AM, Parag Khuraswar wrote:


Hi All,

Does SLURM support below features:-

Job Schedulers



• Workload cum resource manager with policy-aware, resource-aware and 
topology-aware scheduling




• Advance reservation support



• Support of job submission through CLI, Web-services and APIs



• Heterogeneous and multi- cluster support



• Preemptive and backfill scheduling support



• Application integration support



• Live reconfiguration capability



• SLA/Equivalent



• GPU Aware scheduling Intuitive web interface to submit and monitor jobs

Regards,

Parag





[slurm-dev] Re: Deleting jobs in Completing state on hung nodes

2017-04-10 Thread Paul Edmon


Sometimes restarting slurm on the node and the master can purge the jobs 
as well.


-Paul Edmon-


On 04/10/2017 03:59 PM, Douglas Meyer wrote:

Set node to drain if other jobs running.  Then down and then resume.  Down will 
kill and clear any jobs.

scontrol update nodename= state=drain reason=job_sux

scontrol update nodename= state=down reason=job_sux
scontrol update nodename= state=resume

If it happens again either reboot or stop and restart slurm.  Make sure you 
verify it has stopped.

Doug

-Original Message-
From: Gene Soudlenkov [mailto:g.soudlen...@auckland.ac.nz]
Sent: Monday, April 10, 2017 12:56 PM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Deleting jobs in Completing state on hung nodes


It happens sometimes - in our case epilogue code got stuck. Either check the 
processes and kill whicehver ones belong to the user or simply reboot the nodes.

Cheers,
Gene

--
New Zealand eScience Infrastructure
Centre for eResearch
The University of Auckland
e: g.soudlen...@auckland.ac.nz
p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453
w: www.nesi.org.nz

On 11/04/17 07:52, Tus wrote:

I have 2 nodes that have hardware issues and died with jobs running on
them. I am not able to fix the nodes at the moment but want to delete
the jobs that are stuck in completing state from slurm. I have set the
nodes to DRAIN and tried scancel which did not work.

How do I remove these jobs?




[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-16 Thread Paul Edmon

You should look at LLN (least loaded nodes):

https://slurm.schedmd.com/slurm.conf.html

That should do what you want.

-Paul Edmon-

On 03/16/2017 12:54 PM, kesim wrote:

Fwd: Scheduling jobs according to the CPU load

-- Forwarded message --
From: *kesim* <ketiw...@gmail.com <mailto:ketiw...@gmail.com>>
Date: Thu, Mar 16, 2017 at 5:50 PM
Subject: Scheduling jobs according to the CPU load
To: slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>


Hi all,

I am a new user and I created a small network of 11 nodes 7 CPUs per 
node out of users desktops.

I configured slurm as:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
When I submit a task with srun -n70 task
It will fill 10 nodes with 7 tasks/node. However, I have no clue what 
is the algorithm of choosing the nodes. Users run programs on the 
nodes and some nodes are more busy than others. It seems logical that 
the scheduler should submit the tasks to the less busy nodes but it is 
not the case.
In the sinfo -N -o '%N %O %C' I can see that the jobs are allocated to 
the node11 with the load 2.06 leaving the node4 which is totally 
idling. That somehow make no sense to me.

node1 0.00  7/0/0/7
node20.26  7/0/0/7
node3 0.54  7/0/0/7
node40.07  0/7/0/7
node5  0.00  7/0/0/7
node60.01  7/0/0/7
node7   0.00  7/0/0/7
node8   0.01  7/0/0/7
node90.06  7/0/0/7
node10  0.11  7/0/0/7
node11  2.06  7/0/0/7
How can I configure slurm to be able to fill the node with minimum 
load first?







[slurm-dev] Re: Federated Cluster scheduling

2017-02-12 Thread Paul Edmon

I'm fairly certain that is coming in the 17.02 release.

According to this presentation it was supposed to be in 16.05:

https://slurm.schedmd.com/SLUG15/MultiCluster.pdf

but from what I recall of the SC BoF full functionality was not going to 
be available to 17.02


-Paul Edmon-

On 2/12/2017 11:05 AM, Anastasia Angelopoulou wrote:

Federated Cluster scheduling
Hi,

I would like to ask if federated cluster scheduling is available in 
the currently released Slurm versions. If yes, is there any 
documentation about it? Has anyone tried it?


I would appreciate any information.

Thank you.

Regards,
Anastasia




[slurm-dev] Re: distributing fair share allocation

2017-02-07 Thread Paul Edmon

I would probably look at qos's for this:

https://slurm.schedmd.com/qos.html

https://slurm.schedmd.com/resource_limits.html

You can attach them to partitions as well which can be handy.

You probably want to use things like MaxJobs, MaxWallDurationPerJob, 
MaxTRESperJob.


-Paul Edmon-

On 2/7/2017 10:11 AM, Hossein Pourreza wrote:


Greetings,

We have a cluster with SLURM 16.05.4 as its workload manager with fair 
share policy. From time to time, we need to create accounts for 
courses where students' usage gets charged against that account. I am 
wondering what the best approach would be to limit the number of cores 
and/or walltime of each job asked by each student without creating a 
dedicated allocation for a course. For example, an instructor may want 
to limit the walltime of student jobs to 2 hours and not more than 2 
cores. She/he may also want to put a cap on the number of SUs that a 
student can use so a single student cannot burn all the course allocation.


Thanks

Hossein





[slurm-dev] Re: Priority issue

2017-01-25 Thread Paul Edmon


Are there other jobs in the queue?  If so then those higher priority 
jobs are likely waiting for resources before they run. Also the lower 
priority jobs probably can't fit in before the higher priority jobs run, 
thus they won't get backfilled.  That's at least my guess.


-Paul Edmon-

On 01/25/2017 12:14 AM, Roe Zohar wrote:

Priority issue
Hello,
I have 1000 jobs in queue. The queue have 4 servers yet only two 
servers are with running jobs and all other jobs are pending with 
reason "priority". I think that only when current running jobs will 
finish, the other jobs will start running on all other server.


Any idea why are they waiting with priority now instead of start 
running on idle servers?


thanks


[slurm-dev] Re: preemptive fair share scheduling

2017-01-06 Thread Paul Edmon
I might email them directly then or hit up their bugzilla page. They 
noted late last year around SC that they no longer have the personnel 
necessary to monitor the dev list.  They are hiring though, so hopefully 
they will get sufficient people to watch this again as it was good to 
have them chiming in occasionally to answer general questions.


As to the specific slurm question, I don't think it can do that 
specifically.  Or at least I can't see an obvious solution in slurm to 
do what you want.  That said I haven't put a whole ton of thought into 
how I might pull this off in slurm.  You would probably have to do 
something crafty to do it with settings and environment in slurm.  
Regardless it would be nontrivial to do. You likely would have to write 
some external scripts or perhaps modify the slurm code to pull it off.


As to your second question, as it stands the code does not do nicing for 
suspends.  You could theoretically modify the slurm code to do that as 
the whole thing is open source and contribute it back to the community, 
but the feature doesn't exist currently.


-Paul Edmon-

On 01/06/2017 07:27 AM, Sophana Kok wrote:

Re: [slurm-dev] Re: preemptive fair share scheduling

On Fri, Jan 6, 2017 at 1:15 PM, Ole Holm Nielsen 
<ole.h.niel...@fysik.dtu.dk <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:



On 01/06/2017 01:07 PM, Sophana Kok wrote:

Hi again, is there someone from schedmd to respond? On the
contact form,
it is written to contact the mailing list for technical questions.


If the Slurm community doesn't get you going, you need to buy
commercial support, see https://www.schedmd.com/support.php
<https://www.schedmd.com/support.php>, if you want SchedMD to
support you.

/Ole


My question is about slurm features. We already use SGE, and are 
wondering if slurm fit our needs. We can't buy support if we don't 
know if slum is what we are looking for...






[slurm-dev] Re: where to find completed job execution command

2017-01-06 Thread Paul Edmon


We do the same thing except with a prolog script which dumps the job 
info to a flat file so we can look up historical jobs.  Sadly the 
slurmdbd does not store this info, so you have to do it yourself.


-Paul Edmon-


On 01/06/2017 04:31 AM, Loris Bennett wrote:

Sean McGrath <smcg...@tchpc.tcd.ie> writes:


Hi,

On Thu, Jan 05, 2017 at 02:29:11PM -0800, Prasad, Bhanu wrote:


Hi,


Is there a convenient command like `scontrol show job id` to check more info of 
jobs that are completed

Not to my knowledge.


or any command to check the sbatch command run in that particular job

How we do this is with the slurmctld epilog script:

EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld

Which does the following:

/usr/bin/scontrol show job=$SLURM_JOB_ID > 
$recordsdir/$SLURM_JOBID.record

The `scontrol show jobid=` record is saved to the file system for future
reference if it is needed.

It might be worth using the option '--oneliner' to print out the record
in a single line.  You could then parse it more easily for, say, then
inserting the data into a table in a database.

Cheers,

Loris



[slurm-dev] Re: Proper way to clean up ghost jobs (slurm 15.08.12-1)

2017-01-04 Thread Paul Edmon


I know that a query like this was added to 16.x series.  Namely

sacctmgr show runawayjobs

I don't think it is in 15.x series though.

-Paul Edmon-

On 01/04/2017 10:00 AM, Chris Rutledge wrote:


Hello all,

We recently discovered ghost jobs in the 
slurm_acct_db._job_table that was the result of process crash 
back in July that was not corrected soon enough. Should this be 
corrected simply by updating the state and time_end fields for each of 
these records or is there something other than raw SQL we should be 
using in this case?


I ran the following query to find potential ghost jobs and validated 
each with squeue that the id_job is currently invalid, so I'm almost 
100% I have a valid list of job_db_inx to update.



Query

--

select job_db_inx,id_job,time_start,time_end,state,partition,id_user 
from hatteras_job_table where time_start <= 1480339901 and state = 1;



Thanks in advance,

Chris Rutledge


[slurm-dev] Re: Best practice for adding & maintaining lots of user accounts?

2016-12-09 Thread Paul Edmon
We manage about 800 accounts in ours, so an order of magnitude smaller.  
We use the lua plugin at submission time.  In it the lau script checks 
if the user has an account in slurm (the script itself keeps a cache so 
as not to pound the database with this query).  If not it will make one 
with the proper associations.  We key off the users primary group id for 
their default association. Thus the users account is created when they 
submit their first job.


This should spread out the load, unless of course you have all your 
users submitting simultaneously.



-Paul Edmon-

On 12/08/2016 09:05 PM, Tuo Chen Peng wrote:


Hello,

Is there a guideline / best practice for

(1) maintain huge number of user accounts in slurm?

(2) adding thousands of user in batch?

We are maintaining a slurm cluster with 10k+ user accounts.

But recently we need to add thousands more of users and noticed that 
adding new users could sometimes cause slurmctld to hang or ignore 
user requests for several minutes


(users are currently added one-by-one by using sacctmgr add user)

By enabling verbose logging in slurmctld (‘debug5’) I can see 
slurmctld does 2 things when a new user is added


(1) walk the account/user tree

(2) recalculate normalized shares for every other user

I’m guessing these steps takes time, and with usual scheduling work 
they hanged slurmctld in our case.


This might be ok if we just need to add a handful of user, but we do 
need to add thousands of user to slurm


So a few minutes of hang in slurm for each user add means we can only 
add 1 user every 10 minutes to avoid blocking user from accessing slurm


And therefore will take 1 day to add 120 users, and 8+ days to add 
1000 users for example


Which works, but seems very inefficient

And it doesn’t seem right, because with more users added, the 
normalized shares calculation will only take more time.


So, does anyone know what might be the efficient way to add huge 
number of users to slurm and maintain them?


TuoChen Peng


This email message is for the sole use of the intended recipient(s) 
and may contain confidential information.  Any unauthorized review, 
use, disclosure or distribution is prohibited.  If you are not the 
intended recipient, please contact the sender by reply email and 
destroy all copies of the original message.






[slurm-dev] Re: Are Backfill parameters supported per partition?

2016-11-29 Thread Paul Edmon
Sadly all backfill parameters are global.  You can tell slurm to 
schedule per partition using the bf_max_job_part flag.  That's about it 
though.


-Paul Edmon-


On 11/29/2016 01:09 PM, Kumar, Amit wrote:


Dear SLURM,


Can I specify partition specific backfill parameters?

Thank you,

Amit





[slurm-dev] Re: Restrict users to see only jobs of their groups

2016-11-01 Thread Paul Edmon

For slurmdbd you can set this:

*PrivateData*
   This controls what type of information is hidden from regular users.
   By default, all information is visible to all users. User
   *SlurmUser*, *root*, and users with AdminLevel=Admin can always view
   all information. Multiple values may be specified with a comma
   separator. Acceptable values include:


   *accounts*
   prevents users from viewing any account definitions unless
   they are coordinators of them. 
   *jobs*

   prevents users from viewing job records belonging to other
   users unless they are coordinators of the association
   running the job when using sacct. 
   *reservations*

   restricts getting reservation information to users with
   operator status and above. 
   *usage*

   prevents users from viewing usage of any other user. This
   applys to sreport. 
   *users*

   prevents users from viewing information of any user other
   than themselves, this also makes it so users can only see
   associations they deal with. Coordinators can see
   associations of all users they are coordinator of, but can
   only see themselves when listing users

http://slurm.schedmd.com/slurmdbd.conf.html

For the slurm.conf you can do the same:

*PrivateData*
   This controls what type of information is hidden from regular users.
   By default, all information is visible to all users. User
   *SlurmUser* and *root* can always view all information. Multiple
   values may be specified with a comma separator. Acceptable values
   include:


   *accounts*
   (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing
   any account definitions unless they are coordinators of them. 
   *cloud*
   Powered down nodes in the cloud are visible. 
   *jobs*

   Prevents users from viewing jobs or job steps belonging to
   other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents users
   from viewing job records belonging to other users unless
   they are coordinators of the association running the job
   when using sacct. 
   *nodes*
   Prevents users from viewing node state information. 
   *partitions*
   Prevents users from viewing partition state information. 
   *reservations*

   Prevents regular users from viewing reservations which they
   can not use. 
   *usage*

   Prevents users from viewing usage of any other user, this
   applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Prevents
   users from viewing usage of any other user, this applies to
   sreport. 
   *users*

   (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing
   information of any user other than themselves, this also
   makes it so users can only see associations they deal with.
   Coordinators can see associations of all users they are
   coordinator of, but can only see themselves when listing users. 



http://slurm.schedmd.com/slurm.conf.html

That should do what you want.

-Paul Edmon-

On 11/01/2016 07:53 AM, Nathan Harper wrote:

Re: [slurm-dev] Restrict users to see only jobs of their groups
Hi,

No solution, but a 'me too'.


--
*Nathan Harper*// IT Systems Lead

*e: * nathan.har...@cfms.org.uk <mailto:nathan.har...@cfms.org.uk> // 
*t: * 0117 906 1104 // *m: * 07875 510891 // *w: * www.cfms.org.uk 
<http://www.cfms.org.uk%22> // Linkedin grey icon scaled 
<http://uk.linkedin.com/pub/nathan-harper/21/696/b81>
CFMS Services Ltd// Bristol & Bath Science Park // Dirac Crescent // 
Emersons Green // Bristol // BS16 7FR
CFMS Services Ltd is registered in England and Wales No 05742022 - a 
subsidiary of CFMS Ltd
CFMS Services Ltd registered office // Victoria House // 51 Victoria 
Street // Bristol // BS1 6AD


On 1 November 2016 at 11:49, Taras Shapovalov 
<taras.shapova...@brightcomputing.com 
<mailto:taras.shapova...@brightcomputing.com>> wrote:


Hey guys,

Could you tell me how I can configure slurm that users will only
be able to see the jobs submitted by members of their own user
group? Similar restriction by partition or by account is also
would work for me. I cannot find a solution anywhere.

Best regards,

Taras







[slurm-dev] Re: How to set the maximum priority to a Slurm job? (from StackOverflow.com)

2016-09-30 Thread Paul Edmon
Yeah, that's usually what I do.  Priority is a 32-bit int.  So I would 
just give it something like 9.  That should max out most 
priority, unless of course you are using that last digit in which case 
you may have to do something like 20.


It all really depends on how you set the priority scaling in your 
slurm.conf.  You just need to give the job a higher priority than any 
other job can possibly have in the queue, which is determined by your 
slurm.conf settings.  You can use sprio to see what the priorities are 
for jobs in the queue to get an idea of what the max is.


-Paul Edmon-

On 09/30/2016 10:49 AM, Sergio Iserte wrote:
Re: [slurm-dev] Re: How to set the maximum priority to a Slurm job? 
(from StackOverflow.com)

Thanks,
however, I would like to give the maximum priority.

Should I give a large number to the parameter?

Thank you.

2016-09-30 16:44 GMT+02:00 Paul Edmon <ped...@cfa.harvard.edu 
<mailto:ped...@cfa.harvard.edu>>:


scontrol update jobid=jobid priority=blah

That works at least on a per job basis.

    -Paul Edmon-

On 09/30/2016 10:34 AM, Sergio Iserte wrote:

Hello,
this is a copy of my own StackOverflow post:
http://stackoverflow.com/q/39787477/4286662
<http://stackoverflow.com/q/39787477/4286662>

Maybe, here, it can be solved faster:

I am using /sched/backfill/ policy and the default job
priority policy (FIFO) and as administrator I need to give
the maximum priority to a given job.

I have found that submission options like:
|--priority=| or |--nice[=adjustment]| could be
useful, but I do not know which values I should assign them
in order to provide the job with the highest priority.

Another approach could be to set a low priority by default to
all the jobs and to the special ones increase it.

Any idea of how I could carry it out?


Best regards,
Sergio.

-- 
Sergio Iserte

High Performance Computing & Architectures (HPCA)
Department of Computer Science and Engineering (DICC)
Universitat Jaume I (UJI)/
/






--
Sergio Iserte
High Performance Computing & Architectures (HPCA)
Department of Computer Science and Engineering (DICC)
Universitat Jaume I (UJI)/
/





[slurm-dev] Re: How to set the maximum priority to a Slurm job? (from StackOverflow.com)

2016-09-30 Thread Paul Edmon

scontrol update jobid=jobid priority=blah

That works at least on a per job basis.

-Paul Edmon-

On 09/30/2016 10:34 AM, Sergio Iserte wrote:

How to set the maximum priority to a Slurm job? (from StackOverflow.com)
Hello,
this is a copy of my own StackOverflow post: 
http://stackoverflow.com/q/39787477/4286662


Maybe, here, it can be solved faster:

I am using /sched/backfill/ policy and the default job priority
policy (FIFO) and as administrator I need to give the maximum
priority to a given job.

I have found that submission options like: |--priority=| or
|--nice[=adjustment]| could be useful, but I do not know which
values I should assign them in order to provide the job with the
highest priority.

Another approach could be to set a low priority by default to all
the jobs and to the special ones increase it.

Any idea of how I could carry it out?


Best regards,
Sergio.

--
Sergio Iserte
High Performance Computing & Architectures (HPCA)
Department of Computer Science and Engineering (DICC)
Universitat Jaume I (UJI)/
/





[slurm-dev] Re: Slurm web dashboards

2016-09-27 Thread Paul Edmon

Nice.  I might recommend grafana.  There is a nice dashboard for that here:

http://giovannitorres.me/graphing-sdiag-with-graphite.html

Then there are other diamond collectors you can use to gather other 
statistics:


https://github.com/fasrc/slurm-diamond-collector

Grafana is pretty flexible for presenting data, you can also just take 
the graphs it generates and embed them elsewhere.


-Paul Edmon-

On 09/27/2016 08:21 AM, John Hearns wrote:


Hello all.  What are the thoughts on a Slurm ‘dashboard’. The purpose 
being to display cluster status on a large screen monitor.


I rather liked the look of this, based on dashing,io 
https://github.com/julcollas/dashing-slurm/blob/master/README.md


Sadly dashing.io is not being supported, and this looks two years old now.

There is the slurm-web from EDF.

Also a containerised dashboard by Christian Kniep (hello Christian).

Any other packages out there please?

Any views or opinions presented in this email are solely those of the 
author and do not necessarily represent those of the company. 
Employees of XMA Ltd are expressly required not to make defamatory 
statements and not to infringe or authorise any infringement of 
copyright or any other legal right by email communications. Any such 
communication is contrary to company policy and outside the scope of 
the employment of the individual concerned. The company will not 
accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or 
other liability arising. XMA Limited is registered in England and 
Wales (registered no. 2051703). Registered Office: Wilford Industrial 
Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP 




[slurm-dev] Re: Passing MCA parameters via srun

2016-09-26 Thread Paul Edmon

Excellent. T hanks for the info.

-Paul Edmon-


On 09/26/2016 04:03 PM, Paul Hargrove wrote:

Re: [slurm-dev] Passing MCA parameters via srun
Paul,

If the user always want a specific set of MCA options, then they 
should be placed in $HOME/.openmpi/mca-params.conf

Otherwise, one should use environment variables with the prefix OMPI_MCA_

A far more complete answer is here: 
https://www.open-mpi.org/faq/?category=tuning#setting-mca-params


-[another] Paul

On Mon, Sep 26, 2016 at 12:52 PM, Paul Edmon <ped...@cfa.harvard.edu 
<mailto:ped...@cfa.harvard.edu>> wrote:



Typically on our cluster we use:

srun --mpi=pmi2

To launch MPI applications as we have task affinity turned on and
we want to ensure that the binding works properly (if we just
invoke mpirun it doesn't bind properly).  One of our users wishes
to pass in more options to srun to tune MPI (specifically OpenMPI
1.10.2) such as to standard MPI options:

mpirun -mca btl self,openib

My question is how do I get slurm to pass these options through
when it invokes MPI.

    -Paul Edmon-




--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900




[slurm-dev] Passing MCA parameters via srun

2016-09-26 Thread Paul Edmon


Typically on our cluster we use:

srun --mpi=pmi2

To launch MPI applications as we have task affinity turned on and we 
want to ensure that the binding works properly (if we just invoke mpirun 
it doesn't bind properly).  One of our users wishes to pass in more 
options to srun to tune MPI (specifically OpenMPI 1.10.2) such as to 
standard MPI options:


mpirun -mca btl self,openib

My question is how do I get slurm to pass these options through when it 
invokes MPI.


-Paul Edmon-


[slurm-dev] Slurm Diamond Collectors

2016-07-07 Thread Paul Edmon


For those using graphite and diamond, check this out as they may be useful.

https://github.com/fasrc/slurm-diamond-collector

-Paul Edmon-


[slurm-dev] Per Job Usage

2016-06-14 Thread Paul Edmon


So I don't thread hijack, here is my post in a new thread.

Is there a way in sacct (or some other command) to get the RawUsage of a 
job, rather than the aggregate calculated by sshare.  I want to see the 
break down of which job charged how much against a users total usage and 
hence fairshare score. Essentially I want to reconstruct the math it 
used for a specific user to see if there was a job that it changed 
inordinately more for and why.


My first logical step was to look at sacct but I didn't see an entry 
that simply listed RawUsage for the job in terms of TRES. Even better 
would be a break down of memory and cpu charges.


-Paul Edmon-


[slurm-dev] Per Job Usage

2016-06-14 Thread Paul Edmon


Is there a way in sacct (or some other command) to get the RawUsage of a 
job, rather than the aggregate calculated by sshare.  I want to see the 
break down of which job charged how much against a users total usage and 
hence fairshare score.  Essentially I want to reconstruct the math it 
used for a specific user to see if there was a job that it changed 
inordinately more for and why.


My first logical step was to look at sacct but I didn't see an entry 
that simply listed RawUsage for the job in terms of TRES.  Even better 
would be a break down of memory and cpu charges.


-Paul Edmon-


[slurm-dev] Re: Preemption order?

2016-06-10 Thread Paul Edmon


I'm interested in this as well.  It would be good to have a few knobs to 
fine tune which jobs preferentially get requeued.  I know it adds 
overhead to the scheduling but the benefit to us outweighs that.  
Essentially it would be good to both preempt jobs that are small and 
short before the big ones or have been running for a long time.


-Paul Edmon-

On 06/10/2016 02:36 AM, Steffen Grunewald wrote:

Good morning everyone,

is there a way to control the order in which jobs get preempted?
That is, for a queue with PreemptMode=REQUEUE, it would make sense to
preempt jobs first that haven't run for long; for CANCEL with some
GraceTime, perhaps addressing the longest-running jobs first would
be better - can Slurm's strategy be hinted twoards the "right" way?

Thanks,

- S



[slurm-dev] Re: Architectural constraints of Slurm

2016-05-24 Thread Paul Edmon
This email actually reminded me of what we run here.  We've been 
operating with slurm for the past 3 years in an environment similar to 
your own.  We schedule about 50k cores, 90 partitions, heterogeneous 
architecture of AMD, Intel, high memory versus low memory and GPU.  
Slurm has handled it all and a lot of the lessons we learned have been 
pushed back to the community.  Thus the current version of slurm should 
be able to handle all that you have defined there.


The only tricky part that I am aware of may be the Kerberos part. We use 
ldap for our auth on the boxes, and slurm does have a pam module for 
handling who gets access.  That plus cgroups should take care of your 
security problems.


Anyways, suffice it to say slurm can work for your environment as your 
environment is fairly similar to ours.


-Paul Edmon-

On 05/24/2016 07:33 AM, Šimon Tóth wrote:

Architectural constraints of Slurm
Hello,

I'm looking for information on Slurm architectural constraints as we 
are considering a switch to Slurm.


We are currently running a heavily modified version of Torque with a 
custom scheduler.


Our system (~13k CPU cores), is heavily heterogenous (~40 clusters) 
with complex operational constraints. The system generally handles 
10k-50k enqued jobs.


We are currently scheduling CPU cores, Memory, GPU cards, Scratch 
space (local, SSD and NFS with different machines having access to 
different combination of these) and software licenses.


Machines are described by a set of physical and software properties 
(can be requested by users) and their speed (users can request ranges 
of machine performance).


Jobs carry complex requests. Each job can request sets of machines, 
where each set caries a different specification. Each set is described 
by the amount of resources requested, machine properties (negative 
specification is supported for specifying nodes that do not have 
specific property) and the number of nodes with such specification. 
Nodes can be allocated exclusively (in which case, the specification 
describes the minimum amount) or shared, with each resource still 
being allocated exclusively (jobs cannot overlap in cores, memory, gpu 
cards).


We are relying on Kerberos which is used for both identification and 
authentication. Each running task inside of a job has a nanny process 
that periodically refreshes the kerberos ticket for that particular 
process.


For scalability reasons, our scheduler relies on the server to keep an 
up-to-date and full state of the system. The server therefore counts 
the current resource allocation state for each of the resources on 
each of the nodes.


Please let me know if this sounds like something Slurm could handle, 
or if there are any limitations in Slurm that would make this 
impossible to support.


Sincerely,
Simon Toth




[slurm-dev] Print Out Slurm Network Hierarchy

2016-05-09 Thread Paul Edmon


We've been having some node communications time outs on our network.  I 
know that slurm communicates in a heirarchical fashion. Is it possible 
to have slurm spit out its communications tree?  This would be helpful 
for isolating nodes that could be causing problems.  Especially since we 
have nodes on various different networks and locations.


In addition it would be great if we could specify our own topology or at 
least give hints to slurm about what would be a good lay out for its 
communications.  From what I know the topology plugin is really more for 
compute topology and not so much internal slurm communications.


Thanks.

-Paul Edmon-


[slurm-dev] Re: Slurm Upgrade Instructions needed

2016-05-04 Thread Paul Edmon
Oh and the longest part of the upgrade is the database upgrade. It 
varies depending on how big it is.  We have about 20 million jobs in 
ours and it takes about 30-40 minutes.  The rest of the upgrade is 
pretty quick, as it just involves updating the rpms and restarting the 
scheduler and slurmds.


-Paul Edmon-


On 5/4/2016 10:10 PM, Paul Edmon wrote:


Specifically the upgrade instructions are here:

http://slurm.schedmd.com/quickstart_admin.html

Look at the bottom of the page.  If you follow the instructions you 
should be fine. Though I would recommend pausing the scheduler by 
suspending all the jobs and setting all the partitions to down.  This 
helps with keeping the database in lock step as jobs won't be exiting 
or launching while you are upgrading the database.


You can do it live though, but you may end up with a few jobs in the 
database that are inconsistent.  The rest is very smooth and works 
well.  Just note that once you do a major upgrade like this there is 
no rolling back due to the database and job structure changes.


-Paul Edmon-

On 5/4/2016 6:23 PM, Lachlan Musicman wrote:

Re: [slurm-dev] Slurm Upgrade Instructions needed
I would backup /etc/slurm.

That's about it.

Cheers
L.

--
The most dangerous phrase in the language is, "We've always done it 
this way."


- Grace Hopper

On 5 May 2016 at 07:36, Balaji Deivam <balaji.dei...@seagate.com 
<mailto:balaji.dei...@seagate.com>> wrote:


Hello,

Right now we are using Slurm 14.11.3 and planning to upgrade to
the latest version 15.8.7. Could you please share the latest
upgrade steps document link?

Also please let me know what are the stuffs I have to take backup
in the existing slurm version before installing new one.


Thanks & Regards,
Balaji Deivam
Staff Analyst - Business Data Center
Seagate Technology - 389 Disc Drive, Longmont, CO 80503 |
720-684- _3395_








[slurm-dev] Re: Slurm Upgrade Instructions needed

2016-05-04 Thread Paul Edmon

Specifically the upgrade instructions are here:

http://slurm.schedmd.com/quickstart_admin.html

Look at the bottom of the page.  If you follow the instructions you 
should be fine. Though I would recommend pausing the scheduler by 
suspending all the jobs and setting all the partitions to down. This 
helps with keeping the database in lock step as jobs won't be exiting or 
launching while you are upgrading the database.


You can do it live though, but you may end up with a few jobs in the 
database that are inconsistent.  The rest is very smooth and works 
well.  Just note that once you do a major upgrade like this there is no 
rolling back due to the database and job structure changes.


-Paul Edmon-

On 5/4/2016 6:23 PM, Lachlan Musicman wrote:

Re: [slurm-dev] Slurm Upgrade Instructions needed
I would backup /etc/slurm.

That's about it.

Cheers
L.

--
The most dangerous phrase in the language is, "We've always done it 
this way."


- Grace Hopper

On 5 May 2016 at 07:36, Balaji Deivam <balaji.dei...@seagate.com 
<mailto:balaji.dei...@seagate.com>> wrote:


Hello,

Right now we are using Slurm 14.11.3 and planning to upgrade to
the latest version 15.8.7. Could you please share the latest
upgrade steps document link?

Also please let me know what are the stuffs I have to take backup
in the existing slurm version before installing new one.


Thanks & Regards,
Balaji Deivam
Staff Analyst - Business Data Center
Seagate Technology - 389 Disc Drive, Longmont, CO 80503 | 720-684-
_3395_






[slurm-dev] Zero Bytes Received

2016-05-02 Thread Paul Edmon


I'm seeing quite a few of these errors:

May  2 11:33:29 holy-slurm01 slurmctld[47253]: error: slurm_receive_msg: 
Zero Bytes were transmitted or received
May  2 11:33:29 holy-slurm01 slurmctld[47253]: error: slurm_receive_msg: 
Zero Bytes were transmitted or received


I know that this can be caused by a node or client that is in a bad 
state, but I can't figure out how to trace it back to which one. Does 
anyone have any tricks for tracing this sort of error back?  I turned on 
the Protocol Debug Flag but none of the additional debug statements lead 
to the culprit.


-Paul Edmon-


[slurm-dev] Re: Saving job submissions

2016-04-27 Thread Paul Edmon
We use this python script as a slurmctld prolog to save ours. Basically 
it pulls all the info from the slurm hash files and copies to a separate 
filesystem.  We used to do it via mysql but the database got too large.


We then use the get_jobscript to actually query the job scripts.

-Paul Edmon-

On 04/27/2016 03:41 AM, Lennart Karlsson wrote:


On 04/27/2016 08:46 AM, Miguel Gila wrote:
Another option is to run (as SlurmUser, or root) when the job is 
still in the system (R or PD):


# scontrol show jobid= -dd

and dump its output somewhere.

Miguel


Hi,

We are using this one, run as slurmctld.prolog,
saving the output for 30 days.

Cheers,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
   http://www.uppmax.uu.se/


#!/usr/bin/python -tt

import logging
import os
import sys
import syslog
import traceback
from glob import glob


def get_env_vars(job_id):
#if not os.path.exists('/slurm/spool/job.%s/environment' % job_id):
#return []
paths = glob('/slurm/spool/hash.*/job.%s/environment' % job_id)
if len(paths) == 0:
return []

# Not sure how this would happen, but it would be weird
if len(paths) > 1:
return []

# Skip the first four bytes, which are a uint32 indicating the
# length of the following data.
env_raw = open(paths[0]).read()[4:]

env_vars = []
for env in env_raw.split('\0'):
if env in ['', ';']:
continue

name, value = env.split('=', 1)
env_vars.append("%s='%s'" % (name, value.replace("'", r"'\''")))

return env_vars

def get_job_script(job_id):
#if not os.path.exists('/slurm/spool/job.%s/script' % job_id):
#return ''
paths = glob('/slurm/spool/hash.*/job.%s/script' % job_id)
if len(paths) == 0:
return ''

# Not sure how this would happen, but it would be weird
if len(paths) > 1:
return ''

# The job script has a trailing NULL. o_O
script_raw = open(paths[0]).read().rstrip('\0')
shebang = rest = ''
try:
shebang, rest = script_raw.split('\n', 1)
except ValueError, e:
raise Exception('No lines in script file 
/slurm/spool/hash.*/job.%s/script' % job_id)


sbatch_lines = []
rest_lines = []
in_sbatch = True
for line in rest.split('\n'):
if in_sbatch:
if line.startswith('#SBATCH') or line.strip() == '':
sbatch_lines.append(line)
else:
in_sbatch = False
rest_lines.append(line)
else:
rest_lines.append(line)

return '%s\n%s\n\n%s\n BEGIN RUNTIME ENV \n%s' % (
shebang, '\n'.join(sbatch_lines).strip(),
'\n'.join(rest_lines), '\n'.join(get_env_vars(job_id)), 
)

def save_job_script(job_id):

# The number of subdirectories the scripts will be distributed into
# job_id modulo SUBDIRCOUNT determines parent directory for the job id
SLURM_JOBSCRIPT_SUBDIRCOUNT = 1000;
SLURM_JOBSCRIPT_SUBDIRCOUNT = 
int(os.environ.get("SLURM_JOBSCRIPT_SUBDIRCOUNT","1000"))

# The root path of the jobscripts
SLURM_JOBSCRIPT_HOME = os.environ.get("SLURM_JOBSCRIPT_HOME","/jobscripts")


subdir = str(int(job_id) % SLURM_JOBSCRIPT_SUBDIRCOUNT)
scriptdir = os.path.join(SLURM_JOBSCRIPT_HOME,subdir)
if not os.path.exists(scriptdir):
os.mkdir(scriptdir)

scriptfile = os.path.join(scriptdir,job_id)
with open(scriptfile,'w') as f:
f.write(get_job_script(job_id))


if __name__ == '__main__':
try:
save_job_script(os.environ['SLURM_JOB_ID'])
except Exception as e:
syslog.openlog(os.path.basename(sys.argv[0]), syslog.LOG_PID)
for line in traceback.format_exc().split('\n'):
syslog.syslog(syslog.LOG_ERR, line)

get_jobscript.sh
Description: Bourne shell script


[slurm-dev] TRES in QoS

2016-04-11 Thread Paul Edmon


So when you list a TRES in a QoS limit do you use the billing weight 
modified TRES or the actual raw number?  So say for instance I want to 
set a memory limit of 256 GB and I am billing at 0.25 per GB.  Do then do:


GrpTRES=mem=256

or

GrpTRES=mem=64

The docs seem a bit ambiguous on this point.

-Paul Edmon-


[slurm-dev] Re: User education tools for fair share

2016-03-01 Thread Paul Edmon


We use the showq tool.  We are looking into fairshare plotting over 
time.  I've been doing it on our test cluster and seeing the evolution 
of fairshare overtime really gives you a good feel for what your usage 
is and what is going on.  We haven't implemented it yet though as we 
would need to set up a data repository for historic fairshare data.  
Still it is on our docket to do.


sshare is a great way of seeing things too, though it can be a bit much 
for your average user.


-Paul Edmon-

On 03/01/2016 04:52 AM, Chris Samuel wrote:

Hi Loris,

On Tue, 1 Mar 2016 12:29:12 AM Loris Bennett wrote:


To help the user understand their current fairshare/priority status, I
usually point them to 'sprio', generally in the following incantation:

sprio -l | sort -nk3

to get the jobs sorted by priority.

Thanks for that, our local version of "showq" already orders waiting jobs by
priority, so I'll direct users to that first, and then to sprio to get a
breakdown of priority.

All the best,
Chris


[slurm-dev] Re: job_submit.lua slurm.log_user

2016-01-29 Thread Paul Edmon


Ah, okay.  In this case I wanted to print out something to the user and 
still succeed, as it was just a warning to the user that their script 
had been modified.


Thanks.

-Paul Edmon-

On 01/29/2016 12:36 PM, je...@schedmd.com wrote:


It works for me, but only for the job_submit function (not job_modify) 
and requires that the function return an error code (not "SUCCESS"). 
Here's a trivial example


function slurm_job_submit(job_desc, part_list, submit_uid)
  slurm.log_user("TEST");
  return slurm.ERROR
end

$ srun hostname
srun: error: TEST
srun: error: Unable to allocate resources: Unspecified error

Quoting Paul Edmon <ped...@cfa.harvard.edu>:
So I've been working with the job_submit.lua part of our config, we 
are running 15.08.7 and while slurm.log_info works, slurm.log_user 
does not.  My understanding is that slurm.log_user will print to the 
users console.  My code looks like the following:


slurm.log_user("Your current fairshare is not high enough to use the 
priority partition. Your job has been automatically reset to the 
following: %s",job_desc.partition)


Does any one see a problem with this?  Does any one have a version of 
job_submit.lua that uses slurm.log_user and it works?  If so what is 
the trick?  Or am I misunderstanding what slurm.log_user does?


-Paul Edmon-




[slurm-dev] Re: Pause all new submissions

2016-01-20 Thread Paul Edmon
You can also set all the partitions to down.  All pending jobs will 
pend, new jobs can be submitted and existing jobs will finish. However 
no new jobs will be scheduled.


-Paul Edmon-

On 01/20/2016 03:41 PM, Trey Dockendorf wrote:

Re: [slurm-dev] Pause all new submissions
Can likely use a reservation: http://slurm.schedmd.com/reservations.html.

I use something like this to put entire cluster into maintenance:

scontrol create reservation starttime=now duration=infinite 
accounts=mgmt flags=maint,ignore_jobs nodes=ALL


The above will ignore running jobs and prevent new jobs from 
starting.  It also allows people in mgmt account to run jobs by using 
--reservation= with sbatch.


- Trey

=

Trey Dockendorf
Systems Analyst I
Texas A University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu <mailto:treyd...@tamu.edu>
Jabber: treyd...@tamu.edu <mailto:treyd...@tamu.edu>

On Wed, Jan 20, 2016 at 2:06 PM, Jordan Willis <jwillis0...@gmail.com 
<mailto:jwillis0...@gmail.com>> wrote:



Hi,

How can I suspend new submissions to slurm without killing or
suspending everything currently scheduled? Also, can I do this on
the cluster level or does it have to be by partition?

Jordan






[slurm-dev] Re: slurmdbd upgrade

2016-01-12 Thread Paul Edmon


That sounds about right.  We have about the same order of magnitude and 
our last major upgrade took about an hour for the DB to  update itself.


-Paul Edmon-

On 1/12/2016 5:04 PM, Andrew E. Bruno wrote:

We're planning an upgrade from 14.11.10 to 15.08.6 and in the past the
slurmdbd upgrades can take a while (~20-30 minutes). Curious if there's
any way to gauge what we can expect this time? To give an idea, we have
upwards of ~4.5 million records in our job tables. We frequently get
these warnings in our logs:

  Warning: Note very large processing time from hourly_rollup
  ...

Also, any "best practices" for upgrading the slurmdbd other than:

   yum update; systemctl restart slurmdbd

(this usually hangs until the database migrations are complete)

Thanks in advance for any pointers.

Cheers,

--Andrew


[slurm-dev] Grp Limits for Partition QoS

2016-01-06 Thread Paul Edmon


So I want to set up a partition so that each account gets up to a 
certain TRES limit and no more.  I don't want to do it per user as a 
user could be part of multiple accounts, and I don't want to do it 
globally per account as we have backfill partitions that I still want 
the account to have access too.  I just don't want any given account to 
dominate the partition, especially if that account has perfect fairshare 
and submits a ton of jobs at once.


I was hoping that the Partition QoS could help me do this.  However, I'm 
not sure how Grp limits apply here.  I would use something like GrpTRES 
but from my reading of the description it would apply to the whole QoS 
and hence the whole partition rather than on a per Account basis.  There 
is MaxTRESPerUser,  but I want something more like MaxTRESPerAccount (I 
don't know if this exists and I'm just missing it).


Thanks for the info.

-Paul Edmon-


[slurm-dev] MsgAggregation Parameters

2015-12-08 Thread Paul Edmon


We recently upgraded to 15.08.4 from 14.11 and we wanted to try out the 
MsgAggregation to see if that would improve cluster throughput and 
responsiveness.  However when we turned it on with the settings of 
WindowMsgs=10 and WindowTime=100 everything slowed to a crawl and it 
looked like the slurmctld was threading like crazy.  When we turned it 
off everything returned to normal.  Does any one have any suggestions or 
guidelines for what to set the MsgAggregationParam to?  I'm guessing it 
depends on the size of the cluster as we have the same settings on our 
test cluster but it is about 10 times smaller in terms of number of 
nodes than our main one.  I'm guessing this is a scaling problem.


Thoughts?  Anyone else using MsgAggregation?

-Paul Edmon-


[slurm-dev] Re: MsgAggregation Parameters

2015-12-08 Thread Paul Edmon


I will have to try that out.  Thanks for the info.

-Paul Edmon-

On 12/08/2015 01:54 PM, Danny Auble wrote:


Hey Paul,  Unless you have a very busy cluster (100s of jobs a second) 
or are running very large jobs (>2000 nodes) I don't think this will 
be very useful.  But I would expect 
MsgAggregationParams=WindowMsgs=10,WindowTime=10 to be more what you 
would want.  WindowTime=100 may be too long of a wait.  I am surprised 
at the threading of your slurmctld though, I would expect it to have 
much less threading.  Be sure to restart all your slurmd's as well as 
the slurmctld when you change the parameter. Try again and see if 
lowering the WindowTime down improves your situation.


Danny

On 12/08/15 10:25, Paul Edmon wrote:


We recently upgraded to 15.08.4 from 14.11 and we wanted to try out 
the MsgAggregation to see if that would improve cluster throughput 
and responsiveness.  However when we turned it on with the settings 
of WindowMsgs=10 and WindowTime=100 everything slowed to a crawl and 
it looked like the slurmctld was threading like crazy.  When we 
turned it off everything returned to normal.  Does any one have any 
suggestions or guidelines for what to set the MsgAggregationParam 
to?  I'm guessing it depends on the size of the cluster as we have 
the same settings on our test cluster but it is about 10 times 
smaller in terms of number of nodes than our main one.  I'm guessing 
this is a scaling problem.


Thoughts?  Anyone else using MsgAggregation?

-Paul Edmon-


[slurm-dev] Re: A floating exclusive partition

2015-11-19 Thread Paul Edmon


Yeah, I guess QoS won't really work for overflow.  I was more thinking 
of the QoS as a way to create a floating partition of 5 nodes with the 
rest being in the public queue.  They would send jobs to the QoS to hit 
that and then when it is full they would submit to public as normal.  
That's at least my thinking, but it's less seamless to the users as they 
will have to consciously monitor what is going on.


-Paul Edmon-

On 11/19/2015 10:50 AM, Daniel Letai wrote:


Can you elaborate a little? I'm not sure what kind of QoS will help, 
nor how to implement one that will satisfy the requirements.


On 11/19/2015 04:52 PM, Paul Edmon wrote:


You might consider a QoS for this.  It may not do everything you want 
but it will give you the flexibility.


-Paul Edmon-

On 11/19/2015 04:49 AM, Daniel Letai wrote:


Hi,

Suppose I have a 100 node cluster with ~5% nodes down at any given 
time (maintanence/hw failure/...).


One of the projects requires exclusive use of 5 nodes, and be able 
to use entire cluster when available (when other projects aren't 
running).


I can do this easily if I maintain a static list of the exclusive 
nodes in slurm.conf:


PartitionName=public Nodes=tux0[01-95] Default=YES
PartitionName=special Nodes=tux[001-100] Default=NO

And allowing only that project to use partition special.

However, due to the downtime of 5%, I'd like to maintain a dynamic 
exclusive 5 nodes.

Any suggestions?

The project is serial and deployed as array of single node jobs, so 
I can run it even when the other 95 nodes are full.


Thanks,
--Dani_L.


[slurm-dev] Re: Partition QoS

2015-11-12 Thread Paul Edmon
Okay, its working now.  I just had to be more patient for slurmctld to 
pick up the DB change of adding the QOS.  Thanks for the help.


-Paul Edmon-

On 11/11/2015 11:54 AM, Bruce Roberts wrote:
I believe the method James describes only allows that one qos to be 
used by jobs on the partition, it does not set up a Partition QOS as 
Paul is looking for.


Create the QOS as James describes, but instead of AllowQOS use the QOS 
option as Danny eluded to, and as documented here 
http://slurm.schedmd.com/slurm.conf.html#OPT_QOS.


PartitionName=batch Nodes=compute[01-05] *QOS*=vip Default=NO 
MaxTime=UNLIMITED State=UP


This will allow for a partition QOS that isn't associated with the job 
as talked about in the SLUG15 presentation 
http://slurm.schedmd.com/SLUG15/Partition_QOS.pdf


On 11/11/15 08:09, Paul Edmon wrote:

Thanks for the insight.  I will try it out on my end.

-Paul Edmon-

On 11/11/2015 3:31 AM, Dennis Mungai wrote:

Re: [slurm-dev] Re: Partition QoS

Thanks a lot, James. Helped a bunch ;-)

*From:*James Oguya [mailto:oguyaja...@gmail.com]
*Sent:* Wednesday, November 11, 2015 11:21 AM
*To:* slurm-dev <slurm-dev@schedmd.com>
*Subject:* [slurm-dev] Re: Partition QoS

I was able to do this on slurm 15.08.3 on a test environment. Here 
are my notes:


- create qos

sacctmgr add qos Name=vip MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 
MaxTRESPerUser=cpu=128


- create partition in slurm.conf

PartitionName=batch Nodes=compute[01-05] AllowQOS=vip Default=NO 
MaxTime=UNLIMITED State=UP


I believe the keyword here is to specify the QoS name when creating 
the partition, for my case, I used AllowQOS=vip


James

On Wed, Nov 11, 2015 at 4:16 AM, Danny Auble <d...@schedmd.com 
<mailto:d...@schedmd.com>> wrote:


I'm guessing you also added the qos to your partition line in
your slurm.conf as well.

On November 10, 2015 5:08:24 PM PST, Paul Edmon
<ped...@cfa.harvard.edu> wrote:

I did that but it didn't pick it up.  I must need to
reconfigure again after I made the qos.  I will have to try
it again.  Let you know how it goes.

-Paul Edmon-

On 11/10/2015 5:40 PM, Douglas Jacobsen wrote:

Hi Paul,

I did this by creating the qos, e.g. sacctmgr create qos
part_whatever

Then in slurm.conf setting qos=part_whatever in the
"whatever" partition definition.

scontrol reconfigure

finally, set the limits on the qos:

sacctmgr modify qos set MaxJobsPerUser=5 where
name=part_whatever

...

-Doug




Doug Jacobsen, Ph.D.

NERSC Computer Systems Engineer

National Energy Research Scientific Computing Center
<http://www.nersc.gov>

- __o
-- _ '\<,_
--(_)/  (_)__

    On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon
<ped...@cfa.harvard.edu> wrote:


In 15.08 you are able to set QoS limits directly on
a partition. So how do you actually accomplish
this?  I've tried a couple of ways, but no luck.  I
haven't seen a demo of how to do this anywhere
either.  My goal is to set up a partition with the
following QoS parameters:

MaxJobsPerUser=5
MaxSubmitJobsPerUser=5
MaxCPUsPerUser=128

    Thanks for the info.

-Paul Edmon-



--

/James Oguya/


__

This e-mail contains information which is confidential. It is 
intended only for the use of the named recipient. If you have 
received this e-mail in error, please let us know by replying to the 
sender, and immediately delete it from your system. Please note, 
that in these circumstances, the use, disclosure, distribution or 
copying of this information is strictly prohibited. KEMRI-Wellcome 
Trust Programme cannot accept any responsibility for the accuracy or 
completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable 
precautions to ensure no viruses are present in emails, it cannot 
accept responsibility for any loss or damage arising from the use of 
the email or attachments. Any views expressed in this message are 
those of the individual sender, except where the sender specifically 
states them to be the views of KEMRI-Wellcome Trust Programme.

__








[slurm-dev] Re: Partition QoS

2015-11-11 Thread Paul Edmon

Thanks for the insight.  I will try it out on my end.

-Paul Edmon-

On 11/11/2015 3:31 AM, Dennis Mungai wrote:

Re: [slurm-dev] Re: Partition QoS

Thanks a lot, James. Helped a bunch ;-)

*From:*James Oguya [mailto:oguyaja...@gmail.com]
*Sent:* Wednesday, November 11, 2015 11:21 AM
*To:* slurm-dev <slurm-dev@schedmd.com>
*Subject:* [slurm-dev] Re: Partition QoS

I was able to do this on slurm 15.08.3 on a test environment. Here are 
my notes:


- create qos

sacctmgr add qos Name=vip MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 
MaxTRESPerUser=cpu=128


- create partition in slurm.conf

PartitionName=batch Nodes=compute[01-05] AllowQOS=vip Default=NO 
MaxTime=UNLIMITED State=UP


I believe the keyword here is to specify the QoS name when creating 
the partition, for my case, I used AllowQOS=vip


James

On Wed, Nov 11, 2015 at 4:16 AM, Danny Auble <d...@schedmd.com 
<mailto:d...@schedmd.com>> wrote:


I'm guessing you also added the qos to your partition line in your
slurm.conf as well.

On November 10, 2015 5:08:24 PM PST, Paul Edmon
<ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote:

I did that but it didn't pick it up.  I must need to
reconfigure again after I made the qos.  I will have to try it
again.  Let you know how it goes.

-Paul Edmon-

On 11/10/2015 5:40 PM, Douglas Jacobsen wrote:

Hi Paul,

I did this by creating the qos, e.g. sacctmgr create qos
part_whatever

Then in slurm.conf setting qos=part_whatever in the
"whatever" partition definition.

scontrol reconfigure

finally, set the limits on the qos:

sacctmgr modify qos set MaxJobsPerUser=5 where
name=part_whatever

...

-Doug




Doug Jacobsen, Ph.D.

NERSC Computer Systems Engineer

National Energy Research Scientific Computing Center
<http://www.nersc.gov>

- __o
-- _ '\<,_
--(_)/  (_)__

    On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon
<ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>>
wrote:


In 15.08 you are able to set QoS limits directly on a
partition.  So how do you actually accomplish this? 
I've tried a couple of ways, but no luck.  I haven't

seen a demo of how to do this anywhere either. My goal
is to set up a partition with the following QoS
parameters:

MaxJobsPerUser=5
MaxSubmitJobsPerUser=5
MaxCPUsPerUser=128

Thanks for the info.

-Paul Edmon-



--

/James Oguya/


__

This e-mail contains information which is confidential. It is intended 
only for the use of the named recipient. If you have received this 
e-mail in error, please let us know by replying to the sender, and 
immediately delete it from your system. Please note, that in these 
circumstances, the use, disclosure, distribution or copying of this 
information is strictly prohibited. KEMRI-Wellcome Trust Programme 
cannot accept any responsibility for the accuracy or completeness of 
this message as it has been transmitted over a public network. 
Although the Programme has taken reasonable precautions to ensure no 
viruses are present in emails, it cannot accept responsibility for any 
loss or damage arising from the use of the email or attachments. Any 
views expressed in this message are those of the individual sender, 
except where the sender specifically states them to be the views of 
KEMRI-Wellcome Trust Programme.

__




[slurm-dev] Partition QoS

2015-11-10 Thread Paul Edmon


In 15.08 you are able to set QoS limits directly on a partition.  So how 
do you actually accomplish this?  I've tried a couple of ways, but no 
luck.  I haven't seen a demo of how to do this anywhere either.  My goal 
is to set up a partition with the following QoS parameters:


MaxJobsPerUser=5
MaxSubmitJobsPerUser=5
MaxCPUsPerUser=128

Thanks for the info.

-Paul Edmon-


[slurm-dev] Re: Partition QoS

2015-11-10 Thread Paul Edmon
I did that but it didn't pick it up.  I must need to reconfigure again 
after I made the qos.  I will have to try it again.  Let you know how it 
goes.


-Paul Edmon-

On 11/10/2015 5:40 PM, Douglas Jacobsen wrote:

Re: [slurm-dev] Partition QoS
Hi Paul,

I did this by creating the qos, e.g. sacctmgr create qos part_whatever
Then in slurm.conf setting qos=part_whatever in the "whatever" 
partition definition.


scontrol reconfigure

finally, set the limits on the qos:
sacctmgr modify qos set MaxJobsPerUser=5 where name=part_whatever
...

-Doug


Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
<http://www.nersc.gov>


- __o
-- _ '\<,_
--(_)/  (_)__


On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon <ped...@cfa.harvard.edu 
<mailto:ped...@cfa.harvard.edu>> wrote:



In 15.08 you are able to set QoS limits directly on a partition. 
So how do you actually accomplish this?  I've tried a couple of

ways, but no luck.  I haven't seen a demo of how to do this
anywhere either.  My goal is to set up a partition with the
following QoS parameters:

MaxJobsPerUser=5
MaxSubmitJobsPerUser=5
MaxCPUsPerUser=128

Thanks for the info.

-Paul Edmon-






[slurm-dev] Re: Interpreting

2015-10-27 Thread Paul Edmon


Mixed just means the node is partially full.  It has nothing to do with 
the health of a node.  Down/Drained/*(where the node name is followed by 
a *)/unavailable are all bad states.  You will see compound states as 
well such as DOWN+DRAIN or IDLE+DRAIN or ALLOCATED+DRAIN.  They indicate 
exactly what they are.  Namely the IDLE+DRAIN means that the node has 
been DRAINED and is IDLE. ALLOCATED+DRAIN means the node is set to drain 
but is currently fully allocated.  MIXED+DRAIN means the node is 
draining but is currently partially allocated.


Hope this helps.  Generally states that are in a problematic state will 
have their Reason field filled with something.


-Paul Edmon-

On 10/27/2015 05:03 AM, Всеволод Никоноров wrote:

Hello,

if a node is in a MIXED state, is it possible that there are "good" and "bad" states 
mixed? I mean, if a node is MIXED when it is partly allocated, this is OK. But is it possible that a node is, 
for example, partly DOWN or partly DRAINED, or partly in some other state that is not normal? I maintain a 
script in a monitoring system which checks whether a node is available to SLURM, and it consideres IDLE and 
ALLOCATED states as "good". I am going to add a MIXED state also, but I don't wont to miss a 
problem in the way I described.

Thanks in advance!

--О©╫
Vsevolod Nikonorov, JSC NIKIET
О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫ О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫, О©╫О©╫О©╫ 
О©╫О©╫О©╫О©╫О©╫О©╫


[slurm-dev] Re: login node configuration?

2015-10-26 Thread Paul Edmon


For clarity, they should not need to talk to the compute nodes unless 
you intend to do interactive work.  You should only need to talk to the 
master to submit jobs.


-Paul Edmon-

On 10/26/2015 9:45 PM, Paul Edmon wrote:


What we did was that we just opened up port 6817 between the two 
VLAN's.  So long as the traffic is routeable and they see the same 
slurm.conf that should work.  All the login node needs is slurm.conf, 
and the slurm, slurm-munge, and slurm-plugin rpms. You don't need to 
run the slurm service to submit as all you need is the ability of the 
login node to talk to the master.


-Paul Edmon-

On 10/26/2015 8:02 PM, Liam Forbes wrote:

Hello,

I’m in the process of setting up SLURM 15.08.X for the first time. 
I’ve got a head node and ten compute nodes working fine for serial 
and parallel jobs. I’m struggling with figuring out how to configure 
a pair of login nodes to be able to get status and submit jobs. I’ve 
been searching through the documentation and googling, but I’ve not 
found a reference other than brief mentions of login nodes or submit 
hosts.


Our pair of login nodes share an ethernet network with the head node. 
They do not, currently, share a network, ethernet or IB, with the 
compute nodes. When I execute SLURM commands like sinfo or squeue, 
using strace I can see them attempting to communicate with port 6817 
on the head node. However, the head node doesn’t respond. lsof on the 
head node seems to indicate slurmctld is only listening on the 
private network shared with the compute nodes.


Is there a reference for configuring login nodes to be able to query 
slurmctld and submit jobs? If not, would somebody with a similar 
configuration be willing to share some guidance?


Do the login nodes need to be on the same network as the compute 
nodes, or at least the same network as slurmctld is listening on?


Apologies if these questions have been answered somewhere else, I 
just haven’t found that documentation.


Regards,
-liam

-There are uncountably more irrational fears than rational ones. -P. 
Dolan
Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 
907-450-8601

UAF Research Computing Systems Senior HPC Engineer LPIC1, CISSP


[slurm-dev] Re: login node configuration?

2015-10-26 Thread Paul Edmon


What we did was that we just opened up port 6817 between the two 
VLAN's.  So long as the traffic is routeable and they see the same 
slurm.conf that should work.  All the login node needs is slurm.conf, 
and the slurm, slurm-munge, and slurm-plugin rpms.  You don't need to 
run the slurm service to submit as all you need is the ability of the 
login node to talk to the master.


-Paul Edmon-

On 10/26/2015 8:02 PM, Liam Forbes wrote:

Hello,

I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a 
head node and ten compute nodes working fine for serial and parallel jobs. I’m 
struggling with figuring out how to configure a pair of login nodes to be able 
to get status and submit jobs. I’ve been searching through the documentation 
and googling, but I’ve not found a reference other than brief mentions of login 
nodes or submit hosts.

Our pair of login nodes share an ethernet network with the head node. They do 
not, currently, share a network, ethernet or IB, with the compute nodes. When I 
execute SLURM commands like sinfo or squeue, using strace I can see them 
attempting to communicate with port 6817 on the head node. However, the head 
node doesn’t respond. lsof on the head node seems to indicate slurmctld is only 
listening on the private network shared with the compute nodes.

Is there a reference for configuring login nodes to be able to query slurmctld 
and submit jobs? If not, would somebody with a similar configuration be willing 
to share some guidance?

Do the login nodes need to be on the same network as the compute nodes, or at 
least the same network as slurmctld is listening on?

Apologies if these questions have been answered somewhere else, I just haven’t 
found that documentation.

Regards,
-liam

-There are uncountably more irrational fears than rational ones. -P. Dolan
Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 907-450-8601
UAF Research Computing Systems Senior HPC EngineerLPIC1, CISSP


[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-12 Thread Paul Edmon


You should be able to do this with out losing any jobs (at least I've 
never lost any on any version of Slurm I have run).  I do it all the 
time in our environment (about once a day) as our slurm.conf is in flux 
quite a bit.  It should always preserve the running and pending state.  
The only issue I have ever seen where this becomes a problem is in 
fringe cases during major version upgrades, even then it is rare.


-Paul Edmon-

On 10/12/2015 3:57 AM, Robbert Eggermont wrote:


Hello,

Some modifications to the slurm.conf require me to restart the slurmd 
daemons on all nodes. Is there a way to do this without loosing any 
running jobs (and not having to drain the cluster)?


Thanks,

Robbert



[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-12 Thread Paul Edmon


I've had this happen several times, but have never lost jobs due to it.  
Still one should always watch the logs on the master when restarting so 
you can catch typos immediately.


We run a sanity check on our conf's before we push them (we use puppet 
for configuration control).  Our post commit hook uses scontrol to run a 
test on the conf before pushing.  This typically catches most errors.  
Not all though.


-Paul Edmon-

On 10/12/2015 12:41 PM, Antony Cleave wrote:


While this is true be very, very careful when restarting the slurmd on 
the controller node.
it's quite easy to miss a typo in one of the config files, e.g. an 
unexpected comma in topology.conf which can cause slurm to segfault or 
otherwise shut-down uncleanly. If this happens then the state of 
queued jobs is not guaranteed and if not spotted can end up draining 
the system.


Antony

On 12/10/2015 16:12, Paul Edmon wrote:


You should be able to do this with out losing any jobs (at least I've 
never lost any on any version of Slurm I have run).  I do it all the 
time in our environment (about once a day) as our slurm.conf is in 
flux quite a bit.  It should always preserve the running and pending 
state.  The only issue I have ever seen where this becomes a problem 
is in fringe cases during major version upgrades, even then it is rare.


-Paul Edmon-

On 10/12/2015 3:57 AM, Robbert Eggermont wrote:


Hello,

Some modifications to the slurm.conf require me to restart the 
slurmd daemons on all nodes. Is there a way to do this without 
loosing any running jobs (and not having to drain the cluster)?


Thanks,

Robbert



[slurm-dev] Re: the nodes in state down*

2015-10-08 Thread Paul Edmon
Typically that means that the master is having problems communicating 
with the nodes.  I would check your networking, especially your ACLs.


-Paul Edmon-

On 10/08/2015 10:15 AM, Fany Pagés Díaz wrote:


I have a cluster with 3 nodes, and yesterday isincorrectly turned off 
by electrical problems. When I started the cluster, the slurm doesn´t 
work correctly, the state of the nodes appear in down *. I put it to 
the idle state but put back in state down *, Any can help me?


Thanks,

Ing. Fany Pages Diaz





[slurm-dev] Re: the nodes in state down*

2015-10-08 Thread Paul Edmon
The only other thing I can think of it check that the node daemons are 
up and okay.


-Paul Edmon-

On 10/08/2015 11:04 AM, Fany Pagés Díaz wrote:


My networks it looks fine, and the communication with the nodes too, 
the problems is with slurm,  the slumd demon is failed, I never made 
any configuration with acls but I am going to check. if you have 
another idea, please let me know.


thanks

*De:*Paul Edmon [mailto:ped...@cfa.harvard.edu]
*Enviado el:* jueves, 08 de octubre de 2015 10:20
*Para:* slurm-dev
*Asunto:* [slurm-dev] Re: the nodes in state down*

Typically that means that the master is having problems communicating 
with the nodes.  I would check your networking, especially your ACLs.


-Paul Edmon-

On 10/08/2015 10:15 AM, Fany Pagés Díaz wrote:

I have a cluster with 3 nodes, and yesterday isincorrectly turned
off by electrical problems. When I started the cluster, the slurm
doesn´t work correctly, the state of the nodes appear in down *. I
put it to the idle state but put back in state down *, Any can
help me?

Thanks,

Ing. Fany Pages Diaz





[slurm-dev] RE: slurm job priorities depending on the number of each user's jobs

2015-10-06 Thread Paul Edmon


There was some discussion of an analogous feature at the Slurm User 
Group meeting.


http://bugs.schedmd.com/show_bug.cgi?id=1951

Basically jobs would have their fairshare penalized immediately once a 
job launched for the full runtime, rather than actual time used. Which 
would have a similar effect to what you are asking for.


That said they aren't exactly analogous and I can see situations where 
one would want to do this sort of thing so that no one person can 
monopolize the queue.


-Paul Edmon-

On 10/6/2015 6:30 PM, Kumar, Amit wrote:

Just wanted to jump the wagon, this feature would help us as well...!!

Thank you,
Amit



-Original Message-
From: gareth.willi...@csiro.au [mailto:gareth.willi...@csiro.au]
Sent: Tuesday, October 06, 2015 5:23 PM
To: slurm-dev
Subject: [slurm-dev] RE: slurm job priorities depending on the number of each
user's jobs

Hi Markus,

We would also like such a feature.  We have a clunky but practical alternative
where a cron job repeatedly runs to reset the 'nice' factor for the top few
pending jobs for each user to spread priority, encourage interleaving of jobs
starting and help each user to have at least one job running quickly.  I can
provide our script if you are interested.  It requires allowing negative nice.

Gareth


-Original Message-
From: Dr. Markus Stöhr [mailto:markus.sto...@tuwien.ac.at]
Sent: Tuesday, 6 October 2015 12:18 AM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] slurm job priorities depending on the number of
each user's jobs


Dear all,

we are currently looking at the details of the slurm priority multifactor

plugin.

Basically it is doing what it should, but there are, at least from our
point of view, some limitations. The five contributions to priority
(fairshare, qos, partition, job size and job
age) are fixed at a time. So if a user has submitted a huge number of
equal sized jobs, all jobs have nearly the same priority. If such a
bunch of jobs has highest priority, nearly all of them might start
simultanously, not allowing the start of jobs of other users.
The situtation described above looks like this:

prio / user / job_nr_of_user
500 X 1
500 X 2
500 X 3
500 X 4
500 X 5
500 X 6

400 Z 1
400 Z 2
400 Z 3

300 A 1
300 A 2
300 A 3

What we want to achieve is something like the following list, ie. only
a few jobs of each user are getting the highest possible priority, all
others are reduced in priority, relative to the other users' jobs:

prio / user / job_nr_of_user
500 X 1
450 X 2

400 Z 1
300 A 1

250 X 3
250 X 4
250 X 5
250 X 6

220 Z 2
220 Z 3

200 A 2
200 A 3

Basically this would use the number of jobs per user and weight the
job index (of every user) with a damping function. The questions are:

1) Can this be done with existing parameters (e.g. limits of QOS/Partitions)?
2) If a own multifactor priority plugin is the method of choice, has
anyone maybe programmed one by him/herself?

best regards,
Markus




--
=
Dr. Markus Stöhr
Zentraler Informatikdienst BOKU Wien / TU Wien Wiedner Hauptstraße
8-10
1040 Wien

Tel. +43-1-58801-420754
Fax  +43-1-58801-9420754

Email: markus.sto...@tuwien.ac.at
=


[slurm-dev] cgroups, julia, and threads

2015-09-24 Thread Paul Edmon


We recently made the change in our slurm conf to enabling:

TaskPlugin=task/cgroup

Previously we were using task/none.  We are using this in our cgroup.conf:

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=no

The constraining of cores has gone pretty well but I got stumped by this 
one from one of our users.  Any thoughts?


"I am running a parallel optimization program that optimizes a function with
several parameters. I am using Julia to parallelize the evaluation of this
function using parallel map, although I believe as concerns my problem, the
implementation details would be near similar to Python. Essentially, Julia
is started with multiple other worker processes that can be communicated
with via a single "brain" process.

Typically, I run this script with 23 processes on 23 cores, with one
"brain" process and 22 worker processes. The brain uses parallel map to
distribute the function and parameter set to each worker, the worker
evaluates the function and returns the result to the brain. The brain must
wait for the return of all 22 workers before it can propose new parameter
combinations. Within the function itself, I run a command-line Fortran
program (dispatched using Julia's `run` command, which essentially uses the
operating system `fork` and `exec` to spawn a child process) that takes in
some of the parameters and then writes the result to file. This Fortran
program is run in a blocking manner (synchronously), so each Julia worker
idles while the Fortran program runs. When the Fortran program is finished
the Julia script reads back in the result and then returns it to the brain
process. This framework works very well on my local 24 core machine and up
until last week when the cpu permissions were changed on Odyssey, this
worked very well there as well. Prior to this change, the 22 Fortran
programs would naturally spread evenly across 22 cores, utilizing 100% of
each core.

However, after the cpu permissions change on Odyssey, the 22 Fortran
programs sometimes tend to clump onto the same core (examined using htop).
For example, it is now common for at least one core to host up to 3 Fortran
processes, each running at 33%, while several of my otherwise allocated
cores idle at 0%. Most of the other ~15 Fortran processes are assigned one
per core and have 100% utilization. When it comes time to run the Julia
portion of the script, the 0% cores do return to full 100% utilization.
However, because the Fortran program is my dominant time cost, and the
brain must wait for all workers to finish, the 1/3 efficiency for a few
workers means that my optimizer takes three times as long for the full
ensemble. I am unclear why the OS will not reallocate these clumped
processes to run on my otherwise available open cores. Paul was thinking
that this might be related to the way SLURM is allocating threads to cores,
and somehow my forked processes are being prevented from moving to the open
cores. A similar example of what I think is happening is here:
http://stackoverflow.com/questions/10019096/multiple-processes-stuck-on-same-cpu-after-multiple-forks-linux-c

My run script is typically the following:

#!/bin/bash

#SBATCH -p general #partition

#SBATCH -t 78:00:00 #running time

#SBATCH --mem 2 #memory request per node

#SBATCH -N 1 #ensure all jobs are on the same node

#SBATCH -n 23

venus.jl -p 22 -r $SLURM_ARRAY_TASK_ID


Where the venus.jl script is part of my code package here:
https://github.com/iancze/JudithExcalibur/blob/master/scripts/venus.jl

I have tried alternate sbatch submission options,

1. running with way more cores than I require (e.g. -n 30), so that some
are definitely left idle.
2. replacing -n 23 with -c 23
3. using srun and --cpu_bind=cores
4. using srun and --cpu_bind=none

although none have these have prevented the Fortran programs from clumping
on at least a few cores. I would greatly appreciate your help in trying to
figure out how to prevent this behavior under the newly changed core
permissions."

Thanks for the help.

-Paul Edmon-


[slurm-dev] Slurmctld Thread Count

2015-07-09 Thread Paul Edmon


So by default the slurm thread count is 256.  We were wondering how one 
would go about increasing this (i.e. what variables to change), what 
issues one might expect if we did, and also what the rationale was for 
256 being the thread limit.  We are hoping to improve the responsiveness 
of the scheduler by bumping this up if we can to 1024, or at least test 
to see what happens when we do.


Thanks.

-Paul Edmon-


[slurm-dev] Re: Slurm integration scripts with Matlab

2015-06-16 Thread Paul Edmon


So the new scripts are in the R2015b prerelease

http://www.mathworks.com/downloads/web_downloads

The ones we have are customized for our site.  Mathworks recommends 
contacting them for assistance with customizing them for your own.


-Paul Edmon-


On 06/10/2015 09:53 AM, Sean McGrath wrote:

Hi,

+1 for running MATLAB on our cluster. Our notes on it are here:
http://www.tchpc.tcd.ie/node/132 in the unlikely event they help.

The module file just sets the following environment variables:
root, MATLAB, MLM_LICENSE_FILE

Regards

Sean McGrath

Trinity Centre for High Performance and Research Computing
https://www.tchpc.tcd.ie/

On Wed, Jun 10, 2015 at 06:36:53AM -0700, John Desantis wrote:


Hadrian,

This reply will most likely not be of any help since it seems as if
you're referring to the MATLAB distributed computing server, but I
wanted to chime in and state that we are running MATLAB on our
cluster.

Users can either submit serial MATLAB jobs (with HPC using array
tasks) or they can use a matlab pool up to either 12 to 16 cores on 1
node for SMP jobs.

Integration scripts aren't needed in this set-up.  All that is
required is a normal submission script.

John DeSantis


2015-06-09 17:24 GMT-04:00 Hadrian Djohari hx...@case.edu:

Hi,

We are in the process of adapting Slurm as our scheduler and we only found
Matlab-Slurm integration scripts back from 2011 that seem to need some
additional tweaking.

Can someone with a running Matlab on their cluster help provide us with the
information about this or also share the set of m-scripts that would work
with Slurm both on the cluster and remotely?

Thank you,

--
Hadrian Djohari
HPCC Manager
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490


[slurm-dev] Re: Slurm integration scripts with Matlab

2015-06-10 Thread Paul Edmon
Huh.  Let me ask around here and see if we can share what they gave us 
with the community.


-Paul Edmon-

On 6/10/2015 9:04 AM, Hadrian Djohari wrote:

Re: [slurm-dev] Re: Slurm integration scripts with Matlab
Hi Paul and others,

We have contacted Mathworks and they only came back with the 2011 
integration scripts.

It seems they have a newer set now that they gave to Harvard?

Does anyone else currently running Slurm not use Matlab at all?

Hadrian

On Tue, Jun 9, 2015 at 6:45 PM, Paul Edmon ped...@cfa.harvard.edu 
mailto:ped...@cfa.harvard.edu wrote:


We've been working with Mathworks to get this sorted so that DCS
and Matlab can work with Slurm.  I think if you contact them you
should be able to get the scripts they gave us.

-Paul Edmon-

On 6/9/2015 5:24 PM, Hadrian Djohari wrote:

Hi,

We are in the process of adapting Slurm as our scheduler and we
only found Matlab-Slurm integration scripts back from 2011 that
seem to need some additional tweaking.

Can someone with a running Matlab on their cluster help provide
us with the information about this or also share the set of
m-scripts that would work with Slurm both on the cluster and
remotely?

Thank you,

-- 
Hadrian Djohari

HPCC Manager
Case Western Reserve University
(W): 216-368-0395 tel:216-368-0395
(M): 216-798-7490 tel:216-798-7490





--
Hadrian Djohari
HPCC Manager
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490




[slurm-dev] Re: Slurm integration scripts with Matlab

2015-06-09 Thread Paul Edmon
We've been working with Mathworks to get this sorted so that DCS and 
Matlab can work with Slurm.  I think if you contact them you should be 
able to get the scripts they gave us.


-Paul Edmon-

On 6/9/2015 5:24 PM, Hadrian Djohari wrote:

Slurm integration scripts with Matlab
Hi,

We are in the process of adapting Slurm as our scheduler and we only 
found Matlab-Slurm integration scripts back from 2011 that seem to 
need some additional tweaking.


Can someone with a running Matlab on their cluster help provide us 
with the information about this or also share the set of m-scripts 
that would work with Slurm both on the cluster and remotely?


Thank you,

--
Hadrian Djohari
HPCC Manager
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490




[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-24 Thread Paul Edmon
As to performance concerns.  We are running our production cluster which 
handles 1.5 million jobs per day with -O0 and gdb flags.  No degraded 
performance.  The reason for the -O0 is that the optimization can make 
debugging harder as it optimizes out some values.  At least that's what 
has happened to us a couple of times when trying to debug the 
scheduler.  So in reality the optimization did nothing in the first 
place.  So any concerns about speed should be alleviated.


-Paul Edmon-

On 04/24/2015 03:31 PM, Chris Read wrote:

Re: [slurm-dev] Re: Slurm versions 14.11.6 is now available


On Fri, Apr 24, 2015 at 4:41 AM, Janne Blomqvist 
janne.blomqv...@aalto.fi mailto:janne.blomqv...@aalto.fi wrote:


On 2015-04-24 02:03, Moe Jette wrote:


Slurm version 14.11.6 is now available with quite a few bug
fixes as
listed below.

Slurm downloads are available from
http://slurm.schedmd.com/download.html

* Changes in Slurm 14.11.6
==


[snip]

  -- Enable compiling without optimizations and with debugging
symbols by
 default. Disable this by configuring with --disable-debug.


Always including debug symbols is good (the only cost is a little
bit of disk space, should never really be a problem), but
disabling optimization by default?? In our environment, slurmctld
consumes a decent chunk of cpu time, I would loathe to see it
getting a lot (?) slower. 



Typically, problems which are fixed by disabling optimization
are due to violations of the C standard or such which for some
reason just doesn't happen to trigger with -O0. Perhaps I'm being
needlessly harsh here, but I'd prefer if the bugs were fixed
properly rather than being papered over like this.


+1 from me...



  -- Use standard statvfs(2) syscall if available, in
preference to
 non-standard statfs.


This is not actually such a good idea. Prior to Linux kernel
2.6.36 and glibc 2.13, the implementation of statvfs required
checking all entries in /proc/mounts. If any of those other
filesystems are not available (e.g. a hung NFS mount), the statvfs
call would thus hang. See e.g.

http://man7.org/linux/man-pages/man2/statvfs.2.html

Not directly related to this change, there is also a bit of
silliness in the statfs() code for get_tmp_disk(), namely that it
assumes that the fs record size is the same as the memory page
size. As of Linux 2.6 the struct statfs contains a field f_frsize
which contains the correct record size. I suggest the attached
patch which should fix both of these issues.



-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist

Aalto University School of Science, PHYS  NBE
+358503841576 tel:%2B358503841576 || janne.blomqv...@aalto.fi
mailto:janne.blomqv...@aalto.fi






[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-24 Thread Paul Edmon
Correction 50,000 jobs per day.  That 1.5 million is per month.  My 
bad.  Still no degradation in performance seen in our environment.


-Paul Edmon-

On 04/24/2015 03:31 PM, Chris Read wrote:

Re: [slurm-dev] Re: Slurm versions 14.11.6 is now available


On Fri, Apr 24, 2015 at 4:41 AM, Janne Blomqvist 
janne.blomqv...@aalto.fi mailto:janne.blomqv...@aalto.fi wrote:


On 2015-04-24 02:03, Moe Jette wrote:


Slurm version 14.11.6 is now available with quite a few bug
fixes as
listed below.

Slurm downloads are available from
http://slurm.schedmd.com/download.html

* Changes in Slurm 14.11.6
==


[snip]

  -- Enable compiling without optimizations and with debugging
symbols by
 default. Disable this by configuring with --disable-debug.


Always including debug symbols is good (the only cost is a little
bit of disk space, should never really be a problem), but
disabling optimization by default?? In our environment, slurmctld
consumes a decent chunk of cpu time, I would loathe to see it
getting a lot (?) slower. 



Typically, problems which are fixed by disabling optimization
are due to violations of the C standard or such which for some
reason just doesn't happen to trigger with -O0. Perhaps I'm being
needlessly harsh here, but I'd prefer if the bugs were fixed
properly rather than being papered over like this.


+1 from me...



  -- Use standard statvfs(2) syscall if available, in
preference to
 non-standard statfs.


This is not actually such a good idea. Prior to Linux kernel
2.6.36 and glibc 2.13, the implementation of statvfs required
checking all entries in /proc/mounts. If any of those other
filesystems are not available (e.g. a hung NFS mount), the statvfs
call would thus hang. See e.g.

http://man7.org/linux/man-pages/man2/statvfs.2.html

Not directly related to this change, there is also a bit of
silliness in the statfs() code for get_tmp_disk(), namely that it
assumes that the fs record size is the same as the memory page
size. As of Linux 2.6 the struct statfs contains a field f_frsize
which contains the correct record size. I suggest the attached
patch which should fix both of these issues.



-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist

Aalto University School of Science, PHYS  NBE
+358503841576 tel:%2B358503841576 || janne.blomqv...@aalto.fi
mailto:janne.blomqv...@aalto.fi






[slurm-dev] Upgrade Rollbacks

2015-04-02 Thread Paul Edmon


I've been curious about this for a bit.  What is the procedure for 
rolling back a minor and major release of slurm in case something goes 
wrong?


-Paul Edmon-


[slurm-dev] Re: Problems running job

2015-03-31 Thread Paul Edmon


Do you have all the ports open between all the compute nodes as well?  
Since slurm builds a tree to communicate all the nodes need to talk to 
every other node on those ports and do so with out a huge amount of 
latency. You might want to try to up your timeouts.


-Paul Edmon-

On 03/31/2015 10:28 AM, Jeff Layton wrote:


Chris and David,

Thanks for the help! I'm still trying to find out why the
compute nodes are down or not responding. Any tips
on where to start?

How about open ports? Right now I have 6817 and
6818 open as per my slurm.conf. I also have 22 and 80
open as well as 111, 2049, and 32806. I'm using NFSv4
but don't know if that is causing the problem or not
(I REALLY want to stick to NFSv4).

Thanks!

Jeff


On 31/03/15 07:31, Jeff Layton wrote:


Good afternoon!

Hiya Jeff,

[...]

But it doesn't seem to run. Here is the output of sinfo
and squeue:

[...]

Actually it does appear to get started (at least), but..


[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
  JOBID PARTITION NAME USER ST TIME  NODES
NODELIST(REASON)
  2 debug slurmtes ec2-user CG 0:00  1 
ip-10-0-2-101

...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.

The system logs on the master node (contoller node) don't show too 
much:


Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
NodeList=ip-10-0-2-101 #CPUs=1

OK, node allocated.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0

Job finishes.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x8000 NodeCnt=1 done

Not sure of the implication of that requeue there, unless it's the
transition to the CG state?


Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes 
ip-10-0-2-102

not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes 
ip-10-0-2-101

not responding, setting DOW

Now the nodes stop responding (not before).


 From these logs, it looks like the compute nodes are not
responding to the control node (master node).

Not sure how to debug this - any tips?

I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some form
of device I/O.

I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).

Good luck!
Chris


[slurm-dev] --reboot

2015-03-30 Thread Paul Edmon


So point of information.  My impression is that when you use this flag, 
the node/s are rebooted (or the script defined is run) and then the job 
runs as per normal.  Is this correct?  Because I think in our case it is 
rebooting the nodes but then the node is defined as closed due to 
unexpected reboot.  Is there a way to suppress this when the node is 
rebooted by this flag?  Obviously the reboot wasn't unexpected as slurm 
was aware of it due to the flag.


-Paul Edmon-


[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4

2015-03-27 Thread Paul Edmon


So we have had the same problem, usually due to the scheduler receiving 
tons of requests.  Usually this is fixed by having the scheduler slow 
itself down by using defer, or the max_rpc_cnt options. We in particular 
use max_rpc_cnt=16.  I actually did a test yesterday where I removed 
this and let it go with out defer as I was hoping that the newer version 
of slurm would be able to run with out it. In our environment the thread 
count saturated and slurm started to drop requests all over the place.  
This even caused some node_fail messages as nodes themselves couldn't 
talk to the master because it was so busy.


It sounds like your problem is similar to ours in that we just have too 
much traffic and demand hitting the master (we run with about 1000 cores 
and 800 users hitting the scheduler).  So I advise using defer or 
max_rpc_cnt.  Max_rpc_cnt is good because when things are less busy the 
scheduler automatically switches out of defer state and starts to 
schedule things more quickly, so its adaptive.


However an even better thing would be to have slurm more intelligently 
auto throttle itself (which max_rpc_cnt is a somewhat simple way of 
doing) when it receives ton of traffic due to completion, user queries 
or just general busyness. The threads tend to stack up when you have a 
bunch of blocking requests.  sdiag can help you track some of that back, 
but some can't be prevented.  So it would be good at least from the 
developer end if the number of blocking requests could be reduced.  In 
particular in our case it would be good if user requests for information 
were responded to with a intelligent busy signal and an estimate of 
return time, as our users get a bit punchy if requests for information 
don't return immediately.


Regardless, I would look at defer or max_rpc_cnt.

-Paul Edmon-

On 03/27/2015 07:32 AM, Mehdi Denou wrote:

Hi,

Using gdb you can retrieve which thread own the locks on the slurmctld
internal structures (and block all the others).
Then it will be easier to understand what is happening.

Le 27/03/2015 12:24, Stuart Rankin a écrit :

Hi,

I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University 
cluster. Since upgrading
from 14.03.3 we have been seeing the following problem and I'd appreciate any 
advice (maybe it's a
bug but maybe I'm missing something obvious).

Occasionally the number of slurmctld threads starts to rise rapidly until it 
hits the hard coded 256
limit and stays there. The threads are in the futex_ state according to ps and 
logging stops
(nothing out of the ordinary leaps out in the log before this happens). 
Naturally slurm clients then
start failing with timeout messages (which isn't trivial since it is causing 
some not very resilient
user pipelines to fail). This condition has persisted for several hours during 
the night without
being detected. However there is a simple workaround, which is to send a STOP 
signal to slurmctld
process, wait a few seconds, then resume it - this clears the logjam. Merely 
attaching a debugger
has the same effect!

I feel this must be a clue as to the root cause. I have already tried setting 
CommitDelay in
slurmdbd.conf, increasing MessageTimeout, setting HealthCheckNodeState=CYCLE and
decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent 
impact (please see
slurm.conf attached).

Any advice would be gratefully received.

Many thanks -

Stuart




[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-26 Thread Paul Edmon


Yeah, we use puppet and yum to manage our stack.  Works pretty well and 
scales nicely.


-Paul Edmon-

On 03/26/2015 11:46 AM, Jason Bacon wrote:



+1 for using package managers in general.

On our CentOS clusters, I do the munge and slurm installs using pkgsrc 
(+ pkgsrc-wip).


http://acadix.biz/pkgsrc.php

I use Yum for most system services and for libraries required by 
commercial software, and pkgsrc for all the latest open source software.


I rsync a small pkgsrc tree to the local drive of each node for munge, 
slurm, and a few basic tools, and keep separate, more extensive tree 
on the NFS share for scientific software.


The current slurm package is pretty outdated, but we'll bring it 
up-to-date soon.


Regards,

Jason

On 3/26/15 4:23 AM, Paddy Doyle wrote:

+1 for local installs.

We build the RPMs and put them in a local repo (Scientific Linux 6), 
and so

installing/upgrading via Salt/Puppet/Ansible etc is quite scalable.

It works for us, but of course YMMV.

Paddy

On Tue, Mar 24, 2015 at 06:46:46PM -0700, Paul Edmon wrote:


Yeah, we've been running CentOS 6 and slurm in this fashion for
about a year and a half on about a thousand machines and haven't
really had a problem with this.  Though I don't know if this method
scales indefinitely.  We just have a symlink back to our conf from
/etc/slurm/slurm.conf.  We then control the version via RPM
installs.

-Paul Edmon-

On 3/24/2015 4:22 PM, Jason Bacon wrote:

I ran one of our CentOS clusters this way for about a year and
found it to be more trouble than it was worth.

I recently reconfigured it to run all system services from local
disks so that nodes are as independent of each other as possible.
Assuming you have ssh keys on all the nodes, syncing slurm.conf
and other files is a snap using a simple shell script.  We only
use NFS for data files and user applications at this point.

Of course, if your compute nodes don't have local disks, that's
another story.

Jason

On 03/24/15 14:42, Jeff Layton wrote:

Good afternoon,

I apologies for the newb question but I'm setting up slurm
for the first time in a very long time. I've got a small cluster
of a master node and 4 compute nodes. I'd like to install
slurm on an NFS file system that is exported from the master
node and mounted on the compute nodes. I've been reading
a bit about this but does anyone have recommendations on
what to watch out for?

Thanks!

Jeff





[slurm-dev] Re: slurm on NFS for a cluster - Part II

2015-03-25 Thread Paul Edmon
All of them should be owned by munge.  Further more for security's sake 
I would make them all only accessible to munge, at least the etc one.


-Paul Edmon-

On 03/25/2015 10:29 AM, Jeff Layton wrote:

I assume the same is true for /var/log/munge and /var/run/munge?

How about /etc/munge?

Thanks!

Jeff



Yea, that folder and the files inside needs to be owned by munge.

-Paul Edmon-

On 03/25/2015 09:54 AM, Jeff Layton wrote:

Good morning,

Thanks for all of the advice in regard to slurm on NFS. I've
started on my slurm quest  by installing munge but I'm
having some trouble. I'm not sure  this is the right place
to ask about munge but here goes.

I'm building and install munge with the following options:

./configure --exec-prefix=/share/ec2-user --prefix= \
--sysconfdir=/etc/ --sharedstatedir=/var/


This allows me to store the local state information in /var
and the configuration information in /etc, but everything
else is stored in /share/ec2-user/ (NFS shared directory).

The build goes fine but when I try to start the munge
service I get the following:

[ec2-user@ip-10-0-1-72 munge-0.5.11]$ sudo service munge start
Starting MUNGE: munged (failed).
munged: Error: Failed to check logfile /var/log/munge/munged.log: 
Permission denied



The permissions on /var/log/munge are:

[ec2-user@ip-10-0-1-72 munge-0.5.11]$ sudo ls -lstar /var/log/munge
total 8
4 drwxr-xr-x 6 root root 4096 Mar 25 13:39 ..
4 drwx-- 2 root root 4096 Mar 25 13:39


I'm not sure but should this directory be owned by user munge?

TIA!

Jeff 








[slurm-dev] Re: slurm on NFS for a cluster - Part II

2015-03-25 Thread Paul Edmon

Yea, that folder and the files inside needs to be owned by munge.

-Paul Edmon-

On 03/25/2015 09:54 AM, Jeff Layton wrote:

Good morning,

Thanks for all of the advice in regard to slurm on NFS. I've
started on my slurm quest  by installing munge but I'm
having some trouble. I'm not sure  this is the right place
to ask about munge but here goes.

I'm building and install munge with the following options:

./configure --exec-prefix=/share/ec2-user --prefix= \
--sysconfdir=/etc/ --sharedstatedir=/var/


This allows me to store the local state information in /var
and the configuration information in /etc, but everything
else is stored in /share/ec2-user/ (NFS shared directory).

The build goes fine but when I try to start the munge
service I get the following:

[ec2-user@ip-10-0-1-72 munge-0.5.11]$ sudo service munge start
Starting MUNGE: munged (failed).
munged: Error: Failed to check logfile /var/log/munge/munged.log: 
Permission denied



The permissions on /var/log/munge are:

[ec2-user@ip-10-0-1-72 munge-0.5.11]$ sudo ls -lstar /var/log/munge
total 8
4 drwxr-xr-x 6 root root 4096 Mar 25 13:39 ..
4 drwx-- 2 root root 4096 Mar 25 13:39


I'm not sure but should this directory be owned by user munge?

TIA!

Jeff 




[slurm-dev] Re: successful systemd service start on RHEL7?

2015-03-24 Thread Paul Edmon


I have tried building slurm 14.11.4 on CentOS7 but it never quite worked 
right.  I'm not sure if it has been vetted for RHEL7 yet.  I didn't dig 
too deeply though when I did build it as I just figured it wasn't ready 
for RHEL7.


-Paul Edmon-

On 03/24/2015 10:32 AM, Fred Liu wrote:

Hi,


Anyone successfully started systemd service on RHEL7?
I failed like following:

[root@cnlnx03 system]# systemctl start slurmctld
Job for slurmctld.service failed. See 'systemctl status slurmctld.service' and 
'journalctl -xn' for details.
[root@cnlnx03 system]# systemctl status slurmctld.service
slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled)
Active: failed (Result: timeout) since Tue 2015-03-24 22:22:46 CST; 4min 
32s ago

Mar 24 22:21:05 cnlnx03 slurmctld[20561]: init_requeue_policy: 
kill_invalid_depend is set to 0
Mar 24 22:21:05 cnlnx03 slurmctld[20561]: Recovered state of 0 reservations
Mar 24 22:21:05 cnlnx03 slurmctld[20561]: read_slurm_conf: backup_controller 
not specified.
Mar 24 22:21:05 cnlnx03 slurmctld[20561]: Running as primary controller
Mar 24 22:22:05 cnlnx03 slurmctld[20561]: 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_tim...pth=0
Mar 24 22:22:45 cnlnx03 systemd[1]: slurmctld.service operation timed out. 
Terminating.
Mar 24 22:22:45 cnlnx03 slurmctld[20561]: Terminate signal (SIGINT or SIGTERM) 
received
Mar 24 22:22:45 cnlnx03 slurmctld[20561]: Saving all slurm state
Mar 24 22:22:46 cnlnx03 systemd[1]: Failed to start Slurm controller daemon.
Mar 24 22:22:46 cnlnx03 systemd[1]: Unit slurmctld.service entered failed state.
Hint: Some lines were ellipsized, use -l to show in full.
[root@cnlnx03 system]# journalctl -xn
-- Logs begin at Wed 2015-03-11 17:23:37 CST, end at Tue 2015-03-24 22:25:02 
CST. --
Mar 24 22:22:45 cnlnx03 systemd[1]: slurmctld.service operation timed out. 
Terminating.
Mar 24 22:22:45 cnlnx03 slurmctld[20561]: Terminate signal (SIGINT or SIGTERM) 
received
Mar 24 22:22:45 cnlnx03 slurmctld[20561]: Saving all slurm state
Mar 24 22:22:46 cnlnx03 slurmctld[20561]: layouts: all layouts are now unloaded.
Mar 24 22:22:46 cnlnx03 systemd[1]: Failed to start Slurm controller daemon.
-- Subject: Unit slurmctld.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-24 Thread Paul Edmon


Yeah, we've been running CentOS 6 and slurm in this fashion for about a 
year and a half on about a thousand machines and haven't really had a 
problem with this.  Though I don't know if this method scales 
indefinitely.  We just have a symlink back to our conf from 
/etc/slurm/slurm.conf.  We then control the version via RPM installs.


-Paul Edmon-

On 3/24/2015 4:22 PM, Jason Bacon wrote:



I ran one of our CentOS clusters this way for about a year and found 
it to be more trouble than it was worth.


I recently reconfigured it to run all system services from local disks 
so that nodes are as independent of each other as possible. Assuming 
you have ssh keys on all the nodes, syncing slurm.conf and other files 
is a snap using a simple shell script.  We only use NFS for data files 
and user applications at this point.


Of course, if your compute nodes don't have local disks, that's 
another story.


Jason

On 03/24/15 14:42, Jeff Layton wrote:


Good afternoon,

I apologies for the newb question but I'm setting up slurm
for the first time in a very long time. I've got a small cluster
of a master node and 4 compute nodes. I'd like to install
slurm on an NFS file system that is exported from the master
node and mounted on the compute nodes. I've been reading
a bit about this but does anyone have recommendations on
what to watch out for?

Thanks!

Jeff


[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-24 Thread Paul Edmon


Yup, that's exactly what we do.  We make sure to export it read only and 
make sure that it is synced and hard mounted.  Not much else to it.


-Paul Edmon-

On 03/24/2015 03:43 PM, Jeff Layton wrote:


Good afternoon,

I apologies for the newb question but I'm setting up slurm
for the first time in a very long time. I've got a small cluster
of a master node and 4 compute nodes. I'd like to install
slurm on an NFS file system that is exported from the master
node and mounted on the compute nodes. I've been reading
a bit about this but does anyone have recommendations on
what to watch out for?

Thanks!

Jeff


[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-24 Thread Paul Edmon


Interesting.  Yeah we use v3 here.  Hadn't tried out v4, and good thing 
we didn't then.


-Paul Edmon-

On 03/24/2015 04:05 PM, Uwe Sauter wrote:

And if you are planning on using cgroups, don't use NFSv4. There are problems 
that cause the NFS client process to freeze (and
with that freeze the node) when the cgroup removal script is called.

Regards,

Uwe Sauter

Am 24.03.2015 um 20:50 schrieb Paul Edmon:

Yup, that's exactly what we do.  We make sure to export it read only and make 
sure that it is synced and hard mounted.  Not much
else to it.

-Paul Edmon-

On 03/24/2015 03:43 PM, Jeff Layton wrote:

Good afternoon,

I apologies for the newb question but I'm setting up slurm
for the first time in a very long time. I've got a small cluster
of a master node and 4 compute nodes. I'd like to install
slurm on an NFS file system that is exported from the master
node and mounted on the compute nodes. I've been reading
a bit about this but does anyone have recommendations on
what to watch out for?

Thanks!

Jeff


[slurm-dev] Re: SlurmDBD Archiving

2015-03-10 Thread Paul Edmon


Oh and for the record we are running 14.11.4

-Paul Edmon-

On 03/10/2015 09:26 AM, Paul Edmon wrote:


So when I tried to do an archive dump I got the following error. What 
does this mean?


[root@holy-slurm01 slurm]# sacctmgr -i archive dump
sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
 Problem dumping archive: Unspecified error

It also caused the slurmdbd to crash so I had to restart it.  Here is 
the log:


Mar 10 08:18:21 holy-slurm01 slurmctld[20144]: slurmdbd: agent queue 
size 50
Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: error: mysql_query 
failed: 1205 Lock wait timeout exceeded; try restarting transaction
Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: fatal: mysql gave 
ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart 
the calling program
Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: error: slurmdbd: 
Getting response to message type 1407
Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: slurmdbd: reopening 
connection
Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: 
Sending message type 1472: 11: Connection refused
Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: 
DBD_SEND_MULT_JOB_START failure: Connection refused


It did manage to dump:

[root@holy-slurm01 archive]# ls -ltr
total 160260
-rw--- 1 slurm slurm_users 163847476 Mar 10 09:16 
odyssey_event_archive_2013-08-01T00:00:00_2014-08-31T23:59:59
-rw--- 1 slurm slurm_users253639 Mar 10 09:17 
odyssey_suspend_archive_2013-08-01T00:00:00_2014-08-31T23:59:59


Is it safe to try again?

-Paul Edmon-

On 03/06/2015 03:07 PM, Paul Edmon wrote:


Ah, okay, that was the command I was looking for.  I wasn't sure how 
to force it.  Thanks.


-Paul Edmon-

On 03/06/2015 01:43 PM, Danny Auble wrote:


It looks like I might stand corrected though.  It looks like you 
will have to wait for the month to go by before the purge starts.


With a lot of jobs it may take a while depending on the speed of 
your disk and such.  If you have the debugflag=DB_USAGE you should 
see the sql statements go by.  This will be quite verbose, so it 
might not be that great of an idea.  You can edit the slurmdbd.conf 
file with the flag and run sacctmgr reconfig to add or remove the 
flag.


You can force a purge which in turn will force an archive with

sacctmgr archive dump

see http://slurm.schedmd.com/sacctmgr.html

Danny

On 03/06/2015 10:12 AM, Paul Edmon wrote:


How long does that typically take?  Because I have done it on our 
large job database and I have seen nothing yet.


-Paul Edmon

On 03/06/2015 01:07 PM, Danny Auble wrote:


If you had older than 6 month data I would expect it to purge on 
restart of the slurmdbd.  You will see a message in the log when 
the archive file is created.


On 03/06/2015 09:58 AM, Paul Edmon wrote:


Okay, that's what I suspected.  We set it to 6 months. So I guess 
then the purge will happen on April 1st.


-Paul Edmon-

On 03/06/2015 12:33 PM, Danny Auble wrote:


Paul, do you have Purge* set up in the slurmdbd.conf? Archiving 
takes place during the Purge process.  If no Purge values are 
set archiving will never take place since nothing is ever 
purged. When Purge values are set the related archivings take 
place on an hourly, daily, and monthly basis depending on the 
units your purge values are set to.


If PurgeJobs=2months the archive would take place at the 
beginning of each month.  If it were set to 2hours it would 
happen each hour. The purge will also happen on slurmdbd startup 
as well if running things for the first time.


Danny


On 03/06/2015 08:20 AM, Paul Edmon wrote:


So we recently turned this on to archive jobs older than 6 
months. However when we restarted slurmdbd nothing happened, at 
least no file was deposited at the specified archive location. 
Is there a way to force it to purge? When is the archiving 
scheduled to be done? We definitely have jobs older than 6 
months in the database, I'm just curious about the schedule of 
when the archiving is done.


-Paul Edmon-


[slurm-dev] Re: SlurmDBD Archiving

2015-03-10 Thread Paul Edmon


So when I tried to do an archive dump I got the following error. What 
does this mean?


[root@holy-slurm01 slurm]# sacctmgr -i archive dump
sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
 Problem dumping archive: Unspecified error

It also caused the slurmdbd to crash so I had to restart it.  Here is 
the log:


Mar 10 08:18:21 holy-slurm01 slurmctld[20144]: slurmdbd: agent queue size 50
Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: error: mysql_query failed: 
1205 Lock wait timeout exceeded; try restarting transaction
Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: fatal: mysql gave 
ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart 
the calling program
Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: error: slurmdbd: Getting 
response to message type 1407
Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: slurmdbd: reopening 
connection
Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: Sending 
message type 1472: 11: Connection refused
Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: 
DBD_SEND_MULT_JOB_START failure: Connection refused


It did manage to dump:

[root@holy-slurm01 archive]# ls -ltr
total 160260
-rw--- 1 slurm slurm_users 163847476 Mar 10 09:16 
odyssey_event_archive_2013-08-01T00:00:00_2014-08-31T23:59:59
-rw--- 1 slurm slurm_users253639 Mar 10 09:17 
odyssey_suspend_archive_2013-08-01T00:00:00_2014-08-31T23:59:59


Is it safe to try again?

-Paul Edmon-

On 03/06/2015 03:07 PM, Paul Edmon wrote:


Ah, okay, that was the command I was looking for.  I wasn't sure how 
to force it.  Thanks.


-Paul Edmon-

On 03/06/2015 01:43 PM, Danny Auble wrote:


It looks like I might stand corrected though.  It looks like you will 
have to wait for the month to go by before the purge starts.


With a lot of jobs it may take a while depending on the speed of your 
disk and such.  If you have the debugflag=DB_USAGE you should see the 
sql statements go by.  This will be quite verbose, so it might not be 
that great of an idea.  You can edit the slurmdbd.conf file with the 
flag and run sacctmgr reconfig to add or remove the flag.


You can force a purge which in turn will force an archive with

sacctmgr archive dump

see http://slurm.schedmd.com/sacctmgr.html

Danny

On 03/06/2015 10:12 AM, Paul Edmon wrote:


How long does that typically take?  Because I have done it on our 
large job database and I have seen nothing yet.


-Paul Edmon

On 03/06/2015 01:07 PM, Danny Auble wrote:


If you had older than 6 month data I would expect it to purge on 
restart of the slurmdbd.  You will see a message in the log when 
the archive file is created.


On 03/06/2015 09:58 AM, Paul Edmon wrote:


Okay, that's what I suspected.  We set it to 6 months.  So I guess 
then the purge will happen on April 1st.


-Paul Edmon-

On 03/06/2015 12:33 PM, Danny Auble wrote:


Paul, do you have Purge* set up in the slurmdbd.conf? Archiving 
takes place during the Purge process.  If no Purge values are set 
archiving will never take place since nothing is ever purged. 
When Purge values are set the related archivings take place on an 
hourly, daily, and monthly basis depending on the units your 
purge values are set to.


If PurgeJobs=2months the archive would take place at the 
beginning of each month.  If it were set to 2hours it would 
happen each hour. The purge will also happen on slurmdbd startup 
as well if running things for the first time.


Danny


On 03/06/2015 08:20 AM, Paul Edmon wrote:


So we recently turned this on to archive jobs older than 6 
months. However when we restarted slurmdbd nothing happened, at 
least no file was deposited at the specified archive location. 
Is there a way to force it to purge? When is the archiving 
scheduled to be done? We definitely have jobs older than 6 
months in the database, I'm just curious about the schedule of 
when the archiving is done.


-Paul Edmon-


[slurm-dev] Re: SlurmDBD Archiving

2015-03-10 Thread Paul Edmon
Ok, good to know.  We've never purged this database and it has well over 
34 million jobs in it.  I was hoping to do so in a controlled way, but 
guess I will have to wait for the 1st.


Out of curiousity is there a way to change which day of the month it 
does the archive on?  Could that be added as a feature in a future release?


-Paul Edmon-

On 03/10/2015 11:18 AM, Danny Auble wrote:
The fatal you received means your query lasted more than 15 minutes, 
mysql deemed it hung and aborted. You can increase the timeout for 
innodb_lock_wait_timeout in your my.cnf and try again, but that 
generally isn't a good idea. You can safely try again as many times as 
you would like and perhaps it will finish or just wait for the normal 
purge.


This timeout should only occur when running the archive dump with 
sacctmgr, under normal purge with the dbd this wouldn't happen.


On March 10, 2015 6:26:19 AM PDT, Paul Edmon ped...@cfa.harvard.edu 
wrote:


So when I tried to do an archive dump I got the following error. What
does this mean?

[root@holy-slurm01 slurm]# sacctmgr -i archive dump
sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
   Problem dumping archive: Unspecified error

It also caused the slurmdbd to crash so I had to restart it.  Here is
the log:

Mar 10 08:18:21 holy-slurm01 slurmctld[20144]: slurmdbd: agent queue size 50
Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: error: mysql_query failed:
1205 Lock wait timeout exceeded; try restarting transaction
Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: fatal: mysql gave
ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart
the calling program
Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: error: slurmdbd: Getting
response to message type 1407
Mar 10
09:20:57 holy-slurm01 slurmctld[20144]: slurmdbd: reopening
connection
Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: Sending
message type 1472: 11: Connection refused
Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd:
DBD_SEND_MULT_JOB_START failure: Connection refused

It did manage to dump:

[root@holy-slurm01 archive]# ls -ltr
total 160260
-rw--- 1 slurm slurm_users 163847476 Mar 10 09:16
odyssey_event_archive_2013-08-01T00:00:00_2014-08-31T23:59:59
-rw--- 1 slurm slurm_users253639 Mar 10 09:17
odyssey_suspend_archive_2013-08-01T00:00:00_2014-08-31T23:59:59

Is it safe to try again?

-Paul Edmon-

On 03/06/2015 03:07 PM, Paul Edmon wrote:

Ah, okay, that was the command I was looking for. I wasn't
sure how to force it. Thanks. -Paul Edmon- On 03/06/2015 01:43
PM, Danny Auble wrote:

It looks like I might stand corrected though. It looks
like you will have to wait for the month to go by before
the purge starts. With a lot of jobs it may take a while
depending on the speed of your disk and such. If you have
the debugflag=DB_USAGE you should see the sql statements
go by. This will be quite verbose, so it might not be that
great of an idea. You can edit the slurmdbd.conf file with
the flag and run sacctmgr reconfig to add or remove the
flag. You can force a purge which in turn will force an
archive with sacctmgr archive dump see
http://slurm.schedmd.com/sacctmgr.html Danny On 03/06/2015
10:12 AM, Paul Edmon wrote:

How long does that typically take? Because I have done
it on our large job database and I have seen nothing
yet. -Paul Edmon On 03/06/2015 01:07 PM, Danny Auble
wrote:

If you had older than 6 month data I would expect
it to purge on restart of the slurmdbd. You will
see a message in the log when the archive file is
created. On 03/06/2015 09:58 AM, Paul Edmon wrote:

Okay, that's what I suspected. We set it to 6
months. So I guess then the purge will happen
on April 1st. -Paul Edmon- On 03/06/2015 12:33
PM, Danny Auble wrote:

Paul, do you have Purge* set up in the
slurmdbd.conf? Archiving takes place
during the Purge process. If no Purge
values are set archiving will never take
place since nothing is ever purged. When
Purge values are set the related
archivings take place on an hourly, daily,
and monthly basis depending on the units
your purge values are set

[slurm-dev] Re: SlurmDBD Archiving

2015-03-06 Thread Paul Edmon


Okay, that's what I suspected.  We set it to 6 months.  So I guess then 
the purge will happen on April 1st.


-Paul Edmon-

On 03/06/2015 12:33 PM, Danny Auble wrote:


Paul, do you have Purge* set up in the slurmdbd.conf?  Archiving takes 
place during the Purge process.  If no Purge values are set archiving 
will never take place since nothing is ever purged. When Purge values 
are set the related archivings take place on an hourly, daily, and 
monthly basis depending on the units your purge values are set to.


If PurgeJobs=2months the archive would take place at the beginning of 
each month.  If it were set to 2hours it would happen each hour. The 
purge will also happen on slurmdbd startup as well if running things 
for the first time.


Danny


On 03/06/2015 08:20 AM, Paul Edmon wrote:


So we recently turned this on to archive jobs older than 6 months. 
However when we restarted slurmdbd nothing happened, at least no file 
was deposited at the specified archive location. Is there a way to 
force it to purge? When is the archiving scheduled to be done?  We 
definitely have jobs older than 6 months in the database, I'm just 
curious about the schedule of when the archiving is done.


-Paul Edmon-


[slurm-dev] Re: SlurmDBD Archiving

2015-03-06 Thread Paul Edmon


Ah, okay, that was the command I was looking for.  I wasn't sure how to 
force it.  Thanks.


-Paul Edmon-

On 03/06/2015 01:43 PM, Danny Auble wrote:


It looks like I might stand corrected though.  It looks like you will 
have to wait for the month to go by before the purge starts.


With a lot of jobs it may take a while depending on the speed of your 
disk and such.  If you have the debugflag=DB_USAGE you should see the 
sql statements go by.  This will be quite verbose, so it might not be 
that great of an idea.  You can edit the slurmdbd.conf file with the 
flag and run sacctmgr reconfig to add or remove the flag.


You can force a purge which in turn will force an archive with

sacctmgr archive dump

see http://slurm.schedmd.com/sacctmgr.html

Danny

On 03/06/2015 10:12 AM, Paul Edmon wrote:


How long does that typically take?  Because I have done it on our 
large job database and I have seen nothing yet.


-Paul Edmon

On 03/06/2015 01:07 PM, Danny Auble wrote:


If you had older than 6 month data I would expect it to purge on 
restart of the slurmdbd.  You will see a message in the log when the 
archive file is created.


On 03/06/2015 09:58 AM, Paul Edmon wrote:


Okay, that's what I suspected.  We set it to 6 months.  So I guess 
then the purge will happen on April 1st.


-Paul Edmon-

On 03/06/2015 12:33 PM, Danny Auble wrote:


Paul, do you have Purge* set up in the slurmdbd.conf? Archiving 
takes place during the Purge process.  If no Purge values are set 
archiving will never take place since nothing is ever purged. When 
Purge values are set the related archivings take place on an 
hourly, daily, and monthly basis depending on the units your purge 
values are set to.


If PurgeJobs=2months the archive would take place at the beginning 
of each month.  If it were set to 2hours it would happen each 
hour. The purge will also happen on slurmdbd startup as well if 
running things for the first time.


Danny


On 03/06/2015 08:20 AM, Paul Edmon wrote:


So we recently turned this on to archive jobs older than 6 
months. However when we restarted slurmdbd nothing happened, at 
least no file was deposited at the specified archive location. Is 
there a way to force it to purge? When is the archiving scheduled 
to be done? We definitely have jobs older than 6 months in the 
database, I'm just curious about the schedule of when the 
archiving is done.


-Paul Edmon-


[slurm-dev] Requeue Exit

2015-03-03 Thread Paul Edmon
So what are the default values for these two options?  We recently 
updated to 14.11 and jobs that previously would have just requeued due 
to node failure are now going into a held state.


*RequeueExit*
   Enables automatic job requeue for jobs which exit with the specified
   values. Separate multiple exit code by a comma. Jobs will be put
   back in to pending state and later scheduled again. Restarted jobs
   will have the environment variable *SLURM_RESTART_COUNT* set to the
   number of times the job has been restarted.

*RequeueExitHold*
   Enables automatic requeue of jobs into pending state in hold,
   meaning their priority is zero. Separate multiple exit code by a
   comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
   Restarted jobs will have the environment variable
   *SLURM_RESTART_COUNT* set to the number of times the job has been
   restarted. 


-Paul Edmon-



[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon


Basically the node cuts out due to hardware issues and the jobs is 
requeued.  I'm just trying to figure out why it sent them into a held 
state as opposed to just simply requeueing as normal. Thoughts?


-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:


There are no default values for these parameters, you have to 
configure your own. In your case do the prolog fails or the node 
changes state as the jobs are running?


On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued due
to node failure are now going into a held state.

*RequeueExit*
Enables automatic job requeue for jobs which exit with the specified
values. Separate multiple exit code by a comma. Jobs will be put
back in to pending state and later scheduled again. Restarted jobs
will have the environment variable *SLURM_RESTART_COUNT* set to the
number of times the job has been restarted.

*RequeueExitHold*
Enables automatic requeue of jobs into pending state in hold,
meaning their priority is zero. Separate multiple exit code by a
comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
Restarted jobs will have the environment variable
*SLURM_RESTART_COUNT* set to the number of times the job has been
restarted.

-Paul Edmon-





[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon


We are definitely using the default for that one.  So it should be 
requeueing just fine.


-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
 This option controls what to do by default after a node failure.  If 
JobRequeue is
 set to a value of 1, then any batch job running on the failed node will be 
requeued
 for execution on different nodes.  If JobRequeue is set to a value of 0, 
then any
 job running on the failed node will be terminated.  Use the sbatch 
--no-requeue or
 --requeue option to change the default behavior for individual jobs.  The 
default
 value is 1.

According to the this, the job should be requeued and not held.


-Original Message-
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued.  I'm just trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
 Enables automatic job requeue for jobs which exit with the

specified

 values. Separate multiple exit code by a comma. Jobs will be

put

 back in to pending state and later scheduled again. Restarted

jobs

 will have the environment variable *SLURM_RESTART_COUNT* set to

the

 number of times the job has been restarted.

*RequeueExitHold*
 Enables automatic requeue of jobs into pending state in hold,
 meaning their priority is zero. Separate multiple exit code by

a

 comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
 Restarted jobs will have the environment variable
 *SLURM_RESTART_COUNT* set to the number of times the job has

been

 restarted.

-Paul Edmon-



[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon


In this case the Node was in a funny state where it couldn't resolve 
user id's.  So right after the job tried to launch it failed and 
requeued.  We just let the scheduler do what it will when it lists 
Node_fail.


-Paul Edmon-


On 03/03/2015 01:20 PM, David Bigagli wrote:


How do you set your node down? If I run a job and then issue

'scontrol update node=prometeo state=down'

the job is requeued in pend. Do you have an epilog?

On 03/03/2015 10:12 AM, Paul Edmon wrote:


We are definitely using the default for that one.  So it should be
requeueing just fine.

-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
 This option controls what to do by default after a node failure.
If JobRequeue is
 set to a value of 1, then any batch job running on the failed
node will be requeued
 for execution on different nodes.  If JobRequeue is set to a
value of 0, then any
 job running on the failed node will be terminated.  Use the
sbatch --no-requeue or
 --requeue option to change the default behavior for individual
jobs.  The default
 value is 1.

According to the this, the job should be requeued and not held.


-Original Message-
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued.  I'm just trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
 Enables automatic job requeue for jobs which exit with the

specified

 values. Separate multiple exit code by a comma. Jobs will be

put

 back in to pending state and later scheduled again. Restarted

jobs

 will have the environment variable *SLURM_RESTART_COUNT* set to

the

 number of times the job has been restarted.

*RequeueExitHold*
 Enables automatic requeue of jobs into pending state in hold,
 meaning their priority is zero. Separate multiple exit code by

a

 comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
 Restarted jobs will have the environment variable
 *SLURM_RESTART_COUNT* set to the number of times the job has

been

 restarted.

-Paul Edmon-





[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon


Ah, good to know.  I do prefer that behavior, just didn't expect it.  
Thanks.


-Paul Edmon-

On 03/03/2015 02:00 PM, David Bigagli wrote:



Ah ok, the job failed to launch in this case Slurm requeue the job in 
held state, the previous behaviour was to terminate the job.

The reason for this is to avoid the job dispatch failure over and over.

On 03/03/2015 10:53 AM, Paul Edmon wrote:


In this case the Node was in a funny state where it couldn't resolve
user id's.  So right after the job tried to launch it failed and
requeued.  We just let the scheduler do what it will when it lists
Node_fail.

-Paul Edmon-


On 03/03/2015 01:20 PM, David Bigagli wrote:


How do you set your node down? If I run a job and then issue

'scontrol update node=prometeo state=down'

the job is requeued in pend. Do you have an epilog?

On 03/03/2015 10:12 AM, Paul Edmon wrote:


We are definitely using the default for that one.  So it should be
requeueing just fine.

-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
 This option controls what to do by default after a node failure.
If JobRequeue is
 set to a value of 1, then any batch job running on the failed
node will be requeued
 for execution on different nodes.  If JobRequeue is set to a
value of 0, then any
 job running on the failed node will be terminated. Use the
sbatch --no-requeue or
 --requeue option to change the default behavior for individual
jobs.  The default
 value is 1.

According to the this, the job should be requeued and not held.


-Original Message-
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued.  I'm just trying to figure out why it sent them into a 
held

state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
 Enables automatic job requeue for jobs which exit with the

specified

 values. Separate multiple exit code by a comma. Jobs will be

put

 back in to pending state and later scheduled again. Restarted

jobs
 will have the environment variable *SLURM_RESTART_COUNT* 
set to

the

 number of times the job has been restarted.

*RequeueExitHold*
 Enables automatic requeue of jobs into pending state in hold,
 meaning their priority is zero. Separate multiple exit 
code by

a
 comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit 
state.

 Restarted jobs will have the environment variable
 *SLURM_RESTART_COUNT* set to the number of times the job has

been

 restarted.

-Paul Edmon-







[slurm-dev] Re: slurm.conf consistent across all nodes

2015-02-02 Thread Paul Edmon

You define blocks that have different designs.  Such as:

# Node1
NodeName=nodes[1-9],nodes[0-6] \
CPUs=64 RealMemory=258302 Sockets=4 CoresPerSocket=16 \
ThreadsPerCore=1 Feature=amd

# Node2
NodeName=gpu0[1-8] \
CPUs=32 RealMemory=129013 Sockets=2 CoresPerSocket=16 \
ThreadsPerCore=1 Feature=intel Gres=gpu:2

-Paul Edmon-

On 2/2/2015 1:09 PM, Bruce Roberts wrote:
Yes.  All nodes and their resources need to be defined in the 
slurm.conf on each node, not a different .conf on each node.


On 02/02/2015 10:04 AM, Slurm User wrote:

slurm.conf consistent across all nodes
Hello

The documentation says that slurm.conf must be consistent across 
all nodes.


How do we handle parameters that are different on a per node basis?

For example, RealMemory may be X on node 1, but Y on node 2.

Am I missing something?






[slurm-dev] Re: slurm.conf consistent across all nodes

2015-02-02 Thread Paul Edmon
Yeah, that's good to get started for a conf, but then following the man 
page is the next step.


-Paul Edmon-

On 2/2/2015 1:29 PM, Slurm User wrote:

Re: [slurm-dev] Re: slurm.conf consistent across all nodes
Ian, Paul

Thanks for your replies, that makes sense!!!

I was using the configurator.html and it didn't allow multiple 
definitions.




On Mon, Feb 2, 2015 at 10:23 AM, Paul Edmon ped...@cfa.harvard.edu 
mailto:ped...@cfa.harvard.edu wrote:


You define blocks that have different designs.  Such as:

# Node1
NodeName=nodes[1-9],nodes[0-6] \
CPUs=64 RealMemory=258302 Sockets=4 CoresPerSocket=16 \
ThreadsPerCore=1 Feature=amd

# Node2
NodeName=gpu0[1-8] \
CPUs=32 RealMemory=129013 Sockets=2 CoresPerSocket=16 \
ThreadsPerCore=1 Feature=intel Gres=gpu:2

-Paul Edmon-


On 2/2/2015 1:09 PM, Bruce Roberts wrote:

Yes.  All nodes and their resources need to be defined in the
slurm.conf on each node, not a different .conf on each node.

On 02/02/2015 10:04 AM, Slurm User wrote:

Hello

The documentation says that slurm.conf must be consistent
across all nodes.

How do we handle parameters that are different on a per node basis?

For example, RealMemory may be X on node 1, but Y on node 2.

Am I missing something?









[slurm-dev] Re: Partition for unused resources until needed by any other partition

2014-10-20 Thread Paul Edmon


Yes, we have this set up here.  Here is an example:

# Serial Requeue
PartitionName=serial_requeue Priority=1 \
PreemptMode=REQUEUE MaxTime=7-0 Default=YES MaxNodes=1 \
AllowGroups=cluster_users \
Nodes=blah


# Priority
PartitionName=priority Priority=10 \
AllowGroups=important_people \
Nodes=blah


# JOB PREEMPTION
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE

Since serial_requeue is the lowest priority it gets scheduled last and 
if any jobs come in from the higher priority queue it requeues the lower 
priority jobs.


-Paul Edmon-


On 10/20/2014 02:03 PM, je...@schedmd.com wrote:


This should help:
http://slurm.schedmd.com/preempt.html


Quoting Mikael Johansson mikael.johans...@iki.fi:


Hello All,

I've been scratching my head for a while now trying to figure this 
one out, which I would think would be a rather common setup.


I would need to set up a partition (or whatever, maybe a partition is 
actually not the way to go) with the following properties:


1. If there are any unused cores on the cluster, jobs submitted to this
   one would use them, and immediately have access to them.

2. The jobs should only use these resources until _any_ other job in
   another partition needs them. In this case, the jobs should be
   preempted and requeued.

So this should be some sort of shadow queue/partition, that 
shouldn't affect the scheduling of other jobs on the cluster, but 
just use up any free resources that momentarily happen to be 
available. So SLURM should just continue scheduling everything else 
normally, and treat the cores used by this shadow queue as free 
resources, and then just immediately cancel and requeue any jobs 
there, when a real job starts.


If anyone has something like this set up, example configs would be 
very welcome, as of course all other suggestions and ideas.



Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/





[slurm-dev] Re: Partition for unused resources until needed by any other partition

2014-10-20 Thread Paul Edmon


I advise using the following SchedulerParameters, partition_job_depth, 
and bf_max_job_part.  This will force the scheduler to schedule jobs for 
each partition.  Otherwise it will take a strictly top down approach.



This is what we run:

# default_queue_depth should be some multiple of the partition_job_depth,
# ideally number_of_partitions * partition_job_depth.
SchedulerParameters=default_queue_depth=5700,partition_job_depth=100,bf_interval=1,bf_continue,bf_window=2880,bf_resolution=3600,bf_max_job_test=5,bf_max_job_part=5,b
f_max_job_user=1,bf_max_job_start=100,max_rpc_cnt=8

These parameters work well for a cluster of 50,000 cores, 57 queues, and 
about 40,000 jobs per day.  We are running 14.03.8


-Paul Edmon-

On 10/20/2014 02:19 PM, Mikael Johansson wrote:



Hello,

Yeah, I looked at that, and have now four partitions defined like this:

PartitionName=shortNodes=node[005-026] Default=YES MaxNodes=6 
MaxTime=02:00:00  AllowGroups=ALL Priority=2 DisableRootJobs=NO 
RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=medium   Nodes=node[009-026] Default=NO  MaxNodes=4 
MaxTime=168:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO 
RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=long Nodes=node[001-004] Default=NO  MaxNodes=4 
MaxTime=744:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO 
RootOnly=NO Hidden=NO Shared=no PreemptMode=off
PartitionName=backfill Nodes=node[001-026] Default=NO MaxNodes=10 
MaxTime=168:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO 
RootOnly=NO Hidden=NO Shared=no PreemptMode=requeue



And I've set up:

PreemptType=preempt/partition_prio
PreemptMode=requeue
PriorityType=priority/multifactor
PriorityWeightFairshare=1
PriorityWeightAge=2000
SelectType=select/cons_res


It works so far as when a job in backfill gets running, it will be 
requeued when a job in one of the other partitions start.


The problem is that there's plenty of free cores on the cluster that 
don't get assigned jobs from backfill. If I understand things 
correctly, this is because there are jobs with a higher priority 
queueing in the other partitions.


So I would maybe need a mechanism that increases the priority of the 
backfill jobs while queueing, but then immediately decreases it when 
the jobs start?



Cheers,
Mikael J.
http://www.iki.fi/~mpjohans/


On Mon, 20 Oct 2014, je...@schedmd.com wrote:



This should help:
http: //slurm.schedmd.com/preempt.html


Quoting Mikael Johansson mikael.johans...@iki.fi:


Hello All,

I've been scratching my head for a while now trying to figure this 
one out, which I would think would be a rather common setup.


I would need to set up a partition (or whatever, maybe a partition 
is actually not the way to go) with the following properties:


1. If there are any unused cores on the cluster, jobs submitted to this
   one would use them, and immediately have access to them.

2. The jobs should only use these resources until _any_ other job in
   another partition needs them. In this case, the jobs should be
   preempted and requeued.

So this should be some sort of shadow queue/partition, that 
shouldn't affect the scheduling of other jobs on the cluster, but 
just use up any free resources that momentarily happen to be 
available. So SLURM should just continue scheduling everything else 
normally, and treat the cores used by this shadow queue as free 
resources, and then just immediately cancel and requeue any jobs 
there, when a real job starts.


If anyone has something like this set up, example configs would be 
very welcome, as of course all other suggestions and ideas.



Cheers,
Mikael J.
http: //www.iki.fi/~mpjohans/



--
Morris Moe Jette
CTO, SchedMD LLC



[slurm-dev] 14.03 FlexLM

2014-07-02 Thread Paul Edmon


If memory serves I thought that 14.03 was supposed to support hooking 
into FlexLM licensing.  However, I can't find any documentation on 
that.  Was that pushed off to a future release?


-Paul Edmon-


[slurm-dev] Job Array sacct feature request

2014-06-05 Thread Paul Edmon


I don't know if this has been done in the newer versions of slurm but it 
would be good to have sacct be able to list both the JobID and the index 
of the Job Array if it is a job array.


Thanks.

-Paul Edmon-


[slurm-dev] Re: QoS Feature Requests

2014-05-21 Thread Paul Edmon


Thanks.  For the info.  The spillover stuff would be handy, but I can 
definitely see the difficulties with the coding of it.  Though a similar 
mechanism exists for the partitions where you can list multiple 
partitions and it will execute on the one that will go first.  Could 
this be imported into QoS?


-Paul Edmon-

On 5/21/2014 6:52 PM, je...@schedmd.com wrote:


Quoting Paul Edmon ped...@cfa.harvard.edu:

We have just started using QoS here and I was curious about a few 
features which would make our lives easier.


1. Spillover/overflow: Essentially if you use up one QoS you would 
spill over into your next lower priority QoS.  For instance if you 
used up your groups QoS but still had jobs and there were idle cycles 
your jobs that were pending for your high priority QoS would go to 
the low priority normal QoS.


There isn't a great way to do this today. Each job is associated with 
a single QOS.


One possibility would be to submit one job to each QOS and then 
whichever job started first would kill the others. A job submit plugin 
could probably handle the multiple submissions (e.g. if the --qos 
option has multiple comma-separated names, then submit one job for 
each QOS). Offhand I'm not sure what would be a good way to identify 
and purge the extra jobs. Some variation of the --depend=singleton 
logic would probably do the trick.



2. Gres: Adding number of GPU's or other Gres quantities to the QoS 
that can be used.


This has been discussed, but not implemented yet.


3. Requeue/No Requeue: There are some partitions we want to allow QoS 
to requeue, others we don't. For instance we have a general queue 
which we don't want requeue on, but we also have a backfill queue 
that we do permit it on. If the QoS could kill the backfill jobs 
first to find space, and just wait on the general queue that would be 
great.  We haven't experimented with QoS Requeue but we may in the 
future so this is just looking forward.


You can configure differtent preemption mechanisms and preempt by 
either QoS or partition. Take a look at:

http://slurm.schedmd.com/preempt.html
For example, you might enable QoS high to requeue jobs in QoS low, 
but wait for jobs in QoS medium.
There is no mechanism to configure QoS high to preempt jobs by 
partition.


We were also wondering if jobs asking for Gres could get higher 
priority on those nodes, such that they can grab the GPU's and leave 
the CPU's for everyone else.  After all the Gres resources are 
usually scarcer than the CPU resouces and we would hate for a Gres 
resource to idle just because all the CPU jobs took up the slots.


-Paul Edmon-


This has also been discussed, but not implemented yet. One option 
might be to use a job_submit plugin to adjust a job's nice option 
based upon GRES.
There is a partition parameter MaxCPUsPerNode that might be useful to 
limit the number of CPUs that are consumed on each node by each 
partition. You would probably require a separate partition/queue for 
GPU jobs for that to work well, so that would probably not work for you.


Let me know if you need help pursuing these options.

Moe Jette


[slurm-dev] Re: Removing Job from Slurm Database

2014-04-22 Thread Paul Edmon

Well more like the naive ones namely:

sacctmgr delete job JobID

How do you set the endtime?  Do you do that via scontrol?

-Paul Edmon-


On 04/21/2014 10:14 PM, Danny Auble wrote:

What are the obvious ones?

I would expect setting the end time to the start time and state to 4 
(I think that is a completed state) should do it.





On April 21, 2014 6:54:22 PM PDT, Paul Edmon ped...@cfa.harvard.edu 
wrote:


Sure I can hunt that info down.  So what would be the command to remove
the job from the DB?  I tried the obvious ones I could think of but with
not effect.

-Paul Edmon-

On 4/21/2014 4:31 PM, Danny Auble wrote:

Paul, you should be able to remove the job with no issue. The
real question is why is it still running in the database
instead of completed. If you happen to have any logs on the
job and the information from the database it would be nice to
look at since what you are describing shouldn't be possible. I
know others have seen this before but no one has found a
reproducer yet or any evidence on how the state was achieved.
Let me know if you have anything like this. Thanks, Danny On
04/21/14 13:05, Paul Edmon wrote:

Is there a way to delete a JobID and it's relevant data
from the slurm database? I have a user that I want to
remove but there is a job which slurm thinks is not
complete that is preventing me. I want slurm to just
remove that job data as it shouldn't impact anything.
-Paul Edmon- 





[slurm-dev] Re: Removing Job from Slurm Database

2014-04-22 Thread Paul Edmon

Thanks.  Sorry forgot about that thread.

I'm wagering that the jobs got orphaned due to timing out. Essentially 
they actually launched but the didn't successfully update the database 
because it was busy.


-Paul Edmon-

On 04/22/2014 12:15 PM, Danny Auble wrote:
Paul I think this was covered in this thread 
https://groups.google.com/forum/#!searchin/slurm-devel/time_start/slurm-devel/nf7JxV91F40/KUsS1AmyWRYJ


The just of it is you have to go into the database and manually update 
the record.


If you know the jobid or the db_inx you can do something like this

update $CLUSTER_job_table set state=3 time_end=time_start where 
time_end=0 and id_job=$JOBID;


That should make it go away from the check.

Knowing why the job didn't finish in the database would be very good 
to know though as this shouldn't happen.


Danny

On 04/22/14 06:48, Paul Edmon wrote:

Well more like the naive ones namely:

sacctmgr delete job JobID

How do you set the endtime?  Do you do that via scontrol?

-Paul Edmon-


On 04/21/2014 10:14 PM, Danny Auble wrote:

What are the obvious ones?

I would expect setting the end time to the start time and state to 4 
(I think that is a completed state) should do it.





On April 21, 2014 6:54:22 PM PDT, Paul Edmon 
ped...@cfa.harvard.edu wrote:


Sure I can hunt that info down.  So what would be the command to remove
the job from the DB?  I tried the obvious ones I could think of but with
not effect.

-Paul Edmon-

On 4/21/2014 4:31 PM, Danny Auble wrote:

Paul, you should be able to remove the job with no issue.
The real question is why is it still running in the database
instead of completed. If you happen to have any logs on the
job and the information from the database it would be nice
to look at since what you are describing shouldn't be
possible. I know others have seen this before but no one has
found a reproducer yet or any evidence on how the state was
achieved. Let me know if you have anything like this.
Thanks, Danny On 04/21/14 13:05, Paul Edmon wrote:

Is there a way to delete a JobID and it's relevant data
from the slurm database? I have a user that I want to
remove but there is a job which slurm thinks is not
complete that is preventing me. I want slurm to just
remove that job data as it shouldn't impact anything.
-Paul Edmon- 









[slurm-dev] srun and node unknown state

2014-04-21 Thread Paul Edmon


Occassionally when we reset the master some of our nodes go into an 
unknown state or take a bit to get back in contact with the master.   If 
srun is being launched on the nodes at that time it tends to make it 
hang which causes the mpirun dependent on the srun being launched to 
fail.  Even stranger the sbatch that originally launched the srun keeps 
running and not failing out right.


Is there a way to prevent srun from failing but rather just have it wait 
until the master comes back?  Or is the timeout the only way to set 
this?  Or if this isn't possible can we have the parent sbatch die with 
an error rather than have srun just hang up?


Thanks for any insight.

-Paul Edmon-


[slurm-dev] Removing Job from Slurm Database

2014-04-21 Thread Paul Edmon


Is there a way to delete a JobID and it's relevant data from the slurm 
database?  I have a user that I want to remove but there is a job which 
slurm thinks is not complete that is preventing me.  I want slurm to 
just remove that job data as it shouldn't impact anything.


-Paul Edmon-


[slurm-dev] Re: srun and node unknown state

2014-04-21 Thread Paul Edmon


For a relevant error:

Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 request 
from 56441.33234@10.242.58.34 (port 704)

Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE plugin 
loaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONE 
plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONE 
plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUX 
plugin loaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: Checkpoint plugin loaded: 
checkpoint/none

Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started 
2014-04-20T17:45:19
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
Apr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited with 
exit code 0.
Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg: 
Socket timed out on send/recv operation
Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failed 
sending step completion message directly to slurmctld (0.0.0.0:0), retrying
Apr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60 
secs after job shutdown initiated
Apr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completion 
message directly to slurmctld (0.0.0.0:0)

Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job

Is there anyway to prevent this?  When this fails it creates a Zombie 
task that holds the job still open.  I think part of the reason why is 
that the user is looping over mpirun's like this:


do i=1,1000
mpirun -np 64 ./executable
enddo

Each run lasts about 5 minutes.  If one of the mpirun's fails to launch 
the entire thing hangs.  It would be better if srun kept trying instead 
of just failing.


-Paul Edmon-

On 4/16/2014 11:16 PM, Paul Edmon wrote:
Occassionally when we reset the master some of our nodes go into an 
unknown state or take a bit to get back in contact with the master.   
If srun is being launched on the nodes at that time it tends to make 
it hang which causes the mpirun dependent on the srun being launched 
to fail.  Even stranger the sbatch that originally launched the srun 
keeps running and not failing out right.


Is there a way to prevent srun from failing but rather just have it 
wait until the master comes back?  Or is the timeout the only way to 
set this?  Or if this isn't possible can we have the parent sbatch die 
with an error rather than have srun just hang up?


Thanks for any insight.

-Paul Edmon-


  1   2   >