[slurm-dev] Re: CPU/GPU Affinity Not Working

2017-10-27 Thread Michael Di Domenico

On Thu, Oct 26, 2017 at 1:39 PM, Kilian Cavalotti
 wrote:
> and for a 4-GPU node which has a gres.conf like this (don't ask, some
> vendors like their CPU ids alternating between sockets):
>
> NodeName=sh-114-03   name=gpuFile=/dev/nvidia[0-1]
> CPUs=0,2,4,6,8,10,12,14,16,18
> NodeName=sh-114-03   name=gpuFile=/dev/nvidia[2-3]
> CPUs=1,3,5,7,9,11,13,15,17,19

as an aside, is there some tool which provides the optimal mapping of
CPU id's to GPU cards?


[slurm-dev] Re: node selection

2017-10-20 Thread Michael Di Domenico

On Thu, Oct 19, 2017 at 3:14 AM, Steffen Grunewald
 wrote:
>> for some reason on an empty cluster when i spin up a large job it's
>> staggering the allocation across a seemingly random allocation of
>> nodes
>
> Have you looked into topology? With topology.conf, you may group nodes
> by (virtually or really, Slurm doesn't check nor care) connecting them
> to network switches... adding some "locality" to your cluster setup

yes, i have a topology file defined based on output from
ibslurmtopology.sh linked from the schedmd website

>> we're using backfill/cons_res + gres, and all the nodes are identical.
>
> Why do you care about the randomness then?

because I do.  and further because slurm is skipping nodes for a
reason despite my topology file and i'd like to understand why


[slurm-dev] node selection

2017-10-18 Thread Michael Di Domenico

is there anyway after a job starts to determine why the scheduler
choose the series of nodes it did?

for some reason on an empty cluster when i spin up a large job it's
staggering the allocation across a seemingly random allocation of
nodes

we're using backfill/cons_res + gres, and all the nodes are identical.

in the past it used to select the next node past a down node and then
start sequential from there.

i haven't made (or are not aware of ) any changes in the system, but
now it's skipping nodes that presumably should have been in the
allocation


[slurm-dev] Re: detecting gpu's in use

2017-07-31 Thread Michael Di Domenico

On Mon, Jul 31, 2017 at 11:58 AM, Sean McGrath  wrote:
> We do check that the GPU drivers are working on nodes before launching a job 
> on
> them.
>
> Our prolog call's another in house script, (we should move to NHC to be 
> honest),
> that does the following:
>
> if [[ -n $(lspci | grep GK110BGL) ]]; then # only applicable to boole 
> nodes with gpu's installed
> if [[ -z 
> $(/home/support/apps/apps/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
>  | grep 'Result = PASS') ]]; then # deviceQuery is slow
> print_problem "GPU Drivers"
> else
> print_ok "GPU Drivers"
> fi
> fi

i'm doing a slightly different test, but along the same lines.  since
we run GPU exclusive, if the test fails I can presume it's either a
dead card or in use.  the process seems pretty fragile though, i've
had instances where the driver would hang and cause the command to
hang as well which would then cause the prolog to hang.  which then
just left a mess behind


[slurm-dev] detecting gpu's in use

2017-07-31 Thread Michael Di Domenico

do people here running slurm with gres based gpu's, check that the gpu
is actually usable before launching the job?  if so, can you detail
how you're doing it?

my cluster is currently using slurm, but we run htcondor on the nodes
in the background.  when a node isn't currently allocated through
slurm it's made available to htcondor for use.  in general this works
pretty well.

however, the issue that arises is that condor can't detect a slurm
allocated node fast enough or halt the job it's running quick enough.
when a user srun's a job, it usually errors out with some irreverent
error about not being able to use the gpu.  generally the user can't
decipher it and tell what actually happened.

i've tried setting up a prolog on the nodes to kick the jobs off, but
I've seen issues in the past where users quickly issuing srun commands
will hork up the nodes.  and if the srun takes to long they'll just
kill it and try again, hastening the problem.  whether it's the node,
slurm, or condor or the combination of all three i've have not nailed
down yet.

it might come down to, i'm doing it correctly, but my script is just
too chunky.  before i spend a bunch of hours tuning, i'd like to
double check i'm going down the right path and/or incorporate some
other ideas

thanks


[slurm-dev] Re: Potential CUDA Memory Allocation Issues

2017-04-27 Thread Michael Di Domenico

You can stop tracking memory by changing
SelectTypeParameters=CR_Core_Memory to SelectTypeParameters=CR_Core
doing this will mean slurm is no longer tracking memory at all and
jobs could in theory stop on one another if they allocate to much
physical memory.

we haven't started tracking memory on our slurm clusters and have
gpu's so i'm very curious to the answer on this one as well.  when our
jobs run we routinely see the virtual memory on the process spiral up
to +100GB, but the physical memory is much less



On Wed, Apr 26, 2017 at 3:41 PM, Samuel Bowerman  wrote:
> Hello Slurm community,
>
> Our lab has recently begun transitioning from maui/torque to slurm, but we
> are having some difficulties getting our configuration correct.  In short,
> our CUDA tests routinely violate the virtual memory limits assigned by the
> scheduler even though the physical memory space is orders of magnitude
> smaller than the amount of memory requested through slurm.  Since our
> CPU-only versions of the same program (pmemd of the Amber suite - v16) runs
> without problems, I am highly suspicious that our scheduler is not handing
> GPU memory appropriately.
>
> Here is some relevant information/specs:
> - I am a user, not the sys-admin.  Both the admin and myself are new to
> managing slurm installations, but I do have experienced on the user end
> through XSEDE HPC resources.
> - Cluster runs on Rocks 6.2
> - One head node, 12 compute nodes
> - Each compute node has 12 CPU cores, 64 GB of RAM, and 4 NVIDIA GeForce GTX
> 1080 GPU's
> - GPU version of program uses less than 1 GB of physical memory (~0.6-8 GB)
>
> Here are some symptoms of the issue:
> - CPU version of program runs fine using default SLURM memory allocation
> - SBATCH option "--mem=2GB" returns an error: "cudaGetDeviceCount failed out
> of memory".  This problem is only resolved when I declare "--mem=10GB".  I
> find it very hard to believe that this routine requires that much memory.
> - SLURM will kill the job unless I declare "--mem=20GB" or above with the
> error message: "slurmstepd: error: Job exceeded virtual memory limit".  As a
> reminder that the program in question uses less than 1GB of physical memory;
> however, the virtual memory footprint is on the order of 20g.
> - Requesting more than "--mem=62GB" results in a job never running due to
> lack of resources ("scontrol show node" returns
> "RealMemory=64527,FreeMemory=62536"), so if I understand the SLURM scheduler
> appropriately, the virtual memory limit is set according to maximum physical
> memory available (even though virtual memory usage =/= physical memory
> usage).
> - According to the previous bullet point, SLURM believes we should only be
> able to run up to 3 copies of this program at once, but our maui/torque
> nodes are currently running 4 versions of the program + other jobs
> simultaneously without any performance losses.
>
> In summary, we are looking for a way to either shut off the SLURM memory
> allocation and let the programs "self-regulate" how much memory they are
> using or to have SLURM enforce a physical memory limit instead of a virtual
> one.  We have tried several things to no avail since we are both new to
> SLURM, so we are hoping for some help on the matter.  Our current slurm.conf
> file is printed at the end of this email, and any assistance on the matter
> would be greatly appreciated.  Thanks!
>
> Take care,
> Samuel Bowerman
> Ph.D. Candidate
> Department of Physics
> Illinois Institute of Technology
>
> Begin slurm.conf:
>
> SlurmUser=root
> SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> CryptoType=crypto/munge
> StateSaveLocation=/var/spool/slurm.state
> SlurmdSpoolDir=/var/spool/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> ProctrackType=proctrack/linuxproc
> PluginDir=/usr/lib64/slurm
> CacheGroups=0
> JobCheckpointDir=/var/spool/slurm.checkpoint
> #SallocDefaultCommand = "xterm"
> GresTypes=gpu
> #FirstJobId=
> ReturnToService=2
> #MaxJobCount=
> #PlugStackConfig=
> #PropagatePrioProcess=
> #PropagateResourceLimits=
> #PropagateResourceLimitsExcept=
>
> PropagateResourceLimits=NONE
>
> #Prolog=
> #Epilog=
> #SrunProlog=
> #SrunEpilog=
> #TaskProlog=
> #TaskEpilog=
> TaskPlugin=task/affinity
> TrackWCKey=yes
> TopologyPlugin=topology/none
> #TreeWidth=50
> TmpFs=/state/partition1
> #UsePAM=
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=30
> MinJobAge=300
> KillWait=60
> WaitTime=60
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> #DefMemPerCPU=220
> #MaxMemPerCPU=300
> VSizeFactor=110
> FastSchedule=0
> 
> #Set MemLimitEnforce
> MemLimitEnforce=No
> ##
>
> JobCompType=jobcomp/none
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
> ##JobAcctGatherParams=UsePss
>
> ### Priority Begin ##
> PriorityType=priority/multifactor
> 

[slurm-dev] Re: Unable to allocate Gres by type

2017-02-08 Thread Michael Di Domenico

On Mon, Feb 6, 2017 at 1:55 PM, Hans-Nikolai Viessmann  wrote:
> Hi Michael,
>
> Yes, on all the compute nodes there is a gres.conf, and all the GPU nodes
> except gpu08 have the following defined:
>
> Name=gpu Count=1
> Name=mic Count=0
>
> The head node has this defined:
>
> Name=gpu Count=0
> Name=mic Count=0
>
> Is it possible that Gres Type needs to be specified for all nodes (of a
> particular
> grestype, e.g. gpu) in order to use type based allocation?
>
> So should I perhaps update the gres.conf file on the gpu nodes to something
> this:
>
> Name=gpu Type=tesla Count=1
> Name=mic Count=0
>
> Would that make a difference?

Not sure, this is starting to get beyond my troubleshooting ability.
Here's what i have defined:

slurm.conf
nodename=host001 gres=gpu:k10:8

gres.conf
name=gpu file=/dev/nvidia0 type=k10
name=gpu file=/dev/nvidia1 type=k10
name=gpu file=/dev/nvidia2 type=k10
name=gpu file=/dev/nvidia3 type=k10
name=gpu file=/dev/nvidia4 type=k10
name=gpu file=/dev/nvidia5 type=k10
name=gpu file=/dev/nvidia6 type=k10
name=gpu file=/dev/nvidia7 type=k10

if that isn't working you, i would take the "type" definitions out of
both the slurm.conf and the gres.conf and see if it then works.  there
was a bug a couple revs ago with the gres types, which is resolved,
but maybe it regressed.


[slurm-dev] Re: Unable to allocate Gres by type

2017-02-06 Thread Michael Di Domenico

On Mon, Feb 6, 2017 at 11:50 AM, Hans-Nikolai Viessmann  wrote:
> Hi Michael,
>
> That's an interesting suggestion, and this works for you? I'm
> a bit confused then because the man-page for gres.conf
> states otherwise: https://slurm.schedmd.com/gres.conf.html,
> indicating the it mustmatch one of the GresTypes (gpu, mic, or
> nic).

sorry you're correct, i was thinking NodeName

do you have a gres.conf defined on all the compute nodes or just
node08 with the difference cards?


[slurm-dev] Re: Unable to allocate Gres by type

2017-02-06 Thread Michael Di Domenico

On Mon, Feb 6, 2017 at 10:17 AM, Hans-Nikolai Viessmann  wrote:
>
> I had just added the DebugFlags setting to slurm.conf on the head node
> and did not sychronise it with the nodes. I doubt that this could cause the
> problem I described as it was occuring before I made the change to
> slurm.conf.
>
> One thing I did notice is this error occuring every once and a while:
>
> [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
> lacks File parameter for node gpu07
> [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
> lacks File parameter for node gpu04
> [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
> lacks File parameter for node gpu01
> [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
> lacks File parameter for node gpu05
> [2016-12-30T17:36:50.963] error: gres_plugin_node_config_unpack: gres/gpu
> lacks File parameter for node gpu02
> [2016-12-30T17:36:50.964] error: gres_plugin_node_config_unpack: gres/gpu
> lacks File parameter for node gpu06
> [2016-12-30T17:36:50.966] error: gres_plugin_node_config_unpack: gres/gpu
> lacks File parameter for node gpu03
>
> Is it possible that I need to specify the Gres Type for the other nodes as
> well, even though that
> have only one GPU each?


i'm not an expert, but i believe your gres.conf is incorrect.  Ours
looks like this

name=hostname file=/dev/nvidia0 type=k10

i think the issue is that slurm is trying to match your hostname to
the gres file to see what matches and can't


[slurm-dev] RE: Slurm for render farm

2017-01-20 Thread Michael Di Domenico

On Fri, Jan 20, 2017 at 11:09 AM, John Hearns  wrote:
> I plan to have this wrapper script run the actual render through slurm.
> The script will have to block until the job completes I think - else the
> RenderPal server will report it has finished.
> Is it possible to block and wait till an sbatch has finished?
> Or shoudl I be thinking on using srun here?

i think you want salloc for an interactive sbatch, but since renderpal
likely isn't going to run an mpi program, srun should work just the
same


[slurm-dev] Re: Job temporary directory

2017-01-20 Thread Michael Di Domenico

On Fri, Jan 20, 2017 at 11:16 AM, John Hearns  wrote:
> As I remember, in SGE and in PbsPro a job has a directory created for it on
> the execution host which is a temporary directory, named with he jobid.
> you can define int he batch system configuration where the root of these
> directories is.
>
> On running srun env, the only TMPDIR I see is /tmp
> I know - RTFM.  I bet I haven't realised that this si easy to set up...
>
> Specifically I would like a temprary job directory which is
> /local/$SLURM_JOBID
>
> I guess I can create this in the job then delete it, but it would be cleaner
> if the batch system deleted it and didnt allow for failed jobs or bad
> scripts leaving it on disk.

this has come up on the list a few times over the years that i can
recall, but i don't have specific pointers.  there were some pretty
fancy scripts that slurm could run to create a scratch space
allocation on the local node.  search back through the history


[slurm-dev] Re: Question about -m cyclic and --exclusive options to slurm

2017-01-03 Thread Michael Di Domenico

what behaviour do you get if you leave off the exclusive and cyclic
options?  which selecttype are you using?


On Tue, Jan 3, 2017 at 12:19 PM, Koziol, Lucas
 wrote:
> Dear Vendor,
>
>
>
>
>
> What I want to do is run a  large number of single-CPU tasks, and have them
> distributed evenly over all allocated nodes, and to oversubscribe CPUs to
> tasks (each task is very light on CPU resources).
>
>
>
> Here is a small test script that allocates 2 nodes (16 CPUs per Node on our
> machines) and tries to distribute 32 tasks over these 32 CPUs:
>
>
>
> #SBATCH -n 32 -p short
>
>
>
> set Vec = ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
> 25 26 27 28 29 30 31 32 )
>
>
>
> foreach frame ($Vec)
>
>
>
> cd $frame
>
> srun –n 1 –m cyclic a.out > output.txt &
>
> cd ..
>
> end
>
> wait
>
>
>
> The hope was that all 16 tasks would run on Node 1, and 16 tasks would run
> on Node 2. Unfortunately what happens is that all 32 jobs get assigned to
> Node 1. I thought –m cyclic was supposed to avoid this.
>
>
>
> A note from the vendor suggested using the –exclusive flag. In that case I
> modified my srun command to
>
>
>
> srun –exclusive –N 1 –n 1 a.out > output.txt &
>
>
>
>
>
> The problem with this is that it still assigns the tasks to Node 1, but
> waits until there is an available CPU before assigning the last 16. It still
> doesn’t accomplish the task of distributing all 32 jobs to the 32 CPUs
> across 2 nodes. And, in the next step I want to overscubscribe tasks to
> nodes, and –exclusive specicifally waits until open CPUs before submitting
> all the jobs. This sinks a whole lot of time.
>
>
>
> I have also played around with the –overcommit option, however that has not
> produced any difference. Note that MAX_TASKS_PER_NODE set in slurm.h is
> adequate.
>
>
>
> The –m cyclic option only applies to multiple tasks launched within a single
> step. Is there a mechanism for submiting 32 tasks using 1 srun command, at
> which point –m cyclic should hopefully fix everything.
>
>
>
> Thank you for your time and any help or suggestions.
>
>
>
> Best regards,
>
> Lucas Koziol
>
>
>
>
>
> Corporate Strategic Research
>
> ExxonMobil Research and Engineering Co.
>
> 1545 US Route 22 East
>
> Annandale, NJ, 08801
>
> Tel: (908) 335-3411
>
>

[slurm-dev] Re: Node selection for serial tasks

2016-12-08 Thread Michael Di Domenico

On Thu, Dec 8, 2016 at 5:48 AM, Nigella Sanders
 wrote:
>
> All 30 tasks run always in the first two allocated nodes (torus6001 and
> torus6002).
>
> However, I would like to get these tasks using only the second and then
> third nodes (torus6002 and torus6003).
> Does anyone an idea about how to do this?

I've not tested this, but i believe you can add the -w option to the
srun inside your sbatch script


[slurm-dev] limiting scontrol

2016-11-17 Thread Michael Di Domenico

i'm a little hazy on account security controls, so i might some
correction on this

as i understand it

users have accounts inside /etc/passwd

users can also have accounts inside slurm

and then there's the root account

if i don't add anyone to the slurm accounts, everyone is basically at
the lowest level, meaning they can submit jobs but not much else

if i add a user to slurm, I can then grant them further permissions
over groups, partitions etc all the way up to admin level

but irrespective of that if someone is root then they can basically do anything

my question is, is there anything in slurm that would allow me to say
only accept scontrol requests from these users and only on these
specific hosts?


[slurm-dev] Re: Gres issue

2016-11-16 Thread Michael Di Domenico

this might be nothing, but i usually call --gres with an equals

srun --gres=gpu:k10:8

i'm not sure if the equals is optional or not



On Wed, Nov 16, 2016 at 4:34 AM, Dmitrij S. Kryzhevich  wrote:
>
> Hi,
>
> I have some issues with gres usage. I'm running slurm of 16.05.4 version and
> I have a small stand with 4 nodes+master. The best description of it would
> be to paste confs:
> slurm.conf: http://paste.org.ru/?m8v7ca
> gres.conf: http://paste.org.ru/?ouspnz
> They are populated on each node.
>
> And the problem is following:
>
> [dkryzhevich@gpu ~]$ srun -N 1 --gres gpu:c2050 
> srun: error: Unable to allocate resources: Requested node configuration is
> not available
> [dkryzhevich@gpu ~]$
>
> Relevant logs: http://paste.org.ru/?mj4dfs
> Whatever I did with --gres flag it just does not start. What am I missing
> here?
>
> I tried to remove Type column from gres.conf and all nodes have gone into
> "drain" state. I tried to remove all details from Gres column in slurm.conf
> in addition (i.e. "NodeName=node2 Gres=gpu:1 CoresPerSocket=2
> ThreadsPerCore=2 State=UNKNOWN") and task was submitted but I want the
> ability to specify type of card in case I really need it.
>
> And two small unrelevant questions.
> 1. Is it possible to submit a job from any node, or is it master only? Start
> secondary slurmctl daemon on each node may be, I don't know.
> 2. Is it possible to start a job on two separate nodes with nvidia cards in
> a way something like
> $ srun --gres gpu:2
> ? The point is to use 2-3-4 cards installed on different nodes with some MPI
> connection between threads.
>
> BR,
> Dmitrij


[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-25 Thread Michael Di Domenico

although i see this with and without slurm, so there very well maybe
something wrong with my ompi compile

On Thu, Aug 25, 2016 at 2:04 PM, Michael Di Domenico
<mdidomeni...@gmail.com> wrote:
>
> I'm seeing this presently on our new cluster.  I'm not sure what's
> going on.  Did this every get resolved?
>
> I can confirm that we have compiled openmpi with the slurm options.
> we have other clusters which work fine, albeit this is our first
> mellanox based IB cluster, so i'm not sure if that has anything to do
> with it.  i am using the same openmpi install between clusters.
>
>>> >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <yoshi...@ohsu.edu> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I have a strange situation that I could use assistance with.  We
>>> >> recently rebooted some nodes in our Slurm cluster and after the reboot,
>>> >> running MPI programs on these nodes results in complaints from OpenMPI 
>>> >> about
>>> >> the Infiniband ports:
>>> >>
>>> >>
>>> >> --
>>> >> No OpenFabrics connection schemes reported that they were able to be
>>> >> used on a specific port.  As such, the openib BTL (OpenFabrics
>>> >> support) will be disabled for this port.
>>> >>
>>> >> Local host:   XX
>>> >> Local device: mlx4_0
>>> >> Local port:   1
>>> >> CPCs attempted:   udcm
>>> >> —
>>> >>
>>> >> [XXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>> >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]
>>> >>
>>> >> These nodes did receive some updates, but are otherwise all running the
>>> >> same version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird thing is
>>> >> that if I ssh into the affected nodes and use mpirun directly Infiniband
>>> >> works correctly.  So the problem definitely involves an interaction 
>>> >> between
>>> >> Slurm (maybe via PMI?) and OpenMPI.

[slurm-dev] Re: strange going-ons with OpenMPI and Infiniband

2016-08-25 Thread Michael Di Domenico

I'm seeing this presently on our new cluster.  I'm not sure what's
going on.  Did this every get resolved?

I can confirm that we have compiled openmpi with the slurm options.
we have other clusters which work fine, albeit this is our first
mellanox based IB cluster, so i'm not sure if that has anything to do
with it.  i am using the same openmpi install between clusters.



On Wed, Apr 13, 2016 at 2:51 AM, Lachlan Musicman  wrote:
> I was reading about this today. Isn't OpenMPI compiled --with-slurm by
> default when installing with one of the pkg managers?
>
> https://www.open-mpi.org/faq/?category=building#build-rte
>
> Cheers
> L.
>
> --
> The most dangerous phrase in the language is, "We've always done it this
> way."
>
> - Grace Hopper
>
> On 13 April 2016 at 16:30, Craig Yoshioka  wrote:
>>
>>
>> Thanks, I'll add that to my list of things to try. I did use --with-pmi
>> but not --with-slurm.
>>
>> Sent from my iPhone
>>
>> > On Apr 12, 2016, at 11:19 PM, Jordan Willis 
>> > wrote:
>> >
>> >
>> > Have you tried recompiling openmpi with the —with-slurm option? That did
>> > the trick for me
>> >
>> >
>> >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka  wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a strange situation that I could use assistance with.  We
>> >> recently rebooted some nodes in our Slurm cluster and after the reboot,
>> >> running MPI programs on these nodes results in complaints from OpenMPI 
>> >> about
>> >> the Infiniband ports:
>> >>
>> >>
>> >> --
>> >> No OpenFabrics connection schemes reported that they were able to be
>> >> used on a specific port.  As such, the openib BTL (OpenFabrics
>> >> support) will be disabled for this port.
>> >>
>> >> Local host:   XX
>> >> Local device: mlx4_0
>> >> Local port:   1
>> >> CPCs attempted:   udcm
>> >> —
>> >>
>> >> [XXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>> >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]
>> >>
>> >> These nodes did receive some updates, but are otherwise all running the
>> >> same version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird thing is
>> >> that if I ssh into the affected nodes and use mpirun directly Infiniband
>> >> works correctly.  So the problem definitely involves an interaction 
>> >> between
>> >> Slurm (maybe via PMI?) and OpenMPI.
>> >>
>> >> Any thoughts?
>> >>
>> >> Thanks!,
>> >> -Craig
>> >>
>
>

[slurm-dev] Re: gmail spam filters?

2016-07-08 Thread Michael Di Domenico

On Fri, Jul 8, 2016 at 1:22 PM, Tim Wickberg  wrote:
>
> I've made a few minor changes to our SPF records, and fixed the reverse IP
> record for the mailing list server.
>
> I highly recommend filtering based on the list ID (slurmdev.schedmd.com),
> which has remained unchanged for a long time, and should let you bypass any
> local spam filter rules.

it's not me, it's google's spam filters.  the message is something
about the list coming through as unauthenticated


[slurm-dev] Re: Regards Postgres Plugin for SLURM

2016-03-21 Thread Michael Di Domenico

just to jump on the wagon, we would prefer a postgres option as well.
i can't "Create, debug and contribute back" a plugin, but i could help
in some fashion

On Sun, Mar 20, 2016 at 5:41 PM, Simpson Lachlan
 wrote:
> I think we would like a PostgreSQL plugin too - if you start building one, 
> please let me know so I can contribute.
>
> Cheers
> L.
>
>> -Original Message-
>> From: Chris Samuel [mailto:sam...@unimelb.edu.au]
>> Sent: Saturday, 19 March 2016 2:00 PM
>> To: slurm-dev
>> Subject: [slurm-dev] Re: Regards Postgres Plugin for SLURM
>>
>>
>> On Fri, 18 Mar 2016 06:53:38 AM Doguparthi, Subramanyam wrote:
>>
>> > We are from Hewlett Packard Enterprise and evaluating
>> > SLURM for one of our requirements. Database our application uses is
>> > Postgres and we don’t see any working plugin available. Is it possible
>> > to help us with Postgres DB plug in? We are ok to add any missing
>> > functionality and submit back the changes.
>>
>> PostgreSQL support was removed in the 14.03 release two years ago.
>>
>> The release notes said:
>>
>>  -- Support for Postgres database has long since been out of date and
>> problematic, so it has been removed entirely.  If you would like to
>> use it the code still exists in <= 2.6, but will not be included in
>> this and future versions of the code.
>>
>> The RPM you've found is from Slurm 2.3.x, which is ancient.
>>
>> So I suspect there are 3 options:
>>
>> 1) Use Slurm 2.6.x (which is no longer maintained)
>> 2) Use MySQL instead of PostgreSQL for Slurm
>> 3) Create, debug and contribute back an out of tree, fully functional 
>> PostgreSQL
>> plugin & see what folks think.
>>
>> All the best,
>> Chris
>> --
>>  Christopher SamuelSenior Systems Administrator
>>  VLSCI - Victorian Life Sciences Computation Initiative
>>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>>  http://www.vlsci.org.au/  http://twitter.com/vlsci
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
>

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico

I see in the output

mca:base:select(ess) selected component [pmi]

the last line in the output reads

MCW rank 0 is not bound (or bound to all available processors)

(sorry i can't cut and paste from the cluster to here)

i recompiled slurm/openmpi with enable-debug, so i might be able to
track down something with that if it helps



On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>
> Yes, mpirun adds that to the environment to ensure we don’t pickup the wrong
> ess component. Try adding “OMPI_MCA_ess_base_verbose=10” to your environment
> and just srun one copy of hello_world - let’s ensure it picked up the right
> ess component.
>
> I can try to replicate here, but it will take me a little while to get to
> it
>
>> On Dec 16, 2015, at 8:50 AM, Michael Di Domenico <mdidomeni...@gmail.com>
>> wrote:
>>
>>
>> Yes, i have PMI support included into openmpi
>>
>> --with-slurm --with-psm --with-pmi=/opt/slurm
>>
>> checking through the config.log it does appear the PMI tests build
>> successfully.
>>
>> though checking with ompi_info i'm not sure i can say with 100%
>> certainty it's in there.
>>
>> ompi_info --parseable | grep pmi does return
>>
>> mca:db:pmi
>> mca:ess:pmi
>> mca:grpcomm:pmi
>> mca:pubsub:pmi
>>
>> interestingly enough when i run 'mpirun env' (no slurm), i see ^pmi in
>> the OMPI environment variables, but i'm not sure if that's supposed to
>> do that or not
>>
>>
>>
>> On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>> Hey Michael
>>>
>>> Check ompi_info and ensure that the PMI support built - you have to
>>> explicitly ask for it and provide the path to pmi.h
>>>
>>>
>>>> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico
>>>> <mdidomeni...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0.  but i
>>>> seem to have an srun oddity i've not seen before and i'm not exactly
>>>> sure how to debug it
>>>>
>>>> srun -n 4 hello_world
>>>> - does not run, hangs in MPI_INIT
>>>>
>>>> srun -n 4 -N1 hello_world
>>>> - does not run, hangs in MPI_INIT
>>>>
>>>> srun -n 4 -N 4
>>>> - runs one task per node
>>>>
>>>> sbatch and salloc seem to work okay launching using mpirun inside, and
>>>> mpirun works without issue outside of slurm
>>>>
>>>> i disabled all the gres and cgroup controls and all that
>>>>
>>>> has anyone seen this before?
>>>
>

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico

to add some additional info, i let it sit for a long time and finally got

PSM returned unhandled/unknown connect error: Operation timed out
PSM EP connect error (uknown connect error)

so perhaps my old friend psm and srun aren't getting along again...



On 12/16/15, Michael Di Domenico <mdidomeni...@gmail.com> wrote:
> I see in the output
>
> mca:base:select(ess) selected component [pmi]
>
> the last line in the output reads
>
> MCW rank 0 is not bound (or bound to all available processors)
>
> (sorry i can't cut and paste from the cluster to here)
>
> i recompiled slurm/openmpi with enable-debug, so i might be able to
> track down something with that if it helps
>
>
>
> On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> Yes, mpirun adds that to the environment to ensure we don’t pickup the
>> wrong
>> ess component. Try adding “OMPI_MCA_ess_base_verbose=10” to your
>> environment
>> and just srun one copy of hello_world - let’s ensure it picked up the
>> right
>> ess component.
>>
>> I can try to replicate here, but it will take me a little while to get to
>> it
>>
>>> On Dec 16, 2015, at 8:50 AM, Michael Di Domenico
>>> <mdidomeni...@gmail.com>
>>> wrote:
>>>
>>>
>>> Yes, i have PMI support included into openmpi
>>>
>>> --with-slurm --with-psm --with-pmi=/opt/slurm
>>>
>>> checking through the config.log it does appear the PMI tests build
>>> successfully.
>>>
>>> though checking with ompi_info i'm not sure i can say with 100%
>>> certainty it's in there.
>>>
>>> ompi_info --parseable | grep pmi does return
>>>
>>> mca:db:pmi
>>> mca:ess:pmi
>>> mca:grpcomm:pmi
>>> mca:pubsub:pmi
>>>
>>> interestingly enough when i run 'mpirun env' (no slurm), i see ^pmi in
>>> the OMPI environment variables, but i'm not sure if that's supposed to
>>> do that or not
>>>
>>>
>>>
>>> On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>>>>
>>>> Hey Michael
>>>>
>>>> Check ompi_info and ensure that the PMI support built - you have to
>>>> explicitly ask for it and provide the path to pmi.h
>>>>
>>>>
>>>>> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico
>>>>> <mdidomeni...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0.  but i
>>>>> seem to have an srun oddity i've not seen before and i'm not exactly
>>>>> sure how to debug it
>>>>>
>>>>> srun -n 4 hello_world
>>>>> - does not run, hangs in MPI_INIT
>>>>>
>>>>> srun -n 4 -N1 hello_world
>>>>> - does not run, hangs in MPI_INIT
>>>>>
>>>>> srun -n 4 -N 4
>>>>> - runs one task per node
>>>>>
>>>>> sbatch and salloc seem to work okay launching using mpirun inside, and
>>>>> mpirun works without issue outside of slurm
>>>>>
>>>>> i disabled all the gres and cgroup controls and all that
>>>>>
>>>>> has anyone seen this before?
>>>>
>>
>

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico

Adding

OMPI_MCA_mtl=^psm

to my environment and re-running the 'srun -n4 hello_world' seems to
fix the issue

so i guess we've isolated the problem to slurm/srun and psm, but now
the question is what's broke

On 12/16/15, Michael Di Domenico <mdidomeni...@gmail.com> wrote:
>
> to add some additional info, i let it sit for a long time and finally got
>
> PSM returned unhandled/unknown connect error: Operation timed out
> PSM EP connect error (uknown connect error)
>
> so perhaps my old friend psm and srun aren't getting along again...
>
>
>
> On 12/16/15, Michael Di Domenico <mdidomeni...@gmail.com> wrote:
>> I see in the output
>>
>> mca:base:select(ess) selected component [pmi]
>>
>> the last line in the output reads
>>
>> MCW rank 0 is not bound (or bound to all available processors)
>>
>> (sorry i can't cut and paste from the cluster to here)
>>
>> i recompiled slurm/openmpi with enable-debug, so i might be able to
>> track down something with that if it helps
>>
>>
>>
>> On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>> Yes, mpirun adds that to the environment to ensure we don’t pickup the
>>> wrong
>>> ess component. Try adding “OMPI_MCA_ess_base_verbose=10” to your
>>> environment
>>> and just srun one copy of hello_world - let’s ensure it picked up the
>>> right
>>> ess component.
>>>
>>> I can try to replicate here, but it will take me a little while to get
>>> to
>>> it
>>>
>>>> On Dec 16, 2015, at 8:50 AM, Michael Di Domenico
>>>> <mdidomeni...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> Yes, i have PMI support included into openmpi
>>>>
>>>> --with-slurm --with-psm --with-pmi=/opt/slurm
>>>>
>>>> checking through the config.log it does appear the PMI tests build
>>>> successfully.
>>>>
>>>> though checking with ompi_info i'm not sure i can say with 100%
>>>> certainty it's in there.
>>>>
>>>> ompi_info --parseable | grep pmi does return
>>>>
>>>> mca:db:pmi
>>>> mca:ess:pmi
>>>> mca:grpcomm:pmi
>>>> mca:pubsub:pmi
>>>>
>>>> interestingly enough when i run 'mpirun env' (no slurm), i see ^pmi in
>>>> the OMPI environment variables, but i'm not sure if that's supposed to
>>>> do that or not
>>>>
>>>>
>>>>
>>>> On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>
>>>>> Hey Michael
>>>>>
>>>>> Check ompi_info and ensure that the PMI support built - you have to
>>>>> explicitly ask for it and provide the path to pmi.h
>>>>>
>>>>>
>>>>>> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico
>>>>>> <mdidomeni...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0.  but
>>>>>> i
>>>>>> seem to have an srun oddity i've not seen before and i'm not exactly
>>>>>> sure how to debug it
>>>>>>
>>>>>> srun -n 4 hello_world
>>>>>> - does not run, hangs in MPI_INIT
>>>>>>
>>>>>> srun -n 4 -N1 hello_world
>>>>>> - does not run, hangs in MPI_INIT
>>>>>>
>>>>>> srun -n 4 -N 4
>>>>>> - runs one task per node
>>>>>>
>>>>>> sbatch and salloc seem to work okay launching using mpirun inside,
>>>>>> and
>>>>>> mpirun works without issue outside of slurm
>>>>>>
>>>>>> i disabled all the gres and cgroup controls and all that
>>>>>>
>>>>>> has anyone seen this before?
>>>>>
>>>
>>

[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico

Yes, i have the required pmi/openmpi attributes set in the slurm.conf
and i've tried alternating them from the command line.  same behavior

On 12/16/15, Ian Logan <i...@nmsu.edu> wrote:
> Hi Michael,
> I'm not sure if this is right or not, I don't have much experience with
> OpenMPI. On most slurm installs I believe the MPI type defaults to none,
> have you tried adding --mpi=pmi2 or --mpi=openmpi to your srun command?
> Ian
>
> On Wed, Dec 16, 2015 at 9:50 AM, Michael Di Domenico
> <mdidomeni...@gmail.com
>> wrote:
>
>>
>> Yes, i have PMI support included into openmpi
>>
>> --with-slurm --with-psm --with-pmi=/opt/slurm
>>
>> checking through the config.log it does appear the PMI tests build
>> successfully.
>>
>> though checking with ompi_info i'm not sure i can say with 100%
>> certainty it's in there.
>>
>> ompi_info --parseable | grep pmi does return
>>
>> mca:db:pmi
>> mca:ess:pmi
>> mca:grpcomm:pmi
>> mca:pubsub:pmi
>>
>> interestingly enough when i run 'mpirun env' (no slurm), i see ^pmi in
>> the OMPI environment variables, but i'm not sure if that's supposed to
>> do that or not
>>
>>
>>
>> On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>> >
>> > Hey Michael
>> >
>> > Check ompi_info and ensure that the PMI support built - you have to
>> > explicitly ask for it and provide the path to pmi.h
>> >
>> >
>> >> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico <
>> mdidomeni...@gmail.com>
>> >> wrote:
>> >>
>> >>
>> >> i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0.  but i
>> >> seem to have an srun oddity i've not seen before and i'm not exactly
>> >> sure how to debug it
>> >>
>> >> srun -n 4 hello_world
>> >> - does not run, hangs in MPI_INIT
>> >>
>> >> srun -n 4 -N1 hello_world
>> >> - does not run, hangs in MPI_INIT
>> >>
>> >> srun -n 4 -N 4
>> >> - runs one task per node
>> >>
>> >> sbatch and salloc seem to work okay launching using mpirun inside, and
>> >> mpirun works without issue outside of slurm
>> >>
>> >> i disabled all the gres and cgroup controls and all that
>> >>
>> >> has anyone seen this before?
>> >
>>
>
>
>
> --
> Ian Logan
> Virtualization and Unix Systems Administrator
> Information and Communication Technologies - New Mexico State University
> Phone: 575-646-3054 Email: i...@nmsu.edu
>


[slurm-dev] Re: srun and openmpi

2015-12-16 Thread Michael Di Domenico

Yes, i have PMI support included into openmpi

--with-slurm --with-psm --with-pmi=/opt/slurm

checking through the config.log it does appear the PMI tests build successfully.

though checking with ompi_info i'm not sure i can say with 100%
certainty it's in there.

ompi_info --parseable | grep pmi does return

mca:db:pmi
mca:ess:pmi
mca:grpcomm:pmi
mca:pubsub:pmi

interestingly enough when i run 'mpirun env' (no slurm), i see ^pmi in
the OMPI environment variables, but i'm not sure if that's supposed to
do that or not



On 12/16/15, Ralph Castain <r...@open-mpi.org> wrote:
>
> Hey Michael
>
> Check ompi_info and ensure that the PMI support built - you have to
> explicitly ask for it and provide the path to pmi.h
>
>
>> On Dec 16, 2015, at 6:48 AM, Michael Di Domenico <mdidomeni...@gmail.com>
>> wrote:
>>
>>
>> i just compiled and installed Slurm 14.11.4 and Openmpi 1.10.0.  but i
>> seem to have an srun oddity i've not seen before and i'm not exactly
>> sure how to debug it
>>
>> srun -n 4 hello_world
>> - does not run, hangs in MPI_INIT
>>
>> srun -n 4 -N1 hello_world
>> - does not run, hangs in MPI_INIT
>>
>> srun -n 4 -N 4
>> - runs one task per node
>>
>> sbatch and salloc seem to work okay launching using mpirun inside, and
>> mpirun works without issue outside of slurm
>>
>> i disabled all the gres and cgroup controls and all that
>>
>> has anyone seen this before?
>


[slurm-dev] Re: add/remove node from partition

2015-11-25 Thread Michael Di Domenico

sorry for the double post.  i'm not sure why gmail decided to send the
message again, i didn't send it twice...

On Wed, Nov 25, 2015 at 9:33 AM, Michael Di Domenico
<mdidomeni...@gmail.com> wrote:
>
> is it possible to add or remove just a single node from a partition
> without having to re-establish the whole list of nodes?
>
> for example
>
> if i have nodes[001-100] and i want to remove only node 049.  is there
> some incantation that will allow me to do that without having to say
> nodes[001-048,050-100]
>
> the motivation is that we have a mixed pool of nodes some with gpu's
> and some without.  as our cluster ages, the gpus are getting flaky.
> often the gpu flakes out or dies, but the rest of the node is
> perfectly fine.
>
> i'd like to dynamically move a node out of the gpu partition and into
> a non-gpu partition using a node-health script
>
> yes, gres would probably handle this better then split partitions, but
> we haven't rolled to gres allocations on the gpu's yet


[slurm-dev] add/remove node from partition

2015-11-25 Thread Michael Di Domenico

is it possible to add or remove just a single node from a partition
without having to re-establish the whole list of nodes?

for example

if i have nodes[001-100] and i want to remove only node 049.  is there
some incantation that will allow me to do that without having to say
nodes[001-048,050-100]

the motivation is that we have a mixed pool of nodes some with gpu's
and some without.  as our cluster ages, the gpus are getting flaky.
often the gpu flakes out or dies, but the rest of the node is
perfectly fine.

i'd like to dynamically move a node out of the gpu partition and into
a non-gpu partition using a node-health script

yes, gres would probably handle this better then split partitions, but
we haven't rolled to gres allocations on the gpu's yet


[slurm-dev] gmail spam filters?

2015-07-30 Thread Michael Di Domenico

is anyone else having an issue using a gmail address for the slurm
mailling lists?  Gmail keeps blocking all the slurm mail for my
account and marking it as Spam.  A little yellow box pops up and says
this message is in violation of gmails bulk sender something or other