[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jeffrey T Frey via slurm-users
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is per user.

The ulimit is a frontend to rusage limits, which are per-process restrictions 
(not per-user).

The fs.file-max is the kernel's limit on how many file descriptors can be open 
in aggregate.  You'd have to edit that with sysctl:


$ sysctl fs.file-max
fs.file-max = 26161449


Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative 
limit versus the default.




> But if you have ulimit -n == 1024, then no user should be able to hit
> the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
> 96 users each trying to open 1024 files would do it, though.)

Naturally, since the ulimit is per-process the equating of core count with the 
multiplier isn't valid.  It also assumes Slurm isn't setup to oversubscribe CPU 
resources :-)



>> I'm not sure how the number 3092846 got set, since it's not defined in
>> /etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
>> our compute nodes, so which dynamic service might affect the limits?

If the 1024 is a soft limit, you may have users who are raising it to arbitrary 
values themselves, for example.  Especially as 1024 is somewhat low for the 
more naively-written data science Python code I see on our systems.  If Slurm 
is configured to propagate submission shell ulimits to the runtime environment 
and you allow submission from a variety of nodes/systems you could be seeing 
myriad limits reconstituted on the compute node despite the 
/etc/security/limits.conf settings.


The main question needing an answer is _what_ process(es) are opening all the 
files on your systems that are faltering.  It's very likely to be user jobs' 
opening all of them, I was just hoping to also rule out any bug in munged.  
Since you're upgrading munged, you'll now get the errno associated with the 
backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-15 Thread Jeffrey T Frey via slurm-users
https://github.com/dun/munge/issues/94


The NEWS file claims this was fixed in 0.5.15.  Since your log doesn't show the 
additional strerror() output you're definitely running an older version, 
correct?


If you go on one of the affected nodes and do an `lsof -p ` I'm 
betting you'll find a long list of open file descriptors — that would explain 
the "Too many open files" situation _and_ indicate that this is something other 
than external memory pressure or open file limits on the process.




> On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users 
>  wrote:
> 
> We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 
> 8.9.  We've had a number of incidents where the Munge log-file 
> /var/log/munge/munged.log suddenly fills up the root file system, after a 
> while to 100% (tens of GBs), and the node eventually comes to a grinding 
> halt!  Wiping munged.log and restarting the node works around the issue.
> 
> I've tried to track down the symptoms and this is what I found:
> 
> 1. In munged.log there are infinitely many lines filling up the disk:
> 
>   2024-04-11 09:59:29 +0200 Info:  Suspended new connections while 
> processing backlog
> 
> 2. The slurmd is not getting any responses from munged, even though we run
>   "munged --num-threads 10".  The slurmd.log displays errors like:
> 
>   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
> --num-threads=10
>   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to 
> "/var/run/munge/munge.socket.2": Resource temporarily unavailable
>   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: 
> RESPONSE_ACCT_GATHER_UPDATE has authentication error
> 
> 3. The /var/log/messages displays the errors from slurmd as well as
>   NetworkManager saying "Too many open files in system".
>   The telltale syslog entry seems to be:
> 
>   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached
> 
>   where the limit is confirmed in /proc/sys/fs/file-max.
> 
> We have never before seen any such errors from Munge.  The error may perhaps 
> be triggered by certain user codes (possibly star-ccm+) that might be opening 
> a lot more files on the 96-core nodes than on nodes with a lower core count.
> 
> My workaround has been to edit the line in /etc/sysctl.conf:
> 
> fs.file-max = 131072
> 
> and update settings by "sysctl -p".  We haven't seen any of the Munge errors 
> since!
> 
> The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer 
> version in https://github.com/dun/munge/releases/tag/munge-0.5.16
> I can't figure out if 0.5.16 has a fix for the issue seen here?
> 
> Questions: Have other sites seen the present Munge issue as well?  Are there 
> any good recommendations for setting the fs.file-max parameter on Slurm 
> compute nodes?
> 
> Thanks for sharing your insights,
> Ole
> 
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Restricting local disk storage of jobs

2024-02-07 Thread Jeffrey T Frey via slurm-users
The native job_container/tmpfs would certainly have access to the job record, 
so modification to it (or a forked variant) would be possible.  A SPANK plugin 
should be able to fetch the full job record [1] and is then able to inspect the 
"gres" list (as a C string), which means I could modify UD's auto_tmpdir 
accordingly.  Having a compiled plugin executing xfs_quota to effect the 
commands illustrated wouldn't be a great idea -- luckily Linux XFS has an API.  
Seemingly not the simplest one, but xfsprogs is a working example.




[1] https://gitlab.hpc.cineca.it/dcesari1/slurm-msrsafe



> On Feb 7, 2024, at 05:25, Tim Schneider via slurm-users 
>  wrote:
> 
> Hey Jeffrey,
> thanks for this suggestion! This is probably the way to go if one can find a 
> way to access GRES in the prolog. I read somewhere that people were calling 
> scontrol to get this information, but this seems a bit unclean. Anyway, if I 
> find some time I will try it out.
> Best,
> Tim
> On 2/6/24 16:30, Jeffrey T Frey wrote:
>> Most of my ideas have revolved around creating file systems on-the-fly as 
>> part of the job prolog and destroying them in the epilog.  The issue with 
>> that mechanism is that formatting a file system (e.g. mkfs.) can be 
>> time-consuming.  E.g. formatting your local scratch SSD as an LVM PV+VG and 
>> allocating per-job volumes, you'd still need to run a e.g. mkfs.xfs and 
>> mount the new file system. 
>> 
>> 
>> ZFS file system creation is much quicker (basically combines the LVM + mkfs 
>> steps above) but I don't know of any clusters using ZFS to manage local file 
>> systems on the compute nodes :-)
>> 
>> 
>> One could leverage XFS project quotas.  E.g. for Slurm job 2147483647:
>> 
>> 
>> [root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647
>> [root@r00n00 /]# xfs_quota -x -c 'project -s -p /tmp-alloc/slurm-2147483647 
>> 2147483647' /tmp-alloc
>> Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)...
>> Processed 1 (/etc/projects and cmdline) paths for project 2147483647 with 
>> recursion depth infinite (-1).
>> [root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647' /tmp-alloc
>> [root@r00n00 /]# cd /tmp-alloc/slurm-2147483647
>> [root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M count=1000
>> dd: error writing ‘zeroes’: No space left on device
>> 205+0 records in
>> 204+0 records out
>> 1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s
>> 
>>:
>> 
>> [root@r00n00 /]# rm -rf /tmp-alloc/slurm-2147483647
>> [root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=0 2147483647' /tmp-alloc
>> 
>> 
>> Since Slurm jobids max out at 0x03FF (and 2147483647 = 0x7FFF) we 
>> have an easy on-demand project id to use on the file system.  Slurm tmpfs 
>> plugins have to do a mkdir to create the per-job directory, adding two 
>> xfs_quota commands (which run in more or less O(1) time) won't extend the 
>> prolog by much. Likewise, Slurm tmpfs plugins have to scrub the directory at 
>> job cleanup, so adding another xfs_quota command will not do much to change 
>> their epilog execution times.  The main question is "where does the tmpfs 
>> plugin find the quota limit for the job?"
>> 
>> 
>> 
>> 
>> 
>>> On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users 
>>>  wrote:
>>> 
>>> Hi,
>>> 
>>> In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure 
>>> that each user can use /tmp and it gets cleaned up after them. Currently, 
>>> we are mapping /tmp into the nodes RAM, which means that the cgroups make 
>>> sure that users can only use a certain amount of storage inside /tmp.
>>> 
>>> Now we would like to use of the node's local SSD instead of its RAM to hold 
>>> the files in /tmp. I have seen people define local storage as GRES, but I 
>>> am wondering how to make sure that users do not exceed the storage space 
>>> they requested in a job. Does anyone have an idea how to configure local 
>>> storage as a proper tracked resource?
>>> 
>>> Thanks a lot in advance!
>>> 
>>> Best,
>>> 
>>> Tim
>>> 
>>> 
>>> -- 
>>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>> 
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Restricting local disk storage of jobs

2024-02-06 Thread Jeffrey T Frey via slurm-users
Most of my ideas have revolved around creating file systems on-the-fly as part 
of the job prolog and destroying them in the epilog.  The issue with that 
mechanism is that formatting a file system (e.g. mkfs.) can be 
time-consuming.  E.g. formatting your local scratch SSD as an LVM PV+VG and 
allocating per-job volumes, you'd still need to run a e.g. mkfs.xfs and mount 
the new file system.


ZFS file system creation is much quicker (basically combines the LVM + mkfs 
steps above) but I don't know of any clusters using ZFS to manage local file 
systems on the compute nodes :-)


One could leverage XFS project quotas.  E.g. for Slurm job 2147483647:


[root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647
[root@r00n00 /]# xfs_quota -x -c 'project -s -p /tmp-alloc/slurm-2147483647 
2147483647' /tmp-alloc
Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)...
Processed 1 (/etc/projects and cmdline) paths for project 2147483647 with 
recursion depth infinite (-1).
[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647' /tmp-alloc
[root@r00n00 /]# cd /tmp-alloc/slurm-2147483647
[root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M count=1000
dd: error writing ‘zeroes’: No space left on device
205+0 records in
204+0 records out
1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s

   :

[root@r00n00 /]# rm -rf /tmp-alloc/slurm-2147483647
[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=0 2147483647' /tmp-alloc


Since Slurm jobids max out at 0x03FF (and 2147483647 = 0x7FFF) we have 
an easy on-demand project id to use on the file system.  Slurm tmpfs plugins 
have to do a mkdir to create the per-job directory, adding two xfs_quota 
commands (which run in more or less O(1) time) won't extend the prolog by much. 
Likewise, Slurm tmpfs plugins have to scrub the directory at job cleanup, so 
adding another xfs_quota command will not do much to change their epilog 
execution times.  The main question is "where does the tmpfs plugin find the 
quota limit for the job?"





> On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users 
>  wrote:
> 
> Hi,
> 
> In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure 
> that each user can use /tmp and it gets cleaned up after them. Currently, we 
> are mapping /tmp into the nodes RAM, which means that the cgroups make sure 
> that users can only use a certain amount of storage inside /tmp.
> 
> Now we would like to use of the node's local SSD instead of its RAM to hold 
> the files in /tmp. I have seen people define local storage as GRES, but I am 
> wondering how to make sure that users do not exceed the storage space they 
> requested in a job. Does anyone have an idea how to configure local storage 
> as a proper tracked resource?
> 
> Thanks a lot in advance!
> 
> Best,
> 
> Tim
> 
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] Fairshare: Penalising unused memory rather than used memory?

2023-10-11 Thread Jeffrey T Frey
> On the automation part, it would be pretty easy to do regularly(daily?) stats 
> of jobs for that period of time and dump them into an sql database. 
> Then a select statement where cpu_efficiency is less than desired value and 
> get the list of not so nice users on which you can apply whatever 
> warnings/limits you want to do.

Assuming the existence of collection and collation of the penalty data (e.g. in 
an SQL database) you could consider using a site factor plugin to deprioritize 
scheduling priority w/o the need to alter an association in the accounting 
database.


https://slurm.schedmd.com/site_factor.html





Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Jeffrey T Frey
In case you're developing the plugin in C and not LUA, behind the scenes the 
LUA mechanism is concatenating all log_user() strings into a single variable 
(user_msg).  When the LUA code completes, the C code sets the *err_msg argument 
to the job_submit()/job_modify() function to that string, then NULLs-out 
user-msg.  (There's a mutex around all of that code so slurmctld never executes 
LUA job submit/modify scripts concurrently.)  The slurmctld then communicates 
that returned string back to sbatch/salloc/srun for display to the user.


Your C plugin would do likewise — set *err_msg before returning from 
job_submit()/job_modify() — and needn't be mutex'ed if the code is reentrant.






> On Jul 19, 2023, at 08:37, Angel de Vicente  wrote:
> 
> Hello Lorenzo,
> 
> Lorenzo Bosio  writes:
> 
>> I'm developing a job submit plugin to check if some conditions are met 
>> before a job runs.
>> I'd need a way to notify the user about the plugin actions (i.e. why its 
>> jobs was killed and what to do), but after a lot of research I could only 
>> write to logs and not the user shell.
>> The user gets the output of slurm_kill_job but I can't find a way to add a 
>> custom note.
>> 
>> Can anyone point me to the right api/function in the code?
> 
> In our "job_submit.lua" script we have the following for that purpose:
> 
> ,
> |   slurm.log_user("%s: WARNING: [...]", log_prefix)
> `
> 
> -- 
> Ángel de Vicente
> Research Software Engineer (Supercomputing and BigData)
> Tel.: +34 922-605-747
> Web.: http://research.iac.es/proyecto/polmag/
> 
> GPG: 0x8BDC390B69033F52




Re: [slurm-users] How do I set SBATCH_EXCLUSIVE to its default value?

2023-05-19 Thread Jeffrey T Frey
> I get that these correspond
> 
> --exclusive=userexport SBATCH_EXCLUSIVE=user
> --exclusive=mcs export SBATCH_EXCLUSIVE=mcs
> But --exclusive has a default behavior if I don't assign it a value. What do 
> I set SBATCH_EXCLUSIVE to, to get the same default behavior?


Try setting the env var to an empty string:


export SBATCH_EXCLUSIVE=""


Re: [slurm-users] slurm and singularity

2023-02-08 Thread Jeffrey T Frey
You may need srun to allocate a pty for the command.  The 
InteractiveStepOptions we use (that are handed to srun when no explicit command 
is given to salloc) are:


--interactive --pty --export=TERM


E.g. without those flags a bare srun gives a promptless session:


[(it_nss:frey)@login00.darwin ~]$ salloc -p idle srun 
/opt/shared/singularity/3.10.0/bin/singularity shell 
/opt/shared/singularity/prebuilt/postgresql/13.2.simg
salloc: Granted job allocation 3953722
salloc: Waiting for resource configuration
salloc: Nodes r1n00 are ready for job
ls -l
total 437343
-rw-r--r--  1 frey it_nss  180419 Oct 26 16:56 amd.cache
-rw-r--r--  1 frey it_nss  72 Oct 26 16:52 amd.conf
-rw-r--r--  1 frey everyone   715 Nov 12 23:39 anaconda-activate.sh
drwxr-xr-x  2 frey everyone 4 Apr 11  2022 bin
   :


With the --pty flag added:


[(it_nss:frey)@login00.darwin ~]$ salloc -p idle srun --pty 
/opt/shared/singularity/3.10.0/bin/singularity shell 
/opt/shared/singularity/prebuilt/postgresql/13.2.simg
salloc: Granted job allocation 3953723
salloc: Waiting for resource configuration
salloc: Nodes r1n00 are ready for job
Singularity>



> On Feb 8, 2023, at 09:47 , Groner, Rob  wrote:
> 
> I tried that, and it says the nodes have been allocated, but it never comes 
> to an apptainer prompt.
> 
> I then tried doing them in separate steps.  Doing salloc works, I get a 
> prompt on the node that was allocated.  I can then run "singularity shell 
> " and get the apptainer prompt.  If I prefix that command with "srun", 
> then it just hangs and I never get the prompt.  So that seems to be the 
> sticking point.  I'll have to do some experiments running singularity with 
> srun.
> 
> From: slurm-users  on behalf of 
> Jeffrey T Frey 
> Sent: Tuesday, February 7, 2023 6:16 PM
> To: Slurm User Community List 
> Subject: Re: [slurm-users] slurm and singularity
>  
> You don't often get email from f...@udel.edu. Learn why this is important
>> The remaining issue then is how to put them into an allocation that is 
>> actually running a singularity container.  I don't get how what I'm doing 
>> now is resulting in an allocation where I'm in a container on the submit 
>> node still!
> 
> Try prefixing the singularity command with "srun" e.g.
> 
> 
> salloc  srun  /usr/bin/singularity shell 
> 



Re: [slurm-users] slurm and singularity

2023-02-07 Thread Jeffrey T Frey
> The remaining issue then is how to put them into an allocation that is 
> actually running a singularity container.  I don't get how what I'm doing now 
> is resulting in an allocation where I'm in a container on the submit node 
> still!

Try prefixing the singularity command with "srun" e.g.


salloc  srun  /usr/bin/singularity shell 


Re: [slurm-users] Why every job will sleep 100000000

2022-11-04 Thread Jeffrey T Frey
If you examine the process hierarchy, that "sleep 1" process if 
probably the child of a "slurmstepd: [.extern]" process.  This is a 
housekeeping step launched for the job by slurmd -- in older Slurm releases it 
would handle the X11 forwarding, for example.  It should have no impact on the 
other steps of the job.




> On Nov 4, 2022, at 05:26 , GHui  wrote:
> 
> I found a sleep process running by root, when I submit a job. And it sleep 
> 1 seconds.
> Sometimes, my job is hung up. The job state is "R". Though it runs nothing, 
> the jobscript like the following,
> --
> #!/bin/bash
> #SBATCH -J sub
> #SBATCH -N 1
> #SBATCH -n 1
> #SBATCH -p vpartition
> 
> --
> 
> Is it because of "sleep 1" process? Or how could I debug it?
> 
> Any help will be appreciated.
> --GHui




Re: [slurm-users] sacct output in tabular form

2021-08-25 Thread Jeffrey T Frey
You've confirmed my suspicion — no one seems to care for Slurm's standard 
output formats :-)  At UD we did a Python curses wrapper around the parseable 
output to turn the terminal window into a navigable spreadsheet of output:


https://gitlab.com/udel-itrci/slurm-output-wrappers




> On Aug 25, 2021, at 01:41 , Sternberger, Sven  
> wrote:
> 
> Hello!
> 
> this is a simple wrapper for sacct which prints the
> output from sacct as table. So you can make a 
> "sacctml -j foo --long" even without two 8k displays ;-)
> 
> cheers




Re: [slurm-users] Bug: incorrect output directory fails silently

2021-07-08 Thread Jeffrey T Frey
> I understand that there is no output file to write an error message to, but 
> it might be good to check the `--output` path during the scheduling, just 
> like `--account` is checked.
> 
> Does anybody know a workaround to be warned about the error?

I would make a feature request of SchedMD to fix the issue, then I would write 
a cli_filter plugin to validate the --output/--error/--input paths as desired 
until Slurm itself handles it.




Re: [slurm-users] squeue: compact pending job-array in one partition, but not in other

2021-02-23 Thread Jeffrey T Frey
Did those four jobs


   6577272_21 scavenger PD   0:00  1 (Priority)
   6577272_22 scavenger PD   0:00  1 (Priority)
   6577272_23 scavenger PD   0:00  1 (Priority)
   6577272_28 scavenger PD   0:00  1 (Priority)


run before and get requeued?  Seems likely with a partition named "scavenger."





> On Feb 23, 2021, at 13:59 , Loris Bennett  wrote:
> 
> Hi,
> 
> Does anyone have an idea why pending elements of an array job in one
> partition should be displayed compactly by 'squeue' but those of another
> in a different partition are displayed one element per line?  Please see below
> (compact display in 'main', one element per line in 'scavenger').
> This is with version 20.02.6
> 
> Cheers,
> 
> Loris
> 
> JOBID PARTITION ST   TIME  NODES NODELIST(REASON)
>   ...
>   6755576  main PD   0:00  1 (Priority)
> 6749327_[754-1000]  main PD   0:00  1 (Priority)
>   6748246  main PD   0:00  1 (Priority)
>   6749213  main PD   0:00  1 (Priority)
>   6749309  main PD   0:00  1 (Priority)
>   6750124  main PD   0:00  1 (Priority)
>   6752967  main PD   0:00  1 (Priority)
>   6746767  main PD   0:00  1 (Priority)
>   6755188  main PD   0:00  1 (Priority)
>  6702557_[13]  main PD   0:00  4 (Priority)
>6702858_[1-10]  main PD   0:00  4 (Priority)
> 6703700_[1-4,6-8]  main PD   0:00  4 (Priority)
>   6703764_[1]  main PD   0:00  4 (Priority)
> 6705324_[1,3,5,9]  main PD   0:00  4 (Priority)
>   6748962  main PD   0:00  4 (Priority)
>   6709963  main PD   0:00  1 (Priority)
>   6709964  main PD   0:00  1 (Priority)
>   6709976  main PD   0:00  1 (Priority)
> 6462709_[1-77,79-8  main PD   0:00  1 (QOSMaxCpuPerUserLimit)
> 6463366_[1-2,28-72  main PD   0:00  1 (QOSMaxCpuPerUserLimit)
>6577272_21 scavenger PD   0:00  1 (Priority)
>6577272_22 scavenger PD   0:00  1 (Priority)
>6577272_23 scavenger PD   0:00  1 (Priority)
>6577272_28 scavenger PD   0:00  1 (Priority)
> -- 
> Dr. Loris Bennett (Hr./Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
> 




Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Jeffrey T Frey
> ...I would say having SLURM rpms in EPEL could be very helpful for a lot of 
> people.
> 
> I get that this took you by surprise, but that's not a reason to not have 
> them in the repository. I, for one, will happily test if they work for me, 
> and if they do, that means that I can stop having to build them. I agree it's 
> not hard to do, but if I don't have to do it I'll be very happy about that.

There have been plenty of arguments for why having them in EPEL isn't 
necessarily the best option.  Many open source products (e.g. Postgres, Docker) 
maintain their own YUM repository online -- probably to exercise greater 
control over what's published, but also to avoid overlap with mainstream 
package repositories.  If there is value perceived in having pre-built packages 
available, then perhaps the best solution for all parties is to publish the 
packages to a unique repository:  those who want the pre-built packages 
explicitly configure their YUM to pull from that repository, those who have 
EPEL configured (which is a LOT of us) don't get overlapping Slurm packages 
interfering with their local builds.


::::::::::
 Jeffrey T. Frey, Ph.D.
 Systems Programmer V & Cluster Management
 IT Research Cyberinfrastructure
  & College of Engineering
 University of Delaware, Newark DE  19716
 Office: (302) 831-6034  Mobile: (302) 419-4976
::




[slurm-users] Constraint multiple counts not working

2020-12-16 Thread Jeffrey T Frey
On a cluster running Slurm 17.11.8 (cons_res) I can submit a job that requests 
e.g. 2 nodes with unique features on each:


$ sbatch --nodes=2 --ntasks-per-node=1 --constraint="[256GB*1&192GB*1]" …


The job is submitted and runs as expected:  on 1 node with feature "256GB" and 
1 node with feature "192GB."  A similar job on a cluster running 20.11.1 
(cons_res OR cons_tres, tested with both) fails to submit:


sbatch: error: Batch job submission failed: Requested node configuration is not 
available


I enabled debug5 output with NodeFeatures:


[2020-12-16T08:53:19.024] debug:  JobId=118 feature list: [512GB*1&768GB*1]
[2020-12-16T08:53:19.025] NODE_FEATURES: _log_feature_nodes: FEAT:512GB COUNT:1 
PAREN:0 OP:XAND ACTIVE:r1n[00-47] AVAIL:r1n[00-47]
[2020-12-16T08:53:19.025] NODE_FEATURES: _log_feature_nodes: FEAT:768GB COUNT:1 
PAREN:0 OP:END ACTIVE:r2l[00-31] AVAIL:r2l[00-31]
[2020-12-16T08:53:19.025] NODE_FEATURES: valid_feature_counts: feature:512GB 
feature_bitmap:r1n[00-47],r2l[00-31],r2x[00-10] 
work_bitmap:r1n[00-47],r2l[00-31],r2x[00-10] tmp_bitmap:r1n[00-47] count:1
[2020-12-16T08:53:19.025] NODE_FEATURES: valid_feature_counts: feature:768GB 
feature_bitmap:r1n[00-47],r2l[00-31],r2x[00-10] 
work_bitmap:r1n[00-47],r2l[00-31],r2x[00-10] tmp_bitmap:r2l[00-31] count:1
[2020-12-16T08:53:19.025] NODE_FEATURES: valid_feature_counts: 
NODES:r1n[00-47],r2l[00-31],r2x[00-10] HAS_XOR:T status:No error
[2020-12-16T08:53:19.025] select/cons_tres: _job_test: SELECT_TYPE: test 0 
pass: test_only
[2020-12-16T08:53:19.026] debug2: job_allocate: setting JobId=118_* to 
"BadConstraints" due to a flaw in the job request (Requested node configuration 
is not available)
[2020-12-16T08:53:19.026] _slurm_rpc_submit_batch_job: Requested node 
configuration is not available


My syntax agrees with the 20.11.1 documentation (online and man pages) so it 
seems correct — and it works fine in 17.11.8.  Any ideas?



::::::::::
 Jeffrey T. Frey, Ph.D.
 Systems Programmer V / Cluster Management
 Network & Systems Services / College of Engineering
 University of Delaware, Newark DE  19716
 Office: (302) 831-6034  Mobile: (302) 419-4976
::



Re: [slurm-users] Slurm versions 20.11.1 is now available

2020-12-11 Thread Jeffrey T Frey
It's in the github commits:


https://github.com/SchedMD/slurm/commit/8e84db0f01ecd4c977c12581615d74d59b3ff995


The primary issue is that any state the client program established on the 
connection after first making it (e.g. opening a transaction, creating temp 
tables) won't be present if MySQL automatically reconnects to the server.  So 
the reconnected state won't match the state expected by the client.  Better for 
the client to know the connection failed and reconnect on its own to 
reestablish state.



> On Dec 11, 2020, at 10:31 , Malte Thoma  wrote:
> 
> 
> 
> Am 11.12.20 um 14:11 schrieb Michael Di Domenico:
>>>  -- Disable MySQL automatic reconnection.
>> can you expand on this?  seems an 'odd' thing to disable.
> 
> same thoughts here :-)
> 
> 
> 
> 
> 
> 
> 
> 
>> On Thu, Dec 10, 2020 at 4:44 PM Tim Wickberg  wrote:
>>> 
>>> We are pleased to announce the availability of Slurm version 20.11.1.
>>> 
>>> This includes a number of fixes made in the month since 20.11 was
>>> initially released, including critical fixes to nss_slurm and the Perl
>>> API when used with the newer configless mode of operation.
>>> 
>>> Slurm can be downloaded from https://www.schedmd.com/downloads.php .
>>> 
>>> - Tim
>>> 
>>> --
>>> Tim Wickberg
>>> Chief Technology Officer, SchedMD LLC
>>> Commercial Slurm Development and Support
>>> 
 * Changes in Slurm 20.11.1
 ==
  -- Fix spelling of "overcomited" to "overcomitted" in sreport's cluster
 utilization report.
  -- Silence debug message about shutting down backup controllers if none 
 are
 configured.
  -- Don't create interactive srun until PrologSlurmctld is done.
  -- Fix fd symlink path resolution.
  -- Fix slurmctld segfault on subnode reservation restore after node
 configuration change.
  -- Fix resource allocation response message environment allocation size.
  -- Ensure that details->env_sup is NULL terminated.
  -- select/cray_aries - Correctly remove jobs/steps from blades using NPC.
  -- cons_tres - Avoid max_node_gres when entire node is allocated with
 --ntasks-per-gpu.
  -- Allow NULL arg to data_get_type().
  -- In sreport have usage for a reservation contain all jobs that ran in 
 the
 reservation instead of just the ones that ran in the time specified. 
 This
 matches the report for the reservation is not truncated for a time 
 period.
  -- Fix issue with sending wrong batch step id to a < 20.11 slurmd.
  -- Add a job's alloc_node to lua for job modification and completion.
  -- Fix regression getting a slurmdbd connection through the perl API.
  -- Stop the extern step terminate monitor right after proctrack_g_wait().
  -- Fix removing the normalized priority of assocs.
  -- slurmrestd/v0.0.36 - Use correct name for partition field:
 "min nodes per job" -> "min_nodes_per_job".
  -- slurmrestd/v0.0.36 - Add node comment field.
  -- Fix regression marking cloud nodes as "unexpectedly rebooted" after
 multiple boots.
  -- Fix slurmctld segfault in _slurm_rpc_job_step_create().
  -- slurmrestd/v0.0.36 - Filter node states against NODE_STATE_BASE to 
 avoid
 the extended states all being reported as "invalid".
  -- Fix race that can prevent the prolog for a requeued job from running.
  -- cli_filter - add "type" to readily distinguish between the CLI command 
 in
 use.
  -- smail - reduce sleep before seff to 5 seconds.
  -- Ensure SPANK prolog and epilog run without an explicit PlugStackConfig.
  -- Disable MySQL automatic reconnection.
  -- Fix allowing "b" after memory unit suffixes.
  -- Fix slurmctld segfault with reservations without licenses.
  -- Due to internal restructuring ahead of the 20.11 release, applications
 calling libslurm MUST call slurm_init(NULL) before any API calls.
 Otherwise the API call is likely to fail due to libslurm's internal
 configuration not being available.
  -- slurm.spec - allow custom paths for PMIx and UCX install locations.
  -- Use rpath if enabled when testing for Mellanox's UCX libraries.
  -- slurmrestd/dbv0.0.36 - Change user query for associations to optional.
  -- slurmrestd/dbv0.0.36 - Change account query for associations to 
 optional.
  -- mpi/pmix - change the error handler error message to be more useful.
  -- Add missing connection in acct_storage_p_{clear_stats, reconfig, 
 shutdown}.
  -- Perl API - fix issue when running in configless mode.
  -- nss_slurm - avoid deadlock when stray sockets are found.
  -- Display correct value for ScronParameters in 'scontrol show config'.
>>> 
> 
> -- 
> Malte ThomaTel. +49-471-4831-1828
> HSM Documentation: https://spaces.awi.de/x/YF3-Eg (User)
>   https://spaces.awi.de/x/oYD8B  (Admin)
> HPC Documentation: 

Re: [slurm-users] Heterogeneous GPU Node MPS

2020-11-13 Thread Jeffrey T Frey
From the NVIDIA docs re: MPS:


On systems with a mix of Volta / pre-Volta GPUs, if the MPS server is set to 
enumerate any Volta GPU, it will discard all pre-Volta GPUs. In other words, 
the MPS server will either operate only on the Volta GPUs and expose Volta 
capabilities, or operate only on pre-Volta GPUs.


I'd be curious what happens if you change the ordering (RTX then V100) in the 
gres.conf -- would the RTX work with MPS and the V100 would not?


> On Nov 13, 2020, at 07:23 , Holger Badorreck  wrote:
> 
> Hello,
>  
> I have a heterogeneous GPU Node with one V100 and two RTX cards. When I 
> request resources with --gres=mps:100, always the V100 is chosen, and jobs 
> are waiting if the V100 is completely allocated, while RTX cards are free. If 
> I use --gres=gpu:1, also the RTX cards are used. Is something wrong with the 
> configuration or is it another problem?
>  
> The node configuration  in slurm.conf:
> NodeName=node1 CPUs=48 RealMemory=128530 Sockets=1 CoresPerSocket=24 
> ThreadsPerCore=2 Gres=gpu:v100:1,gpu:rtx:2,mps:600 State=UNKNOWN
>  
> gres.conf:
> Name=gpu Type=v100  File=/dev/nvidia0
> Name=gpu Type=rtx  File=/dev/nvidia1
> Name=gpu Type=rtx  File=/dev/nvidia2
> Name=mps Count=200  File=/dev/nvidia0
> Name=mps Count=200  File=/dev/nvidia1
> Name=mps Count=200  File=/dev/nvidia2
>  
> Best regards,
> Holger



Re: [slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Jeffrey T Frey
Making the certificate globally-available on the host may not always be 
permissible.  If I were you, I'd write/suggest a modification to the plugin to 
make the CA path (CURLOPT_CAPATH) and verification itself 
(CURLOPT_SSL_VERIFYPEER) configurable in Slurm.  They are both straightforward 
options in the CURL API (a char* and an int, respectively) that could be set 
directly from parsed Slurm config options.  Many other SSL CURL options would 
be just as easy (revocation path, etc.).



> On Aug 14, 2020, at 08:55 , Stefan Staeglich 
>  wrote:
> 
> Hi,
> 
> all except of /etc/ssl/certs/ca-certificates.crt is ignored. So I've copied 
> it 
> to /usr/local/share/ca-certificates/ and run update-ca-certificates.
> 
> Now it's working :)
> 
> Best,
> Stefan
> 
> Am Freitag, 14. August 2020, 11:42:04 CEST schrieb Stefan Staeglich:
>> Hi,
>> 
>> I try to setup the acct_gather plugin ProfileInfluxDB. Unfortunately our
>> influxdb server has a self-signed certificate only:
>> [2020-08-14T09:54:30.007] [46.0] error: acct_gather_profile/influxdb
>> _send_data: curl_easy_perform failed to send data (discarded). Reason: SSL
>> peer certificate or SSH remote key was not OK
>> 
>> I've copied the certificate to /etc/ssl/certs/ but this doesn't help. But
>> his command is working:
>> curl 'https://influxdb-server.privat:8086' --cacert /etc/ssl/certs/
>> influxdb.crt
>> 
>> Has someone a solution for this issue?
>> 
>> Best,
>> Stefan
> 
> 
> -- 
> Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
> Georges-Köhler-Allee,  Geb.74,   79110 Freiburg,Germany
> 
> E-Mail : staeg...@informatik.uni-freiburg.de
> WWW: ml.informatik.uni-freiburg.de
> Telefon: +49 761 203-54216
> Fax: +49 761 203-74217
> 
> 
> 
> 




Re: [slurm-users] slurm array with non-numeric index values

2020-07-15 Thread Jeffrey T Frey
On our HPC systems we have a lot of users attempting to organize job arrays for 
varying purposes -- parameter scans, SSMD (Single-Script, Multiple Datasets).  
We eventually wrote an abstract utility to try to help them with the process:


https://github.com/jtfrey/job-templating-tool


May be of some use to you.




> On Jul 15, 2020, at 16:13 , c b  wrote:
> 
> I'm trying to run an embarrassingly parallel experiment, with 500+ tasks that 
> all differ in one parameter.  e.g.:
> 
> job 1 - script.py foo
> job 2 - script.py bar
> job 3 - script.py baz
> and so on.
> 
> This seems like a case where having a slurm array hold all of these jobs 
> would help, so I could just submit one job to my cluster instead of 500 
> individual jobs.  It seems like sarray is only set up for varying an integer 
> index parameter.  How would i do this for non-numeric values (say, if the 
> parameter I'm varying is a string in a given list) ?
> 
> 



Re: [slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread Jeffrey T Frey
If you check the source up on Github, that's more of a warning produced when 
you didn't specify a CPU count and it's going to calculate from the 
socket-core-thread numbers (src/common/read_config.c):



/* Node boards are factored into sockets */
if ((n->cpus != n->sockets) &&
(n->cpus != n->sockets * n->cores) &&
(n->cpus != n->sockets * n->cores * n->threads)) {
error("NodeNames=%s CPUs=%d match no Sockets, 
Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting 
CPUs.",
  n->nodenames, n->cpus);
n->cpus = n->sockets * n->cores * n->threads;
}


This behavior is present beginning in 18.x releases; in 17.x and earlier the 
inferred n->cpus was done quietly.


> On Jun 16, 2020, at 04:12 , Ole Holm Nielsen  
> wrote:
> 
> Today we upgraded the controller node from 19.05 to 20.02.3, and immediately 
> all Slurm commands (on the controller node) give error messages for all 
> partitions:
> 
> # sinfo --version
> sinfo: error: NodeNames=a[001-140] CPUs=1 match no Sockets, 
> Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting 
> CPUs.
> (lines deleted)
> slurm 20.02.3
> 
> In slurm.conf we have defined NodeName like:
> 
> NodeName=a[001-140] Weight=10001 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 
> ThreadsPerCore=1 ...
> 
> According to the slurm.conf manual the CPUs should then be calculated 
> automatically:
> 
> "If CPUs is omitted, its default will be set equal to the product of Boards, 
> Sockets, CoresPerSocket, and ThreadsPerCore."
> 
> Has anyone else seen this error with Slurm 20.02?
> 
> I wonder if there is a problem with specifying SocketsPerBoard in stead of 
> Sockets?  The slurm.conf manual doesn't seem to prefer one over the other.
> 
> I've opened a bug https://bugs.schedmd.com/show_bug.cgi?id=9241
> 
> Thanks,
> Ole
> 
> 



Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Jeffrey T Frey
Is the time on that node too far out-of-sync w.r.t. the slurmctld server?



> On Jun 11, 2020, at 09:01 , navin srivastava  wrote:
> 
> I tried by executing the debug mode but there also it is not writing anything.
> 
> i waited for about 5-10 minutes
> 
> deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v
> 
> No output on terminal. 
> 
> The OS is SLES12-SP4 . All firewall services are disabled.
> 
> The recent change is the local hostname earlier it was with local hostname 
> node1,node2,etc but we have moved to dns based hostname which is deda
> 
> NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] 
> Sockets=2 CoresPerSocket=10 State=UNKNOWN
> other than this it is fine but after that i have done several time slurmd 
> process started on the node and it works fine but now i am seeing this issue 
> today.
> 
> Regards
> Navin.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy  wrote:
> Navin,
> 
>  
> 
> As you can see, systemd provides very little service-specific information. 
> For slurm, you really need to go to the slurm logs to find out what happened.
> 
>  
> 
> Hint: A quick way to identify problems like this with slurmd and slurmctld is 
> to run them with the “-Dvvv” option, causing them to log to your window, and 
> usually causing the problem to become immediately obvious.
> 
>  
> 
> For example,
> 
>  
> 
> # /usr/local/slurm/sbin/slurmd -D
> 
>  
> 
> Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
> you run it this way, it’s time to look elsewhere.
> 
>  
> 
> Andy
> 
>  
> 
> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
> navin srivastava
> Sent: Thursday, June 11, 2020 8:25 AM
> To: Slurm User Community List 
> Subject: [slurm-users] unable to start slurmd process.
> 
>  
> 
> Hi Team,
> 
>  
> 
> when i am trying to start the slurmd process i am getting the below error.
> 
>  
> 
> 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node 
> daemon...
> 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
> operation timed out. Terminating.
> 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
> daemon.
> 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit 
> entered failed state.
> 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed 
> with result 'timeout'.
> 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
> session opened for user root by (uid=0)
> 
>  
> 
> Slurm version is 17.11.8
> 
>  
> 
> The server and slurm is running from long time and we have not made any 
> changes but today when i am starting it is giving this error message. 
> 
> Any idea what could be wrong here.
> 
>  
> 
> Regards
> 
> Navin.
> 
>  
> 
>  
> 
>  
> 
>  
> 




Re: [slurm-users] ssh-keys on compute nodes?

2020-06-08 Thread Jeffrey T Frey
An MPI library with tight integration with Slurm (e.g. Intel MPI, Open MPI) can 
use "srun" to start the remote workers.  In some cases "srun" can be used 
directly for MPI startup (e.g. "srun" instead of "mpirun").


Other/older MPI libraries that start remote processes using "ssh" would, 
naturally, require keyless ssh logins to work across all compute nodes in the 
cluster.


When we provision user accounts on our Slurm cluster we still add .ssh, 
.ssh/id_rsa (needed for older X11 tunneling via libssh2), and add the public 
key to .ssh/authorized_keys.  All officially-supported MPIs on the cluster are 
tightly integrated with Slurm.  But there are commercial products and older 
software our clients use that are not, so having keyless access ready for them 
helps those users get their workflows working more quickly.





> On Jun 8, 2020, at 11:16 , Durai Arasan  wrote:
> 
> Hi,
> 
> we are setting up a slurm cluster and are at the stage of adding ssh keys of 
> the users to the nodes.
> 
> I thought it would be sufficient to add the ssh keys of the users to only the 
> designated login nodes. But I heard that it is also necessary to add them to 
> the compute nodes as well for slurm to be able to submit jobs of the users 
> successfully. Apparently this is true especially for MPI jobs.
> 
> So is it true that ssh keys of the users must be added to the 
> ~/.ssh/authorized_keys of *all* nodes and not just the login nodes?
> 
> Thanks,
> Durai
> 




Re: [slurm-users] IPv6 for slurmd and slurmctld

2020-05-01 Thread Jeffrey T Frey
Use netstat to list listening ports on the box (netstat -ln) and see if it 
shows up as tcp6 or tcp.  On our (older) 17.11.8 server:


$ netstat -ln | grep :6817
tcp0  0 0.0.0.0:68170.0.0.0:*   LISTEN  

$ nc -6 :: 6817
Ncat: Connection refused.

$ nc -4 localhost 6817
^C





> On May 1, 2020, at 12:44 , William Brown  wrote:
> 
> For some services that display of 0.0.0.0 does include IPv6, although it is 
> counter-intuitive.   Try to see if you can connect to it using the IPv6 
> address.
> 
> William
> 
> On Fri, 1 May 2020 at 16:35, Thomas Schäfer  
> wrote:
> Hi,
> 
> is there an switch, option, environment variable, configurable key word to 
> enable IP6 for the slurmd and slurmctld daemons?
> 
> tcp   LISTEN   0.0.0.0:6818
> 
> isn't a good choice, were everything else (nfs, ssh, ntp, dns) runs over IPv6.
> 
> Regards,
> Thomas
> 
> 
> 
> 
> 
> 



Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?

2020-04-21 Thread Jeffrey T Frey
You could also choose to propagate the signal to the child process of 
test.slurm yourself:


#!/bin/bash
#SBATCH --job-name=test
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --time=00:03:00
#SBATCH --signal=B:SIGINT@30

# This example works, but I need it to work without "B:" in --signal options, 
so I want test.sh receives the SIGINT signal and not test.slurm

sig_handler()
{
 echo "BATCH interrupted"
 if [ -n "$child_pid" ]; then
 kill -INT $child_pid
 fi
}

trap 'sig_handler' SIGINT

/home/user/test.sh &
child_pid=$!
wait $child_pid
exit $?


and


#!/bin/bash

function sig_handler()
{
 echo "Executable interrupted"
 exit 2
}

trap 'sig_handler' SIGINT

echo "BEGIN"
sleep 200 &
wait
echo "END"


Having your signal handler in test.slurm "exit 2" signals the end of the job, 
so the child processes will be terminated whether they've hit their own signal 
handler yet or not.  Signaling the child then returning control in test.slurm 
to wait and reap the child's exit code and "exit $?" actually gives the child 
time to do cleanup and influence the final exit code of the job.




> On Apr 21, 2020, at 06:13 , Bjørn-Helge Mevik  wrote:
> 
> Jean-mathieu CHANTREIN  writes:
> 
>> But that is not enough, it is also necessary to use srun in
>> test.slurm, because the signals are sent to the child processes only
>> if they are also children in the JOB sense.
> 
> Good to know!
> 
> -- 
> Cheers,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo



Re: [slurm-users] Problems calling mpirun in OpenMPI-3.1.6 + slurm and OpenMPI-4.0.3+slurm environments

2020-04-10 Thread Jeffrey T Frey
I just reread your post -- you installed Open MPI 4.0.3 to 
/home/manumachu/openmpi-4.0.3/OPENMPI_INSTALL then set what's probably a 
different directory -- /scratch/manumachu/openmpi-4.0.3/OPENMPI_INSTALL/bin -- 
on your path.  So I bet "which mpirun" won't show you what you're expecting :-)




> On Apr 10, 2020, at 12:59 , Jeffrey T Frey  wrote:
> 
> Are you certain you're PATH addition is correct?  The "-np" flag is still 
> present in a build of Open MPI 4.0.3 I just made, in fact:
> 
> 
> $ 4.0.3/bin/mpirun 
> --
> mpirun could not find anything to do.
> 
> It is possible that you forgot to specify how many processes to run
> via the "-np" argument.
> --
> 
> 
> Note that with the Slurm plugins present in your Open MPI build, there should 
> be no need to use "-np" on the command line; the Slurm RAS plugin should pull 
> such information from the Slurm runtime environment variables.  If you do use 
> "-np" to request more CPUs that the job was allocated, you'll receive 
> oversubscription errors (you know, unless you include mpirun flags to allow 
> that to happen).
> 
> 
> What if you add "which mpirun" to your job script ahead of the "mpirun" 
> command -- does it show you 
> /scratch/manumachu/openmpi-4.0.3/OPENMPI_INSTALL/bin/mpirun?
> 
> 
> 
> 
>> On Apr 10, 2020, at 12:12 , Ravi Reddy Manumachu  
>> wrote:
>> 
>> 
>> Dear Slurm Users,
>> 
>> I am facing issues with the following combinations of OpenMPI and SLURM. I 
>> was wondering if you have faced something similar and can help me.
>> 
>> OpenMPI-3.1.6 and slurm 19.05.5
>> OpenMPI-4.0.3 and slurm 19.05.5
>> 
>> I have the OpenMPI packages configured with "--with-slurm" option and 
>> installed. 
>> 
>>   Configure command line: 
>> '--prefix=/home/manumachu/openmpi-4.0.3/OPENMPI_INSTALL' '--with-slurm'
>>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
>> 
>> I am executing the sbatch script shown below:
>> 
>> #!/bin/bash
>> #SBATCH --account=x
>> #SBATCH --job-name=ompi4
>> #SBATCH --output=ompi4.out
>> #SBATCH --error=ompi4.err
>> #SBATCH --ntasks-per-node=1
>> #SBATCH --time=00:30:00
>> export PATH=/scratch/manumachu/openmpi-4.0.3/OPENMPI_INSTALL/bin:$PATH
>> export 
>> LD_LIBRARY_PATH=/scratch/manumachu/openmpi-4.0.3/OPENMPI_INSTALL/lib:$LD_LIBRARY_PATH
>> mpirun -np 4 ./bcast_timing -t 1
>> 
>> No matter what option I give to mpirun, I get the following error:
>> mpirun: Error: unknown option "-np"
>> 
>> I have used mpiexec also but received the same errors.
>> 
>> To summarize, I am not able to call mpirun from a SLURM script. I can use 
>> srun but I have no idea how to pass MCA parameters I usually give to mpirun 
>> such as, "--map-by ppr:1:socket -mca pml ob1 -mca btl tcp,self -mca 
>> coll_tuned_use_dynamic_rules 1".
>> 
>> Thank you for your help.
>> 
>> -- 
>> Kind Regards
>> Dr. Ravi Reddy Manumachu
>> Research Fellow, School of Computer Science, University College Dublin
>> Ravi Manumachu on Google Scholar, ResearchGate
> 




Re: [slurm-users] Problems calling mpirun in OpenMPI-3.1.6 + slurm and OpenMPI-4.0.3+slurm environments

2020-04-10 Thread Jeffrey T Frey
Are you certain you're PATH addition is correct?  The "-np" flag is still 
present in a build of Open MPI 4.0.3 I just made, in fact:


$ 4.0.3/bin/mpirun 
--
mpirun could not find anything to do.

It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--


Note that with the Slurm plugins present in your Open MPI build, there should 
be no need to use "-np" on the command line; the Slurm RAS plugin should pull 
such information from the Slurm runtime environment variables.  If you do use 
"-np" to request more CPUs that the job was allocated, you'll receive 
oversubscription errors (you know, unless you include mpirun flags to allow 
that to happen).


What if you add "which mpirun" to your job script ahead of the "mpirun" command 
-- does it show you /scratch/manumachu/openmpi-4.0.3/OPENMPI_INSTALL/bin/mpirun?




> On Apr 10, 2020, at 12:12 , Ravi Reddy Manumachu  
> wrote:
> 
> 
> Dear Slurm Users,
> 
> I am facing issues with the following combinations of OpenMPI and SLURM. I 
> was wondering if you have faced something similar and can help me.
> 
> OpenMPI-3.1.6 and slurm 19.05.5
> OpenMPI-4.0.3 and slurm 19.05.5
> 
> I have the OpenMPI packages configured with "--with-slurm" option and 
> installed. 
> 
>   Configure command line: 
> '--prefix=/home/manumachu/openmpi-4.0.3/OPENMPI_INSTALL' '--with-slurm'
>  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>   MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
> 
> I am executing the sbatch script shown below:
> 
> #!/bin/bash
> #SBATCH --account=x
> #SBATCH --job-name=ompi4
> #SBATCH --output=ompi4.out
> #SBATCH --error=ompi4.err
> #SBATCH --ntasks-per-node=1
> #SBATCH --time=00:30:00
> export PATH=/scratch/manumachu/openmpi-4.0.3/OPENMPI_INSTALL/bin:$PATH
> export 
> LD_LIBRARY_PATH=/scratch/manumachu/openmpi-4.0.3/OPENMPI_INSTALL/lib:$LD_LIBRARY_PATH
> mpirun -np 4 ./bcast_timing -t 1
> 
> No matter what option I give to mpirun, I get the following error:
> mpirun: Error: unknown option "-np"
> 
> I have used mpiexec also but received the same errors.
> 
> To summarize, I am not able to call mpirun from a SLURM script. I can use 
> srun but I have no idea how to pass MCA parameters I usually give to mpirun 
> such as, "--map-by ppr:1:socket -mca pml ob1 -mca btl tcp,self -mca 
> coll_tuned_use_dynamic_rules 1".
> 
> Thank you for your help.
> 
> -- 
> Kind Regards
> Dr. Ravi Reddy Manumachu
> Research Fellow, School of Computer Science, University College Dublin
> Ravi Manumachu on Google Scholar, ResearchGate



Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-26 Thread Jeffrey T Frey
Did you reuse the 20.02 select/cons_res/Makefile.{in,am} in your plugin's 
source?  You probably will have to re-model your plugin after the 
select/cray_aries plugin if you need to override those two functions (it also 
defines its own select_p_job_begin() and doesn't link against 
libcons_common.la).  Naturally, omitting libcons_common.a from your plugin 
doesn't help if you use other functions defined in select/common.





> On Feb 26, 2020, at 00:48 , Dean Schulze  wrote:
> 
> There was a major refactoring between the 19.05 and 20.02 code.  Most of the 
> callbacks for select plugins were moved to cons_common.  I have a plugin for 
> 19.05 that depends on two of those callbacks:  select_p_job_begin() and 
> select_p_job_fini().  My plugin is a copy of the select/cons_res plugin, but 
> when I implement those functions in my plugin I get this error because those 
> functions already exist in cons_common:
> 
> /home/dean/src/slurm.versions/slurm-20.02/slurm-20.02.0/src/plugins/select/cons_common/cons_common.c:1134:
>  multiple definition of `select_p_job_begin'; 
> .libs/select_liqid_cons_res.o:/home/dean/src/slurm.versions/slurm-20.02/slurm-20.02.0/src/plugins/select/liqid_cons_res/select_liqid_cons_res.c:559:
>  first defined here
> /usr/bin/ld: ../cons_common/.libs/libcons_common.a(cons_common.o): in 
> function `select_p_job_fini':
> /home/dean/src/slurm.versions/slurm-20.02/slurm-20.02.0/src/plugins/select/cons_common/cons_common.c:1561:
>  multiple definition of `select_p_job_fini'; 
> .libs/select_liqid_cons_res.o:/home/dean/src/slurm.versions/slurm-20.02/slurm-20.02.0/src/plugins/select/liqid_cons_res/select_liqid_cons_res.c:607:
>  first defined here
> collect2: error: ld returned 1 exit status
> 
> Since only one select plugin can be used at a time (determined in slurm.conf) 
> I could put my code in the cons_common implementation of those functions, but 
> if I ever switch plugins then my plugin code will get executed when it 
> shouldn't be.
> 
> How can I "override" those callbacks in my own plugin?  This isn't Java (but 
> it sure looks like the slurm code tries to do Java in C).
> 
> 
> On Tue, Feb 25, 2020 at 11:57 AM Tim Wickberg  wrote:
> After 9 months of development and testing we are pleased to announce the 
> availability of Slurm version 20.02.0!
> 
> Downloads are available from https://www.schedmd.com/downloads.php.
> 
> Highlights of the 20.02 release include:
> 
> - A "configless" method of deploying Slurm within the cluster, in which 
> the slurmd and user commands can use DNS SRV records to locate the 
> slurmctld host and automatically download the relevant configuration files.
> 
> - A new "auth/jwt" authentication mechanism using JWT, which can help 
> integrate untrusted external systems into the cluster.
> 
> - A new "slurmrestd" command/daemon which translates a new Slurm REST 
> API into the underlying libslurm calls.
> 
> - Packaging fixes for RHEL8 distributions.
> 
> - Significant performance improvements to the backfill scheduler, as 
> well as to string construction and processing.
> 
> Thank you to all customers, partners, and community members who 
> contributed to this release.
> 
> As with past releases, the documentation available at 
> https://slurm.schedmd.com has been updated to the 20.02 release. Past 
> versions are available in the archive. This release also marks the end 
> of support for the 18.08 release. The 19.05 release will remain 
> supported up until the 20.11 release in November, but will not see as 
> frequent updates, and bug-fixes will be targeted for the 20.02 
> maintenance releases going forward.
> 
> -- 
> Tim Wickberg
> Chief Technology Officer, SchedMD
> Commercial Slurm Development and Support
> 




Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account

2020-01-27 Thread Jeffrey T Frey
> So the answer then is to either kludge the keys by making symlinks to the 
> cluster and cluster.pub files warewulf makes (I tried this already and I know 
> it works), or to update to the v19.x release and the new style x11 forwarding.

Our answer was to create RSA keys for all users in their ~/.ssh directory if 
they didn't have that pair already.  If Warewulf were to change the key type in 
cluster{,.pub} to one that libssh2 doesn't support you'll have a different 
problem to debug :-)


> Is the update to v19 fairly straightforward?  Is it stable enough at this 
> point to just give it a try?

The way they've implemented X11 forwarding in v19 is to use Slurm's own 
messaging infrastructure (between salloc and slurmd/slurmstepd) to move the 
data rather than using a third-party library (libssh2).  I'm not clear on 
whether or not that data is encrypted in transit (it must be...).  The release 
notes do make it clear that requiring salloc on the client side means X11 
forwarding no longer works with batch jobs, but I never used it in batch jobs 
anyway.


I can't comment on stability of v19 releases...I'm interested in others' input 
on that point myself!





Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account

2020-01-27 Thread Jeffrey T Frey
The Slurm-native X11 plugin demands you use ~/.ssh/id_rsa{,.pub} keys.  It's 
hard-coded into the plugin:


/*
 * Ideally these would be selected at run time. Unfortunately,
 * only ssh-rsa and ssh-dss are supported by libssh2 at this time,
 * and ssh-dss is deprecated.
 */
static char *hostkey_priv = "/etc/ssh/ssh_host_rsa_key";
static char *hostkey_pub = "/etc/ssh/ssh_host_rsa_key.pub";
static char *priv_format = "%s/.ssh/id_rsa";
static char *pub_format = "%s/.ssh/id_rsa.pub";





> On Jan 27, 2020, at 09:34 , Simon Andrews  
> wrote:
> 
> I’ve managed to track down the difference between the accounts which work and 
> those which don’t – but I still don’t understand the mechanism.
>  
> The accounts which work all had their home directories used on an older 
> system.  The ones which fail were only ever used on the new system.  The 
> relevant difference seems to be the way their ssh keys are set up.  On the 
> old system a standard ssh-keygen was run, creating ~/.ssh/id_rsa and 
> ~/.ssh/id_rsa.pub files and putting the pub file into authorized_keys.
>  
> On the new warewulf based system ssh-keygen was again run, but the default 
> key file names was changed.  We now have ~/.ssh/cluster and 
> ~/.ssh/cluster.pub and there is a ~/.ssh/config file which contains:
>  
> # Added by Warewulf  2019-12-10
> Host pebble*
>IdentityFile ~/.ssh/cluster
>StrictHostKeyChecking=no
>  
> This all works fine, and I can ssh from the head node to the ‘pebble’ compute 
> nodes just fine, however something in the code for the slurm x11 forwarder is 
> specifically looking for id_rsa files (or is ignoring the config file), since 
> the forwarding fails if I don’t have these, and works as soon as I do.
>  
> Any ideas where this might be happening so I can either file a bug for change 
> whatever setting this needs?
>  
> Simon.
>  
> From: slurm-users  On Behalf Of 
> William Brown
> Sent: 24 January 2020 17:21
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account
>  
> There are differences for X11 between Slurm versions so it may help to know 
> which version you have.
>  
> I tried some of your commands on our slurm 19.05.3-2 cluster, and 
> interestingly on the session on the compute node I don't see the cookie for 
> the login node:  This was with MobaXterm:
>  
> [user@prdubrvm005 ~]$ xauth list
> prdubrvm005.research.rcsi.com/unix:10  MIT-MAGIC-COOKIE-1  
> 2efc5dd851736e3848193f65d038eca8
> [user@prdubrvm005 ~]$ srun --pty  --x11  --preserve-env /bin/bash
> [user@prdubrhpc1-02 ~]$ xauth list
> prdubrhpc1-02.research.rcsi.com/unix:95  MIT-MAGIC-COOKIE-1  
> 2efc5dd851736e3848193f65d038eca8
> [user@prdubrhpc1-02 ~]$ echo $DISPLAY
> localhost:95.0
>  
> Any per-user problem would make me suspect the user having a different shell, 
> or something in their login script.  Can you make their .bashrc and 
> .bash_profile just exit?  Or look for hidden configuration files for 
>  in their home directory?
>  
> William
>  
>  
>  
> On Fri, 24 Jan 2020 at 16:05, Simon Andrews  
> wrote:
> I have a weird problem which I can’t get to the bottom of. 
>  
> We have a cluster which allows users to start interactive sessions which 
> forward any X11 sessions they generated on the head node.  This generally 
> works fine, but on the account of one user it doesn’t work.  The X11 
> connection to the head node is fine, but it won’t transfer to the compute 
> node.
>  
> The symptoms are shown below:
>  
> A good user gets this:
>  
> [good@headnode ~]$ xauth list
> headnode.babraham.ac.uk/unix:12  MIT-MAGIC-COOKIE-1  
> f04a2bf9a921a3357e44373655add14a
>  
> [good@headnode ~]$ echo $DISPLAY
> localhost:12.0
>  
> [good@headnode ~]$ srun --pty -p interactive --x11  --preserve-env /bin/bash
>  
> [good@compute ~]$ xauth list
> headnode.babraham.ac.uk/unix:12  MIT-MAGIC-COOKIE-1  
> f04a2bf9a921a3357e44373655add14a
> compute/unix:25  MIT-MAGIC-COOKIE-1  f04a2bf9a921a3357e44373655add14a
>  
> [good@compute ~]$ echo $DISPLAY
> localhost:25.0
>  
> So the cookie is copied from the head node and forwarded and the DISPLAY 
> variable is updated.
>  
> The bad user gets this:
>  
> [bad@headnode ~]$ xauth list
> headnode.babraham.ac.uk/unix:10  MIT-MAGIC-COOKIE-1  
> c39a493a37132d308b37469d363d8692
>  
> [bad@headnode ~]$ echo $DISPLAY
> localhost:10.0
>  
> [bad@headnode ~]$ srun --pty -p interactive --x11  --preserve-env /bin/bash
>  
> [bad@compute ~]$ xauth list
> headnode.babraham.ac.uk/unix:10  MIT-MAGIC-COOKIE-1  
> c39a493a37132d308b37469d363d8692
>  
> [bad@compute ~]$ echo $DISPLAY
> localhost:10.0
>  
> So the cookie isn’t copied and the DISPLAY isn’t updated.  I can’t see any 
> errors in the logs and I can’t see anything different about this account.
>  
> If I do a straight forward ssh -Y from the head node to a compute node from 
> the bad account then that works fine – it’s only whatever is specific about 
> the way that srun forwards X which fails.
>  
> 

Re: [slurm-users] blastx fails with "Error memory mapping"

2020-01-24 Thread Jeffrey T Frey
Does your Slurm cgroup or node OS cgroup configuration limit the virtual 
address space of processes?  The "Error memory mapping" is thrown by blast when 
trying to create a virtual address space that exposes the contents of a file on 
disk (see "man mmap") so the file can be accessed via pointers (with the OS 
handling paging data in and out of the file on disk) rather than by means of 
standard file i/o calls (e.g. fread(), fscanf(), read()).  It sounds like you 
don't have enough system RAM, period, or the cgroup 
"memory.memsw.limit_in_bytes" is set too low for the amount of file content 
you're attempting to mmap() into the virtual address space (e.g. BIG files).




> On Jan 24, 2020, at 07:03 , Mahmood Naderan  wrote:
> 
> Hi,
> Although I can run the blastx command on terminal on all nodes, I can not use 
> slurm for that due to a so called "memory map error".
> Please see below that I pressed ^C after some seconds when running via 
> terminal.
> 
> Fri Jan 24 15:29:57 +0330 2020
> [shams@hpc ~]$ blastx -db ~/ncbi-blast-2.9.0+/bin/nr -query 
> ~/khTrinityfilterless1.fasta -max_target_seqs 5 -outfmt 6 -evalue 1e-5 
> -num_threads 2
> ^C
> [shams@hpc ~]$ date
> Fri Jan 24 15:30:09 +0330 2020
> 
> 
> However, the following script fails
> 
> [shams@hpc ~]$ cat slurm_blast.sh
> #!/bin/bash
> #SBATCH --job-name=blast1
> #SBATCH --output=my_blast.log
> #SBATCH --partition=SEA
> #SBATCH --account=fish
> #SBATCH --mem=38GB
> #SBATCH --nodelist=hpc
> #SBATCH --nodes=1
> #SBATCH --ntasks-per-node=2
> 
> export PATH=~/ncbi-blast-2.9.0+/bin:$PATH
> blastx -db ~/ncbi-blast-2.9.0+/bin/nr -query ~/khTrinityfilterless1.fasta 
> -max_target_seqs 5 -outfmt 6 -evalue 1e-5 -num_threads 2
> [shams@hpc ~]$ sbatch slurm_blast.sh
> Submitted batch job 284
> [shams@hpc ~]$ cat my_blast.log
> Error memory mapping:/home/shams/ncbi-blast-2.9.0+/bin/nr.52.psq 
> openedFilesCount=151 threadID=0
> Error: NCBI C++ Exception:
> T0 
> "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_560232_130.14.18.6_9008__PrepareRelease_Linux64-Centos_1552331742/c++/compilers/unix/../../src/corelib/ncbiobj.cpp",
>  line 981: Critical: ncbi::CObject::ThrowNullPointerException() - Attempt to 
> access NULL pointer.
>  Stack trace:
>   blastx ???:0 ncbi::CStackTraceImpl::CStackTraceImpl() offset=0x77 
> addr=0x1d95da7
>   blastx ???:0 ncbi::CStackTrace::CStackTrace(std::string const&) 
> offset=0x25 addr=0x1d98465
>   blastx ???:0 ncbi::CException::x_GetStackTrace() offset=0xA0 
> addr=0x1ec7330
>   blastx ???:0 ncbi::CException::SetSeverity(ncbi::EDiagSev) offset=0x49 
> addr=0x1ec2169
>   blastx ???:0 ncbi::CObject::ThrowNullPointerException() offset=0x2D2 
> addr=0x1f42582
>   blastx ???:0 ncbi::blast::CBlastTracebackSearch::Run() offset=0x61C 
> addr=0xf2929c
>   blastx ???:0 ncbi::blast::CLocalBlast::Run() offset=0x404 addr=0xed4684
>   blastx ???:0 CBlastxApp::Run() offset=0xC9C addr=0x9cbf7c
>   blastx ???:0 ncbi::CNcbiApplication::x_TryMain(ncbi::EAppDiagStream, 
> char const*, int*, bool*) offset=0x8E3 addr=0x1da0e13
>   blastx ???:0 ncbi::CNcbiApplication::AppMain(int, char const* const*, 
> char const* const*, ncbi::EAppDiagStream, char const*, std::string const&) 
> offset=0x782 addr=0x1d9f6b2
>   blastx ???:0 main offset=0x5E5 addr=0x9caa05
>   /lib64/libc.so.6 ???:0 __libc_start_main offset=0xF5 addr=0x7f9a0fb3e505
>   blastx ???:0 blastx() [0x9ca345] offset=0x0 addr=0x9ca345
> 
> 
> 
> Any idea about that?
> 
> 
> Regards,
> Mahmood
> 
> 




Re: [slurm-users] Question about networks and connectivity

2019-12-09 Thread Jeffrey T Frey
Open MPI matches available hardware in node(s) against its compiled-in 
capabilities.  Those capabilities are expressed as modular shared libraries 
(see e.g. $PREFIX/lib64/openmpi).  You can use environment variables or 
command-line flags to influence which modules get used for specific purposed.  
For example, the Byte-Transfer Layer (BTL) module has openib, tcp, self, 
shared-memory (sm), vader implementations.  So long as your build of Open MPI 
knew about Infiniband and the runtime can see the hardware, Open MPI should 
rank that interface highest-performance and use it.



> On Dec 9, 2019, at 08:54 , Sysadmin CAOS  wrote:
> 
> Hi mercan,
> 
> OK, I forgot to compile OpenMPI with Infiniband support... But I still have a 
> doubt: SLURM scheduler assigns (offers) some nodes called "node0x" to my 
> sbatch job because in my SLURM cluster nodes have been added with "node0x" 
> name. My OpenMPI application has been (now) compiled with ibverbs support.. 
> but how I tell to my application or to my SLURM sbatch submit script that my 
> MPI program MUST use Infiniband network? If SLURM has assigned to me node01 
> and node02 (with IP address 192.168.11.1 and 192.168.11.2 in a gigabit 
> network) and Infiniband is 192.168.13.x, who transform from "clus01" 
> (192.168.12.1) and "clus02" (192.168.12.2) to "infi01" (192.168.13.1) and 
> "infi02" (192.168.13.2).
> 
> This step still baffles me...
> 
> Sorry if my question is easy for you... but now I have been entered in a sea 
> of doubts.
> 
> Thanks.
> 
> El 05/12/2019 a las 14:27, mercan escribió:
>> Hi;
>> 
>> Your mpi and NAMD use your second network because of your applications did 
>> not compiled for infiniband. There are many compiled NAMD versions. the verb 
>> and ibverb versions are for using infiniband. Also, when you compiling the 
>> mpi source, you should check configure script detect the infiniband network 
>> to use infiniband. And even while compiling the slurm too.
>> 
>> Regards;
>> 
>>  Ahmet M.
>> 
>> 
>> On 5.12.2019 15:07, sysadmin.caos wrote:
>>> Hello,
>>> 
>>> Really, I don't know if my question is for this mailing list... but I will 
>>> explain my problem and, then, you could answer me whatever you think ;)
>>> 
>>> I manage a SLURM clusters composed by 3 networks:
>>> 
>>>   * a gigabit network used for NFS shares (192.168.11.X). In this
>>> network, my nodes are "node01, node02..." in /etc/hosts.
>>>   * a gigabit network used by SLURM (all my nodes are added to SLURM
>>> cluster using this network and the hostname assigned via /etc/host
>>> to this second network). (192.168.12.X). In this network, my nodes
>>> are "clus01, clus02..." in /etc/hosts.
>>>   * a Infiniband network (192.168.13.X). In this network, my nodes are
>>> "infi01, infi02..." in /etc/hosts.
>>> 
>>> When I submit a MPI job, SLURM scheduler offers me "n" nodes called, for 
>>> example, clus01 and clus02 and, there, my application runs perfectly using 
>>> second network for SLURM connectivity and first network for NFS (and NIS) 
>>> shares. By default, as SLURM connectivity is on second network, my nodelist 
>>> contains nodes called "clus0x".
>>> 
>>> However, now, I'm getting a "new" problem. I want to use third network 
>>> (Infiniband), but as SLURM offers me "clus0x" (second network), my MPI 
>>> application runs OK but using second network. This problem also occurs, for 
>>> example, using NAMD (Charmrun) application.
>>> 
>>> So, my questions are:
>>> 
>>>  1. is this SLURM configuration correct for using both networks?
>>>  1. If answer is "no", how do I configure SLURM for my purpose?
>>>  2. But if answer is "yes", how can I ensure connections in my
>>> SLURM job are going in Infiniband?
>>> 
>>> Thanks a lot!!
>>> 
> 
> 




Re: [slurm-users] $TMPDIR does not honor "TmpFS"

2018-11-21 Thread Jeffrey T Frey
If you check the applicable code in src/slurmd/slurmstepd/task.c, TMPDIR is set 
to "/tmp" if it's not already set in the job environment and then TMPDIR is 
created if permissible.  It's your responsibility to set TMPDIR -- e.g. we have 
a plugin we wrote (autotmp) to set TMPDIR to per-job and per-step paths 
according to the job id.



> On Nov 21, 2018, at 10:33 , Michael Gutteridge  
> wrote:
> 
> 
> I don't think that's a bug.  As far as I've ever known, TmpFS is only used to 
> tell slurmd where to look for available space (reported as TmpDisk for the 
> node).  The manpage only indicates that, not any additional functionality.  
> We set TMPDIR in a task prolog:
> 
> #!/bin/bash
> echo "export TMPDIR=/loc/scratch/${SLURM_JOB_ID}"
> echo "export SCRATCH_LOCAL=/loc/scratch/${SLURM_JOB_ID}"
> echo "export SCRATCH=/net/scratch/${SLURM_JOB_ID}"
> 
> - Michael
> 
> 
> On Wed, Nov 21, 2018 at 6:52 AM Shenglong Wang  wrote:
> We have TMPDIR setup inside prolog file. Hope users do not have absolute path 
> /tmp inside their scripts.
> 
> #!/bin/bash
> 
> SLURM_BIN="/opt/slurm/bin"
> 
> SLURM_job_tmp=/state/partition1/job-${SLURM_JOB_ID}
> 
> mkdir -m 700 -p $SLURM_job_tmp
> chown $SLURM_JOB_USER $SLURM_job_tmp
> 
> echo "export SLURM_JOBTMP=$SLURM_job_tmp"
> echo "export SLURM_JOB_TMP=$SLURM_job_tmp"
> echo "export SLURM_JOB_TMPDIR=$SLURM_job_tmp"
> echo "export TMPDIR=$SLURM_job_tmp”
> 
> Best.
> Shenglong
> 
>> On Nov 21, 2018, at 9:44 AM, Roger Moye  wrote:
>> 
>> We are having the exact same problem with $TMPDIR.   I wonder if a bug has 
>> crept in?I spoke to the SchedMD guys at SC18 last week and they were not 
>> aware of a bug but since more than one person is having this difficulty 
>> something must be wrong somewhere.
>>  
>> -Roger
>>  
>> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf 
>> Of Douglas Duckworth
>> Sent: Wednesday, November 21, 2018 7:38 AM
>> To: slurm-users@lists.schedmd.com
>> Subject: [slurm-users] $TMPDIR does not honor "TmpFS"
>>  
>> Hi 
>>  
>> We are setting TmpFS=/scratchLocal in /etc/slurm/slurm.conf on nodes and 
>> controller.  However $TMPDIR value seems to be /tmp not /scratchLocal.  As a 
>> result users are writing to /tmp which we do not want.
>> 
>> We are not setting $TMPDIR anywhere else such as /etc/profile.d nor do users 
>> have it defined in their ~/.bashrc or ~/.bash_profile.  
>>  
>> We do not see any error messages anywhere which could indicate why the 
>> default value of /tmp overrides our value of of TmpFS.  
>>  
>> As I understand prolog scripts can change this value though, if that's the 
>> case, then what's the purpose of setting TmpFS in /etc/slurm/slurm.conf?
>>  
>> 
>> Thanks,
>> 
>> Douglas Duckworth, MSc, LFCS
>> HPC System Administrator
>> Scientific Computing Unit
>> Weill Cornell Medicine
>> 1300 York Avenue
>> New York, NY 10065
>> E: d...@med.cornell.edu
>> O: 212-746-6305
>> F: 212-746-8690
>>  
>> ---
>> 
>> The information in this communication and any attachment is confidential and 
>> intended solely for the attention and use of the named addressee(s). All 
>> information and opinions expressed herein are subject to change without 
>> notice. This communication is not to be construed as an offer to sell or the 
>> solicitation of an offer to buy any security. Any such offer or solicitation 
>> can only be made by means of the delivery of a confidential private offering 
>> memorandum (which should be carefully reviewed for a complete description of 
>> investment strategies and risks). Any reliance one may place on the accuracy 
>> or validity of this information is at their own risk. Past performance is 
>> not necessarily indicative of the future results of an investment. All 
>> figures are estimated and unaudited unless otherwise noted. If you are not 
>> the intended recipient, or a person responsible for delivering this to the 
>> intended recipient, you are not authorized to and must not disclose, copy, 
>> distribute, or retain this message or any part of it. In this case, please 
>> notify the sender immediately at 713-333-5440
>> 
> 




Re: [slurm-users] new user simple question re sacct output line2

2018-11-14 Thread Jeffrey T Frey
The identifier after the base numeric job id -- e.g. "batch" -- is the job 
step.  The "batch" step is where your job script executes.  Each time your job 
script calls "srun" a new numerical step is created, e.g. "82.0," "82.1," et 
al.  Job accounting captures information for the entire job (JobID = "82") and 
the individual steps.  All data points visible in sacct may or may not pertain 
or be captured for each individual step.



> On Nov 14, 2018, at 08:38 , Matthew Goulden  
> wrote:
> 
> Hi,
> 
> New to slurm; currently working up to move our system from uge/sge
> 
> sacct output including the default headers is three lines, What is line 2 
> documenting? Most fields are blank.
> 
> For most fields with values these are the same as for line 3:
> AllocCPUS,
> Elapsed,
> State,
> ExitCode,
> ReqMem,
> 
> For some fields with values these are clearly related to that in line 3 
> (represented here as line1:line2:line3)
> JobID: 82:  82.batch
> JobIDRaw: 82:  82.batch
> 
> For others the values are uniq to line 2:
> JobName:  : batch
> Partition: all_slt_limit:
> ReqCPUFreqMin: Unknown:  0
> ReqCPUFreqMax: Unknown:  0
> ReqCPUFreqGov: Unknown:  0
> ReqTRES: billing=1,cpu=1,node=1:  
> AllocTRES: billing=1,cpu=1,mem=125000M,node=1:  
> cpu=1,mem=125000M,node=1
> 
> I'm sure the documentation - which is excellent - details this but I've not 
> found where; can someone give me the pointer I need?
> 
> Many thanks
> 
> Matt
> 
> **
> The information contained in the EMail and any attachments is confidential 
> and intended solely and for the attention and use of the named addressee(s). 
> It may not be disclosed to any other person without the express authority of 
> Public Health England, or the intended recipient, or both. If you are not the 
> intended recipient, you must not disclose, copy, distribute or retain this 
> message or any part of it. This footnote also confirms that this EMail has 
> been swept for computer viruses by Symantec.Cloud, but please re-sweep any 
> attachments before opening or saving. http://www.gov.uk/PHE
> **




Re: [slurm-users] Slurmstepd sleep processes

2018-08-03 Thread Jeffrey T Frey
See:

https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/mgr.c


Circa line 1072 the comment explains:


/*
 * Need to exec() something for proctrack/linuxproc to
 * work, it will not keep a process named "slurmstepd"
 */

execl(SLEEP_CMD, "sleep", "1", NULL);


Basically, proctrack/linuxproc will produce an error if a slurmstepd is running 
zero subprocesses.  So a very long sleep command is spawned to satisfy that 
condition (no matter what proctrack plugin is actually being used).




> On Aug 3, 2018, at 17:42 , Christopher Benjamin Coffey  
> wrote:
> 
> Hello,
> 
> Has anyone observed "sleep 1" processes on their compute nodes? They 
> seem to be tied to the slurmstepd extern process in slurm:
> 
> 4 S root 136777  1  0  80   0 - 73218 do_wai 05:48 ?00:00:01 
> slurmstepd: [13220317.extern]
> 0 S root 136782 136777  0  80   0 - 25229 hrtime 05:48 ?00:00:00  
> \_ sleep 1
> 4 S root 136784  1  0  80   0 - 73280 do_wai 05:48 ?00:00:02 
> slurmstepd: [13220317.batch]
> 4 S tes87136789 136784  0  80   0 - 26520 do_wai 05:48 ?00:00:00  
> \_ /bin/bash /var/spool/slurm/slurmd/job13220317/slurm_script
> 4 S root 136807  1  0  80   0 - 107157 do_wai 05:48 ?   00:00:01 
> slurmstepd: [13220317.1]
> 
> I'm not exactly sure what the extern piece is for. Anyone know what this is 
> all about? Is this normal? We just saw this the other day while investigating 
> some issues. Sleeping for 3.17 years seems strange. Any help would be 
> appreciated, thanks!
> 
> Best,
> Chris
> 
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
>