Re: [slurm-users] Slurmstepd sleep processes

2018-08-03 Thread Jeffrey T Frey
See: https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/mgr.c Circa line 1072 the comment explains: /* * Need to exec() something for proctrack/linuxproc to * work, it will not keep a process

Re: [slurm-users] new user simple question re sacct output line2

2018-11-14 Thread Jeffrey T Frey
The identifier after the base numeric job id -- e.g. "batch" -- is the job step. The "batch" step is where your job script executes. Each time your job script calls "srun" a new numerical step is created, e.g. "82.0," "82.1," et al. Job accounting captures information for the entire job

Re: [slurm-users] $TMPDIR does not honor "TmpFS"

2018-11-21 Thread Jeffrey T Frey
If you check the applicable code in src/slurmd/slurmstepd/task.c, TMPDIR is set to "/tmp" if it's not already set in the job environment and then TMPDIR is created if permissible. It's your responsibility to set TMPDIR -- e.g. we have a plugin we wrote (autotmp) to set TMPDIR to per-job and

Re: [slurm-users] Question about networks and connectivity

2019-12-09 Thread Jeffrey T Frey
Open MPI matches available hardware in node(s) against its compiled-in capabilities. Those capabilities are expressed as modular shared libraries (see e.g. $PREFIX/lib64/openmpi). You can use environment variables or command-line flags to influence which modules get used for specific

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-26 Thread Jeffrey T Frey
Did you reuse the 20.02 select/cons_res/Makefile.{in,am} in your plugin's source? You probably will have to re-model your plugin after the select/cray_aries plugin if you need to override those two functions (it also defines its own select_p_job_begin() and doesn't link against

Re: [slurm-users] blastx fails with "Error memory mapping"

2020-01-24 Thread Jeffrey T Frey
Does your Slurm cgroup or node OS cgroup configuration limit the virtual address space of processes? The "Error memory mapping" is thrown by blast when trying to create a virtual address space that exposes the contents of a file on disk (see "man mmap") so the file can be accessed via pointers

Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account

2020-01-27 Thread Jeffrey T Frey
> So the answer then is to either kludge the keys by making symlinks to the > cluster and cluster.pub files warewulf makes (I tried this already and I know > it works), or to update to the v19.x release and the new style x11 forwarding. Our answer was to create RSA keys for all users in their

Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account

2020-01-27 Thread Jeffrey T Frey
The Slurm-native X11 plugin demands you use ~/.ssh/id_rsa{,.pub} keys. It's hard-coded into the plugin: /* * Ideally these would be selected at run time. Unfortunately, * only ssh-rsa and ssh-dss are supported by libssh2 at this time, * and ssh-dss is deprecated. */ static char

Re: [slurm-users] Problems calling mpirun in OpenMPI-3.1.6 + slurm and OpenMPI-4.0.3+slurm environments

2020-04-10 Thread Jeffrey T Frey
expecting :-) > On Apr 10, 2020, at 12:59 , Jeffrey T Frey wrote: > > Are you certain you're PATH addition is correct? The "-np" flag is still > present in a build of Open MPI 4.0.3 I just made, in

Re: [slurm-users] Problems calling mpirun in OpenMPI-3.1.6 + slurm and OpenMPI-4.0.3+slurm environments

2020-04-10 Thread Jeffrey T Frey
Are you certain you're PATH addition is correct? The "-np" flag is still present in a build of Open MPI 4.0.3 I just made, in fact: $ 4.0.3/bin/mpirun -- mpirun could not find anything to do. It is possible that you

Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?

2020-04-21 Thread Jeffrey T Frey
You could also choose to propagate the signal to the child process of test.slurm yourself: #!/bin/bash #SBATCH --job-name=test #SBATCH --ntasks-per-node=1 #SBATCH --nodes=1 #SBATCH --time=00:03:00 #SBATCH --signal=B:SIGINT@30 # This example works, but I need it to work without "B:" in --signal

Re: [slurm-users] IPv6 for slurmd and slurmctld

2020-05-01 Thread Jeffrey T Frey
Use netstat to list listening ports on the box (netstat -ln) and see if it shows up as tcp6 or tcp. On our (older) 17.11.8 server: $ netstat -ln | grep :6817 tcp0 0 0.0.0.0:68170.0.0.0:* LISTEN $ nc -6 :: 6817 Ncat: Connection refused. $ nc -4

Re: [slurm-users] slurm array with non-numeric index values

2020-07-15 Thread Jeffrey T Frey
On our HPC systems we have a lot of users attempting to organize job arrays for varying purposes -- parameter scans, SSMD (Single-Script, Multiple Datasets). We eventually wrote an abstract utility to try to help them with the process: https://github.com/jtfrey/job-templating-tool May be of

Re: [slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread Jeffrey T Frey
If you check the source up on Github, that's more of a warning produced when you didn't specify a CPU count and it's going to calculate from the socket-core-thread numbers (src/common/read_config.c): /* Node boards are factored into sockets */ if ((n->cpus !=

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-08 Thread Jeffrey T Frey
An MPI library with tight integration with Slurm (e.g. Intel MPI, Open MPI) can use "srun" to start the remote workers. In some cases "srun" can be used directly for MPI startup (e.g. "srun" instead of "mpirun"). Other/older MPI libraries that start remote processes using "ssh" would,

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Jeffrey T Frey
Is the time on that node too far out-of-sync w.r.t. the slurmctld server? > On Jun 11, 2020, at 09:01 , navin srivastava wrote: > > I tried by executing the debug mode but there also it is not writing anything. > > i waited for about 5-10 minutes > > deda1x1452:/etc/sysconfig #

Re: [slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Jeffrey T Frey
Making the certificate globally-available on the host may not always be permissible. If I were you, I'd write/suggest a modification to the plugin to make the CA path (CURLOPT_CAPATH) and verification itself (CURLOPT_SSL_VERIFYPEER) configurable in Slurm. They are both straightforward

Re: [slurm-users] Slurm versions 20.11.1 is now available

2020-12-11 Thread Jeffrey T Frey
It's in the github commits: https://github.com/SchedMD/slurm/commit/8e84db0f01ecd4c977c12581615d74d59b3ff995 The primary issue is that any state the client program established on the connection after first making it (e.g. opening a transaction, creating temp tables) won't be present if MySQL

[slurm-users] Constraint multiple counts not working

2020-12-16 Thread Jeffrey T Frey
uested node configuration is not available My syntax agrees with the 20.11.1 documentation (online and man pages) so it seems correct — and it works fine in 17.11.8. Any ideas? :::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer V / Cluster

Re: [slurm-users] Heterogeneous GPU Node MPS

2020-11-13 Thread Jeffrey T Frey
From the NVIDIA docs re: MPS: On systems with a mix of Volta / pre-Volta GPUs, if the MPS server is set to enumerate any Volta GPU, it will discard all pre-Volta GPUs. In other words, the MPS server will either operate only on the Volta GPUs and expose Volta capabilities, or operate only on

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Jeffrey T Frey
o a unique repository: those who want the pre-built packages explicitly configure their YUM to pull from that repository, those who have EPEL configured (which is a LOT of us) don't get overlapping Slurm packages interfering with their local builds. ::::::::

Re: [slurm-users] Bug: incorrect output directory fails silently

2021-07-08 Thread Jeffrey T Frey
> I understand that there is no output file to write an error message to, but > it might be good to check the `--output` path during the scheduling, just > like `--account` is checked. > > Does anybody know a workaround to be warned about the error? I would make a feature request of SchedMD to

Re: [slurm-users] squeue: compact pending job-array in one partition, but not in other

2021-02-23 Thread Jeffrey T Frey
Did those four jobs 6577272_21 scavenger PD 0:00 1 (Priority) 6577272_22 scavenger PD 0:00 1 (Priority) 6577272_23 scavenger PD 0:00 1 (Priority) 6577272_28 scavenger PD 0:00 1 (Priority) run before and get requeued?

Re: [slurm-users] sacct output in tabular form

2021-08-25 Thread Jeffrey T Frey
You've confirmed my suspicion — no one seems to care for Slurm's standard output formats :-) At UD we did a Python curses wrapper around the parseable output to turn the terminal window into a navigable spreadsheet of output: https://gitlab.com/udel-itrci/slurm-output-wrappers > On Aug

Re: [slurm-users] Fairshare: Penalising unused memory rather than used memory?

2023-10-11 Thread Jeffrey T Frey
> On the automation part, it would be pretty easy to do regularly(daily?) stats > of jobs for that period of time and dump them into an sql database. > Then a select statement where cpu_efficiency is less than desired value and > get the list of not so nice users on which you can apply whatever

Re: [slurm-users] Why every job will sleep 100000000

2022-11-04 Thread Jeffrey T Frey
If you examine the process hierarchy, that "sleep 1" process if probably the child of a "slurmstepd: [.extern]" process. This is a housekeeping step launched for the job by slurmd -- in older Slurm releases it would handle the X11 forwarding, for example. It should have no impact on

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Jeffrey T Frey
> The remaining issue then is how to put them into an allocation that is > actually running a singularity container. I don't get how what I'm doing now > is resulting in an allocation where I'm in a container on the submit node > still! Try prefixing the singularity command with "srun" e.g.

Re: [slurm-users] slurm and singularity

2023-02-08 Thread Jeffrey T Frey
ll > " and get the apptainer prompt. If I prefix that command with "srun", > then it just hangs and I never get the prompt. So that seems to be the > sticking point. I'll have to do some experiments running singularity with > srun. > > From: slurm-users on

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Jeffrey T Frey
In case you're developing the plugin in C and not LUA, behind the scenes the LUA mechanism is concatenating all log_user() strings into a single variable (user_msg). When the LUA code completes, the C code sets the *err_msg argument to the job_submit()/job_modify() function to that string,

Re: [slurm-users] How do I set SBATCH_EXCLUSIVE to its default value?

2023-05-19 Thread Jeffrey T Frey
> I get that these correspond > > --exclusive=userexport SBATCH_EXCLUSIVE=user > --exclusive=mcs export SBATCH_EXCLUSIVE=mcs > But --exclusive has a default behavior if I don't assign it a value. What do > I set SBATCH_EXCLUSIVE to, to get the same default behavior? Try setting

[slurm-users] Re: Restricting local disk storage of jobs

2024-02-07 Thread Jeffrey T Frey via slurm-users
his information, but this seems a bit unclean. Anyway, if I > find some time I will try it out. > Best, > Tim > On 2/6/24 16:30, Jeffrey T Frey wrote: >> Most of my ideas have revolved around creating file systems on-the-fly as >> part of the job prolog and destroying

[slurm-users] Re: Restricting local disk storage of jobs

2024-02-06 Thread Jeffrey T Frey via slurm-users
Most of my ideas have revolved around creating file systems on-the-fly as part of the job prolog and destroying them in the epilog. The issue with that mechanism is that formatting a file system (e.g. mkfs.) can be time-consuming. E.g. formatting your local scratch SSD as an LVM PV+VG and

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-15 Thread Jeffrey T Frey via slurm-users
https://github.com/dun/munge/issues/94 The NEWS file claims this was fixed in 0.5.15. Since your log doesn't show the additional strerror() output you're definitely running an older version, correct? If you go on one of the affected nodes and do an `lsof -p ` I'm betting you'll find a long

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jeffrey T Frey via slurm-users
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" > is per user. The ulimit is a frontend to rusage limits, which are per-process restrictions (not per-user). The fs.file-max is the kernel's limit on how many file descriptors can be open in aggregate. You'd have to edit