Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-02 Thread Michael Di Domenico
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: > The problem is to identify the cards physically from the information we > have, like what's reported with nvidia-smi or available in > /proc/driver/nvidia/gpus/*/information > The serial number isn't shown for every type of GPU and I'm not sure

Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-15 Thread Michael Di Domenico
the error message sounds like when you built the slurm source it wasn't able to find the nvml devel packages. if you look in where you installed slurm, in lib/slurm you should have a gpu_nvml.so. do you? On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro wrote: > > typing error, should be --> **

Re: [slurm-users] GPU process accounting information

2021-01-15 Thread Michael Di Domenico
i would imagine that slurm should be able to pull that data through nvml. but i'd bet the hooks aren't inplace. On Fri, Jan 15, 2021 at 7:44 AM Ole Holm Nielsen wrote: > > Hi, > > We have installed some new GPU nodes, and now users are asking for some > sort of monitoring of GPU utilisation and

Re: [slurm-users] gres names

2020-12-15 Thread Michael Di Domenico
you can either make them up on your own or they get spit out by NVML in the slurmd.log file On Tue, Dec 15, 2020 at 12:55 PM Erik Bryer wrote: > > Hi, > > Where do I get the gres names, e.g. "rtx2080ti", to use for my gpus in my > node definitions in slurm.conf? > > Thanks, > Erik

Re: [slurm-users] Slurm versions 20.11.1 is now available

2020-12-11 Thread Michael Di Domenico
7;t be present if MySQL automatically reconnects to the server. So > the reconnected state won't match the state expected by the client. Better > for the client to know the connection failed and reconnect on its own to > reestablish state. > > > > > On Dec 11, 2020, at

Re: [slurm-users] Slurm versions 20.11.1 is now available

2020-12-11 Thread Michael Di Domenico
> -- Disable MySQL automatic reconnection. can you expand on this? seems an 'odd' thing to disable. On Thu, Dec 10, 2020 at 4:44 PM Tim Wickberg wrote: > > We are pleased to announce the availability of Slurm version 20.11.1. > > This includes a number of fixes made in the month since 20.11 w

Re: [slurm-users] [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error

2020-10-22 Thread Michael Di Domenico
was there ever a result to this? i'm seeing the same error message, but i'm not adding in all the environ flags like the original poster. On Wed, Jul 10, 2019 at 9:18 AM Daniel Letai wrote: > > Thank you Artem, > > > I've made a mistake while typing the mail, in all cases it was > 'OMPI_MCA_pml

Re: [slurm-users] Memory per CPU

2020-09-29 Thread Michael Di Domenico
what leads you to believe that you're getting 2 CPU's instead of 1? 'scontrol show job ' would be a helpful first start. On Tue, Sep 29, 2020 at 9:56 AM Luecht, Jeff A wrote: > > I am working on my first ever SLURM cluster build for use as a resource > manager in a JupyterHub Development environ

Re: [slurm-users] slurm_rpc_node_registration invalid argument

2020-08-26 Thread Michael Di Domenico
and it looks like i'll have to wait till 20.11 for a fix https://bugs.schedmd.com/show_bug.cgi?id=9035 On Wed, Aug 26, 2020 at 11:20 AM Michael Di Domenico wrote: > > looks like a similar issue is being tracked by: > https://bugs.schedmd.com/show_bug.cgi?id=9441 > > On Wed, A

Re: [slurm-users] slurm_rpc_node_registration invalid argument

2020-08-26 Thread Michael Di Domenico
looks like a similar issue is being tracked by: https://bugs.schedmd.com/show_bug.cgi?id=9441 On Wed, Aug 26, 2020 at 11:04 AM Michael Di Domenico wrote: > > sorry i meant to say, our slurm nodehealth script pushed the node to > failed state. slurm itself wasn't doing this >

Re: [slurm-users] slurm_rpc_node_registration invalid argument

2020-08-26 Thread Michael Di Domenico
sorry i meant to say, our slurm nodehealth script pushed the node to failed state. slurm itself wasn't doing this On Wed, Aug 26, 2020 at 11:02 AM Michael Di Domenico wrote: > > i just upgraded from v18 to v20. Did something change in the node > config validation? it used t

[slurm-users] slurm_rpc_node_registration invalid argument

2020-08-26 Thread Michael Di Domenico
i just upgraded from v18 to v20. Did something change in the node config validation? it used to be that if i started slurm on a compute node that had lower than expected memory or was missing gpu's, slurm would push a node into a failed state that i could see in sinfo -R. now it seems to be loggi

Re: [slurm-users] How to view GPU indices of the completed jobs?

2020-06-10 Thread Michael Di Domenico
I don't know the answer, but have you checked the SQL tables in the database to see if the data you want is even being kept? its possible slurm is just throwing that value away. (i agree it would be nice if it was retrievable) On Wed, Jun 10, 2020 at 2:59 AM Kota Tsuyuzaki wrote: > > > -j -l`

Re: [slurm-users] Dependencies with singleton and after

2019-08-28 Thread Michael Di Domenico
just curious. if you leave out the singleton, do you get the behavior as expected? On Tue, Aug 27, 2019 at 9:42 AM Jarno van der Kolk wrote: > > Hi all, > > I'm still puzzled by the expected behaviour of the following: > $ sbatch --hold fakejob.sh > Submitted batch job 25909273 > $ sbatch --hold

Re: [slurm-users] Slurm version 19.05.2 is now available

2019-08-26 Thread Michael Di Domenico
On Fri, Aug 23, 2019 at 11:08 AM Stuart Barkley wrote: > > Is it possible for these email announcements to include the MD5 and > SHA1 information that is contained on the download page. I like to > verify the checksums using a different channel than that used to > retrieve the software. not that

Re: [slurm-users] Feature request: create a job id before job submission

2019-05-07 Thread Michael Di Domenico
On Tue, May 7, 2019 at 10:03 AM Mark Hahn wrote: > > > Some cluster sites need the creation of a workspace for the job in a > >scratch area before the actual job submission, and on the other hand they > >don't accept all characters as name of the workspace. Hence the plain job > >name often can't

Re: [slurm-users] Anyone built PMIX 3.1.1 against Slurm 18.08.4?

2019-01-22 Thread Michael Di Domenico
i've seen the same error, i don't think it's you. but i don't know what the cause is either, i didn't have time to look into it so i backed up to pmix 2.2.1 which seems to work fine On Tue, Jan 22, 2019 at 12:56 AM Greg Wickham wrote: > > > Hi All, > > I’m trying to build pmix 3.1.1 against slur

Re: [slurm-users] GRES GPU issues

2018-12-04 Thread Michael Di Domenico
unfortunately, someone smarter then me will have to help further. I'm not sure i see anything specifically wrong. The one thing i might try is backing the software down to a 17.x release series. I recently tried 18.x and had some issues. I can't say whether it'll be any different, but you might

Re: [slurm-users] GRES GPU issues

2018-12-03 Thread Michael Di Domenico
t; > So, we use LDAP for authentication and my UID is 1498, but I created a user > in slurm using my login name. The default account for all users is "slt" Is > this the cause of my problems? > root@panther02 slurm# getent passwd lnicotra > lnicotra:*:1498:1152:Lou Nicotra:

Re: [slurm-users] GRES GPU issues

2018-12-03 Thread Michael Di Domenico
do you get anything additional in the slurm logs? have you tried adding gres to the debugflags? what version of slurm are you running? On Mon, Dec 3, 2018 at 9:18 AM Lou Nicotra wrote: > > Hi All, I have recently set up a slurm cluster with my servers and I'm > running into an issue while submi

[slurm-users] maintenance partitions?

2018-10-05 Thread Michael Di Domenico
Is anyone on the list using maintenance partitions for broken nodes? If so, how are you moving nodes between partitions? The situation with my machines at the moment, is that we have a steady stream of new jobs coming into the queues, but broken nodes as well. I'd like to fix those broken nodes an

Re: [slurm-users] slurm does not pass mca params toopenmpi?

2018-07-20 Thread Michael Di Domenico
On Thu, Jul 19, 2018 at 3:50 PM, Roger Mason wrote: > Michael Di Domenico writes: > >> did you copy the mca parameters file to all the compute nodes as well? >> > > No need: my home directory is shared between the submit machine & the > nodes. my fault you'

Re: [slurm-users] slurm does not pass mca params toopenmpi?

2018-07-19 Thread Michael Di Domenico
did you copy the mca parameters file to all the compute nodes as well? On Thu, Jul 19, 2018 at 11:37 AM, Roger Mason wrote: > Hell Gilles, > > gil...@rist.or.jp writes: > >> is the home directory mounted at the same place regardless this is a >> frontend or a compute node ? > > One host serves as

Re: [slurm-users] Job Resource Utilization Summary Email

2018-06-13 Thread Michael Di Domenico
when i run that i get perl: error: plugin_load_from_file: dlopen(accounting_storage_filetxt.so): accounting_storage_filetxt.so: undefined symbol: slurmdbd_conf the only reference i can find to slurmdbd_conf is in libslurmfull.so where it's marked with B so something is a miss with my environment

Re: [slurm-users] Job Resource Utilization Summary Email

2018-06-12 Thread Michael Di Domenico
how do you plan to collect all of the performance data? On Tue, Jun 12, 2018 at 12:06 PM, Hanby, Mike wrote: > Howdy, > > > > Is anyone aware of any existing job completion email scripts that provide a > summary of the jobs resource utilization? For example, something like: > > > > Job ID: 123456

Re: [slurm-users] slurm jobs are pending but resources are available

2018-04-16 Thread Michael Di Domenico
On Mon, Apr 16, 2018 at 6:35 AM, wrote: > > According to the above I have the backfill scheduler enabled with CPUs and > Memory configured as > resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would > expect that he backfill >scheduler attempts to allocate the resources in orde

Re: [slurm-users] slurm and dates?

2018-02-26 Thread Michael Di Domenico
e what the user has set > > > Alternatively, you can set your preferred timezone with the TZ environment > variable when you issue your Slurm commands. > > > On 02/26/2018 08:31 AM, Michael Di Domenico wrote: >> >> On Sat, Feb 24, 2018 at 7:20 AM, Jessica Nettelbl

Re: [slurm-users] slurm and dates?

2018-02-26 Thread Michael Di Domenico
On Sat, Feb 24, 2018 at 7:20 AM, Jessica Nettelblad wrote: > So it seems to me, unix time is used for dates, which is then converted with > localtime for certain output to be readable for humans. Since Slurm is a C > program run in a Unix environment, that is also what I would expect. Thanks, the

Re: [slurm-users] slurm and dates?

2018-02-23 Thread Michael Di Domenico
when i run 'scontrol -o -d show job jobid=' i get a long list of variables some of those variables are spit out as dates. since the dates do not include a timezone field how should that date field be assumed to work? from the value i conclude that it's my localtime, but is the date being stored