On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote:
> The problem is to identify the cards physically from the information we
> have, like what's reported with nvidia-smi or available in
> /proc/driver/nvidia/gpus/*/information
> The serial number isn't shown for every type of GPU and I'm not sure
the error message sounds like when you built the slurm source it
wasn't able to find the nvml devel packages. if you look in where you
installed slurm, in lib/slurm you should have a gpu_nvml.so. do you?
On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
wrote:
>
> typing error, should be --> **
i would imagine that slurm should be able to pull that data through
nvml. but i'd bet the hooks aren't inplace.
On Fri, Jan 15, 2021 at 7:44 AM Ole Holm Nielsen
wrote:
>
> Hi,
>
> We have installed some new GPU nodes, and now users are asking for some
> sort of monitoring of GPU utilisation and
you can either make them up on your own or they get spit out by NVML
in the slurmd.log file
On Tue, Dec 15, 2020 at 12:55 PM Erik Bryer wrote:
>
> Hi,
>
> Where do I get the gres names, e.g. "rtx2080ti", to use for my gpus in my
> node definitions in slurm.conf?
>
> Thanks,
> Erik
7;t be present if MySQL automatically reconnects to the server. So
> the reconnected state won't match the state expected by the client. Better
> for the client to know the connection failed and reconnect on its own to
> reestablish state.
>
>
>
> > On Dec 11, 2020, at
> -- Disable MySQL automatic reconnection.
can you expand on this? seems an 'odd' thing to disable.
On Thu, Dec 10, 2020 at 4:44 PM Tim Wickberg wrote:
>
> We are pleased to announce the availability of Slurm version 20.11.1.
>
> This includes a number of fixes made in the month since 20.11 w
was there ever a result to this? i'm seeing the same error message,
but i'm not adding in all the environ flags like the original poster.
On Wed, Jul 10, 2019 at 9:18 AM Daniel Letai wrote:
>
> Thank you Artem,
>
>
> I've made a mistake while typing the mail, in all cases it was
> 'OMPI_MCA_pml
what leads you to believe that you're getting 2 CPU's instead of 1?
'scontrol show job ' would be a helpful first start.
On Tue, Sep 29, 2020 at 9:56 AM Luecht, Jeff A wrote:
>
> I am working on my first ever SLURM cluster build for use as a resource
> manager in a JupyterHub Development environ
and it looks like i'll have to wait till 20.11 for a fix
https://bugs.schedmd.com/show_bug.cgi?id=9035
On Wed, Aug 26, 2020 at 11:20 AM Michael Di Domenico
wrote:
>
> looks like a similar issue is being tracked by:
> https://bugs.schedmd.com/show_bug.cgi?id=9441
>
> On Wed, A
looks like a similar issue is being tracked by:
https://bugs.schedmd.com/show_bug.cgi?id=9441
On Wed, Aug 26, 2020 at 11:04 AM Michael Di Domenico
wrote:
>
> sorry i meant to say, our slurm nodehealth script pushed the node to
> failed state. slurm itself wasn't doing this
>
sorry i meant to say, our slurm nodehealth script pushed the node to
failed state. slurm itself wasn't doing this
On Wed, Aug 26, 2020 at 11:02 AM Michael Di Domenico
wrote:
>
> i just upgraded from v18 to v20. Did something change in the node
> config validation? it used t
i just upgraded from v18 to v20. Did something change in the node
config validation? it used to be that if i started slurm on a compute
node that had lower than expected memory or was missing gpu's, slurm
would push a node into a failed state that i could see in sinfo -R.
now it seems to be loggi
I don't know the answer, but have you checked the SQL tables in the
database to see if the data you want is even being kept? its possible
slurm is just throwing that value away. (i agree it would be nice if
it was retrievable)
On Wed, Jun 10, 2020 at 2:59 AM Kota Tsuyuzaki
wrote:
>
> > -j -l`
just curious. if you leave out the singleton, do you get the behavior
as expected?
On Tue, Aug 27, 2019 at 9:42 AM Jarno van der Kolk wrote:
>
> Hi all,
>
> I'm still puzzled by the expected behaviour of the following:
> $ sbatch --hold fakejob.sh
> Submitted batch job 25909273
> $ sbatch --hold
On Fri, Aug 23, 2019 at 11:08 AM Stuart Barkley wrote:
>
> Is it possible for these email announcements to include the MD5 and
> SHA1 information that is contained on the download page. I like to
> verify the checksums using a different channel than that used to
> retrieve the software.
not that
On Tue, May 7, 2019 at 10:03 AM Mark Hahn wrote:
>
> > Some cluster sites need the creation of a workspace for the job in a
> >scratch area before the actual job submission, and on the other hand they
> >don't accept all characters as name of the workspace. Hence the plain job
> >name often can't
i've seen the same error, i don't think it's you. but i don't know
what the cause is either, i didn't have time to look into it so i
backed up to pmix 2.2.1 which seems to work fine
On Tue, Jan 22, 2019 at 12:56 AM Greg Wickham wrote:
>
>
> Hi All,
>
> I’m trying to build pmix 3.1.1 against slur
unfortunately, someone smarter then me will have to help further. I'm
not sure i see anything specifically wrong. The one thing i might try
is backing the software down to a 17.x release series. I recently
tried 18.x and had some issues. I can't say whether it'll be any
different, but you might
t;
> So, we use LDAP for authentication and my UID is 1498, but I created a user
> in slurm using my login name. The default account for all users is "slt" Is
> this the cause of my problems?
> root@panther02 slurm# getent passwd lnicotra
> lnicotra:*:1498:1152:Lou Nicotra:
do you get anything additional in the slurm logs? have you tried
adding gres to the debugflags? what version of slurm are you running?
On Mon, Dec 3, 2018 at 9:18 AM Lou Nicotra wrote:
>
> Hi All, I have recently set up a slurm cluster with my servers and I'm
> running into an issue while submi
Is anyone on the list using maintenance partitions for broken nodes?
If so, how are you moving nodes between partitions?
The situation with my machines at the moment, is that we have a steady
stream of new jobs coming into the queues, but broken nodes as well.
I'd like to fix those broken nodes an
On Thu, Jul 19, 2018 at 3:50 PM, Roger Mason wrote:
> Michael Di Domenico writes:
>
>> did you copy the mca parameters file to all the compute nodes as well?
>>
>
> No need: my home directory is shared between the submit machine & the
> nodes.
my fault you'
did you copy the mca parameters file to all the compute nodes as well?
On Thu, Jul 19, 2018 at 11:37 AM, Roger Mason wrote:
> Hell Gilles,
>
> gil...@rist.or.jp writes:
>
>> is the home directory mounted at the same place regardless this is a
>> frontend or a compute node ?
>
> One host serves as
when i run that i get
perl: error: plugin_load_from_file:
dlopen(accounting_storage_filetxt.so): accounting_storage_filetxt.so:
undefined symbol: slurmdbd_conf
the only reference i can find to slurmdbd_conf is in libslurmfull.so
where it's marked with B
so something is a miss with my environment
how do you plan to collect all of the performance data?
On Tue, Jun 12, 2018 at 12:06 PM, Hanby, Mike wrote:
> Howdy,
>
>
>
> Is anyone aware of any existing job completion email scripts that provide a
> summary of the jobs resource utilization? For example, something like:
>
>
>
> Job ID: 123456
On Mon, Apr 16, 2018 at 6:35 AM, wrote:
>
> According to the above I have the backfill scheduler enabled with CPUs and
> Memory configured as
> resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would
> expect that he backfill
>scheduler attempts to allocate the resources in orde
e what the user has set
>
>
> Alternatively, you can set your preferred timezone with the TZ environment
> variable when you issue your Slurm commands.
>
>
> On 02/26/2018 08:31 AM, Michael Di Domenico wrote:
>>
>> On Sat, Feb 24, 2018 at 7:20 AM, Jessica Nettelbl
On Sat, Feb 24, 2018 at 7:20 AM, Jessica Nettelblad
wrote:
> So it seems to me, unix time is used for dates, which is then converted with
> localtime for certain output to be readable for humans. Since Slurm is a C
> program run in a Unix environment, that is also what I would expect.
Thanks, the
when i run 'scontrol -o -d show job jobid=' i get a long list of variables
some of those variables are spit out as dates. since the dates do not
include a timezone field how should that date field be assumed to
work? from the value i conclude that it's my localtime, but is the
date being stored
29 matches
Mail list logo