Re: [slurm-users] How to view GPU indices of the completed jobs?

2020-06-11 Thread Kota Tsuyuzaki
Thank you David! Let me try it. Thinking about our case, I'll try to dump the debug info to somewhere like syslog. Anyway, the idea should be useful to improve our system monitoring. Much appreciated. Best, Kota 露崎 浩太 (Kota Tsuyuzaki)

[slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-06-11 Thread Rhian Resnick
We have several users submitting single GPU jobs to our cluster. We expected the jobs to fill each node and fully utilize the available GPU's but we instead find that only 2 out of the 4 gpu's in each node gets allocated. If we request 2 GPU's in the job and start two jobs, both jobs will

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Short of getting on the system and kicking the tires myself, I’m fresh out of ideas. Does “sinfo -R” offer any hints? From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of navin srivastava Sent: Thursday, June 11, 2020 11:31 AM To: Slurm User Community List Subject: Re:

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread navin srivastava
i am able to get the output scontrol show node oled3 also the oled3 is pinging fine and scontrol ping output showing like Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN so all looks ok to me. REgards Navin. On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy wrote: > So there seems to

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread navin srivastava
i collected the log from slurmctld and it says below [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 Nodelist=oled3 [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request

[slurm-users] Oversubscribe until 100% load?

2020-06-11 Thread Holtgrewe, Manuel
Hi, I have some trouble understanding the "Oversubscribe" setting completely. What I would like is to oversubscribe nodes to increase overall throughput. - Is there a way to oversubscribe by a certain fraction, e.g. +20% or +50%? - Is there a way to stop if a node reaches 100% "Load"? Is there

Re: [slurm-users] [ext] Re: Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Holtgrewe, Manuel
Thanks, all for your replies. I think I can figure out something that makes sense from here... -- Dr. Manuel Holtgrewe, Dipl.-Inform. Bioinformatician Core Unit Bioinformatics – CUBI Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the Helmholtz Association / Charité –

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to interpret it not reporting anything but the “log file” and “munge” messages. When you have it running attached to your window, is there any chance that sinfo or scontrol suggest that the node is actually all right?

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
Spare capacity is critical. At our scale, the few dozen cores that were typically left idle in our GPU nodes handles the vast majority of interactive work. > On Jun 11, 2020, at 8:38 AM, Paul Edmon wrote: > > External Email Warning > > This email originated from outside the university.

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Paul Edmon
That's pretty slick.  We just have a test, gpu_test, and remotedesktop partition set up for those purposes. What the real trick is making sure you have sufficient spare capacity that you can deliberately idle for these purposes.  If we were a smaller shop with less hardware I wouldn't be able

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
That’s close to what we’re doing, but without dedicated nodes. We have three back-end partitions (interactive, any-interactive, and gpu-interactive), but the users typically don’t have to consider that, due to our job_submit.lua plugin. All three partitions have a default of 2 hours, 1 core, 2

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Jeffrey T Frey
Is the time on that node too far out-of-sync w.r.t. the slurmctld server? > On Jun 11, 2020, at 09:01 , navin srivastava wrote: > > I tried by executing the debug mode but there also it is not writing anything. > > i waited for about 5-10 minutes > > deda1x1452:/etc/sysconfig #

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
If you omitted the “-D” that I suggested, then the daemon would have detached and logged nothing on the screen. In this case, you can still go to the slurmd log (use “scontrol show config | grep -I log” if you’re not sure where the logs are stored). From: slurm-users

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread navin srivastava
I tried by executing the debug mode but there also it is not writing anything. i waited for about 5-10 minutes deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v No output on terminal. The OS is SLES12-SP4 . All firewall services are disabled. The recent change is the local hostname earlier

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Loris Bennett
Hi Manual, "Holtgrewe, Manuel" writes: > Hi, > > is there a way to make interactive logins where users will use almost no > resources "always succeed"? > > In most of these interactive sessions, users will have mostly idle shells > running and do some batch job submissions. Is there a way to

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Sudeep Narayan Banerjee
Hi: please mention the below output. cat /etc/redhat-release OR cat /etc/lsb_release Also, please let us know the detailed log reports that is probably available at /var/log/slurm/slurmctld.log status of: ps -ef | grep slurmctld Thanks & Regards, Sudeep Narayan Banerjee System Analyst |

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Marcus Boden
Hi Navin, try running slurmd in the foregrund with increased verbosity: slurmd -D -v (add as many v as you deem necessary) Hopefully it'll tell you more about why it times out. Best, Marcus On 6/11/20 2:24 PM, navin srivastava wrote: > Hi Team, > > when i am trying to start the slurmd

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Ole Holm Nielsen
On 11-06-2020 14:24, navin srivastava wrote: Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Navin, As you can see, systemd provides very little service-specific information. For slurm, you really need to go to the slurm logs to find out what happened. Hint: A quick way to identify problems like this with slurmd and slurmctld is to run them with the “-Dvvv” option, causing them to log

[slurm-users] unable to start slurmd process.

2020-06-11 Thread navin srivastava
Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating.

[slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Holtgrewe, Manuel
Hi, is there a way to make interactive logins where users will use almost no resources "always succeed"? In most of these interactive sessions, users will have mostly idle shells running and do some batch job submissions. Is there a way to allocate "infinite virtual cpus" on each node that