Re: [slurm-users] How to view GPU indices of the completed jobs?

2020-06-11 Thread Kota Tsuyuzaki
Thank you David! Let me try it.
Thinking about our case, I'll try to dump the debug info to somewhere like 
syslog. Anyway, the idea should be useful to improve our system monitoring. 
Much appreciated.

Best,
Kota 


露崎 浩太 (Kota Tsuyuzaki)
kota.tsuyuzaki...@hco.ntt.co.jp
NTTソフトウェアイノベーションセンタ
分散処理基盤技術プロジェクト
0422-59-2837
-

> -Original Message-
> From: slurm-users  On Behalf Of David 
> Braun
> Sent: Thursday, June 11, 2020 3:50 AM
> To: Slurm User Community List 
> Subject: Re: [slurm-users] How to view GPU indices of the completed jobs?
> 
> Hi Kota,
> 
> This is from the job template that I give to my users:
> 
> # Collect some information about the execution environment that may # be 
> useful should we need to do some debugging.
> 
> echo "CREATING DEBUG DIRECTORY"
> echo
> 
> mkdir .debug_info
> module list > .debug_info/environ_modules 2>&1 ulimit -a > .debug_info/limits 
> 2>&1 hostname
> > .debug_info/environ_hostname 2>&1 env |grep SLURM > 
> > .debug_info/environ_slurm 2>&1 env |grep OMP |grep -v
> OMPI > .debug_info/environ_omp 2>&1 env |grep OMPI > 
> .debug_info/environ_openmpi 2>&1 env
> > .debug_info/environ 2>&1
> 
> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
> echo "SAVING CUDA ENVIRONMENT"
> echo
> env |grep CUDA > .debug_info/environ_cuda 2>&1 fi
> 
> You could add something like this to one of the SLURM prologs to save the GPU 
> list of jobs.
> 
> Best,
> 
> David
> 
> 
> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki   > wrote:
> 
> 
>   Hello Guys,
> 
>   We are running GPU clusters with Slurm and SlurmDBD (version 19.05 
> series) and some of GPUs seemed to get
> troubles for attached
>   jobs. To investigate if the troubles happened on the same GPUs, I'd 
> like to get GPU indices of the completed jobs.
> 
>   In my understanding `scontrol show job` can show the indices (as IDX in 
> gres info) but cannot be used for
> completed job. And also
>   `sacct -j` is available for complete jobs but won't print the indices.
> 
>   Is there any way (commands, configurations, etc...) to see the 
> allocated GPU indices for completed jobs?
> 
>   Best regards,
> 
>   
>   露崎 浩太 (Kota Tsuyuzaki)
>   kota.tsuyuzaki...@hco.ntt.co.jp 
>   NTTソフトウェアイノベーションセンタ
>   分散処理基盤技術プロジェクト
>   0422-59-2837
>   -
> 
> 
> 
> 
> 
> 






[slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-06-11 Thread Rhian Resnick
We have several users submitting single GPU jobs to our cluster.  We expected 
the jobs to fill each node and fully utilize the available GPU's but we instead 
find that only 2 out of the 4 gpu's in each node gets allocated.

If we request 2 GPU's in the job and start two jobs, both jobs will start on 
the same node fully allocating the node. We are puzzled about is going on and 
any hints are welcome.

Thanks for your help,

Rhian



Example SBATCH Script
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=longq7-mri
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --gres=gpu:1
#SBATCH --mail-type=ALL
hostname
echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES

set | grep SLURM
nvidia-smi
sleep 500




gres.conf
#AutoDetect=nvml
Name=gpu Type=v100  File=/dev/nvidia0 Cores=0
Name=gpu Type=v100  File=/dev/nvidia1 Cores=1
Name=gpu Type=v100  File=/dev/nvidia2 Cores=2
Name=gpu Type=v100  File=/dev/nvidia3 Cores=3


slurm.conf
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=cluster
ControlMachine=cluster-slurm1.example.com
ControlAddr=10.116.0.11
BackupController=cluster-slurm2.example.com
BackupAddr=10.116.0.17
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
SchedulerPort=7321

RebootProgram="/usr/sbin/reboot"


AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid

GresTypes=gpu,mps,bandwidth

PrologFlags=x11
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=/etc/slurm/slurm.epilog.clean
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#bf_interval=10
#SchedulerAuth=
#SelectType=select/linear
# Cores and memory are consumable
#SelectType=select/cons_res
#SelectTypeParameters=CR_Core_Memory
SchedulerParameters=bf_interval=10
SelectType=select/cons_res
SelectTypeParameters=CR_Core

FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=10
#PriorityWeightAge=1000
#PriorityWeightPartition=1
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
#
#
# Default values
# DefMemPerNode=64000
# DefCpuPerGPU=4
# DefMemPerCPU=4000
# DefMemPerGPU=16000



# OpenHPC default configuration
#TaskPlugin=task/affinity
TaskPlugin=task/affinity,task/cgroup
PropagateResourceLimitsExcept=MEMLOCK
TaskPluginParam=autobind=cores
#AccountingStorageType=accounting_storage/mysql
#StorageLoc=slurm_acct_db

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=cluster-slurmdbd1.example.com
#AccountingStorageType=accounting_storage/filetxt
Epilog=/etc/slurm/slurm.epilog.clean


#PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP
PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF 
ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO  
Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025]


# Partitions

# Group Limited Queues

# OIT DEBUG QUEUE
PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP 
AllowGroups=oit-hpc-admin

# RNA CHEM
PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 
MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] 
AllowGroups=gpu-rnachem

# V100's
PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 
MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] AllowGroups=gpu-mri

# BIGDATA GRANT
PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 
MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 
AllowGroups=fau-bigdata,nsf-bigdata

PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10  AllowAccounts=ALL 
 Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata

# CogNeuroLab
PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 MaxTime=7-12:00:00 
AllowGroups=cogneurolab Priority=200 State=UP Nodes=node[001-004]


# Standard queues

# OPEN TO ALL

#Short 

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Short of getting on the system and kicking the tires myself, I’m fresh out of 
ideas. Does “sinfo -R” offer any hints?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 11:31 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] unable to start slurmd process.

i am able to get the output scontrol show node oled3
also the oled3 is pinging fine

and scontrol ping output showing like

Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN

so all looks ok to me.

REgards
Navin.



On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
So there seems to be a failure to communicate between slurmctld and the oled3 
slurmd.

From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon.

From the head node, try “scontrol show node oled3”, and then ping the address 
that is shown for “NodeAddr=”

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 10:40 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

i collected the log from slurmctld and it says below

[2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284 
Nodelist=oled3
[2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3 
RPC:REQUEST_TERMINATE_JOB : Communication connection failure
[2020-06-11T07:14:50.210] error: Nodes oled3 not responding
[2020-06-11T07:15:54.313] error: Nodes oled3 not responding
[2020-06-11T07:17:34.407] error: Nodes oled3 not responding
[2020-06-11T07:19:14.637] error: Nodes oled3 not responding
[2020-06-11T07:19:54.313] update_node: node oled3 reason set to: reboot-required
[2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING*
[2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3
[2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3
[2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN

sinfo says

OLED*   up   infinite  1 drain* oled3

while checking the node i feel node is healthy.

Regards
Navin

On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to 
interpret it not reporting anything but the “log file” and “munge” messages. 
When you have it running attached to your window, is there any chance that 
sinfo or scontrol suggest that the node is actually all right? Perhaps 
something in /etc/sysconfig/slurm or the like is messed up?

If that’s not the case, I think my next step would be to follow up on someone 
else’s suggestion, 

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread navin srivastava
i am able to get the output scontrol show node oled3
also the oled3 is pinging fine

and scontrol ping output showing like

Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN

so all looks ok to me.

REgards
Navin.



On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy  wrote:

> So there seems to be a failure to communicate between slurmctld and the
> oled3 slurmd.
>
>
>
> From oled3, try “scontrol ping” to confirm that it can see the slurmctld
> daemon.
>
>
>
> From the head node, try “scontrol show node oled3”, and then ping the
> address that is shown for “NodeAddr=”
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 10:40 AM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] unable to start slurmd process.
>
>
>
> i collected the log from slurmctld and it says below
>
>
>
> [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3
> RPC:REQUEST_TERMINATE_JOB : Communication connection failure
> [2020-06-11T07:14:50.210] error: Nodes oled3 not responding
> [2020-06-11T07:15:54.313] error: Nodes oled3 not responding
> [2020-06-11T07:17:34.407] error: Nodes oled3 not responding
> [2020-06-11T07:19:14.637] error: Nodes oled3 not responding
> [2020-06-11T07:19:54.313] update_node: node oled3 reason set to:
> reboot-required
> [2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING*
> [2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3
> [2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3
> [2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN
>
>
>
> sinfo says
>
>
>
> OLED*   up   infinite  1 drain* oled3
>
>
>
> while checking the node i feel node is healthy.
>
>
>
> Regards
>
> Navin
>
>
>
> On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy  wrote:
>
> Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess
> how to interpret it not reporting anything but the “log file” and “munge”
> messages. When you have it running attached to your window, is there any
> chance that sinfo or scontrol suggest that the node is actually all right?
> Perhaps something in /etc/sysconfig/slurm or the like is messed up?
>
>
>
> If that’s not the case, I think my next step would be to follow up on
> someone else’s suggestion, and scan the slurmctld.log file for the problem
> node name.
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 9:26 AM
> *To:* Slurm User Community List 
> *Subject:* Re: 

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread navin srivastava
i collected the log from slurmctld and it says below

[2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284
Nodelist=oled3
[2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3
RPC:REQUEST_TERMINATE_JOB : Communication connection failure
[2020-06-11T07:14:50.210] error: Nodes oled3 not responding
[2020-06-11T07:15:54.313] error: Nodes oled3 not responding
[2020-06-11T07:17:34.407] error: Nodes oled3 not responding
[2020-06-11T07:19:14.637] error: Nodes oled3 not responding
[2020-06-11T07:19:54.313] update_node: node oled3 reason set to:
reboot-required
[2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING*
[2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3
[2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3
[2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN

sinfo says

OLED*   up   infinite  1 drain* oled3

while checking the node i feel node is healthy.

Regards
Navin

On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy  wrote:

> Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess
> how to interpret it not reporting anything but the “log file” and “munge”
> messages. When you have it running attached to your window, is there any
> chance that sinfo or scontrol suggest that the node is actually all right?
> Perhaps something in /etc/sysconfig/slurm or the like is messed up?
>
>
>
> If that’s not the case, I think my next step would be to follow up on
> someone else’s suggestion, and scan the slurmctld.log file for the problem
> node name.
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 9:26 AM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] unable to start slurmd process.
>
>
>
> Sorry Andy I missed to add.
>
> 1st i tried the  slurmd -Dvvv and it is not written anything
>
> slurmd: debug:  Log file re-opened
> slurmd: debug:  Munge authentication plugin loaded
>
>
>
> After that I waited for 10-20 minutes but no output and finally i pressed
> Ctrl^c.
>
>
>
> My doubt is in slurm.conf file:
>
>
>
> ControlMachine=deda1x1466
> ControlAddr=192.168.150.253
>
>
>
> The deda1x1466 is having a different interface with different IP which
> compute node is unable to ping but IP is pingable.
>
> could be one of the reason?
>
>
>
> but other nodes having the same config and there i am able to start the
> slurmd. so bit of confusion.
>
>
>
> Regards
>
> Navin.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Regards
>
> Navin.
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy  wrote:
>
> If you omitted the “-D” that I suggested, then the daemon would have
> detached and logged nothing on the screen. In this case, you can still go

[slurm-users] Oversubscribe until 100% load?

2020-06-11 Thread Holtgrewe, Manuel
Hi,

I have some trouble understanding the "Oversubscribe" setting completely. What 
I would like is to oversubscribe nodes to increase overall throughput.

- Is there a way to oversubscribe by a certain fraction, e.g. +20% or +50%?
- Is there a way to stop if a node reaches 100% "Load"?

Is there any good documentation available online that describes how to 
"carefully oversubscribe" a cluster?

Our users have pretty mixed workloads, e.g., with high parallelism in the first 
50% of the wall-clock time, then parts with mixed parallelism. Of course, we 
should educate our users better, but in some cases, it's very hard to improve 
because of the software used or workloads that cycle between being I/O and 
compute bound.

Thank you,
Manuel

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de


Re: [slurm-users] [ext] Re: Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Holtgrewe, Manuel
Thanks, all for your replies. I think I can figure out something that makes 
sense from here...

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de


From: slurm-users [slurm-users-boun...@lists.schedmd.com] on behalf of Renfro, 
Michael [ren...@tntech.edu]
Sent: Thursday, June 11, 2020 15:47
To: Slurm User Community List
Subject: [ext] Re: [slurm-users] Make "srun --pty bash -i" always schedule 
immediately

Spare capacity is critical. At our scale, the few dozen cores that were 
typically left idle in our GPU nodes handles the vast majority of interactive 
work.

> On Jun 11, 2020, at 8:38 AM, Paul Edmon  wrote:
>
> External Email Warning
>
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
>
> 
>
> That's pretty slick.  We just have a test, gpu_test, and remotedesktop
> partition set up for those purposes.
>
> What the real trick is making sure you have sufficient spare capacity
> that you can deliberately idle for these purposes.  If we were a smaller
> shop with less hardware I wouldn't be able to set aside as much hardware
> for this.  If that was the case I would likely go the route of a single
> server with oversubscribe.
>
> You could try to do it with an active partition with no deliberately
> idle resources, but then you will want to make sure that your small jobs
> are really small and won't impact larger work.  I don't necessarily
> recommend that.  A single node with oversubscribe should be sufficient.
> If you can't spare a single node then a VM would do the job.
>
> -Paul Edmon-
>
> On 6/11/2020 9:28 AM, Renfro, Michael wrote:
>> That’s close to what we’re doing, but without dedicated nodes. We have three 
>> back-end partitions (interactive, any-interactive, and gpu-interactive), but 
>> the users typically don’t have to consider that, due to our job_submit.lua 
>> plugin.
>>
>> All three partitions have a default of 2 hours, 1 core, 2 GB RAM, but users 
>> could request more cores and RAM (but not as much as a batch job — we used 
>> https://hpcbios.readthedocs.io/en/latest/HPCBIOS_05-05.html as a starting 
>> point).
>>
>> If a GPU is requested, the job goes into the gpu-interactive partition and 
>> is limited to 16 cores per node (we have 28 cores per GPU node, but GPU jobs 
>> can’t keep them all busy)
>>
>> If less than 12 cores per node is requested, the job goes into the 
>> any-interactive partition and could be handled on any of our GPU or non-GPU 
>> nodes.
>>
>> If more than 12 cores per node is requested, the job goes into the 
>> interactive partition and is handled by only a non-GPU node.
>>
>> I haven’t needed to QOS the interactive partitions, but that’s not a bad 
>> idea.
>>
>>> On Jun 11, 2020, at 8:19 AM, Paul Edmon  wrote:
>>>
>>> Generally the way we've solved this is to set aside a specific set of
>>> nodes in a partition for interactive sessions.  We deliberately scale
>>> the size of the resources so that users will always run immediately and
>>> we also set a QoS on the partition to make it so that no one user can
>>> dominate the partition.
>>>
>>> -Paul Edmon-
>>>
>>> On 6/11/2020 8:49 AM, Loris Bennett wrote:
 Hi Manual,

 "Holtgrewe, Manuel"  writes:

> Hi,
>
> is there a way to make interactive logins where users will use almost no 
> resources "always succeed"?
>
> In most of these interactive sessions, users will have mostly idle shells 
> running and do some batch job submissions. Is there a way to allocate 
> "infinite virtual cpus" on each node that can only be allocated to
> interactive jobs?
 I have never done this but setting "OverSubscribe" in the appropriate
 place might be what you are looking for.

   https://slurm.schedmd.com/cons_res_share.html

 Personally, however, I would be a bit wary of doing this.  What if
 someone does start a multithreaded process on purpose or by accident?

 Wouldn't just using cgroups on your login node achieve what you want?

 Cheers,

 Loris

>




Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to 
interpret it not reporting anything but the “log file” and “munge” messages. 
When you have it running attached to your window, is there any chance that 
sinfo or scontrol suggest that the node is actually all right? Perhaps 
something in /etc/sysconfig/slurm or the like is messed up?

If that’s not the case, I think my next step would be to follow up on someone 
else’s suggestion, and scan the slurmctld.log file for the problem node name.

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 9:26 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] unable to start slurmd process.

Sorry Andy I missed to add.
1st i tried the  slurmd -Dvvv and it is not written anything
slurmd: debug:  Log file re-opened
slurmd: debug:  Munge authentication plugin loaded

After that I waited for 10-20 minutes but no output and finally i pressed 
Ctrl^c.

My doubt is in slurm.conf file:

ControlMachine=deda1x1466
ControlAddr=192.168.150.253

The deda1x1466 is having a different interface with different IP which compute 
node is unable to ping but IP is pingable.
could be one of the reason?

but other nodes having the same config and there i am able to start the slurmd. 
so bit of confusion.

Regards
Navin.








Regards
Navin.






On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
If you omitted the “-D” that I suggested, then the daemon would have detached 
and logged nothing on the screen. In this case, you can still go to the slurmd 
log (use “scontrol show config | grep -I log” if you’re not sure where the logs 
are stored).

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 9:01 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] unable to start slurmd process.

I tried by executing the debug mode but there also it is not writing anything.

i waited for about 5-10 minutes

deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v

No output on terminal.

The OS is SLES12-SP4 . All firewall services are disabled.

The recent change is the local hostname earlier it was with local hostname 
node1,node2,etc but we have moved to dns based hostname which is deda

NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] 
Sockets=2 CoresPerSocket=10 State=UNKNOWN
other than this it is fine but after that i have done several time slurmd 
process started on the node and it works fine but now i am seeing this issue 
today.

Regards
Navin.









On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
Navin,

As you can see, systemd provides very little service-specific information. For 
slurm, you really need to go to the slurm logs to find out what happened.

Hint: A quick way to identify problems like this with slurmd and slurmctld is 
to run them with the “-Dvvv” option, causing them to log to your window, and 
usually causing the problem to become immediately obvious.

For example,

# /usr/local/slurm/sbin/slurmd -D

Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
you run it this way, it’s time to look elsewhere.

Andy

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 8:25 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] unable to start slurmd process.

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered 
failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with 
result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
session opened for user root by (uid=0)

Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any changes 
but today when i am starting it is giving this error message.
Any idea what could be wrong here.

Regards
Navin.






Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
Spare capacity is critical. At our scale, the few dozen cores that were 
typically left idle in our GPU nodes handles the vast majority of interactive 
work.

> On Jun 11, 2020, at 8:38 AM, Paul Edmon  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> That's pretty slick.  We just have a test, gpu_test, and remotedesktop
> partition set up for those purposes.
> 
> What the real trick is making sure you have sufficient spare capacity
> that you can deliberately idle for these purposes.  If we were a smaller
> shop with less hardware I wouldn't be able to set aside as much hardware
> for this.  If that was the case I would likely go the route of a single
> server with oversubscribe.
> 
> You could try to do it with an active partition with no deliberately
> idle resources, but then you will want to make sure that your small jobs
> are really small and won't impact larger work.  I don't necessarily
> recommend that.  A single node with oversubscribe should be sufficient.
> If you can't spare a single node then a VM would do the job.
> 
> -Paul Edmon-
> 
> On 6/11/2020 9:28 AM, Renfro, Michael wrote:
>> That’s close to what we’re doing, but without dedicated nodes. We have three 
>> back-end partitions (interactive, any-interactive, and gpu-interactive), but 
>> the users typically don’t have to consider that, due to our job_submit.lua 
>> plugin.
>> 
>> All three partitions have a default of 2 hours, 1 core, 2 GB RAM, but users 
>> could request more cores and RAM (but not as much as a batch job — we used 
>> https://hpcbios.readthedocs.io/en/latest/HPCBIOS_05-05.html as a starting 
>> point).
>> 
>> If a GPU is requested, the job goes into the gpu-interactive partition and 
>> is limited to 16 cores per node (we have 28 cores per GPU node, but GPU jobs 
>> can’t keep them all busy)
>> 
>> If less than 12 cores per node is requested, the job goes into the 
>> any-interactive partition and could be handled on any of our GPU or non-GPU 
>> nodes.
>> 
>> If more than 12 cores per node is requested, the job goes into the 
>> interactive partition and is handled by only a non-GPU node.
>> 
>> I haven’t needed to QOS the interactive partitions, but that’s not a bad 
>> idea.
>> 
>>> On Jun 11, 2020, at 8:19 AM, Paul Edmon  wrote:
>>> 
>>> Generally the way we've solved this is to set aside a specific set of
>>> nodes in a partition for interactive sessions.  We deliberately scale
>>> the size of the resources so that users will always run immediately and
>>> we also set a QoS on the partition to make it so that no one user can
>>> dominate the partition.
>>> 
>>> -Paul Edmon-
>>> 
>>> On 6/11/2020 8:49 AM, Loris Bennett wrote:
 Hi Manual,
 
 "Holtgrewe, Manuel"  writes:
 
> Hi,
> 
> is there a way to make interactive logins where users will use almost no 
> resources "always succeed"?
> 
> In most of these interactive sessions, users will have mostly idle shells 
> running and do some batch job submissions. Is there a way to allocate 
> "infinite virtual cpus" on each node that can only be allocated to
> interactive jobs?
 I have never done this but setting "OverSubscribe" in the appropriate
 place might be what you are looking for.
 
   https://slurm.schedmd.com/cons_res_share.html
 
 Personally, however, I would be a bit wary of doing this.  What if
 someone does start a multithreaded process on purpose or by accident?
 
 Wouldn't just using cgroups on your login node achieve what you want?
 
 Cheers,
 
 Loris
 
> 



Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Paul Edmon
That's pretty slick.  We just have a test, gpu_test, and remotedesktop 
partition set up for those purposes.


What the real trick is making sure you have sufficient spare capacity 
that you can deliberately idle for these purposes.  If we were a smaller 
shop with less hardware I wouldn't be able to set aside as much hardware 
for this.  If that was the case I would likely go the route of a single 
server with oversubscribe.


You could try to do it with an active partition with no deliberately 
idle resources, but then you will want to make sure that your small jobs 
are really small and won't impact larger work.  I don't necessarily 
recommend that.  A single node with oversubscribe should be sufficient.  
If you can't spare a single node then a VM would do the job.


-Paul Edmon-

On 6/11/2020 9:28 AM, Renfro, Michael wrote:

That’s close to what we’re doing, but without dedicated nodes. We have three 
back-end partitions (interactive, any-interactive, and gpu-interactive), but 
the users typically don’t have to consider that, due to our job_submit.lua 
plugin.

All three partitions have a default of 2 hours, 1 core, 2 GB RAM, but users 
could request more cores and RAM (but not as much as a batch job — we used 
https://hpcbios.readthedocs.io/en/latest/HPCBIOS_05-05.html as a starting 
point).

If a GPU is requested, the job goes into the gpu-interactive partition and is 
limited to 16 cores per node (we have 28 cores per GPU node, but GPU jobs can’t 
keep them all busy)

If less than 12 cores per node is requested, the job goes into the 
any-interactive partition and could be handled on any of our GPU or non-GPU 
nodes.

If more than 12 cores per node is requested, the job goes into the interactive 
partition and is handled by only a non-GPU node.

I haven’t needed to QOS the interactive partitions, but that’s not a bad idea.


On Jun 11, 2020, at 8:19 AM, Paul Edmon  wrote:

Generally the way we've solved this is to set aside a specific set of
nodes in a partition for interactive sessions.  We deliberately scale
the size of the resources so that users will always run immediately and
we also set a QoS on the partition to make it so that no one user can
dominate the partition.

-Paul Edmon-

On 6/11/2020 8:49 AM, Loris Bennett wrote:

Hi Manual,

"Holtgrewe, Manuel"  writes:


Hi,

is there a way to make interactive logins where users will use almost no resources 
"always succeed"?

In most of these interactive sessions, users will have mostly idle shells running and do 
some batch job submissions. Is there a way to allocate "infinite virtual cpus" 
on each node that can only be allocated to
interactive jobs?

I have never done this but setting "OverSubscribe" in the appropriate
place might be what you are looking for.

   https://slurm.schedmd.com/cons_res_share.html

Personally, however, I would be a bit wary of doing this.  What if
someone does start a multithreaded process on purpose or by accident?

Wouldn't just using cgroups on your login node achieve what you want?

Cheers,

Loris





Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Renfro, Michael
That’s close to what we’re doing, but without dedicated nodes. We have three 
back-end partitions (interactive, any-interactive, and gpu-interactive), but 
the users typically don’t have to consider that, due to our job_submit.lua 
plugin.

All three partitions have a default of 2 hours, 1 core, 2 GB RAM, but users 
could request more cores and RAM (but not as much as a batch job — we used 
https://hpcbios.readthedocs.io/en/latest/HPCBIOS_05-05.html as a starting 
point).

If a GPU is requested, the job goes into the gpu-interactive partition and is 
limited to 16 cores per node (we have 28 cores per GPU node, but GPU jobs can’t 
keep them all busy)

If less than 12 cores per node is requested, the job goes into the 
any-interactive partition and could be handled on any of our GPU or non-GPU 
nodes.

If more than 12 cores per node is requested, the job goes into the interactive 
partition and is handled by only a non-GPU node.

I haven’t needed to QOS the interactive partitions, but that’s not a bad idea.

> On Jun 11, 2020, at 8:19 AM, Paul Edmon  wrote:
> 
> Generally the way we've solved this is to set aside a specific set of
> nodes in a partition for interactive sessions.  We deliberately scale
> the size of the resources so that users will always run immediately and
> we also set a QoS on the partition to make it so that no one user can
> dominate the partition.
> 
> -Paul Edmon-
> 
> On 6/11/2020 8:49 AM, Loris Bennett wrote:
>> Hi Manual,
>> 
>> "Holtgrewe, Manuel"  writes:
>> 
>>> Hi,
>>> 
>>> is there a way to make interactive logins where users will use almost no 
>>> resources "always succeed"?
>>> 
>>> In most of these interactive sessions, users will have mostly idle shells 
>>> running and do some batch job submissions. Is there a way to allocate 
>>> "infinite virtual cpus" on each node that can only be allocated to
>>> interactive jobs?
>> I have never done this but setting "OverSubscribe" in the appropriate
>> place might be what you are looking for.
>> 
>>   https://slurm.schedmd.com/cons_res_share.html
>> 
>> Personally, however, I would be a bit wary of doing this.  What if
>> someone does start a multithreaded process on purpose or by accident?
>> 
>> Wouldn't just using cgroups on your login node achieve what you want?
>> 
>> Cheers,
>> 
>> Loris
>> 
> 



Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Jeffrey T Frey
Is the time on that node too far out-of-sync w.r.t. the slurmctld server?



> On Jun 11, 2020, at 09:01 , navin srivastava  wrote:
> 
> I tried by executing the debug mode but there also it is not writing anything.
> 
> i waited for about 5-10 minutes
> 
> deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v
> 
> No output on terminal. 
> 
> The OS is SLES12-SP4 . All firewall services are disabled.
> 
> The recent change is the local hostname earlier it was with local hostname 
> node1,node2,etc but we have moved to dns based hostname which is deda
> 
> NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] 
> Sockets=2 CoresPerSocket=10 State=UNKNOWN
> other than this it is fine but after that i have done several time slurmd 
> process started on the node and it works fine but now i am seeing this issue 
> today.
> 
> Regards
> Navin.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy  wrote:
> Navin,
> 
>  
> 
> As you can see, systemd provides very little service-specific information. 
> For slurm, you really need to go to the slurm logs to find out what happened.
> 
>  
> 
> Hint: A quick way to identify problems like this with slurmd and slurmctld is 
> to run them with the “-Dvvv” option, causing them to log to your window, and 
> usually causing the problem to become immediately obvious.
> 
>  
> 
> For example,
> 
>  
> 
> # /usr/local/slurm/sbin/slurmd -D
> 
>  
> 
> Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
> you run it this way, it’s time to look elsewhere.
> 
>  
> 
> Andy
> 
>  
> 
> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
> navin srivastava
> Sent: Thursday, June 11, 2020 8:25 AM
> To: Slurm User Community List 
> Subject: [slurm-users] unable to start slurmd process.
> 
>  
> 
> Hi Team,
> 
>  
> 
> when i am trying to start the slurmd process i am getting the below error.
> 
>  
> 
> 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node 
> daemon...
> 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
> operation timed out. Terminating.
> 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
> daemon.
> 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit 
> entered failed state.
> 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed 
> with result 'timeout'.
> 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
> session opened for user root by (uid=0)
> 
>  
> 
> Slurm version is 17.11.8
> 
>  
> 
> The server and slurm is running from long time and we have not made any 
> changes but today when i am starting it is giving this error message. 
> 
> Any idea what could be wrong here.
> 
>  
> 
> Regards
> 
> Navin.
> 
>  
> 
>  
> 
>  
> 
>  
> 




Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
If you omitted the “-D” that I suggested, then the daemon would have detached 
and logged nothing on the screen. In this case, you can still go to the slurmd 
log (use “scontrol show config | grep -I log” if you’re not sure where the logs 
are stored).

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 9:01 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] unable to start slurmd process.

I tried by executing the debug mode but there also it is not writing anything.

i waited for about 5-10 minutes

deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v

No output on terminal.

The OS is SLES12-SP4 . All firewall services are disabled.

The recent change is the local hostname earlier it was with local hostname 
node1,node2,etc but we have moved to dns based hostname which is deda

NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] 
Sockets=2 CoresPerSocket=10 State=UNKNOWN
other than this it is fine but after that i have done several time slurmd 
process started on the node and it works fine but now i am seeing this issue 
today.

Regards
Navin.









On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy 
mailto:andy.ri...@hpe.com>> wrote:
Navin,

As you can see, systemd provides very little service-specific information. For 
slurm, you really need to go to the slurm logs to find out what happened.

Hint: A quick way to identify problems like this with slurmd and slurmctld is 
to run them with the “-Dvvv” option, causing them to log to your window, and 
usually causing the problem to become immediately obvious.

For example,

# /usr/local/slurm/sbin/slurmd -D

Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
you run it this way, it’s time to look elsewhere.

Andy

From: slurm-users 
[mailto:slurm-users-boun...@lists.schedmd.com]
 On Behalf Of navin srivastava
Sent: Thursday, June 11, 2020 8:25 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] unable to start slurmd process.

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered 
failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with 
result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
session opened for user root by (uid=0)

Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any changes 
but today when i am starting it is giving this error message.
Any idea what could be wrong here.

Regards
Navin.






Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread navin srivastava
I tried by executing the debug mode but there also it is not writing
anything.

i waited for about 5-10 minutes

deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v

No output on terminal.

The OS is SLES12-SP4 . All firewall services are disabled.

The recent change is the local hostname earlier it was with local hostname
node1,node2,etc but we have moved to dns based hostname which is deda

NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12]
Sockets=2 CoresPerSocket=10 State=UNKNOWN
other than this it is fine but after that i have done several time slurmd
process started on the node and it works fine but now i am seeing this
issue today.

Regards
Navin.









On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy  wrote:

> Navin,
>
>
>
> As you can see, systemd provides very little service-specific information.
> For slurm, you really need to go to the slurm logs to find out what
> happened.
>
>
>
> Hint: A quick way to identify problems like this with slurmd and slurmctld
> is to run them with the “-Dvvv” option, causing them to log to your window,
> and usually causing the problem to become immediately obvious.
>
>
>
> For example,
>
>
>
> # /usr/local/slurm/sbin/slurmd -D
>
>
>
> Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail
> when you run it this way, it’s time to look elsewhere.
>
>
>
> Andy
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 8:25 AM
> *To:* Slurm User Community List 
> *Subject:* [slurm-users] unable to start slurmd process.
>
>
>
> Hi Team,
>
>
>
> when i am trying to start the slurmd process i am getting the below error.
>
>
>
> 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node
> daemon...
> 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start
> operation timed out. Terminating.
> 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm
> node daemon.
> 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit
> entered failed state.
> 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed
> with result 'timeout'.
> 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]:
> pam_unix(crond:session): session opened for user root by (uid=0)
>
>
>
> Slurm version is 17.11.8
>
>
>
> The server and slurm is running from long time and we have not made any
> changes but today when i am starting it is giving this error message.
>
> Any idea what could be wrong here.
>
>
>
> Regards
>
> Navin.
>
>
>
>
>
>
>
>
>


Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Loris Bennett
Hi Manual,

"Holtgrewe, Manuel"  writes:

> Hi,
>
> is there a way to make interactive logins where users will use almost no 
> resources "always succeed"?
>
> In most of these interactive sessions, users will have mostly idle shells 
> running and do some batch job submissions. Is there a way to allocate 
> "infinite virtual cpus" on each node that can only be allocated to
> interactive jobs?

I have never done this but setting "OverSubscribe" in the appropriate
place might be what you are looking for.

  https://slurm.schedmd.com/cons_res_share.html

Personally, however, I would be a bit wary of doing this.  What if
someone does start a multithreaded process on purpose or by accident?

Wouldn't just using cgroups on your login node achieve what you want?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Sudeep Narayan Banerjee

Hi: please mention the below output.

cat /etc/redhat-release

OR

cat /etc/lsb_release

Also, please let us know the detailed log reports that is probably 
available at /var/log/slurm/slurmctld.log


status of:
ps -ef | grep slurmctld

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System Technology Facility
Academic Block 5 | Room 110
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355 INDIA

On 11/06/20 5:54 pm, navin srivastava wrote:

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node 
daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: 
Start operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start 
Slurm node daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: 
Unit entered failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: 
Failed with result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: 
pam_unix(crond:session): session opened for user root by (uid=0)


Slurm version is 17.11.8

The server and slurm is running from long time and we have not made 
any changes but today when i am starting it is giving this error message.

Any idea what could be wrong here.

Regards
Navin.






Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Marcus Boden
Hi Navin,

try running slurmd in the foregrund with increased verbosity:
slurmd -D -v (add as many v as you deem necessary)

Hopefully it'll tell you more about why it times out.

Best,
Marcus


On 6/11/20 2:24 PM, navin srivastava wrote:
> Hi Team,
> 
> when i am trying to start the slurmd process i am getting the below error.
> 
> 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node
> daemon...
> 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start
> operation timed out. Terminating.
> 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm
> node daemon.
> 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit
> entered failed state.
> 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed
> with result 'timeout'.
> 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session):
> session opened for user root by (uid=0)
> 
> Slurm version is 17.11.8
> 
> The server and slurm is running from long time and we have not made any
> changes but today when i am starting it is giving this error message.
> Any idea what could be wrong here.
> 
> Regards
> Navin.
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Ole Holm Nielsen

On 11-06-2020 14:24, navin srivastava wrote:

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node 
daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm 
node daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit 
entered failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: 
Failed with result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: 
pam_unix(crond:session): session opened for user root by (uid=0)


Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any 
changes but today when i am starting it is giving this error message.

Any idea what could be wrong here.


Which OS do you run this ancient Slurm version on?  There could be many 
reasons why slurmd refuses to start, such as network, DNS, firewall, etc.


You should check the log file in /var/log/slurm/

You could start the slurmd from the command line, adding one or more -v 
for verbose logging:


$ slurmd -v -v

/Ole



Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Navin,

As you can see, systemd provides very little service-specific information. For 
slurm, you really need to go to the slurm logs to find out what happened.

Hint: A quick way to identify problems like this with slurmd and slurmctld is 
to run them with the “-Dvvv” option, causing them to log to your window, and 
usually causing the problem to become immediately obvious.

For example,

# /usr/local/slurm/sbin/slurmd -D

Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when 
you run it this way, it’s time to look elsewhere.

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
navin srivastava
Sent: Thursday, June 11, 2020 8:25 AM
To: Slurm User Community List 
Subject: [slurm-users] unable to start slurmd process.

Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start 
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node 
daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered 
failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with 
result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): 
session opened for user root by (uid=0)

Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any changes 
but today when i am starting it is giving this error message.
Any idea what could be wrong here.

Regards
Navin.






[slurm-users] unable to start slurmd process.

2020-06-11 Thread navin srivastava
Hi Team,

when i am trying to start the slurmd process i am getting the below error.

2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node
daemon...
2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start
operation timed out. Terminating.
2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm
node daemon.
2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit
entered failed state.
2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed
with result 'timeout'.
2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session):
session opened for user root by (uid=0)

Slurm version is 17.11.8

The server and slurm is running from long time and we have not made any
changes but today when i am starting it is giving this error message.
Any idea what could be wrong here.

Regards
Navin.


[slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Holtgrewe, Manuel
Hi,

is there a way to make interactive logins where users will use almost no 
resources "always succeed"?

In most of these interactive sessions, users will have mostly idle shells 
running and do some batch job submissions. Is there a way to allocate "infinite 
virtual cpus" on each node that can only be allocated to interactive jobs?

Thanks,
Manuel

--
Dr. Manuel Holtgrewe, Dipl.-Inform.
Bioinformatician
Core Unit Bioinformatics – CUBI
Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in the 
Helmholtz Association / Charité – Universitätsmedizin Berlin

Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
Postal Address: Chariteplatz 1, 10117 Berlin

E-Mail: manuel.holtgr...@bihealth.de
Phone: +49 30 450 543 607
Fax: +49 30 450 7 543 901
Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de  www.charite.de