Hi Andy,
Thank you for your response.
Slurm config:
ControlMachine=masternode**
AuthType=auth/munge
CheckpointType=checkpoint/blcr
CryptoType=crypto/munge
JobCheckpointDir=/pathtocheckpoints**
MpiDefault=openmpi
MpiParams=ports=00000-00000**
ReturnToService=2
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=0000**
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=0000**
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurmuser**
SlurmdUser=slurmduser**
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
JobFileAppend=1
SallocDefaultCommand="$SHELL"
#
#
# SCHEDULING
DefMemPerNode=1
#######################
FastSchedule=0
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=00000**
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
#
# JOB PRIORITY
#PriorityFlags=
PriorityType=priority/multifactor
PriorityDecayHalfLife=60-0
PriorityWeightFairshare=1000
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=nameofthecluster
AccountingStorageHost=masternode
#DebugFlags=
###DebugFlags=NO_CONF_HASH
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug5
#SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug5
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# Job Preemption
PreemptType=preempt/partition_prio
##PreemptMode=SUSPEND,GANG
PreemptMode=REQUEUE
#### GroupUpdate
CacheGroups=0
###GroupUpdateForce=1
###GroupUpdateTime=900
#
# COMPUTE NODES
NodeName=master** State=DOWN
NodeName=server[001-028]** State=UNKNOWN Sockets=2 CoresPerSocket=10
ThreadsPerCore=1 RealMemory=128000
NodeName=bigserver01** State=UNKNOWN Sockets=4 CoresPerSocket=8
ThreadsPerCore=1 RealMemory=775000
# Partitions
PartitionName=DEFAULT Nodes=server[001-028]** Shared=YES
#
PartitionName=name1 MaxTime=INFINITE State=UP Priority=1 PreemptMode=REQUEUE
Lines with "**" are the lines where i replaced private data.
Slurm was installed from the RPM's
The mpich and openmpi ./configure was set with default options with only
--prefix=/sotware/storage/path
If any additional information needed please ask me.
Best Wishes,
Igor
On 11/08/15 15:42, Andy Riebs wrote:
To help figure out what is going on, please send the following (to the list,
not to me!):
* Your Slurm configuration file (with private data like IP addresses and
node names removed)
* Your ./configure command lines for
* Slurm
* Mpich
* OpenMPI
* The command(s) that you use to submit the job
Andy
On 08/11/2015 01:57 AM, Igor Chebotar wrote:
Hi all,
We are issuing a problem with a few applications that using MPI.
When we executing the job, the processes looks like running well without any
errors, but when the job is ending in the Slurm output we always get errors
like:
Slurm output file:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 25147 RUNNING AT bee002
= EXIT CODE: 2
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@bee001] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885):
assert (!closed) failed
[proxy:0:0@bee001] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76):
callback returned error status
[proxy:0:0@bee001] main (pm/pmiserv/pmip.c:206): demux engine error waiting for
event
srun: error: bee001: task 0: Exited with exit code 7
srun: error: _server_read: fd 19 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd
on node 2
[mpiexec@bee001] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly;
aborting
[mpiexec@bee001] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec@bee001] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218):
launcher returned error waiting for completion
[mpiexec@bee001] main (ui/mpich/mpiexec.c:344): process manager error waiting
for completion
Slurm's Email Title upon failure:
SLURM Job_id=243524 Name=cloudy Failed, Run time 1-14:46:11, FAILED, ExitCode
255
The output of the job itself looks fine, so it means that everything running
fine and the problem is only in the job ending process.
Information about our system:
Applications that has that problem: FLASH, CLOUDY
MPI versions that we was running those applications: openmpi 1.8.4, openmpi
1.8.5, mpich 3.1.3
Slurm version: tried to run the applications both on slurm 14.11.8 and 14.11.3
OS: CentOS 6.5
Does anyone is familiar with that kind of problem? How can we solve it?
Any information regarding the issue could help us a lot.
Thanks!