I would do the normal things. Log into those nodes. Run dmesg and look at
/var/log/messages
Look at the Slurm log on the node and look for the job ending.
Also look at the sysstat files and see if there was a lot of memory being
used http://sebastien.godard.pagesperso-orange.fr/
On Wed, 17 Apr
Hi,
A QuantumEspresso, multinode and multiprocess MPI job has been terminated
with the following messages in the log file
total cpu time spent up to now is63540.4 secs
total energy = -14004.61932175 Ry
Harris-Foulkes estimate = -14004.73511665 Ry
On Apr 30, 2013, at 1:54 PM, Vladimir Yamshchikov wrote:
> This is the question I am trying to answer - how many threads I can use with
> blastx on a grid? If I could request resources by_node, use -pernode option
> to have one process per node, and then specify the correct
This is the question I am trying to answer - how many threads I can use
with blastx on a grid? If I could request resources by_node, use -pernode
option to have one process per node, and then specify the correct number of
threads for each node. But I cannot, resurces (slots) are requested
per-core
On Apr 30, 2013, at 1:34 PM, Vladimir Yamshchikov wrote:
> I asked grid IT and they said they had to kill it as the job was overloading
> nodes. They saw loads up to 180 instead of close to 12 on 12-core nodes. They
> think that blastx is not an openmpi application, so
I asked grid IT and they said they had to kill it as the job was
overloading nodes. They saw loads up to 180 instead of close to 12 on
12-core nodes. They think that blastx is not an openmpi application, so openMPI
is spawning between 64-96 blastx processes, each of which is then starting
up 96
Hi,
Am 30.04.2013 um 21:26 schrieb Vladimir Yamshchikov:
> My recent job started normally but after a few hours of running died with the
> following message:
>
> --
> A daemon (pid 19390) died unexpectedly with status 137
Hello,
My recent job started normally but after a few hours of running died with
the following message:
--
A daemon (pid 19390) died unexpectedly with status 137 while attempting
to launch so we are aborting.
There