As Jeff indicated, the degree of capability has improved over time - I'm not
sure which version this represents.
The type of failure also plays a major role in our ability to respond. If a
process actually segfaults or dies, we usually pick that up pretty well and
abort the rest of the job (certai
Support for failure scenarios is something that is getting better over
time in Open MPI.
It looks like the version you are using either didn't properly catch
that there was a failure and/or then cleanly exit all MPI processes.
On Nov 6, 2007, at 9:01 PM, Teng Lin wrote:
Hi,
Just realiz
Hi,
Just realize I have a job run for a long time, while some of the nodes
already die. Is there any way to ask other nodes to quit ?
[kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with
errno=104
[kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with
errno=104
T