On Feb 2, 2009, at 4:48 PM, Prentice Bisbal wrote:

No. I was running just a simple "Hello, world" program to test v1.3 when
these errors occured. And as soon as I reverted to 1.2.8, the errors
disappeared.

FWIW, OMPI allocates shared memory based on the number of peers on the host. This allocation is during MPI_INIT, not during the first MPI_SEND/MPI_RECV call. So even if you're running "hello world", you could still be running out of shared memory space.

Interestingly enough, I just upgraded my cluster to PU_IAS 5.3, and now I can't reproduce the problem but HPL fails with a segfault, which I'll
report in a separate e-mail to start a new thread for that problem.

--
Prentice

Jeff Squyres wrote:
Could the nodes be running out of shared memory and/or temp filesystem
space?


On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote:


I have not seen this before.  I assume that for some reason, the
shared memory transport layer cannot create the file it uses for
communicating within a node.  Open MPI then selects some other
transport (TCP, openib) to communicate within the node so the program
runs fine.

The code has not changed that much from 1.2 to 1.3, but it is a little
different.  Let me see if I can reproduce the problem.

Rolf

Mostyn Lewis wrote:
Sort of ditto but with SVN release at 20123 (and earlier):

e.g.

[r2250_46:30018] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_46_0/25682/1/ shared_mem_pool.r2250_46

failed with errno=2
[r2250_63:05292] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_63_0/25682/1/ shared_mem_pool.r2250_63

failed with errno=2
[r2250_57:17527] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_57_0/25682/1/ shared_mem_pool.r2250_57

failed with errno=2
[r2250_68:13553] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_68_0/25682/1/ shared_mem_pool.r2250_68

failed with errno=2
[r2250_50:06541] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_50_0/25682/1/ shared_mem_pool.r2250_50

failed with errno=2
[r2250_49:29237] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_49_0/25682/1/ shared_mem_pool.r2250_49

failed with errno=2
[r2250_66:19066] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_66_0/25682/1/ shared_mem_pool.r2250_66

failed with errno=2
[r2250_58:24902] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_58_0/25682/1/ shared_mem_pool.r2250_58

failed with errno=2
[r2250_69:27426] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_69_0/25682/1/ shared_mem_pool.r2250_69

failed with errno=2
[r2250_60:30560] mca_common_sm_mmap_init: open
/tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_60_0/25682/1/ shared_mem_pool.r2250_60

failed with errno=2

File not found in sm.

10 of them across 32 nodes (8 cores per node (2 sockets x quad- core))
"Apparently harmless"?

DM

On Tue, 27 Jan 2009, Prentice Bisbal wrote:

I just installed OpenMPI 1.3 with tight integration for SGE. Version 1.2.8 was working just fine for several months in the same arrangement.

Now that I've upgraded to 1.3, I get the following errors in my
standard
error file:

mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions- prent ice@node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failed with
errno=2
[node23.aurora:20601] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failed with
errno=2
[node46.aurora:12118] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failed with
errno=2
[node15.aurora:12421] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failed with
errno=2
[node20.aurora:12534] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failed with
errno=2
[node16.aurora:12573] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failed with
errno=2

I've tested 3-4 different times, and the number of hosts that produces
this error varies, as well as which hosts produce this error. My
program
seems to run fun, but it's just a simple "Hello, World!" program. Any
ideas? Is this a bug in 1.3?


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to