No. I was running just a simple "Hello, world" program to test v1.3 when these errors occured. And as soon as I reverted to 1.2.8, the errors disappeared.
Interestingly enough, I just upgraded my cluster to PU_IAS 5.3, and now I can't reproduce the problem but HPL fails with a segfault, which I'll report in a separate e-mail to start a new thread for that problem. -- Prentice Jeff Squyres wrote: > Could the nodes be running out of shared memory and/or temp filesystem > space? > > > On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote: > >> >> I have not seen this before. I assume that for some reason, the >> shared memory transport layer cannot create the file it uses for >> communicating within a node. Open MPI then selects some other >> transport (TCP, openib) to communicate within the node so the program >> runs fine. >> >> The code has not changed that much from 1.2 to 1.3, but it is a little >> different. Let me see if I can reproduce the problem. >> >> Rolf >> >> Mostyn Lewis wrote: >>> Sort of ditto but with SVN release at 20123 (and earlier): >>> >>> e.g. >>> >>> [r2250_46:30018] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_46_0/25682/1/shared_mem_pool.r2250_46 >>> >>> failed with errno=2 >>> [r2250_63:05292] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_63_0/25682/1/shared_mem_pool.r2250_63 >>> >>> failed with errno=2 >>> [r2250_57:17527] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_57_0/25682/1/shared_mem_pool.r2250_57 >>> >>> failed with errno=2 >>> [r2250_68:13553] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_68_0/25682/1/shared_mem_pool.r2250_68 >>> >>> failed with errno=2 >>> [r2250_50:06541] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_50_0/25682/1/shared_mem_pool.r2250_50 >>> >>> failed with errno=2 >>> [r2250_49:29237] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_49_0/25682/1/shared_mem_pool.r2250_49 >>> >>> failed with errno=2 >>> [r2250_66:19066] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_66_0/25682/1/shared_mem_pool.r2250_66 >>> >>> failed with errno=2 >>> [r2250_58:24902] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_58_0/25682/1/shared_mem_pool.r2250_58 >>> >>> failed with errno=2 >>> [r2250_69:27426] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_69_0/25682/1/shared_mem_pool.r2250_69 >>> >>> failed with errno=2 >>> [r2250_60:30560] mca_common_sm_mmap_init: open >>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_60_0/25682/1/shared_mem_pool.r2250_60 >>> >>> failed with errno=2 >>> >>> File not found in sm. >>> >>> 10 of them across 32 nodes (8 cores per node (2 sockets x quad-core)) >>> "Apparently harmless"? >>> >>> DM >>> >>> On Tue, 27 Jan 2009, Prentice Bisbal wrote: >>> >>>> I just installed OpenMPI 1.3 with tight integration for SGE. Version >>>> 1.2.8 was working just fine for several months in the same arrangement. >>>> >>>> Now that I've upgraded to 1.3, I get the following errors in my >>>> standard >>>> error file: >>>> >>>> mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions-prent >>>> ice@node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failed with >>>> errno=2 >>>> [node23.aurora:20601] mca_common_sm_mmap_init: open >>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>> ice@node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failed with >>>> errno=2 >>>> [node46.aurora:12118] mca_common_sm_mmap_init: open >>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>> ice@node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failed with >>>> errno=2 >>>> [node15.aurora:12421] mca_common_sm_mmap_init: open >>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>> ice@node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failed with >>>> errno=2 >>>> [node20.aurora:12534] mca_common_sm_mmap_init: open >>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>> ice@node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failed with >>>> errno=2 >>>> [node16.aurora:12573] mca_common_sm_mmap_init: open >>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>> ice@node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failed with >>>> errno=2 >>>> >>>> I've tested 3-4 different times, and the number of hosts that produces >>>> this error varies, as well as which hosts produce this error. My >>>> program >>>> seems to run fun, but it's just a simple "Hello, World!" program. Any >>>> ideas? Is this a bug in 1.3? >>>>