Is there anyone else who experienced this problem with a HEL-based distro that can upgrade to 5.3 to confirm my experience?
-- Prentice Prentice Bisbal wrote: > No. I was running just a simple "Hello, world" program to test v1.3 when > these errors occured. And as soon as I reverted to 1.2.8, the errors > disappeared. > > Interestingly enough, I just upgraded my cluster to PU_IAS 5.3, and now > I can't reproduce the problem but HPL fails with a segfault, which I'll > report in a separate e-mail to start a new thread for that problem. > > -- > Prentice > > Jeff Squyres wrote: >> Could the nodes be running out of shared memory and/or temp filesystem >> space? >> >> >> On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote: >> >>> I have not seen this before. I assume that for some reason, the >>> shared memory transport layer cannot create the file it uses for >>> communicating within a node. Open MPI then selects some other >>> transport (TCP, openib) to communicate within the node so the program >>> runs fine. >>> >>> The code has not changed that much from 1.2 to 1.3, but it is a little >>> different. Let me see if I can reproduce the problem. >>> >>> Rolf >>> >>> Mostyn Lewis wrote: >>>> Sort of ditto but with SVN release at 20123 (and earlier): >>>> >>>> e.g. >>>> >>>> [r2250_46:30018] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_46_0/25682/1/shared_mem_pool.r2250_46 >>>> >>>> failed with errno=2 >>>> [r2250_63:05292] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_63_0/25682/1/shared_mem_pool.r2250_63 >>>> >>>> failed with errno=2 >>>> [r2250_57:17527] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_57_0/25682/1/shared_mem_pool.r2250_57 >>>> >>>> failed with errno=2 >>>> [r2250_68:13553] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_68_0/25682/1/shared_mem_pool.r2250_68 >>>> >>>> failed with errno=2 >>>> [r2250_50:06541] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_50_0/25682/1/shared_mem_pool.r2250_50 >>>> >>>> failed with errno=2 >>>> [r2250_49:29237] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_49_0/25682/1/shared_mem_pool.r2250_49 >>>> >>>> failed with errno=2 >>>> [r2250_66:19066] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_66_0/25682/1/shared_mem_pool.r2250_66 >>>> >>>> failed with errno=2 >>>> [r2250_58:24902] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_58_0/25682/1/shared_mem_pool.r2250_58 >>>> >>>> failed with errno=2 >>>> [r2250_69:27426] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_69_0/25682/1/shared_mem_pool.r2250_69 >>>> >>>> failed with errno=2 >>>> [r2250_60:30560] mca_common_sm_mmap_init: open >>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_60_0/25682/1/shared_mem_pool.r2250_60 >>>> >>>> failed with errno=2 >>>> >>>> File not found in sm. >>>> >>>> 10 of them across 32 nodes (8 cores per node (2 sockets x quad-core)) >>>> "Apparently harmless"? >>>> >>>> DM >>>> >>>> On Tue, 27 Jan 2009, Prentice Bisbal wrote: >>>> >>>>> I just installed OpenMPI 1.3 with tight integration for SGE. Version >>>>> 1.2.8 was working just fine for several months in the same arrangement. >>>>> >>>>> Now that I've upgraded to 1.3, I get the following errors in my >>>>> standard >>>>> error file: >>>>> >>>>> mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions-prent >>>>> ice@node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failed with >>>>> errno=2 >>>>> [node23.aurora:20601] mca_common_sm_mmap_init: open >>>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>>> ice@node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failed with >>>>> errno=2 >>>>> [node46.aurora:12118] mca_common_sm_mmap_init: open >>>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>>> ice@node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failed with >>>>> errno=2 >>>>> [node15.aurora:12421] mca_common_sm_mmap_init: open >>>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>>> ice@node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failed with >>>>> errno=2 >>>>> [node20.aurora:12534] mca_common_sm_mmap_init: open >>>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>>> ice@node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failed with >>>>> errno=2 >>>>> [node16.aurora:12573] mca_common_sm_mmap_init: open >>>>> /tmp/968.1.all.q/openmpi-sessions-prent >>>>> ice@node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failed with >>>>> errno=2 >>>>> >>>>> I've tested 3-4 different times, and the number of hosts that produces >>>>> this error varies, as well as which hosts produce this error. My >>>>> program >>>>> seems to run fun, but it's just a simple "Hello, World!" program. Any >>>>> ideas? Is this a bug in 1.3? >>>>> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Prentice