No. I was running just a simple "Hello, world" program to test v1.3 when
these errors occured. And as soon as I reverted to 1.2.8, the errors
disappeared.

Interestingly enough, I just upgraded my cluster to PU_IAS 5.3, and now
I can't reproduce the problem but HPL fails with a segfault, which I'll
report in a separate e-mail to start a new thread for that problem.

--
Prentice

Jeff Squyres wrote:
> Could the nodes be running out of shared memory and/or temp filesystem
> space?
> 
> 
> On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote:
> 
>>
>> I have not seen this before.  I assume that for some reason, the
>> shared memory transport layer cannot create the file it uses for
>> communicating within a node.  Open MPI then selects some other
>> transport (TCP, openib) to communicate within the node so the program
>> runs fine.
>>
>> The code has not changed that much from 1.2 to 1.3, but it is a little
>> different.  Let me see if I can reproduce the problem.
>>
>> Rolf
>>
>> Mostyn Lewis wrote:
>>> Sort of ditto but with SVN release at 20123 (and earlier):
>>>
>>> e.g.
>>>
>>> [r2250_46:30018] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_46_0/25682/1/shared_mem_pool.r2250_46
>>>
>>> failed with errno=2
>>> [r2250_63:05292] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_63_0/25682/1/shared_mem_pool.r2250_63
>>>
>>> failed with errno=2
>>> [r2250_57:17527] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_57_0/25682/1/shared_mem_pool.r2250_57
>>>
>>> failed with errno=2
>>> [r2250_68:13553] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_68_0/25682/1/shared_mem_pool.r2250_68
>>>
>>> failed with errno=2
>>> [r2250_50:06541] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_50_0/25682/1/shared_mem_pool.r2250_50
>>>
>>> failed with errno=2
>>> [r2250_49:29237] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_49_0/25682/1/shared_mem_pool.r2250_49
>>>
>>> failed with errno=2
>>> [r2250_66:19066] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_66_0/25682/1/shared_mem_pool.r2250_66
>>>
>>> failed with errno=2
>>> [r2250_58:24902] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_58_0/25682/1/shared_mem_pool.r2250_58
>>>
>>> failed with errno=2
>>> [r2250_69:27426] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_69_0/25682/1/shared_mem_pool.r2250_69
>>>
>>> failed with errno=2
>>> [r2250_60:30560] mca_common_sm_mmap_init: open
>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_60_0/25682/1/shared_mem_pool.r2250_60
>>>
>>> failed with errno=2
>>>
>>> File not found in sm.
>>>
>>> 10 of them across 32 nodes (8 cores per node (2 sockets x quad-core))
>>> "Apparently harmless"?
>>>
>>> DM
>>>
>>> On Tue, 27 Jan 2009, Prentice Bisbal wrote:
>>>
>>>> I just installed OpenMPI 1.3 with tight integration for SGE. Version
>>>> 1.2.8 was working just fine for several months in the same arrangement.
>>>>
>>>> Now that I've upgraded to 1.3, I get the following errors in my
>>>> standard
>>>> error file:
>>>>
>>>> mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice@node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failed with
>>>> errno=2
>>>> [node23.aurora:20601] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice@node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failed with
>>>> errno=2
>>>> [node46.aurora:12118] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice@node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failed with
>>>> errno=2
>>>> [node15.aurora:12421] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice@node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failed with
>>>> errno=2
>>>> [node20.aurora:12534] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice@node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failed with
>>>> errno=2
>>>> [node16.aurora:12573] mca_common_sm_mmap_init: open
>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>> ice@node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failed with
>>>> errno=2
>>>>
>>>> I've tested 3-4 different times, and the number of hosts that produces
>>>> this error varies, as well as which hosts produce this error. My
>>>> program
>>>> seems to run fun, but it's just a simple "Hello, World!" program. Any
>>>> ideas? Is this a bug in 1.3?
>>>>

Reply via email to