Is there anyone else who experienced this problem with a HEL-based
distro that can upgrade to 5.3 to confirm my experience?

--
Prentice


Prentice Bisbal wrote:
> No. I was running just a simple "Hello, world" program to test v1.3 when
> these errors occured. And as soon as I reverted to 1.2.8, the errors
> disappeared.
> 
> Interestingly enough, I just upgraded my cluster to PU_IAS 5.3, and now
> I can't reproduce the problem but HPL fails with a segfault, which I'll
> report in a separate e-mail to start a new thread for that problem.
> 
> --
> Prentice
> 
> Jeff Squyres wrote:
>> Could the nodes be running out of shared memory and/or temp filesystem
>> space?
>>
>>
>> On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote:
>>
>>> I have not seen this before.  I assume that for some reason, the
>>> shared memory transport layer cannot create the file it uses for
>>> communicating within a node.  Open MPI then selects some other
>>> transport (TCP, openib) to communicate within the node so the program
>>> runs fine.
>>>
>>> The code has not changed that much from 1.2 to 1.3, but it is a little
>>> different.  Let me see if I can reproduce the problem.
>>>
>>> Rolf
>>>
>>> Mostyn Lewis wrote:
>>>> Sort of ditto but with SVN release at 20123 (and earlier):
>>>>
>>>> e.g.
>>>>
>>>> [r2250_46:30018] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_46_0/25682/1/shared_mem_pool.r2250_46
>>>>
>>>> failed with errno=2
>>>> [r2250_63:05292] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_63_0/25682/1/shared_mem_pool.r2250_63
>>>>
>>>> failed with errno=2
>>>> [r2250_57:17527] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_57_0/25682/1/shared_mem_pool.r2250_57
>>>>
>>>> failed with errno=2
>>>> [r2250_68:13553] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_68_0/25682/1/shared_mem_pool.r2250_68
>>>>
>>>> failed with errno=2
>>>> [r2250_50:06541] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_50_0/25682/1/shared_mem_pool.r2250_50
>>>>
>>>> failed with errno=2
>>>> [r2250_49:29237] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_49_0/25682/1/shared_mem_pool.r2250_49
>>>>
>>>> failed with errno=2
>>>> [r2250_66:19066] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_66_0/25682/1/shared_mem_pool.r2250_66
>>>>
>>>> failed with errno=2
>>>> [r2250_58:24902] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_58_0/25682/1/shared_mem_pool.r2250_58
>>>>
>>>> failed with errno=2
>>>> [r2250_69:27426] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_69_0/25682/1/shared_mem_pool.r2250_69
>>>>
>>>> failed with errno=2
>>>> [r2250_60:30560] mca_common_sm_mmap_init: open
>>>> /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_60_0/25682/1/shared_mem_pool.r2250_60
>>>>
>>>> failed with errno=2
>>>>
>>>> File not found in sm.
>>>>
>>>> 10 of them across 32 nodes (8 cores per node (2 sockets x quad-core))
>>>> "Apparently harmless"?
>>>>
>>>> DM
>>>>
>>>> On Tue, 27 Jan 2009, Prentice Bisbal wrote:
>>>>
>>>>> I just installed OpenMPI 1.3 with tight integration for SGE. Version
>>>>> 1.2.8 was working just fine for several months in the same arrangement.
>>>>>
>>>>> Now that I've upgraded to 1.3, I get the following errors in my
>>>>> standard
>>>>> error file:
>>>>>
>>>>> mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice@node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failed with
>>>>> errno=2
>>>>> [node23.aurora:20601] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice@node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failed with
>>>>> errno=2
>>>>> [node46.aurora:12118] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice@node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failed with
>>>>> errno=2
>>>>> [node15.aurora:12421] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice@node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failed with
>>>>> errno=2
>>>>> [node20.aurora:12534] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice@node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failed with
>>>>> errno=2
>>>>> [node16.aurora:12573] mca_common_sm_mmap_init: open
>>>>> /tmp/968.1.all.q/openmpi-sessions-prent
>>>>> ice@node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failed with
>>>>> errno=2
>>>>>
>>>>> I've tested 3-4 different times, and the number of hosts that produces
>>>>> this error varies, as well as which hosts produce this error. My
>>>>> program
>>>>> seems to run fun, but it's just a simple "Hello, World!" program. Any
>>>>> ideas? Is this a bug in 1.3?
>>>>>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Prentice

Reply via email to