On Apr 11, 2012, at 6:20 AM, Reuti wrote:

> Am 11.04.2012 um 04:26 schrieb Ralph Castain:
> 
>> Hi Reuti
>> 
>> Can you replicate this problem on your machine? Can you try it with 1.5?
> 
> No. It's also working fine in 1.5.5 in some tests. I even forced an uneven 
> distribution by limiting the slots setting for some machines in the queue 
> configuration.

Thanks - that confirms what I've been able to test. It sounds like it is 
something in Eloi's setup, but I can't fathom what it would be - the 
allocations all look acceptable.

I'm stumped. :-(


> 
> -- Reuti
> 
> 
>> Afraid I don't have a way to replicate it, and as I said, wouldn't fix it 
>> for the 1.4 series anyway. I'm not seeing this problem elsewhere, but I 
>> don't generally get an allocation that varies across nodes.
>> 
>> Ralph
>> 
>> On Apr 10, 2012, at 11:57 AM, Reuti wrote:
>> 
>>> Am 10.04.2012 um 16:55 schrieb Eloi Gaudry:
>>> 
>>>> Hi Ralf,
>>>> 
>>>> I haven't tried any of the 1.5 series yet (we have chosen not to use the 
>>>> features releases) but if this is mandatory for you to work on this topic, 
>>>> I will.
>>>> 
>>>> This might be of interest to Reuti and you : it seems that we cannot 
>>>> reproduce the problem anymore if we don't provide the "-np N" option on 
>>>> the orterun command line. Of course, we need to launch a few other runs to 
>>>> be really sure because the allocation error was not always observable. 
>>>> Actually, I recently understood (from Reuti) that the tight integration 
>>>> mode would supply every necessary bits to the launcher and thus I removed 
>>>> the '-np N' that was around... Could it be that using the '-np N' while 
>>>> using the sge tight integration mode is pathologic ?
>>> 
>>> Yes, it should work without problem to specify -np. As it didn't hit me in 
>>> my tests (normally I don't specify -np), I would really be interested in 
>>> the underlying cause.
>>> 
>>> Especially as the example in Open MPI's FAQ lists -np to start with 
>>> GirdEngine integration, it should have hit other users too.
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> Regards,
>>>> Eloi
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>>>> Behalf Of Ralph Castain
>>>> Sent: mardi 10 avril 2012 16:43
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
>>>> 
>>>> Could well be a bug in OMPI - I can take a look, though it may be awhile 
>>>> before I get to it. Have you tried one of the 1.5 series releases?
>>>> 
>>>> On Apr 10, 2012, at 3:42 AM, Eloi Gaudry wrote:
>>>> 
>>>>> Thx. This is the allocation which is also confirmed by the Open MPI 
>>>>> output.
>>>>> [eg: ] exactly, but not the one used afterwards by openmpi
>>>>> 
>>>>> - The application was compiled with the same version of Open MPI?
>>>>> [eg: ] yes, version 1.4.4 for all
>>>>> 
>>>>> - Does the application start something on its own besides the tasks 
>>>>> granted by mpiexec/orterun?
>>>>> [eg: ] no
>>>>> 
>>>>> You want 12 ranks in total, and to barney.fft and carl.fft there are also 
>>>>> "-mca orte_ess_num_procs 3 " given in to the qrsh_starter. In total I 
>>>>> count only 10 ranks in this example given - 4+4+2 - do you observe the 
>>>>> same?
>>>>> [eg: ] i don't know why the -mca orte_ess_num_procs 3 is added here...
>>>>> In the "Map generated by mapping policy" output in my last email, I see 
>>>>> that 4 processes were started on each node (barney, carl and charlie), 
>>>>> but yes, in the ps -elf output, two of them are missing for one node 
>>>>> (barney)... sorry about that, a bad copy/paste. Here is the actual output 
>>>>> for this node:
>>>>> 2048 ?        Sl     3:33 /opt/sge/bin/lx-amd64/sge_execd
>>>>> 27502 ?        Sl     0:00  \_ sge_shepherd-1416 -bg
>>>>> 27503 ?        Ss     0:00      \_ /opt/sge/utilbin/lx-amd64/qrsh_starter 
>>>>> /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney
>>>>> 27510 ?        S      0:00          \_ bash -c  
>>>>> PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; 
>>>>> LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export 
>>>>> LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess env -mca 
>>>>> orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 
>>>>> --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca 
>>>>> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
>>>>> ras_gridengine_verbose 1
>>>>> 27511 ?        S      0:00              \_ /opt/openmpi-1.4.4/bin/orted 
>>>>> -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca 
>>>>> orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 
>>>>> --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
>>>>> ras_gridengine_verbose 1
>>>>> 27512 ?        Rl    12:54                  \_ 
>>>>> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
>>>>> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
>>>>> --parallel=frequency --scratch=/scratch/cluster/1416 
>>>>> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>>>>> 27513 ?        Rl    12:54                  \_ 
>>>>> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
>>>>> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
>>>>> --parallel=frequency --scratch=/scratch/cluster/1416 
>>>>> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>>>>> 27514 ?        Rl    12:54                  \_ 
>>>>> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
>>>>> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
>>>>> --parallel=frequency --scratch=/scratch/cluster/1416 
>>>>> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>>>>> 27515 ?        Rl    12:53                  \_ 
>>>>> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
>>>>> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
>>>>> --parallel=frequency --scratch=/scratch/cluster/1416 
>>>>> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>>>>> 
>>>>> It looks like Open MPI is doing the right thing, but the applications 
>>>>> decided to start in a different allocation.
>>>>> [eg: ] if the "Map generated by mapping policy" is different than the sge 
>>>>> allocation, then openmpi is not doing the right thing, don't you think ?
>>>>> 
>>>>> Does the application use OpenMP in addition or other kinds of threads? 
>>>>> The suffix "_mp" in the name "actranpy_mp" makes me suspicious about it.
>>>>> [eg: ] no, the suffix _mp stands for "parallel".
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to