Sorry for delay - I understood and was just occupied with something else for a 
while. Thanks for the follow-up. I'm looking at the issue and trying to 
decipher the right solution.


On Jan 17, 2014, at 2:00 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> I'm sorry that my explanation was not enough ...
> This is the summary of my situation:
> 
> 1. I create a hostfile as shown below manually.
> 
> 2. I use mpirun to start the job without Torque, which means I'm running in
> an un-managed environment.
> 
> 3. Firstly, ORTE detects 8 slots on each host(maybe in
> "orte_ras_base_allocate").
>    node05: slots=8 max_slots=0 slots_inuse=0
>    node06: slots=8 max_slots=0 slots_inuse=0
> 
> 4. Then, the code I identified is resetting the slot counts.
>    node05: slots=1 max_slots=0 slots_inuse=0
>    node06: slots=1 max_slots=0 slots_inuse=0
> 
> 5. Therefore, ORTE believes that there is only one slot on each host.
> 
> Regards,
> Tetsuya Mishima
> 
>> No, I didn't use Torque this time.
>> 
>> This issue is caused only when it is not in the managed
>> environment - namely, orte_managed_allocation is false
>> (and orte_set_slots is NULL).
>> 
>> Under the torque management, it works fine.
>> 
>> I hope you can understand the situation.
>> 
>> Tetsuya Mishima
>> 
>>> I'm sorry, but I'm really confused, so let me try to understand the
>> situation.
>>> 
>>> You use Torque to get an allocation, so you are running in a managed
>> environment.
>>> 
>>> You then use mpirun to start the job, but pass it a hostfile as shown
>> below.
>>> 
>>> Somehow, ORTE believes that there is only one slot on each host, and
> you
>> believe the code you've identified is resetting the slot counts.
>>> 
>>> Is that a correct summary of the situation?
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> On Jan 16, 2014, at 4:00 PM, tmish...@jcity.maeda.co.jp wrote:
>>> 
>>>> 
>>>> Hi Ralph,
>>>> 
>>>> I encountered the hostfile issue again where slots are counted by
>>>> listing the node multiple times. This should be fixed by r29765
>>>> - Fix hostfile parsing for the case where RMs count slots ....
>>>> 
>>>> The difference is using RM or not. At that time, I executed mpirun
>> through
>>>> Torque manager. This time I executed it directly from command line as
>>>> shown at the bottom, where node05,06 has 8 cores.
>>>> 
>>>> Then, I checked source files arroud it and found that the line
> 151-160
>> in
>>>> plm_base_launch_support.c caused this issue. As node->slots is
> already
>>>> counted in hostfile.c @ r29765 even when node->slots_given is false,
>>>> I think this part of plm_base_launch_support.c would be unnecesarry.
>>>> 
>>>> orte/mca/plm/base/plm_base_launch_support.c @ 30189:
>>>> 151             } else {
>>>> 152                 /* set any non-specified slot counts to 1 */
>>>> 153                 for (i=0; i < orte_node_pool->size; i++) {
>>>> 154                     if (NULL == (node =
>>>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
>>>> 155                         continue;
>>>> 156                     }
>>>> 157                     if (!node->slots_given) {
>>>> 158                         node->slots = 1;
>>>> 159                     }
>>>> 160                 }
>>>> 161             }
>>>> 
>>>> Removing this part, it works very well, where the function of
>>>> orte_set_default_slots is still alive. I think this would be better
> for
>>>> the compatible extention of openmpi-1.7.3.
>>>> 
>>>> Regards,
>>>> Tetsuya Mishima
>>>> 
>>>> [mishima@manage work]$ cat pbs_hosts
>>>> node05
>>>> node05
>>>> node05
>>>> node05
>>>> node05
>>>> node05
>>>> node05
>>>> node05
>>>> node06
>>>> node06
>>>> node06
>>>> node06
>>>> node06
>>>> node06
>>>> node06
>>>> node06
>>>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts
> -cpus-per-proc
>> 4
>>>> -report-bindings myprog
>>>> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]],
>> socket
>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all
>> available
>>>> processors)
>>>> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket
>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all
>> available
>>>> processors)
>>>> Hello world from process 0 of 4
>>>> Hello world from process 1 of 4
>>>> Hello world from process 3 of 4
>>>> Hello world from process 2 of 4
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to