Hi,

yesterday I installed openmpi-1.8.2rc3 on my machines
(Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE
Linux 12.1 x86_64) with Sun C 5.12. I get an error,
if I use a rankfile for all three architectures.
The error message depends on the local machine, which
I use to run "mpiexec". I get a different error, if I
use two "Sparc64 VII" machines (see below).

tyr openmpi_1.7.x_or_newer 109 cat rf_linpc_sunpc_tyr
rank 0=linpc0 slot=0:0-1;1:0-1
rank 1=linpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=tyr slot=1:0
tyr openmpi_1.7.x_or_newer 110 


I get the following message, if I run "mpiexec" on
Solaris 10 Sparc.

tyr openmpi_1.7.x_or_newer 110 mpiexec -report-bindings -np 4 -rf 
rf_linpc_sunpc_tyr hostname
--------------------------------------------------------------------------
An invalid value was supplied for an enum variable.

  Variable     : hwloc_base_report_bindings
  Value        : 1,1
  Valid values : 0: f|false|disabled, 1: t|true|enabled
--------------------------------------------------------------------------
[tyr.informatik.hs-fulda.de:26960] MCW rank 3 bound to socket 1[core 1[hwt 0]]: 
[.][B]
tyr.informatik.hs-fulda.de
[linpc1:12109] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 
0]]: [B/B][./.]
[linpc0:26642] MCW rank 0 is not bound (or bound to all available processors)
linpc1
linpc0
sunpc1
tyr openmpi_1.7.x_or_newer 111 



I get the following message, if I run "mpiexec" on
Solaris 10 x86_64 or Linux x86_64.

sunpc1 openmpi_1.7.x_or_newer 109 mpiexec -report-bindings -np 4 -rf 
rf_linpc_sunpc_tyr hostname
--------------------------------------------------------------------------
An invalid value was supplied for an enum variable.

  Variable     : hwloc_base_report_bindings
  Value        : 1,1
  Valid values : 0: f|false|disabled, 1: t|true|enabled
--------------------------------------------------------------------------
[sunpc1:02931] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
sunpc1
[linpc0:26850] MCW rank 0 is not bound (or bound to all available processors)
[linpc1:12386] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 
0]]: [B/B][./.]
linpc0
linpc1
--------------------------------------------------------------------------
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:        tyr
  Application name:  /usr/local/bin/hostname
  Error message:     hwloc_set_cpubind returned "Error" for bitmap "2"
  Location:          
../../../../../openmpi-1.8.2rc3/orte/mca/odls/default/odls_default_module.c:551
--------------------------------------------------------------------------
sunpc1 openmpi_1.7.x_or_newer 110 




The rankfile worked for older versions of Open MPI.

tyr openmpi_1.7.x_or_newer 139 ompi_info | grep MPI:
                Open MPI: 1.8.2a1r31804
tyr openmpi_1.7.x_or_newer 140 mpiexec -report-bindings -np 4 -rf 
rf_linpc_sunpc_tyr hostname
[tyr.informatik.hs-fulda.de:27171] MCW rank 3 bound to socket 1[core 1[hwt 0]]: 
[.][B]
tyr.informatik.hs-fulda.de
[linpc1:12790] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 
0]]: [B/B][./.]
[linpc0:27221] MCW rank 0 is not bound (or bound to all available processors)
linpc1
linpc0
[sunpc1:03046] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
sunpc1
tyr openmpi_1.7.x_or_newer 141 




I get the following error, if I use two Sparc machines
(Sun M4000 servers with two quad core Sparc64 VII processors
and two hardware threads per core). I'm not sure if this
worked before or if I have to use different options to make
it working.

tyr openmpi_1.7.x_or_newer 151 cat rf_rs0_rs1
rank 0=rs0 slot=0:0-7
rank 1=rs0 slot=1
rank 2=rs1 slot=0
rank 3=rs1 slot=1
tyr openmpi_1.7.x_or_newer 152 

rs0 openmpi_1.7.x_or_newer 104 mpiexec --report-bindings --use-hwthread-cpus 
-np 
4 -rf rf_rs0_rs1 hostname
[rs0.informatik.hs-fulda.de:26085] [[28578,0],0] ORTE_ERROR_LOG: Not found in 
file ../../../../../openmpi-1.8.2rc3/orte/mca/rmaps/rank_file/rmaps_rank_file.c 
at line 279
[rs0.informatik.hs-fulda.de:26085] [[28578,0],0] ORTE_ERROR_LOG: Not found in 
file ../../../../openmpi-1.8.2rc3/orte/mca/rmaps/base/rmaps_base_map_job.c at 
line 285
rs0 openmpi_1.7.x_or_newer 105 


It works for the following command.

rs0 openmpi_1.7.x_or_newer 107 mpiexec --report-bindings -np 4 --host rs0,rs1 
--bind-to hwthread hostname
[rs0.informatik.hs-fulda.de:26102] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B./../../..][../../../..]
[rs0.informatik.hs-fulda.de:26102] MCW rank 1 bound to socket 1[core 4[hwt 0]]: 
[../../../..][B./../../..]
rs0.informatik.hs-fulda.de
rs0.informatik.hs-fulda.de
rs1.informatik.hs-fulda.de
[rs1.informatik.hs-fulda.de:28740] MCW rank 2 bound to socket 0[core 0[hwt 0]]: 
[B./../../..][../../../..]
[rs1.informatik.hs-fulda.de:28740] MCW rank 3 bound to socket 1[core 4[hwt 0]]: 
[../../../..][B./../../..]
rs1.informatik.hs-fulda.de
rs0 openmpi_1.7.x_or_newer 108 


I would be grateful if somebody could fix the problem. Please let
me know if I can provide anything else. Thank you very much for
any help in advance.


Kind regards

Siegmar

Reply via email to