Hi Ralph

> > some weeks ago (mainly in the beginning of October) I reported
> > several problems and I would be grateful if you can tell me if
> > and probably when somebody will try to solve them.
> > 
> > 1) I don't get the expected results, when I try to send or scatter
> >   the columns of a matrix in Java. The received column values have
> >   nothing to do with the original values, if I use a homogeneous
> >   environment and the program breaks with "An error occurred in
> >   MPI_Comm_dup" and "MPI_ERR_INTERN: internal error", if I use
> >   a heterogeneous environment. I would like to use the Java API.
> > 
> > 2) I don't get the expected result, when I try to scatter an object
> >   in Java.
> >   https://svn.open-mpi.org/trac/ompi/ticket/3351
> 
> Nothing has happened on these yet

Do you have an idea when somebody will have time to fix these problems?


> > 3) I still get only a message that all nodes are already filled up
> >   when I use a "rankfile" and nothing else happens. I would like
> >   to use a rankfile. You filed a bug fix for it.
> > 
> 
> I believe rankfile was fixed, at least on the trunk - not sure if it
> was moved to 1.7. I assume that's the release you are talking about?

I'm using the trunk for my tests. It didn't work for me because I used
the rankfile without a hostfile or a hostlist (it is not enough to
specify the hosts in the rankfile). Everything works fine when I provide
a "correct" hostfile or hostlist and the binding isn't too compilicated
(see my last example below).

My rankfile:

rank 0=sunpc0 slot=0:0
rank 1=sunpc1 slot=0:0
rank 2=sunpc0 slot=1:0
rank 3=sunpc1 slot=1:0


My hostfile:

sunpc0 slots=4
sunpc1 slots=4


It will not work without a hostfile or hostlist.

sunpc0 mpi-probleme 128 mpiexec -report-bindings -rf rankfile_1.openmpi \
  -np 4 hostname
------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc1
------------------------------------------------------------------------
sunpc0 mpi-probleme 129 


I get the expected output, if I add "-hostfile host_sunpc" or
"-host sunpc0,sunpc1" on the command line.

sunpc0 mpi-probleme 129 mpiexec -report-bindings -rf rankfile_1.openmpi \
  -np 4 -hostfile host_sunpc hostname
[sunpc0:06954] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.]
[sunpc0:06954] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
sunpc0
sunpc0
[sunpc1:12583] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/.][./.]
[sunpc1:12583] MCW rank 3 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
sunpc1
sunpc1
sunpc0 mpi-probleme 130 


Furthermore it is necessary that both the rankfile and the hostfile
contain qualified or unqualified hostnames in the same way. Otherwise
it will not work as you can see in the following output where my
hostfile contains a qualified hostname and my rankfile only the hostname
without domain name.

sunpc0 mpi-probleme 131 mpiexec -report-bindings -rf rankfile_1.openmpi \
  -np 4 -hostfile host_sunpc_full hostname
------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc1
------------------------------------------------------------------------
sunpc0 mpi-probleme 132 


Unfortunately my complicated rankfile still doesn't work, although
you told me some weeks ago that it is correct.

rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1

sunpc1 mpi-probleme 103 mpiexec -report-bindings -rf rankfile -np 4 \
  -hostfile host_sunpc hostname
sunpc1
sunpc1
sunpc1
[sunpc1:12741] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
[sunpc1:12741] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
[sunpc1:12741] MCW rank 1 bound to socket 0[core 0[hwt 0]],
   socket 0[core 1[hwt 0]]: [B/B][./.]
[sunpc0:07075] MCW rank 0 bound to socket 0[core 0[hwt 0]],
   socket 0[core 1[hwt 0]]: [B/B][./.]
sunpc0
sunpc1 mpi-probleme 104 

The bindings for ranks 1 to 3 are correct, but rank 0 didn't get the
cores from the second socket.



> > 4) I would like to have "-cpus-per-proc", "-npersocket", etc for
> >   every set of machines/applications and not globally for all
> >   machines/applications if I specify several colon-separated sets
> >   of machines or applications on the command line. You told me that
> >   it could be done.
> > 
> > 5) By the way, it seems that the option "-cpus-per-proc" isn't any
> >   longer supported in openmpi-1.7 and openmpi-1.9. How can I bind a
> >   multi-threaded process to more than one core in these versions?
> 
> I'm afraid I haven't gotten around to working on cpus-per-proc, though
> I believe npersocket was fixed.

Will you also support "-cpus-per-proc" in openmpi-1.7 and openmpi-1.9?
At the moment it isn't available.

sunpc1 mpi-probleme 106 mpiexec -report-bindings -np 4 \
  -host linpc0,linpc1,sunpc0,sunpc1 -cpus-per-proc 4 -map-by core \
  -bind-to core hostname
mpiexec: Error: unknown option "-p"
Type 'mpiexec --help' for usage.


sunpc1 mpi-probleme 110 mpiexec --help | grep cpus
                         cpus allocated to this job [default: none]
   -use-hwthread-cpus|--use-hwthread-cpus 
                         Use hardware threads as independent cpus



> > I can provide my small programs once more if you need them. Thank
> > you very much for any answer in advance.

Thank you very much for all your help and time

Siegmar

Reply via email to