On Jun 29, 2006, at 5:23 PM, Graham E Fagg wrote:
Hi Doug
wow, looks like some messages are getting lost (or even delivered
to the wrong peer on the same node.. ) Could you also try with:
-mca coll_base_verbose 1 -mca coll_tuned_use_dynamic_rules 1 -mca
coll_tuned_bcast_algorithm <1,2,3,4,5,6>
The values 1-6 control which topology/aglorithm are used internally..
The results are... very odd. With algorithms 1--5, everything seems
to be okay: I ran a couple trials of each and never had it hang.
When I use algorithm 6, I get:
[odin003.cs.indiana.edu:14174] *** An error occurred in MPI_Bcast
[odin005.cs.indiana.edu:10510] *** An error occurred in MPI_Bcast
Broadcasting integers from root 0...[odin004.cs.indiana.edu:11752]
*** An error occurred in MPI_Bcast
[odin003.cs.indiana.edu:14174] *** on communicator MPI_COMM_WORLD
[odin005.cs.indiana.edu:10510] *** on communicator MPI_COMM_WORLD
[odin005.cs.indiana.edu:10510] *** MPI_ERR_ARG: invalid argument of
some other kind
[odin005.cs.indiana.edu:10510] *** MPI_ERRORS_ARE_FATAL (goodbye)
[odin002.cs.indiana.edu:05866] *** An error occurred in MPI_Bcast
[odin004.cs.indiana.edu:11752] *** on communicator MPI_COMM_WORLD
[odin003.cs.indiana.edu:14174] *** MPI_ERR_ARG: invalid argument of
some other kind
[message repeated many times for the different processes]
Are there other settings I can tweak to try to find the algorithm
that it's deciding to use at run-time?
Cheers,
Doug