On Jun 29, 2006, at 5:23 PM, Graham E Fagg wrote:

Hi Doug
wow, looks like some messages are getting lost (or even delivered to the wrong peer on the same node.. ) Could you also try with:

-mca coll_base_verbose 1 -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_bcast_algorithm <1,2,3,4,5,6>

The values 1-6 control which topology/aglorithm are used internally..

The results are... very odd. With algorithms 1--5, everything seems to be okay: I ran a couple trials of each and never had it hang.

When I use algorithm 6, I get:

[odin003.cs.indiana.edu:14174] *** An error occurred in MPI_Bcast
[odin005.cs.indiana.edu:10510] *** An error occurred in MPI_Bcast
Broadcasting integers from root 0...[odin004.cs.indiana.edu:11752] *** An error occurred in MPI_Bcast
[odin003.cs.indiana.edu:14174] *** on communicator MPI_COMM_WORLD
[odin005.cs.indiana.edu:10510] *** on communicator MPI_COMM_WORLD
[odin005.cs.indiana.edu:10510] *** MPI_ERR_ARG: invalid argument of some other kind
[odin005.cs.indiana.edu:10510] *** MPI_ERRORS_ARE_FATAL (goodbye)
[odin002.cs.indiana.edu:05866] *** An error occurred in MPI_Bcast
[odin004.cs.indiana.edu:11752] *** on communicator MPI_COMM_WORLD
[odin003.cs.indiana.edu:14174] *** MPI_ERR_ARG: invalid argument of some other kind
[message repeated many times for the different processes]

Are there other settings I can tweak to try to find the algorithm that it's deciding to use at run-time?

        Cheers,
        Doug

Reply via email to