Hello, Andy.

I am not SLURM expert, so please consider the following advices as optional.

According to sources PMI2 fails when trying to broadcast cumulative
database to the nodes. It tries 5 times with delays that increases as
powers of two:
1, 2, 4, 8, 16 seconds.

The cumulative size of Database should be proportional to 1534*24*437 =
16051776 = 16Mb which is not that much. I attached small patch to output
exact DB size right before broadcasting it to compute nodes (I assume that
you can change sources).

One of the things that come to my mind regarding to this is what if you try
twice less tasks with twice bigger message. 2x437 is still less than 1024
(PMI2_MAX_VALLEN). We exclude the case of cumulative DB size problem.

The most probably the problem is in slurm_forward_data I would do the
following two things:
1. Enable 3rd level of debug on slurmd's (SlurmdDebug=3 configuration
option) to see what is failing there. You probably done that already but
didn't mention in mail.
2. Optionally you can try to play with slurm fanout tree width
(TreeWidth=10/50/100/whatever... configuration option).

2014-11-21 19:20 GMT+06:00 Andy Riebs <[email protected]>:

> Hi Slurm gurus!   We are seeing an issue when launching large rank count
> jobs in our IB cluster using PMI2 and could use your help. When the jobs
> fail, the first line of output seems to have the most useful information:
>
> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
>
> One of our Mellanox friends stripped down the test case to just include
> the PMI2 startup code from MPI to help try to isolate the issue further.
> This code just takes one argument which is how many bytes to put in the
> message.  When we start up this test code on 1534 nodes with PPN=24, 436
> byte messages will pass, but 437 byte messages will fail.  (Maybe those
> numbers will help someone figure this out!)
>
> The only interesting slurm configuration option that we have updated is:
> MessageTimeout=60, but it did not impact this issue.   We are looking for
> advice on how to proceed in debugging/troubleshooting this issue further.
>
> I have attached the test program to this message.
>
> We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system.
>
> Andy
>
> --
> Andy Riebs
> Hewlett-Packard Company
> High Performance Computing
> +1 404 648 9024
> My opinions are not necessarily those of HP
>
>


-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
diff --git a/src/plugins/mpi/pmi2/kvs.c b/src/plugins/mpi/pmi2/kvs.c
index c7d4e42..16c5680 100644
--- a/src/plugins/mpi/pmi2/kvs.c
+++ b/src/plugins/mpi/pmi2/kvs.c
@@ -202,6 +202,7 @@ temp_kvs_send(void)
 			rc = tree_msg_to_stepds(job_info.step_nodelist,
 						temp_kvs_cnt,
 						temp_kvs_buf);
+			debug3("TEMP DEBUG: Broadcast %u bytes", temp_kvs_cnt);
 		} else if (tree_info.parent_node != NULL) {
 			/* non-first-level stepds */
 			rc = tree_msg_to_stepds(tree_info.parent_node,

Reply via email to