2015-01-05 23:03 GMT+06:00 Andy Riebs <andy.ri...@hp.com>: > Will this fix be in 14.11.3? > > (The system is in the customer's hands, so my ability to test it is > limited.) >
Currently it is not in 14.11 branch. > > Andy > > > On 12/21/2014 10:59 AM, Artem Polyakov wrote: > > Hello, Andy! > > I think this is the race condition that I found and fixed few weeks ago: > > https://github.com/SchedMD/slurm/commit/bc190aee6f517eefe11c01f455fe4fde73ba8323 > > There was detailed discussion of it in bugzilla > http://bugs.schedmd.com/show_bug.cgi?id=1302 > Here is the image that demonstrates the case: > http://bugs.schedmd.com/attachment.cgi?id=1490 > > Could you port/try this patch? > > 2014-12-21 21:45 GMT+06:00 Andy Riebs <andy.ri...@hp.com>: > >> >> We are sporadically seeing messages such as these when running on more >> than 1000 nodes: >> >> slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3 got 2 >> >> or >> >> slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3 got 2 >> >> We have seen this with both SHMEM and BUPC jobs, but we see it in perhaps >> 3 or 4 runs out of a hundred. A typically job might look like >> >> salloc -N1280 >> srun foo >> srun foo >> srun foo >> >> On those rare occasions where one of the first two steps fail, the >> remaining steps will run just fine, so we know it's not a question of which >> nodes are used at each step. >> >> The environment: >> * RHEL 6.5 >> * Slurm 14.11.1 >> * SHMEM provided by OpenMPI 1.8.4rc1 >> * Berkeley UPC 2.18.0, built on OpenMPI >> >> The only thing unusual in slurm.conf is MpiDefault=pmi2 (which is >> probably obvious from the messages). >> > > > > >> >> Any ideas? >> Andy >> >> -- >> Andy Riebs >> Hewlett-Packard Company >> High Performance Computing >> +1 404 648 9024 >> My opinions are not necessarily those of HP >> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov