2015-01-05 23:03 GMT+06:00 Andy Riebs <andy.ri...@hp.com>:

>  Will this fix be in 14.11.3?
>
> (The system is in the customer's hands, so my ability to test it is
> limited.)
>

Currently it is not in 14.11 branch.


>
> Andy
>
>
> On 12/21/2014 10:59 AM, Artem Polyakov wrote:
>
> Hello, Andy!
>
>  I think this is the race condition that I found and fixed few weeks ago:
>
> https://github.com/SchedMD/slurm/commit/bc190aee6f517eefe11c01f455fe4fde73ba8323
>
>  There was detailed discussion of it in bugzilla
> http://bugs.schedmd.com/show_bug.cgi?id=1302
>  Here is the image that demonstrates the case:
> http://bugs.schedmd.com/attachment.cgi?id=1490
>
>  Could you port/try this patch?
>
> 2014-12-21 21:45 GMT+06:00 Andy Riebs <andy.ri...@hp.com>:
>
>>
>> We are sporadically seeing messages such as these when running on more
>> than 1000 nodes:
>>
>> slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3 got 2
>>
>> or
>>
>> slurmstepd: mpi/pmi2: invalid kvs seq from srun, expect 3 got 2
>>
>> We have seen this with both SHMEM and BUPC jobs, but we see it in perhaps
>> 3 or 4 runs out of a hundred. A typically job might look like
>>
>> salloc -N1280
>>     srun foo
>>     srun foo
>>     srun foo
>>
>> On those rare occasions where one of the first two steps fail, the
>> remaining steps will run just fine, so we know it's not a question of which
>> nodes are used at each step.
>>
>> The environment:
>> * RHEL 6.5
>> * Slurm 14.11.1
>> * SHMEM provided by OpenMPI 1.8.4rc1
>> * Berkeley UPC 2.18.0, built on OpenMPI
>>
>> The only thing unusual in slurm.conf is MpiDefault=pmi2 (which is
>> probably obvious from the messages).
>>
>
>
>
>
>>
>> Any ideas?
>> Andy
>>
>> --
>> Andy Riebs
>> Hewlett-Packard Company
>> High Performance Computing
>> +1 404 648 9024
>> My opinions are not necessarily those of HP
>>
>
>
>
>  --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>
>
>


-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Reply via email to