Hi Rémi!

Thanks very much for your reply. Switching on PMI_DEBUG shows that most of the 
time is spent after the last call to

In: PMI_KVS_Get(hostname[95])

There are two calls that take a few seconds right afterwards,

In: PMI_KVS_Get(hostname[95])
In: PMI_KVS_Get_key_length_max
In: PMI_KVS_Get_value_length_max

- alltogether maybe 5-10s to get here. These are followed by a large number of

In: PMI_Get_rank

and

In: PMI_Get_size

until the process is killed after about 30s

Decreasing PMI_TIME from 500 to smaller values (all the way down to 50) changes 
the number of PMI_Get_size showing up in the logs (i.e. it gets slightly faster 
so PMI can finish more of the Get_rank-s and proceed with the PMI_Get_size-s, 
but it never finishes the initialisation before the timeout).

Out of curiousity, how can I choose to use pmi2? I first compiled mvapich2-2.2b 
with "--with-pm=none --with-pmi=slurm". Will "--with-pm=none --with-pmi=pmi2” 
work?

Thanks again,

Dom

> On 23/03/2016, at 12:58 PM, Rémi Palancher <[email protected]> wrote:
> 
> 
> Le 23/03/2016 08:54, Dominikus Heinzeller a écrit :
>> 
>> Hi all,
>>      
>>      I am having a problem with spawning a large number of threads on a
>>      node. My server consists of 4 sockets x 12 cores per socket x 2 threads 
>> per
>>      core = 96 procs
>>      
>>       [...]
>>      Any help or suggestion what I could do?
> 
> It looks like you're using PMI1, there's must be something wrong in PMI 
> initialization. 96 tasks on one node is not something large, there's not 
> reason to spend more than 5 seconds on this... You can eventually profile PMI 
> calls by setting PMI_DEBUG environement variable to 1, to find out where it 
> takes time.
> 
> Eventually, you can set PMI_TIME environement variable as well to a value 
> <500, and see if there's any difference.
> 
> Best,
> Rémi

Reply via email to