Folks,
i commited 248acbbc3ba06c2bef04f840e07816f71f864959 in order to fix a
hang in coll/ml
when using srun (both pmi1 and pmi2)
could you please git it a try ?
Cheers,
Gilles
On 2014/10/22 23:03, Joshua Ladd wrote:
> Privet, Artem
>
> ML is the collective component that is invoking the
Privet, Artem
ML is the collective component that is invoking the calls into BCOL. The
triplet basesmuma,basesmuma,ptpcoll, for example, means I want three levels
of hierarchy - socket level, UMA level, and then network level. I am
guessing (only a guess after a quick glance) that maybe srun is
Hey, Lena :).
2014-10-17 22:07 GMT+07:00 Elena Elkina :
> Hi Artem,
>
> Actually some time ago there was a known issue with coll ml. I used to run
> my command lines with -mca coll ^ml to avoid these problems, so I don't
> know if it was fixed or not. It looks like you
Hi Artem,
Actually some time ago there was a known issue with coll ml. I used to run
my command lines with -mca coll ^ml to avoid these problems, so I don't
know if it was fixed or not. It looks like you have the same problem.
Best regards,
Elena
On Fri, Oct 17, 2014 at 7:01 PM, Artem Polyakov
Gilles,
I checked your patch and it doesn't solve the problem I observe. I think
the reason is somewhere else.
2014-10-17 19:13 GMT+07:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:
> Artem,
>
> There is a known issue #235 with modex and i made PR #238 with a tentative
> fix.
>
>
Artem,
There is a known issue #235 with modex and i made PR #238 with a tentative fix.
Could you please give it a try and reports if it solves your problem ?
Cheers
Gilles
Artem Polyakov wrote:
>Hello, I have troubles with latest trunk if I use PMI1.
>
>
>For example, if
Hello, I have troubles with latest trunk if I use PMI1.
For example, if I use 2 nodes the application hangs. See backtraces from
both nodes below. From them I can see that second (non launching) node
hangs in bcol component selection. Here is the default setting of
bcol_base_string parameter: