Rainer,

what if you explicitly bind tasks to cores ?

mpirun -bind-to core ...

note this is v1.8 syntax ...
v1.6 is now obsolete (Debian folks are working on upgrading it...)

out of curiosity, did you try an other distro such as redhat and the likes,
suse ...
and do you observe the same behavior ?

and btw, what does /proc/self/status say ?
bound to cores ? socket ? no binding at all ?

Cheers,

Gilles

On Wednesday, March 23, 2016, Rainer Koenig <rainer.koe...@ts.fujitsu.com>
wrote:

> Gilles,
>
> I managed to get snapshots of all the /proc/<pid>/status entries for all
> liggghts jobs, but the Cpus_allowed ist similar no matter if the system
> was cold or warm booted.
>
> Then I looked around in /proc/ and found sched_debug.
>
> This at least shows, that the liggghts-processes are not spread over all
> cores. Some cores just have on of those, some have none and some have many.
>
> I agree that the problem that the processes are not spread over all
> cores is a consequence but not the root cause. This means I now need to
> find out how the kernel scheduler decides on which core a process should
> run and why he can spread 48 tasks over 48 cores when I cold boot the
> machine and can't when I warm boot it.
>
> So I guess I have to proceed to the linux kernel mailing list with this
> issue. Another thing that points towards the kernel is that yesterday I
> installed a newer 4.4.0 kernel on the machine and the problem is still
> there, but not that worse than on the 4.2 kernel.
>
> I also tried mpirun -mca... but that didn't change anything.
>
> Thanks for your input anyway, at least I now have a sched_debug
> snapshot, maybe that is helpful in the further investigation.
>
> Regards
> Rainer
>
> Am 22.03.2016 um 14:38 schrieb Gilles Gouaillardet:
> > Rainer,
> >
> > a first step could be to gather /proc/pid/status for your 48 tasks.
> > then you can
> > grep Cpus_allowed_list
> > and see if you find something suspucious.
> >
> > if your processes are idling, then the scheduler might assign them to
> > the same core.
> > in this case, your processes not being spread is a consequence and not a
> > root cause.
> >
> > just to make sure there are no strange side effects, could you
> > mpirun --mca btl sm,self ...
> >
> > Cheers,
> >
> > Gilles
> >
> >
> > On Tuesday, March 22, 2016, Rainer Koenig <rainer.koe...@ts.fujitsu.com
> <javascript:;>
> > <mailto:rainer.koe...@ts.fujitsu.com <javascript:;>>> wrote:
> >
> >     Am 17.03.2016 um 10:40 schrieb Ralph Castain:
> >     > Just some thoughts offhand:
> >     >
> >     > * what version of OMPI are you using?
> >
> >     dpkg -l openmpi-bin says 1.6.5-8 from Ubuntu 14.04.
> >     >
> >     > * are you saying that after the warm reboot, all 48 procs are
> >     running on a subset of cores?
> >
> >     Yes. After a cold boot all 48 processses are spread over all 48 cores
> >     and all cores show up as almost 100% in the htop cpu meter.
> >
> >     After a warm boot, the 48 processes are just spread over a few cores
> and
> >     the rest of the system is idling.
> >
> >     > * it sounds like some of the cores have been marked as “offline”
> >     for some reason. Make sure you have hwloc installed on the machine,
> >     and run “lstopo” and see if that is the case
> >
> >     I tried with lstopo, but the graphics that I got look almost similar.
> >     The visible difference is in the sort of topology for the graphics
> >     adapter and the LAN cards. The path to the graphics shows 2 times the
> >     numbers 4,0 above the lines and the path to the eth0 shows 2 times
> the
> >     numbers 0,2 above the lines. lstopo for the warm boot looks
> identical,
> >     but those small numbers are missing now.
> >
> >     I also tried with hwloc-gather-topology and diff'd the 2 results.
> There
> >     is nothing special to see. Differneces in /proc/stats/ and
> >     /proc/cpuinfo, but nothing special, just ohter values.
> >
> >     Something is obviously wrong on a low level, but I'm still
> struggling to
> >     find it. :-/
> >
> >     Rainer
> >     --
> >     Dipl.-Inf. (FH) Rainer Koenig
> >     Project Manager Linux Clients
> >     Dept. PDG WPS R&D SW OSE
> >
> >     Fujitsu Technology Solutions
> >     Bürgermeister-Ullrich-Str. 100
> >     86199 Augsburg
> >     Germany
> >
> >     Telephone: +49-821-804-3321
> >     Telefax:   +49-821-804-2131
> >     Mail:      mailto:rainer.koe...@ts.fujitsu.com <javascript:;>
> <javascript:;>
> >
> >     Internet         ts.fujtsu.com <http://ts.fujtsu.com>
> >     Company Details  ts.fujitsu.com/imprint.html
> >     <http://ts.fujitsu.com/imprint.html>
> >     _______________________________________________
> >     users mailing list
> >     us...@open-mpi.org <javascript:;> <javascript:;>
> >     Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >     Link to this post:
> >     http://www.open-mpi.org/community/lists/users/2016/03/28787.php
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org <javascript:;>
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28788.php
> >
>
>
> --
> Dipl.-Inf. (FH) Rainer Koenig
> Project Manager Linux Clients
> Dept. PDG WPS R&D SW OSE
>
> Fujitsu Technology Solutions
> Bürgermeister-Ullrich-Str. 100
> 86199 Augsburg
> Germany
>
> Telephone: +49-821-804-3321
> Telefax:   +49-821-804-2131
> Mail:      mailto:rainer.koe...@ts.fujitsu.com <javascript:;>
>
> Internet         ts.fujtsu.com
> Company Details  ts.fujitsu.com/imprint.html
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:;>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28797.php
>

Reply via email to