Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-23 Thread Rainer Koenig
Gilles,

I managed to get snapshots of all the /proc//status entries for all
liggghts jobs, but the Cpus_allowed ist similar no matter if the system
was cold or warm booted.

Then I looked around in /proc/ and found sched_debug.

This at least shows, that the liggghts-processes are not spread over all
cores. Some cores just have on of those, some have none and some have many.

I agree that the problem that the processes are not spread over all
cores is a consequence but not the root cause. This means I now need to
find out how the kernel scheduler decides on which core a process should
run and why he can spread 48 tasks over 48 cores when I cold boot the
machine and can't when I warm boot it.

So I guess I have to proceed to the linux kernel mailing list with this
issue. Another thing that points towards the kernel is that yesterday I
installed a newer 4.4.0 kernel on the machine and the problem is still
there, but not that worse than on the 4.2 kernel.

I also tried mpirun -mca... but that didn't change anything.

Thanks for your input anyway, at least I now have a sched_debug
snapshot, maybe that is helpful in the further investigation.

Regards
Rainer

Am 22.03.2016 um 14:38 schrieb Gilles Gouaillardet:
> Rainer,
> 
> a first step could be to gather /proc/pid/status for your 48 tasks.
> then you can
> grep Cpus_allowed_list
> and see if you find something suspucious.
> 
> if your processes are idling, then the scheduler might assign them to
> the same core.
> in this case, your processes not being spread is a consequence and not a
> root cause.
> 
> just to make sure there are no strange side effects, could you
> mpirun --mca btl sm,self ...
> 
> Cheers,
> 
> Gilles
> 
> 
> On Tuesday, March 22, 2016, Rainer Koenig <rainer.koe...@ts.fujitsu.com
> <mailto:rainer.koe...@ts.fujitsu.com>> wrote:
> 
> Am 17.03.2016 um 10:40 schrieb Ralph Castain:
> > Just some thoughts offhand:
> >
> > * what version of OMPI are you using?
> 
> dpkg -l openmpi-bin says 1.6.5-8 from Ubuntu 14.04.
> >
> > * are you saying that after the warm reboot, all 48 procs are
> running on a subset of cores?
> 
> Yes. After a cold boot all 48 processses are spread over all 48 cores
> and all cores show up as almost 100% in the htop cpu meter.
> 
> After a warm boot, the 48 processes are just spread over a few cores and
> the rest of the system is idling.
> 
> > * it sounds like some of the cores have been marked as “offline”
> for some reason. Make sure you have hwloc installed on the machine,
> and run “lstopo” and see if that is the case
> 
> I tried with lstopo, but the graphics that I got look almost similar.
> The visible difference is in the sort of topology for the graphics
> adapter and the LAN cards. The path to the graphics shows 2 times the
> numbers 4,0 above the lines and the path to the eth0 shows 2 times the
> numbers 0,2 above the lines. lstopo for the warm boot looks identical,
> but those small numbers are missing now.
> 
> I also tried with hwloc-gather-topology and diff'd the 2 results. There
> is nothing special to see. Differneces in /proc/stats/ and
> /proc/cpuinfo, but nothing special, just ohter values.
> 
> Something is obviously wrong on a low level, but I'm still struggling to
> find it. :-/
> 
> Rainer
> --
> Dipl.-Inf. (FH) Rainer Koenig
> Project Manager Linux Clients
> Dept. PDG WPS R SW OSE
> 
> Fujitsu Technology Solutions
> Bürgermeister-Ullrich-Str. 100
> 86199 Augsburg
> Germany
> 
> Telephone: +49-821-804-3321
> Telefax:   +49-821-804-2131
> Mail:  mailto:rainer.koe...@ts.fujitsu.com <javascript:;>
> 
> Internet ts.fujtsu.com <http://ts.fujtsu.com>
> Company Details  ts.fujitsu.com/imprint.html
> <http://ts.fujitsu.com/imprint.html>
> ___
> users mailing list
> us...@open-mpi.org <javascript:;>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28787.php
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28788.php
> 


-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Clients
Dept. PDG WPS R SW OSE

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:  mailto:rainer.koe...@ts.fujitsu.com

Internet ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html


Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-22 Thread Rainer Koenig
Am 17.03.2016 um 10:40 schrieb Ralph Castain:
> Just some thoughts offhand:
> 
> * what version of OMPI are you using?

dpkg -l openmpi-bin says 1.6.5-8 from Ubuntu 14.04.
> 
> * are you saying that after the warm reboot, all 48 procs are running on a 
> subset of cores?

Yes. After a cold boot all 48 processses are spread over all 48 cores
and all cores show up as almost 100% in the htop cpu meter.

After a warm boot, the 48 processes are just spread over a few cores and
the rest of the system is idling.

> * it sounds like some of the cores have been marked as “offline” for some 
> reason. Make sure you have hwloc installed on the machine, and run “lstopo” 
> and see if that is the case

I tried with lstopo, but the graphics that I got look almost similar.
The visible difference is in the sort of topology for the graphics
adapter and the LAN cards. The path to the graphics shows 2 times the
numbers 4,0 above the lines and the path to the eth0 shows 2 times the
numbers 0,2 above the lines. lstopo for the warm boot looks identical,
but those small numbers are missing now.

I also tried with hwloc-gather-topology and diff'd the 2 results. There
is nothing special to see. Differneces in /proc/stats/ and
/proc/cpuinfo, but nothing special, just ohter values.

Something is obviously wrong on a low level, but I'm still struggling to
find it. :-/

Rainer
-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Clients
Dept. PDG WPS R SW OSE

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:  mailto:rainer.koe...@ts.fujitsu.com

Internet ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html


[OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-17 Thread Rainer Koenig
Hi,

I'm experiencing a strange problem with running LIGGGHTS on  48 core
workstation running Ubuntu 14.04.4 LTS.

If I cold boot the workstation and start one of the examples form
LIGGGHTS then everything looks fine:

$ mpirun -np 48 liggghts < in.chute_wear

launches the example on all 48 cores, htop in a second window shows that
all cores are occupied and run at nearly 100% workload.

So far so good. Now I just reboot the workstation and do the exact same
steps as abovre.

This time the job just runs on a few cores (16 to 20) and the cores
don't even run at 100% load.

So now I'm trying to find out what is wrong. Bad luck is that I can't
just ask the vendor of the workstation since I'm working for that vendor
and trying to solve this issue. :-)

I guess that something that OpenMPI needs is initialized different when
I do a cold boot or a warm boot. But how can I find out what is wrong?

Already tried to look for differences in the Ubuntu boot logs, but there
is nothing different.

ompi_info --all or even the parsable format  doesn't show any difference
between cold boot and warm boot.

Any ideas what could be wrong after the reboot that causes such a behaviour?

Thanks,
Rainer
-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Clients
Dept. PDG WPS R SW OSE

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:  mailto:rainer.koe...@ts.fujitsu.com

Internet ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html