Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-23 Thread Gilles Gouaillardet
Rainer,

what if you explicitly bind tasks to cores ?

mpirun -bind-to core ...

note this is v1.8 syntax ...
v1.6 is now obsolete (Debian folks are working on upgrading it...)

out of curiosity, did you try an other distro such as redhat and the likes,
suse ...
and do you observe the same behavior ?

and btw, what does /proc/self/status say ?
bound to cores ? socket ? no binding at all ?

Cheers,

Gilles

On Wednesday, March 23, 2016, Rainer Koenig 
wrote:

> Gilles,
>
> I managed to get snapshots of all the /proc//status entries for all
> liggghts jobs, but the Cpus_allowed ist similar no matter if the system
> was cold or warm booted.
>
> Then I looked around in /proc/ and found sched_debug.
>
> This at least shows, that the liggghts-processes are not spread over all
> cores. Some cores just have on of those, some have none and some have many.
>
> I agree that the problem that the processes are not spread over all
> cores is a consequence but not the root cause. This means I now need to
> find out how the kernel scheduler decides on which core a process should
> run and why he can spread 48 tasks over 48 cores when I cold boot the
> machine and can't when I warm boot it.
>
> So I guess I have to proceed to the linux kernel mailing list with this
> issue. Another thing that points towards the kernel is that yesterday I
> installed a newer 4.4.0 kernel on the machine and the problem is still
> there, but not that worse than on the 4.2 kernel.
>
> I also tried mpirun -mca... but that didn't change anything.
>
> Thanks for your input anyway, at least I now have a sched_debug
> snapshot, maybe that is helpful in the further investigation.
>
> Regards
> Rainer
>
> Am 22.03.2016 um 14:38 schrieb Gilles Gouaillardet:
> > Rainer,
> >
> > a first step could be to gather /proc/pid/status for your 48 tasks.
> > then you can
> > grep Cpus_allowed_list
> > and see if you find something suspucious.
> >
> > if your processes are idling, then the scheduler might assign them to
> > the same core.
> > in this case, your processes not being spread is a consequence and not a
> > root cause.
> >
> > just to make sure there are no strange side effects, could you
> > mpirun --mca btl sm,self ...
> >
> > Cheers,
> >
> > Gilles
> >
> >
> > On Tuesday, March 22, 2016, Rainer Koenig  
> > >> wrote:
> >
> > Am 17.03.2016 um 10:40 schrieb Ralph Castain:
> > > Just some thoughts offhand:
> > >
> > > * what version of OMPI are you using?
> >
> > dpkg -l openmpi-bin says 1.6.5-8 from Ubuntu 14.04.
> > >
> > > * are you saying that after the warm reboot, all 48 procs are
> > running on a subset of cores?
> >
> > Yes. After a cold boot all 48 processses are spread over all 48 cores
> > and all cores show up as almost 100% in the htop cpu meter.
> >
> > After a warm boot, the 48 processes are just spread over a few cores
> and
> > the rest of the system is idling.
> >
> > > * it sounds like some of the cores have been marked as “offline”
> > for some reason. Make sure you have hwloc installed on the machine,
> > and run “lstopo” and see if that is the case
> >
> > I tried with lstopo, but the graphics that I got look almost similar.
> > The visible difference is in the sort of topology for the graphics
> > adapter and the LAN cards. The path to the graphics shows 2 times the
> > numbers 4,0 above the lines and the path to the eth0 shows 2 times
> the
> > numbers 0,2 above the lines. lstopo for the warm boot looks
> identical,
> > but those small numbers are missing now.
> >
> > I also tried with hwloc-gather-topology and diff'd the 2 results.
> There
> > is nothing special to see. Differneces in /proc/stats/ and
> > /proc/cpuinfo, but nothing special, just ohter values.
> >
> > Something is obviously wrong on a low level, but I'm still
> struggling to
> > find it. :-/
> >
> > Rainer
> > --
> > Dipl.-Inf. (FH) Rainer Koenig
> > Project Manager Linux Clients
> > Dept. PDG WPS R SW OSE
> >
> > Fujitsu Technology Solutions
> > Bürgermeister-Ullrich-Str. 100
> > 86199 Augsburg
> > Germany
> >
> > Telephone: +49-821-804-3321
> > Telefax:   +49-821-804-2131
> > Mail:  mailto:rainer.koe...@ts.fujitsu.com 
> 
> >
> > Internet ts.fujtsu.com 
> > Company Details  ts.fujitsu.com/imprint.html
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org  
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2016/03/28787.php
> >
> >
> >
> > ___
> > 

Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-23 Thread Rainer Koenig
Gilles,

I managed to get snapshots of all the /proc//status entries for all
liggghts jobs, but the Cpus_allowed ist similar no matter if the system
was cold or warm booted.

Then I looked around in /proc/ and found sched_debug.

This at least shows, that the liggghts-processes are not spread over all
cores. Some cores just have on of those, some have none and some have many.

I agree that the problem that the processes are not spread over all
cores is a consequence but not the root cause. This means I now need to
find out how the kernel scheduler decides on which core a process should
run and why he can spread 48 tasks over 48 cores when I cold boot the
machine and can't when I warm boot it.

So I guess I have to proceed to the linux kernel mailing list with this
issue. Another thing that points towards the kernel is that yesterday I
installed a newer 4.4.0 kernel on the machine and the problem is still
there, but not that worse than on the 4.2 kernel.

I also tried mpirun -mca... but that didn't change anything.

Thanks for your input anyway, at least I now have a sched_debug
snapshot, maybe that is helpful in the further investigation.

Regards
Rainer

Am 22.03.2016 um 14:38 schrieb Gilles Gouaillardet:
> Rainer,
> 
> a first step could be to gather /proc/pid/status for your 48 tasks.
> then you can
> grep Cpus_allowed_list
> and see if you find something suspucious.
> 
> if your processes are idling, then the scheduler might assign them to
> the same core.
> in this case, your processes not being spread is a consequence and not a
> root cause.
> 
> just to make sure there are no strange side effects, could you
> mpirun --mca btl sm,self ...
> 
> Cheers,
> 
> Gilles
> 
> 
> On Tuesday, March 22, 2016, Rainer Koenig  > wrote:
> 
> Am 17.03.2016 um 10:40 schrieb Ralph Castain:
> > Just some thoughts offhand:
> >
> > * what version of OMPI are you using?
> 
> dpkg -l openmpi-bin says 1.6.5-8 from Ubuntu 14.04.
> >
> > * are you saying that after the warm reboot, all 48 procs are
> running on a subset of cores?
> 
> Yes. After a cold boot all 48 processses are spread over all 48 cores
> and all cores show up as almost 100% in the htop cpu meter.
> 
> After a warm boot, the 48 processes are just spread over a few cores and
> the rest of the system is idling.
> 
> > * it sounds like some of the cores have been marked as “offline”
> for some reason. Make sure you have hwloc installed on the machine,
> and run “lstopo” and see if that is the case
> 
> I tried with lstopo, but the graphics that I got look almost similar.
> The visible difference is in the sort of topology for the graphics
> adapter and the LAN cards. The path to the graphics shows 2 times the
> numbers 4,0 above the lines and the path to the eth0 shows 2 times the
> numbers 0,2 above the lines. lstopo for the warm boot looks identical,
> but those small numbers are missing now.
> 
> I also tried with hwloc-gather-topology and diff'd the 2 results. There
> is nothing special to see. Differneces in /proc/stats/ and
> /proc/cpuinfo, but nothing special, just ohter values.
> 
> Something is obviously wrong on a low level, but I'm still struggling to
> find it. :-/
> 
> Rainer
> --
> Dipl.-Inf. (FH) Rainer Koenig
> Project Manager Linux Clients
> Dept. PDG WPS R SW OSE
> 
> Fujitsu Technology Solutions
> Bürgermeister-Ullrich-Str. 100
> 86199 Augsburg
> Germany
> 
> Telephone: +49-821-804-3321
> Telefax:   +49-821-804-2131
> Mail:  mailto:rainer.koe...@ts.fujitsu.com 
> 
> Internet ts.fujtsu.com 
> Company Details  ts.fujitsu.com/imprint.html
> 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28787.php
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28788.php
> 


-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Clients
Dept. PDG WPS R SW OSE

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:  mailto:rainer.koe...@ts.fujitsu.com

Internet ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html


Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-22 Thread Gilles Gouaillardet
Rainer,

a first step could be to gather /proc/pid/status for your 48 tasks.
then you can
grep Cpus_allowed_list
and see if you find something suspucious.

if your processes are idling, then the scheduler might assign them to the
same core.
in this case, your processes not being spread is a consequence and not a
root cause.

just to make sure there are no strange side effects, could you
mpirun --mca btl sm,self ...

Cheers,

Gilles


On Tuesday, March 22, 2016, Rainer Koenig 
wrote:

> Am 17.03.2016 um 10:40 schrieb Ralph Castain:
> > Just some thoughts offhand:
> >
> > * what version of OMPI are you using?
>
> dpkg -l openmpi-bin says 1.6.5-8 from Ubuntu 14.04.
> >
> > * are you saying that after the warm reboot, all 48 procs are running on
> a subset of cores?
>
> Yes. After a cold boot all 48 processses are spread over all 48 cores
> and all cores show up as almost 100% in the htop cpu meter.
>
> After a warm boot, the 48 processes are just spread over a few cores and
> the rest of the system is idling.
>
> > * it sounds like some of the cores have been marked as “offline” for
> some reason. Make sure you have hwloc installed on the machine, and run
> “lstopo” and see if that is the case
>
> I tried with lstopo, but the graphics that I got look almost similar.
> The visible difference is in the sort of topology for the graphics
> adapter and the LAN cards. The path to the graphics shows 2 times the
> numbers 4,0 above the lines and the path to the eth0 shows 2 times the
> numbers 0,2 above the lines. lstopo for the warm boot looks identical,
> but those small numbers are missing now.
>
> I also tried with hwloc-gather-topology and diff'd the 2 results. There
> is nothing special to see. Differneces in /proc/stats/ and
> /proc/cpuinfo, but nothing special, just ohter values.
>
> Something is obviously wrong on a low level, but I'm still struggling to
> find it. :-/
>
> Rainer
> --
> Dipl.-Inf. (FH) Rainer Koenig
> Project Manager Linux Clients
> Dept. PDG WPS R SW OSE
>
> Fujitsu Technology Solutions
> Bürgermeister-Ullrich-Str. 100
> 86199 Augsburg
> Germany
>
> Telephone: +49-821-804-3321
> Telefax:   +49-821-804-2131
> Mail:  mailto:rainer.koe...@ts.fujitsu.com 
>
> Internet ts.fujtsu.com
> Company Details  ts.fujitsu.com/imprint.html
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28787.php


Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-22 Thread Rainer Koenig
Am 17.03.2016 um 10:40 schrieb Ralph Castain:
> Just some thoughts offhand:
> 
> * what version of OMPI are you using?

dpkg -l openmpi-bin says 1.6.5-8 from Ubuntu 14.04.
> 
> * are you saying that after the warm reboot, all 48 procs are running on a 
> subset of cores?

Yes. After a cold boot all 48 processses are spread over all 48 cores
and all cores show up as almost 100% in the htop cpu meter.

After a warm boot, the 48 processes are just spread over a few cores and
the rest of the system is idling.

> * it sounds like some of the cores have been marked as “offline” for some 
> reason. Make sure you have hwloc installed on the machine, and run “lstopo” 
> and see if that is the case

I tried with lstopo, but the graphics that I got look almost similar.
The visible difference is in the sort of topology for the graphics
adapter and the LAN cards. The path to the graphics shows 2 times the
numbers 4,0 above the lines and the path to the eth0 shows 2 times the
numbers 0,2 above the lines. lstopo for the warm boot looks identical,
but those small numbers are missing now.

I also tried with hwloc-gather-topology and diff'd the 2 results. There
is nothing special to see. Differneces in /proc/stats/ and
/proc/cpuinfo, but nothing special, just ohter values.

Something is obviously wrong on a low level, but I'm still struggling to
find it. :-/

Rainer
-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Clients
Dept. PDG WPS R SW OSE

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:  mailto:rainer.koe...@ts.fujitsu.com

Internet ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html


Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-18 Thread Thomas Jahns

Hi,

On 03/17/2016 10:00 AM, Rainer Koenig wrote:

I'm experiencing a strange problem with running LIGGGHTS on  48 core
workstation running Ubuntu 14.04.4 LTS.

If I cold boot the workstation and start one of the examples form
LIGGGHTS then everything looks fine:

$ mpirun -np 48 liggghts < in.chute_wear

launches the example on all 48 cores, htop in a second window shows that
all cores are occupied and run at nearly 100% workload.


does that machine really have 48 cores or 48 cpus, i.e. assuming it's an Intel 
machine is Hyperthreading active or not?



So far so good. Now I just reboot the workstation and do the exact same
steps as abovre.

This time the job just runs on a few cores (16 to 20) and the cores
don't even run at 100% load.

So now I'm trying to find out what is wrong. Bad luck is that I can't
just ask the vendor of the workstation since I'm working for that vendor
and trying to solve this issue. :-)

I guess that something that OpenMPI needs is initialized different when
I do a cold boot or a warm boot. But how can I find out what is wrong?


I might be wrong but you mpirun command does not specify affinity so it's 
probably not something in OpenMPI and rather an effect of the way your Linux 
scheduler works.



Already tried to look for differences in the Ubuntu boot logs, but there
is nothing different.


Did you look into /proc/cpuinfo?

Regards, Thomas
--
Thomas Jahns
HD(CP)^2
Abteilung Anwendungssoftware

Deutsches Klimarechenzentrum GmbH
Bundesstraße 45a • D-20146 Hamburg • Germany

Phone:  +49 40 460094-151
Fax:+49 40 460094-270
Email:  Thomas Jahns 
URL:www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-17 Thread Ralph Castain
Just some thoughts offhand:

* what version of OMPI are you using?

* are you saying that after the warm reboot, all 48 procs are running on a 
subset of cores?

* it sounds like some of the cores have been marked as “offline” for some 
reason. Make sure you have hwloc installed on the machine, and run “lstopo” and 
see if that is the case

Ralph

> On Mar 17, 2016, at 2:00 AM, Rainer Koenig  
> wrote:
> 
> Hi,
> 
> I'm experiencing a strange problem with running LIGGGHTS on  48 core
> workstation running Ubuntu 14.04.4 LTS.
> 
> If I cold boot the workstation and start one of the examples form
> LIGGGHTS then everything looks fine:
> 
> $ mpirun -np 48 liggghts < in.chute_wear
> 
> launches the example on all 48 cores, htop in a second window shows that
> all cores are occupied and run at nearly 100% workload.
> 
> So far so good. Now I just reboot the workstation and do the exact same
> steps as abovre.
> 
> This time the job just runs on a few cores (16 to 20) and the cores
> don't even run at 100% load.
> 
> So now I'm trying to find out what is wrong. Bad luck is that I can't
> just ask the vendor of the workstation since I'm working for that vendor
> and trying to solve this issue. :-)
> 
> I guess that something that OpenMPI needs is initialized different when
> I do a cold boot or a warm boot. But how can I find out what is wrong?
> 
> Already tried to look for differences in the Ubuntu boot logs, but there
> is nothing different.
> 
> ompi_info --all or even the parsable format  doesn't show any difference
> between cold boot and warm boot.
> 
> Any ideas what could be wrong after the reboot that causes such a behaviour?
> 
> Thanks,
> Rainer
> -- 
> Dipl.-Inf. (FH) Rainer Koenig
> Project Manager Linux Clients
> Dept. PDG WPS R SW OSE
> 
> Fujitsu Technology Solutions
> Bürgermeister-Ullrich-Str. 100
> 86199 Augsburg
> Germany
> 
> Telephone: +49-821-804-3321
> Telefax:   +49-821-804-2131
> Mail:  mailto:rainer.koe...@ts.fujitsu.com
> 
> Internet ts.fujtsu.com
> Company Details  ts.fujitsu.com/imprint.html
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28722.php



[OMPI users] Strange problem with mpirun and LIGGGHTS after reboot of machine

2016-03-17 Thread Rainer Koenig
Hi,

I'm experiencing a strange problem with running LIGGGHTS on  48 core
workstation running Ubuntu 14.04.4 LTS.

If I cold boot the workstation and start one of the examples form
LIGGGHTS then everything looks fine:

$ mpirun -np 48 liggghts < in.chute_wear

launches the example on all 48 cores, htop in a second window shows that
all cores are occupied and run at nearly 100% workload.

So far so good. Now I just reboot the workstation and do the exact same
steps as abovre.

This time the job just runs on a few cores (16 to 20) and the cores
don't even run at 100% load.

So now I'm trying to find out what is wrong. Bad luck is that I can't
just ask the vendor of the workstation since I'm working for that vendor
and trying to solve this issue. :-)

I guess that something that OpenMPI needs is initialized different when
I do a cold boot or a warm boot. But how can I find out what is wrong?

Already tried to look for differences in the Ubuntu boot logs, but there
is nothing different.

ompi_info --all or even the parsable format  doesn't show any difference
between cold boot and warm boot.

Any ideas what could be wrong after the reboot that causes such a behaviour?

Thanks,
Rainer
-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Clients
Dept. PDG WPS R SW OSE

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:  mailto:rainer.koe...@ts.fujitsu.com

Internet ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html