Re: [petsc-users] GAMG Parallel Performance

2018-11-15 Thread Smith, Barry F. via petsc-users



> On Nov 15, 2018, at 1:02 PM, Mark Adams  wrote:
> 
> There is a lot of load imbalance in VecMAXPY also. The partitioning could be 
> bad and if not its the machine.


> 
> On Thu, Nov 15, 2018 at 1:56 PM Smith, Barry F. via petsc-users 
>  wrote:
> 
> Something is odd about your configuration. Just consider the time for 
> VecMAXPY which is an embarrassingly parallel operation. On 1000 MPI processes 
> it produces
> 
> Time  
>   
> flop rate
>  VecMAXPY 575 1.0 8.4132e-01 1.5 1.36e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,600,021
> 
> on 1500 processes it produces
> 
>  VecMAXPY 583 1.0 1.0786e+00 3.4 9.38e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,289,187
> 
> that is it actually takes longer (the time goes from .84 seconds to 1.08 
> seconds and the flop rate from 1,600,021 down to 1,289,187) You would never 
> expect this kind of behavior
> 
> and on 2000 processes it produces
> 
> VecMAXPY 583 1.0 7.1103e-01 2.7 7.03e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,955,563
> 
> so it speeds up again but not by very much. This is very mysterious and not 
> what you would expect.
> 
>I'm inclined to believe something is out of whack on your computer, are 
> you sure all nodes on the computer are equivalent? Same processors, same 
> clock speeds? What happens if you run the 1000 process case several times, do 
> you get very similar numbers for VecMAXPY()? You should but I am guessing you 
> may not.
> 
> Barry
> 
>   Note that this performance issue doesn't really have anything to do with 
> the preconditioner you are using.
> 
> 
> 
> 
> 
> > On Nov 15, 2018, at 10:50 AM, Karin via petsc-users 
> >  wrote:
> > 
> > Dear PETSc team,
> > 
> > I am solving a linear transient dynamic problem, based on a discretization 
> > with finite elements. To do that, I am using FGMRES with GAMG as a 
> > preconditioner. I consider here 10 time steps. 
> > The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000 
> > procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
> > I notice that the performance deteriorates when I increase the number of 
> > processes. 
> > You can find as attached file the log_view of the execution and the 
> > detailled definition of the KSP.
> > 
> > Is the problem too small to run on that number of processes or is there 
> > something wrong with my use of GAMG?
> > 
> > I thank you in advance for your help,
> > Nicolas
> > 
> 



Re: [petsc-users] GAMG Parallel Performance

2018-11-15 Thread Mark Adams via petsc-users
There is a lot of load imbalance in VecMAXPY also. The partitioning could
be bad and if not its the machine.

On Thu, Nov 15, 2018 at 1:56 PM Smith, Barry F. via petsc-users <
petsc-users@mcs.anl.gov> wrote:

>
> Something is odd about your configuration. Just consider the time for
> VecMAXPY which is an embarrassingly parallel operation. On 1000 MPI
> processes it produces
>
> Time
>
>   flop rate
>  VecMAXPY 575 1.0 8.4132e-01 1.5 1.36e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,600,021
>
> on 1500 processes it produces
>
>  VecMAXPY 583 1.0 1.0786e+00 3.4 9.38e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,289,187
>
> that is it actually takes longer (the time goes from .84 seconds to 1.08
> seconds and the flop rate from 1,600,021 down to 1,289,187) You would never
> expect this kind of behavior
>
> and on 2000 processes it produces
>
> VecMAXPY 583 1.0 7.1103e-01 2.7 7.03e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,955,563
>
> so it speeds up again but not by very much. This is very mysterious and
> not what you would expect.
>
>I'm inclined to believe something is out of whack on your computer, are
> you sure all nodes on the computer are equivalent? Same processors, same
> clock speeds? What happens if you run the 1000 process case several times,
> do you get very similar numbers for VecMAXPY()? You should but I am
> guessing you may not.
>
> Barry
>
>   Note that this performance issue doesn't really have anything to do with
> the preconditioner you are using.
>
>
>
>
>
> > On Nov 15, 2018, at 10:50 AM, Karin via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
> >
> > Dear PETSc team,
> >
> > I am solving a linear transient dynamic problem, based on a
> discretization with finite elements. To do that, I am using FGMRES with
> GAMG as a preconditioner. I consider here 10 time steps.
> > The problem has round to 118e6 dof and I am running on 1000, 1500 and
> 2000 procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
> > I notice that the performance deteriorates when I increase the number of
> processes.
> > You can find as attached file the log_view of the execution and the
> detailled definition of the KSP.
> >
> > Is the problem too small to run on that number of processes or is there
> something wrong with my use of GAMG?
> >
> > I thank you in advance for your help,
> > Nicolas
> >
> 
>
>


Re: [petsc-users] GAMG Parallel Performance

2018-11-15 Thread Smith, Barry F. via petsc-users


Something is odd about your configuration. Just consider the time for 
VecMAXPY which is an embarrassingly parallel operation. On 1000 MPI processes 
it produces

Time

flop rate
 VecMAXPY 575 1.0 8.4132e-01 1.5 1.36e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0 1,600,021

on 1500 processes it produces

 VecMAXPY 583 1.0 1.0786e+00 3.4 9.38e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0 1,289,187

that is it actually takes longer (the time goes from .84 seconds to 1.08 
seconds and the flop rate from 1,600,021 down to 1,289,187) You would never 
expect this kind of behavior

and on 2000 processes it produces

VecMAXPY 583 1.0 7.1103e-01 2.7 7.03e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0 1,955,563

so it speeds up again but not by very much. This is very mysterious and not 
what you would expect.

   I'm inclined to believe something is out of whack on your computer, are you 
sure all nodes on the computer are equivalent? Same processors, same clock 
speeds? What happens if you run the 1000 process case several times, do you get 
very similar numbers for VecMAXPY()? You should but I am guessing you may not.

Barry

  Note that this performance issue doesn't really have anything to do with the 
preconditioner you are using.





> On Nov 15, 2018, at 10:50 AM, Karin via petsc-users 
>  wrote:
> 
> Dear PETSc team,
> 
> I am solving a linear transient dynamic problem, based on a discretization 
> with finite elements. To do that, I am using FGMRES with GAMG as a 
> preconditioner. I consider here 10 time steps. 
> The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000 
> procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
> I notice that the performance deteriorates when I increase the number of 
> processes. 
> You can find as attached file the log_view of the execution and the detailled 
> definition of the KSP.
> 
> Is the problem too small to run on that number of processes or is there 
> something wrong with my use of GAMG?
> 
> I thank you in advance for your help,
> Nicolas
> 



Re: [petsc-users] GAMG Parallel Performance

2018-11-15 Thread Matthew Knepley via petsc-users
On Thu, Nov 15, 2018 at 11:52 AM Karin via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Dear PETSc team,
>
> I am solving a linear transient dynamic problem, based on a discretization
> with finite elements. To do that, I am using FGMRES with GAMG as a
> preconditioner. I consider here 10 time steps.
> The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000
> procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
> I notice that the performance deteriorates when I increase the number of
> processes.
> You can find as attached file the log_view of the execution and the
> detailled definition of the KSP.
>
> Is the problem too small to run on that number of processes or is there
> something wrong with my use of GAMG?
>

I am having a hard time understanding the data. Just to be clear, I
understand you to be running the exact same problem on 1000, 1500, and 2000
processes, so looking for strong speedup. The PCSetUp time actually sped up
a little, which is great, and its still a small percentage (notice that
your whole solve is only half the runtime). Lets just look at a big time
component, MatMult,

P = 1000

MatMult 7342 1.0 4.4956e+01 1.4 4.09e+10 1.2 9.6e+07
4.3e+03 0.0e+00 23 53 81 86  0  23 53 81 86  0 859939


P = 2000

MatMult 7470 1.0 4.7611e+01 1.9 2.11e+10 1.2 2.0e+08
2.9e+03 0.0e+00 11 53 81 86  0  11 53 81 86  0 827107


So there was no speedup at all. It is doing 1/2 the flops per process, but
taking almost exactly the same time. This looks like your 2000 process run
is on exactly the same number of nodes as your 1000 process run, but you
just use more processes. Your 1000 process run was maxing out the bandwidth
of those nodes, and thus 2000 runs no faster. Is this true? Otherwise, I am
misunderstanding the run.

  Thanks,

Matt


> I thank you in advance for your help,
> Nicolas
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


[petsc-users] GAMG Parallel Performance

2018-11-15 Thread Karin via petsc-users
Dear PETSc team,

I am solving a linear transient dynamic problem, based on a discretization
with finite elements. To do that, I am using FGMRES with GAMG as a
preconditioner. I consider here 10 time steps.
The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000
procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
I notice that the performance deteriorates when I increase the number of
processes.
You can find as attached file the log_view of the execution and the
detailled definition of the KSP.

Is the problem too small to run on that number of processes or is there
something wrong with my use of GAMG?

I thank you in advance for your help,
Nicolas
 -- PETSc Performance Summary: 
--
 
 Unknown Name on a arch-linux2-c-opt-mpi-ml-hypre named eocn0117 with 1000 
processors, by B07947 Thu Nov 15 16:14:46 2018
 Using Petsc Release Version 3.8.2, Nov, 09, 2017 
 
  Max   Max/MinAvg  Total 
 Time (sec):   1.661e+02  1.00034   1.661e+02
 Objects:  1.401e+03  1.00143   1.399e+03
 Flop: 7.695e+10  1.13672   7.354e+10  7.354e+13
 Flop/sec:4.633e+08  1.13672   4.428e+08  4.428e+11
 MPI Messages: 3.697e+05 12.46258   1.179e+05  1.179e+08
 MPI Message Lengths:  8.786e+08  3.98485   4.086e+03  4.817e+11
 MPI Reductions:   2.635e+03  1.0
 
 Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
 e.g., VecAXPY() for real vectors of length N --> 
2N flop
 and VecAXPY() for complex vectors of length N --> 
8N flop
 
 Summary of Stages:   - Time --  - Flop -  --- Messages ---  -- 
Message Lengths --  -- Reductions --
 Avg %Total Avg %Total   counts   %Total
 Avg %Total   counts   %Total 
  0:  Main Stage: 1.6608e+02 100.0%  7.3541e+13 100.0%  1.178e+08  99.9%  
4.081e+03   99.9%  2.603e+03  98.8% 
 
 

 See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
 Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
   %T - percent time in this phase %F - percent flop in this phase
   %M - percent messages in this phase %L - percent message lengths in 
this phase
   %R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all 
processors)
 

 EventCount  Time (sec) Flop
 --- Global ---  --- Stage ---   Total
Max Ratio  Max Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
 

 
 --- Event Stage 0: Main Stage
 
 MatMult 7342 1.0 4.4956e+01 1.4 4.09e+10 1.2 9.6e+07 4.3e+03 
0.0e+00 23 53 81 86  0  23 53 81 86  0 859939
 MatMultAdd  1130 1.0 3.4048e+00 2.3 1.55e+09 1.1 8.4e+06 8.2e+02 
0.0e+00  2  2  7  1  0   2  2  7  1  0 434274
 MatMultTranspose1130 1.0 4.7555e+00 3.8 1.55e+09 1.1 8.4e+06 8.2e+02 
0.0e+00  1  2  7  1  0   1  2  7  1  0 310924
 MatSolve 226 0.0 6.8927e-04 0.0 6.24e+04 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  090
 MatSOR  6835 1.0 3.6061e+01 1.4 2.85e+10 1.1 0.0e+00 0.0e+00 
0.0e+00 20 37  0  0  0  20 37  0  0  0 760198
 MatLUFactorSym 1 1.0 1.0800e-0390.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
 MatLUFactorNum 1 1.0 8.0395e-04421.5 1.09e+03 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 1
 MatScale  15 1.0 1.7925e-02 1.8 9.12e+06 1.1 6.6e+04 1.1e+03 
0.0e+00  0  0  0  0  0   0  0  0  0  0 485856
 MatResidual 1130 1.0 6.3576e+00 1.5 5.31e+09 1.2 1.5e+07 3.7e+03 
0.0e+00  3  7 13 11  0   3  7 13 11  0 781728
 MatAssemblyBegin 112 1.0 9.9765e-01 3.0 0.00e+00 0.0 2.1e+05 7.8e+04 
7.4e+01  0  0  0  3  3   0  0  0  3  3 0
 MatAssemblyEnd   112 1.0 6.8845e-01 1.1 0.00e+00 0.0 8.3e+05 3.4e+02 
2.6e+02  0  0  1  0 10   0  0  1  0 10 0
 MatGetRow 582170 1.0 8.5022e-02 1.3 0.00e+00