Re: [petsc-users] GAMG Parallel Performance

2018-11-15 Thread Smith, Barry F. via petsc-users



> On Nov 15, 2018, at 1:02 PM, Mark Adams  wrote:
> 
> There is a lot of load imbalance in VecMAXPY also. The partitioning could be 
> bad and if not its the machine.


> 
> On Thu, Nov 15, 2018 at 1:56 PM Smith, Barry F. via petsc-users 
>  wrote:
> 
> Something is odd about your configuration. Just consider the time for 
> VecMAXPY which is an embarrassingly parallel operation. On 1000 MPI processes 
> it produces
> 
> Time  
>   
> flop rate
>  VecMAXPY 575 1.0 8.4132e-01 1.5 1.36e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,600,021
> 
> on 1500 processes it produces
> 
>  VecMAXPY 583 1.0 1.0786e+00 3.4 9.38e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,289,187
> 
> that is it actually takes longer (the time goes from .84 seconds to 1.08 
> seconds and the flop rate from 1,600,021 down to 1,289,187) You would never 
> expect this kind of behavior
> 
> and on 2000 processes it produces
> 
> VecMAXPY 583 1.0 7.1103e-01 2.7 7.03e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,955,563
> 
> so it speeds up again but not by very much. This is very mysterious and not 
> what you would expect.
> 
>I'm inclined to believe something is out of whack on your computer, are 
> you sure all nodes on the computer are equivalent? Same processors, same 
> clock speeds? What happens if you run the 1000 process case several times, do 
> you get very similar numbers for VecMAXPY()? You should but I am guessing you 
> may not.
> 
> Barry
> 
>   Note that this performance issue doesn't really have anything to do with 
> the preconditioner you are using.
> 
> 
> 
> 
> 
> > On Nov 15, 2018, at 10:50 AM, Karin via petsc-users 
> >  wrote:
> > 
> > Dear PETSc team,
> > 
> > I am solving a linear transient dynamic problem, based on a discretization 
> > with finite elements. To do that, I am using FGMRES with GAMG as a 
> > preconditioner. I consider here 10 time steps. 
> > The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000 
> > procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
> > I notice that the performance deteriorates when I increase the number of 
> > processes. 
> > You can find as attached file the log_view of the execution and the 
> > detailled definition of the KSP.
> > 
> > Is the problem too small to run on that number of processes or is there 
> > something wrong with my use of GAMG?
> > 
> > I thank you in advance for your help,
> > Nicolas
> > 
> 



Re: [petsc-users] GAMG Parallel Performance

2018-11-15 Thread Mark Adams via petsc-users
There is a lot of load imbalance in VecMAXPY also. The partitioning could
be bad and if not its the machine.

On Thu, Nov 15, 2018 at 1:56 PM Smith, Barry F. via petsc-users <
petsc-users@mcs.anl.gov> wrote:

>
> Something is odd about your configuration. Just consider the time for
> VecMAXPY which is an embarrassingly parallel operation. On 1000 MPI
> processes it produces
>
> Time
>
>   flop rate
>  VecMAXPY 575 1.0 8.4132e-01 1.5 1.36e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,600,021
>
> on 1500 processes it produces
>
>  VecMAXPY 583 1.0 1.0786e+00 3.4 9.38e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,289,187
>
> that is it actually takes longer (the time goes from .84 seconds to 1.08
> seconds and the flop rate from 1,600,021 down to 1,289,187) You would never
> expect this kind of behavior
>
> and on 2000 processes it produces
>
> VecMAXPY 583 1.0 7.1103e-01 2.7 7.03e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  2  0  0  0   0  2  0  0  0 1,955,563
>
> so it speeds up again but not by very much. This is very mysterious and
> not what you would expect.
>
>I'm inclined to believe something is out of whack on your computer, are
> you sure all nodes on the computer are equivalent? Same processors, same
> clock speeds? What happens if you run the 1000 process case several times,
> do you get very similar numbers for VecMAXPY()? You should but I am
> guessing you may not.
>
> Barry
>
>   Note that this performance issue doesn't really have anything to do with
> the preconditioner you are using.
>
>
>
>
>
> > On Nov 15, 2018, at 10:50 AM, Karin via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
> >
> > Dear PETSc team,
> >
> > I am solving a linear transient dynamic problem, based on a
> discretization with finite elements. To do that, I am using FGMRES with
> GAMG as a preconditioner. I consider here 10 time steps.
> > The problem has round to 118e6 dof and I am running on 1000, 1500 and
> 2000 procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
> > I notice that the performance deteriorates when I increase the number of
> processes.
> > You can find as attached file the log_view of the execution and the
> detailled definition of the KSP.
> >
> > Is the problem too small to run on that number of processes or is there
> something wrong with my use of GAMG?
> >
> > I thank you in advance for your help,
> > Nicolas
> >
> 
>
>


Re: [petsc-users] GAMG Parallel Performance

2018-11-15 Thread Smith, Barry F. via petsc-users


Something is odd about your configuration. Just consider the time for 
VecMAXPY which is an embarrassingly parallel operation. On 1000 MPI processes 
it produces

Time

flop rate
 VecMAXPY 575 1.0 8.4132e-01 1.5 1.36e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0 1,600,021

on 1500 processes it produces

 VecMAXPY 583 1.0 1.0786e+00 3.4 9.38e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0 1,289,187

that is it actually takes longer (the time goes from .84 seconds to 1.08 
seconds and the flop rate from 1,600,021 down to 1,289,187) You would never 
expect this kind of behavior

and on 2000 processes it produces

VecMAXPY 583 1.0 7.1103e-01 2.7 7.03e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0 1,955,563

so it speeds up again but not by very much. This is very mysterious and not 
what you would expect.

   I'm inclined to believe something is out of whack on your computer, are you 
sure all nodes on the computer are equivalent? Same processors, same clock 
speeds? What happens if you run the 1000 process case several times, do you get 
very similar numbers for VecMAXPY()? You should but I am guessing you may not.

Barry

  Note that this performance issue doesn't really have anything to do with the 
preconditioner you are using.





> On Nov 15, 2018, at 10:50 AM, Karin via petsc-users 
>  wrote:
> 
> Dear PETSc team,
> 
> I am solving a linear transient dynamic problem, based on a discretization 
> with finite elements. To do that, I am using FGMRES with GAMG as a 
> preconditioner. I consider here 10 time steps. 
> The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000 
> procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
> I notice that the performance deteriorates when I increase the number of 
> processes. 
> You can find as attached file the log_view of the execution and the detailled 
> definition of the KSP.
> 
> Is the problem too small to run on that number of processes or is there 
> something wrong with my use of GAMG?
> 
> I thank you in advance for your help,
> Nicolas
> 



Re: [petsc-users] On unknown ordering

2018-11-15 Thread Smith, Barry F. via petsc-users


> On Nov 15, 2018, at 4:48 AM, Appel, Thibaut via petsc-users 
>  wrote:
> 
> Good morning,
> 
> I would like to ask about the importance of the initial choice of ordering 
> the unknowns when feeding a matrix to PETSc. 
> 
> I have a regular grid, using high-order finite differences and I simply 
> divide rows of the matrix with PetscSplitOwnership using vertex major, 
> natural ordering for the parallelism (not using DMDA)

So each process is getting a slice of the domain? To minimize communication 
it is best to use "square-ish" subdomains instead of slices; this is why the 
DMDA tries to use "square-ish" subdomains. I don't know the relationship 
between convergence rate and the shapes of the subdomains, it will depend on 
the operator and possibly "flow direction" etc. 
> 
> My understanding is that when using LU-MUMPS, this does not matter because 
> either serial or parallel analysis is performed and all the rows are 
> reordered ‘optimally’ before the LU factorization. Quality of reordering 
> might suffer from parallel analysis.
> 
> But if I use the default block Jacobi with ILU with one block per processor, 
> the initial ordering seems to have an influence because some tightly coupled 
> degrees of freedom might lay on different processes and the ILU becomes less 
> powerful. You can change the ordering on each block but this won’t 
> necessarily make things better.
> 
> Are my observations accurate? Is there a recommended ordering type for a 
> block Jacobi approach in my case? Could I expect natural improvements in 
> fill-in or better GMRES robustness opting for parallelism offered by DMDA?

 You might consider using -pc_type asm (additive Schwarz method) instead of 
block Jacobi. This "reintroduces" some of the tight coupling that is discarded 
when slicing up the domain for block Jacobi.

   Barry

> 
> Thank you,
> 
> Thibaut



Re: [petsc-users] petsc4py help with parallel execution

2018-11-15 Thread Ivan via petsc-users

Matthew,

*/As I wrote before, its not impossible. You could be directly calling 
PMI, but I do not think you are doing that./*


Could you precise what is PMI? and how can we directly use it? It might 
be a key to this mystery!


*/Why do you think its running on 8 processes?/*

Well, we base our opinion on 3 points:
1) htop shows a total loading of 8 processors
2) system monitor shows the same behavior
3) Time. 8 seconds vs 70 seconds, although we have a very similar PC configs

*/I think its much more likely that there are differences in the solver 
(use -ksp_view to see exactly what solver was used), then to think it is 
parallelism. /*


We actually use the incidental code. Or do you think that independently 
on this fact, and the fact that we precise in the code 
"ksp.getPC().setFactorSolverType('mumps')" ksp may solve the system of 
equations using different solver?

/
/
*/Moreover, you would never ever ever see that much speedup on a laptop 
since all these computations /*

**
*/are bandwidth limited./*

I agree to this point. But I would think that taking into account that 
his computer is *a bit* more powerful and his code is executed in 
parallel, we might have an acceleration. We, for example, tested other, 
more physical codes. And noted acceleration x4 - x6


Thank you for your contribution,

Ivan

On 15/11/2018 18:07, Matthew Knepley wrote:
On Thu, Nov 15, 2018 at 11:59 AM Ivan Voznyuk 
mailto:ivan.voznyuk.w...@gmail.com>> wrote:


Hi Matthew,

Does it mean that by using just command python3 simple_code.py
(without mpiexec) you _cannot_ obtain a parallel execution?


As I wrote before, its not impossible. You could be directly calling 
PMI, but I do not think you are doing that.


It s been 5 days we are trying to understand with my colleague how
he managed to do so.
It means that by using simply python3 simple_code.py he gets 8
processors workiing.
By the way, we wrote in his code few lines:
rank = PETSc.COMM_WORLD.Get_rank()
size = PETSc.COMM_WORLD.Get_size()
and we got rank = 0, size = 1


This is MPI telling you that you are only running on 1 processes.

However, we compilator arrives to KSP.solve(), somehow it turns on
8 processors.


Why do you think its running on 8 processes?

This problem is solved on his PC in 5-8 sec (in parallel, using
_python3 simple_code.py_), on mine it takes 70-90 secs (in
sequantial, but with the same command _python3 simple_code.py_)


I think its much more likely that there are differences in the solver 
(use -ksp_view to see exactly what solver was used), then
to think it is parallelism. Moreover, you would never ever ever see 
that much speedup on a laptop since all these computations

are bandwidth limited.

  Thanks,

     Matt

So, conclusion is that on his computer this code works in the same
way as scipy: all the code is executed in sequantial mode, but
when it comes to solution of system of linear equations, it runs
on all available processors. All this with just running python3
my_code.py (without any mpi-smth)

Is it an exception / abnormal behavior? I mean, is it something
irregular that you, developers, have never seen?

Thanks and have a good evening!
Ivan

P.S. I don't think I know the answer regarding Scipy...


On Thu, Nov 15, 2018 at 2:39 PM Matthew Knepley mailto:knep...@gmail.com>> wrote:

On Thu, Nov 15, 2018 at 8:07 AM Ivan Voznyuk
mailto:ivan.voznyuk.w...@gmail.com>> wrote:

Hi Matthew,
Thanks for your reply!

Let me precise what I mean by defining few questions:

1. In order to obtain a parallel execution of
simple_code.py, do I need to go with mpiexec python3
simple_code.py, or I can just launch python3 simple_code.py?


mpiexec -n 2 python3 simple_code.py

2. This simple_code.py consists of 2 parts: a) preparation
of matrix b) solving the system of linear equations with
PETSc. If I launch mpirun (or mpiexec) -np 8 python3
simple_code.py, I suppose that I will basically obtain 8
matrices and 8 systems to solve. However, I need to
prepare only one matrix, but launch this code in parallel
on 8 processors.


When you create the Mat object, you give it a communicator
(here PETSC_COMM_WORLD). That allows us to distribute the
data. This is all covered extensively in the manual and the
online tutorials, as well as the example code.

In fact, here attached you will find a similar code
(scipy_code.py) with only one difference: the system of
linear equations is solved with scipy. So when I solve it,
I can clearly see that the solution is obtained in a
parallel way. However, I do not use the command mpirun (or
mpiexec). I just go with python3 scipy_code.py.

Re: [petsc-users] GAMG Parallel Performance

2018-11-15 Thread Matthew Knepley via petsc-users
On Thu, Nov 15, 2018 at 11:52 AM Karin via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Dear PETSc team,
>
> I am solving a linear transient dynamic problem, based on a discretization
> with finite elements. To do that, I am using FGMRES with GAMG as a
> preconditioner. I consider here 10 time steps.
> The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000
> procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
> I notice that the performance deteriorates when I increase the number of
> processes.
> You can find as attached file the log_view of the execution and the
> detailled definition of the KSP.
>
> Is the problem too small to run on that number of processes or is there
> something wrong with my use of GAMG?
>

I am having a hard time understanding the data. Just to be clear, I
understand you to be running the exact same problem on 1000, 1500, and 2000
processes, so looking for strong speedup. The PCSetUp time actually sped up
a little, which is great, and its still a small percentage (notice that
your whole solve is only half the runtime). Lets just look at a big time
component, MatMult,

P = 1000

MatMult 7342 1.0 4.4956e+01 1.4 4.09e+10 1.2 9.6e+07
4.3e+03 0.0e+00 23 53 81 86  0  23 53 81 86  0 859939


P = 2000

MatMult 7470 1.0 4.7611e+01 1.9 2.11e+10 1.2 2.0e+08
2.9e+03 0.0e+00 11 53 81 86  0  11 53 81 86  0 827107


So there was no speedup at all. It is doing 1/2 the flops per process, but
taking almost exactly the same time. This looks like your 2000 process run
is on exactly the same number of nodes as your 1000 process run, but you
just use more processes. Your 1000 process run was maxing out the bandwidth
of those nodes, and thus 2000 runs no faster. Is this true? Otherwise, I am
misunderstanding the run.

  Thanks,

Matt


> I thank you in advance for your help,
> Nicolas
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-users] petsc4py help with parallel execution

2018-11-15 Thread Matthew Knepley via petsc-users
On Thu, Nov 15, 2018 at 11:59 AM Ivan Voznyuk 
wrote:

> Hi Matthew,
>
> Does it mean that by using just command python3 simple_code.py (without
> mpiexec) you *cannot* obtain a parallel execution?
>

As I wrote before, its not impossible. You could be directly calling PMI,
but I do not think you are doing that.


> It s been 5 days we are trying to understand with my colleague how he
> managed to do so.
> It means that by using simply python3 simple_code.py he gets 8 processors
> workiing.
> By the way, we wrote in his code few lines:
> rank = PETSc.COMM_WORLD.Get_rank()
> size = PETSc.COMM_WORLD.Get_size()
> and we got rank = 0, size = 1
>

This is MPI telling you that you are only running on 1 processes.


> However, we compilator arrives to KSP.solve(), somehow it turns on 8
> processors.
>

Why do you think its running on 8 processes?


> This problem is solved on his PC in 5-8 sec (in parallel, using *python3
> simple_code.py*), on mine it takes 70-90 secs (in sequantial, but with
> the same command *python3 simple_code.py*)
>

I think its much more likely that there are differences in the solver (use
-ksp_view to see exactly what solver was used), then
to think it is parallelism. Moreover, you would never ever ever see that
much speedup on a laptop since all these computations
are bandwidth limited.

  Thanks,

 Matt


> So, conclusion is that on his computer this code works in the same way as
> scipy: all the code is executed in sequantial mode, but when it comes to
> solution of system of linear equations, it runs on all available
> processors. All this with just running python3 my_code.py (without any
> mpi-smth)
>
> Is it an exception / abnormal behavior? I mean, is it something irregular
> that you, developers, have never seen?
>
> Thanks and have a good evening!
> Ivan
>
> P.S. I don't think I know the answer regarding Scipy...
>
>
> On Thu, Nov 15, 2018 at 2:39 PM Matthew Knepley  wrote:
>
>> On Thu, Nov 15, 2018 at 8:07 AM Ivan Voznyuk 
>> wrote:
>>
>>> Hi Matthew,
>>> Thanks for your reply!
>>>
>>> Let me precise what I mean by defining few questions:
>>>
>>> 1. In order to obtain a parallel execution of simple_code.py, do I need
>>> to go with mpiexec python3 simple_code.py, or I can just launch python3
>>> simple_code.py?
>>>
>>
>> mpiexec -n 2 python3 simple_code.py
>>
>>
>>> 2. This simple_code.py consists of 2 parts: a) preparation of matrix b)
>>> solving the system of linear equations with PETSc. If I launch mpirun (or
>>> mpiexec) -np 8 python3 simple_code.py, I suppose that I will basically
>>> obtain 8 matrices and 8 systems to solve. However, I need to prepare only
>>> one matrix, but launch this code in parallel on 8 processors.
>>>
>>
>> When you create the Mat object, you give it a communicator (here
>> PETSC_COMM_WORLD). That allows us to distribute the data. This is all
>> covered extensively in the manual and the online tutorials, as well as the
>> example code.
>>
>>
>>> In fact, here attached you will find a similar code (scipy_code.py) with
>>> only one difference: the system of linear equations is solved with scipy.
>>> So when I solve it, I can clearly see that the solution is obtained in a
>>> parallel way. However, I do not use the command mpirun (or mpiexec). I just
>>> go with python3 scipy_code.py.
>>>
>>
>> Why do you think its running in parallel?
>>
>>   Thanks,
>>
>>  Matt
>>
>>
>>> In this case, the first part (creation of the sparse matrix) is not
>>> parallel, whereas the solution of system is found in a parallel way.
>>> So my question is, Do you think that it s possible to have the same
>>> behavior with PETSC? And what do I need for this?
>>>
>>> I am asking this because for my colleague it worked! It means that he
>>> launches the simple_code.py on his computer using the command python3
>>> simple_code.py (and not mpi-smth python3 simple_code.py) and he obtains a
>>> parallel execution of the same code.
>>>
>>> Thanks for your help!
>>> Ivan
>>>
>>>
>>> On Thu, Nov 15, 2018 at 11:54 AM Matthew Knepley 
>>> wrote:
>>>
 On Thu, Nov 15, 2018 at 4:53 AM Ivan Voznyuk via petsc-users <
 petsc-users@mcs.anl.gov> wrote:

> Dear PETSC community,
>
> I have a question regarding the parallel execution of petsc4py.
>
> I have a simple code (here attached simple_code.py) which solves a
> system of linear equations Ax=b using petsc4py. To execute it, I use the
> command python3 simple_code.py which yields a sequential performance. With
> a colleague of my, we launched this code on his computer, and this time 
> the
> execution was in parallel. Although, he used the same command python3
> simple_code.py (without mpirun, neither mpiexec).
>
> I am not sure what you mean. To run MPI programs in parallel, you need
 a launcher like mpiexec or mpirun. There are Python programs (like nemesis)
 that use the launcher API directly (called PMI), but that is not part of
 petsc4py.

   

Re: [petsc-users] petsc4py help with parallel execution

2018-11-15 Thread Ivan Voznyuk via petsc-users
Hi Matthew,

Does it mean that by using just command python3 simple_code.py (without
mpiexec) you *cannot* obtain a parallel execution?
It s been 5 days we are trying to understand with my colleague how he
managed to do so.
It means that by using simply python3 simple_code.py he gets 8 processors
workiing.
By the way, we wrote in his code few lines:
rank = PETSc.COMM_WORLD.Get_rank()
size = PETSc.COMM_WORLD.Get_size()
and we got rank = 0, size = 1
However, we compilator arrives to KSP.solve(), somehow it turns on 8
processors.
This problem is solved on his PC in 5-8 sec (in parallel, using *python3
simple_code.py*), on mine it takes 70-90 secs (in sequantial, but with the
same command *python3 simple_code.py*)

So, conclusion is that on his computer this code works in the same way as
scipy: all the code is executed in sequantial mode, but when it comes to
solution of system of linear equations, it runs on all available
processors. All this with just running python3 my_code.py (without any
mpi-smth)

Is it an exception / abnormal behavior? I mean, is it something irregular
that you, developers, have never seen?

Thanks and have a good evening!
Ivan

P.S. I don't think I know the answer regarding Scipy...


On Thu, Nov 15, 2018 at 2:39 PM Matthew Knepley  wrote:

> On Thu, Nov 15, 2018 at 8:07 AM Ivan Voznyuk 
> wrote:
>
>> Hi Matthew,
>> Thanks for your reply!
>>
>> Let me precise what I mean by defining few questions:
>>
>> 1. In order to obtain a parallel execution of simple_code.py, do I need
>> to go with mpiexec python3 simple_code.py, or I can just launch python3
>> simple_code.py?
>>
>
> mpiexec -n 2 python3 simple_code.py
>
>
>> 2. This simple_code.py consists of 2 parts: a) preparation of matrix b)
>> solving the system of linear equations with PETSc. If I launch mpirun (or
>> mpiexec) -np 8 python3 simple_code.py, I suppose that I will basically
>> obtain 8 matrices and 8 systems to solve. However, I need to prepare only
>> one matrix, but launch this code in parallel on 8 processors.
>>
>
> When you create the Mat object, you give it a communicator (here
> PETSC_COMM_WORLD). That allows us to distribute the data. This is all
> covered extensively in the manual and the online tutorials, as well as the
> example code.
>
>
>> In fact, here attached you will find a similar code (scipy_code.py) with
>> only one difference: the system of linear equations is solved with scipy.
>> So when I solve it, I can clearly see that the solution is obtained in a
>> parallel way. However, I do not use the command mpirun (or mpiexec). I just
>> go with python3 scipy_code.py.
>>
>
> Why do you think its running in parallel?
>
>   Thanks,
>
>  Matt
>
>
>> In this case, the first part (creation of the sparse matrix) is not
>> parallel, whereas the solution of system is found in a parallel way.
>> So my question is, Do you think that it s possible to have the same
>> behavior with PETSC? And what do I need for this?
>>
>> I am asking this because for my colleague it worked! It means that he
>> launches the simple_code.py on his computer using the command python3
>> simple_code.py (and not mpi-smth python3 simple_code.py) and he obtains a
>> parallel execution of the same code.
>>
>> Thanks for your help!
>> Ivan
>>
>>
>> On Thu, Nov 15, 2018 at 11:54 AM Matthew Knepley 
>> wrote:
>>
>>> On Thu, Nov 15, 2018 at 4:53 AM Ivan Voznyuk via petsc-users <
>>> petsc-users@mcs.anl.gov> wrote:
>>>
 Dear PETSC community,

 I have a question regarding the parallel execution of petsc4py.

 I have a simple code (here attached simple_code.py) which solves a
 system of linear equations Ax=b using petsc4py. To execute it, I use the
 command python3 simple_code.py which yields a sequential performance. With
 a colleague of my, we launched this code on his computer, and this time the
 execution was in parallel. Although, he used the same command python3
 simple_code.py (without mpirun, neither mpiexec).

 I am not sure what you mean. To run MPI programs in parallel, you need
>>> a launcher like mpiexec or mpirun. There are Python programs (like nemesis)
>>> that use the launcher API directly (called PMI), but that is not part of
>>> petsc4py.
>>>
>>>   Thanks,
>>>
>>>  Matt
>>>
 My configuration: Ubuntu x86_64 Ubuntu 16.04, Intel Core i7, PETSc
 3.10.2, PETSC_ARCH=arch-linux2-c-debug, petsc4py 3.10.0 in virtualenv

 In order to parallelize it, I have already tried:
 - use 2 different PCs
 - use Ubuntu 16.04, 18.04
 - use different architectures (arch-linux2-c-debug, linux-gnu-c-debug,
 etc)
 - ofc use different configurations (my present config can be found in
 make.log that I attached here)
 - mpi from mpich, openmpi

 Nothing worked.

 Do you have any ideas?

 Thanks and have a good day,
 Ivan

 --
 Ivan VOZNYUK
 PhD in Computational Electromagnetics

>>>
>>>
>>> --
>>> What most 

[petsc-users] GAMG Parallel Performance

2018-11-15 Thread Karin via petsc-users
Dear PETSc team,

I am solving a linear transient dynamic problem, based on a discretization
with finite elements. To do that, I am using FGMRES with GAMG as a
preconditioner. I consider here 10 time steps.
The problem has round to 118e6 dof and I am running on 1000, 1500 and 2000
procs. So I have something like 100e3, 78e3 and 50e3 dof/proc.
I notice that the performance deteriorates when I increase the number of
processes.
You can find as attached file the log_view of the execution and the
detailled definition of the KSP.

Is the problem too small to run on that number of processes or is there
something wrong with my use of GAMG?

I thank you in advance for your help,
Nicolas
 -- PETSc Performance Summary: 
--
 
 Unknown Name on a arch-linux2-c-opt-mpi-ml-hypre named eocn0117 with 1000 
processors, by B07947 Thu Nov 15 16:14:46 2018
 Using Petsc Release Version 3.8.2, Nov, 09, 2017 
 
  Max   Max/MinAvg  Total 
 Time (sec):   1.661e+02  1.00034   1.661e+02
 Objects:  1.401e+03  1.00143   1.399e+03
 Flop: 7.695e+10  1.13672   7.354e+10  7.354e+13
 Flop/sec:4.633e+08  1.13672   4.428e+08  4.428e+11
 MPI Messages: 3.697e+05 12.46258   1.179e+05  1.179e+08
 MPI Message Lengths:  8.786e+08  3.98485   4.086e+03  4.817e+11
 MPI Reductions:   2.635e+03  1.0
 
 Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
 e.g., VecAXPY() for real vectors of length N --> 
2N flop
 and VecAXPY() for complex vectors of length N --> 
8N flop
 
 Summary of Stages:   - Time --  - Flop -  --- Messages ---  -- 
Message Lengths --  -- Reductions --
 Avg %Total Avg %Total   counts   %Total
 Avg %Total   counts   %Total 
  0:  Main Stage: 1.6608e+02 100.0%  7.3541e+13 100.0%  1.178e+08  99.9%  
4.081e+03   99.9%  2.603e+03  98.8% 
 
 

 See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
 Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
   %T - percent time in this phase %F - percent flop in this phase
   %M - percent messages in this phase %L - percent message lengths in 
this phase
   %R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all 
processors)
 

 EventCount  Time (sec) Flop
 --- Global ---  --- Stage ---   Total
Max Ratio  Max Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
 

 
 --- Event Stage 0: Main Stage
 
 MatMult 7342 1.0 4.4956e+01 1.4 4.09e+10 1.2 9.6e+07 4.3e+03 
0.0e+00 23 53 81 86  0  23 53 81 86  0 859939
 MatMultAdd  1130 1.0 3.4048e+00 2.3 1.55e+09 1.1 8.4e+06 8.2e+02 
0.0e+00  2  2  7  1  0   2  2  7  1  0 434274
 MatMultTranspose1130 1.0 4.7555e+00 3.8 1.55e+09 1.1 8.4e+06 8.2e+02 
0.0e+00  1  2  7  1  0   1  2  7  1  0 310924
 MatSolve 226 0.0 6.8927e-04 0.0 6.24e+04 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  090
 MatSOR  6835 1.0 3.6061e+01 1.4 2.85e+10 1.1 0.0e+00 0.0e+00 
0.0e+00 20 37  0  0  0  20 37  0  0  0 760198
 MatLUFactorSym 1 1.0 1.0800e-0390.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
 MatLUFactorNum 1 1.0 8.0395e-04421.5 1.09e+03 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 1
 MatScale  15 1.0 1.7925e-02 1.8 9.12e+06 1.1 6.6e+04 1.1e+03 
0.0e+00  0  0  0  0  0   0  0  0  0  0 485856
 MatResidual 1130 1.0 6.3576e+00 1.5 5.31e+09 1.2 1.5e+07 3.7e+03 
0.0e+00  3  7 13 11  0   3  7 13 11  0 781728
 MatAssemblyBegin 112 1.0 9.9765e-01 3.0 0.00e+00 0.0 2.1e+05 7.8e+04 
7.4e+01  0  0  0  3  3   0  0  0  3  3 0
 MatAssemblyEnd   112 1.0 6.8845e-01 1.1 0.00e+00 0.0 8.3e+05 3.4e+02 
2.6e+02  0  0  1  0 10   0  0  1  0 10 0
 MatGetRow 582170 1.0 8.5022e-02 1.3 0.00e+00 

[petsc-users] On unknown ordering

2018-11-15 Thread Appel, Thibaut via petsc-users
Good morning,

I would like to ask about the importance of the initial choice of ordering the 
unknowns when feeding a matrix to PETSc. 

I have a regular grid, using high-order finite differences and I simply divide 
rows of the matrix with PetscSplitOwnership using vertex major, natural 
ordering for the parallelism (not using DMDA)

My understanding is that when using LU-MUMPS, this does not matter because 
either serial or parallel analysis is performed and all the rows are reordered 
‘optimally’ before the LU factorization. Quality of reordering might suffer 
from parallel analysis.

But if I use the default block Jacobi with ILU with one block per processor, 
the initial ordering seems to have an influence because some tightly coupled 
degrees of freedom might lay on different processes and the ILU becomes less 
powerful. You can change the ordering on each block but this won’t necessarily 
make things better.

Are my observations accurate? Is there a recommended ordering type for a block 
Jacobi approach in my case? Could I expect natural improvements in fill-in or 
better GMRES robustness opting for parallelism offered by DMDA?

Thank you,

Thibaut


[petsc-users] petsc4py help with parallel execution

2018-11-15 Thread Ivan Voznyuk via petsc-users
Dear PETSC community,

I have a question regarding the parallel execution of petsc4py.

I have a simple code (here attached simple_code.py) which solves a system
of linear equations Ax=b using petsc4py. To execute it, I use the command
python3 simple_code.py which yields a sequential performance. With a
colleague of my, we launched this code on his computer, and this time the
execution was in parallel. Although, he used the same command python3
simple_code.py (without mpirun, neither mpiexec).

My configuration: Ubuntu x86_64 Ubuntu 16.04, Intel Core i7, PETSc 3.10.2,
PETSC_ARCH=arch-linux2-c-debug, petsc4py 3.10.0 in virtualenv

In order to parallelize it, I have already tried:
- use 2 different PCs
- use Ubuntu 16.04, 18.04
- use different architectures (arch-linux2-c-debug, linux-gnu-c-debug, etc)
- ofc use different configurations (my present config can be found in
make.log that I attached here)
- mpi from mpich, openmpi

Nothing worked.

Do you have any ideas?

Thanks and have a good day,
Ivan

-- 
Ivan VOZNYUK
PhD in Computational Electromagnetics
from petsc4py import PETSc

import scipy.sparse as sps
import scipy.sparse.linalg as spslin

import numpy as np

import time

n = 4000


matrix = sps.random(n, n, format='csr') + 10 * sps.eye(n) + \
0.1j * sps.random(n, n, format='csr')

exact_solution = 0.2 * np.arange(n) + 0.01j * np.arange(n)

rhs = matrix * exact_solution

print("")
print(" Start PETSC so hope to be parallel")
t0 = time.time()
n_dofs = rhs.shape[0]

# Create the PETSc environment & fill the sparse matrix
pA = PETSc.Mat().createAIJ(size=matrix.shape, csr=(matrix.indptr,
		   matrix.indices,
		   matrix.data))
pA.assemblyBegin()
pA.assemblyEnd()

# create linear solver
ksp = PETSc.KSP()
ksp.create(PETSc.COMM_WORLD)

# use direct method
ksp.setType('preonly')
ksp.getPC().setType('lu')
ksp.getPC().setFactorSolverType('mumps')

x, b = pA.getVecs()
b.setValues(range(n_dofs), rhs)

# and next solve
ksp.setOperators(pA)
ksp.setFromOptions()
ksp.solve(b, x)

x_sol = x.getArray()
print(" DONE for", time.time()-t0)

print("")
#print(" The error is", np.linalg.norm(res_scipy - res_petsc))

make[1]: Entering directory '/home/ivan/Work/01.Programms/petsc-3.10.0'
==
 
See documentation/faq.html and documentation/bugreporting.html
for help with installation problems.  Please send EVERYTHING
printed out below when reporting problems.  Please check the
mailing list archives and consider subscribing.
 
  http://www.mcs.anl.gov/petsc/miscellaneous/mailing-lists.html
 
==
Starting make run on mw1 at mer., 14 nov. 2018 18:45:52 +0100
Machine characteristics: Linux mw1 4.15.0-39-generic #42~16.04.1-Ubuntu SMP Wed Oct 24 17:09:54 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
-
Using PETSc directory: /home/ivan/Work/01.Programms/petsc-3.10.0
Using PETSc arch: linux-gnu-c-debug
-
PETSC_VERSION_RELEASE1
PETSC_VERSION_MAJOR  3
PETSC_VERSION_MINOR  10
PETSC_VERSION_SUBMINOR   0
PETSC_VERSION_PATCH  0
PETSC_VERSION_DATE   "Sep, 12, 2018"
PETSC_VERSION_GIT"v3.10"
PETSC_VERSION_DATE_GIT   "2018-09-12 08:05:28 -0500"
PETSC_VERSION_EQ(MAJOR,MINOR,SUBMINOR) \
PETSC_VERSION_ PETSC_VERSION_EQ
PETSC_VERSION_LT(MAJOR,MINOR,SUBMINOR)  \
PETSC_VERSION_LE(MAJOR,MINOR,SUBMINOR) \
PETSC_VERSION_GT(MAJOR,MINOR,SUBMINOR) \
PETSC_VERSION_GE(MAJOR,MINOR,SUBMINOR) \
-
Using configure Options: --with-cc=mpicc --with-fc=mpif90 --with-cxx=mpicxx --download-openmpi --download-scalapack --download-mumps --with-scalar-type=complex
Using configuration flags:
#define INCLUDED_PETSCCONF_H
#define HAVE_MATH_INFINITY 1
#define IS_COLORING_MAX 65535
#define STDC_HEADERS 1
#define MPIU_COLORING_VALUE MPI_UNSIGNED_SHORT
#define PETSC_HAVE_CXX 1
#define PETSC_HAVE_MATHLIB 1
#define PETSC_UINTPTR_T uintptr_t
#define PETSC_HAVE_PTHREAD 1
#define PETSC_DEPRECATED(why) __attribute((deprecated))
#define PETSC_REPLACE_DIR_SEPARATOR '\\'
#define PETSC_HAVE_SO_REUSEADDR 1
#define PETSC_HAVE_MPI 1
#define PETSC_PREFETCH_HINT_T2 _MM_HINT_T2
#define PETSC_PREFETCH_HINT_T0 _MM_HINT_T0
#define PETSC_PREFETCH_HINT_T1 _MM_HINT_T1
#define PETSC_ARCH "linux-gnu-c-debug"
#define PETSC_HAVE_FORTRAN 1
#define PETSC_DIR "/home/ivan/Work/01.Programms/petsc-3.10.0"
#define PETSC_LIB_DIR "/home/ivan/Work/01.Programms/petsc-3.10.0/linux-gnu-c-debug/lib"
#define PETSC_USE_SOCKET_VIEWER 1
#define PETSC_USE_ISATTY 1
#define PETSC_SLSUFFIX "so"
#define PETSC_FUNCTION_NAME_CXX __func__
#define PETSC_HAVE_MUMPS 1
#define PETSC_HAVE_ATOLL 1
#define PETSC_HAVE_ATTRIBUTEALIGNED 1
#define PETSC_HAVE_DOUBLE_ALIGN_MALLOC 1
#define PETSC_UNUSED __attribute((unused))
#define PETSC_ATTRIBUTEALIGNED(size) __attribute((aligned (size)))
#define PETSC_MPICC_SHOW "mpicc -I/home/ivan/Work/01.Programms/petsc-3.10.0/linux-gnu-c-debug/include -Wl,-rpath