Re: [petsc-users] Performance of the Telescope Multigrid Preconditioner

2016-10-07 Thread Dave May
On Friday, 7 October 2016, frank  wrote:

> Dear all,
>
> Thank you so much for the advice.
>
> All setup is done in the first solve.
>
>
>> ** The time for 1st solve does not scale.
>> In practice, I am solving a variable coefficient  Poisson equation. I
>> need to build the matrix every time step. Therefore, each step is similar
>> to the 1st solve which does not scale. Is there a way I can improve the
>> performance?
>>
>
>> You could use rediscretization instead of Galerkin to produce the coarse
>> operators.
>>
>
> Yes I can think of one option for improved performance, but I cannot tell
> whether it will be beneficial because the logging isn't sufficiently fine
> grained (and there is no easy way to get the info out of petsc).
>
> I use PtAP to repartition the matrix, this could be consuming most of the
> setup time in Telescope with your run. Such a repartitioning could be avoid
> if you provided a method to create the operator on the coarse levels (what
> Matt is suggesting). However, this requires you to be able to define your
> coefficients on the coarse grid. This will most likely reduce setup time,
> but your coarse grid operators (now re-discretized) are likely to be less
> effective than those generated via Galerkin coarsening.
>
>
> Please correct me if I understand this incorrectly:   I can define my own
> restriction function and pass it to petsc instead of using PtAP.
> If so,what's the interface to do that?
>

You need to provide your provide a method to KSPSetComputeOoerators to your
outer KSP

http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPSetComputeOperators.html

This method will get propagated through telescope to the KSP running in the
sub-comm.

Note that this functionality is currently not support for fortran. I need
to make a small modification to telescope to enable fortran support.

Thanks
  Dave


>
>
>
> Also, you use CG/MG when FMG by itself would probably be faster. Your
>> smoother is likely not strong enough, and you
>> should use something like V(2,2). There is a lot of tuning that is
>> possible, but difficult to automate.
>>
>
> Matt's completely correct.
> If we could automate this in a meaningful manner, we would have done so.
>
>
> I am not as familiar with multigrid as you guys. It would be very kind if
> you could be more specific.
> What does V(2,2) stand for? Is there some strong smoother build in petsc
> that I can try?
>
>
> Another thing, the vector assemble and scatter take more time as I
> increased the cores#:
>
>  cores#   4096
> 8192  16384 32768  65536
> VecAssemblyBegin   2982.91E+002.87E+008.59E+00
> 2.75E+012.21E+03
> VecAssemblyEnd  2983.37E-031.78E-031.78E-03
> 5.13E-031.99E-03
> VecScatterBegin   763033.82E+003.01E+002.54E+00
> 4.40E+001.32E+00
> VecScatterEnd  763033.09E+011.47E+012.23E+01
> 2.96E+012.10E+01
>
> The above data is produced by solving a constant coefficients Possoin
> equation with different rhs for 100 steps.
> As you can see, the time of VecAssemblyBegin increase dramatically from
> 32K cores to 65K.
> With 65K cores, it took more time to assemble the rhs than solving the
> equation.   Is there a way to improve this?
>
>
> Thank you.
>
> Regards,
> Frank
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>>>
>>>
>>>
>>>
>>> On 10/04/2016 12:56 PM, Dave May wrote:
>>>
>>>
>>>
>>> On Tuesday, 4 October 2016, frank >> > wrote:
>>>
 Hi,
 This question is follow-up of the thread "Question about memory usage
 in Multigrid preconditioner".
 I used to have the "Out of Memory(OOM)" problem when using the
 CG+Telescope MG solver with 32768 cores. Adding the "-matrap 0;
 -matptap_scalable" option did solve that problem.

 Then I test the scalability by solving a 3d poisson eqn for 1 step. I
 used one sub-communicator in all the tests. The difference between the
 petsc options in those tests are: 1 the pc_telescope_reduction_factor; 2
 the number of multigrid levels in the up/down solver. The function
 "ksp_solve" is timed. It is kind of slow and doesn't scale at all.

 Test1: 512^3 grid points
 Core#telescope_reduction_factorMG levels# for up/down
 solver Time for KSPSolve (s)
 512 8 4 /
 3  6.2466
 4096   64   5 /
 3  0.9361
 32768 64   4 /
 3  4.8914

 Test2: 1024^3 grid points
 Core#telescope_reduction_factorMG levels# for up/down
 solver Time for 

Re: [petsc-users] How to Get the last absolute residual that has been computed

2016-10-07 Thread Jed Brown
丁老师  writes:

> Dear professor:
> How to Get the last absolute residual that has been computed

SNESGetFunctionNorm?


signature.asc
Description: PGP signature


Re: [petsc-users] Time cost by Vec Assembly

2016-10-07 Thread Jed Brown
Barry Smith  writes:
> There is still something wonky here, whether it is the MPI implementation 
> or how PETSc handles the assembly. Without any values that need to be 
> communicated it is unacceptably that these calls take so long. If we 
> understood __exactly__ why the performance suddenly drops so dramatically we 
> could perhaps fix it. I do not understand why.

I guess it's worth timing.  If they don't have MPI_Reduce_scatter_block
then it falls back to a big MPI_Allreduce.  After that, it's all
point-to-point messaging that shouldn't suck and there actually
shouldn't be anything to send or receive anyway.  The BTS implementation
should be much smarter and literally reduces to a barrier in this case.


signature.asc
Description: PGP signature


Re: [petsc-users] How to Get the last absolute residual that has been computed

2016-10-07 Thread Barry Smith

  KSPGetResidualNorm() if you wish the true (and not preconditioned residual) 
you must call KSPSetNormType() be the KSPSolve

  SNESGetFunctionNorm()

> On Oct 7, 2016, at 10:41 PM, 丁老师  wrote:
> 
> Dear professor:
> How to Get the last absolute residual that has been computed
> 
> 
>  
> 
> 
>  
> 
> 
>  
> 
> 
>  
> 
> 
>  



Re: [petsc-users] Time cost by Vec Assembly

2016-10-07 Thread Barry Smith

> On Oct 7, 2016, at 10:44 PM, Jed Brown  wrote:
> 
> Barry Smith  writes:
>>VecAssemblyBegin/End() does a couple of all reduces and then message 
>> passing (if values need to be moved) to get the values onto the correct 
>> processes. So these calls should take very little time. Something is wonky 
>> on your system with that many MPI processes, with these calls. I don't know 
>> why, if you look at the code you'll see it is pretty straightforward.
> 
> Those MPI calls can be pretty sucky on some networks.  Dave encountered
> this years ago when they were using VecSetValues/VecAssembly rather
> heavily.  I think that most performance-aware PETSc applications
> typically never tried to use VecSetValues/VecAssembly or they did not
> need to do it very often (e.g., as part of a matrix-free solver).  The
> BTS implementation fixes the performance issue, but I'm still working on
> solving the corner case that has been reported.  Fortunately, the
> VecAssembly is totally superfluous to this user.

   Jed,

There is still something wonky here, whether it is the MPI implementation 
or how PETSc handles the assembly. Without any values that need to be 
communicated it is unacceptably that these calls take so long. If we understood 
__exactly__ why the performance suddenly drops so dramatically we could perhaps 
fix it. I do not understand why.


  Barry



[petsc-users] How to Get the last absolute residual that has been computed

2016-10-07 Thread 丁老师
Dear professor:
How to Get the last absolute residual that has been computed




 





 





 





 





 

Re: [petsc-users] Time cost by Vec Assembly

2016-10-07 Thread Barry Smith

> On Oct 7, 2016, at 6:41 PM, frank  wrote:
> 
> Hello,
>   
>>> Another thing, the vector assemble and scatter take more time as I 
>>> increased the cores#:
>>> 
>>>  cores#   4096 8192 
>>>  16384 32768  65536  
>>> VecAssemblyBegin   2982.91E+002.87E+008.59E+00
>>> 2.75E+012.21E+03
>>> VecAssemblyEnd  2983.37E-031.78E-031.78E-03   
>>> 5.13E-031.99E-03
>>> VecScatterBegin   763033.82E+003.01E+002.54E+00
>>> 4.40E+001.32E+00
>>> VecScatterEnd  763033.09E+011.47E+012.23E+01
>>> 2.96E+012.10E+01
>>> 
>>> The above data is produced by solving a constant coefficients Possoin 
>>> equation with different rhs for 100 steps. 
>>> As you can see, the time of VecAssemblyBegin increase dramatically from 32K 
>>> cores to 65K.
>>> 
>>Something is very very wrong here. It is likely not the 
>> VecAssemblyBegin() itself that is taking the huge amount of time. 
>> VecAssemblyBegin() is a barrier, that is all processes have to reach it 
>> before any process can continue beyond it. Something in the code on some 
>> processes is taking a huge amount of time before reaching that point. 
>> Perhaps it is in starting up all the processes?   Or are you generating the 
>> entire rhs on one process? You can't to that.
>> 
>>Barry
>> 
> (I create a new subject since this is a separate problem from my previous  
> question.)
> 
> Each process computes its part of the rhs. 
> The above result are from 100 steps' computation. It is not a starting-up 
> issue.
> 
> I also have the results  from a simple code to show this problem:
> 
> cores#  4096  8192   16384
> 32768 65536
> VecAssemblyBegin14.56E-023.27E-023.63E-026.26E-02
> 2.80E+02
> VecAssemblyEnd   13.54E-043.43E-043.47E-043.44E-04
> 4.53E-04
> 
> Again, the time cost increases dramatically after 30K cores. 
> The max/min ratio of VecAssemblyBegin is 1.2 for both 30K and 65K cases. If 
> there is a huge delay on some process, should this value be large? 

   Yes, one would expect that. You are right it is something inside those calls.


> 
> The part of code that calls the assembly subroutines looks like:
>  
>   CALL DMCreateGlobalVector( ... ) 
>   CALL DMDAVecGetArrayF90( ... )  
>  ... each process computes its part of rhs...
>   CALL DMDAVecRestoreArrayF90(...)
>   
There is absolutely no reason for you to be calling the 
VecAssemblyBegin/End() below, take it out! You only need that if you use 
VecSetValues() if you use XXXGetArrayYYY() and put values into the vector that 
way VecAssemblyBegin/End() serves no purpose.

>   CALL VecAssemblyBegin( ... ) 
>   CALL VecAssemblyEnd( ... )


VecAssemblyBegin/End() does a couple of all reduces and then message 
passing (if values need to be moved) to get the values onto the correct 
processes. So these calls should take very little time. Something is wonky on 
your system with that many MPI processes, with these calls. I don't know why, 
if you look at the code you'll see it is pretty straightforward.

  Barry

> 
> Thank you
> 
> Regards,
> Frank
> 
> 
> On 10/04/2016 12:56 PM, Dave May wrote:
> On Tuesday, 4 October 2016, frank 
>  wrote:
> Hi,
> 
> This question is follow-up of the thread "Question about memory usage in 
> Multigrid preconditioner".
> I used to have the "Out of Memory(OOM)" problem when using the 
> CG+Telescope MG solver with 32768 cores. Adding the "-matrap 0; 
> -matptap_scalable" option did solve that problem. 
> 
> Then I test the scalability by solving a 3d poisson eqn for 1 step. I 
> used one sub-communicator in all the tests. The difference between the 
> petsc options in those tests are: 1 the pc_telescope_reduction_factor; 2 
> the number of multigrid levels in the up/down solver. The function 
> "ksp_solve" is timed. It is kind of slow and doesn't scale at all. 
> 
> Test1: 512^3 grid points
> Core#telescope_reduction_factorMG levels# for up/down 
> solver Time for KSPSolve (s)
> 512 8 4 / 3   
>6.2466
> 4096   64   5 / 3 
>  0.9361
> 32768 64   4 / 3  
> 4.8914
> 
> Test2: 1024^3 grid points
> Core#telescope_reduction_factorMG levels# for up/down 
> solver Time for KSPSolve (s)
> 4096   64   5 / 4 
>  

[petsc-users] Time cost by Vec Assembly

2016-10-07 Thread frank

Hello,


Another thing, the vector assemble and scatter take more time as I increased 
the cores#:

  cores#   4096 8192  
16384 32768  65536
VecAssemblyBegin   2982.91E+002.87E+008.59E+002.75E+01  
  2.21E+03
VecAssemblyEnd  2983.37E-031.78E-031.78E-03   
5.13E-031.99E-03
VecScatterBegin   763033.82E+003.01E+002.54E+004.40E+00 
   1.32E+00
VecScatterEnd  763033.09E+011.47E+012.23E+01
2.96E+012.10E+01

The above data is produced by solving a constant coefficients Possoin equation 
with different rhs for 100 steps.
As you can see, the time of VecAssemblyBegin increase dramatically from 32K 
cores to 65K.

Something is very very wrong here. It is likely not the VecAssemblyBegin() 
itself that is taking the huge amount of time. VecAssemblyBegin() is a barrier, 
that is all processes have to reach it before any process can continue beyond 
it. Something in the code on some processes is taking a huge amount of time 
before reaching that point. Perhaps it is in starting up all the processes?   
Or are you generating the entire rhs on one process? You can't to that.

Barry
(I create a new subject since this is a separate problem from my 
previous  question.)


Each process computes its part of the rhs.
The above result are from 100 steps' computation. It is not a 
starting-up issue.


I also have the results  from a simple code to show this problem:

cores#  4096  8192 16384
32768 65536
VecAssemblyBegin14.56E-023.27E-023.63E-02 6.26E-02
2.80E+02
VecAssemblyEnd   13.54E-043.43E-043.47E-04 3.44E-04
4.53E-04


Again, the time cost increases dramatically after 30K cores.
The max/min ratio of VecAssemblyBegin is 1.2 for both 30K and 65K cases. 
If there is a huge delay on some process, should this value be large?


The part of code that calls the assembly subroutines looks like:

  CALL DMCreateGlobalVector( ... )
  CALL DMDAVecGetArrayF90( ... )
 ... each process computes its part of rhs...
  CALL DMDAVecRestoreArrayF90(...)

  CALL VecAssemblyBegin( ... )
  CALL VecAssemblyEnd( ... )

Thank you

Regards,
Frank


On 10/04/2016 12:56 PM, Dave May wrote:


On Tuesday, 4 October 2016, frank  wrote:
Hi,

This question is follow-up of the thread "Question about memory usage in Multigrid 
preconditioner".
I used to have the "Out of Memory(OOM)" problem when using the CG+Telescope MG solver 
with 32768 cores. Adding the "-matrap 0; -matptap_scalable" option did solve that problem.

Then I test the scalability by solving a 3d poisson eqn for 1 step. I used one 
sub-communicator in all the tests. The difference between the petsc options in those 
tests are: 1 the pc_telescope_reduction_factor; 2 the number of multigrid levels in the 
up/down solver. The function "ksp_solve" is timed. It is kind of slow and 
doesn't scale at all.

Test1: 512^3 grid points
Core#telescope_reduction_factorMG levels# for up/down solver
 Time for KSPSolve (s)
512 8 4 / 3 
 6.2466
4096   64   5 / 3   
   0.9361
32768 64   4 / 3
  4.8914

Test2: 1024^3 grid points
Core#telescope_reduction_factorMG levels# for up/down solver
 Time for KSPSolve (s)
4096   64   5 / 4   
   3.4139
8192   128 5 / 4
  2.4196
16384 32   5 / 3
  5.4150
32768 64   5 / 3
  5.6067
65536 128 5 / 3 
 6.5219

You have to be very careful how you interpret these numbers. Your solver 
contains nested calls to KSPSolve, and unfortunately as a result the numbers 
you report include setup time. This will remain true even if you call KSPSetUp 
on the outermost KSP.

Your email concerns scalability of the silver application, so let's focus on 
that issue.

The only way to clearly separate setup from solve time is to perform two 
identical solves. The second solve will not require any setup. You should 
monitor the second solve via a new PetscStage.

This was what I did in the telescope paper. It was the only way to understand 
the setup cost (and scaling) cf the solve time (and scaling).

Thanks
   

Re: [petsc-users] Performance of the Telescope Multigrid Preconditioner

2016-10-07 Thread Barry Smith

> On Oct 7, 2016, at 4:49 PM, frank  wrote:
> 
> Dear all,
> 
> Thank you so much for the advice. 
>> All setup is done in the first solve.
>>  
>> ** The time for 1st solve does not scale. 
>> In practice, I am solving a variable coefficient  Poisson equation. I 
>> need to build the matrix every time step. Therefore, each step is similar to 
>> the 1st solve which does not scale. Is there a way I can improve the 
>> performance? 
>> 
>> You could use rediscretization instead of Galerkin to produce the coarse 
>> operators.
>> 
>> Yes I can think of one option for improved performance, but I cannot tell 
>> whether it will be beneficial because the logging isn't sufficiently fine 
>> grained (and there is no easy way to get the info out of petsc). 
>> 
>> I use PtAP to repartition the matrix, this could be consuming most of the 
>> setup time in Telescope with your run. Such a repartitioning could be avoid 
>> if you provided a method to create the operator on the coarse levels (what 
>> Matt is suggesting). However, this requires you to be able to define your 
>> coefficients on the coarse grid. This will most likely reduce setup time, 
>> but your coarse grid operators (now re-discretized) are likely to be less 
>> effective than those generated via Galerkin coarsening.
> 
> Please correct me if I understand this incorrectly:   I can define my own 
> restriction function and pass it to petsc instead of using PtAP.
> If so,what's the interface to do that?
>  
>> Also, you use CG/MG when FMG by itself would probably be faster. Your 
>> smoother is likely not strong enough, and you
>> should use something like V(2,2). There is a lot of tuning that is possible, 
>> but difficult to automate.
>> 
>> Matt's completely correct. 
>> If we could automate this in a meaningful manner, we would have done so.
> 
> I am not as familiar with multigrid as you guys. It would be very kind if you 
> could be more specific.
> What does V(2,2) stand for? Is there some strong smoother build in petsc that 
> I can try?
> 
> 
> Another thing, the vector assemble and scatter take more time as I increased 
> the cores#:
> 
>  cores#   4096 8192  
> 16384 32768  65536  
> VecAssemblyBegin   2982.91E+002.87E+008.59E+00
> 2.75E+012.21E+03
> VecAssemblyEnd  2983.37E-031.78E-031.78E-03   
> 5.13E-031.99E-03
> VecScatterBegin   763033.82E+003.01E+002.54E+00
> 4.40E+001.32E+00
> VecScatterEnd  763033.09E+011.47E+012.23E+01
> 2.96E+012.10E+01
> 
> The above data is produced by solving a constant coefficients Possoin 
> equation with different rhs for 100 steps. 
> As you can see, the time of VecAssemblyBegin increase dramatically from 32K 
> cores to 65K.

   Something is very very wrong here. It is likely not the VecAssemblyBegin() 
itself that is taking the huge amount of time. VecAssemblyBegin() is a barrier, 
that is all processes have to reach it before any process can continue beyond 
it. Something in the code on some processes is taking a huge amount of time 
before reaching that point. Perhaps it is in starting up all the processes?   
Or are you generating the entire rhs on one process? You can't to that.

   Barry


>  
> With 65K cores, it took more time to assemble the rhs than solving the 
> equation.   Is there a way to improve this?
> 
> 
> Thank you.
> 
> Regards,
> Frank  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> 
>> 
>> 
>> 
>> 
>> On 10/04/2016 12:56 PM, Dave May wrote:
>>> 
>>> 
>>> On Tuesday, 4 October 2016, frank  wrote:
>>> Hi,
>>> 
>>> This question is follow-up of the thread "Question about memory usage in 
>>> Multigrid preconditioner".
>>> I used to have the "Out of Memory(OOM)" problem when using the CG+Telescope 
>>> MG solver with 32768 cores. Adding the "-matrap 0; -matptap_scalable" 
>>> option did solve that problem. 
>>> 
>>> Then I test the scalability by solving a 3d poisson eqn for 1 step. I used 
>>> one sub-communicator in all the tests. The difference between the petsc 
>>> options in those tests are: 1 the pc_telescope_reduction_factor; 2 the 
>>> number of multigrid levels in the up/down solver. The function "ksp_solve" 
>>> is timed. It is kind of slow and doesn't scale at all. 
>>> 
>>> Test1: 512^3 grid points
>>> Core#telescope_reduction_factorMG levels# for up/down 
>>> solver Time for KSPSolve (s)
>>> 512 8 4 / 3 
>>>  6.2466
>>> 4096   64   5 / 3   
>>>0.9361
>>> 32768 64   4 / 3
>>>   4.8914
>>> 
>>> Test2: 1024^3 grid points
>>> Core#   

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Barry Smith

   Fande,

If you can reproduce the problem with PETSc 3.7.4 please send us sample 
code that produces it so we can work with Sherry to get it fixed ASAP.

   Barry

> On Oct 7, 2016, at 10:23 AM, Satish Balay  wrote:
> 
> On Fri, 7 Oct 2016, Kong, Fande wrote:
> 
>> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:
>> 
>>> On Fri, 7 Oct 2016, Anton Popov wrote:
>>> 
 Hi guys,
 
 are there any news about fixing buggy behavior of SuperLU_DIST, exactly
>>> what
 is described here:
 
 https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
>>> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html=CwIBAg=
>>> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
>>> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
>>> 1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?
 
 I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
>>> fine
 with 3.5.4.
 
 Do I still have to stick to maint branch, and what are the chances for
>>> these
 fixes to be included in 3.7.5?
>>> 
>>> 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
>>> issues with it - its best to debug and figure out the cause.
>>> 
>> 
>> This bug is indeed inside of superlu_dist, and we started having this issue
>> from PETSc-3.6.x. I think superlu_dist developers should have fixed this
>> bug. We forgot to update superlu_dist??  This is not a thing users could
>> debug and fix.
>> 
>> I have many people in INL suffering from this issue, and they have to stay
>> with PETSc-3.5.4 to use superlu_dist.
> 
> To verify if the bug is fixed in latest superlu_dist - you can try
> [assuming you have git - either from petsc-3.7/maint/master]:
> 
> --download-superlu_dist --download-superlu_dist-commit=origin/maint
> 
> 
> Satish
> 



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Satish Balay
On Fri, 7 Oct 2016, Kong, Fande wrote:

> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:
> 
> > On Fri, 7 Oct 2016, Anton Popov wrote:
> >
> > > Hi guys,
> > >
> > > are there any news about fixing buggy behavior of SuperLU_DIST, exactly
> > what
> > > is described here:
> > >
> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> > mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html=CwIBAg=
> > 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
> > JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
> > 1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?
> > >
> > > I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
> > fine
> > > with 3.5.4.
> > >
> > > Do I still have to stick to maint branch, and what are the chances for
> > these
> > > fixes to be included in 3.7.5?
> >
> > 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
> > issues with it - its best to debug and figure out the cause.
> >
> 
> This bug is indeed inside of superlu_dist, and we started having this issue
> from PETSc-3.6.x. I think superlu_dist developers should have fixed this
> bug. We forgot to update superlu_dist??  This is not a thing users could
> debug and fix.
> 
> I have many people in INL suffering from this issue, and they have to stay
> with PETSc-3.5.4 to use superlu_dist.

To verify if the bug is fixed in latest superlu_dist - you can try
[assuming you have git - either from petsc-3.7/maint/master]:

--download-superlu_dist --download-superlu_dist-commit=origin/maint


Satish



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Matthew Knepley
On Fri, Oct 7, 2016 at 10:16 AM, Kong, Fande  wrote:

> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:
>
>> On Fri, 7 Oct 2016, Anton Popov wrote:
>>
>> > Hi guys,
>> >
>> > are there any news about fixing buggy behavior of SuperLU_DIST, exactly
>> what
>> > is described here:
>> >
>> > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.mc
>> s.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html&
>> d=CwIBAg=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=
>> DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H8
>> 9Z6LXKBfJBOAM2vG1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0e
>> UwibAKsRRWKafos=  ?
>> >
>> > I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
>> fine
>> > with 3.5.4.
>> >
>> > Do I still have to stick to maint branch, and what are the chances for
>> these
>> > fixes to be included in 3.7.5?
>>
>> 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
>> issues with it - its best to debug and figure out the cause.
>>
>
> This bug is indeed inside of superlu_dist, and we started having this
> issue from PETSc-3.6.x. I think superlu_dist developers should have fixed
> this bug. We forgot to update superlu_dist??  This is not a thing users
> could debug and fix.
>
> I have many people in INL suffering from this issue, and they have to stay
> with PETSc-3.5.4 to use superlu_dist.
>

Do you have this bug with the latest maint?

  Matt


> Fande
>
>
>
>>
>> Satish
>>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Kong, Fande
On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:

> On Fri, 7 Oct 2016, Anton Popov wrote:
>
> > Hi guys,
> >
> > are there any news about fixing buggy behavior of SuperLU_DIST, exactly
> what
> > is described here:
> >
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html=CwIBAg=
> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
> 1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?
> >
> > I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
> fine
> > with 3.5.4.
> >
> > Do I still have to stick to maint branch, and what are the chances for
> these
> > fixes to be included in 3.7.5?
>
> 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
> issues with it - its best to debug and figure out the cause.
>

This bug is indeed inside of superlu_dist, and we started having this issue
from PETSc-3.6.x. I think superlu_dist developers should have fixed this
bug. We forgot to update superlu_dist??  This is not a thing users could
debug and fix.

I have many people in INL suffering from this issue, and they have to stay
with PETSc-3.5.4 to use superlu_dist.

Fande



>
> Satish
>


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Satish Balay
On Fri, 7 Oct 2016, Anton Popov wrote:

> Hi guys,
> 
> are there any news about fixing buggy behavior of SuperLU_DIST, exactly what
> is described here:
> 
> http://lists.mcs.anl.gov/pipermail/petsc-users/2015-August/026802.html ?
> 
> I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works fine
> with 3.5.4.
> 
> Do I still have to stick to maint branch, and what are the chances for these
> fixes to be included in 3.7.5?

3.7.4. is off maint branch [as of a week ago]. So if you are seeing
issues with it - its best to debug and figure out the cause.

Satish


[petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Anton Popov

Hi guys,

are there any news about fixing buggy behavior of SuperLU_DIST, exactly 
what is described here:


http://lists.mcs.anl.gov/pipermail/petsc-users/2015-August/026802.html ?

I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works 
fine with 3.5.4.


Do I still have to stick to maint branch, and what are the chances for 
these fixes to be included in 3.7.5?


Thanks,

Anton



Re: [petsc-users] Performance of the Telescope Multigrid Preconditioner

2016-10-07 Thread Dave May
On 7 October 2016 at 02:05, Matthew Knepley  wrote:

> On Thu, Oct 6, 2016 at 7:33 PM, frank  wrote:
>
>> Dear Dave,
>> Follow your advice, I solve the identical equation twice and time two
>> steps separately. The result is below:
>>
>> Test: 1024^3 grid points
>> Cores#reduction factorMG levels#time of 1st solve2nd time
>> 4096646 + 3
>> 3.85 1.75
>> 8192  128   5 + 3
>> 5.52 0.91
>> 16384256   5 + 3  5.37
>> 0.52
>> 32768512   5 + 4  3.03
>>   0.36
>> 32768 64 | 8   4 | 3 | 3  2.80
>> 0.43
>> 65536  1024  5 + 4  3.38
>>  0.59
>> 6553632 | 32  4 | 4 | 3  2.14
>> 0.22
>>
>> I also attached the log_view info from all the run.  The file is names by
>> the cores# + reduction factor.
>> The ksp_view and petsc_options for  the 1st run are also included. Others
>> are similar. The only differences are the reduction factor and mg levels.
>>
>> ** The time for the 1st solve is generally much larger. Is this because
>> the ksp solver on the sub-communicator is set up during the 1st solve?
>>
>
Yes, but it's not just the setup for the KSP on the sub-comm.
There is additional setup required,
[1] creating the sub-comm 
[2] creating the DM on the sub-comm 
[3] creating the scatter objects and nullspaces 
[3] repartitioning the matrix 


>
> All setup is done in the first solve.
>
>
>> ** The time for 1st solve does not scale.
>> In practice, I am solving a variable coefficient  Poisson equation. I
>> need to build the matrix every time step. Therefore, each step is similar
>> to the 1st solve which does not scale. Is there a way I can improve the
>> performance?
>>
>
> You could use rediscretization instead of Galerkin to produce the coarse
> operators.
>

Yes I can think of one option for improved performance, but I cannot tell
whether it will be beneficial because the logging isn't sufficiently fine
grained (and there is no easy way to get the info out of petsc).

I use PtAP to repartition the matrix, this could be consuming most of the
setup time in Telescope with your run. Such a repartitioning could be avoid
if you provided a method to create the operator on the coarse levels (what
Matt is suggesting). However, this requires you to be able to define your
coefficients on the coarse grid. This will most likely reduce setup time,
but your coarse grid operators (now re-discretized) are likely to be less
effective than those generated via Galerkin coarsening.




>
>
>> ** The 2nd solve scales but not quite well for more than 16384 cores.
>>
>
> How well were you looking for? This is strong scaling, which is has an
> Amdahl's Law limit.
>

Is 1024^3 points your target (production run) resolution?
If it is not, then start doing the tests with your target resolution.
Setup time cf the solve time will always smaller and impact the scaling
less when you consider higher resolution problems.


>
>
>> It seems to me that the performance depends on the tuning of MG
>> levels on the sub-communicator(s).
>>
>
Yes - absolutely.


> Is there some general strategies regarding how to distribute the
>> levels? or when to use multiple sub-communicators ?
>>
>
Yes, but there is nothing definite.
We don't have a performance model to guide these choices.
The optimal choice is dependent on the characteristics of your compute
nodes, the network, the form of the discrete operator, and the mesh
refinement factor used when creating the MG hierarchy.
It's a bit complicated.

I have found when using meshes with a refinement factor of 2, using a
reduction factor of 64 within telescope is effective.

I would suggest experimenting with the refinement factor. If your
coefficients are smooth, you can probably refine your mesh for MG by a
factor of 4 (rather than the default of 2). Galerkin will still provide
meaningful coarse grid operators.

Always coarsen the problem until you have ~1 DOF per core before
reparititon the operator via Telescope. Don't use a reduction factor which
will only allow 1 new additional MG level to be defined on the sub-comm.
e.g. if you use meshes refined by 2x, on the coarse level, use a reduction
factor of 64.

Without a performance model, the optimal level to invoke repartitioning and
how aggressively the communicator size is reduced by cannot be determined
apriori. Experimentation is the only way.


>
>
> Also, you use CG/MG when FMG by itself would probably be faster. Your
> smoother is likely not strong enough, and you
> should use something like V(2,2). There is a lot of tuning that is
> possible, but difficult to