Re: [petsc-users] GAMG scaling

2018-12-24 Thread Mark Adams via petsc-users
On Tue, Dec 25, 2018 at 12:10 AM Jed Brown  wrote:

> Mark Adams  writes:
>
> > On Mon, Dec 24, 2018 at 4:56 PM Jed Brown  wrote:
> >
> >> Mark Adams via petsc-users  writes:
> >>
> >> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private
> in
> >> > attached, NB, this is code that I wrote in grad school). It is memory
> >> > efficient and simple, just four nested loops i,j,I,J: C(I,J) =
> >> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data
> that I
> >> am
> >> > getting from my bone modeling colleagues, that use this old code on
> >> > Stampede now, the times look reasonable compared to GAMG. This is
> >> optimized
> >> > for elasticity, where I unroll loops (so it is really six nested
> loops).
> >>
> >> Is the A above meant to include some ghosted rows?
> >>
> >
> > You could but I was thinking of having i in the outer loop. In C(I,J) =
> > P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows
> of
> > A (and the left term P).
>
> Okay, so you need to gather those rows of P referenced by the
> off-diagonal parts of A.


yes, and this looks correct ..


> Once you have them, do
>
>   for i:
> v[:] = 0 # sparse vector
> for j:
>   v[:] += A[i,j] * P[j,:]
> for I:
>   C[I,:] += P[i,I] * v[:]
>
> One inefficiency is that you don't actually get "hits" on all the
> entries of C[I,:], but that much remains no matter how you reorder loops
> (unless you make I the outermost).
>

> >> > In thinking about this now, I think you want to make a local copy of P
> >> with
> >> > rows (j) for every column in A that you have locally, then transpose
> this
> >> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
> >> > special tree data structure but a matrix is simpler.)
> >>
> >> Why transpose for P(j,J)?
> >>
> >
> > (premature) optimization. I was thinking 'j' being in the inner loop and
> > doing sparse inner product, but now that I think about it there are other
> > options.
>
> Sparse inner products tend to be quite inefficient.  Explicit blocking
> helps some, but I would try to avoid it.
>

Yea, the design space here is non-trivial.

BTW, I have a Cal ME grad student that I've been working with on getting my
old parallel FE / Prometheus code running on Stampede for his bone modeling
problems. He started from zero in HPC but he is eager and has been picking
it up. If there is interest we could get performance data with the existing
code, as a benchmark, and we could generate matrices, if anyone wants to
look into this.


Re: [petsc-users] GAMG scaling

2018-12-24 Thread Jed Brown via petsc-users
Mark Adams  writes:

> On Mon, Dec 24, 2018 at 4:56 PM Jed Brown  wrote:
>
>> Mark Adams via petsc-users  writes:
>>
>> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in
>> > attached, NB, this is code that I wrote in grad school). It is memory
>> > efficient and simple, just four nested loops i,j,I,J: C(I,J) =
>> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I
>> am
>> > getting from my bone modeling colleagues, that use this old code on
>> > Stampede now, the times look reasonable compared to GAMG. This is
>> optimized
>> > for elasticity, where I unroll loops (so it is really six nested loops).
>>
>> Is the A above meant to include some ghosted rows?
>>
>
> You could but I was thinking of having i in the outer loop. In C(I,J) =
> P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows of
> A (and the left term P).

Okay, so you need to gather those rows of P referenced by the
off-diagonal parts of A.  Once you have them, do

  for i:
v[:] = 0 # sparse vector
for j:
  v[:] += A[i,j] * P[j,:]
for I:
  C[I,:] += P[i,I] * v[:]

One inefficiency is that you don't actually get "hits" on all the
entries of C[I,:], but that much remains no matter how you reorder loops
(unless you make I the outermost).

>> > In thinking about this now, I think you want to make a local copy of P
>> with
>> > rows (j) for every column in A that you have locally, then transpose this
>> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
>> > special tree data structure but a matrix is simpler.)
>>
>> Why transpose for P(j,J)?
>>
>
> (premature) optimization. I was thinking 'j' being in the inner loop and
> doing sparse inner product, but now that I think about it there are other
> options.

Sparse inner products tend to be quite inefficient.  Explicit blocking
helps some, but I would try to avoid it.


Re: [petsc-users] GAMG scaling

2018-12-24 Thread Mark Adams via petsc-users
On Mon, Dec 24, 2018 at 4:56 PM Jed Brown  wrote:

> Mark Adams via petsc-users  writes:
>
> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in
> > attached, NB, this is code that I wrote in grad school). It is memory
> > efficient and simple, just four nested loops i,j,I,J: C(I,J) =
> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I
> am
> > getting from my bone modeling colleagues, that use this old code on
> > Stampede now, the times look reasonable compared to GAMG. This is
> optimized
> > for elasticity, where I unroll loops (so it is really six nested loops).
>
> Is the A above meant to include some ghosted rows?
>

You could but I was thinking of having i in the outer loop. In C(I,J) =
P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows of
A (and the left term P).


>
> > In thinking about this now, I think you want to make a local copy of P
> with
> > rows (j) for every column in A that you have locally, then transpose this
> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
> > special tree data structure but a matrix is simpler.)
>
> Why transpose for P(j,J)?
>

(premature) optimization. I was thinking 'j' being in the inner loop and
doing sparse inner product, but now that I think about it there are other
options.


Re: [petsc-users] GAMG scaling

2018-12-24 Thread Jed Brown via petsc-users
Mark Adams via petsc-users  writes:

> Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in
> attached, NB, this is code that I wrote in grad school). It is memory
> efficient and simple, just four nested loops i,j,I,J: C(I,J) =
> P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I am
> getting from my bone modeling colleagues, that use this old code on
> Stampede now, the times look reasonable compared to GAMG. This is optimized
> for elasticity, where I unroll loops (so it is really six nested loops).

Is the A above meant to include some ghosted rows?

> In thinking about this now, I think you want to make a local copy of P with
> rows (j) for every column in A that you have locally, then transpose this
> local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
> special tree data structure but a matrix is simpler.)

Why transpose for P(j,J)?


Re: [petsc-users] GAMG scaling

2018-12-22 Thread Mark Adams via petsc-users
Wow, this is an old thread.

Sorry if I sound like an old fart talking about the good old days but I
originally did RAP. in Prometheus, in a non work optimal way that might be
of interest. Not hard to implement. I bring this up because we continue to
struggle with this damn thing. I think this approach is perfectly scalable
and pretty low overhead, and simple.

Note, I talked to the hypre people about this in like 97 when they were
implementing RAP and perhaps they are doing it this way ... the 4x slower
way.

Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in
attached, NB, this is code that I wrote in grad school). It is memory
efficient and simple, just four nested loops i,j,I,J: C(I,J) =
P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I am
getting from my bone modeling colleagues, that use this old code on
Stampede now, the times look reasonable compared to GAMG. This is optimized
for elasticity, where I unroll loops (so it is really six nested loops).

In thinking about this now, I think you want to make a local copy of P with
rows (j) for every column in A that you have locally, then transpose this
local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
special tree data structure but a matrix is simpler.)


On Sat, Dec 22, 2018 at 3:39 AM Mark Adams  wrote:

> OK, so this thread has drifted, see title :)
>
> On Fri, Dec 21, 2018 at 10:01 PM Fande Kong  wrote:
>
>> Sorry, hit the wrong button.
>>
>>
>>
>> On Fri, Dec 21, 2018 at 7:56 PM Fande Kong  wrote:
>>
>>>
>>>
>>> On Fri, Dec 21, 2018 at 9:44 AM Mark Adams  wrote:
>>>
 Also, you mentioned that you are using 10 levels. This is very strange
 with GAMG. You can run with -info and grep on GAMG to see the sizes and the
 number of non-zeros per level. You should coarsen at a rate of about 2^D to
 3^D with GAMG (with 10 levels this would imply a very large fine grid
 problem so I suspect there is something strange going on with coarsening).
 Mark

>>>
>>> Hi Mark,
>>>
>>>
>> Thanks for your email. We did not try GAMG much for our problems since we
>> still have troubles to figure out how to effectively use GAMG so far.
>> Instead, we are building our own customized  AMG  that needs to use PtAP to
>> construct coarse matrices.  The customized AMG works pretty well for our
>> specific simulations. The bottleneck right now is that PtAP might
>> take too much memory, and the code crashes within the function "PtAP". I
>> defiantly need a memory profiler to confirm my statement here.
>>
>> Thanks,
>>
>> Fande Kong,
>>
>>
>>
>>>
>>>
>>>

 On Fri, Dec 21, 2018 at 11:36 AM Zhang, Hong via petsc-users <
 petsc-users@mcs.anl.gov> wrote:

> Fande:
> I will explore it and get back to you.
> Does anyone know how to profile memory usage?
> Hong
>
> Thanks, Hong,
>>
>> I just briefly went through the code. I was wondering if it is
>> possible to destroy "c->ptap" (that caches a lot of intermediate data) to
>> release the memory after the coarse matrix is assembled. I understand you
>> may still want to reuse these data structures by default but for my
>> simulation, the preconditioner is fixed and there is no reason to keep 
>> the
>> "c->ptap".
>>
>
>> It would be great, if we could have this optional functionality.
>>
>> Fande Kong,
>>
>> On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong 
>> wrote:
>>
>>> We use nonscalable implementation as default, and switch to scalable
>>> for matrices over finer grids. You may use option '-matptap_via 
>>> scalable'
>>> to force scalable PtAP  implementation for all PtAP. Let me know if it
>>> works.
>>> Hong
>>>
>>> On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
>>> wrote:
>>>

   See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable
 automatically for "large" problems, which is determined by some 
 heuristic.

Barry


 > On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users <
 petsc-users@mcs.anl.gov> wrote:
 >
 >
 >
 > On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
 wrote:
 > Fande:
 > Hong,
 > Thanks for your improvements on PtAP that is critical for MG-type
 algorithms.
 >
 > On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
 > Mark,
 > Below is the copy of my email sent to you on Feb 27:
 >
 > I implemented scalable MatPtAP and did comparisons of three
 implementations using ex56.c on alcf cetus machine (this machine has 
 small
 memory, 1GB/core):
 > - nonscalable PtAP: use an array of length PN to do dense axpy
 > - scalable PtAP:   do sparse axpy without use of PN array
 >
 > What PN means here?
 > Global number of columns of P.

Re: [petsc-users] GAMG scaling

2018-12-22 Thread Mark Adams via petsc-users
OK, so this thread has drifted, see title :)

On Fri, Dec 21, 2018 at 10:01 PM Fande Kong  wrote:

> Sorry, hit the wrong button.
>
>
>
> On Fri, Dec 21, 2018 at 7:56 PM Fande Kong  wrote:
>
>>
>>
>> On Fri, Dec 21, 2018 at 9:44 AM Mark Adams  wrote:
>>
>>> Also, you mentioned that you are using 10 levels. This is very strange
>>> with GAMG. You can run with -info and grep on GAMG to see the sizes and the
>>> number of non-zeros per level. You should coarsen at a rate of about 2^D to
>>> 3^D with GAMG (with 10 levels this would imply a very large fine grid
>>> problem so I suspect there is something strange going on with coarsening).
>>> Mark
>>>
>>
>> Hi Mark,
>>
>>
> Thanks for your email. We did not try GAMG much for our problems since we
> still have troubles to figure out how to effectively use GAMG so far.
> Instead, we are building our own customized  AMG  that needs to use PtAP to
> construct coarse matrices.  The customized AMG works pretty well for our
> specific simulations. The bottleneck right now is that PtAP might
> take too much memory, and the code crashes within the function "PtAP". I
> defiantly need a memory profiler to confirm my statement here.
>
> Thanks,
>
> Fande Kong,
>
>
>
>>
>>
>>
>>>
>>> On Fri, Dec 21, 2018 at 11:36 AM Zhang, Hong via petsc-users <
>>> petsc-users@mcs.anl.gov> wrote:
>>>
 Fande:
 I will explore it and get back to you.
 Does anyone know how to profile memory usage?
 Hong

 Thanks, Hong,
>
> I just briefly went through the code. I was wondering if it is
> possible to destroy "c->ptap" (that caches a lot of intermediate data) to
> release the memory after the coarse matrix is assembled. I understand you
> may still want to reuse these data structures by default but for my
> simulation, the preconditioner is fixed and there is no reason to keep the
> "c->ptap".
>

> It would be great, if we could have this optional functionality.
>
> Fande Kong,
>
> On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong 
> wrote:
>
>> We use nonscalable implementation as default, and switch to scalable
>> for matrices over finer grids. You may use option '-matptap_via scalable'
>> to force scalable PtAP  implementation for all PtAP. Let me know if it
>> works.
>> Hong
>>
>> On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
>> wrote:
>>
>>>
>>>   See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically
>>> for "large" problems, which is determined by some heuristic.
>>>
>>>Barry
>>>
>>>
>>> > On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users <
>>> petsc-users@mcs.anl.gov> wrote:
>>> >
>>> >
>>> >
>>> > On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
>>> wrote:
>>> > Fande:
>>> > Hong,
>>> > Thanks for your improvements on PtAP that is critical for MG-type
>>> algorithms.
>>> >
>>> > On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
>>> > Mark,
>>> > Below is the copy of my email sent to you on Feb 27:
>>> >
>>> > I implemented scalable MatPtAP and did comparisons of three
>>> implementations using ex56.c on alcf cetus machine (this machine has 
>>> small
>>> memory, 1GB/core):
>>> > - nonscalable PtAP: use an array of length PN to do dense axpy
>>> > - scalable PtAP:   do sparse axpy without use of PN array
>>> >
>>> > What PN means here?
>>> > Global number of columns of P.
>>> >
>>> > - hypre PtAP.
>>> >
>>> > The results are attached. Summary:
>>> > - nonscalable PtAP is 2x faster than scalable, 8x faster than
>>> hypre PtAP
>>> > - scalable PtAP is 4x faster than hypre PtAP
>>> > - hypre uses less memory (see job.ne399.n63.np1000.sh)
>>> >
>>> > I was wondering how much more memory PETSc PtAP uses than hypre? I
>>> am implementing an AMG algorithm based on PETSc right now, and it is
>>> working well. But we find some a bottleneck with PtAP. For the same P 
>>> and
>>> A, PETSc PtAP fails to generate a coarse matrix due to out of memory, 
>>> while
>>> hypre still can generates the coarse matrix.
>>> >
>>> > I do not want to just use the HYPRE one because we had to
>>> duplicate matrices if I used HYPRE PtAP.
>>> >
>>> > It would be nice if you guys already have done some compassions on
>>> these implementations for the memory usage.
>>> > Do you encounter memory issue with  scalable PtAP?
>>> >
>>> > By default do we use the scalable PtAP?? Do we have to specify
>>> some options to use the scalable version of PtAP?  If so, it would be 
>>> nice
>>> to use the scalable version by default.  I am totally missing something
>>> here.
>>> >
>>> > Thanks,
>>> >
>>> > Fande
>>> >
>>> >
>>> > Karl had a student in the summer who improved MatPtAP(). Do you
>>> use the latest version of 

Re: [petsc-users] GAMG scaling

2018-12-21 Thread Fande Kong via petsc-users
Sorry, hit the wrong button.



On Fri, Dec 21, 2018 at 7:56 PM Fande Kong  wrote:

>
>
> On Fri, Dec 21, 2018 at 9:44 AM Mark Adams  wrote:
>
>> Also, you mentioned that you are using 10 levels. This is very strange
>> with GAMG. You can run with -info and grep on GAMG to see the sizes and the
>> number of non-zeros per level. You should coarsen at a rate of about 2^D to
>> 3^D with GAMG (with 10 levels this would imply a very large fine grid
>> problem so I suspect there is something strange going on with coarsening).
>> Mark
>>
>
> Hi Mark,
>
>
Thanks for your email. We did not try GAMG much for our problems since we
still have troubles to figure out how to effectively use GAMG so far.
Instead, we are building our own customized  AMG  that needs to use PtAP to
construct coarse matrices.  The customized AMG works pretty well for our
specific simulations. The bottleneck right now is that PtAP might
take too much memory, and the code crashes within the function "PtAP". I
defiantly need a memory profiler to confirm my statement here.

Thanks,

Fande Kong,



>
>
>
>>
>> On Fri, Dec 21, 2018 at 11:36 AM Zhang, Hong via petsc-users <
>> petsc-users@mcs.anl.gov> wrote:
>>
>>> Fande:
>>> I will explore it and get back to you.
>>> Does anyone know how to profile memory usage?
>>> Hong
>>>
>>> Thanks, Hong,

 I just briefly went through the code. I was wondering if it is possible
 to destroy "c->ptap" (that caches a lot of intermediate data) to release
 the memory after the coarse matrix is assembled. I understand you may still
 want to reuse these data structures by default but for my simulation, the
 preconditioner is fixed and there is no reason to keep the "c->ptap".

>>>
 It would be great, if we could have this optional functionality.

 Fande Kong,

 On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong  wrote:

> We use nonscalable implementation as default, and switch to scalable
> for matrices over finer grids. You may use option '-matptap_via scalable'
> to force scalable PtAP  implementation for all PtAP. Let me know if it
> works.
> Hong
>
> On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
> wrote:
>
>>
>>   See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically
>> for "large" problems, which is determined by some heuristic.
>>
>>Barry
>>
>>
>> > On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users <
>> petsc-users@mcs.anl.gov> wrote:
>> >
>> >
>> >
>> > On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
>> wrote:
>> > Fande:
>> > Hong,
>> > Thanks for your improvements on PtAP that is critical for MG-type
>> algorithms.
>> >
>> > On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
>> > Mark,
>> > Below is the copy of my email sent to you on Feb 27:
>> >
>> > I implemented scalable MatPtAP and did comparisons of three
>> implementations using ex56.c on alcf cetus machine (this machine has 
>> small
>> memory, 1GB/core):
>> > - nonscalable PtAP: use an array of length PN to do dense axpy
>> > - scalable PtAP:   do sparse axpy without use of PN array
>> >
>> > What PN means here?
>> > Global number of columns of P.
>> >
>> > - hypre PtAP.
>> >
>> > The results are attached. Summary:
>> > - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre
>> PtAP
>> > - scalable PtAP is 4x faster than hypre PtAP
>> > - hypre uses less memory (see job.ne399.n63.np1000.sh)
>> >
>> > I was wondering how much more memory PETSc PtAP uses than hypre? I
>> am implementing an AMG algorithm based on PETSc right now, and it is
>> working well. But we find some a bottleneck with PtAP. For the same P and
>> A, PETSc PtAP fails to generate a coarse matrix due to out of memory, 
>> while
>> hypre still can generates the coarse matrix.
>> >
>> > I do not want to just use the HYPRE one because we had to duplicate
>> matrices if I used HYPRE PtAP.
>> >
>> > It would be nice if you guys already have done some compassions on
>> these implementations for the memory usage.
>> > Do you encounter memory issue with  scalable PtAP?
>> >
>> > By default do we use the scalable PtAP?? Do we have to specify some
>> options to use the scalable version of PtAP?  If so, it would be nice to
>> use the scalable version by default.  I am totally missing something 
>> here.
>> >
>> > Thanks,
>> >
>> > Fande
>> >
>> >
>> > Karl had a student in the summer who improved MatPtAP(). Do you use
>> the latest version of petsc?
>> > HYPRE may use less memory than PETSc because it does not save and
>> reuse the matrices.
>> >
>> > I do not understand why generating coarse matrix fails due to out
>> of memory. Do you use direct solver at coarse grid?
>> > Hong
>> >
>> 

Re: [petsc-users] GAMG scaling

2018-12-21 Thread Fande Kong via petsc-users
Thanks so much, Hong,

If any new finding, please let me know.


On Fri, Dec 21, 2018 at 9:36 AM Zhang, Hong  wrote:

> Fande:
> I will explore it and get back to you.
> Does anyone know how to profile memory usage?
>

We are using gperftools
https://gperftools.github.io/gperftools/heapprofile.html

Fande,



> Hong
>
> Thanks, Hong,
>>
>> I just briefly went through the code. I was wondering if it is possible
>> to destroy "c->ptap" (that caches a lot of intermediate data) to release
>> the memory after the coarse matrix is assembled. I understand you may still
>> want to reuse these data structures by default but for my simulation, the
>> preconditioner is fixed and there is no reason to keep the "c->ptap".
>>
>
>> It would be great, if we could have this optional functionality.
>>
>> Fande Kong,
>>
>> On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong  wrote:
>>
>>> We use nonscalable implementation as default, and switch to scalable for
>>> matrices over finer grids. You may use option '-matptap_via scalable' to
>>> force scalable PtAP  implementation for all PtAP. Let me know if it works.
>>> Hong
>>>
>>> On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
>>> wrote:
>>>

   See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically
 for "large" problems, which is determined by some heuristic.

Barry


 > On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users <
 petsc-users@mcs.anl.gov> wrote:
 >
 >
 >
 > On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
 wrote:
 > Fande:
 > Hong,
 > Thanks for your improvements on PtAP that is critical for MG-type
 algorithms.
 >
 > On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
 > Mark,
 > Below is the copy of my email sent to you on Feb 27:
 >
 > I implemented scalable MatPtAP and did comparisons of three
 implementations using ex56.c on alcf cetus machine (this machine has small
 memory, 1GB/core):
 > - nonscalable PtAP: use an array of length PN to do dense axpy
 > - scalable PtAP:   do sparse axpy without use of PN array
 >
 > What PN means here?
 > Global number of columns of P.
 >
 > - hypre PtAP.
 >
 > The results are attached. Summary:
 > - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre
 PtAP
 > - scalable PtAP is 4x faster than hypre PtAP
 > - hypre uses less memory (see job.ne399.n63.np1000.sh)
 >
 > I was wondering how much more memory PETSc PtAP uses than hypre? I am
 implementing an AMG algorithm based on PETSc right now, and it is working
 well. But we find some a bottleneck with PtAP. For the same P and A, PETSc
 PtAP fails to generate a coarse matrix due to out of memory, while hypre
 still can generates the coarse matrix.
 >
 > I do not want to just use the HYPRE one because we had to duplicate
 matrices if I used HYPRE PtAP.
 >
 > It would be nice if you guys already have done some compassions on
 these implementations for the memory usage.
 > Do you encounter memory issue with  scalable PtAP?
 >
 > By default do we use the scalable PtAP?? Do we have to specify some
 options to use the scalable version of PtAP?  If so, it would be nice to
 use the scalable version by default.  I am totally missing something here.
 >
 > Thanks,
 >
 > Fande
 >
 >
 > Karl had a student in the summer who improved MatPtAP(). Do you use
 the latest version of petsc?
 > HYPRE may use less memory than PETSc because it does not save and
 reuse the matrices.
 >
 > I do not understand why generating coarse matrix fails due to out of
 memory. Do you use direct solver at coarse grid?
 > Hong
 >
 > Based on above observation, I set the default PtAP algorithm as
 'nonscalable'.
 > When PN > local estimated nonzero of C=PtAP, then switch default to
 'scalable'.
 > User can overwrite default.
 >
 > For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I get
 > MatPtAP   3.6224e+01 (nonscalable for small mats,
 scalable for larger ones)
 > scalable MatPtAP 4.6129e+01
 > hypre1.9389e+02
 >
 > This work in on petsc-master. Give it a try. If you encounter any
 problem, let me know.
 >
 > Hong
 >
 > On Wed, May 3, 2017 at 10:01 AM, Mark Adams  wrote:
 > (Hong), what is the current state of optimizing RAP for scaling?
 >
 > Nate, is driving 3D elasticity problems at scaling with GAMG and we
 are working out performance problems. They are hitting problems at ~1.5B
 dof problems on a basic Cray (XC30 I think).
 >
 > Thanks,
 > Mark
 >




Re: [petsc-users] GAMG scaling

2018-12-21 Thread Matthew Knepley via petsc-users
On Fri, Dec 21, 2018 at 12:55 PM Zhang, Hong  wrote:

> Matt:
>
>> Does anyone know how to profile memory usage?
>>>
>>
>> The best serial way is to use Massif, which is part of valgrind. I think
>> it might work in parallel if you
>> only look at one process at a time.
>>
>
> Can you give an example of using  Massif?
> For example, how to use it on petsc/src/ksp/ksp/examples/tutorials/ex56.c
> with np=8?
>

I have not used it in a while, so I have nothing laying around. However,
the manual is very good:

http://valgrind.org/docs/manual/ms-manual.html

  Thanks,

Matt


> Hong
>
>>
>>
>>> Hong
>>>
>>> Thanks, Hong,

 I just briefly went through the code. I was wondering if it is possible
 to destroy "c->ptap" (that caches a lot of intermediate data) to release
 the memory after the coarse matrix is assembled. I understand you may still
 want to reuse these data structures by default but for my simulation, the
 preconditioner is fixed and there is no reason to keep the "c->ptap".

>>>
 It would be great, if we could have this optional functionality.

 Fande Kong,

 On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong  wrote:

> We use nonscalable implementation as default, and switch to scalable
> for matrices over finer grids. You may use option '-matptap_via scalable'
> to force scalable PtAP  implementation for all PtAP. Let me know if it
> works.
> Hong
>
> On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
> wrote:
>
>>
>>   See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically
>> for "large" problems, which is determined by some heuristic.
>>
>>Barry
>>
>>
>> > On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users <
>> petsc-users@mcs.anl.gov> wrote:
>> >
>> >
>> >
>> > On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
>> wrote:
>> > Fande:
>> > Hong,
>> > Thanks for your improvements on PtAP that is critical for MG-type
>> algorithms.
>> >
>> > On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
>> > Mark,
>> > Below is the copy of my email sent to you on Feb 27:
>> >
>> > I implemented scalable MatPtAP and did comparisons of three
>> implementations using ex56.c on alcf cetus machine (this machine has 
>> small
>> memory, 1GB/core):
>> > - nonscalable PtAP: use an array of length PN to do dense axpy
>> > - scalable PtAP:   do sparse axpy without use of PN array
>> >
>> > What PN means here?
>> > Global number of columns of P.
>> >
>> > - hypre PtAP.
>> >
>> > The results are attached. Summary:
>> > - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre
>> PtAP
>> > - scalable PtAP is 4x faster than hypre PtAP
>> > - hypre uses less memory (see job.ne399.n63.np1000.sh)
>> >
>> > I was wondering how much more memory PETSc PtAP uses than hypre? I
>> am implementing an AMG algorithm based on PETSc right now, and it is
>> working well. But we find some a bottleneck with PtAP. For the same P and
>> A, PETSc PtAP fails to generate a coarse matrix due to out of memory, 
>> while
>> hypre still can generates the coarse matrix.
>> >
>> > I do not want to just use the HYPRE one because we had to duplicate
>> matrices if I used HYPRE PtAP.
>> >
>> > It would be nice if you guys already have done some compassions on
>> these implementations for the memory usage.
>> > Do you encounter memory issue with  scalable PtAP?
>> >
>> > By default do we use the scalable PtAP?? Do we have to specify some
>> options to use the scalable version of PtAP?  If so, it would be nice to
>> use the scalable version by default.  I am totally missing something 
>> here.
>> >
>> > Thanks,
>> >
>> > Fande
>> >
>> >
>> > Karl had a student in the summer who improved MatPtAP(). Do you use
>> the latest version of petsc?
>> > HYPRE may use less memory than PETSc because it does not save and
>> reuse the matrices.
>> >
>> > I do not understand why generating coarse matrix fails due to out
>> of memory. Do you use direct solver at coarse grid?
>> > Hong
>> >
>> > Based on above observation, I set the default PtAP algorithm as
>> 'nonscalable'.
>> > When PN > local estimated nonzero of C=PtAP, then switch default to
>> 'scalable'.
>> > User can overwrite default.
>> >
>> > For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I
>> get
>> > MatPtAP   3.6224e+01 (nonscalable for small mats,
>> scalable for larger ones)
>> > scalable MatPtAP 4.6129e+01
>> > hypre1.9389e+02
>> >
>> > This work in on petsc-master. Give it a try. If you encounter any
>> problem, let me know.
>> >
>> > Hong
>> >
>> > On Wed, May 3, 2017 at 

Re: [petsc-users] GAMG scaling

2018-12-21 Thread Zhang, Hong via petsc-users
Matt:
Does anyone know how to profile memory usage?

The best serial way is to use Massif, which is part of valgrind. I think it 
might work in parallel if you
only look at one process at a time.

Can you give an example of using  Massif?
For example, how to use it on petsc/src/ksp/ksp/examples/tutorials/ex56.c with 
np=8?
Hong

Hong

Thanks, Hong,

I just briefly went through the code. I was wondering if it is possible to 
destroy "c->ptap" (that caches a lot of intermediate data) to release the 
memory after the coarse matrix is assembled. I understand you may still want to 
reuse these data structures by default but for my simulation, the 
preconditioner is fixed and there is no reason to keep the "c->ptap".

It would be great, if we could have this optional functionality.

Fande Kong,

On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:
We use nonscalable implementation as default, and switch to scalable for 
matrices over finer grids. You may use option '-matptap_via scalable' to force 
scalable PtAP  implementation for all PtAP. Let me know if it works.
Hong

On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically for 
"large" problems, which is determined by some heuristic.

   Barry


> On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
>
>
>
> On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
> mailto:hzh...@mcs.anl.gov>> wrote:
> Fande:
> Hong,
> Thanks for your improvements on PtAP that is critical for MG-type algorithms.
>
> On Wed, May 3, 2017 at 10:17 AM Hong 
> mailto:hzh...@mcs.anl.gov>> wrote:
> Mark,
> Below is the copy of my email sent to you on Feb 27:
>
> I implemented scalable MatPtAP and did comparisons of three implementations 
> using ex56.c on alcf cetus machine (this machine has small memory, 1GB/core):
> - nonscalable PtAP: use an array of length PN to do dense axpy
> - scalable PtAP:   do sparse axpy without use of PN array
>
> What PN means here?
> Global number of columns of P.
>
> - hypre PtAP.
>
> The results are attached. Summary:
> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
> - scalable PtAP is 4x faster than hypre PtAP
> - hypre uses less memory (see 
> job.ne399.n63.np1000.sh)
>
> I was wondering how much more memory PETSc PtAP uses than hypre? I am 
> implementing an AMG algorithm based on PETSc right now, and it is working 
> well. But we find some a bottleneck with PtAP. For the same P and A, PETSc 
> PtAP fails to generate a coarse matrix due to out of memory, while hypre 
> still can generates the coarse matrix.
>
> I do not want to just use the HYPRE one because we had to duplicate matrices 
> if I used HYPRE PtAP.
>
> It would be nice if you guys already have done some compassions on these 
> implementations for the memory usage.
> Do you encounter memory issue with  scalable PtAP?
>
> By default do we use the scalable PtAP?? Do we have to specify some options 
> to use the scalable version of PtAP?  If so, it would be nice to use the 
> scalable version by default.  I am totally missing something here.
>
> Thanks,
>
> Fande
>
>
> Karl had a student in the summer who improved MatPtAP(). Do you use the 
> latest version of petsc?
> HYPRE may use less memory than PETSc because it does not save and reuse the 
> matrices.
>
> I do not understand why generating coarse matrix fails due to out of memory. 
> Do you use direct solver at coarse grid?
> Hong
>
> Based on above observation, I set the default PtAP algorithm as 'nonscalable'.
> When PN > local estimated nonzero of C=PtAP, then switch default to 
> 'scalable'.
> User can overwrite default.
>
> For the case of np=8000, ne=599 (see 
> job.ne599.n500.np8000.sh), I get
> MatPtAP   3.6224e+01 (nonscalable for small mats, scalable 
> for larger ones)
> scalable MatPtAP 4.6129e+01
> hypre1.9389e+02
>
> This work in on petsc-master. Give it a try. If you encounter any problem, 
> let me know.
>
> Hong
>
> On Wed, May 3, 2017 at 10:01 AM, Mark Adams 
> mailto:mfad...@lbl.gov>> wrote:
> (Hong), what is the current state of optimizing RAP for scaling?
>
> Nate, is driving 3D elasticity problems at scaling with GAMG and we are 
> working out performance problems. They are hitting problems at ~1.5B dof 
> problems on a basic Cray (XC30 I think).
>
> Thanks,
> Mark
>



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/


Re: [petsc-users] GAMG scaling

2018-12-21 Thread Matthew Knepley via petsc-users
On Fri, Dec 21, 2018 at 11:36 AM Zhang, Hong via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Fande:
> I will explore it and get back to you.
> Does anyone know how to profile memory usage?
>

The best serial way is to use Massif, which is part of valgrind. I think it
might work in parallel if you
only look at one process at a time.

  Matt


> Hong
>
> Thanks, Hong,
>>
>> I just briefly went through the code. I was wondering if it is possible
>> to destroy "c->ptap" (that caches a lot of intermediate data) to release
>> the memory after the coarse matrix is assembled. I understand you may still
>> want to reuse these data structures by default but for my simulation, the
>> preconditioner is fixed and there is no reason to keep the "c->ptap".
>>
>
>> It would be great, if we could have this optional functionality.
>>
>> Fande Kong,
>>
>> On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong  wrote:
>>
>>> We use nonscalable implementation as default, and switch to scalable for
>>> matrices over finer grids. You may use option '-matptap_via scalable' to
>>> force scalable PtAP  implementation for all PtAP. Let me know if it works.
>>> Hong
>>>
>>> On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
>>> wrote:
>>>

   See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically
 for "large" problems, which is determined by some heuristic.

Barry


 > On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users <
 petsc-users@mcs.anl.gov> wrote:
 >
 >
 >
 > On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
 wrote:
 > Fande:
 > Hong,
 > Thanks for your improvements on PtAP that is critical for MG-type
 algorithms.
 >
 > On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
 > Mark,
 > Below is the copy of my email sent to you on Feb 27:
 >
 > I implemented scalable MatPtAP and did comparisons of three
 implementations using ex56.c on alcf cetus machine (this machine has small
 memory, 1GB/core):
 > - nonscalable PtAP: use an array of length PN to do dense axpy
 > - scalable PtAP:   do sparse axpy without use of PN array
 >
 > What PN means here?
 > Global number of columns of P.
 >
 > - hypre PtAP.
 >
 > The results are attached. Summary:
 > - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre
 PtAP
 > - scalable PtAP is 4x faster than hypre PtAP
 > - hypre uses less memory (see job.ne399.n63.np1000.sh)
 >
 > I was wondering how much more memory PETSc PtAP uses than hypre? I am
 implementing an AMG algorithm based on PETSc right now, and it is working
 well. But we find some a bottleneck with PtAP. For the same P and A, PETSc
 PtAP fails to generate a coarse matrix due to out of memory, while hypre
 still can generates the coarse matrix.
 >
 > I do not want to just use the HYPRE one because we had to duplicate
 matrices if I used HYPRE PtAP.
 >
 > It would be nice if you guys already have done some compassions on
 these implementations for the memory usage.
 > Do you encounter memory issue with  scalable PtAP?
 >
 > By default do we use the scalable PtAP?? Do we have to specify some
 options to use the scalable version of PtAP?  If so, it would be nice to
 use the scalable version by default.  I am totally missing something here.
 >
 > Thanks,
 >
 > Fande
 >
 >
 > Karl had a student in the summer who improved MatPtAP(). Do you use
 the latest version of petsc?
 > HYPRE may use less memory than PETSc because it does not save and
 reuse the matrices.
 >
 > I do not understand why generating coarse matrix fails due to out of
 memory. Do you use direct solver at coarse grid?
 > Hong
 >
 > Based on above observation, I set the default PtAP algorithm as
 'nonscalable'.
 > When PN > local estimated nonzero of C=PtAP, then switch default to
 'scalable'.
 > User can overwrite default.
 >
 > For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I get
 > MatPtAP   3.6224e+01 (nonscalable for small mats,
 scalable for larger ones)
 > scalable MatPtAP 4.6129e+01
 > hypre1.9389e+02
 >
 > This work in on petsc-master. Give it a try. If you encounter any
 problem, let me know.
 >
 > Hong
 >
 > On Wed, May 3, 2017 at 10:01 AM, Mark Adams  wrote:
 > (Hong), what is the current state of optimizing RAP for scaling?
 >
 > Nate, is driving 3D elasticity problems at scaling with GAMG and we
 are working out performance problems. They are hitting problems at ~1.5B
 dof problems on a basic Cray (XC30 I think).
 >
 > Thanks,
 > Mark
 >



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their

Re: [petsc-users] GAMG scaling

2018-12-21 Thread Mark Adams via petsc-users
Also, you mentioned that you are using 10 levels. This is very strange with
GAMG. You can run with -info and grep on GAMG to see the sizes and the
number of non-zeros per level. You should coarsen at a rate of about 2^D to
3^D with GAMG (with 10 levels this would imply a very large fine grid
problem so I suspect there is something strange going on with coarsening).
Mark

On Fri, Dec 21, 2018 at 11:36 AM Zhang, Hong via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Fande:
> I will explore it and get back to you.
> Does anyone know how to profile memory usage?
> Hong
>
> Thanks, Hong,
>>
>> I just briefly went through the code. I was wondering if it is possible
>> to destroy "c->ptap" (that caches a lot of intermediate data) to release
>> the memory after the coarse matrix is assembled. I understand you may still
>> want to reuse these data structures by default but for my simulation, the
>> preconditioner is fixed and there is no reason to keep the "c->ptap".
>>
>
>> It would be great, if we could have this optional functionality.
>>
>> Fande Kong,
>>
>> On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong  wrote:
>>
>>> We use nonscalable implementation as default, and switch to scalable for
>>> matrices over finer grids. You may use option '-matptap_via scalable' to
>>> force scalable PtAP  implementation for all PtAP. Let me know if it works.
>>> Hong
>>>
>>> On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
>>> wrote:
>>>

   See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically
 for "large" problems, which is determined by some heuristic.

Barry


 > On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users <
 petsc-users@mcs.anl.gov> wrote:
 >
 >
 >
 > On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
 wrote:
 > Fande:
 > Hong,
 > Thanks for your improvements on PtAP that is critical for MG-type
 algorithms.
 >
 > On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
 > Mark,
 > Below is the copy of my email sent to you on Feb 27:
 >
 > I implemented scalable MatPtAP and did comparisons of three
 implementations using ex56.c on alcf cetus machine (this machine has small
 memory, 1GB/core):
 > - nonscalable PtAP: use an array of length PN to do dense axpy
 > - scalable PtAP:   do sparse axpy without use of PN array
 >
 > What PN means here?
 > Global number of columns of P.
 >
 > - hypre PtAP.
 >
 > The results are attached. Summary:
 > - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre
 PtAP
 > - scalable PtAP is 4x faster than hypre PtAP
 > - hypre uses less memory (see job.ne399.n63.np1000.sh)
 >
 > I was wondering how much more memory PETSc PtAP uses than hypre? I am
 implementing an AMG algorithm based on PETSc right now, and it is working
 well. But we find some a bottleneck with PtAP. For the same P and A, PETSc
 PtAP fails to generate a coarse matrix due to out of memory, while hypre
 still can generates the coarse matrix.
 >
 > I do not want to just use the HYPRE one because we had to duplicate
 matrices if I used HYPRE PtAP.
 >
 > It would be nice if you guys already have done some compassions on
 these implementations for the memory usage.
 > Do you encounter memory issue with  scalable PtAP?
 >
 > By default do we use the scalable PtAP?? Do we have to specify some
 options to use the scalable version of PtAP?  If so, it would be nice to
 use the scalable version by default.  I am totally missing something here.
 >
 > Thanks,
 >
 > Fande
 >
 >
 > Karl had a student in the summer who improved MatPtAP(). Do you use
 the latest version of petsc?
 > HYPRE may use less memory than PETSc because it does not save and
 reuse the matrices.
 >
 > I do not understand why generating coarse matrix fails due to out of
 memory. Do you use direct solver at coarse grid?
 > Hong
 >
 > Based on above observation, I set the default PtAP algorithm as
 'nonscalable'.
 > When PN > local estimated nonzero of C=PtAP, then switch default to
 'scalable'.
 > User can overwrite default.
 >
 > For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I get
 > MatPtAP   3.6224e+01 (nonscalable for small mats,
 scalable for larger ones)
 > scalable MatPtAP 4.6129e+01
 > hypre1.9389e+02
 >
 > This work in on petsc-master. Give it a try. If you encounter any
 problem, let me know.
 >
 > Hong
 >
 > On Wed, May 3, 2017 at 10:01 AM, Mark Adams  wrote:
 > (Hong), what is the current state of optimizing RAP for scaling?
 >
 > Nate, is driving 3D elasticity problems at scaling with GAMG and we
 are working out performance problems. They are hitting problems at ~1.5B
 dof problems on a basic Cray 

Re: [petsc-users] GAMG scaling

2018-12-21 Thread Zhang, Hong via petsc-users
Fande:
I will explore it and get back to you.
Does anyone know how to profile memory usage?
Hong

Thanks, Hong,

I just briefly went through the code. I was wondering if it is possible to 
destroy "c->ptap" (that caches a lot of intermediate data) to release the 
memory after the coarse matrix is assembled. I understand you may still want to 
reuse these data structures by default but for my simulation, the 
preconditioner is fixed and there is no reason to keep the "c->ptap".

It would be great, if we could have this optional functionality.

Fande Kong,

On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:
We use nonscalable implementation as default, and switch to scalable for 
matrices over finer grids. You may use option '-matptap_via scalable' to force 
scalable PtAP  implementation for all PtAP. Let me know if it works.
Hong

On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically for 
"large" problems, which is determined by some heuristic.

   Barry


> On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
>
>
>
> On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
> mailto:hzh...@mcs.anl.gov>> wrote:
> Fande:
> Hong,
> Thanks for your improvements on PtAP that is critical for MG-type algorithms.
>
> On Wed, May 3, 2017 at 10:17 AM Hong 
> mailto:hzh...@mcs.anl.gov>> wrote:
> Mark,
> Below is the copy of my email sent to you on Feb 27:
>
> I implemented scalable MatPtAP and did comparisons of three implementations 
> using ex56.c on alcf cetus machine (this machine has small memory, 1GB/core):
> - nonscalable PtAP: use an array of length PN to do dense axpy
> - scalable PtAP:   do sparse axpy without use of PN array
>
> What PN means here?
> Global number of columns of P.
>
> - hypre PtAP.
>
> The results are attached. Summary:
> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
> - scalable PtAP is 4x faster than hypre PtAP
> - hypre uses less memory (see 
> job.ne399.n63.np1000.sh)
>
> I was wondering how much more memory PETSc PtAP uses than hypre? I am 
> implementing an AMG algorithm based on PETSc right now, and it is working 
> well. But we find some a bottleneck with PtAP. For the same P and A, PETSc 
> PtAP fails to generate a coarse matrix due to out of memory, while hypre 
> still can generates the coarse matrix.
>
> I do not want to just use the HYPRE one because we had to duplicate matrices 
> if I used HYPRE PtAP.
>
> It would be nice if you guys already have done some compassions on these 
> implementations for the memory usage.
> Do you encounter memory issue with  scalable PtAP?
>
> By default do we use the scalable PtAP?? Do we have to specify some options 
> to use the scalable version of PtAP?  If so, it would be nice to use the 
> scalable version by default.  I am totally missing something here.
>
> Thanks,
>
> Fande
>
>
> Karl had a student in the summer who improved MatPtAP(). Do you use the 
> latest version of petsc?
> HYPRE may use less memory than PETSc because it does not save and reuse the 
> matrices.
>
> I do not understand why generating coarse matrix fails due to out of memory. 
> Do you use direct solver at coarse grid?
> Hong
>
> Based on above observation, I set the default PtAP algorithm as 'nonscalable'.
> When PN > local estimated nonzero of C=PtAP, then switch default to 
> 'scalable'.
> User can overwrite default.
>
> For the case of np=8000, ne=599 (see 
> job.ne599.n500.np8000.sh), I get
> MatPtAP   3.6224e+01 (nonscalable for small mats, scalable 
> for larger ones)
> scalable MatPtAP 4.6129e+01
> hypre1.9389e+02
>
> This work in on petsc-master. Give it a try. If you encounter any problem, 
> let me know.
>
> Hong
>
> On Wed, May 3, 2017 at 10:01 AM, Mark Adams 
> mailto:mfad...@lbl.gov>> wrote:
> (Hong), what is the current state of optimizing RAP for scaling?
>
> Nate, is driving 3D elasticity problems at scaling with GAMG and we are 
> working out performance problems. They are hitting problems at ~1.5B dof 
> problems on a basic Cray (XC30 I think).
>
> Thanks,
> Mark
>



Re: [petsc-users] GAMG scaling

2018-12-20 Thread Fande Kong via petsc-users
Thanks, Hong,

I just briefly went through the code. I was wondering if it is possible to
destroy "c->ptap" (that caches a lot of intermediate data) to release the
memory after the coarse matrix is assembled. I understand you may still
want to reuse these data structures by default but for my simulation, the
preconditioner is fixed and there is no reason to keep the "c->ptap".

It would be great, if we could have this optional functionality.

Fande Kong,

On Thu, Dec 20, 2018 at 9:45 PM Zhang, Hong  wrote:

> We use nonscalable implementation as default, and switch to scalable for
> matrices over finer grids. You may use option '-matptap_via scalable' to
> force scalable PtAP  implementation for all PtAP. Let me know if it works.
> Hong
>
> On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
> wrote:
>
>>
>>   See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically for
>> "large" problems, which is determined by some heuristic.
>>
>>Barry
>>
>>
>> > On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users <
>> petsc-users@mcs.anl.gov> wrote:
>> >
>> >
>> >
>> > On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong  wrote:
>> > Fande:
>> > Hong,
>> > Thanks for your improvements on PtAP that is critical for MG-type
>> algorithms.
>> >
>> > On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
>> > Mark,
>> > Below is the copy of my email sent to you on Feb 27:
>> >
>> > I implemented scalable MatPtAP and did comparisons of three
>> implementations using ex56.c on alcf cetus machine (this machine has small
>> memory, 1GB/core):
>> > - nonscalable PtAP: use an array of length PN to do dense axpy
>> > - scalable PtAP:   do sparse axpy without use of PN array
>> >
>> > What PN means here?
>> > Global number of columns of P.
>> >
>> > - hypre PtAP.
>> >
>> > The results are attached. Summary:
>> > - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
>> > - scalable PtAP is 4x faster than hypre PtAP
>> > - hypre uses less memory (see job.ne399.n63.np1000.sh)
>> >
>> > I was wondering how much more memory PETSc PtAP uses than hypre? I am
>> implementing an AMG algorithm based on PETSc right now, and it is working
>> well. But we find some a bottleneck with PtAP. For the same P and A, PETSc
>> PtAP fails to generate a coarse matrix due to out of memory, while hypre
>> still can generates the coarse matrix.
>> >
>> > I do not want to just use the HYPRE one because we had to duplicate
>> matrices if I used HYPRE PtAP.
>> >
>> > It would be nice if you guys already have done some compassions on
>> these implementations for the memory usage.
>> > Do you encounter memory issue with  scalable PtAP?
>> >
>> > By default do we use the scalable PtAP?? Do we have to specify some
>> options to use the scalable version of PtAP?  If so, it would be nice to
>> use the scalable version by default.  I am totally missing something here.
>> >
>> > Thanks,
>> >
>> > Fande
>> >
>> >
>> > Karl had a student in the summer who improved MatPtAP(). Do you use the
>> latest version of petsc?
>> > HYPRE may use less memory than PETSc because it does not save and reuse
>> the matrices.
>> >
>> > I do not understand why generating coarse matrix fails due to out of
>> memory. Do you use direct solver at coarse grid?
>> > Hong
>> >
>> > Based on above observation, I set the default PtAP algorithm as
>> 'nonscalable'.
>> > When PN > local estimated nonzero of C=PtAP, then switch default to
>> 'scalable'.
>> > User can overwrite default.
>> >
>> > For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I get
>> > MatPtAP   3.6224e+01 (nonscalable for small mats,
>> scalable for larger ones)
>> > scalable MatPtAP 4.6129e+01
>> > hypre1.9389e+02
>> >
>> > This work in on petsc-master. Give it a try. If you encounter any
>> problem, let me know.
>> >
>> > Hong
>> >
>> > On Wed, May 3, 2017 at 10:01 AM, Mark Adams  wrote:
>> > (Hong), what is the current state of optimizing RAP for scaling?
>> >
>> > Nate, is driving 3D elasticity problems at scaling with GAMG and we are
>> working out performance problems. They are hitting problems at ~1.5B dof
>> problems on a basic Cray (XC30 I think).
>> >
>> > Thanks,
>> > Mark
>> >
>>
>>


Re: [petsc-users] GAMG scaling

2018-12-20 Thread Zhang, Hong via petsc-users
We use nonscalable implementation as default, and switch to scalable for 
matrices over finer grids. You may use option '-matptap_via scalable' to force 
scalable PtAP  implementation for all PtAP. Let me know if it works.
Hong

On Thu, Dec 20, 2018 at 8:16 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically for 
"large" problems, which is determined by some heuristic.

   Barry


> On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
>
>
>
> On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong 
> mailto:hzh...@mcs.anl.gov>> wrote:
> Fande:
> Hong,
> Thanks for your improvements on PtAP that is critical for MG-type algorithms.
>
> On Wed, May 3, 2017 at 10:17 AM Hong 
> mailto:hzh...@mcs.anl.gov>> wrote:
> Mark,
> Below is the copy of my email sent to you on Feb 27:
>
> I implemented scalable MatPtAP and did comparisons of three implementations 
> using ex56.c on alcf cetus machine (this machine has small memory, 1GB/core):
> - nonscalable PtAP: use an array of length PN to do dense axpy
> - scalable PtAP:   do sparse axpy without use of PN array
>
> What PN means here?
> Global number of columns of P.
>
> - hypre PtAP.
>
> The results are attached. Summary:
> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
> - scalable PtAP is 4x faster than hypre PtAP
> - hypre uses less memory (see 
> job.ne399.n63.np1000.sh)
>
> I was wondering how much more memory PETSc PtAP uses than hypre? I am 
> implementing an AMG algorithm based on PETSc right now, and it is working 
> well. But we find some a bottleneck with PtAP. For the same P and A, PETSc 
> PtAP fails to generate a coarse matrix due to out of memory, while hypre 
> still can generates the coarse matrix.
>
> I do not want to just use the HYPRE one because we had to duplicate matrices 
> if I used HYPRE PtAP.
>
> It would be nice if you guys already have done some compassions on these 
> implementations for the memory usage.
> Do you encounter memory issue with  scalable PtAP?
>
> By default do we use the scalable PtAP?? Do we have to specify some options 
> to use the scalable version of PtAP?  If so, it would be nice to use the 
> scalable version by default.  I am totally missing something here.
>
> Thanks,
>
> Fande
>
>
> Karl had a student in the summer who improved MatPtAP(). Do you use the 
> latest version of petsc?
> HYPRE may use less memory than PETSc because it does not save and reuse the 
> matrices.
>
> I do not understand why generating coarse matrix fails due to out of memory. 
> Do you use direct solver at coarse grid?
> Hong
>
> Based on above observation, I set the default PtAP algorithm as 'nonscalable'.
> When PN > local estimated nonzero of C=PtAP, then switch default to 
> 'scalable'.
> User can overwrite default.
>
> For the case of np=8000, ne=599 (see 
> job.ne599.n500.np8000.sh), I get
> MatPtAP   3.6224e+01 (nonscalable for small mats, scalable 
> for larger ones)
> scalable MatPtAP 4.6129e+01
> hypre1.9389e+02
>
> This work in on petsc-master. Give it a try. If you encounter any problem, 
> let me know.
>
> Hong
>
> On Wed, May 3, 2017 at 10:01 AM, Mark Adams 
> mailto:mfad...@lbl.gov>> wrote:
> (Hong), what is the current state of optimizing RAP for scaling?
>
> Nate, is driving 3D elasticity problems at scaling with GAMG and we are 
> working out performance problems. They are hitting problems at ~1.5B dof 
> problems on a basic Cray (XC30 I think).
>
> Thanks,
> Mark
>



Re: [petsc-users] GAMG scaling

2018-12-20 Thread Smith, Barry F. via petsc-users


  See MatPtAP_MPIAIJ_MPIAIJ(). It switches to scalable automatically for 
"large" problems, which is determined by some heuristic.

   Barry


> On Dec 20, 2018, at 6:46 PM, Fande Kong via petsc-users 
>  wrote:
> 
> 
> 
> On Thu, Dec 20, 2018 at 4:43 PM Zhang, Hong  wrote:
> Fande:
> Hong,
> Thanks for your improvements on PtAP that is critical for MG-type algorithms. 
> 
> On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
> Mark,
> Below is the copy of my email sent to you on Feb 27:
> 
> I implemented scalable MatPtAP and did comparisons of three implementations 
> using ex56.c on alcf cetus machine (this machine has small memory, 1GB/core):
> - nonscalable PtAP: use an array of length PN to do dense axpy
> - scalable PtAP:   do sparse axpy without use of PN array
> 
> What PN means here?
> Global number of columns of P. 
> 
> - hypre PtAP.
> 
> The results are attached. Summary:
> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
> - scalable PtAP is 4x faster than hypre PtAP
> - hypre uses less memory (see job.ne399.n63.np1000.sh)
> 
> I was wondering how much more memory PETSc PtAP uses than hypre? I am 
> implementing an AMG algorithm based on PETSc right now, and it is working 
> well. But we find some a bottleneck with PtAP. For the same P and A, PETSc 
> PtAP fails to generate a coarse matrix due to out of memory, while hypre 
> still can generates the coarse matrix.
> 
> I do not want to just use the HYPRE one because we had to duplicate matrices 
> if I used HYPRE PtAP.
> 
> It would be nice if you guys already have done some compassions on these 
> implementations for the memory usage.
> Do you encounter memory issue with  scalable PtAP?
> 
> By default do we use the scalable PtAP?? Do we have to specify some options 
> to use the scalable version of PtAP?  If so, it would be nice to use the 
> scalable version by default.  I am totally missing something here. 
> 
> Thanks,
> 
> Fande
> 
>  
> Karl had a student in the summer who improved MatPtAP(). Do you use the 
> latest version of petsc?
> HYPRE may use less memory than PETSc because it does not save and reuse the 
> matrices.
> 
> I do not understand why generating coarse matrix fails due to out of memory. 
> Do you use direct solver at coarse grid?
> Hong
> 
> Based on above observation, I set the default PtAP algorithm as 
> 'nonscalable'. 
> When PN > local estimated nonzero of C=PtAP, then switch default to 
> 'scalable'.
> User can overwrite default.
> 
> For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I get
> MatPtAP   3.6224e+01 (nonscalable for small mats, scalable 
> for larger ones)
> scalable MatPtAP 4.6129e+01
> hypre1.9389e+02 
> 
> This work in on petsc-master. Give it a try. If you encounter any problem, 
> let me know.
> 
> Hong
> 
> On Wed, May 3, 2017 at 10:01 AM, Mark Adams  wrote:
> (Hong), what is the current state of optimizing RAP for scaling?
> 
> Nate, is driving 3D elasticity problems at scaling with GAMG and we are 
> working out performance problems. They are hitting problems at ~1.5B dof 
> problems on a basic Cray (XC30 I think).
> 
> Thanks,
> Mark
> 



Re: [petsc-users] GAMG scaling

2018-12-20 Thread Smith, Barry F. via petsc-users



> On Dec 20, 2018, at 5:51 PM, Zhang, Hong via petsc-users 
>  wrote:
> 
> Fande:
> Hong,
> Thanks for your improvements on PtAP that is critical for MG-type algorithms. 
> 
> On Wed, May 3, 2017 at 10:17 AM Hong  wrote:
> Mark,
> Below is the copy of my email sent to you on Feb 27:
> 
> I implemented scalable MatPtAP and did comparisons of three implementations 
> using ex56.c on alcf cetus machine (this machine has small memory, 1GB/core):
> - nonscalable PtAP: use an array of length PN to do dense axpy
> - scalable PtAP:   do sparse axpy without use of PN array
> 
> What PN means here?
> Global number of columns of P. 
> 
> - hypre PtAP.
> 
> The results are attached. Summary:
> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
> - scalable PtAP is 4x faster than hypre PtAP
> - hypre uses less memory (see job.ne399.n63.np1000.sh)
> 
> I was wondering how much more memory PETSc PtAP uses than hypre? I am 
> implementing an AMG algorithm based on PETSc right now, and it is working 
> well. But we find some a bottleneck with PtAP. For the same P and A, PETSc 
> PtAP fails to generate a coarse matrix due to out of memory, while hypre 
> still can generates the coarse matrix.
> 
> I do not want to just use the HYPRE one because we had to duplicate matrices 
> if I used HYPRE PtAP.
> 
> It would be nice if you guys already have done some compassions on these 
> implementations for the memory usage.
> Do you encounter memory issue with  scalable PtAP? Karl had a student in the 
> summer who improved MatPtAP(). Do you use the latest version of petsc?
> HYPRE may use less memory than PETSc because it does not save and reuse the 
> matrices.

   Could PETSc have an option where it does not save and reuse the matrices? 
And thus require less memory but with more compute time for multiple setups? 
How much memory would it save, 20%, 50%? 

   Barry

> 
> I do not understand why generating coarse matrix fails due to out of memory. 
> Do you use direct solver at coarse grid?
> Hong
> 
> Based on above observation, I set the default PtAP algorithm as 
> 'nonscalable'. 
> When PN > local estimated nonzero of C=PtAP, then switch default to 
> 'scalable'.
> User can overwrite default.
> 
> For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I get
> MatPtAP   3.6224e+01 (nonscalable for small mats, scalable 
> for larger ones)
> scalable MatPtAP 4.6129e+01
> hypre1.9389e+02 
> 
> This work in on petsc-master. Give it a try. If you encounter any problem, 
> let me know.
> 
> Hong
> 
> On Wed, May 3, 2017 at 10:01 AM, Mark Adams  wrote:
> (Hong), what is the current state of optimizing RAP for scaling?
> 
> Nate, is driving 3D elasticity problems at scaling with GAMG and we are 
> working out performance problems. They are hitting problems at ~1.5B dof 
> problems on a basic Cray (XC30 I think).
> 
> Thanks,
> Mark
> 



Re: [petsc-users] GAMG scaling

2018-12-20 Thread Zhang, Hong via petsc-users
Fande:
Hong,
Thanks for your improvements on PtAP that is critical for MG-type algorithms.

On Wed, May 3, 2017 at 10:17 AM Hong 
mailto:hzh...@mcs.anl.gov>> wrote:
Mark,
Below is the copy of my email sent to you on Feb 27:

I implemented scalable MatPtAP and did comparisons of three implementations 
using ex56.c on alcf cetus machine (this machine has small memory, 1GB/core):
- nonscalable PtAP: use an array of length PN to do dense axpy
- scalable PtAP:   do sparse axpy without use of PN array

What PN means here?
Global number of columns of P.

- hypre PtAP.

The results are attached. Summary:
- nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
- scalable PtAP is 4x faster than hypre PtAP
- hypre uses less memory (see 
job.ne399.n63.np1000.sh)

I was wondering how much more memory PETSc PtAP uses than hypre? I am 
implementing an AMG algorithm based on PETSc right now, and it is working well. 
But we find some a bottleneck with PtAP. For the same P and A, PETSc PtAP fails 
to generate a coarse matrix due to out of memory, while hypre still can 
generates the coarse matrix.

I do not want to just use the HYPRE one because we had to duplicate matrices if 
I used HYPRE PtAP.

It would be nice if you guys already have done some compassions on these 
implementations for the memory usage.
Do you encounter memory issue with  scalable PtAP? Karl had a student in the 
summer who improved MatPtAP(). Do you use the latest version of petsc?
HYPRE may use less memory than PETSc because it does not save and reuse the 
matrices.

I do not understand why generating coarse matrix fails due to out of memory. Do 
you use direct solver at coarse grid?
Hong

Based on above observation, I set the default PtAP algorithm as 'nonscalable'.
When PN > local estimated nonzero of C=PtAP, then switch default to 'scalable'.
User can overwrite default.

For the case of np=8000, ne=599 (see 
job.ne599.n500.np8000.sh), I get
MatPtAP   3.6224e+01 (nonscalable for small mats, scalable for 
larger ones)
scalable MatPtAP 4.6129e+01
hypre1.9389e+02

This work in on petsc-master. Give it a try. If you encounter any problem, let 
me know.

Hong

On Wed, May 3, 2017 at 10:01 AM, Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
(Hong), what is the current state of optimizing RAP for scaling?

Nate, is driving 3D elasticity problems at scaling with GAMG and we are working 
out performance problems. They are hitting problems at ~1.5B dof problems on a 
basic Cray (XC30 I think).

Thanks,
Mark



Re: [petsc-users] GAMG scaling

2017-05-04 Thread Hong
Mark,
Fixed
https://bitbucket.org/petsc/petsc/commits/68eacb73b84ae7f3fd7363217d47f23a8f967155

Run ex56 gives
mpiexec -n 8 ./ex56 -ne 13 ... -h |grep via
  -mattransposematmult_via  Algorithmic approach (choose one of)
scalable nonscalable matmatmult (MatTransposeMatMult)
  -matmatmult_via  Algorithmic approach (choose one of)
scalable nonscalable hypre (MatMatMult)
  -matptap_via  Algorithmic approach (choose one of) scalable
nonscalable hypre (MatPtAP)
...

I'll merge it to master after regression tests.

Hong

On Thu, May 4, 2017 at 10:33 AM, Hong  wrote:

> Mark:
>>
>> I am not seeing these options with -help ...
>>
> Hmm, this might be a bug - I'll check it.
> Hong
>
>
>>
>> On Wed, May 3, 2017 at 10:05 PM, Hong  wrote:
>>
>>> I basically used 'runex56' and set '-ne' be compatible with np.
>>> Then I used option
>>> '-matptap_via scalable'
>>> '-matptap_via hypre'
>>> '-matptap_via nonscalable'
>>>
>>> I attached a job script below.
>>>
>>> In master branch, I set default as 'nonscalable' for small - medium size
>>> matrices, and automatically switch to 'scalable' when matrix size gets
>>> larger.
>>>
>>> Petsc solver uses MatPtAP,  which does local RAP to reduce communication
>>> and accelerate computation.
>>> I suggest you simply use default setting. Let me know if you encounter
>>> trouble.
>>>
>>> Hong
>>>
>>> job.ne174.n8.np125.sh:
>>> runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56
>>> -ne 174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
>>> -pc_gamg_reuse_interpolation true -ksp_converged_reason
>>> -use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
>>> -mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
>>> -mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
>>> -mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
>>> -gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
>>> -mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
>>> -pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
>>> -pc_gamg_repartition false -pc_mg_cycle_type v
>>> -pc_gamg_use_parallel_coarse_grid_solver -mg_coarse_pc_type jacobi
>>> -mg_coarse_ksp_type cg -ksp_monitor -log_view -matptap_via scalable >
>>> log.ne174.n8.np125.scalable
>>>
>>> runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56
>>> -ne 174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
>>> -pc_gamg_reuse_interpolation true -ksp_converged_reason
>>> -use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
>>> -mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
>>> -mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
>>> -mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
>>> -gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
>>> -mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
>>> -pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
>>> -pc_gamg_repartition false -pc_mg_cycle_type v
>>> -pc_gamg_use_parallel_coarse_grid_solver -mg_coarse_pc_type jacobi
>>> -mg_coarse_ksp_type cg -ksp_monitor -log_view -matptap_via hypre >
>>> log.ne174.n8.np125.hypre
>>>
>>> runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56
>>> -ne 174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
>>> -pc_gamg_reuse_interpolation true -ksp_converged_reason
>>> -use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
>>> -mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
>>> -mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
>>> -mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
>>> -gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
>>> -mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
>>> -pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
>>> -pc_gamg_repartition false -pc_mg_cycle_type v
>>> -pc_gamg_use_parallel_coarse_grid_solver -mg_coarse_pc_type jacobi
>>> -mg_coarse_ksp_type cg -ksp_monitor -log_view -matptap_via nonscalable >
>>> log.ne174.n8.np125.nonscalable
>>>
>>> runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56
>>> -ne 174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
>>> -pc_gamg_reuse_interpolation true -ksp_converged_reason
>>> -use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
>>> -mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
>>> -mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
>>> -mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
>>> -gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
>>> -mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
>>> -pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
>>> -pc_gamg_repartition false -pc_mg_cycle_type v
>>> -pc_gamg_use_parallel_coarse_grid_solver -mg_coarse_pc_type jacobi
>>> -mg_coarse_ksp_type cg -ksp_monitor -log_view > log.ne174.n8.np125
>>>
>>> On Wed, May 3, 2017 at 2:08 PM, Mark Adams  wrote:
>>>

Re: [petsc-users] GAMG scaling

2017-05-04 Thread Mark Adams
Thanks Hong,

I am not seeing these options with -help ...

On Wed, May 3, 2017 at 10:05 PM, Hong  wrote:

> I basically used 'runex56' and set '-ne' be compatible with np.
> Then I used option
> '-matptap_via scalable'
> '-matptap_via hypre'
> '-matptap_via nonscalable'
>
> I attached a job script below.
>
> In master branch, I set default as 'nonscalable' for small - medium size
> matrices, and automatically switch to 'scalable' when matrix size gets
> larger.
>
> Petsc solver uses MatPtAP,  which does local RAP to reduce communication
> and accelerate computation.
> I suggest you simply use default setting. Let me know if you encounter
> trouble.
>
> Hong
>
> job.ne174.n8.np125.sh:
> runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56 -ne
> 174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
> -pc_gamg_reuse_interpolation true -ksp_converged_reason
> -use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
> -mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
> -mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
> -mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
> -gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
> -mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
> -pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
> -pc_gamg_repartition false -pc_mg_cycle_type v 
> -pc_gamg_use_parallel_coarse_grid_solver
> -mg_coarse_pc_type jacobi -mg_coarse_ksp_type cg -ksp_monitor -log_view
> -matptap_via scalable > log.ne174.n8.np125.scalable
>
> runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56 -ne
> 174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
> -pc_gamg_reuse_interpolation true -ksp_converged_reason
> -use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
> -mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
> -mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
> -mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
> -gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
> -mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
> -pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
> -pc_gamg_repartition false -pc_mg_cycle_type v 
> -pc_gamg_use_parallel_coarse_grid_solver
> -mg_coarse_pc_type jacobi -mg_coarse_ksp_type cg -ksp_monitor -log_view
> -matptap_via hypre > log.ne174.n8.np125.hypre
>
> runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56 -ne
> 174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
> -pc_gamg_reuse_interpolation true -ksp_converged_reason
> -use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
> -mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
> -mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
> -mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
> -gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
> -mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
> -pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
> -pc_gamg_repartition false -pc_mg_cycle_type v 
> -pc_gamg_use_parallel_coarse_grid_solver
> -mg_coarse_pc_type jacobi -mg_coarse_ksp_type cg -ksp_monitor -log_view
> -matptap_via nonscalable > log.ne174.n8.np125.nonscalable
>
> runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56 -ne
> 174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
> -pc_gamg_reuse_interpolation true -ksp_converged_reason
> -use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
> -mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
> -mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
> -mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
> -gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
> -mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
> -pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
> -pc_gamg_repartition false -pc_mg_cycle_type v 
> -pc_gamg_use_parallel_coarse_grid_solver
> -mg_coarse_pc_type jacobi -mg_coarse_ksp_type cg -ksp_monitor -log_view >
> log.ne174.n8.np125
>
> On Wed, May 3, 2017 at 2:08 PM, Mark Adams  wrote:
>
>> Hong,the input files do not seem to be accessible. What are the command
>> line option? (I don't see a "rap" or "scale" in the source).
>>
>>
>>
>> On Wed, May 3, 2017 at 12:17 PM, Hong  wrote:
>>
>>> Mark,
>>> Below is the copy of my email sent to you on Feb 27:
>>>
>>> I implemented scalable MatPtAP and did comparisons of three
>>> implementations using ex56.c on alcf cetus machine (this machine has
>>> small memory, 1GB/core):
>>> - nonscalable PtAP: use an array of length PN to do dense axpy
>>> - scalable PtAP:   do sparse axpy without use of PN array
>>> - hypre PtAP.
>>>
>>> The results are attached. Summary:
>>> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
>>> - scalable PtAP is 4x faster than hypre PtAP
>>> - hypre uses less memory 

Re: [petsc-users] GAMG scaling

2017-05-03 Thread Hong
I basically used 'runex56' and set '-ne' be compatible with np.
Then I used option
'-matptap_via scalable'
'-matptap_via hypre'
'-matptap_via nonscalable'

I attached a job script below.

In master branch, I set default as 'nonscalable' for small - medium size
matrices, and automatically switch to 'scalable' when matrix size gets
larger.

Petsc solver uses MatPtAP,  which does local RAP to reduce communication
and accelerate computation.
I suggest you simply use default setting. Let me know if you encounter
trouble.

Hong

job.ne174.n8.np125.sh:
runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56 -ne
174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
-pc_gamg_reuse_interpolation true -ksp_converged_reason
-use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
-mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
-mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
-mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
-gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
-mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
-pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
-pc_gamg_repartition false -pc_mg_cycle_type v
-pc_gamg_use_parallel_coarse_grid_solver -mg_coarse_pc_type jacobi
-mg_coarse_ksp_type cg -ksp_monitor -log_view -matptap_via scalable >
log.ne174.n8.np125.scalable

runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56 -ne
174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
-pc_gamg_reuse_interpolation true -ksp_converged_reason
-use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
-mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
-mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
-mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
-gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
-mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
-pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
-pc_gamg_repartition false -pc_mg_cycle_type v
-pc_gamg_use_parallel_coarse_grid_solver -mg_coarse_pc_type jacobi
-mg_coarse_ksp_type cg -ksp_monitor -log_view -matptap_via hypre >
log.ne174.n8.np125.hypre

runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56 -ne
174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
-pc_gamg_reuse_interpolation true -ksp_converged_reason
-use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
-mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
-mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
-mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
-gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
-mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
-pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
-pc_gamg_repartition false -pc_mg_cycle_type v
-pc_gamg_use_parallel_coarse_grid_solver -mg_coarse_pc_type jacobi
-mg_coarse_ksp_type cg -ksp_monitor -log_view -matptap_via nonscalable >
log.ne174.n8.np125.nonscalable

runjob --np 125 -p 16 --block $COBALT_PARTNAME --verbose=INFO : ./ex56 -ne
174 -alpha 1.e-3 -ksp_type cg -pc_type gamg -pc_gamg_agg_nsmooths 1
-pc_gamg_reuse_interpolation true -ksp_converged_reason
-use_mat_nearnullspace -mg_levels_esteig_ksp_type cg
-mg_levels_esteig_ksp_max_it 10 -pc_gamg_square_graph 1
-mg_levels_ksp_max_it 1 -mg_levels_ksp_type chebyshev
-mg_levels_ksp_chebyshev_esteig 0,0.2,0,1.05 -gamg_est_ksp_type cg
-gamg_est_ksp_max_it 10 -pc_gamg_asm_use_agg true -mg_levels_sub_pc_type lu
-mg_levels_pc_asm_overlap 0 -pc_gamg_threshold -0.01
-pc_gamg_coarse_eq_limit 200 -pc_gamg_process_eq_limit 30
-pc_gamg_repartition false -pc_mg_cycle_type v
-pc_gamg_use_parallel_coarse_grid_solver -mg_coarse_pc_type jacobi
-mg_coarse_ksp_type cg -ksp_monitor -log_view > log.ne174.n8.np125

On Wed, May 3, 2017 at 2:08 PM, Mark Adams  wrote:

> Hong,the input files do not seem to be accessible. What are the command
> line option? (I don't see a "rap" or "scale" in the source).
>
>
>
> On Wed, May 3, 2017 at 12:17 PM, Hong  wrote:
>
>> Mark,
>> Below is the copy of my email sent to you on Feb 27:
>>
>> I implemented scalable MatPtAP and did comparisons of three
>> implementations using ex56.c on alcf cetus machine (this machine has
>> small memory, 1GB/core):
>> - nonscalable PtAP: use an array of length PN to do dense axpy
>> - scalable PtAP:   do sparse axpy without use of PN array
>> - hypre PtAP.
>>
>> The results are attached. Summary:
>> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
>> - scalable PtAP is 4x faster than hypre PtAP
>> - hypre uses less memory (see job.ne399.n63.np1000.sh)
>>
>> Based on above observation, I set the default PtAP algorithm as
>> 'nonscalable'.
>> When PN > local estimated nonzero of C=PtAP, then switch default to
>> 'scalable'.
>> User can overwrite default.
>>
>> For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I 

Re: [petsc-users] GAMG scaling

2017-05-03 Thread Mark Adams
Hong,the input files do not seem to be accessible. What are the command
line option? (I don't see a "rap" or "scale" in the source).



On Wed, May 3, 2017 at 12:17 PM, Hong  wrote:

> Mark,
> Below is the copy of my email sent to you on Feb 27:
>
> I implemented scalable MatPtAP and did comparisons of three
> implementations using ex56.c on alcf cetus machine (this machine has
> small memory, 1GB/core):
> - nonscalable PtAP: use an array of length PN to do dense axpy
> - scalable PtAP:   do sparse axpy without use of PN array
> - hypre PtAP.
>
> The results are attached. Summary:
> - nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
> - scalable PtAP is 4x faster than hypre PtAP
> - hypre uses less memory (see job.ne399.n63.np1000.sh)
>
> Based on above observation, I set the default PtAP algorithm as
> 'nonscalable'.
> When PN > local estimated nonzero of C=PtAP, then switch default to
> 'scalable'.
> User can overwrite default.
>
> For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I get
> MatPtAP   3.6224e+01 (nonscalable for small mats, scalable
> for larger ones)
> scalable MatPtAP 4.6129e+01
> hypre1.9389e+02
>
> This work in on petsc-master. Give it a try. If you encounter any problem,
> let me know.
>
> Hong
>
> On Wed, May 3, 2017 at 10:01 AM, Mark Adams  wrote:
>
>> (Hong), what is the current state of optimizing RAP for scaling?
>>
>> Nate, is driving 3D elasticity problems at scaling with GAMG and we are
>> working out performance problems. They are hitting problems at ~1.5B dof
>> problems on a basic Cray (XC30 I think).
>>
>> Thanks,
>> Mark
>>
>
>


Re: [petsc-users] GAMG scaling

2017-05-03 Thread Hong
Mark,
Below is the copy of my email sent to you on Feb 27:

I implemented scalable MatPtAP and did comparisons of three implementations
using ex56.c on alcf cetus machine (this machine has small memory,
1GB/core):
- nonscalable PtAP: use an array of length PN to do dense axpy
- scalable PtAP:   do sparse axpy without use of PN array
- hypre PtAP.

The results are attached. Summary:
- nonscalable PtAP is 2x faster than scalable, 8x faster than hypre PtAP
- scalable PtAP is 4x faster than hypre PtAP
- hypre uses less memory (see job.ne399.n63.np1000.sh)

Based on above observation, I set the default PtAP algorithm as
'nonscalable'.
When PN > local estimated nonzero of C=PtAP, then switch default to
'scalable'.
User can overwrite default.

For the case of np=8000, ne=599 (see job.ne599.n500.np8000.sh), I get
MatPtAP   3.6224e+01 (nonscalable for small mats, scalable
for larger ones)
scalable MatPtAP 4.6129e+01
hypre1.9389e+02

This work in on petsc-master. Give it a try. If you encounter any problem,
let me know.

Hong

On Wed, May 3, 2017 at 10:01 AM, Mark Adams  wrote:

> (Hong), what is the current state of optimizing RAP for scaling?
>
> Nate, is driving 3D elasticity problems at scaling with GAMG and we are
> working out performance problems. They are hitting problems at ~1.5B dof
> problems on a basic Cray (XC30 I think).
>
> Thanks,
> Mark
>


out_ex56_cetus_short
Description: Binary data


[petsc-users] GAMG scaling

2017-05-03 Thread Mark Adams
(Hong), what is the current state of optimizing RAP for scaling?

Nate, is driving 3D elasticity problems at scaling with GAMG and we are
working out performance problems. They are hitting problems at ~1.5B dof
problems on a basic Cray (XC30 I think).

Thanks,
Mark