Re: [petsc-users] GAMG scaling

2018-12-24 Thread Mark Adams via petsc-users
On Tue, Dec 25, 2018 at 12:10 AM Jed Brown  wrote:

> Mark Adams  writes:
>
> > On Mon, Dec 24, 2018 at 4:56 PM Jed Brown  wrote:
> >
> >> Mark Adams via petsc-users  writes:
> >>
> >> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private
> in
> >> > attached, NB, this is code that I wrote in grad school). It is memory
> >> > efficient and simple, just four nested loops i,j,I,J: C(I,J) =
> >> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data
> that I
> >> am
> >> > getting from my bone modeling colleagues, that use this old code on
> >> > Stampede now, the times look reasonable compared to GAMG. This is
> >> optimized
> >> > for elasticity, where I unroll loops (so it is really six nested
> loops).
> >>
> >> Is the A above meant to include some ghosted rows?
> >>
> >
> > You could but I was thinking of having i in the outer loop. In C(I,J) =
> > P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows
> of
> > A (and the left term P).
>
> Okay, so you need to gather those rows of P referenced by the
> off-diagonal parts of A.


yes, and this looks correct ..


> Once you have them, do
>
>   for i:
> v[:] = 0 # sparse vector
> for j:
>   v[:] += A[i,j] * P[j,:]
> for I:
>   C[I,:] += P[i,I] * v[:]
>
> One inefficiency is that you don't actually get "hits" on all the
> entries of C[I,:], but that much remains no matter how you reorder loops
> (unless you make I the outermost).
>

> >> > In thinking about this now, I think you want to make a local copy of P
> >> with
> >> > rows (j) for every column in A that you have locally, then transpose
> this
> >> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
> >> > special tree data structure but a matrix is simpler.)
> >>
> >> Why transpose for P(j,J)?
> >>
> >
> > (premature) optimization. I was thinking 'j' being in the inner loop and
> > doing sparse inner product, but now that I think about it there are other
> > options.
>
> Sparse inner products tend to be quite inefficient.  Explicit blocking
> helps some, but I would try to avoid it.
>

Yea, the design space here is non-trivial.

BTW, I have a Cal ME grad student that I've been working with on getting my
old parallel FE / Prometheus code running on Stampede for his bone modeling
problems. He started from zero in HPC but he is eager and has been picking
it up. If there is interest we could get performance data with the existing
code, as a benchmark, and we could generate matrices, if anyone wants to
look into this.


Re: [petsc-users] GAMG scaling

2018-12-24 Thread Jed Brown via petsc-users
Mark Adams  writes:

> On Mon, Dec 24, 2018 at 4:56 PM Jed Brown  wrote:
>
>> Mark Adams via petsc-users  writes:
>>
>> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in
>> > attached, NB, this is code that I wrote in grad school). It is memory
>> > efficient and simple, just four nested loops i,j,I,J: C(I,J) =
>> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I
>> am
>> > getting from my bone modeling colleagues, that use this old code on
>> > Stampede now, the times look reasonable compared to GAMG. This is
>> optimized
>> > for elasticity, where I unroll loops (so it is really six nested loops).
>>
>> Is the A above meant to include some ghosted rows?
>>
>
> You could but I was thinking of having i in the outer loop. In C(I,J) =
> P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows of
> A (and the left term P).

Okay, so you need to gather those rows of P referenced by the
off-diagonal parts of A.  Once you have them, do

  for i:
v[:] = 0 # sparse vector
for j:
  v[:] += A[i,j] * P[j,:]
for I:
  C[I,:] += P[i,I] * v[:]

One inefficiency is that you don't actually get "hits" on all the
entries of C[I,:], but that much remains no matter how you reorder loops
(unless you make I the outermost).

>> > In thinking about this now, I think you want to make a local copy of P
>> with
>> > rows (j) for every column in A that you have locally, then transpose this
>> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
>> > special tree data structure but a matrix is simpler.)
>>
>> Why transpose for P(j,J)?
>>
>
> (premature) optimization. I was thinking 'j' being in the inner loop and
> doing sparse inner product, but now that I think about it there are other
> options.

Sparse inner products tend to be quite inefficient.  Explicit blocking
helps some, but I would try to avoid it.


Re: [petsc-users] GAMG scaling

2018-12-24 Thread Mark Adams via petsc-users
On Mon, Dec 24, 2018 at 4:56 PM Jed Brown  wrote:

> Mark Adams via petsc-users  writes:
>
> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in
> > attached, NB, this is code that I wrote in grad school). It is memory
> > efficient and simple, just four nested loops i,j,I,J: C(I,J) =
> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I
> am
> > getting from my bone modeling colleagues, that use this old code on
> > Stampede now, the times look reasonable compared to GAMG. This is
> optimized
> > for elasticity, where I unroll loops (so it is really six nested loops).
>
> Is the A above meant to include some ghosted rows?
>

You could but I was thinking of having i in the outer loop. In C(I,J) =
P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows of
A (and the left term P).


>
> > In thinking about this now, I think you want to make a local copy of P
> with
> > rows (j) for every column in A that you have locally, then transpose this
> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
> > special tree data structure but a matrix is simpler.)
>
> Why transpose for P(j,J)?
>

(premature) optimization. I was thinking 'j' being in the inner loop and
doing sparse inner product, but now that I think about it there are other
options.


Re: [petsc-users] GAMG scaling

2018-12-24 Thread Jed Brown via petsc-users
Mark Adams via petsc-users  writes:

> Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in
> attached, NB, this is code that I wrote in grad school). It is memory
> efficient and simple, just four nested loops i,j,I,J: C(I,J) =
> P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I am
> getting from my bone modeling colleagues, that use this old code on
> Stampede now, the times look reasonable compared to GAMG. This is optimized
> for elasticity, where I unroll loops (so it is really six nested loops).

Is the A above meant to include some ghosted rows?

> In thinking about this now, I think you want to make a local copy of P with
> rows (j) for every column in A that you have locally, then transpose this
> local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a
> special tree data structure but a matrix is simpler.)

Why transpose for P(j,J)?