Re: [petsc-users] GAMG scaling
On Tue, Dec 25, 2018 at 12:10 AM Jed Brown wrote: > Mark Adams writes: > > > On Mon, Dec 24, 2018 at 4:56 PM Jed Brown wrote: > > > >> Mark Adams via petsc-users writes: > >> > >> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private > in > >> > attached, NB, this is code that I wrote in grad school). It is memory > >> > efficient and simple, just four nested loops i,j,I,J: C(I,J) = > >> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data > that I > >> am > >> > getting from my bone modeling colleagues, that use this old code on > >> > Stampede now, the times look reasonable compared to GAMG. This is > >> optimized > >> > for elasticity, where I unroll loops (so it is really six nested > loops). > >> > >> Is the A above meant to include some ghosted rows? > >> > > > > You could but I was thinking of having i in the outer loop. In C(I,J) = > > P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows > of > > A (and the left term P). > > Okay, so you need to gather those rows of P referenced by the > off-diagonal parts of A. yes, and this looks correct .. > Once you have them, do > > for i: > v[:] = 0 # sparse vector > for j: > v[:] += A[i,j] * P[j,:] > for I: > C[I,:] += P[i,I] * v[:] > > One inefficiency is that you don't actually get "hits" on all the > entries of C[I,:], but that much remains no matter how you reorder loops > (unless you make I the outermost). > > >> > In thinking about this now, I think you want to make a local copy of P > >> with > >> > rows (j) for every column in A that you have locally, then transpose > this > >> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a > >> > special tree data structure but a matrix is simpler.) > >> > >> Why transpose for P(j,J)? > >> > > > > (premature) optimization. I was thinking 'j' being in the inner loop and > > doing sparse inner product, but now that I think about it there are other > > options. > > Sparse inner products tend to be quite inefficient. Explicit blocking > helps some, but I would try to avoid it. > Yea, the design space here is non-trivial. BTW, I have a Cal ME grad student that I've been working with on getting my old parallel FE / Prometheus code running on Stampede for his bone modeling problems. He started from zero in HPC but he is eager and has been picking it up. If there is interest we could get performance data with the existing code, as a benchmark, and we could generate matrices, if anyone wants to look into this.
Re: [petsc-users] GAMG scaling
Mark Adams writes: > On Mon, Dec 24, 2018 at 4:56 PM Jed Brown wrote: > >> Mark Adams via petsc-users writes: >> >> > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in >> > attached, NB, this is code that I wrote in grad school). It is memory >> > efficient and simple, just four nested loops i,j,I,J: C(I,J) = >> > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I >> am >> > getting from my bone modeling colleagues, that use this old code on >> > Stampede now, the times look reasonable compared to GAMG. This is >> optimized >> > for elasticity, where I unroll loops (so it is really six nested loops). >> >> Is the A above meant to include some ghosted rows? >> > > You could but I was thinking of having i in the outer loop. In C(I,J) = > P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows of > A (and the left term P). Okay, so you need to gather those rows of P referenced by the off-diagonal parts of A. Once you have them, do for i: v[:] = 0 # sparse vector for j: v[:] += A[i,j] * P[j,:] for I: C[I,:] += P[i,I] * v[:] One inefficiency is that you don't actually get "hits" on all the entries of C[I,:], but that much remains no matter how you reorder loops (unless you make I the outermost). >> > In thinking about this now, I think you want to make a local copy of P >> with >> > rows (j) for every column in A that you have locally, then transpose this >> > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a >> > special tree data structure but a matrix is simpler.) >> >> Why transpose for P(j,J)? >> > > (premature) optimization. I was thinking 'j' being in the inner loop and > doing sparse inner product, but now that I think about it there are other > options. Sparse inner products tend to be quite inefficient. Explicit blocking helps some, but I would try to avoid it.
Re: [petsc-users] GAMG scaling
On Mon, Dec 24, 2018 at 4:56 PM Jed Brown wrote: > Mark Adams via petsc-users writes: > > > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in > > attached, NB, this is code that I wrote in grad school). It is memory > > efficient and simple, just four nested loops i,j,I,J: C(I,J) = > > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I > am > > getting from my bone modeling colleagues, that use this old code on > > Stampede now, the times look reasonable compared to GAMG. This is > optimized > > for elasticity, where I unroll loops (so it is really six nested loops). > > Is the A above meant to include some ghosted rows? > You could but I was thinking of having i in the outer loop. In C(I,J) = P(i,I)*A(i,j)*P(j,J), the iteration over 'i' need only be the local rows of A (and the left term P). > > > In thinking about this now, I think you want to make a local copy of P > with > > rows (j) for every column in A that you have locally, then transpose this > > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a > > special tree data structure but a matrix is simpler.) > > Why transpose for P(j,J)? > (premature) optimization. I was thinking 'j' being in the inner loop and doing sparse inner product, but now that I think about it there are other options.
Re: [petsc-users] GAMG scaling
Mark Adams via petsc-users writes: > Anyway, my data for this is in my SC 2004 paper (MakeNextMat_private in > attached, NB, this is code that I wrote in grad school). It is memory > efficient and simple, just four nested loops i,j,I,J: C(I,J) = > P(i,I)*A(i,j)*P(j,J). In eyeballing the numbers and from new data that I am > getting from my bone modeling colleagues, that use this old code on > Stampede now, the times look reasonable compared to GAMG. This is optimized > for elasticity, where I unroll loops (so it is really six nested loops). Is the A above meant to include some ghosted rows? > In thinking about this now, I think you want to make a local copy of P with > rows (j) for every column in A that you have locally, then transpose this > local thing for the P(j,J) term. A sparse AXPY on j. (My code uses a > special tree data structure but a matrix is simpler.) Why transpose for P(j,J)?