Re: [petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions

2023-03-01 Thread Paul Grosse-Bley

I previously thought that a custom Red-Black GS solution would be to specific 
for PETSc as it will only work for star-shaped stencils of width 1.

For that specific operation, a custom kernel would probably be able to achieve 
significantly better performance as it would not have to load a mask from 
memory and one can make use of the very regular structure of the problem. We 
still wanted to use cuSPARSE here because it seemed to be more in line with the 
present GPU support of PETSc and using custom kernels in one library might be 
seen as unfair when comparing it to others.

But if you are interested anyway, I will share my implementation if/when I get 
to implementing that.

Best,
Paul Grosse-Bley

On Wednesday, March 01, 2023 16:51 CET, Barry Smith  wrote:
   On Mar 1, 2023, at 10:30 AM, Paul Grosse-Bley 
 wrote: Thank you for the detailed 
answer, Barry. I had hit a deadend on my side.
If you wish to compare, for example, ex45.c with a code that does not 
incorporate the Dirichlet boundary nodes in the linear system you can just use 
0 boundary conditions for both codes.Do you mean to implement the boundary 
conditions explicitly in e.g. hpgmg-cuda instead of using the ghosted cells for 
them?    I don't know anything about hpgmg-cuda and what it means by "ghosted 
cells".  I am just saying I think it is reasonable to use the style of ex45.c 
that retains Dirichlet unknowns in the global matrix (with zero on the those 
points) in PETSc to compare with other codes that may or may not do something 
different. But you need to use zero for Dirichlet points to ensure that the 
"funny" convergence rates do not happen at the beginning making the comparison 
unbalanced between the two codes. 
Do I go right in the assumption that the PCMG coarsening (using DMDAs geometric 
information) will cause the boundary condition on the coarser grids to be 
finite (>0)?   In general even if the Dirichlet points on the fine grid are 
zero, during PCMG with DMDA as in ex45.c yes those "boundary" values on the 
coarser grid may end up during the iterative process as non-zero.  But this is 
"harmless", just part of the algorithm but it does mean the convergence with a 
code that does not include those points will be different (not necessarily 
better or worse, just different). 
Ideally I would like to just use some kind of GPU-parallel (colored) 
SOR/Gauss-Seidel instead of Jacobi. One can relatively easily implement 
Red-Black GS using cuSPARSE's masked matrix vector products, but I have not 
found any information on implementing a custom preconditioner in PETSc.   You 
can use https://petsc.org/release/docs/manualpages/PC/PCSHELL/ You can look at 
src/ksp/pc/impls/jacobi.c for detailed comments on what goes into a 
preconditioner object.  If you write such a GPU-parallel (colored) 
SOR/Gauss-Seidel  we would love to include it in PETSc. Note also 
https://petsc.org/release/docs/manualpages/KSP/KSPCHEBYSHEV/ and potentially 
other "polynomial preconditioners" are an alternative approach for having more 
powerful parallel smoothers.
Best,
Paul Grosse-Bley

On Wednesday, March 01, 2023 05:38 CET, Barry Smith  wrote:
  Ok, here is the situation. The command line options as given do not 
result in multigrid quality convergence in any of the runs; the error 
contraction factor is around .94 (meaning that for the modes that the multigrid 
algorithm does the worst on it only removes about 6 percent of them per 
iteration).     But this is hidden by the initial right hand side for the 
linear system as written in ex45.c which has O(h) values on the boundary nodes 
and O(h^3) values on the interior nodes. The first iterations are largely 
working on the boundary residual and making great progress attacking that so 
that it looks like the one has a good error contraction factor. One then sees 
the error contraction factor start to get worse and worse for the later 
iterations. With the 0 on the boundary the iterations quickly get to the bad 
regime where the error contraction factor is near one. One can see this by 
using a -ksp_rtol 1.e-12 and having the MG code print the residual decrease for 
each iteration. Thought it appears the 0 boundary condition one converges much 
slower (since it requires many more iterations) if you factor out the huge 
advantage of the nonzero boundary condition case at the beginning (in terms of 
decreasing the residual) you see they both have an asymptotic error contraction 
factor of around .94 (which is horrible for multigrid).    I now add 
-mg_levels_ksp_richardson_scale .9 -mg_coarse_ksp_richardson_scale .9 and rerun 
the two cases (nonzero and zero boundary right hand side) they take 35 and 41 
iterations (much better) initial residual norm 14.6993next residual norm 
0.84167 0.0572591next residual norm 0.0665392 0.00452668next residual norm 
0.0307273 0.00209039next residual norm 0.0158949 0.00108134next residual norm

Re: [petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions

2023-03-01 Thread Paul Grosse-Bley

Thank you for the detailed answer, Barry. I had hit a deadend on my side.
If you wish to compare, for example, ex45.c with a code that does not 
incorporate the Dirichlet boundary nodes in the linear system you can just use 
0 boundary conditions for both codes.Do you mean to implement the boundary 
conditions explicitly in e.g. hpgmg-cuda instead of using the ghosted cells for 
them?

Do I go right in the assumption that the PCMG coarsening (using DMDAs geometric 
information) will cause the boundary condition on the coarser grids to be 
finite (>0)?

Ideally I would like to just use some kind of GPU-parallel (colored) 
SOR/Gauss-Seidel instead of Jacobi. One can relatively easily implement 
Red-Black GS using cuSPARSE's masked matrix vector products, but I have not 
found any information on implementing a custom preconditioner in PETSc.

Best,
Paul Grosse-Bley

On Wednesday, March 01, 2023 05:38 CET, Barry Smith  wrote:
  Ok, here is the situation. The command line options as given do not 
result in multigrid quality convergence in any of the runs; the error 
contraction factor is around .94 (meaning that for the modes that the multigrid 
algorithm does the worst on it only removes about 6 percent of them per 
iteration).     But this is hidden by the initial right hand side for the 
linear system as written in ex45.c which has O(h) values on the boundary nodes 
and O(h^3) values on the interior nodes. The first iterations are largely 
working on the boundary residual and making great progress attacking that so 
that it looks like the one has a good error contraction factor. One then sees 
the error contraction factor start to get worse and worse for the later 
iterations. With the 0 on the boundary the iterations quickly get to the bad 
regime where the error contraction factor is near one. One can see this by 
using a -ksp_rtol 1.e-12 and having the MG code print the residual decrease for 
each iteration. Thought it appears the 0 boundary condition one converges much 
slower (since it requires many more iterations) if you factor out the huge 
advantage of the nonzero boundary condition case at the beginning (in terms of 
decreasing the residual) you see they both have an asymptotic error contraction 
factor of around .94 (which is horrible for multigrid).    I now add 
-mg_levels_ksp_richardson_scale .9 -mg_coarse_ksp_richardson_scale .9 and rerun 
the two cases (nonzero and zero boundary right hand side) they take 35 and 41 
iterations (much better) 
initial residual norm 14.6993
next residual norm 0.84167 0.0572591
next residual norm 0.0665392 0.00452668
next residual norm 0.0307273 0.00209039
next residual norm 0.0158949 0.00108134
next residual norm 0.00825189 0.000561378
next residual norm 0.00428474 0.000291492
next residual norm 0.00222482 0.000151355
next residual norm 0.00115522 7.85898e-05
next residual norm 0.000599836 4.0807e-05
next residual norm 0.000311459 2.11887e-05
next residual norm 0.000161722 1.1002e-05
next residual norm 8.39727e-05 5.71269e-06
next residual norm 4.3602e-05 2.96626e-06
next residual norm 2.26399e-05 1.5402e-06
next residual norm 1.17556e-05 7.99735e-07
next residual norm 6.10397e-06 4.15255e-07
next residual norm 3.16943e-06 2.15617e-07
next residual norm 1.64569e-06 1.11957e-07
next residual norm 8.54511e-07 5.81326e-08
next residual norm 4.43697e-07 3.01848e-08
next residual norm 2.30385e-07 1.56732e-08
next residual norm 1.19625e-07 8.13815e-09
next residual norm 6.21143e-08 4.22566e-09
next residual norm 3.22523e-08 2.19413e-09
next residual norm 1.67467e-08 1.13928e-09
next residual norm 8.69555e-09 5.91561e-10
next residual norm 4.51508e-09 3.07162e-10
next residual norm 2.34441e-09 1.59491e-10
next residual norm 1.21731e-09 8.28143e-11
next residual norm 6.32079e-10 4.30005e-11
next residual norm 3.28201e-10 2.23276e-11
next residual norm 1.70415e-10 1.15934e-11
next residual norm 8.84865e-11 6.01976e-12
next residual norm 4.59457e-11 3.1257e-12
next residual norm 2.38569e-11 1.62299e-12
next residual norm 1.23875e-11 8.42724e-13
Linear solve converged due to CONVERGED_RTOL iterations 35
Residual norm 1.23875e-11 
initial residual norm 172.601
next residual norm 154.803 0.896887
next residual norm 66.9409 0.387837
next residual norm 34.4572 0.199636
next residual norm 17.8836 0.103612
next residual norm 9.28582 0.0537995
next residual norm 4.82161 0.027935
next residual norm 2.50358 0.014505
next residual norm 1.29996 0.0075316
next residual norm 0.674992 0.00391071
next residual norm 0.350483 0.0020306
next residual norm 0.181985 0.00105437
next residual norm 0.094494 0.000547472
next residual norm 0.0490651 0.000284269
next residual norm 0.0254766 0.000147604
next residual norm 0.0132285 7.6642e-05
next residual norm 0.00686876 3.97956e-05
next residual norm 0.00356654 2.06635e-05
next residual norm 0.00185189 1.07293e-05
next residual norm 0.000961576 5.5711e-06
next residual norm 0.000499289 2.89274e-06
next residual norm 0.000259251 1.50203e

Re: [petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions

2023-02-27 Thread Paul Grosse-Bley

The scaling might be the problem, especially since I don't know what you mean 
by scaling it according to FE.

For reproducing the issue with a smaller problem:
Change the ComputeRHS function in ex45.c

if (i == 0 || j == 0 || k == 0 || i == mx - 1 || j == my - 1 || k == mz - 1) {
  barray[k][j][i] = 0.0;
} else {
  barray[k][j][i] = 1.0;
}

Change the dimensions to e.g. 33 (I scaled it down, so it goes quick without a 
GPU) instead of 7 and then run with

-ksp_converged_reason -ksp_type richardson -ksp_rtol 1e-09 -pc_type mg 
-pc_mg_levels 3 -mg_levels_ksp_type richardson -mg_levels_ksp_max_it 6 
-mg_levels_ksp_converged_maxits -mg_levels_pc_type jacobi -mg_coarse_ksp_type 
richardson -mg_coarse_ksp_max_it 6 -mg_coarse_ksp_converged_maxits 
-mg_coarse_pc_type jacobi

You will find that it takes 145 iterations instead of 25 for the original ex45 
RHS. My hpgmg-cuda implementation (using 32^3) takes 41 iterations.

To what do I have to change the diagonal entries of the matrix for the boundary 
according to FE? Right now the diagonal is completely constant.

Paul

On Tuesday, February 28, 2023 00:23 CET, Barry Smith  wrote:
 
I have not seen explicitly including, or excluding, the Dirichlet boundary 
values in the system having a significant affect on the convergence so long as 
you SCALE the diagonal rows (of those Dirichlet points) by a value similar to 
the other entries along the diagonal. If they are scaled completely 
differently, that can screw up the convergence. For src/ksp/ksp/ex45.c I see 
that the appropriate scaling is used (note the scaling should come from a 
finite element view of the discretization even if the discretization is finite 
differences as is done in ex45.c)

Are you willing to share the two codes so we can take a look with experienced 
eyes to try to figure out the difference?

Barry




> On Feb 27, 2023, at 5:48 PM, Paul Grosse-Bley 
>  wrote:
>
> Hi Barry,
>
> the reason why I wanted to change to ghost boundaries is that I was worrying 
> about the effect of PCMGs coarsening on these boundary values.
>
> As mentioned before, I am trying to reproduce results from the hpgmg-cuda 
> benchmark (a modified version of it, e.g. using 2nd order instead of 4th 
> etc.).
> I am trying to solve the Poisson equation -\nabla^2 u = 1 with u = 0 on the 
> boundary with rtol=1e-9. While my MG solver implemented in hpgmg solves this 
> in 40 V-cycles (I weakened it a lot by only doing smooths at the coarse level 
> instead of CG). When I run the "same" MG solver built in PETSc on this 
> problem, it starts out reducing the residual norm as fast or even faster for 
> the first 20-30 iterations. But for the last order of magnitude in the 
> residual norm it needs more than 300 V-cycles, i.e. it gets very slow. At 
> this point I am pretty much out of ideas about what is the cause, especially 
> since e.g. adding back cg at the coarsest level doesn't seem to change the 
> number of iterations at all. Therefore I am suspecting the discretization to 
> be the problem. HPGMG uses an even number of points per dimension (e.g. 256), 
> while PCMG wants an odd number (e.g. 257). So I also tried adding another 
> layer of boundary values for the discretization to effectively use only 254 
> points per dimension. This caused the solver to get even slightly worse.
>
> So can the explicit boundary values screw with the coarsening, especially 
> when they are not finite? Because with the problem as stated in ex45 with 
> finite (i.e. non-zero) boundary values, the MG solver takes only 18 V-cycles.
>
> Best,
> Paul
>
>
>
> On Monday, February 27, 2023 18:17 CET, Barry Smith  wrote:
>
>>
>> Paul,
>>
>> DM_BOUNDARY_GHOSTED would result in the extra ghost locations in the local 
>> vectors (obtained with DMCreateLocalVector() but they will not appear in the 
>> global vectors obtained with DMCreateGlobalVector(); perhaps this is the 
>> issue? Since they do not appear in the global vector they will not appear in 
>> the linear system so there will be no diagonal entries for you to set since 
>> those rows/columns do not exist in the linear system. In other words, using 
>> DM_BOUNDARY_GHOSTED is a way to avoid needing to put the Dirichlet values 
>> explicitly into the system being solved; DM_BOUNDARY_GHOSTED is generally 
>> more helpful for nonlinear systems than linear systems.
>>
>> Barry
>>
>> > On Feb 27, 2023, at 12:08 PM, Paul Grosse-Bley 
>> >  wrote:
>> >
>> > Hi,
>> >
>> > I would like to modify src/ksp/ksp/tutorials/ex45.c to implement Dirichlet 
>> > boundary conditions using DM_BOUNDARY_GHOSTED instead of using 
>> > DM_BOUNDARY_NONE and explicitly implementing the boundary by adding 
>> > diagnonal-only rows

Re: [petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions

2023-02-27 Thread Paul Grosse-Bley

Hi Barry,

the reason why I wanted to change to ghost boundaries is that I was worrying 
about the effect of PCMGs coarsening on these boundary values.

As mentioned before, I am trying to reproduce results from the hpgmg-cuda 
benchmark (a modified version of it, e.g. using 2nd order instead of 4th etc.).
I am trying to solve the Poisson equation -\nabla^2 u = 1 with u = 0 on the 
boundary with rtol=1e-9. While my MG solver implemented in hpgmg solves this in 
40 V-cycles (I weakened it a lot by only doing smooths at the coarse level 
instead of CG). When I run the "same" MG solver built in PETSc on this problem, 
it starts out reducing the residual norm as fast or even faster for the first 
20-30 iterations. But for the last order of magnitude in the residual norm it 
needs more than 300 V-cycles, i.e. it gets very slow. At this point I am pretty 
much out of ideas about what is the cause, especially since e.g. adding back cg 
at the coarsest level doesn't seem to change the number of iterations at all. 
Therefore I am suspecting the discretization to be the problem. HPGMG uses an 
even number of points per dimension (e.g. 256), while PCMG wants an odd number 
(e.g. 257). So I also tried adding another layer of boundary values for the 
discretization to effectively use only 254 points per dimension. This caused 
the solver to get even slightly worse.

So can the explicit boundary values screw with the coarsening, especially when 
they are not finite? Because with the problem as stated in ex45 with finite 
(i.e. non-zero) boundary values, the MG solver takes only 18 V-cycles.

Best,
Paul



On Monday, February 27, 2023 18:17 CET, Barry Smith  wrote:
 Paul,

DM_BOUNDARY_GHOSTED would result in the extra ghost locations in the local 
vectors (obtained with DMCreateLocalVector() but they will not appear in the 
global vectors obtained with DMCreateGlobalVector(); perhaps this is the issue? 
Since they do not appear in the global vector they will not appear in the 
linear system so there will be no diagonal entries for you to set since those 
rows/columns do not exist in the linear system. In other words, using 
DM_BOUNDARY_GHOSTED is a way to avoid needing to put the Dirichlet values 
explicitly into the system being solved; DM_BOUNDARY_GHOSTED is generally more 
helpful for nonlinear systems than linear systems.

Barry

> On Feb 27, 2023, at 12:08 PM, Paul Grosse-Bley 
>  wrote:
>
> Hi,
>
> I would like to modify src/ksp/ksp/tutorials/ex45.c to implement Dirichlet 
> boundary conditions using DM_BOUNDARY_GHOSTED instead of using 
> DM_BOUNDARY_NONE and explicitly implementing the boundary by adding 
> diagnonal-only rows.
>
> My assumption was that with DM_BOUNDARY_GHOSTED all vectors from that DM have 
> the extra memory for the ghost entries and that I can basically use 
> DMDAGetGhostCorners instead of DMDAGetCorners to access the array gotten via 
> DMDAVecGetArray. But when I access (gxs, gys, gzs) = (-1,-1,-1) I get a 
> segmentation fault. When looking at the implementation of DMDAVecGetArray it 
> looked to me as if accessing (-1, -1, -1) should work as DMDAVecGetArray 
> passes the ghost corners to VecGetArray3d which then adds the right offsets.
>
> I could not find any example using DM_BOUNDARY_GHOSTED and then actually 
> accessing the ghost/boundary elements. Can I assume that they are set to zero 
> for the solution vector, i.e. the u=0 on \del\Omega and I do not need to 
> access them at all?
>
> Best,
> Paul Große-Bley
 


[petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions

2023-02-27 Thread Paul Grosse-Bley

Hi,

I would like to modify src/ksp/ksp/tutorials/ex45.c to implement Dirichlet 
boundary conditions using DM_BOUNDARY_GHOSTED instead of using DM_BOUNDARY_NONE 
and explicitly implementing the boundary by adding diagnonal-only rows.

My assumption was that with DM_BOUNDARY_GHOSTED all vectors from that DM have 
the extra memory for the ghost entries and that I can basically use 
DMDAGetGhostCorners instead of DMDAGetCorners to access the array gotten via 
DMDAVecGetArray. But when I access (gxs, gys, gzs) = (-1,-1,-1) I get a 
segmentation fault. When looking at the implementation of DMDAVecGetArray it 
looked to me as if accessing (-1, -1, -1) should work as DMDAVecGetArray passes 
the ghost corners to VecGetArray3d which then adds the right offsets.

I could not find any example using DM_BOUNDARY_GHOSTED and then actually 
accessing the ghost/boundary elements. Can I assume that they are set to zero 
for the solution vector, i.e. the u=0 on \del\Omega and I do not need to access 
them at all?

Best,
Paul Große-Bley


Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

I thought to have observed the number of cycles with 
-pc_mg_multiplicative_cycles to be dependent on rtol. But I might have seen 
this with maxits=0 which would explain my missunderstanding of richardson.

I guess PCAMGX does not use this PCApplyRichardson_MG (yet?). Because I still 
see the multiple PCApplys there with maxits=1 and richardson, while they are 
gone for PCMG (due to getting rid of  KSPSetComputeInitialGuess?).

Best,
Paul Grosse-Bley


On Wednesday, February 22, 2023 23:20 CET, Barry Smith  wrote:
  Preonly means exactly one application of the PC so it will never converge 
by itself unless the PC is a full solver.    Note there is a 
PCApplyRichardson_MG() that gets used automatically with KSPRICHARSON. This 
does not have an"extra" application of the preconditioner so 2 iterations of 
Richardson with MG will use 2 applications of the V-cycle. So it is exactly 
"multigrid as a solver, without a Krylov method", no extra work. So I don't 
think you need to make any "compromises".    Barry   On Feb 22, 2023, at 4:57 
PM, Paul Grosse-Bley  wrote: Hi again,

I now found out that

1. preonly ignores -ksp_pc_side right (makes sense, I guess).
2. richardson is incompatible with -ksp_pc_side right.
3. preonly gives less output for -log_view -pc_mg_log than richardson.
4. preonly also ignores -ksp_rtol etc..
5. preonly causes -log_view to measure incorrect timings for custom stages, 
i.e. the time for the stage (219us) is significantly shorter than the time for 
the KSPSolve inside the stage (~40ms).

Number 4 will be problematic as I want to benchmark number of V-cycles and 
runtime for a given rtol. At the same time I want to avoid richardson now 
because of number 2 and the additional work of scaling the RHS.

Is there any good way of just using MG V-cycles as a solver, i.e. without 
interference from an outer Krylov solver and still iterate until convergence?
Or will I just have to accept the additional V-cycle due to the left 
application of th PC with richardson?

I guess I could also manually change -pc_mg_multiplicative_cycles until the 
residual gets low enough (using preonly), but that seems very inefficient.

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" 
 wrote:
 I was using the Richardson KSP type which I guess has the same behavior as 
GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use 
preonly from now on, where maxits=1 does what I want it to do.

Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think 
for now all my problems are resolved for now. Thank you very much, Barry and 
Mark!

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:03 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

I think most of my "weird" observations came from the fact that I looked at 
iterations of KSPSolve where the residual was already converged. PCMG and 
PCGAMG do one V-cycle before even taking a look at the residual and then 
independent of pc_mg_multiplicative_cycles stop if it is converged.

Looking at iterations that are not converged with PCMG, 
pc_mg_multiplicative_cycles works fine.

At these iterations I also see the multiple calls to PCApply in a single 
KSPSolve iteration which were throwing me off with PCAMGX before.

The reason for these multiple applications of the preconditioner (tested for 
both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be 
better documented, I think.    I do not understand what you are talking about 
with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one 
iteration, 0 is kind of meaningless.    The reason that there is a PCApply at 
the start of the solve is because by default the KSPType is KSPGMRES which by 
default uses left preconditioner which means the right hand side needs to be 
scaled by the preconditioner before the KSP process starts. So in this 
configuration one KSP iteration results in 2 PCApply.  You can use -ksp_pc_side 
right to use right preconditioning and then the number of PCApply will match 
the number of KSP iterations.
Best,
Paul Große-Bley



On Wednesday, February 22, 2023 20:15 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
 wrote: Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occ

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi again,

I now found out that

1. preonly ignores -ksp_pc_side right (makes sense, I guess).
2. richardson is incompatible with -ksp_pc_side right.
3. preonly gives less output for -log_view -pc_mg_log than richardson.
4. preonly also ignores -ksp_rtol etc..
5. preonly causes -log_view to measure incorrect timings for custom stages, 
i.e. the time for the stage (219us) is significantly shorter than the time for 
the KSPSolve inside the stage (~40ms).

Number 4 will be problematic as I want to benchmark number of V-cycles and 
runtime for a given rtol. At the same time I want to avoid richardson now 
because of number 2 and the additional work of scaling the RHS.

Is there any good way of just using MG V-cycles as a solver, i.e. without 
interference from an outer Krylov solver and still iterate until convergence?
Or will I just have to accept the additional V-cycle due to the left 
application of th PC with richardson?

I guess I could also manually change -pc_mg_multiplicative_cycles until the 
residual gets low enough (using preonly), but that seems very inefficient.

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" 
 wrote:
 I was using the Richardson KSP type which I guess has the same behavior as 
GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use 
preonly from now on, where maxits=1 does what I want it to do.

Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think 
for now all my problems are resolved for now. Thank you very much, Barry and 
Mark!

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:03 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

I think most of my "weird" observations came from the fact that I looked at 
iterations of KSPSolve where the residual was already converged. PCMG and 
PCGAMG do one V-cycle before even taking a look at the residual and then 
independent of pc_mg_multiplicative_cycles stop if it is converged.

Looking at iterations that are not converged with PCMG, 
pc_mg_multiplicative_cycles works fine.

At these iterations I also see the multiple calls to PCApply in a single 
KSPSolve iteration which were throwing me off with PCAMGX before.

The reason for these multiple applications of the preconditioner (tested for 
both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be 
better documented, I think.    I do not understand what you are talking about 
with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one 
iteration, 0 is kind of meaningless.    The reason that there is a PCApply at 
the start of the solve is because by default the KSPType is KSPGMRES which by 
default uses left preconditioner which means the right hand side needs to be 
scaled by the preconditioner before the KSP process starts. So in this 
configuration one KSP iteration results in 2 PCApply.  You can use -ksp_pc_side 
right to use right preconditioning and then the number of PCApply will match 
the number of KSP iterations.
Best,
Paul Große-Bley



On Wednesday, February 22, 2023 20:15 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
 wrote: Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.   Hmm, I run an example with 
-pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not 
understanding why it would not work for you. If you use and don't use the 
option are the exact same counts listed for all the events in the -log_view ? 
I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

I was using the Richardson KSP type which I guess has the same behavior as 
GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use 
preonly from now on, where maxits=1 does what I want it to do.

Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think 
for now all my problems are resolved for now. Thank you very much, Barry and 
Mark!

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 21:03 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

I think most of my "weird" observations came from the fact that I looked at 
iterations of KSPSolve where the residual was already converged. PCMG and 
PCGAMG do one V-cycle before even taking a look at the residual and then 
independent of pc_mg_multiplicative_cycles stop if it is converged.

Looking at iterations that are not converged with PCMG, 
pc_mg_multiplicative_cycles works fine.

At these iterations I also see the multiple calls to PCApply in a single 
KSPSolve iteration which were throwing me off with PCAMGX before.

The reason for these multiple applications of the preconditioner (tested for 
both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be 
better documented, I think.    I do not understand what you are talking about 
with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one 
iteration, 0 is kind of meaningless.    The reason that there is a PCApply at 
the start of the solve is because by default the KSPType is KSPGMRES which by 
default uses left preconditioner which means the right hand side needs to be 
scaled by the preconditioner before the KSP process starts. So in this 
configuration one KSP iteration results in 2 PCApply.  You can use -ksp_pc_side 
right to use right preconditioning and then the number of PCApply will match 
the number of KSP iterations.
Best,
Paul Große-Bley



On Wednesday, February 22, 2023 20:15 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
 wrote: Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.   Hmm, I run an example with 
-pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not 
understanding why it would not work for you. If you use and don't use the 
option are the exact same counts listed for all the events in the -log_view ? 
I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi Barry,

the picture keeps getting clearer. I did not use KSPSetInitialGuessNonzero or 
the corresponding option, but using KSPSetComputeInitialGuess probably sets it 
automatically (without telling one in the output of -help).

I was also confused by the preonly KSP type not working which is also caused by 
this. I think ex45 should be changed to use a non-zero initial guess (plus 
maybe a comment mentioning that one should avoid KSPSetComputeInitialGuess when 
using the zero initial guess).

Thank you for asking the right questions,
Paul Große-Bley



On Wednesday, February 22, 2023 20:46 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 2:19 PM, Paul Grosse-Bley 
 wrote: Hi again,

after checking with -ksp_monitor for PCMG, it seems my assumption that I could 
reset the solution by calling KSPSetComputeInitialGuess and then KSPSetupwas 
generally wrong and BoomerAMG was just the only preconditioner that cleverly 
stops doing work when the residual is already converged (which then caused me 
to find the shrinking residual and to think that it was doing something 
different to the other methods).    Depending on your example the 
KSPSetComputeInitialGuess usage might not work. I suggest not using it for your 
benchmarking.    But if you just call KSPSolve() multiple times it will zero 
the solution at the beginning of each KSPSolve() (unless you use 
KSPSetInitialGuessNonzero()). So if you want to run multiple solves for testing 
you should not need to do anything. This will be true for any use of KSPSolve() 
whether the PC is from PETSc, hypre or NVIDIA.   
So, how can I get KSP to use the function given through 
KSPSetComputeInitialGuess to reset the solution vector (without calling 
KSPReset which would add a lot of overhead, I assume)?

Best,
Paul Große-Bley


On Wednesday, February 22, 2023 19:46 CET, Mark Adams  wrote:
 OK, Nsight Systems is a good way to see what is going on. So all three of your 
solvers are not traversing the MG hierching with the correct logic.I don't know 
about hypre but PCMG and AMGx are pretty simple and AMGx dives into the AMGx 
library directly from out interface.Some things to try:* Use -options_left to 
make sure your options are being used (eg, spelling mistakes)* Use -ksp_view to 
see a human readable list of your solver parameters.* Use -log_trace to see if 
the correct methods are called. - PCMG calls PCMGMCycle_Private for each of the 
cycle in code like:   for (i = 0; i < mg->cyclesperpcapply; i++) 
PetscCall(PCMGMCycle_Private(pc, mglevels + levels - 1, transpose, matapp, 
NULL));- AMGx is called PCApply_AMGX and then it dives into the library. See 
where these three calls to AMGx are called from. Mark On Wed, Feb 22, 2023 at 
1:10 PM Paul Grosse-Bley  wrote:Hi 
Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.

I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile fo

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi Barry,

I think most of my "weird" observations came from the fact that I looked at 
iterations of KSPSolve where the residual was already converged. PCMG and 
PCGAMG do one V-cycle before even taking a look at the residual and then 
independent of pc_mg_multiplicative_cycles stop if it is converged.

Looking at iterations that are not converged with PCMG, 
pc_mg_multiplicative_cycles works fine.

At these iterations I also see the multiple calls to PCApply in a single 
KSPSolve iteration which were throwing me off with PCAMGX before.

The reason for these multiple applications of the preconditioner (tested for 
both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be 
better documented, I think.

Best,
Paul Große-Bley



On Wednesday, February 22, 2023 20:15 CET, Barry Smith  wrote:
   On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
 wrote: Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.   Hmm, I run an example with 
-pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not 
understanding why it would not work for you. If you use and don't use the 
option are the exact same counts listed for all the events in the -log_view ? 
I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss it).
When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?
  Again, not sure what "see" is, but PCAMGX is pretty new and has not been used 
much.Note some KSP methods apply to the PC before the iteration. Mark  Thank 
you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith  wrote:
It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMatrix() it will not work out of the box for geometric multigrid. So it 
might be the wrong example for you.   I don't know what you mean about clever. 
If you simply set the solution to zero at the beginning of the loop then it 
will just do the same solve multiple times. The setup should not do much of 
anything after the first solver.  Thought usually solves are big enough that 
one need not run solves multiple times to get a good understanding of thei

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi again,

after checking with -ksp_monitor for PCMG, it seems my assumption that I could 
reset the solution by calling KSPSetComputeInitialGuess and then KSPSetupwas 
generally wrong and BoomerAMG was just the only preconditioner that cleverly 
stops doing work when the residual is already converged (which then caused me 
to find the shrinking residual and to think that it was doing something 
different to the other methods).

So, how can I get KSP to use the function given through 
KSPSetComputeInitialGuess to reset the solution vector (without calling 
KSPReset which would add a lot of overhead, I assume)?

Best,
Paul Große-Bley


On Wednesday, February 22, 2023 19:46 CET, Mark Adams  wrote:
 OK, Nsight Systems is a good way to see what is going on. So all three of your 
solvers are not traversing the MG hierching with the correct logic.I don't know 
about hypre but PCMG and AMGx are pretty simple and AMGx dives into the AMGx 
library directly from out interface.Some things to try:* Use -options_left to 
make sure your options are being used (eg, spelling mistakes)* Use -ksp_view to 
see a human readable list of your solver parameters.* Use -log_trace to see if 
the correct methods are called. - PCMG calls PCMGMCycle_Private for each of the 
cycle in code like:   for (i = 0; i < mg->cyclesperpcapply; i++) 
PetscCall(PCMGMCycle_Private(pc, mglevels + levels - 1, transpose, matapp, 
NULL));- AMGx is called PCApply_AMGX and then it dives into the library. See 
where these three calls to AMGx are called from. Mark On Wed, Feb 22, 2023 at 
1:10 PM Paul Grosse-Bley  wrote:Hi 
Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.

I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss it).
When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?
  Again, not sure what "see" is, but PCAMGX is pretty new and has not been used 
much.Note some KSP methods apply to the PC before the iteration. Mark  Thank 
you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith  wrote:
It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMat

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi Mark,

I use Nvidia Nsight Systems with --trace 
cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers 
that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE 
calls (in addition to the actual kernels which are not always easy to 
attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed 
NVTX markers.

The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because 
kernel runtimes on coarser levels are much shorter. At the coarsest level, 
there normally isn't even enough work for the GPU (Nvidia A100) to be fully 
occupied which is also visible in Nsight Systems.

I run only a single MPI rank with a single GPU, so profiling is straighforward.

Best,
Paul Große-Bley

On Wednesday, February 22, 2023 18:24 CET, Mark Adams  wrote:
   On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
 wrote:Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG). How are you seeing this? You might try 
-log_trace to see if you get two V cycles. 
When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?
  Looking at the doc 
https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you 
use this with  KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the 
solution vector.  The profile for BoomerAMG also doesn't really show the 
V-cycle behavior of the other implementations. Most of the runtime seems to go 
into calls to cusparseDcsrsv which might happen at the different levels, but 
the runtime of these kernels doesn't show the V-cycle pattern. According to the 
output with -pc_hypre_boomeramg_print_statistics it is doing the right thing 
though, so I guess it is alright (and if not, this is probably the wrong place 
to discuss it).
When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?
  Again, not sure what "see" is, but PCAMGX is pretty new and has not been used 
much.Note some KSP methods apply to the PC before the iteration. Mark  Thank 
you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith  wrote:
It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMatrix() it will not work out of the box for geometric multigrid. So it 
might be the wrong example for you.   I don't know what you mean about clever. 
If you simply set the solution to zero at the beginning of the loop then it 
will just do the same solve multiple times. The setup should not do much of 
anything after the first solver.  Thought usually solves are big enough that 
one need not run solves multiple times to get a good understanding of their 
performance.   On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, 
thank you! Sadly I get a segfault when executing that example with PCMG and 
more than one level, i.e. with the minimal args:

$ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
===
Test: KSP performance - Poisson
    Input matrix: 27-pt finite difference stencil
    -n 100
    DoFs = 100
    Number of nonzeros = 26463592

Step1  - creating Vecs and Mat...
Step2a - running PCSetUp()...
[0]PETSC ERROR: 

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on 
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: configure us

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-22 Thread Paul Grosse-Bley

Hi Barry,

after using VecCUDAGetArray to initialize the RHS, that kernel still gets 
called as part of KSPSolve instead of KSPSetup, but its runtime is way less 
significant than the cudaMemcpy before, so I guess I will leave it like this. 
Other than that I kept the code like in my first message in this thread (as you 
wrote, benchmark_ksp.c is not well suited for PCMG).

The profiling results for PCMG and PCAMG look as I would expect them to look, 
i.e. one can nicely see the GPU load/kernel runtimes going down and up again 
for each V-cycle.

I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make 
any difference. I would have expected to be able to increase the number of 
V-cycles per KSP iteration, but I keep seeing just a single V-cycle when 
changing the option (using PCMG).

When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between 
bench iterations to reset the solution vector does not seem to work as the 
residual keeps shrinking. Is this a bug? Any advice for working around this?

The profile for BoomerAMG also doesn't really show the V-cycle behavior of the 
other implementations. Most of the runtime seems to go into calls to 
cusparseDcsrsv which might happen at the different levels, but the runtime of 
these kernels doesn't show the V-cycle pattern. According to the output with 
-pc_hypre_boomeramg_print_statistics it is doing the right thing though, so I 
guess it is alright (and if not, this is probably the wrong place to discuss 
it).

When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) 
calls in KSPSolve (three for the very first KSPSolve) while expecting just one. 
Each KSPSolve should do a single preconditioned Richardson iteration. Why is 
the preconditioner applied multiple times here?

Thank you,
Paul Große-Bley


On Monday, February 06, 2023 20:05 CET, Barry Smith  wrote:
It should not crash, take a look at the test cases at the bottom of the 
file. You are likely correct if the code, unfortunately, does use 
DMCreateMatrix() it will not work out of the box for geometric multigrid. So it 
might be the wrong example for you.   I don't know what you mean about clever. 
If you simply set the solution to zero at the beginning of the loop then it 
will just do the same solve multiple times. The setup should not do much of 
anything after the first solver.  Thought usually solves are big enough that 
one need not run solves multiple times to get a good understanding of their 
performance.   On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley 
 wrote: Hi Barry,

src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, 
thank you! Sadly I get a segfault when executing that example with PCMG and 
more than one level, i.e. with the minimal args:

$ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
===
Test: KSP performance - Poisson
    Input matrix: 27-pt finite difference stencil
    -n 100
    DoFs = 100
    Number of nonzeros = 26463592

Step1  - creating Vecs and Mat...
Step2a - running PCSetUp()...
[0]PETSC ERROR: 

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on 
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing 
the crash.
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--

As the matrix is not created using DMDACreate3d I expected it to fail due to 
the missing geometric information, but I expected it to fail more gracefully 
than with a segfault.
I will try to combine bench_kspsolve.c with ex45.c to get easy MG 
preconditioning, especially since I am interested in the 7pt stencil for now.

Concerning my benchmarking loop from before: Is it generally discouraged to do 
this for KSPSolve due to PETSc cleverly/lazily skipping some of the work when 
doing the same solve multiple times or are the solves not iterated in 
bench_kspsolve.c (while the MatMuls are with -matmult) just to keep the runtime 
short?

Thanks,
Paul

On Monday, February 06, 2023 17:04 CET, Barry Smith  wrote:
 Paul,    I think

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-06 Thread Paul Grosse-Bley

Hi Barry,

src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, 
thank you! Sadly I get a segfault when executing that example with PCMG and 
more than one level, i.e. with the minimal args:

$ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
===
Test: KSP performance - Poisson
    Input matrix: 27-pt finite difference stencil
    -n 100
    DoFs = 100
    Number of nonzeros = 26463592

Step1  - creating Vecs and Mat...
Step2a - running PCSetUp()...
[0]PETSC ERROR: 

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on 
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing 
the crash.
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--

As the matrix is not created using DMDACreate3d I expected it to fail due to 
the missing geometric information, but I expected it to fail more gracefully 
than with a segfault.
I will try to combine bench_kspsolve.c with ex45.c to get easy MG 
preconditioning, especially since I am interested in the 7pt stencil for now.

Concerning my benchmarking loop from before: Is it generally discouraged to do 
this for KSPSolve due to PETSc cleverly/lazily skipping some of the work when 
doing the same solve multiple times or are the solves not iterated in 
bench_kspsolve.c (while the MatMuls are with -matmult) just to keep the runtime 
short?

Thanks,
Paul

On Monday, February 06, 2023 17:04 CET, Barry Smith  wrote:
 Paul,    I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code 
intended to be used for simple benchmarking.     You can use VecCudaGetArray() 
to access the GPU memory of the vector and then call a CUDA kernel to compute 
the right hand side vector directly on the GPU.   Barry  On Feb 6, 2023, at 
10:57 AM, Paul Grosse-Bley  wrote: Hi,

I want to compare different implementations of multigrid solvers for Nvidia 
GPUs using the poisson problem (starting from ksp tutorial example ex45.c).
Therefore I am trying to get runtime results comparable to hpgmg-cuda 
(finite-volume), i.e. using multiple warmup and measurement solves and avoiding 
measuring setup time.
For now I am using -log_view with added stages:

PetscLogStageRegister("Solve Bench", _bench_stage);
  for (int i = 0; i < BENCH_SOLVES; i++) {
    PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); // 
reset x
    PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
    PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that 
everything is done

    PetscLogStagePush(solve_bench_stage);
    PetscCall(KSPSolve(ksp, NULL, NULL));
    PetscLogStagePop();
  }

This snippet is preceded by a similar loop for warmup.

When profiling this using Nsight Systems, I see that the very first solve is 
much slower which mostly correspods to H2D (host to device) copies and e.g. 
cuBLAS setup (maybe also paging overheads as mentioned in the docs, but 
probably insignificant in this case). The following solves have some overhead 
at the start from a H2D copy of a vector (the RHS I guess, as the copy is 
preceeded by a matrix-vector product) in the first MatResidual call (callchain: 
KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My 
interpretation of the profiling results (i.e. cuBLAS calls) is that that vector 
is overwritten with the residual in Daxpy and therefore has to be copied again 
for the next iteration.

Is there an elegant way of avoiding that H2D copy? I have seen some examples on 
constructing matrices directly on the GPU, but nothing about vectors. Any 
further tips for benchmarking (vs profiling) PETSc solvers? At the moment I am 
using jacobi as smoother, but I would like to have a CUDA implementation of SOR 
instead. Is there a good way of achieving that, e.g. using PCHYPREs boomeramg 
with a single level and "SOR/Jacobi"-smoother  as smoother in PCMG? Or is the 
overhead from constantly switching between PETSc and hypre too big?

Thanks,
Paul


[petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

2023-02-06 Thread Paul Grosse-Bley

Hi,

I want to compare different implementations of multigrid solvers for Nvidia 
GPUs using the poisson problem (starting from ksp tutorial example ex45.c).
Therefore I am trying to get runtime results comparable to hpgmg-cuda 
(finite-volume), i.e. using multiple warmup and measurement solves and avoiding 
measuring setup time.
For now I am using -log_view with added stages:

PetscLogStageRegister("Solve Bench", _bench_stage);
  for (int i = 0; i < BENCH_SOLVES; i++) {
    PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); // 
reset x
    PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve
    PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that 
everything is done

    PetscLogStagePush(solve_bench_stage);
    PetscCall(KSPSolve(ksp, NULL, NULL));
    PetscLogStagePop();
  }

This snippet is preceded by a similar loop for warmup.

When profiling this using Nsight Systems, I see that the very first solve is 
much slower which mostly correspods to H2D (host to device) copies and e.g. 
cuBLAS setup (maybe also paging overheads as mentioned in the docs, but 
probably insignificant in this case). The following solves have some overhead 
at the start from a H2D copy of a vector (the RHS I guess, as the copy is 
preceeded by a matrix-vector product) in the first MatResidual call (callchain: 
KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My 
interpretation of the profiling results (i.e. cuBLAS calls) is that that vector 
is overwritten with the residual in Daxpy and therefore has to be copied again 
for the next iteration.

Is there an elegant way of avoiding that H2D copy? I have seen some examples on 
constructing matrices directly on the GPU, but nothing about vectors. Any 
further tips for benchmarking (vs profiling) PETSc solvers? At the moment I am 
using jacobi as smoother, but I would like to have a CUDA implementation of SOR 
instead. Is there a good way of achieving that, e.g. using PCHYPREs boomeramg 
with a single level and "SOR/Jacobi"-smoother  as smoother in PCMG? Or is the 
overhead from constantly switching between PETSc and hypre too big?

Thanks,
Paul


Re: [petsc-users] PCMG Interpolation/Restriction Defaults

2023-02-02 Thread Paul Grosse-Bley

That is great to know! When I read that part of the docs, I thought it would be 
a big hassle to use geometric multigrid, having to manually define all those 
matrices.
Thanks again!



On Thursday, February 02, 2023 16:38 CET, Barry Smith  wrote:
 Yes, it pulls the information from the DM. See src/ksp/pc/impls/mg/mg.c 
the function PCSetUp_MG.  Note this function has become horribly complex with 
people adding all kinds of crazy ways to set things up, and the code should be 
refactored to remove the complexity but anyways you'll see things like 
DMCoarsen(), DMCreateInterpolation() etc used to create what the MG algorithm 
needs.   Barry  On Feb 2, 2023, at 9:26 AM, Grosse-Bley, Paul 
 wrote: The documentation says that

> The multigrid routines, which determine the solvers and 
> interpolation/restriction operators that are used, are mandatory.

But using code that doesn't contain any multigrid-specific routines e.g. ex45 
and running it with

-da_grid_x 129 -da_grid_y 129 -da_grid_z 129
-pc_type mg
-pc_mg_type multiplicative
-pc_mg_cycle_type v
-pc_mg_levels 5
...

seems to work fine and also not fall-back to algebraic multigrid/PCGAMG from 
what I see with -log_view.

What interpolation/restriction operators are used by default with PCMG?
Is it cleverly using the geometric information from DA?

Thanks,
Paul