Re: [petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions
I previously thought that a custom Red-Black GS solution would be to specific for PETSc as it will only work for star-shaped stencils of width 1. For that specific operation, a custom kernel would probably be able to achieve significantly better performance as it would not have to load a mask from memory and one can make use of the very regular structure of the problem. We still wanted to use cuSPARSE here because it seemed to be more in line with the present GPU support of PETSc and using custom kernels in one library might be seen as unfair when comparing it to others. But if you are interested anyway, I will share my implementation if/when I get to implementing that. Best, Paul Grosse-Bley On Wednesday, March 01, 2023 16:51 CET, Barry Smith wrote: On Mar 1, 2023, at 10:30 AM, Paul Grosse-Bley wrote: Thank you for the detailed answer, Barry. I had hit a deadend on my side. If you wish to compare, for example, ex45.c with a code that does not incorporate the Dirichlet boundary nodes in the linear system you can just use 0 boundary conditions for both codes.Do you mean to implement the boundary conditions explicitly in e.g. hpgmg-cuda instead of using the ghosted cells for them? I don't know anything about hpgmg-cuda and what it means by "ghosted cells". I am just saying I think it is reasonable to use the style of ex45.c that retains Dirichlet unknowns in the global matrix (with zero on the those points) in PETSc to compare with other codes that may or may not do something different. But you need to use zero for Dirichlet points to ensure that the "funny" convergence rates do not happen at the beginning making the comparison unbalanced between the two codes. Do I go right in the assumption that the PCMG coarsening (using DMDAs geometric information) will cause the boundary condition on the coarser grids to be finite (>0)? In general even if the Dirichlet points on the fine grid are zero, during PCMG with DMDA as in ex45.c yes those "boundary" values on the coarser grid may end up during the iterative process as non-zero. But this is "harmless", just part of the algorithm but it does mean the convergence with a code that does not include those points will be different (not necessarily better or worse, just different). Ideally I would like to just use some kind of GPU-parallel (colored) SOR/Gauss-Seidel instead of Jacobi. One can relatively easily implement Red-Black GS using cuSPARSE's masked matrix vector products, but I have not found any information on implementing a custom preconditioner in PETSc. You can use https://petsc.org/release/docs/manualpages/PC/PCSHELL/ You can look at src/ksp/pc/impls/jacobi.c for detailed comments on what goes into a preconditioner object. If you write such a GPU-parallel (colored) SOR/Gauss-Seidel we would love to include it in PETSc. Note also https://petsc.org/release/docs/manualpages/KSP/KSPCHEBYSHEV/ and potentially other "polynomial preconditioners" are an alternative approach for having more powerful parallel smoothers. Best, Paul Grosse-Bley On Wednesday, March 01, 2023 05:38 CET, Barry Smith wrote: Ok, here is the situation. The command line options as given do not result in multigrid quality convergence in any of the runs; the error contraction factor is around .94 (meaning that for the modes that the multigrid algorithm does the worst on it only removes about 6 percent of them per iteration). But this is hidden by the initial right hand side for the linear system as written in ex45.c which has O(h) values on the boundary nodes and O(h^3) values on the interior nodes. The first iterations are largely working on the boundary residual and making great progress attacking that so that it looks like the one has a good error contraction factor. One then sees the error contraction factor start to get worse and worse for the later iterations. With the 0 on the boundary the iterations quickly get to the bad regime where the error contraction factor is near one. One can see this by using a -ksp_rtol 1.e-12 and having the MG code print the residual decrease for each iteration. Thought it appears the 0 boundary condition one converges much slower (since it requires many more iterations) if you factor out the huge advantage of the nonzero boundary condition case at the beginning (in terms of decreasing the residual) you see they both have an asymptotic error contraction factor of around .94 (which is horrible for multigrid). I now add -mg_levels_ksp_richardson_scale .9 -mg_coarse_ksp_richardson_scale .9 and rerun the two cases (nonzero and zero boundary right hand side) they take 35 and 41 iterations (much better) initial residual norm 14.6993next residual norm 0.84167 0.0572591next residual norm 0.0665392 0.00452668next residual norm 0.0307273 0.00209039next residual norm 0.0158949 0.00108134next residual norm
Re: [petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions
Thank you for the detailed answer, Barry. I had hit a deadend on my side. If you wish to compare, for example, ex45.c with a code that does not incorporate the Dirichlet boundary nodes in the linear system you can just use 0 boundary conditions for both codes.Do you mean to implement the boundary conditions explicitly in e.g. hpgmg-cuda instead of using the ghosted cells for them? Do I go right in the assumption that the PCMG coarsening (using DMDAs geometric information) will cause the boundary condition on the coarser grids to be finite (>0)? Ideally I would like to just use some kind of GPU-parallel (colored) SOR/Gauss-Seidel instead of Jacobi. One can relatively easily implement Red-Black GS using cuSPARSE's masked matrix vector products, but I have not found any information on implementing a custom preconditioner in PETSc. Best, Paul Grosse-Bley On Wednesday, March 01, 2023 05:38 CET, Barry Smith wrote: Ok, here is the situation. The command line options as given do not result in multigrid quality convergence in any of the runs; the error contraction factor is around .94 (meaning that for the modes that the multigrid algorithm does the worst on it only removes about 6 percent of them per iteration). But this is hidden by the initial right hand side for the linear system as written in ex45.c which has O(h) values on the boundary nodes and O(h^3) values on the interior nodes. The first iterations are largely working on the boundary residual and making great progress attacking that so that it looks like the one has a good error contraction factor. One then sees the error contraction factor start to get worse and worse for the later iterations. With the 0 on the boundary the iterations quickly get to the bad regime where the error contraction factor is near one. One can see this by using a -ksp_rtol 1.e-12 and having the MG code print the residual decrease for each iteration. Thought it appears the 0 boundary condition one converges much slower (since it requires many more iterations) if you factor out the huge advantage of the nonzero boundary condition case at the beginning (in terms of decreasing the residual) you see they both have an asymptotic error contraction factor of around .94 (which is horrible for multigrid). I now add -mg_levels_ksp_richardson_scale .9 -mg_coarse_ksp_richardson_scale .9 and rerun the two cases (nonzero and zero boundary right hand side) they take 35 and 41 iterations (much better) initial residual norm 14.6993 next residual norm 0.84167 0.0572591 next residual norm 0.0665392 0.00452668 next residual norm 0.0307273 0.00209039 next residual norm 0.0158949 0.00108134 next residual norm 0.00825189 0.000561378 next residual norm 0.00428474 0.000291492 next residual norm 0.00222482 0.000151355 next residual norm 0.00115522 7.85898e-05 next residual norm 0.000599836 4.0807e-05 next residual norm 0.000311459 2.11887e-05 next residual norm 0.000161722 1.1002e-05 next residual norm 8.39727e-05 5.71269e-06 next residual norm 4.3602e-05 2.96626e-06 next residual norm 2.26399e-05 1.5402e-06 next residual norm 1.17556e-05 7.99735e-07 next residual norm 6.10397e-06 4.15255e-07 next residual norm 3.16943e-06 2.15617e-07 next residual norm 1.64569e-06 1.11957e-07 next residual norm 8.54511e-07 5.81326e-08 next residual norm 4.43697e-07 3.01848e-08 next residual norm 2.30385e-07 1.56732e-08 next residual norm 1.19625e-07 8.13815e-09 next residual norm 6.21143e-08 4.22566e-09 next residual norm 3.22523e-08 2.19413e-09 next residual norm 1.67467e-08 1.13928e-09 next residual norm 8.69555e-09 5.91561e-10 next residual norm 4.51508e-09 3.07162e-10 next residual norm 2.34441e-09 1.59491e-10 next residual norm 1.21731e-09 8.28143e-11 next residual norm 6.32079e-10 4.30005e-11 next residual norm 3.28201e-10 2.23276e-11 next residual norm 1.70415e-10 1.15934e-11 next residual norm 8.84865e-11 6.01976e-12 next residual norm 4.59457e-11 3.1257e-12 next residual norm 2.38569e-11 1.62299e-12 next residual norm 1.23875e-11 8.42724e-13 Linear solve converged due to CONVERGED_RTOL iterations 35 Residual norm 1.23875e-11 initial residual norm 172.601 next residual norm 154.803 0.896887 next residual norm 66.9409 0.387837 next residual norm 34.4572 0.199636 next residual norm 17.8836 0.103612 next residual norm 9.28582 0.0537995 next residual norm 4.82161 0.027935 next residual norm 2.50358 0.014505 next residual norm 1.29996 0.0075316 next residual norm 0.674992 0.00391071 next residual norm 0.350483 0.0020306 next residual norm 0.181985 0.00105437 next residual norm 0.094494 0.000547472 next residual norm 0.0490651 0.000284269 next residual norm 0.0254766 0.000147604 next residual norm 0.0132285 7.6642e-05 next residual norm 0.00686876 3.97956e-05 next residual norm 0.00356654 2.06635e-05 next residual norm 0.00185189 1.07293e-05 next residual norm 0.000961576 5.5711e-06 next residual norm 0.000499289 2.89274e-06 next residual norm 0.000259251 1.50203e
Re: [petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions
The scaling might be the problem, especially since I don't know what you mean by scaling it according to FE. For reproducing the issue with a smaller problem: Change the ComputeRHS function in ex45.c if (i == 0 || j == 0 || k == 0 || i == mx - 1 || j == my - 1 || k == mz - 1) { barray[k][j][i] = 0.0; } else { barray[k][j][i] = 1.0; } Change the dimensions to e.g. 33 (I scaled it down, so it goes quick without a GPU) instead of 7 and then run with -ksp_converged_reason -ksp_type richardson -ksp_rtol 1e-09 -pc_type mg -pc_mg_levels 3 -mg_levels_ksp_type richardson -mg_levels_ksp_max_it 6 -mg_levels_ksp_converged_maxits -mg_levels_pc_type jacobi -mg_coarse_ksp_type richardson -mg_coarse_ksp_max_it 6 -mg_coarse_ksp_converged_maxits -mg_coarse_pc_type jacobi You will find that it takes 145 iterations instead of 25 for the original ex45 RHS. My hpgmg-cuda implementation (using 32^3) takes 41 iterations. To what do I have to change the diagonal entries of the matrix for the boundary according to FE? Right now the diagonal is completely constant. Paul On Tuesday, February 28, 2023 00:23 CET, Barry Smith wrote: I have not seen explicitly including, or excluding, the Dirichlet boundary values in the system having a significant affect on the convergence so long as you SCALE the diagonal rows (of those Dirichlet points) by a value similar to the other entries along the diagonal. If they are scaled completely differently, that can screw up the convergence. For src/ksp/ksp/ex45.c I see that the appropriate scaling is used (note the scaling should come from a finite element view of the discretization even if the discretization is finite differences as is done in ex45.c) Are you willing to share the two codes so we can take a look with experienced eyes to try to figure out the difference? Barry > On Feb 27, 2023, at 5:48 PM, Paul Grosse-Bley > wrote: > > Hi Barry, > > the reason why I wanted to change to ghost boundaries is that I was worrying > about the effect of PCMGs coarsening on these boundary values. > > As mentioned before, I am trying to reproduce results from the hpgmg-cuda > benchmark (a modified version of it, e.g. using 2nd order instead of 4th > etc.). > I am trying to solve the Poisson equation -\nabla^2 u = 1 with u = 0 on the > boundary with rtol=1e-9. While my MG solver implemented in hpgmg solves this > in 40 V-cycles (I weakened it a lot by only doing smooths at the coarse level > instead of CG). When I run the "same" MG solver built in PETSc on this > problem, it starts out reducing the residual norm as fast or even faster for > the first 20-30 iterations. But for the last order of magnitude in the > residual norm it needs more than 300 V-cycles, i.e. it gets very slow. At > this point I am pretty much out of ideas about what is the cause, especially > since e.g. adding back cg at the coarsest level doesn't seem to change the > number of iterations at all. Therefore I am suspecting the discretization to > be the problem. HPGMG uses an even number of points per dimension (e.g. 256), > while PCMG wants an odd number (e.g. 257). So I also tried adding another > layer of boundary values for the discretization to effectively use only 254 > points per dimension. This caused the solver to get even slightly worse. > > So can the explicit boundary values screw with the coarsening, especially > when they are not finite? Because with the problem as stated in ex45 with > finite (i.e. non-zero) boundary values, the MG solver takes only 18 V-cycles. > > Best, > Paul > > > > On Monday, February 27, 2023 18:17 CET, Barry Smith wrote: > >> >> Paul, >> >> DM_BOUNDARY_GHOSTED would result in the extra ghost locations in the local >> vectors (obtained with DMCreateLocalVector() but they will not appear in the >> global vectors obtained with DMCreateGlobalVector(); perhaps this is the >> issue? Since they do not appear in the global vector they will not appear in >> the linear system so there will be no diagonal entries for you to set since >> those rows/columns do not exist in the linear system. In other words, using >> DM_BOUNDARY_GHOSTED is a way to avoid needing to put the Dirichlet values >> explicitly into the system being solved; DM_BOUNDARY_GHOSTED is generally >> more helpful for nonlinear systems than linear systems. >> >> Barry >> >> > On Feb 27, 2023, at 12:08 PM, Paul Grosse-Bley >> > wrote: >> > >> > Hi, >> > >> > I would like to modify src/ksp/ksp/tutorials/ex45.c to implement Dirichlet >> > boundary conditions using DM_BOUNDARY_GHOSTED instead of using >> > DM_BOUNDARY_NONE and explicitly implementing the boundary by adding >> > diagnonal-only rows
Re: [petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions
Hi Barry, the reason why I wanted to change to ghost boundaries is that I was worrying about the effect of PCMGs coarsening on these boundary values. As mentioned before, I am trying to reproduce results from the hpgmg-cuda benchmark (a modified version of it, e.g. using 2nd order instead of 4th etc.). I am trying to solve the Poisson equation -\nabla^2 u = 1 with u = 0 on the boundary with rtol=1e-9. While my MG solver implemented in hpgmg solves this in 40 V-cycles (I weakened it a lot by only doing smooths at the coarse level instead of CG). When I run the "same" MG solver built in PETSc on this problem, it starts out reducing the residual norm as fast or even faster for the first 20-30 iterations. But for the last order of magnitude in the residual norm it needs more than 300 V-cycles, i.e. it gets very slow. At this point I am pretty much out of ideas about what is the cause, especially since e.g. adding back cg at the coarsest level doesn't seem to change the number of iterations at all. Therefore I am suspecting the discretization to be the problem. HPGMG uses an even number of points per dimension (e.g. 256), while PCMG wants an odd number (e.g. 257). So I also tried adding another layer of boundary values for the discretization to effectively use only 254 points per dimension. This caused the solver to get even slightly worse. So can the explicit boundary values screw with the coarsening, especially when they are not finite? Because with the problem as stated in ex45 with finite (i.e. non-zero) boundary values, the MG solver takes only 18 V-cycles. Best, Paul On Monday, February 27, 2023 18:17 CET, Barry Smith wrote: Paul, DM_BOUNDARY_GHOSTED would result in the extra ghost locations in the local vectors (obtained with DMCreateLocalVector() but they will not appear in the global vectors obtained with DMCreateGlobalVector(); perhaps this is the issue? Since they do not appear in the global vector they will not appear in the linear system so there will be no diagonal entries for you to set since those rows/columns do not exist in the linear system. In other words, using DM_BOUNDARY_GHOSTED is a way to avoid needing to put the Dirichlet values explicitly into the system being solved; DM_BOUNDARY_GHOSTED is generally more helpful for nonlinear systems than linear systems. Barry > On Feb 27, 2023, at 12:08 PM, Paul Grosse-Bley > wrote: > > Hi, > > I would like to modify src/ksp/ksp/tutorials/ex45.c to implement Dirichlet > boundary conditions using DM_BOUNDARY_GHOSTED instead of using > DM_BOUNDARY_NONE and explicitly implementing the boundary by adding > diagnonal-only rows. > > My assumption was that with DM_BOUNDARY_GHOSTED all vectors from that DM have > the extra memory for the ghost entries and that I can basically use > DMDAGetGhostCorners instead of DMDAGetCorners to access the array gotten via > DMDAVecGetArray. But when I access (gxs, gys, gzs) = (-1,-1,-1) I get a > segmentation fault. When looking at the implementation of DMDAVecGetArray it > looked to me as if accessing (-1, -1, -1) should work as DMDAVecGetArray > passes the ghost corners to VecGetArray3d which then adds the right offsets. > > I could not find any example using DM_BOUNDARY_GHOSTED and then actually > accessing the ghost/boundary elements. Can I assume that they are set to zero > for the solution vector, i.e. the u=0 on \del\Omega and I do not need to > access them at all? > > Best, > Paul Große-Bley
[petsc-users] How to use DM_BOUNDARY_GHOSTED for Dirichlet boundary conditions
Hi, I would like to modify src/ksp/ksp/tutorials/ex45.c to implement Dirichlet boundary conditions using DM_BOUNDARY_GHOSTED instead of using DM_BOUNDARY_NONE and explicitly implementing the boundary by adding diagnonal-only rows. My assumption was that with DM_BOUNDARY_GHOSTED all vectors from that DM have the extra memory for the ghost entries and that I can basically use DMDAGetGhostCorners instead of DMDAGetCorners to access the array gotten via DMDAVecGetArray. But when I access (gxs, gys, gzs) = (-1,-1,-1) I get a segmentation fault. When looking at the implementation of DMDAVecGetArray it looked to me as if accessing (-1, -1, -1) should work as DMDAVecGetArray passes the ghost corners to VecGetArray3d which then adds the right offsets. I could not find any example using DM_BOUNDARY_GHOSTED and then actually accessing the ghost/boundary elements. Can I assume that they are set to zero for the solution vector, i.e. the u=0 on \del\Omega and I do not need to access them at all? Best, Paul Große-Bley
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
I thought to have observed the number of cycles with -pc_mg_multiplicative_cycles to be dependent on rtol. But I might have seen this with maxits=0 which would explain my missunderstanding of richardson. I guess PCAMGX does not use this PCApplyRichardson_MG (yet?). Because I still see the multiple PCApplys there with maxits=1 and richardson, while they are gone for PCMG (due to getting rid of KSPSetComputeInitialGuess?). Best, Paul Grosse-Bley On Wednesday, February 22, 2023 23:20 CET, Barry Smith wrote: Preonly means exactly one application of the PC so it will never converge by itself unless the PC is a full solver. Note there is a PCApplyRichardson_MG() that gets used automatically with KSPRICHARSON. This does not have an"extra" application of the preconditioner so 2 iterations of Richardson with MG will use 2 applications of the V-cycle. So it is exactly "multigrid as a solver, without a Krylov method", no extra work. So I don't think you need to make any "compromises". Barry On Feb 22, 2023, at 4:57 PM, Paul Grosse-Bley wrote: Hi again, I now found out that 1. preonly ignores -ksp_pc_side right (makes sense, I guess). 2. richardson is incompatible with -ksp_pc_side right. 3. preonly gives less output for -log_view -pc_mg_log than richardson. 4. preonly also ignores -ksp_rtol etc.. 5. preonly causes -log_view to measure incorrect timings for custom stages, i.e. the time for the stage (219us) is significantly shorter than the time for the KSPSolve inside the stage (~40ms). Number 4 will be problematic as I want to benchmark number of V-cycles and runtime for a given rtol. At the same time I want to avoid richardson now because of number 2 and the additional work of scaling the RHS. Is there any good way of just using MG V-cycles as a solver, i.e. without interference from an outer Krylov solver and still iterate until convergence? Or will I just have to accept the additional V-cycle due to the left application of th PC with richardson? I guess I could also manually change -pc_mg_multiplicative_cycles until the residual gets low enough (using preonly), but that seems very inefficient. Best, Paul Große-Bley On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" wrote: I was using the Richardson KSP type which I guess has the same behavior as GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use preonly from now on, where maxits=1 does what I want it to do. Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think for now all my problems are resolved for now. Thank you very much, Barry and Mark! Best, Paul Große-Bley On Wednesday, February 22, 2023 21:03 CET, Barry Smith wrote: On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley wrote: Hi Barry, I think most of my "weird" observations came from the fact that I looked at iterations of KSPSolve where the residual was already converged. PCMG and PCGAMG do one V-cycle before even taking a look at the residual and then independent of pc_mg_multiplicative_cycles stop if it is converged. Looking at iterations that are not converged with PCMG, pc_mg_multiplicative_cycles works fine. At these iterations I also see the multiple calls to PCApply in a single KSPSolve iteration which were throwing me off with PCAMGX before. The reason for these multiple applications of the preconditioner (tested for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be better documented, I think. I do not understand what you are talking about with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of meaningless. The reason that there is a PCApply at the start of the solve is because by default the KSPType is KSPGMRES which by default uses left preconditioner which means the right hand side needs to be scaled by the preconditioner before the KSP process starts. So in this configuration one KSP iteration results in 2 PCApply. You can use -ksp_pc_side right to use right preconditioning and then the number of PCApply will match the number of KSP iterations. Best, Paul Große-Bley On Wednesday, February 22, 2023 20:15 CET, Barry Smith wrote: On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley wrote: Hi Mark, I use Nvidia Nsight Systems with --trace cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE calls (in addition to the actual kernels which are not always easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed NVTX markers. The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because kernel runtimes on coarser levels are much shorter. At the coarsest level, there normally isn't even enough work for the GPU (Nvidia A100) to be fully occ
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
Hi again, I now found out that 1. preonly ignores -ksp_pc_side right (makes sense, I guess). 2. richardson is incompatible with -ksp_pc_side right. 3. preonly gives less output for -log_view -pc_mg_log than richardson. 4. preonly also ignores -ksp_rtol etc.. 5. preonly causes -log_view to measure incorrect timings for custom stages, i.e. the time for the stage (219us) is significantly shorter than the time for the KSPSolve inside the stage (~40ms). Number 4 will be problematic as I want to benchmark number of V-cycles and runtime for a given rtol. At the same time I want to avoid richardson now because of number 2 and the additional work of scaling the RHS. Is there any good way of just using MG V-cycles as a solver, i.e. without interference from an outer Krylov solver and still iterate until convergence? Or will I just have to accept the additional V-cycle due to the left application of th PC with richardson? I guess I could also manually change -pc_mg_multiplicative_cycles until the residual gets low enough (using preonly), but that seems very inefficient. Best, Paul Große-Bley On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" wrote: I was using the Richardson KSP type which I guess has the same behavior as GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use preonly from now on, where maxits=1 does what I want it to do. Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think for now all my problems are resolved for now. Thank you very much, Barry and Mark! Best, Paul Große-Bley On Wednesday, February 22, 2023 21:03 CET, Barry Smith wrote: On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley wrote: Hi Barry, I think most of my "weird" observations came from the fact that I looked at iterations of KSPSolve where the residual was already converged. PCMG and PCGAMG do one V-cycle before even taking a look at the residual and then independent of pc_mg_multiplicative_cycles stop if it is converged. Looking at iterations that are not converged with PCMG, pc_mg_multiplicative_cycles works fine. At these iterations I also see the multiple calls to PCApply in a single KSPSolve iteration which were throwing me off with PCAMGX before. The reason for these multiple applications of the preconditioner (tested for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be better documented, I think. I do not understand what you are talking about with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of meaningless. The reason that there is a PCApply at the start of the solve is because by default the KSPType is KSPGMRES which by default uses left preconditioner which means the right hand side needs to be scaled by the preconditioner before the KSP process starts. So in this configuration one KSP iteration results in 2 PCApply. You can use -ksp_pc_side right to use right preconditioning and then the number of PCApply will match the number of KSP iterations. Best, Paul Große-Bley On Wednesday, February 22, 2023 20:15 CET, Barry Smith wrote: On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley wrote: Hi Mark, I use Nvidia Nsight Systems with --trace cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE calls (in addition to the actual kernels which are not always easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed NVTX markers. The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because kernel runtimes on coarser levels are much shorter. At the coarsest level, there normally isn't even enough work for the GPU (Nvidia A100) to be fully occupied which is also visible in Nsight Systems. Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not understanding why it would not work for you. If you use and don't use the option are the exact same counts listed for all the events in the -log_view ? I run only a single MPI rank with a single GPU, so profiling is straighforward. Best, Paul Große-Bley On Wednesday, February 22, 2023 18:24 CET, Mark Adams wrote: On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley wrote:Hi Barry, after using VecCUDAGetArray to initialize the RHS, that kernel still gets called as part of KSPSolve instead of KSPSetup, but its runtime is way less significant than the cudaMemcpy before, so I guess I will leave it like this. Other than that I kept the code like in my first message in this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). The profiling results for PCMG and PCAMG look as I would expect them to look, i.e. one can nicely see the GPU load/kernel runtimes going down and up again for each V-cycle. I was wondering about -pc_mg_multiplicative
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
I was using the Richardson KSP type which I guess has the same behavior as GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use preonly from now on, where maxits=1 does what I want it to do. Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I think for now all my problems are resolved for now. Thank you very much, Barry and Mark! Best, Paul Große-Bley On Wednesday, February 22, 2023 21:03 CET, Barry Smith wrote: On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley wrote: Hi Barry, I think most of my "weird" observations came from the fact that I looked at iterations of KSPSolve where the residual was already converged. PCMG and PCGAMG do one V-cycle before even taking a look at the residual and then independent of pc_mg_multiplicative_cycles stop if it is converged. Looking at iterations that are not converged with PCMG, pc_mg_multiplicative_cycles works fine. At these iterations I also see the multiple calls to PCApply in a single KSPSolve iteration which were throwing me off with PCAMGX before. The reason for these multiple applications of the preconditioner (tested for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be better documented, I think. I do not understand what you are talking about with regard to maxits of 1 instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of meaningless. The reason that there is a PCApply at the start of the solve is because by default the KSPType is KSPGMRES which by default uses left preconditioner which means the right hand side needs to be scaled by the preconditioner before the KSP process starts. So in this configuration one KSP iteration results in 2 PCApply. You can use -ksp_pc_side right to use right preconditioning and then the number of PCApply will match the number of KSP iterations. Best, Paul Große-Bley On Wednesday, February 22, 2023 20:15 CET, Barry Smith wrote: On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley wrote: Hi Mark, I use Nvidia Nsight Systems with --trace cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE calls (in addition to the actual kernels which are not always easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed NVTX markers. The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because kernel runtimes on coarser levels are much shorter. At the coarsest level, there normally isn't even enough work for the GPU (Nvidia A100) to be fully occupied which is also visible in Nsight Systems. Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not understanding why it would not work for you. If you use and don't use the option are the exact same counts listed for all the events in the -log_view ? I run only a single MPI rank with a single GPU, so profiling is straighforward. Best, Paul Große-Bley On Wednesday, February 22, 2023 18:24 CET, Mark Adams wrote: On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley wrote:Hi Barry, after using VecCUDAGetArray to initialize the RHS, that kernel still gets called as part of KSPSolve instead of KSPSetup, but its runtime is way less significant than the cudaMemcpy before, so I guess I will leave it like this. Other than that I kept the code like in my first message in this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). The profiling results for PCMG and PCAMG look as I would expect them to look, i.e. one can nicely see the GPU load/kernel runtimes going down and up again for each V-cycle. I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make any difference. I would have expected to be able to increase the number of V-cycles per KSP iteration, but I keep seeing just a single V-cycle when changing the option (using PCMG). How are you seeing this? You might try -log_trace to see if you get two V cycles. When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between bench iterations to reset the solution vector does not seem to work as the residual keeps shrinking. Is this a bug? Any advice for working around this? Looking at the doc https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you use this with KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the solution vector. The profile for BoomerAMG also doesn't really show the V-cycle behavior of the other implementations. Most of the runtime seems to go into calls to cusparseDcsrsv which might happen at the different levels, but the runtime of these kernels doesn't show the V-cycle pattern. According to the output with -pc_hypre_boomeramg_print_statistics it is doing the right thing though, so I guess it is alright (and if not, this is probably the wrong place to discuss
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
Hi Barry, the picture keeps getting clearer. I did not use KSPSetInitialGuessNonzero or the corresponding option, but using KSPSetComputeInitialGuess probably sets it automatically (without telling one in the output of -help). I was also confused by the preonly KSP type not working which is also caused by this. I think ex45 should be changed to use a non-zero initial guess (plus maybe a comment mentioning that one should avoid KSPSetComputeInitialGuess when using the zero initial guess). Thank you for asking the right questions, Paul Große-Bley On Wednesday, February 22, 2023 20:46 CET, Barry Smith wrote: On Feb 22, 2023, at 2:19 PM, Paul Grosse-Bley wrote: Hi again, after checking with -ksp_monitor for PCMG, it seems my assumption that I could reset the solution by calling KSPSetComputeInitialGuess and then KSPSetupwas generally wrong and BoomerAMG was just the only preconditioner that cleverly stops doing work when the residual is already converged (which then caused me to find the shrinking residual and to think that it was doing something different to the other methods). Depending on your example the KSPSetComputeInitialGuess usage might not work. I suggest not using it for your benchmarking. But if you just call KSPSolve() multiple times it will zero the solution at the beginning of each KSPSolve() (unless you use KSPSetInitialGuessNonzero()). So if you want to run multiple solves for testing you should not need to do anything. This will be true for any use of KSPSolve() whether the PC is from PETSc, hypre or NVIDIA. So, how can I get KSP to use the function given through KSPSetComputeInitialGuess to reset the solution vector (without calling KSPReset which would add a lot of overhead, I assume)? Best, Paul Große-Bley On Wednesday, February 22, 2023 19:46 CET, Mark Adams wrote: OK, Nsight Systems is a good way to see what is going on. So all three of your solvers are not traversing the MG hierching with the correct logic.I don't know about hypre but PCMG and AMGx are pretty simple and AMGx dives into the AMGx library directly from out interface.Some things to try:* Use -options_left to make sure your options are being used (eg, spelling mistakes)* Use -ksp_view to see a human readable list of your solver parameters.* Use -log_trace to see if the correct methods are called. - PCMG calls PCMGMCycle_Private for each of the cycle in code like: for (i = 0; i < mg->cyclesperpcapply; i++) PetscCall(PCMGMCycle_Private(pc, mglevels + levels - 1, transpose, matapp, NULL));- AMGx is called PCApply_AMGX and then it dives into the library. See where these three calls to AMGx are called from. Mark On Wed, Feb 22, 2023 at 1:10 PM Paul Grosse-Bley wrote:Hi Mark, I use Nvidia Nsight Systems with --trace cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE calls (in addition to the actual kernels which are not always easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed NVTX markers. The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because kernel runtimes on coarser levels are much shorter. At the coarsest level, there normally isn't even enough work for the GPU (Nvidia A100) to be fully occupied which is also visible in Nsight Systems. I run only a single MPI rank with a single GPU, so profiling is straighforward. Best, Paul Große-Bley On Wednesday, February 22, 2023 18:24 CET, Mark Adams wrote: On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley wrote:Hi Barry, after using VecCUDAGetArray to initialize the RHS, that kernel still gets called as part of KSPSolve instead of KSPSetup, but its runtime is way less significant than the cudaMemcpy before, so I guess I will leave it like this. Other than that I kept the code like in my first message in this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). The profiling results for PCMG and PCAMG look as I would expect them to look, i.e. one can nicely see the GPU load/kernel runtimes going down and up again for each V-cycle. I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make any difference. I would have expected to be able to increase the number of V-cycles per KSP iteration, but I keep seeing just a single V-cycle when changing the option (using PCMG). How are you seeing this? You might try -log_trace to see if you get two V cycles. When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between bench iterations to reset the solution vector does not seem to work as the residual keeps shrinking. Is this a bug? Any advice for working around this? Looking at the doc https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you use this with KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the solution vector. The profile fo
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
Hi Barry, I think most of my "weird" observations came from the fact that I looked at iterations of KSPSolve where the residual was already converged. PCMG and PCGAMG do one V-cycle before even taking a look at the residual and then independent of pc_mg_multiplicative_cycles stop if it is converged. Looking at iterations that are not converged with PCMG, pc_mg_multiplicative_cycles works fine. At these iterations I also see the multiple calls to PCApply in a single KSPSolve iteration which were throwing me off with PCAMGX before. The reason for these multiple applications of the preconditioner (tested for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This could be better documented, I think. Best, Paul Große-Bley On Wednesday, February 22, 2023 20:15 CET, Barry Smith wrote: On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley wrote: Hi Mark, I use Nvidia Nsight Systems with --trace cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE calls (in addition to the actual kernels which are not always easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed NVTX markers. The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because kernel runtimes on coarser levels are much shorter. At the coarsest level, there normally isn't even enough work for the GPU (Nvidia A100) to be fully occupied which is also visible in Nsight Systems. Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most definitely it changes the run. I am not understanding why it would not work for you. If you use and don't use the option are the exact same counts listed for all the events in the -log_view ? I run only a single MPI rank with a single GPU, so profiling is straighforward. Best, Paul Große-Bley On Wednesday, February 22, 2023 18:24 CET, Mark Adams wrote: On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley wrote:Hi Barry, after using VecCUDAGetArray to initialize the RHS, that kernel still gets called as part of KSPSolve instead of KSPSetup, but its runtime is way less significant than the cudaMemcpy before, so I guess I will leave it like this. Other than that I kept the code like in my first message in this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). The profiling results for PCMG and PCAMG look as I would expect them to look, i.e. one can nicely see the GPU load/kernel runtimes going down and up again for each V-cycle. I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make any difference. I would have expected to be able to increase the number of V-cycles per KSP iteration, but I keep seeing just a single V-cycle when changing the option (using PCMG). How are you seeing this? You might try -log_trace to see if you get two V cycles. When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between bench iterations to reset the solution vector does not seem to work as the residual keeps shrinking. Is this a bug? Any advice for working around this? Looking at the doc https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you use this with KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the solution vector. The profile for BoomerAMG also doesn't really show the V-cycle behavior of the other implementations. Most of the runtime seems to go into calls to cusparseDcsrsv which might happen at the different levels, but the runtime of these kernels doesn't show the V-cycle pattern. According to the output with -pc_hypre_boomeramg_print_statistics it is doing the right thing though, so I guess it is alright (and if not, this is probably the wrong place to discuss it). When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) calls in KSPSolve (three for the very first KSPSolve) while expecting just one. Each KSPSolve should do a single preconditioned Richardson iteration. Why is the preconditioner applied multiple times here? Again, not sure what "see" is, but PCAMGX is pretty new and has not been used much.Note some KSP methods apply to the PC before the iteration. Mark Thank you, Paul Große-Bley On Monday, February 06, 2023 20:05 CET, Barry Smith wrote: It should not crash, take a look at the test cases at the bottom of the file. You are likely correct if the code, unfortunately, does use DMCreateMatrix() it will not work out of the box for geometric multigrid. So it might be the wrong example for you. I don't know what you mean about clever. If you simply set the solution to zero at the beginning of the loop then it will just do the same solve multiple times. The setup should not do much of anything after the first solver. Thought usually solves are big enough that one need not run solves multiple times to get a good understanding of thei
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
Hi again, after checking with -ksp_monitor for PCMG, it seems my assumption that I could reset the solution by calling KSPSetComputeInitialGuess and then KSPSetupwas generally wrong and BoomerAMG was just the only preconditioner that cleverly stops doing work when the residual is already converged (which then caused me to find the shrinking residual and to think that it was doing something different to the other methods). So, how can I get KSP to use the function given through KSPSetComputeInitialGuess to reset the solution vector (without calling KSPReset which would add a lot of overhead, I assume)? Best, Paul Große-Bley On Wednesday, February 22, 2023 19:46 CET, Mark Adams wrote: OK, Nsight Systems is a good way to see what is going on. So all three of your solvers are not traversing the MG hierching with the correct logic.I don't know about hypre but PCMG and AMGx are pretty simple and AMGx dives into the AMGx library directly from out interface.Some things to try:* Use -options_left to make sure your options are being used (eg, spelling mistakes)* Use -ksp_view to see a human readable list of your solver parameters.* Use -log_trace to see if the correct methods are called. - PCMG calls PCMGMCycle_Private for each of the cycle in code like: for (i = 0; i < mg->cyclesperpcapply; i++) PetscCall(PCMGMCycle_Private(pc, mglevels + levels - 1, transpose, matapp, NULL));- AMGx is called PCApply_AMGX and then it dives into the library. See where these three calls to AMGx are called from. Mark On Wed, Feb 22, 2023 at 1:10 PM Paul Grosse-Bley wrote:Hi Mark, I use Nvidia Nsight Systems with --trace cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE calls (in addition to the actual kernels which are not always easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed NVTX markers. The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because kernel runtimes on coarser levels are much shorter. At the coarsest level, there normally isn't even enough work for the GPU (Nvidia A100) to be fully occupied which is also visible in Nsight Systems. I run only a single MPI rank with a single GPU, so profiling is straighforward. Best, Paul Große-Bley On Wednesday, February 22, 2023 18:24 CET, Mark Adams wrote: On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley wrote:Hi Barry, after using VecCUDAGetArray to initialize the RHS, that kernel still gets called as part of KSPSolve instead of KSPSetup, but its runtime is way less significant than the cudaMemcpy before, so I guess I will leave it like this. Other than that I kept the code like in my first message in this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). The profiling results for PCMG and PCAMG look as I would expect them to look, i.e. one can nicely see the GPU load/kernel runtimes going down and up again for each V-cycle. I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make any difference. I would have expected to be able to increase the number of V-cycles per KSP iteration, but I keep seeing just a single V-cycle when changing the option (using PCMG). How are you seeing this? You might try -log_trace to see if you get two V cycles. When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between bench iterations to reset the solution vector does not seem to work as the residual keeps shrinking. Is this a bug? Any advice for working around this? Looking at the doc https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you use this with KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the solution vector. The profile for BoomerAMG also doesn't really show the V-cycle behavior of the other implementations. Most of the runtime seems to go into calls to cusparseDcsrsv which might happen at the different levels, but the runtime of these kernels doesn't show the V-cycle pattern. According to the output with -pc_hypre_boomeramg_print_statistics it is doing the right thing though, so I guess it is alright (and if not, this is probably the wrong place to discuss it). When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) calls in KSPSolve (three for the very first KSPSolve) while expecting just one. Each KSPSolve should do a single preconditioned Richardson iteration. Why is the preconditioner applied multiple times here? Again, not sure what "see" is, but PCAMGX is pretty new and has not been used much.Note some KSP methods apply to the PC before the iteration. Mark Thank you, Paul Große-Bley On Monday, February 06, 2023 20:05 CET, Barry Smith wrote: It should not crash, take a look at the test cases at the bottom of the file. You are likely correct if the code, unfortunately, does use DMCreateMat
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
Hi Mark, I use Nvidia Nsight Systems with --trace cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX markers that come with -log_view. I.e. I get a nice view of all cuBLAS and cuSPARSE calls (in addition to the actual kernels which are not always easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even more detailed NVTX markers. The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear because kernel runtimes on coarser levels are much shorter. At the coarsest level, there normally isn't even enough work for the GPU (Nvidia A100) to be fully occupied which is also visible in Nsight Systems. I run only a single MPI rank with a single GPU, so profiling is straighforward. Best, Paul Große-Bley On Wednesday, February 22, 2023 18:24 CET, Mark Adams wrote: On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley wrote:Hi Barry, after using VecCUDAGetArray to initialize the RHS, that kernel still gets called as part of KSPSolve instead of KSPSetup, but its runtime is way less significant than the cudaMemcpy before, so I guess I will leave it like this. Other than that I kept the code like in my first message in this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). The profiling results for PCMG and PCAMG look as I would expect them to look, i.e. one can nicely see the GPU load/kernel runtimes going down and up again for each V-cycle. I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make any difference. I would have expected to be able to increase the number of V-cycles per KSP iteration, but I keep seeing just a single V-cycle when changing the option (using PCMG). How are you seeing this? You might try -log_trace to see if you get two V cycles. When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between bench iterations to reset the solution vector does not seem to work as the residual keeps shrinking. Is this a bug? Any advice for working around this? Looking at the doc https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you use this with KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the solution vector. The profile for BoomerAMG also doesn't really show the V-cycle behavior of the other implementations. Most of the runtime seems to go into calls to cusparseDcsrsv which might happen at the different levels, but the runtime of these kernels doesn't show the V-cycle pattern. According to the output with -pc_hypre_boomeramg_print_statistics it is doing the right thing though, so I guess it is alright (and if not, this is probably the wrong place to discuss it). When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) calls in KSPSolve (three for the very first KSPSolve) while expecting just one. Each KSPSolve should do a single preconditioned Richardson iteration. Why is the preconditioner applied multiple times here? Again, not sure what "see" is, but PCAMGX is pretty new and has not been used much.Note some KSP methods apply to the PC before the iteration. Mark Thank you, Paul Große-Bley On Monday, February 06, 2023 20:05 CET, Barry Smith wrote: It should not crash, take a look at the test cases at the bottom of the file. You are likely correct if the code, unfortunately, does use DMCreateMatrix() it will not work out of the box for geometric multigrid. So it might be the wrong example for you. I don't know what you mean about clever. If you simply set the solution to zero at the beginning of the loop then it will just do the same solve multiple times. The setup should not do much of anything after the first solver. Thought usually solves are big enough that one need not run solves multiple times to get a good understanding of their performance. On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley wrote: Hi Barry, src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, thank you! Sadly I get a segfault when executing that example with PCMG and more than one level, i.e. with the minimal args: $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2 === Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 100 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... Step2a - running PCSetUp()... [0]PETSC ERROR: [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/ [0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors [0]PETSC ERROR: configure us
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
Hi Barry, after using VecCUDAGetArray to initialize the RHS, that kernel still gets called as part of KSPSolve instead of KSPSetup, but its runtime is way less significant than the cudaMemcpy before, so I guess I will leave it like this. Other than that I kept the code like in my first message in this thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). The profiling results for PCMG and PCAMG look as I would expect them to look, i.e. one can nicely see the GPU load/kernel runtimes going down and up again for each V-cycle. I was wondering about -pc_mg_multiplicative_cycles as it does not seem to make any difference. I would have expected to be able to increase the number of V-cycles per KSP iteration, but I keep seeing just a single V-cycle when changing the option (using PCMG). When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess between bench iterations to reset the solution vector does not seem to work as the residual keeps shrinking. Is this a bug? Any advice for working around this? The profile for BoomerAMG also doesn't really show the V-cycle behavior of the other implementations. Most of the runtime seems to go into calls to cusparseDcsrsv which might happen at the different levels, but the runtime of these kernels doesn't show the V-cycle pattern. According to the output with -pc_hypre_boomeramg_print_statistics it is doing the right thing though, so I guess it is alright (and if not, this is probably the wrong place to discuss it). When using PCAMGX, I see two PCApply (each showing a nice V-cycle behavior) calls in KSPSolve (three for the very first KSPSolve) while expecting just one. Each KSPSolve should do a single preconditioned Richardson iteration. Why is the preconditioner applied multiple times here? Thank you, Paul Große-Bley On Monday, February 06, 2023 20:05 CET, Barry Smith wrote: It should not crash, take a look at the test cases at the bottom of the file. You are likely correct if the code, unfortunately, does use DMCreateMatrix() it will not work out of the box for geometric multigrid. So it might be the wrong example for you. I don't know what you mean about clever. If you simply set the solution to zero at the beginning of the loop then it will just do the same solve multiple times. The setup should not do much of anything after the first solver. Thought usually solves are big enough that one need not run solves multiple times to get a good understanding of their performance. On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley wrote: Hi Barry, src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, thank you! Sadly I get a segfault when executing that example with PCMG and more than one level, i.e. with the minimal args: $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2 === Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 100 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... Step2a - running PCSetUp()... [0]PETSC ERROR: [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/ [0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [0]PETSC ERROR: to get more information on the crash. [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash. -- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 59. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -- As the matrix is not created using DMDACreate3d I expected it to fail due to the missing geometric information, but I expected it to fail more gracefully than with a segfault. I will try to combine bench_kspsolve.c with ex45.c to get easy MG preconditioning, especially since I am interested in the 7pt stencil for now. Concerning my benchmarking loop from before: Is it generally discouraged to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the work when doing the same solve multiple times or are the solves not iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to keep the runtime short? Thanks, Paul On Monday, February 06, 2023 17:04 CET, Barry Smith wrote: Paul, I think
Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
Hi Barry, src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting point, thank you! Sadly I get a segfault when executing that example with PCMG and more than one level, i.e. with the minimal args: $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2 === Test: KSP performance - Poisson Input matrix: 27-pt finite difference stencil -n 100 DoFs = 100 Number of nonzeros = 26463592 Step1 - creating Vecs and Mat... Step2a - running PCSetUp()... [0]PETSC ERROR: [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/ [0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [0]PETSC ERROR: to get more information on the crash. [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash. -- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 59. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -- As the matrix is not created using DMDACreate3d I expected it to fail due to the missing geometric information, but I expected it to fail more gracefully than with a segfault. I will try to combine bench_kspsolve.c with ex45.c to get easy MG preconditioning, especially since I am interested in the 7pt stencil for now. Concerning my benchmarking loop from before: Is it generally discouraged to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the work when doing the same solve multiple times or are the solves not iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to keep the runtime short? Thanks, Paul On Monday, February 06, 2023 17:04 CET, Barry Smith wrote: Paul, I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to be used for simple benchmarking. You can use VecCudaGetArray() to access the GPU memory of the vector and then call a CUDA kernel to compute the right hand side vector directly on the GPU. Barry On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley wrote: Hi, I want to compare different implementations of multigrid solvers for Nvidia GPUs using the poisson problem (starting from ksp tutorial example ex45.c). Therefore I am trying to get runtime results comparable to hpgmg-cuda (finite-volume), i.e. using multiple warmup and measurement solves and avoiding measuring setup time. For now I am using -log_view with added stages: PetscLogStageRegister("Solve Bench", _bench_stage); for (int i = 0; i < BENCH_SOLVES; i++) { PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); // reset x PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that everything is done PetscLogStagePush(solve_bench_stage); PetscCall(KSPSolve(ksp, NULL, NULL)); PetscLogStagePop(); } This snippet is preceded by a similar loop for warmup. When profiling this using Nsight Systems, I see that the very first solve is much slower which mostly correspods to H2D (host to device) copies and e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs, but probably insignificant in this case). The following solves have some overhead at the start from a H2D copy of a vector (the RHS I guess, as the copy is preceeded by a matrix-vector product) in the first MatResidual call (callchain: KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My interpretation of the profiling results (i.e. cuBLAS calls) is that that vector is overwritten with the residual in Daxpy and therefore has to be copied again for the next iteration. Is there an elegant way of avoiding that H2D copy? I have seen some examples on constructing matrices directly on the GPU, but nothing about vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At the moment I am using jacobi as smoother, but I would like to have a CUDA implementation of SOR instead. Is there a good way of achieving that, e.g. using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother as smoother in PCMG? Or is the overhead from constantly switching between PETSc and hypre too big? Thanks, Paul
[petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy
Hi, I want to compare different implementations of multigrid solvers for Nvidia GPUs using the poisson problem (starting from ksp tutorial example ex45.c). Therefore I am trying to get runtime results comparable to hpgmg-cuda (finite-volume), i.e. using multiple warmup and measurement solves and avoiding measuring setup time. For now I am using -log_view with added stages: PetscLogStageRegister("Solve Bench", _bench_stage); for (int i = 0; i < BENCH_SOLVES; i++) { PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); // reset x PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that everything is done PetscLogStagePush(solve_bench_stage); PetscCall(KSPSolve(ksp, NULL, NULL)); PetscLogStagePop(); } This snippet is preceded by a similar loop for warmup. When profiling this using Nsight Systems, I see that the very first solve is much slower which mostly correspods to H2D (host to device) copies and e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs, but probably insignificant in this case). The following solves have some overhead at the start from a H2D copy of a vector (the RHS I guess, as the copy is preceeded by a matrix-vector product) in the first MatResidual call (callchain: KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My interpretation of the profiling results (i.e. cuBLAS calls) is that that vector is overwritten with the residual in Daxpy and therefore has to be copied again for the next iteration. Is there an elegant way of avoiding that H2D copy? I have seen some examples on constructing matrices directly on the GPU, but nothing about vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At the moment I am using jacobi as smoother, but I would like to have a CUDA implementation of SOR instead. Is there a good way of achieving that, e.g. using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother as smoother in PCMG? Or is the overhead from constantly switching between PETSc and hypre too big? Thanks, Paul
Re: [petsc-users] PCMG Interpolation/Restriction Defaults
That is great to know! When I read that part of the docs, I thought it would be a big hassle to use geometric multigrid, having to manually define all those matrices. Thanks again! On Thursday, February 02, 2023 16:38 CET, Barry Smith wrote: Yes, it pulls the information from the DM. See src/ksp/pc/impls/mg/mg.c the function PCSetUp_MG. Note this function has become horribly complex with people adding all kinds of crazy ways to set things up, and the code should be refactored to remove the complexity but anyways you'll see things like DMCoarsen(), DMCreateInterpolation() etc used to create what the MG algorithm needs. Barry On Feb 2, 2023, at 9:26 AM, Grosse-Bley, Paul wrote: The documentation says that > The multigrid routines, which determine the solvers and > interpolation/restriction operators that are used, are mandatory. But using code that doesn't contain any multigrid-specific routines e.g. ex45 and running it with -da_grid_x 129 -da_grid_y 129 -da_grid_z 129 -pc_type mg -pc_mg_type multiplicative -pc_mg_cycle_type v -pc_mg_levels 5 ... seems to work fine and also not fall-back to algebraic multigrid/PCGAMG from what I see with -log_view. What interpolation/restriction operators are used by default with PCMG? Is it cleverly using the geometric information from DA? Thanks, Paul