[petsc-users] ML and -pc_factor_shift_nonzero
Hi Jed and Matt, thanks a lot for your help and the interesting discussion. Kathrin Quoting Jed Brown jed at 59a2.org: On Mon, 19 Apr 2010 07:23:01 -0500, Matthew Knepley knepley at gmail.com wrote: So, to see if I understand correctly. You are saying that you can get away with more approximate solves if you do not do full reduction? I know the theory for the case of Stokes, but can you prove this in a general sense? The theory is relatively general (as much as preconditioned GMRES is) if you iterate in the full space with either block-diagonal or block-triangular preconditioners. Note that this formulation *never* involves explicit application of a Schur complement. Sometimes I get better convergence with one subcycle on the Schur complement with a very approximate inner solve (FGMRES outer). I'm not sure if Dave sees this, he seems to like doing a couple subcycles in multigrid smoothers. The folks doing Q1-Q1 with ML are not doing *anything* with a Schur complement (approxmate or otherwise). They just coarsen on the full indefinite system and use ASM (overlap 0 or 1) with ILU to precondition the coupled system. This makes a certain amount of sense because for those stabilized formulations, this is similar in spirit to a Vanka smoother (block SOR is a more precise analogue). This sounds like the black magic I expect :) Yeah, this involves some sort of very local solve to produce the aggregates and interpolations that are not transposes of each other (if I understood Ray and Eric correctly). I still maintain that aggregation is a really crappy way to generate coarse systems, especially for mixed elements. We should be generating coarse systems geometrically, and then using a nice (maybe Black-Box) framework for calculating good projectors. This whole framework doesn't work for mixed discretizations. Jed
[petsc-users] ML and -pc_factor_shift_nonzero
Hi Jed, ML works now using, e.g., -mg_coarse_redundant_pc_factor_shift_type POSITIVE_DEFINITE. However, it converges very slowly using the default REDUNDANT for the coarse solve. Converges slowly or the coarse-level solve is expensive? hm, rather converges slowly. Using ML inside a preconditioner for the Schur complement system, the overall outer system preconditioned with the approximated Schur complement preconditioner converges slowly, if you understand what I mean. My particular problem is that the convergence rate depends strongly on the number of processors. In case of one processor, using ML for preconditioning the deeply inner system the outer system converges in, e.g., 39 iterations. In case of np=10, however, it needs 69 iterations. This number of iterations is independent on the number of processes using HYPRE (at least if np80), but the latter is (applied to this inner system, not generally) slower and scales very badly. That's why I would like to use ML. Thinking about it, all this shouldn't have to do anything with the choice of the direct solver of the coarse system inside ML (mumps or petsc-own), should it? The direct solver solves completely, independently from the number of processes, and shouldn't have an influence on the effectiveness of ML, or am I wrong? I suggest starting with -mg_coarse_pc_type lu -mg_coarse_pc_factor_mat_solver_package mumps or varying parameters in ML to see if you can make the coarse level problem smaller without hurting convergence rate. You can do semi-redundant solves if you scale processor counts beyond what MUMPS works well with. Thanks. Thus, MUMPS is supposed to be the usually fastest parallel direct solver? Depending on what problem you are solving, ML could be producing a (nearly) singular coarse level operator in which case you can expect very confusing and inconsistent behavior. Could it also be the reason for the decreased convergence rate when increasing from 1 to 10 processors? Even if the equation system remains the same? Thanks a lot, Kathrin
[petsc-users] ML and -pc_factor_shift_nonzero
On Mon, Apr 19, 2010 at 6:29 AM, tribur at vision.ee.ethz.ch wrote: Hi Jed, ML works now using, e.g., -mg_coarse_redundant_pc_factor_shift_type POSITIVE_DEFINITE. However, it converges very slowly using the default REDUNDANT for the coarse solve. Converges slowly or the coarse-level solve is expensive? hm, rather converges slowly. Using ML inside a preconditioner for the Schur complement system, the overall outer system preconditioned with the approximated Schur complement preconditioner converges slowly, if you understand what I mean. My particular problem is that the convergence rate depends strongly on the number of processors. In case of one processor, using ML for preconditioning the deeply inner system the outer system converges in, e.g., 39 iterations. In case of np=10, however, it needs 69 iterations. For Schur complement methods, the inner system usually has to be solved very accurately. Are you accelerating a Krylov method for A^{-1}, or just using ML itself? I would expect for the same linear system tolerance, you get identical convergence for the same system, independent of the number of processors. Matt This number of iterations is independent on the number of processes using HYPRE (at least if np80), but the latter is (applied to this inner system, not generally) slower and scales very badly. That's why I would like to use ML. Thinking about it, all this shouldn't have to do anything with the choice of the direct solver of the coarse system inside ML (mumps or petsc-own), should it? The direct solver solves completely, independently from the number of processes, and shouldn't have an influence on the effectiveness of ML, or am I wrong? I suggest starting with -mg_coarse_pc_type lu -mg_coarse_pc_factor_mat_solver_package mumps or varying parameters in ML to see if you can make the coarse level problem smaller without hurting convergence rate. You can do semi-redundant solves if you scale processor counts beyond what MUMPS works well with. Thanks. Thus, MUMPS is supposed to be the usually fastest parallel direct solver? Depending on what problem you are solving, ML could be producing a (nearly) singular coarse level operator in which case you can expect very confusing and inconsistent behavior. Could it also be the reason for the decreased convergence rate when increasing from 1 to 10 processors? Even if the equation system remains the same? Thanks a lot, Kathrin -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20100419/7a912612/attachment.htm
[petsc-users] ML and -pc_factor_shift_nonzero
On Mon, 19 Apr 2010 13:29:40 +0200, tribur at vision.ee.ethz.ch wrote: ML works now using, e.g., -mg_coarse_redundant_pc_factor_shift_type POSITIVE_DEFINITE. However, it converges very slowly using the default REDUNDANT for the coarse solve. Converges slowly or the coarse-level solve is expensive? hm, rather converges slowly. Using ML inside a preconditioner for the Schur complement system, the overall outer system preconditioned with the approximated Schur complement preconditioner converges slowly, if you understand what I mean. Sure, but the redundant coarse solve is a direct solve. It may be that the shift (to make it nonsingular) makes it ineffective (and thus outer system converges slowly), but this is the same behavior you would get with a non-redundant solve. I.e. it is the shift that causes the problem, not the REDUNDANT. I don't know which flavor of Schur complement iteration you are currently using. It is true that pure Schur complement reduction requires high-accuracy inner solves, you may of course get away with inexact inner solves if it is part of a full-space iteration. It's worth comparing the number of iterations required to solve the inner (advection-diffusion) block to a given tolerance in parallel and serial. My particular problem is that the convergence rate depends strongly on the number of processors. In case of one processor, using ML for preconditioning the deeply inner system the outer system converges in, e.g., 39 iterations. In case of np=10, however, it needs 69 iterations. ML with defaults has a significant difference between serial and parallel. Usually the scalability is acceptable from 2 processors up, but the difference between one and two can be quite significant. You can make it stronger, e.g. with -mg_levels_ksp_type gmres -mg_levels_ksp_max_it 1 -mg_levels_pc_type asm -mg_levels_sub_pc_type ilu This number of iterations is independent on the number of processes using HYPRE (at least if np80), but the latter is (applied to this inner system, not generally) slower and scales very badly. That's why I would like to use ML. Thinking about it, all this shouldn't have to do anything with the choice of the direct solver of the coarse system inside ML (mumps or petsc-own), should it? The direct solver solves completely, independently from the number of processes, and shouldn't have an influence on the effectiveness of ML, or am I wrong? A shift makes it solve a somewhat different system. How different that perturbed system is depends on the problem and the size of the shift. MUMPS has more sophisticated ordering/pivoting schemes so you should use it if the coarse system demands it (you can also try using different ordering schemes in PETSc, -mg_coarse_redundant_pc_factor_mat_ordering_type). Thanks. Thus, MUMPS is supposed to be the usually fastest parallel direct solver? Usually. Depending on what problem you are solving, ML could be producing a (nearly) singular coarse level operator in which case you can expect very confusing and inconsistent behavior. Could it also be the reason for the decreased convergence rate when increasing from 1 to 10 processors? Even if the equation system remains the same? ML's aggregates change somewhat in parallel (I don't know how much, I haven't investigated precisely what is different) and the smoothers are all different. With a normal discretization of an elliptic system, it would seem surprising for ML to produce nearly singular coarse-level operators, in parallel or otherwise. But snes/tutorials/examples/ex48 exhibits pretty bad ML behavior (the coarse-level isn't singular, but the parallel aggregates with default smoothers don't converge despite being an SPD system, ML is informed of translations but not rigid body modes, I haven't investigated ML's troublesome modes for this problem so I don't know if they are rigid body modes or something else). Jed
[petsc-users] ML and -pc_factor_shift_nonzero
On Mon, 19 Apr 2010 06:34:08 -0500, Matthew Knepley knepley at gmail.com wrote: For Schur complement methods, the inner system usually has to be solved very accurately. Are you accelerating a Krylov method for A^{-1}, or just using ML itself? I would expect for the same linear system tolerance, you get identical convergence for the same system, independent of the number of processors. Matt, run ex48 with ML in parallel and serial, the aggregates are quite different and the parallel case doesn't converge with SOR. Also, from talking with Ray, Eric Cyr, and John Shadid two weeks ago, they are currently using ML on coupled Navier-Stokes systems and usually beating block factorization (i.e. full-space iterations with approximate-commutator Schur-complement preconditioners (PCD or LSC variants) which are beating full Schur-complement reduction). They are using Q1-Q1 with PSPG or Bochev stabilization and SUPG for advection. The trouble is that this method occasionally runs into problems where convergence completely falls apart, despite not having extreme parameter choices. ML has an option energy minimization which they are using (PETSc's interface doesn't currently support this, I'll add it if someone doesn't beat me to it) which is apparently crucial for generating reasonable coarse levels for these systems. They always coarsen all the degrees of freedom together, this is not possible with mixed finite element spaces, so you have to trade quality answers produced by a stable approximation along with necessity to make subdomain and coarse-level problems compatible with inf-sup against the wiggle-room you get with stabilized non-mixed discretizations but with possible artifacts and significant divergence error. Jed
[petsc-users] ML and -pc_factor_shift_nonzero
On Mon, Apr 19, 2010 at 7:12 AM, Jed Brown jed at 59a2.org wrote: On Mon, 19 Apr 2010 06:34:08 -0500, Matthew Knepley knepley at gmail.com wrote: For Schur complement methods, the inner system usually has to be solved very accurately. Are you accelerating a Krylov method for A^{-1}, or just using ML itself? I would expect for the same linear system tolerance, you get identical convergence for the same system, independent of the number of processors. Matt, run ex48 with ML in parallel and serial, the aggregates are quite different and the parallel case doesn't converge with SOR. Also, from talking with Ray, Eric Cyr, and John Shadid two weeks ago, they are currently using ML on coupled Navier-Stokes systems and usually beating block factorization (i.e. full-space iterations with approximate-commutator Schur-complement preconditioners (PCD or LSC variants) which are beating full Schur-complement reduction). They are using Q1-Q1 with PSPG or Bochev stabilization and SUPG for advection. So, to see if I understand correctly. You are saying that you can get away with more approximate solves if you do not do full reduction? I know the theory for the case of Stokes, but can you prove this in a general sense? The trouble is that this method occasionally runs into problems where convergence completely falls apart, despite not having extreme parameter choices. ML has an option energy minimization which they are using (PETSc's interface doesn't currently support this, I'll add it if someone doesn't beat me to it) which is apparently crucial for generating reasonable coarse levels for these systems. This sounds like the black magic I expect :) They always coarsen all the degrees of freedom together, this is not possible with mixed finite element spaces, so you have to trade quality answers produced by a stable approximation along with necessity to make subdomain and coarse-level problems compatible with inf-sup against the wiggle-room you get with stabilized non-mixed discretizations but with possible artifacts and significant divergence error. I still maintain that aggregation is a really crappy way to generate coarse systems, especially for mixed elements. We should be generating coarse systems geometrically, and then using a nice (maybe Black-Box) framework for calculating good projectors. Matt Jed -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20100419/4e9a7ea5/attachment-0001.htm
[petsc-users] ML and -pc_factor_shift_nonzero
On Mon, 19 Apr 2010 07:23:01 -0500, Matthew Knepley knepley at gmail.com wrote: So, to see if I understand correctly. You are saying that you can get away with more approximate solves if you do not do full reduction? I know the theory for the case of Stokes, but can you prove this in a general sense? The theory is relatively general (as much as preconditioned GMRES is) if you iterate in the full space with either block-diagonal or block-triangular preconditioners. Note that this formulation *never* involves explicit application of a Schur complement. Sometimes I get better convergence with one subcycle on the Schur complement with a very approximate inner solve (FGMRES outer). I'm not sure if Dave sees this, he seems to like doing a couple subcycles in multigrid smoothers. The folks doing Q1-Q1 with ML are not doing *anything* with a Schur complement (approxmate or otherwise). They just coarsen on the full indefinite system and use ASM (overlap 0 or 1) with ILU to precondition the coupled system. This makes a certain amount of sense because for those stabilized formulations, this is similar in spirit to a Vanka smoother (block SOR is a more precise analogue). This sounds like the black magic I expect :) Yeah, this involves some sort of very local solve to produce the aggregates and interpolations that are not transposes of each other (if I understood Ray and Eric correctly). I still maintain that aggregation is a really crappy way to generate coarse systems, especially for mixed elements. We should be generating coarse systems geometrically, and then using a nice (maybe Black-Box) framework for calculating good projectors. This whole framework doesn't work for mixed discretizations. Jed
[petsc-users] ML and -pc_factor_shift_nonzero
Dear Barry and Matt, thanks for your helpful response. ML works now using, e.g., -mg_coarse_redundant_pc_factor_shift_type POSITIVE_DEFINITE. However, it converges very slowly using the default REDUNDANT for the coarse solve. On 10 processors, e.g., even bjacobi plus -mg_coarse_ksp_max_it 10 works better. What solver do you recommend for the coarse solve? Maybe superlu? Best regards, Kathrin Quoting Barry Smith bsmith at mcs.anl.gov: -mg_coarse_pc_factor_shift_nonzero since it is the coarse level of the multigrid that is producing the zero pivot. Barry On Apr 13, 2010, at 8:51 AM, Matthew Knepley wrote: On Tue, Apr 13, 2010 at 2:49 PM, tribur at vision.ee.ethz.ch wrote: Hi, using ML I got the error [0]PETSC ERROR: Detected zero pivot in LU factorization As recommended at http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html , I tried -pc_factor_shift_nonzero but it doesn't have the desired effect using ML. How do I have to formulate the command line option? What does - [level]_pc_factor_shift_nonzero mean? What other parallel preconditioner could I try besides Hypre/Boomeramg or ML? This means the MG level, like 2. You can see all available options using -help. Matt Thanks in advance for your precious help, Kathrin -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
[petsc-users] ML and -pc_factor_shift_nonzero
On Fri, 16 Apr 2010 13:51:13 +0200, tribur at vision.ee.ethz.ch wrote: ML works now using, e.g., -mg_coarse_redundant_pc_factor_shift_type POSITIVE_DEFINITE. However, it converges very slowly using the default REDUNDANT for the coarse solve. Converges slowly or the coarse-level solve is expensive? I suggest starting with -mg_coarse_pc_type lu -mg_coarse_pc_factor_mat_solver_package mumps or varying parameters in ML to see if you can make the coarse level problem smaller without hurting convergence rate. You can do semi-redundant solves if you scale processor counts beyond what MUMPS works well with. Depending on what problem you are solving, ML could be producing a (nearly) singular coarse level operator in which case you can expect very confusing and inconsistent behavior. Jed
[petsc-users] ML and -pc_factor_shift_nonzero
Hi, using ML I got the error [0]PETSC ERROR: Detected zero pivot in LU factorization As recommended at http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html, I tried -pc_factor_shift_nonzero but it doesn't have the desired effect using ML. How do I have to formulate the command line option? What does -[level]_pc_factor_shift_nonzero mean? What other parallel preconditioner could I try besides Hypre/Boomeramg or ML? Thanks in advance for your precious help, Kathrin
[petsc-users] ML and -pc_factor_shift_nonzero
On Tue, Apr 13, 2010 at 2:49 PM, tribur at vision.ee.ethz.ch wrote: Hi, using ML I got the error [0]PETSC ERROR: Detected zero pivot in LU factorization As recommended at http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html, I tried -pc_factor_shift_nonzero but it doesn't have the desired effect using ML. How do I have to formulate the command line option? What does -[level]_pc_factor_shift_nonzero mean? What other parallel preconditioner could I try besides Hypre/Boomeramg or ML? This means the MG level, like 2. You can see all available options using -help. Matt Thanks in advance for your precious help, Kathrin -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20100413/1012779e/attachment.htm
[petsc-users] ML and -pc_factor_shift_nonzero
-mg_coarse_pc_factor_shift_nonzero since it is the coarse level of the multigrid that is producing the zero pivot. Barry On Apr 13, 2010, at 8:51 AM, Matthew Knepley wrote: On Tue, Apr 13, 2010 at 2:49 PM, tribur at vision.ee.ethz.ch wrote: Hi, using ML I got the error [0]PETSC ERROR: Detected zero pivot in LU factorization As recommended at http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html , I tried -pc_factor_shift_nonzero but it doesn't have the desired effect using ML. How do I have to formulate the command line option? What does - [level]_pc_factor_shift_nonzero mean? What other parallel preconditioner could I try besides Hypre/Boomeramg or ML? This means the MG level, like 2. You can see all available options using -help. Matt Thanks in advance for your precious help, Kathrin -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20100413/281ecff3/attachment.htm