most efficient way to use local CSR matrix?

2009-10-20 Thread Chris Kees
Hi,

Our unstructured finite element code does a fairly standard  
overlapping decomposition that allows it to calculate all the non-zero  
column entries for the rows owned by the processor (rows 0...ldim-1 in  
the local numbering).  We assemble them into a local CSR matrix and  
then copy them into a PETSc matrix like this:

for (int i=0;ildim;i++)
{
   irow[0] = i;
   MatSetValuesLocal(PETSCMAT(par_L),1,irow,rowptr[i+1]- 
rowptr[i],colind[rowptr[i]],a[rowptr[i]],INSERT_VALUES);
}

where rowptr and colind are the CSR indexing arrays in the local  
numbering.

I would like to eliminate the duplicate memory (and the copy above).  
Is there a recommended way to let PETSc reference array 'a' directly  
or a way to get rid of 'a' and translate our CSR assembly routines to  
use low level PETSc data structures?

So far I've added MatMPIAIJSetPreallocationCSR to try to speed up the  
first solve, but the documentation is clear that 'a' is not a pointer  
to the AIJ storage so I still need the copy. I tried using  
MatCreateMPIAIJWithSplitArrays with the 'o' arrays set to zero but I  
got indexing errors, and the documentation really seems to imply that  
the local matrix is really not a standard CSR anyway. If that's true  
then it seems like the options I have are to write a shell matrix for  
parallel CSR storage or write some new assembly routines that use  
MatSetValuesLocal. I'm more concerned about the memory than the  
overhead of MatSetValuesLocal, but it would certainly be easier on me  
to use the CSR-based assembly routines we already have.

Thanks,
Chris



-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091020/8e825b38/attachment.html


most efficient way to use local CSR matrix?

2009-10-29 Thread Chris Kees
Mark, Barry,

Thanks for the help.  For now I'm sticking with the approach of  
copying the csr matrix and using the csr data structures to do the  
preallocations. I'll eventually get around to writing the code for  
assembling directly into the petsc matrix.  I have two more questions.

1) On 1 processor, the linear solver is not converging in 1 iteration  
using -pc_type lu -pc_factor_mat_solver_package spooles. It converges  
in one iteration in parallel or on a single processor if I set - 
mat_type seqaij. Also, if I go back to the old way of constructing the  
matrix it DOES converge in 1 iteration on 1 processor. That is, if I  
replace the MatCreate/MatSetSizes/...  calls from the earlier email,  
with what I had originally:

MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m);

then the solver converges in 1 iteration as expected.  This wouldn't  
be a big deal except that we're trying to verify that our quadratic  
finite elements are working, and the behavior for p1 and p2 elements  
is different. The above observations are for p2, while all the cases  
work as expected for p1. Should we just not be using mpiaij at all on  
1 processor?

2) In the code snippet I took from the petsc example there are two  
preallocation calls (see below). Is that correct, or should I be  
checking the matrix type and calling only the preallocation function  
for the actual matrix type? i.e. something like

MatCreate(...
MatSetSizes(...
MatSetFromOptions(...
MatGetType(...
if type==mpiaij: MatMPIAIJSetPreallocationCSR(...

Chris

On Oct 21, 2009, at 11:50 AM, Mark F. Adams wrote:

 Chris, just a note,

 Perhaps I missed something here but do something similar to you, eg  
 overlapping partitions, and PETSc is setup very well to be intrusive  
 in your code (I sometimes write a little 'addvalue' wrapper) and  
 assemble your FE matrices directly into a global matrix.  You use  
 the (global) MatSetValues(...) and set the matrix option  
 MAT_IGNORE_OFF_PROC_ENTRIES (or something like that) so that these  
 off processor matrix entries, which are computed redundantly, are  
 thrown away and you get the correct matrix.

 As Barry said you don't need exact non-zero counts to allocate  
 storage in PETSC matrices, just don't underestimate.

 Mark

 On Oct 21, 2009, at 9:16 AM, Barry Smith wrote:


 On Oct 20, 2009, at 10:13 PM, Chris Kees wrote:

 Thanks. Just to make sure I'm following, when I create the matrix  
 I do:

MatCreate(PETSC_COMM_WORLD,self-m);
MatSetSizes(self-m,n,n,N,N);
MatSetType(self-m,MATMPIAIJ);
MatSetFromOptions(self-m);
MatSeqAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval));
MatMPIAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval));

 and then I copy the matrix at each Newton iteration using the code  
 in the first message below.  You said ...PreallocationCSR() does  
 the copy very efficiently, but I don't think I'm supposed to  
 repeatedly call it, right?  Looking at the output some more it  
 does seem like building the preconditioner initially is taking  
 much more time than the initial pass through MatSetValues.

  If the matrix nonzero structure stays the same then yes you need  
 call MatMPIAIJSetPreallocationCSR() only once and use the  
 MatSetValues() calls each Newton step to update the nonzeros.

  Barry


 Chris

 On Oct 20, 2009, at 8:32 PM, Barry Smith wrote:


 Here's the deal. In parallel PETSc does not use a single CSR  
 matrix on each process to hold the MPIAIJ matrix. Hence if you  
 store the matrix on a process as a single CSR matrix then it has  
 to be copied into the PETSc datastructure.  
 MatMPIAIJSetPreallocationCSR() does the copy very efficiently.  
 But you are right, until you free YOUR rowpt[], colind[] and a[]  
 there are two copies of the matrix.
 One could argue that this is ok, because anyways any  
 preconditioner worth anything will take up at least that much  
 memory later anyways.  For example, if you use the PETSc default  
 preconditioner, block Jacobi with ILU(0) on each block it will  
 take pretty much the same amount of memory.

 For people who use SOR or some other preconditioner that requires  
 very little memory this is not satisfactory because they feel  
 they could run larger problems without the copy.

 I strongly recommend you use MatMPIAIJSetPreallocationCSR() and  
 live with any memory limitation. The reason is that doing more  
 than that has orders of magnitude more pain with generally little  
 payoff.
 If you like pain then you can
 1) change your code to not build directly into CSR, instead call  
 MatSetValuesLocal() as you compute the entries. With this you  
 need to figure out the preallocation which can be painful code.
 2) write an entire matrix class that stores the matrix like you  
 store it, not like you store it. This is a huge project, since  
 you have to write your own

most efficient way to use local CSR matrix?

2009-10-29 Thread Chris Kees
I pulled the most recent version of petsc-dev and added superlu_dist.   
The only case that fails is spooles on 1 processor with an mpiaij  
matrix created with the Create/SetSizes/... paradigm.  I would suspect  
something in spooles wrapper code.

code allocating mat:
MatCreate(Py_PETSC_COMM_WORLD,self-m);
MatSetSizes(self-m,n,n,N,N);
MatSetType(self-m,MATMPIAIJ);

results:
-pc_type lu -pc_factor_mat_solver_packages spooles - 
ksp_monitor_true_residual (spooles on 1 proc)
  0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
  1 KSP preconditioned resid norm 3.936838941525e-01 true resid norm  
1.050257318153e+01 ||Ae||/||Ax|| 9.881970495395e-01
  2 KSP preconditioned resid norm 7.332823321496e-14 true resid norm  
2.408053447453e-13 ||Ae||/||Ax|| 2.265760276816e-14

-pc_type lu -pc_factor_mat_solver_packages spooles - 
ksp_monitor_true_residual (spooleson 8 procs)
0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
  1 KSP preconditioned resid norm 2.898588418444e-14 true resid norm  
1.045214942604e-13 ||Ae||/||Ax|| 9.834526306678e-15

-pc_type lu -pc_factor_mat_solver_packages superlu_dist - 
ksp_monitor_true_residual
0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
  1 KSP preconditioned resid norm 2.917052183641e-14 true resid norm  
1.299023464819e-13 ||Ae||/||Ax|| 1.63471084e-14

code allocating mat:

MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m);

results:
-pc_type lu -pc_factor_mat_solver_packages spooles - 
ksp_monitor_true_residual (spooles on 1 proc)
  0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
  1 KSP preconditioned resid norm 3.093811333624e-14 true resid norm  
1.046870732638e-13 ||Ae||/||Ax|| 9.850105791801e-15

On Oct 29, 2009, at 1:50 PM, Matthew Knepley wrote:

 On Thu, Oct 29, 2009 at 1:46 PM, Chris Kees christopher.e.kees at 
 usace.army.mil 
  wrote:
 Mark, Barry,

 Thanks for the help.  For now I'm sticking with the approach of  
 copying the csr matrix and using the csr data structures to do the  
 preallocations. I'll eventually get around to writing the code for  
 assembling directly into the petsc matrix.  I have two more questions.

 1) On 1 processor, the linear solver is not converging in 1  
 iteration using -pc_type lu -pc_factor_mat_solver_package spooles.  
 It converges in one iteration in parallel or on a single processor  
 if I set -mat_type seqaij. Also, if I go back to the old way of  
 constructing the matrix it DOES converge in 1 iteration on 1  
 processor. That is, if I replace the MatCreate/MatSetSizes/...   
 calls from the earlier email, with what I had originally:

 MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
 1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m);

 then the solver converges in 1 iteration as expected.  This wouldn't  
 be a big deal except that we're trying to verify that our quadratic  
 finite elements are working, and the behavior for p1 and p2 elements  
 is different. The above observations are for p2, while all the cases  
 work as expected for p1. Should we just not be using mpiaij at all  
 on 1 processor?

 No, you have a problem.

 2) In the code snippet I took from the petsc example there are two  
 preallocation calls (see below). Is that correct, or should I be  
 checking the matrix type and calling only the preallocation function  
 for the actual matrix type? i.e. something like

 MatCreate(...
 MatSetSizes(...
 MatSetFromOptions(...
 MatGetType(...
 if type==mpiaij: MatMPIAIJSetPreallocationCSR(...

 It does not hurt to call them all.

   Matt

 Chris

 On Oct 21, 2009, at 11:50 AM, Mark F. Adams wrote:

 Chris, just a note,

 Perhaps I missed something here but do something similar to you, eg  
 overlapping partitions, and PETSc is setup very well to be intrusive  
 in your code (I sometimes write a little 'addvalue' wrapper) and  
 assemble your FE matrices directly into a global matrix.  You use  
 the (global) MatSetValues(...) and set the matrix option  
 MAT_IGNORE_OFF_PROC_ENTRIES (or something like that) so that these  
 off processor matrix entries, which are computed redundantly, are  
 thrown away and you get the correct matrix.

 As Barry said you don't need exact non-zero counts to allocate  
 storage in PETSC matrices, just don't underestimate.

 Mark

 On Oct 21, 2009, at 9:16 AM, Barry Smith wrote:


 On Oct 20, 2009, at 10:13 PM, Chris Kees wrote:

 Thanks. Just to make sure I'm following, when I create the matrix I  
 do:

   MatCreate(PETSC_COMM_WORLD,self-m);
   MatSetSizes(self-m,n,n,N,N);
   MatSetType(self-m,MATMPIAIJ);
   MatSetFromOptions(self-m);
   MatSeqAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval

most efficient way to use local CSR matrix?

2009-10-29 Thread Chris Kees
I think I added the correct test code and got a difference of 0, but  
spooles is still not returning the correct result, at least not until  
the 2nd iteration.  I'm attaching some of the code and output.

Chris

code for mat creates
---
MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m2);

MatCreate(Py_PETSC_COMM_WORLD,self-m);
MatSetSizes(self-m,n,n,N,N);
MatSetFromOptions(self-m);
MatSeqAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval));
MatMPIAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval));


code for difference in mats
---

   
MatAXPY 
(PETSCMAT2(par_L),-1.0,PETSCMAT(par_L),DIFFERENT_NONZERO_PATTERN);
  PetscReal norm=-1.0;
  MatNorm(PETSCMAT2(par_L),NORM_INFINITY,norm);
  std::coutinf norm of L - L2 = normstd::endl;

results for -pc_type lu -pc_factor_mat_solver_package spooles - 
ksp_monitor_true_residual -mat_type mpiaij
---

inf norm of L - L2 = 0
  0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
  1 KSP preconditioned resid norm 3.936838941525e-01 true resid norm  
1.050257318153e+01 ||Ae||/||Ax|| 9.881970495395e-01
  2 KSP preconditioned resid norm 7.332823321496e-14 true resid norm  
2.408053447453e-13 ||Ae||/||Ax|| 2.265760276816e-14
On Oct 29, 2009, at 3:56 PM, Barry Smith wrote:


   Hmm, this should not happen. The matrix should be identical in  
 both cases (note the initial residual is the same in both cases so  
 they may be identical).

   Here's one more thing you can try. In the same routine create TWO  
 matrices; one each way, then use MatAXPY() to take the difference  
 between them.
 Is it zero? If not, that will help lead to where the problem might be.

   If that does not help the situation, can you send the code the  
 demonstrates this problem to petsc-maint at mcs.anl.gov


Barry

 On Oct 29, 2009, at 3:34 PM, Chris Kees wrote:

 I pulled the most recent version of petsc-dev and added  
 superlu_dist.  The only case that fails is spooles on 1 processor  
 with an mpiaij matrix created with the Create/SetSizes/...  
 paradigm.  I would suspect something in spooles wrapper code.

 code allocating mat:
 MatCreate(Py_PETSC_COMM_WORLD,self-m);
 MatSetSizes(self-m,n,n,N,N);
 MatSetType(self-m,MATMPIAIJ);

 results:
 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooles on 1 proc)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 3.936838941525e-01 true resid norm  
 1.050257318153e+01 ||Ae||/||Ax|| 9.881970495395e-01
 2 KSP preconditioned resid norm 7.332823321496e-14 true resid norm  
 2.408053447453e-13 ||Ae||/||Ax|| 2.265760276816e-14

 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooleson 8 procs)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 2.898588418444e-14 true resid norm  
 1.045214942604e-13 ||Ae||/||Ax|| 9.834526306678e-15

 -pc_type lu -pc_factor_mat_solver_packages superlu_dist - 
 ksp_monitor_true_residual
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 2.917052183641e-14 true resid norm  
 1.299023464819e-13 ||Ae||/||Ax|| 1.63471084e-14

 code allocating mat:

 MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
 1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m);

 results:
 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooles on 1 proc)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 3.093811333624e-14 true resid norm  
 1.046870732638e-13 ||Ae||/||Ax|| 9.850105791801e-15

 On Oct 29, 2009, at 1:50 PM, Matthew Knepley wrote:

 On Thu, Oct 29, 2009 at 1:46 PM, Chris Kees christopher.e.kees at 
 usace.army.mil 
  wrote:
 Mark, Barry,

 Thanks for the help.  For now I'm sticking with the approach of  
 copying the csr matrix and using the csr data structures to do the  
 preallocations. I'll eventually get around to writing the code for  
 assembling directly into the petsc matrix.  I have two more  
 questions.

 1) On 1 processor, the linear solver is not converging in 1  
 iteration using -pc_type lu -pc_factor_mat_solver_package spooles.  
 It converges in one iteration in parallel or on a single processor  
 if I set -mat_type seqaij. Also, if I go back to the old way of  
 constructing the matrix it DOES converge in 1 iteration on 1  
 processor. That is, if I replace the MatCreate/MatSetSizes

most efficient way to use local CSR matrix?

2009-10-29 Thread Chris Kees

On Oct 29, 2009, at 8:43 PM, Hong Zhang wrote:


 On Oct 29, 2009, at 5:02 PM, Chris Kees wrote:

 I think I added the correct test code and got a difference of 0,  
 but spooles is still not returning the correct result, at least not  
 until the 2nd iteration.  I'm attaching some of the code and output.

 We have not been actively updating spooles interface since
 Spooles has been out of support by its developers for 10+ years.
 For direct parallel solvers, mumps and superlu_dist have been  
 constantly supported and
 outperform spooles in general.

 Why do you use spooles?


Inertia mainly.  It doesn't require a fortran compiler, which comes in  
handy on some systems, and it has been a reliable way to make sure  
everything else in the code is working before trying to loosen the  
linear solver tolerances on larger parallel runs. There were also some  
problems about 10 years ago when superlu_dist was not giving reliable  
results on Jacobians from degenerate nonlinear pde's.  If you guys  
aren't going to support the spooles wrapper, it would be nice to  
provide some kind of robust C version of a sparse direct solver in  
it's place for building a minimal petsc for debugging/development/ 
porting.

Chris

 Hong

 Chris

 code for mat creates
 ---
  MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
 1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m2);

  MatCreate(Py_PETSC_COMM_WORLD,self-m);
  MatSetSizes(self-m,n,n,N,N);
  MatSetFromOptions(self-m);
  MatSeqAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval));
  MatMPIAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval));


 code for difference in mats
 ---

 MatAXPY 
 (PETSCMAT2(par_L),-1.0,PETSCMAT(par_L),DIFFERENT_NONZERO_PATTERN);
 PetscReal norm=-1.0;
 MatNorm(PETSCMAT2(par_L),NORM_INFINITY,norm);
 std::coutinf norm of L - L2 = normstd::endl;

 results for -pc_type lu -pc_factor_mat_solver_package spooles - 
 ksp_monitor_true_residual -mat_type mpiaij
 ---

 inf norm of L - L2 = 0
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 3.936838941525e-01 true resid norm  
 1.050257318153e+01 ||Ae||/||Ax|| 9.881970495395e-01
 2 KSP preconditioned resid norm 7.332823321496e-14 true resid norm  
 2.408053447453e-13 ||Ae||/||Ax|| 2.265760276816e-14
 On Oct 29, 2009, at 3:56 PM, Barry Smith wrote:


 Hmm, this should not happen. The matrix should be identical in  
 both cases (note the initial residual is the same in both cases so  
 they may be identical).

 Here's one more thing you can try. In the same routine create TWO  
 matrices; one each way, then use MatAXPY() to take the difference  
 between them.
 Is it zero? If not, that will help lead to where the problem might  
 be.

 If that does not help the situation, can you send the code the  
 demonstrates this problem to petsc-maint at mcs.anl.gov


  Barry

 On Oct 29, 2009, at 3:34 PM, Chris Kees wrote:

 I pulled the most recent version of petsc-dev and added  
 superlu_dist.  The only case that fails is spooles on 1 processor  
 with an mpiaij matrix created with the Create/SetSizes/...  
 paradigm.  I would suspect something in spooles wrapper code.

 code allocating mat:
 MatCreate(Py_PETSC_COMM_WORLD,self-m);
 MatSetSizes(self-m,n,n,N,N);
 MatSetType(self-m,MATMPIAIJ);

 results:
 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooles on 1 proc)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid  
 norm 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 3.936838941525e-01 true resid  
 norm 1.050257318153e+01 ||Ae||/||Ax|| 9.881970495395e-01
 2 KSP preconditioned resid norm 7.332823321496e-14 true resid  
 norm 2.408053447453e-13 ||Ae||/||Ax|| 2.265760276816e-14

 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooleson 8 procs)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid  
 norm 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 2.898588418444e-14 true resid  
 norm 1.045214942604e-13 ||Ae||/||Ax|| 9.834526306678e-15

 -pc_type lu -pc_factor_mat_solver_packages superlu_dist - 
 ksp_monitor_true_residual
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid  
 norm 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 2.917052183641e-14 true resid  
 norm 1.299023464819e-13 ||Ae||/||Ax|| 1.63471084e-14

 code allocating mat:

 MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
 1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m);

 results:
 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooles on 1 proc)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid  
 norm

most efficient way to use local CSR matrix?

2009-10-29 Thread Chris Kees
I'm just using a simple configuration:
./config/configure.py --with-clanguage=C --with-cc='/usr/bin/mpicc - 
arch x86_64' --with-cxx='/usr/bin/mpicxx -arch x86_64' --without- 
fortran  --with-mpi-compilers --without-shared --without-dynamic -- 
download-parmeti
s=ifneeded --download-spooles=ifneeded

At this point we're not really hung up on this. I was mainly  
continuing to provide information in case it was helpful to you guys.  
The way the matrix is assembled there is no reason to explicitly set  
the type, and so the petsc default on one processor behaves correctly.  
You have to force it to use mpiaij and spooles on one processor to  
produce the error.

Chris

On Oct 29, 2009, at 9:00 PM, Barry Smith wrote:


  Have you run this with the debug version of PETSc? It does  
 additional argument testing that might catch a funny value passed in.


   Barry

 On Oct 29, 2009, at 5:02 PM, Chris Kees wrote:

 I think I added the correct test code and got a difference of 0,  
 but spooles is still not returning the correct result, at least not  
 until the 2nd iteration.  I'm attaching some of the code and output.

 Chris

 code for mat creates
 ---
  MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
 1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m2);

  MatCreate(Py_PETSC_COMM_WORLD,self-m);
  MatSetSizes(self-m,n,n,N,N);
  MatSetFromOptions(self-m);
  MatSeqAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval));
  MatMPIAIJSetPreallocationCSR(self-m,SMP(L)-A.rowptr,SMP(L)- 
 A.colind,(double*)(SMP(L)-A.nzval));


 code for difference in mats
 ---

 MatAXPY 
 (PETSCMAT2(par_L),-1.0,PETSCMAT(par_L),DIFFERENT_NONZERO_PATTERN);
 PetscReal norm=-1.0;
 MatNorm(PETSCMAT2(par_L),NORM_INFINITY,norm);
 std::coutinf norm of L - L2 = normstd::endl;

 results for -pc_type lu -pc_factor_mat_solver_package spooles - 
 ksp_monitor_true_residual -mat_type mpiaij
 ---

 inf norm of L - L2 = 0
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid norm  
 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 3.936838941525e-01 true resid norm  
 1.050257318153e+01 ||Ae||/||Ax|| 9.881970495395e-01
 2 KSP preconditioned resid norm 7.332823321496e-14 true resid norm  
 2.408053447453e-13 ||Ae||/||Ax|| 2.265760276816e-14
 On Oct 29, 2009, at 3:56 PM, Barry Smith wrote:


 Hmm, this should not happen. The matrix should be identical in  
 both cases (note the initial residual is the same in both cases so  
 they may be identical).

 Here's one more thing you can try. In the same routine create TWO  
 matrices; one each way, then use MatAXPY() to take the difference  
 between them.
 Is it zero? If not, that will help lead to where the problem might  
 be.

 If that does not help the situation, can you send the code the  
 demonstrates this problem to petsc-maint at mcs.anl.gov


  Barry

 On Oct 29, 2009, at 3:34 PM, Chris Kees wrote:

 I pulled the most recent version of petsc-dev and added  
 superlu_dist.  The only case that fails is spooles on 1 processor  
 with an mpiaij matrix created with the Create/SetSizes/...  
 paradigm.  I would suspect something in spooles wrapper code.

 code allocating mat:
 MatCreate(Py_PETSC_COMM_WORLD,self-m);
 MatSetSizes(self-m,n,n,N,N);
 MatSetType(self-m,MATMPIAIJ);

 results:
 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooles on 1 proc)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid  
 norm 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 3.936838941525e-01 true resid  
 norm 1.050257318153e+01 ||Ae||/||Ax|| 9.881970495395e-01
 2 KSP preconditioned resid norm 7.332823321496e-14 true resid  
 norm 2.408053447453e-13 ||Ae||/||Ax|| 2.265760276816e-14

 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooleson 8 procs)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid  
 norm 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 2.898588418444e-14 true resid  
 norm 1.045214942604e-13 ||Ae||/||Ax|| 9.834526306678e-15

 -pc_type lu -pc_factor_mat_solver_packages superlu_dist - 
 ksp_monitor_true_residual
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid  
 norm 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 2.917052183641e-14 true resid  
 norm 1.299023464819e-13 ||Ae||/||Ax|| 1.63471084e-14

 code allocating mat:

 MatCreateMPIAIJ(Py_PETSC_COMM_WORLD,n,n,N,N, 
 1,PETSC_NULL,max_nNeighbors,PETSC_NULL,self-m);

 results:
 -pc_type lu -pc_factor_mat_solver_packages spooles - 
 ksp_monitor_true_residual (spooles on 1 proc)
 0 KSP preconditioned resid norm 2.284405915285e+01 true resid  
 norm 1.062801511746e+01 ||Ae||/||Ax|| 1.e+00
 1 KSP preconditioned resid norm 3.093811333624e-14

[petsc-dev] setup.py for building Python extensions involving PETSc

2012-03-01 Thread Chris Kees
Hi,

This may be of interest to people developing in Python and C/C++.  I worked
out a simple hack to use the petsc4py configuration to build Python
extension modules with the correct petsc compilers and options.  We had
been doing this using some makefile tricks, but our approach wasn't as
reliable as Lisandro's configuration of petsc4py.  I had to hack two of the
build_ext functions. You still have to give setup.py the
--petsc-dir/--petsc-arch arguments.It might be a nice feature to
install something similar in petsc4py so one could just do 'from petsc4py
import PetscExtension'.

Chris

setup.py
---
from distutils.core import setup
sys.path.append('/path_to_petsc4py')
from conf.petscconf import Extension as PetscExtension
from conf.petscconf import build_ext as petsc_build_ext
from conf.petscconf import config, build, build_src
from conf.petscconf import test, sdist

class my_build_ext(petsc_build_ext):
def build_configuration(self, arch_list):
from distutils.util import split_quoted, execute
#
template, variables = self.get_config_data(arch_list)
config_data = template % variables
#
build_lib   = self.build_lib
dist_name   = self.distribution.get_name()
config_file = os.path.join(build_lib, dist_name,'petsc.cfg')#cek
hack
#
def write_file(filename, data):
fh = open(filename, 'w')
try: fh.write(config_data)
finally: fh.close()
execute(write_file, (config_file, config_data),
msg='writing %s' % config_file,
verbose=self.verbose, dry_run=self.dry_run)
def _copy_ext(self, ext):
from copy import deepcopy
extclass = ext.__class__
fullname = self.get_ext_fullname(ext.name)
modpath = str.split(fullname, '.')
pkgpath = os.path.join('', *modpath[0:-1])
name = modpath[-1]
sources = list(ext.sources)
newext = extclass(name, sources)
newext.__dict__.update(deepcopy(ext.__dict__))
newext.name = name
pkgpath=''#cek hack
return pkgpath, newext

setup(name='proteus',
  ext_package='proteus',
  package_data = {'proteus' : ['petsc.cfg'],},
  cmdclass = {'config' : config,
  'build'  : build,
  'build_src'  : build_src,
  'build_ext'  : my_build_ext,
  'test'   : test,
  'sdist'  : sdist,
  },
  ext_modules=[PetscExtension('flcbdfWrappers',

 ['proteus/flcbdfWrappersModule.cpp','proteus/mesh.cpp','proteus/meshio.cpp'],

 define_macros=[('PROTEUS_TRIANGLE_H',PROTEUS_TRIANGLE_H),

('PROTEUS_SUPERLU_H',PROTEUS_SUPERLU_H),
('CMRVEC_BOUNDS_CHECK',1),
('MV_VECTOR_BOUNDS_CHECK',1),
('PETSCVEC_BOUNDS_CHECK',1),
('F77_POST_UNDERSCORE',1),
('USE_BLAS',1)],

 
include_dirs=['include',numpy.get_include(),PROTEUS_SUPERLU_INCLUDE_DIR,PROTEUS_TRIANGLE_INCLUDE_DIR]+PROTEUS_DAETK_INCLUDE_DIR+PROTEUS_PETSC_INCLUDE_DIRS+
   [PROTEUS_MPI_INCLUDE_DIR],

 
library_dirs=[PROTEUS_DAETK_LIB_DIR]+PROTEUS_PETSC_LIB_DIRS+[PROTEUS_MPI_LIB_DIR],

 libraries=['m',PROTEUS_DAETK_LIB]+PROTEUS_PETSC_LIBS+PROTEUS_MPI_LIBS,
 extra_link_args=PROTEUS_EXTRA_LINK_ARGS,

 extra_compile_args=PROTEUS_EXTRA_COMPILE_ARGS)])
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120301/523eee73/attachment.html


[petsc-dev] field split/schur complement, incompressible flow

2010-05-12 Thread Chris Kees

On May 12, 2010, at 12:19 PM, Jed Brown wrote:

 On Wed, 12 May 2010 13:27:18 -0300, Lisandro Dalcin  
 dalcinl at gmail.com wrote:
 Chris is going to use this from Python, and I have to add support in
 petsc4py. I think I'll have to provide a better API, something like:

 pc.fieldSplitSetFields(0,[1,2,3],4) # (rho,[Ux,Uy,Uz],p)

 or

 pc.fieldSplitSetIS(is_rho, is_U, is_p)

 and implement then by calling PCFieldSplitSet{Fields|IS} multiple  
 times.

 Does this API make sense?

 Let's think about improving the C interface.  I would really like to
 name the splits since otherwise options sitting in petscrc files don't
 compose.  So what about

  PCFieldSplitSetFields(PC,const char name[],PetscInt  
 nfields,PetscInt fields[]);
  PCFieldSplitSetFieldsByName(PC,const char name[],PetscInt  
 nfields,const char *fieldnames[]);
  PCFieldSplitSetIS(PC,const char name[],IS);

 and on the Python end, the direct wrappers could be called multiple
 times to set individual splits (if you provided direct wrappers) or  
 have
 the higher level interface using a dict

  pc.FieldSplitSetSplits(splits)

 with any of the definitions

  splits = dict(velocity=isvelocity, pressure=ispressure))
  splits = dict(velocity=(0,1,2), pressure=(3,))
  splits = dict(velocity=('u','v','w'), pressure='p')

 The latter would only work if the user used DMSetFieldName to identify
 fields u,v,w,p (recall that PC has access to the DM in -dev).  With  
 any
 of the definitions above, the runtime prefixes would be
 -fieldsplit_velocity_* and -fieldsplit_pressure_* instead of just
 numbers.

 Does this seem sensible?

 Jed

That makes sense to me.  I'm all for naming the fields, and the  
dictionary representation in python is very intuitive.

Chris



[petsc-dev] setting field split pc correctly

2010-05-15 Thread Chris Kees
Hi,

I appear to be setting the index sets properly now through the  
petsc4py wrappers (using the existing interface) that Lisandro and I  
added. I can't seem to set the options on the KSP/PC for the splits  
though. My options look like this:

-navierStokes_fieldsplit_type MULTIPLICATIVE -navierStokes_ksp_type  
bcgsl -navierStokes_ksp_rtol 0.0 -navierStokes_ksp_atol 1.0e-10 - 
navierStokes_ksp_monitor_true_residual -ksp_type bcgsl -ksp_rtol 0.0 - 
ksp_atol 1.0e-10-navierStokes_fieldsplit_0_ksp_type preonly - 
navierStokes_fieldsplit_0_pc_type boomeramg - 
navierStokes_fieldsplit_1_ksp_type fgmres - 
navierStokes_fieldsplit_1_pc_type none - 
navierStokes_fieldsplit_1_ksp_norm_type NO - 
navierStokes_fieldsplit_1_ksp_max_it 3

In the code I call
 self.pc.setOptionsPrefix(prefix)
 self.ksp.setOptionsPrefix(prefix)
 self.ksp.setFromOptions()
 self.pc.setFromOptions()

The output of KSPView before each KSP solve is below. The split's use  
none,bjacobi. I'm guessing I'm making some kind of rookie mistake... - 
Chris

KSP Object:(navierStokes_)
   type: bcgsl
 BCGSL: Ell = 2
 BCGSL: Delta = 0
   maximum iterations=1000, initial guess is zero
   tolerances:  relative=0, absolute=1e-10, divergence=1
   left preconditioning
   using PRECONDITIONED norm type for convergence test
PC Object:(navierStokes_)
   type: fieldsplit
 FieldSplit with MULTIPLICATIVE composition: total splits = 2,  
blocksize = -1
 Solver info for each split is in the following KSP objects:
 Split number 0 Defined by IS
 KSP Object:(fieldsplit_0_)
   type: preonly
   maximum iterations=1, initial guess is zero
   tolerances:  relative=1e-05, absolute=1e-50, divergence=1
   left preconditioning
   using PRECONDITIONED norm type for convergence test
 PC Object:(fieldsplit_0_)
   type: bjacobi
 block Jacobi: number of blocks = 4
 Local solve is same for all blocks, in the following KSP and  
PC objects:
   KSP Object:(fieldsplit_0_sub_)
 type: preonly
 maximum iterations=1, initial guess is zero
 tolerances:  relative=1e-05, absolute=1e-50, divergence=1
 left preconditioning
 using PRECONDITIONED norm type for convergence test
   PC Object:(fieldsplit_0_sub_)
 type: ilu
   ILU: out-of-place factorization
   0 levels of fill
   tolerance for zero pivot 1e-12
   using diagonal shift to prevent zero pivot
   matrix ordering: natural
   factor fill ratio given 1, needed 1
 Factored matrix follows:
   Matrix Object:
 type=seqaij, rows=493, cols=493
 package used to perform factorization: petsc
 total: nonzeros=11409, allocated nonzeros=11409
   using I-node routines: found 124 nodes, limit used  
is 5
 linear system matrix = precond matrix:
 Matrix Object:
   type=seqaij, rows=493, cols=493
   total: nonzeros=11409, allocated nonzeros=11409
 using I-node routines: found 124 nodes, limit used is 5
   linear system matrix = precond matrix:
   Matrix Object:
 type=mpiaij, rows=1973, cols=1973
 total: nonzeros=35591, allocated nonzeros=35591
   using I-node (on process 0) routines: found 124 nodes,  
limit used is 5
 Split number 1 Defined by IS
 KSP Object:(fieldsplit_1_)
   type: preonly
   maximum iterations=1, initial guess is zero
   tolerances:  relative=1e-05, absolute=1e-50, divergence=1
   left preconditioning
   using PRECONDITIONED norm type for convergence test
 PC Object:(fieldsplit_1_)
   type: bjacobi
 block Jacobi: number of blocks = 4
 Local solve is same for all blocks, in the following KSP and  
PC objects:
   KSP Object:(fieldsplit_1_sub_)
 type: preonly
 maximum iterations=1, initial guess is zero
 tolerances:  relative=1e-05, absolute=1e-50, divergence=1
 left preconditioning
 using PRECONDITIONED norm type for convergence test
   PC Object:(fieldsplit_1_sub_)
 type: ilu
   ILU: out-of-place factorization
   0 levels of fill
   tolerance for zero pivot 1e-12
   using diagonal shift to prevent zero pivot
   matrix ordering: natural
   factor fill ratio given 1, needed 1
 Factored matrix follows:
   Matrix Object:
 type=seqaij, rows=1479, cols=1479
 package used to perform factorization: petsc
 total: nonzeros=61433, allocated nonzeros=61433
   using I-node routines: found 370 nodes, limit used  
is 5
 linear system matrix = precond matrix:
 Matrix Object:
   type=seqaij, rows=1479, cols=1479
   total: nonzeros=61433, allocated nonzeros=61433
 using 

since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-07 Thread Chris Kees
This whole thread has been really interesting. I'm interested in   
Barry's suggestion, and I know a handful of people not on the list who  
would also be interested.   I've got a couple of questions/responses  
to the MatSetValues, point-wise physics, and support tools issues  
that were raised.

Did you guys settle the issue that Jed brought up about about the  
Python Global Interpreter Lock (GIL)?  There was some discussion of it  
at SciPy this summer, but I didn't come away feeling like anybody had  
an immediate solution or that Guido was in favor of eliminating the  
GIL anytime soon.  Are you guys assuming that the worker threads   
for the multiple cores/GPU's are going to be set up by C code (hand-  
or machine-generated)?

It seems to me that the MatSetValues discussion moved on from the  
threading issue (since you can't call MatSetValues efficiently in  
python anyway) to the more general issue of how PETSc is going to  
explicitly support shared memory parallelism. I interpret the last  
interchange as implying that the solution is another level of domain  
decomposition that would be hidden from the mat assembly loop level  
(i.e. users aren't going to have to write their code like the thread  
part is their parallel subdomain). Is that a correct interpretation?  
Something along the lines of additional args to MatCreate that  
indicates how many threads to use per subdomain and some OpenMPI style  
instructions to parallelize the assembly loop?

On the issue of the point-wise physics (e.g. quadrature point  
routines, Riemann solvers, ...). It seems like this is a long-standing  
bottleneck for object-orient approaches to PDE's going back to the OO  
Fem research in the 80s and 90s.  The SciPy group that met to talk  
about PDE's spent some time on it, partly related to femhub project.   
I agree that a code generation approach seems to be the only way  
forward.  Ondrej Certik from the femhub project has written a symbolic  
math module for python, and we talked some about using this to  
represent the physics in a way that can be used for code generation.   
I'm not sure what progress he's made. At the end of the day you've got  
to allow people to write code for physics without knowledge of the  
assembly processes while still injecting that code into the inner  
loop, otherwise you'll approach the speed of monolithic fortran codes.

One issue that you might also want to consider with multi-language  
approaches is the configuration and build tools.   I agree that he  
debugging support is a major issue, but what has taken far more of my  
time is the lack of cross-platform configuration and build tools for  
multi-language software.  We have looked at CMake and Scons but have  
had to stick with a pretty ugly mixture of python scripts of python  
and make. I feel like the other scientific computing packages like  
Sage and EPD are successfully growing because they distribute binaries  
that include all dependencies, but I don't see that being feasible for  
HPC software anytime soon.  It seems like to really be successful you  
may have to rely on a full featured and fully documented tool like  
Scons is aiming at being, otherwise you'll end up with an approach  
that works for you (because you wrote it) but not very well for others  
trying to use the new PETSc.

We have a small group at that US Army ERDC that has  been developing a  
python FEM framework for multiscale/multiphysics modeling that tries  
to break coupling between the physics and the FEM methods (i.e. we try  
to support CG,DG, and non-conforming elements with variable order and  
element topologies running from the same description of the physics).  
The code uses our own wrappers to petsc and mpi, but only because we  
started before mpi4py and petsc4py existed.  We have a memory  
intensive  physics API based on computing and storing quadrature point  
values of PDE coefficients and numerical fluxes and a residual and  
jacobian assembly processes along the lines of Matt's first approach  
of storing element residuals and jacobians.  We also have handwritten  
optimized codes that collapse all the loops and eliminate storage  
for direct residual and jacobian assembly. We did that last step out  
of necessity and as a precursor to working  on a code generation  
approach (it's what we want the code generator to write).   It also  
has the side effect of providing a low level API for people who want  
to write traditional monolithic fortran codes for residual and  
jacobian assembly.

Our approach for the user  or more correctly the physics developer  
is 1) prototype physics in python 2) vectorize the physics by  
rewriting in C/C++/Fortran using swig, f2py, or hand written wrappers  
and 3) take the inner loop of 2 and build monolithic residual and  
jacobian routines.  This is obviously not satisfactory and not  
maintainable in the long-run but we are solving real nonlinear multi- 
physics problems in 3D with it 

FieldSplit Schur preconditioner

2008-09-06 Thread Chris Kees
OK. Thanks for the suggestions. I'll pull petsc-dev next week and see if
mumps gives me the same behavior as superlu_dist before trying to set up a
Schur complement preconditioner for stokes.

Chris


On 9/6/08 11:31 AM, Jed Brown jed at 59A2.org wrote:

 On Sat 2008-09-06 10:57, Kees, Christopher E wrote:
 I'm interested in these types of Shur complement preconditioners as well. I
 had stopped working with petsc-dev some time back because I wasn't using new
 functionality. Would this be a good time to start working with petsc-dev in
 order to start testing these? I'm primarily interested in stokes and
 navier-stokes type systems but also have some other coupled systems that an
 operator-split preconditioner should work well on.
 
 My opinion is YES, now is a good time to start using petsc-dev.
 
 With regard to testing, I am also looking for a solid parallel direct solver
 to verify everything from the nonlinear solver up. I have been using superlu
 in serial and superlu_dist in parallel (through PETSc) but have noticed some
 reduction in accuracy as I go to more processors with superlu_dist. This may
 just be a bug in our code, but I thought if I get petsc-dev the new approach
 to external direct solvers that Barry mentioned earlier might make it easier
 to look at some other direct solvers. Several years ago I got better results
 with spooles, but I'd be interested in any current recommendations.
 
 From what I've seen this is more of a code refactoring and interface
 change than a change in functionality.  The old way would be
 
   ./ex2 -pc_type lu -mat_type {aijspooles,superlu_dist,aijmumps,umfpack}
 
 but now it looks like
 
   ./ex2 -pc_type lu -pc_factor_mat_solver_package
 {spooles,superlu_dist,mumps,...}
 
 Jed




[petsc-dev] examples/benchmarks for weak and strong scaling exercise

2013-04-11 Thread Chris Kees
Thanks a lot. I did a little example with the Bratu problem and posted it here:

https://proteus.usace.army.mil/home/pub/17/

I used boomeramg instead of geometric multigrid because I was getting
an error with the options above:

%mpiexec -np 4 ./ex5 -mx 129 -my 129 -Nx 2 -Ny 2 -pc_type mg -pc_mg_levels 2
[0]PETSC ERROR: - Error Message

[0]PETSC ERROR: Argument out of range!
[0]PETSC ERROR: New nonzero at (66,1) caused a malloc!
[0]PETSC ERROR:


I like the ice paper and will try to get the contractor started on
reproducing those results.

-Chris

On Wed, Apr 10, 2013 at 1:13 PM, Nystrom, William D wdn at lanl.gov wrote:
 Sorry.  I overlooked that the URL was using git protocol.  My bad.

 Dave

 
 From: Jed Brown [five9a2 at gmail.com] on behalf of Jed Brown [jedbrown at 
 mcs.anl.gov]
 Sent: Wednesday, April 10, 2013 12:10 PM
 To: Nystrom, William D; For users of the development version of PETSc; Chris 
 Kees
 Subject: Re: [petsc-dev] examples/benchmarks for weak and strong scaling 
 exercise

 Nystrom, William D wdn at lanl.gov writes:

 Jed,

 I tried cloning your tme-ice git repo as follows and it failed:

 % git clone --recursive git://github.com/jedbrown/tme-ice.git tme_ice
 Cloning into 'tme_ice'...
 fatal: unable to connect to github.com:
 github.com[0: 204.232.175.90]: errno=Connection timed out

 I'm doing this from an xterm that allows me to clone petsc just fine.

 You're using https or ssh to clone PETSc, but the git:// to clone
 tme-ice.  The LANL network is blocking that port, so just use the https
 or ssh protocol.


[petsc-dev] HPC benchmarking

2016-09-16 Thread Chris Kees
Hi,

Is there a set of HPC  benchmarks that the core PETSc developers recommend
for evaluating architectures? Meaning to evaluating claims about
performance to support rational acquisition decisions on new systems. I
recall speaking briefly with Jed at a conference about a benchmark that
was  more  representative of  PDE  calculations, maybe representing a
Krylov  iteration arising from an unstructured mesh or  something like
that. I'm talking about long term acquisition of real workhorse petascale
systems not the cutting edge.

Thanks,
Chris