since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-07 Thread Jed Brown
On Mon, 7 Dec 2009 10:53:10 -0500, Alex Peyser peyser.alex at gmail.com wrote:
 I've had endless trouble integrating petsc into my system because the object-
 oriented C is both only partially object oriented while simultaneously 
 making access to the underlying functionality obscure.

Okay, I'm curious what you find to be only partially OO, and what
functionality you find difficult to use.  PETSc's approach to object
creation might look a bit unconventional (depending on what languages
you're familiar with) but it's very much a modern OO pattern, see

  http://martinfowler.com/articles/injection.html


In my opinion, the main downside of C is that it requires a significant
amount of boilerplate to implement these patterns (though not a lot more
than designs with comparable properties in C++ or Java).  I'm not in
love with Python, but there is a large amount of PETSc that is in no way
performance sensitive, it would be very nice to do this in a more
expressive language than C.

Jed



since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-07 Thread Chris Kees
This whole thread has been really interesting. I'm interested in   
Barry's suggestion, and I know a handful of people not on the list who  
would also be interested.   I've got a couple of questions/responses  
to the MatSetValues, point-wise physics, and support tools issues  
that were raised.

Did you guys settle the issue that Jed brought up about about the  
Python Global Interpreter Lock (GIL)?  There was some discussion of it  
at SciPy this summer, but I didn't come away feeling like anybody had  
an immediate solution or that Guido was in favor of eliminating the  
GIL anytime soon.  Are you guys assuming that the worker threads   
for the multiple cores/GPU's are going to be set up by C code (hand-  
or machine-generated)?

It seems to me that the MatSetValues discussion moved on from the  
threading issue (since you can't call MatSetValues efficiently in  
python anyway) to the more general issue of how PETSc is going to  
explicitly support shared memory parallelism. I interpret the last  
interchange as implying that the solution is another level of domain  
decomposition that would be hidden from the mat assembly loop level  
(i.e. users aren't going to have to write their code like the thread  
part is their parallel subdomain). Is that a correct interpretation?  
Something along the lines of additional args to MatCreate that  
indicates how many threads to use per subdomain and some OpenMPI style  
instructions to parallelize the assembly loop?

On the issue of the point-wise physics (e.g. quadrature point  
routines, Riemann solvers, ...). It seems like this is a long-standing  
bottleneck for object-orient approaches to PDE's going back to the OO  
Fem research in the 80s and 90s.  The SciPy group that met to talk  
about PDE's spent some time on it, partly related to femhub project.   
I agree that a code generation approach seems to be the only way  
forward.  Ondrej Certik from the femhub project has written a symbolic  
math module for python, and we talked some about using this to  
represent the physics in a way that can be used for code generation.   
I'm not sure what progress he's made. At the end of the day you've got  
to allow people to write code for physics without knowledge of the  
assembly processes while still injecting that code into the inner  
loop, otherwise you'll approach the speed of monolithic fortran codes.

One issue that you might also want to consider with multi-language  
approaches is the configuration and build tools.   I agree that he  
debugging support is a major issue, but what has taken far more of my  
time is the lack of cross-platform configuration and build tools for  
multi-language software.  We have looked at CMake and Scons but have  
had to stick with a pretty ugly mixture of python scripts of python  
and make. I feel like the other scientific computing packages like  
Sage and EPD are successfully growing because they distribute binaries  
that include all dependencies, but I don't see that being feasible for  
HPC software anytime soon.  It seems like to really be successful you  
may have to rely on a full featured and fully documented tool like  
Scons is aiming at being, otherwise you'll end up with an approach  
that works for you (because you wrote it) but not very well for others  
trying to use the new PETSc.

We have a small group at that US Army ERDC that has  been developing a  
python FEM framework for multiscale/multiphysics modeling that tries  
to break coupling between the physics and the FEM methods (i.e. we try  
to support CG,DG, and non-conforming elements with variable order and  
element topologies running from the same description of the physics).  
The code uses our own wrappers to petsc and mpi, but only because we  
started before mpi4py and petsc4py existed.  We have a memory  
intensive  physics API based on computing and storing quadrature point  
values of PDE coefficients and numerical fluxes and a residual and  
jacobian assembly processes along the lines of Matt's first approach  
of storing element residuals and jacobians.  We also have handwritten  
optimized codes that collapse all the loops and eliminate storage  
for direct residual and jacobian assembly. We did that last step out  
of necessity and as a precursor to working  on a code generation  
approach (it's what we want the code generator to write).   It also  
has the side effect of providing a low level API for people who want  
to write traditional monolithic fortran codes for residual and  
jacobian assembly.

Our approach for the user  or more correctly the physics developer  
is 1) prototype physics in python 2) vectorize the physics by  
rewriting in C/C++/Fortran using swig, f2py, or hand written wrappers  
and 3) take the inner loop of 2 and build monolithic residual and  
jacobian routines.  This is obviously not satisfactory and not  
maintainable in the long-run but we are solving real nonlinear multi- 
physics problems in 3D with it 

since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-07 Thread Matthew Knepley
On Mon, Dec 7, 2009 at 3:05 PM, Chris Kees 
christopher.e.kees at usace.army.mil wrote:

 This whole thread has been really interesting. I'm interested in  Barry's
 suggestion, and I know a handful of people not on the list who would also be
 interested.   I've got a couple of questions/responses to the MatSetValues,
 point-wise physics, and support tools issues that were raised.

 Did you guys settle the issue that Jed brought up about about the Python
 Global Interpreter Lock (GIL)?  There was some discussion of it at SciPy
 this summer, but I didn't come away feeling like anybody had an immediate
 solution or that Guido was in favor of eliminating the GIL anytime soon.
  Are you guys assuming that the worker threads  for the multiple
 cores/GPU's are going to be set up by C code (hand- or machine-generated)?


I think this is going to be solved be first, rearranging to the code to
segregate reductions or conflicts 2) using atomics and then 3) intelligent
reordering to minimize the
conflicts. I have never seen this be a big problem, but some code is always
going to have the bad tradeoff I guess.


 It seems to me that the MatSetValues discussion moved on from the threading
 issue (since you can't call MatSetValues efficiently in python anyway) to
 the more general issue of how PETSc is going to explicitly support shared
 memory parallelism. I interpret the last interchange as implying that the
 solution is another level of domain decomposition that would be hidden from
 the mat assembly loop level (i.e. users aren't going to have to write their
 code like the thread part is their parallel subdomain). Is that a correct
 interpretation? Something along the lines of additional args to MatCreate
 that indicates how many threads to use per subdomain and some OpenMPI style
 instructions to parallelize the assembly loop?


I am not advocating that. I want a CUDA-like solution, but of course you can
probably convert any one to the other with a
tremendous amount of pain. I wish for all matrix-free applications and
vector assemblies, because then I think the problem
is very simple. I have not seen any compelling argument against this
approach, and all the best codes I know use it.


 On the issue of the point-wise physics (e.g. quadrature point routines,
 Riemann solvers, ...). It seems like this is a long-standing bottleneck for
 object-orient approaches to PDE's going back to the OO Fem research in the
 80s and 90s.  The SciPy group that met to talk about PDE's spent some time
 on it, partly related to femhub project.  I agree that a code generation
 approach seems to be the only way forward.  Ondrej Certik from the femhub
 project has written a symbolic math module for python, and we talked some
 about using this to represent the physics in a way that can be used for code
 generation.  I'm not sure what progress he's made. At the end of the day
 you've got to allow people to write code for physics without knowledge of
 the assembly processes while still injecting that code into the inner loop,
 otherwise you'll approach the speed of monolithic fortran codes.


This solution has been obvious for some time, but needs a solution even a
Fortran programmer could love. Or at least a bad C programmer.


 One issue that you might also want to consider with multi-language
 approaches is the configuration and build tools.   I agree that he debugging
 support is a major issue, but what has taken far more of my time is the lack
 of cross-platform configuration and build tools for multi-language software.
  We have looked at CMake and Scons but have had to stick with a pretty ugly
 mixture of python scripts of python and make. I feel like the other
 scientific computing packages like Sage and EPD are successfully growing
 because they distribute binaries that include all dependencies, but I don't
 see that being feasible for HPC software anytime soon.  It seems like to
 really be successful you may have to rely on a full featured and fully
 documented tool like Scons is aiming at being, otherwise you'll end up with
 an approach that works for you (because you wrote it) but not very well for
 others trying to use the new PETSc.


I think we do a better job than anyone with PETSc right now, but we will
have to work a little harder to get everything right.


 We have a small group at that US Army ERDC that has  been developing a
 python FEM framework for multiscale/multiphysics modeling that tries to
 break coupling between the physics and the FEM methods (i.e. we try to
 support CG,DG, and non-conforming elements with variable order and element
 topologies running from the same description of the physics). The code uses
 our own wrappers to petsc and mpi, but only because we started before mpi4py
 and petsc4py existed.  We have a memory intensive  physics API based on
 computing and storing quadrature point values of PDE coefficients and
 numerical fluxes and a residual and jacobian assembly processes along the
 lines of 

since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-07 Thread Barry Smith

On Dec 7, 2009, at 9:53 AM, Alex Peyser wrote:

 Have y'all considered a different approach?

 I've had endless trouble integrating petsc into my system because  
 the object-
 oriented C is both only partially object oriented while  
 simultaneously
 making access to the underlying functionality obscure.

Could you please give us cases where access to the underlying  
functionality is obscure? Everything is suppose to be accessable  
through an object oriented approach.  (Sorry about being defensive)  
We'd like to fix this.

 Some objects have
 some methods, and others are missing;

 This is intentional. To require each object to fully support all  
its interface methods would require having many different classes (for  
example matrix-free methods cannot support the factorization and  
setting/getting values interfaces so they would have to be seperated  
into subclasses). I (and this is me, perhaps most others don't have  
this difficulty) find that complexity of using a class library grows  
RAPIDLY with the number of classes and complexity of class  
hierarchies. Thus we kept a very small number of classes in PETSc, we  
could only do this by cheating and not requiring all objects to  
support all methods.

 some have the same functionality through
 higher level interfaces or lower level interfaces..

 I've tried very hard to prevent having the same functionality at  
two levels; the one place where we do have it is in the matrix  
factorizations and triangular solves where you can do it directly on  
matrix objects or through KSP and PC, but we do note in the matrix  
documentation the prefered approach is with KSP/PC and list them as  
developer level usage. We could hide the direct matrix solver  
methods even more by not providing public interfaces for them  
completely; perhaps we should have done that. Could you please list  
other things that have same functionality at a higher level interface  
and lower level so we can perhaps improve that experience. It is our  
intention (except with the factorization and solves) to NOT have the  
same functionality accessable from two levels.

 . and so on.

 Maybe you need to split this into two projects -- one a set of  
 normal C
 interfaces to the kernels, communications, etc -- the underlying  
 functionality

We ended up with the hybrid model because we could not see or  
even conceive of how to write a normal C interfaces for sparse  
matrices and preconditioners. The closest  numerical package to PETSc  
that I see that uses a normal C interface and is widely used is  
LAPACK and it only supports dense and banded martrices sequentially;  
its interface is already overwhelming, diffcult and not extensible.  
Bill and I selected an objected oriented approach in 1995 not because  
we thought it was cool or modern but because we viewed making a usable  
general purpose package for sparse solvers with a normal c interface  
as impossible. We could have been wrong, and could still be wrong  
about this, but I've yet to see any usable general purpose sparse  
solver package that works with a normal c interface. I think it is  
impossible. ScaLAPACK tries to use a normal C interface (but it  
still has somethings resembling objects) and is a totally unusable  
piece of garbage and again is only for dense matrices.

   I appreciate your comments, again sorry for coming across as  
defensive.

Barry


 -- and a separate object oriented layer on top of it written in  
 whatever 
 language you prefer (I find python to be a very, very evil language,  
 but y'all
 seem to have a taste for it).

 This would both allow the functionality to be adapted to multiple  
 overlaying
 interfaces, and keep the low overhead to initial adoption. It seems  
 that the
 current approach is a shoe horn of both very low level and high  
 level behavior
 into the same programming paradigm -- which always leads to a coding  
 mess.

 It seems particularly ugly to bind low level languages such as C++  
 and Fortran
 onto a higher level, interpreted language like Python -- binding C  
 onto python
 which then binds onto C! And then binding any other language would  
 be a
 sequence of binding X onto C++ to bind onto python to bind onto C!

 Alex Peyser

 On Friday 04 December 2009 11:42:35 pm Barry Smith wrote:
Suggestion:

 1) Discard PETSc
 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches
 tasks onto GPUs and multi-core systems (generally we would have one
 python process per compute node and local parallelism would be done
 via the low-level kernels to the cores and/or GPUs.)
 3) Write a new PETSc using MPI4py and 2) written purely in Python  
 3000
 using all its cool class etc features
 4) Use other packages (like f2py) to generate bindings so that 3)
 maybe called from Fortran90/2003 and C++ (these probably suck since
 people usually think the other way around; calling Fortran/C++ from
 Python, but maybe we shouldn't care; we and our 

since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-06 Thread Jed Brown
On Sat, 5 Dec 2009 16:50:38 -0600, Matthew Knepley knepley at gmail.com wrote:
 You assign a few threads per element to calculate the FEM
 integral. You could maintain this unassembled if you only need
 actions.

You can also store it with much less memory as just values at quadrature
points.

 However, if you want an actual sparse matrix, there are a couple of
 options

 
   1) Store the unassembled matrix, and run assembly after integration
 is complete. This needs more memory, but should perform well.

Fine, but how is this assembly done?  If it's serial then it would be a
bottleneck, so you still need the concurrent thing below.

   2) Use atmoic operations to update. I have not seen this yet, so I am
 unsure how is will perform.

Atomic operations could be used per-entry but this costs on the order of
100 cycles on CPUs.  I think newer GPUs have atomics, but I don't know
the cost.  Presumably it's at least as much as the latency of a read
from memory.

When inserting in decent sized chunks, it would probably be worth taking
per-row or larger locks to amortize the cost of the atomics.
Additionally, you could statically partition the workload and only use
atomics for rows/entries that were shared.

Jed



since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Matthew Knepley
On Fri, Dec 4, 2009 at 10:42 PM, Barry Smith bsmith at mcs.anl.gov wrote:


   Suggestion:

 1) Discard PETSc
 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches tasks
 onto GPUs and multi-core systems (generally we would have one python process
 per compute node and local parallelism would be done via the low-level
 kernels to the cores and/or GPUs.)
 3) Write a new PETSc using MPI4py and 2) written purely in Python 3000
 using all its cool class etc features
 4) Use other packages (like f2py) to generate bindings so that 3) maybe
 called from Fortran90/2003 and C++ (these probably suck since people usually
 think the other way around; calling Fortran/C++ from Python, but maybe we
 shouldn't care; we and our friends can just be 10 times more efficient
 developing apps in Python).

   enjoy coding much better than today.

  What is wrong with Python 3000 that would make this approach not be great?


I am very a big fan of this approach. Let me restate it:

  a) Write the initial code in Python for correctness checking, however
develop a performance model which will allow transition to an accelerator

  b) Move key pieces to a faster platform using

  i) Cython

  ii) PyCUDA

  c) Coordinate loose collection of processes with MPI for large problems

A few comments. Notice that for many people c) is unnecessary if you can
coordinate several GPUs from one CPU. The
key piece here is a dispatch system. Felipe, Rio, and I are getting this
done now. Second, we can leverage all of petc4py
in step b.

In my past attempts at this development model, they have always floundered
on inner loops or iterations. These cannot be
done in Python (too slow) and cannot be wrapped (too much overhead).
However, now we have a way to do this, namely
RunTime Code Generation (like PyCUDA). I think this will get us over the
hump, but we have to rethink how we code things,
especially traversals, which now become lists of scheduled tasks as in FLASH
(from van de Geijn).

  Matt



   Barry

 When coding a new numerical algorithm for PETSc we would just code in
 Python, then when tested and happy with reimplement in Py{{CL, CUDA,
 OpenMP-C}

 The other choice is designing and implementing our own cool/great OO
 language with the flexibilty and power we want, but I fear that is way to
 hard  and why not instead leverage Python.






-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/e5a44149/attachment.html


since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Jed Brown
This is an interesting proposal.  Two thoughts:

Residual and Jacobian evaluation cannot be written in Python (though it
can be prototyped there).  After a discretization is chosen, the physics
is usually representable as a tiny kernel (Riemann solvers/pointwise
operation at quadrature points), but if the user is still in charge of
the discretization, it will need services from this new PETSc.  So we
need to be able to provide good debugging support across languages and
concurrency models, i.e. Python/MPI-C/OpenMP-Python-CUDA.

Python is not statically typed.  This is both a strength and a weakness,
but dynamic typing makes debugger support more important because the
compiler can't catch type errors, instead we'll find them when an
exception is thrown in the multi-language hybrid environment.

Jed



since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Matthew Knepley
On Sat, Dec 5, 2009 at 1:06 PM, Jed Brown jed at 59a2.org wrote:

 This is an interesting proposal.  Two thoughts:

 Residual and Jacobian evaluation cannot be written in Python (though it
 can be prototyped there).  After a discretization is chosen, the physics
 is usually representable as a tiny kernel (Riemann solvers/pointwise
 operation at quadrature points), but if the user is still in charge of
 the discretization, it will need services from this new PETSc.  So we
 need to be able to provide good debugging support across languages and
 concurrency models, i.e. Python/MPI-C/OpenMP-Python-CUDA.

 Python is not statically typed.  This is both a strength and a weakness,
 but dynamic typing makes debugger support more important because the
 compiler can't catch type errors, instead we'll find them when an
 exception is thrown in the multi-language hybrid environment.


This is a great point. Debugging is the Achilles Heel of multi-language
programming. That is
why I propose a dual program. All code should be in a single language, which
allows efficient
debugging. You are correct to point out that Python debugging is all at
runtime.

Then kernels are moved to an accelerator. The nice part here is that
debugging becomes
regression, since we have a working model to test against. The worst
scenario is one in which
a bug only appears in the optimized kernels at a scale unattainable in the
pure Python.

  Matt


 Jed

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/abcf93c0/attachment.html


since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Jed Brown
On Sat, 5 Dec 2009 13:09:33 -0600, Matthew Knepley knepley at gmail.com wrote:
 Then kernels are moved to an accelerator.

These kernels necessarily involve user code (physics).  It's a lot to
ask users to maintain two versions of their physics, one which is
debuggable and another which is fast and runs in a very different
context (fine granularity parallel, either with threads or in a GPU
kernel).

Jed



since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Brad Aagaard
As someone who has a finite-element code built upon PETSc/Sieve with the 
top-level code in Python, I am in favor of Barry's approach.

As Matt mentions debugging multi-languages is more complex. Unit testing 
helps solve some of this because tests associated with the low-level 
code involve only one language and find most of the bugs.

We started with manual C++/Python interfaces, then moved to Pyrex, and 
now use SWIG. Because we use C++, the OO support in SWIG results in a 
much better simpler, cleaner interface between Python and C++ than what 
is possible with Pyrex or Cython. SWIG has eliminated 95% of the effort 
to interface Python and C++ compared to Pyrex.

Brad

Matthew Knepley wrote:
 On Fri, Dec 4, 2009 at 10:42 PM, Barry Smith bsmith at mcs.anl.gov 
 mailto:bsmith at mcs.anl.gov wrote:
 
 
   Suggestion:
 
 1) Discard PETSc
 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches
 tasks onto GPUs and multi-core systems (generally we would have
 one python process per compute node and local parallelism would be
 done via the low-level kernels to the cores and/or GPUs.)
 3) Write a new PETSc using MPI4py and 2) written purely in Python
 3000 using all its cool class etc features
 4) Use other packages (like f2py) to generate bindings so that 3)
 maybe called from Fortran90/2003 and C++ (these probably suck since
 people usually think the other way around; calling Fortran/C++ from
 Python, but maybe we shouldn't care; we and our friends can just be
 10 times more efficient developing apps in Python).
 
   enjoy coding much better than today.
 
  What is wrong with Python 3000 that would make this approach not be
 great?
 
 
 I am very a big fan of this approach. Let me restate it:
 
   a) Write the initial code in Python for correctness checking, however 
 develop a performance model which will allow transition to an accelerator
 
   b) Move key pieces to a faster platform using
 
   i) Cython
 
   ii) PyCUDA
 
   c) Coordinate loose collection of processes with MPI for large problems
 
 A few comments. Notice that for many people c) is unnecessary if you can 
 coordinate several GPUs from one CPU. The
 key piece here is a dispatch system. Felipe, Rio, and I are getting this 
 done now. Second, we can leverage all of petc4py
 in step b.
 
 In my past attempts at this development model, they have always 
 floundered on inner loops or iterations. These cannot be
 done in Python (too slow) and cannot be wrapped (too much overhead). 
 However, now we have a way to do this, namely
 RunTime Code Generation (like PyCUDA). I think this will get us over the 
 hump, but we have to rethink how we code things,
 especially traversals, which now become lists of scheduled tasks as in 
 FLASH (from van de Geijn).
 
   Matt
  
 
 
   Barry
 
 When coding a new numerical algorithm for PETSc we would just code
 in Python, then when tested and happy with reimplement in Py{{CL,
 CUDA, OpenMP-C}
 
 The other choice is designing and implementing our own cool/great OO
 language with the flexibilty and power we want, but I fear that is
 way to hard  and why not instead leverage Python.
 
 
 
 
 
 
 -- 
 What most experimenters take for granted before they begin their 
 experiments is infinitely more interesting than any results to which 
 their experiments lead.
 -- Norbert Wiener




since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Jed Brown
Somehow this drifted off the list, hopefully the deep citations provide
sufficient context.

On Sat, 5 Dec 2009 13:42:31 -0600, Matthew Knepley knepley at gmail.com wrote:
 On Sat, Dec 5, 2009 at 1:32 PM, Jed Brown jed at 59a2.org wrote:
 
  On Sat, 5 Dec 2009 13:20:20 -0600, Matthew Knepley knepley at gmail.com
  wrote:
   I think this involves a change in mindset. Currently, people merge
   that tasks of writing correct and optimized code. The movement of a
   kernel from Python to C is just an optimization step, and in some
   cases (witness Psyco) can be automated. I suggest that the Python is
   written in exactly the same style as the CUDA, so its more a matter of
   translation then wholesale rewriting.
 
  I agree, the hard optimization is high-level data flow stuff that can
  be done in any language, the kernel-level stuff is easy in that at
  least it's local.  But in the real world, people have some model large
  model and need to add a new constitutive relation, or another tracer,
  and it looks like a waste of time to write it twice, so they stuff it
  into the optimized kernel, make a typo, and have to figure out why it's
  behaving badly.  Or they do it correctly, but now the next person
  doesn't want to port all the legacy stuff back to the Python reference
  before adding a new wave.
 
 
 Yes, that comes down to coding discipline. A big reason that PETSc has
 controlled complexity is that coding discipline has been enforced, and
 periodically the code is cleaned up and fixed where we lapsed. I see
 this as a continuum of discipline rather than a new requirement.

It's easy to enforce this *within* PETSc, and I don't have a problem
with maintaining reference implementations of those kernels.  But we
can't require users to follow the same guidelines.  Many/most users are
not going to know in advance how to write Python implementations that
are easily translated to CL/CUDA/whatever.


  I just think it's a lot to ask, not that it's wrong to ask.
 
 
 Its good to be very explicits about the benefits and the costs of
 changing the development model. So far, I think the benefits of this
 approach appear to outweigh the costs. However, the acid test is
 making it easier for some certain researcher to write a certain code,
 so we need a guinea pig.

It might be jumping the gun a little to go looking for a guinea pig
without a prototype (or even a concrete design for one).
Reimplementation is a nontrivial task, even if most of it would be
pretty mechanical.  I think PETSc's object model is really good but it
requires significant boilerplate to implement in C.  In what ways would
a new PETSc, implemented in native Python, be different for the user
than using petsc4py today?  Would it offer a fundamentally different
interface?

Jed

 
 
  And debugging optimized code is sometimes necessary, e.g. the invalid
  frees due to improper Malloc2 that I fixed the other day only showed up
  in optimized builds.
 
 
 Undeniable. We just need to work to minimize this. It should never become
 routine.
 
   Matt
 
 
  Jed
 
 -- 
 What most experimenters take for granted before they begin their experiments
 is infinitely more interesting than any results to which their experiments
 lead.
 -- Norbert Wiener



since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Dima Karpeyev
Cython can accelerate almost any Python code nearly immediately
(although it supports a somewhat restricted subset of Python).
This is simply due to converting it to equivalent C code that is compiled
and runs within CPython.

Then, chunks of the Python code can be explicitly typed, which can result
in a further substantial speedup.  In my experience, however, the code becomes
much less readable (Cython ain't no Python).  Also, the compilation process
becomes somewhat cumbersome: Cython is followed by cc, which slows the
development process.  In addition, debugging can be tricky.  It is
helped a little
bit by the fact that the Cython-generated C includes the original Cython code
as comments.

Using PyCUDA after Cython as a further accelerator is definitely a
promising route:
that's what I'm currently doing to two codes I had to develop from
scratch (and I think Jay
Bardhan will take that route on a brand-new code too).

Dmitry.


On Sat, Dec 5, 2009 at 1:06 PM, Jed Brown jed at 59a2.org wrote:
 This is an interesting proposal. ?Two thoughts:

 Residual and Jacobian evaluation cannot be written in Python (though it
 can be prototyped there). ?After a discretization is chosen, the physics
 is usually representable as a tiny kernel (Riemann solvers/pointwise
 operation at quadrature points), but if the user is still in charge of
 the discretization, it will need services from this new PETSc. ?So we
 need to be able to provide good debugging support across languages and
 concurrency models, i.e. Python/MPI-C/OpenMP-Python-CUDA.

 Python is not statically typed. ?This is both a strength and a weakness,
 but dynamic typing makes debugger support more important because the
 compiler can't catch type errors, instead we'll find them when an
 exception is thrown in the multi-language hybrid environment.

 Jed




since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Matthew Knepley
On Sat, Dec 5, 2009 at 2:29 PM, Ethan Coon etc2103 at columbia.edu wrote:

 I'm a big fan of Barry's approach as well.

 However, the current state of debugging tools is not up to snuff for
 this type of model.  In using petsc4py regularly, debugging cython and
 python (user-defined) functions after they've been passed into PETSc
 just plain sucks -- it basically means reverting to print statements.
 And while unit testing can help for a lot of it, expecting users to
 write unit tests for even the most basic of their physics functions is a
 little unreasonable.  Even basic exception propagation would be a huge
 step forward.


This is a very interesting issue. Suppose you write the RHSFunction in
Python
and pass to SNES. Are you saying that pdb cannot stop in that method when
you step over SNESSolve() in Python? That would suck. If on the other hand,
you passed in C, I can see how you are relegated to obscure gdb. This
happens
to me in PyLith.

  Matt


 I'm all in favor of a hybrid model, but the debugging support/issues
 would need to be addressed.

 Ethan


 On Sat, 2009-12-05 at 12:03 -0800, Brad Aagaard wrote:
  As someone who has a finite-element code built upon PETSc/Sieve with the
  top-level code in Python, I am in favor of Barry's approach.
 
  As Matt mentions debugging multi-languages is more complex. Unit testing
  helps solve some of this because tests associated with the low-level
  code involve only one language and find most of the bugs.
 
  We started with manual C++/Python interfaces, then moved to Pyrex, and
  now use SWIG. Because we use C++, the OO support in SWIG results in a
  much better simpler, cleaner interface between Python and C++ than what
  is possible with Pyrex or Cython. SWIG has eliminated 95% of the effort
  to interface Python and C++ compared to Pyrex.
 
  Brad
 
  Matthew Knepley wrote:
   On Fri, Dec 4, 2009 at 10:42 PM, Barry Smith bsmith at mcs.anl.gov
   mailto:bsmith at mcs.anl.gov wrote:
  
  
 Suggestion:
  
   1) Discard PETSc
   2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches
   tasks onto GPUs and multi-core systems (generally we would have
   one python process per compute node and local parallelism would be
   done via the low-level kernels to the cores and/or GPUs.)
   3) Write a new PETSc using MPI4py and 2) written purely in Python
   3000 using all its cool class etc features
   4) Use other packages (like f2py) to generate bindings so that 3)
   maybe called from Fortran90/2003 and C++ (these probably suck since
   people usually think the other way around; calling Fortran/C++ from
   Python, but maybe we shouldn't care; we and our friends can just be
   10 times more efficient developing apps in Python).
  
 enjoy coding much better than today.
  
What is wrong with Python 3000 that would make this approach not
 be
   great?
  
  
   I am very a big fan of this approach. Let me restate it:
  
 a) Write the initial code in Python for correctness checking, however
   develop a performance model which will allow transition to an
 accelerator
  
 b) Move key pieces to a faster platform using
  
 i) Cython
  
 ii) PyCUDA
  
 c) Coordinate loose collection of processes with MPI for large
 problems
  
   A few comments. Notice that for many people c) is unnecessary if you
 can
   coordinate several GPUs from one CPU. The
   key piece here is a dispatch system. Felipe, Rio, and I are getting
 this
   done now. Second, we can leverage all of petc4py
   in step b.
  
   In my past attempts at this development model, they have always
   floundered on inner loops or iterations. These cannot be
   done in Python (too slow) and cannot be wrapped (too much overhead).
   However, now we have a way to do this, namely
   RunTime Code Generation (like PyCUDA). I think this will get us over
 the
   hump, but we have to rethink how we code things,
   especially traversals, which now become lists of scheduled tasks as in
   FLASH (from van de Geijn).
  
 Matt
  
  
  
 Barry
  
   When coding a new numerical algorithm for PETSc we would just code
   in Python, then when tested and happy with reimplement in Py{{CL,
   CUDA, OpenMP-C}
  
   The other choice is designing and implementing our own cool/great
 OO
   language with the flexibilty and power we want, but I fear that is
   way to hard  and why not instead leverage Python.
  
  
  
  
  
  
   --
   What most experimenters take for granted before they begin their
   experiments is infinitely more interesting than any results to which
   their experiments lead.
   -- Norbert Wiener
 
 
 --
 ---
 Ethan Coon
 DOE CSGF - Graduate Student
 Dept. Applied Physics  Applied Mathematics
 Columbia University
 212-854-0415

 http://www.ldeo.columbia.edu/~ecoon/http://www.ldeo.columbia.edu/%7Eecoon/
 

since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Ethan Coon

 
 This is a very interesting issue. Suppose you write the RHSFunction in
 Python
 and pass to SNES. Are you saying that pdb cannot stop in that method
 when
 you step over SNESSolve() in Python? That would suck. If on the other
 hand,
 you passed in C, I can see how you are relegated to obscure gdb. This
 happens
 to me in PyLith.
 

No, I believe pure Python works fine.  Cython does wierd things however.
Haven't tried pure C.  If you have a way of getting gdb to step between
the python parts, the petsc parts, and the passed-in functions in
C/Cython, I'd love to hear it (offlist would probably be better).

Ethan




   Matt
  
 I'm all in favor of a hybrid model, but the debugging
 support/issues
 would need to be addressed.
 
 Ethan
 
 
 
 On Sat, 2009-12-05 at 12:03 -0800, Brad Aagaard wrote:
  As someone who has a finite-element code built upon
 PETSc/Sieve with the
  top-level code in Python, I am in favor of Barry's approach.
 
  As Matt mentions debugging multi-languages is more complex.
 Unit testing
  helps solve some of this because tests associated with the
 low-level
  code involve only one language and find most of the bugs.
 
  We started with manual C++/Python interfaces, then moved to
 Pyrex, and
  now use SWIG. Because we use C++, the OO support in SWIG
 results in a
  much better simpler, cleaner interface between Python and C
 ++ than what
  is possible with Pyrex or Cython. SWIG has eliminated 95% of
 the effort
  to interface Python and C++ compared to Pyrex.
 
  Brad
 
  Matthew Knepley wrote:
   On Fri, Dec 4, 2009 at 10:42 PM, Barry Smith
 bsmith at mcs.anl.gov
   mailto:bsmith at mcs.anl.gov wrote:
  
  
 Suggestion:
  
   1) Discard PETSc
   2) Develop a general Py{CL, CUDA, OpenMP-C} system
 that dispatches
   tasks onto GPUs and multi-core systems (generally we
 would have
   one python process per compute node and local
 parallelism would be
   done via the low-level kernels to the cores and/or
 GPUs.)
   3) Write a new PETSc using MPI4py and 2) written
 purely in Python
   3000 using all its cool class etc features
   4) Use other packages (like f2py) to generate bindings
 so that 3)
   maybe called from Fortran90/2003 and C++ (these
 probably suck since
   people usually think the other way around; calling
 Fortran/C++ from
   Python, but maybe we shouldn't care; we and our
 friends can just be
   10 times more efficient developing apps in Python).
  
 enjoy coding much better than today.
  
What is wrong with Python 3000 that would make this
 approach not be
   great?
  
  
   I am very a big fan of this approach. Let me restate it:
  
 a) Write the initial code in Python for correctness
 checking, however
   develop a performance model which will allow transition to
 an accelerator
  
 b) Move key pieces to a faster platform using
  
 i) Cython
  
 ii) PyCUDA
  
 c) Coordinate loose collection of processes with MPI for
 large problems
  
   A few comments. Notice that for many people c) is
 unnecessary if you can
   coordinate several GPUs from one CPU. The
   key piece here is a dispatch system. Felipe, Rio, and I
 are getting this
   done now. Second, we can leverage all of petc4py
   in step b.
  
   In my past attempts at this development model, they have
 always
   floundered on inner loops or iterations. These cannot be
   done in Python (too slow) and cannot be wrapped (too much
 overhead).
   However, now we have a way to do this, namely
   RunTime Code Generation (like PyCUDA). I think this will
 get us over the
   hump, but we have to rethink how we code things,
   especially traversals, which now become lists of scheduled
 tasks as in
   FLASH (from van de Geijn).
  
 Matt
  
  
  
 Barry
  
   When coding a new numerical algorithm for PETSc we
 would just code
   in Python, then when tested and happy with reimplement
 in Py{{CL,
   CUDA, OpenMP-C}
  
   The other choice is designing and implementing our 

since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Jed Brown
On Fri, 4 Dec 2009 22:42:35 -0600, Barry Smith bsmith at mcs.anl.gov wrote:
 generally we would have one python process per compute node and local
 parallelism would be done via the low-level kernels to the cores
 and/or GPUs.

I think one MPI process per node is fine for MPI performance on good
hardware because the HCA reads and writes from registered memory without
involving the CPU, but I'm not sure it's actually a better model.

How do you envision implementing MatSetValues()?  If there is only one
MPI process per node, would there be another full level of domain
decomposition based on threads?  Otherwise you need a concurrent
MatSetValues which would make proper preallocation essential and make
cache coherence a very sensitive matter.

And the huge algorithmic issue: triangular kernels are the backbone of
almost every preconditioner and are inherently sequential.  If only one
process per node does MPI, then all these algorithms would need
three-level implementations (decompose the per-node subdomains into
per-core subdomains and use a different concurrency scheme at this
smaller granularity).  The use of threads on the many cores per node
potentially offers more performance through the use of lock-free shared
data structures with NUMA-aware work distribution.  But separate memory
space is much more deterministic, thus easier to work with.

I have trouble finding decent preconditioning algorithms suitable for
the fine granularity of GPUs.  Matt thinks we can get rid of all the
crappy sparse matrix kernels and precondition everything with FMM.

Noteh that all Python implementations have a global interpreter lock
which could also make a single Python process the bottleneck.

Jed



since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Matthew Knepley
On Sat, Dec 5, 2009 at 3:50 PM, Jed Brown jed at 59a2.org wrote:

 On Fri, 4 Dec 2009 22:42:35 -0600, Barry Smith bsmith at mcs.anl.gov wrote:
  generally we would have one python process per compute node and local
  parallelism would be done via the low-level kernels to the cores
  and/or GPUs.

 I think one MPI process per node is fine for MPI performance on good
 hardware because the HCA reads and writes from registered memory without
 involving the CPU, but I'm not sure it's actually a better model.

 How do you envision implementing MatSetValues()?  If there is only one
 MPI process per node, would there be another full level of domain
 decomposition based on threads?  Otherwise you need a concurrent
 MatSetValues which would make proper preallocation essential and make
 cache coherence a very sensitive matter.


I need to understand better. You are asking about the case where we have
many GPUs and one CPU? If its always one or two GPUs per CPU I do not
see the problem.

And the huge algorithmic issue: triangular kernels are the backbone of
 almost every preconditioner and are inherently sequential.  If only one
 process per node does MPI, then all these algorithms would need
 three-level implementations (decompose the per-node subdomains into
 per-core subdomains and use a different concurrency scheme at this
 smaller granularity).  The use of threads on the many cores per node
 potentially offers more performance through the use of lock-free shared
 data structures with NUMA-aware work distribution.  But separate memory
 space is much more deterministic, thus easier to work with.


Hmm, still not quite getting this problem. We need concurrency on the GPU,
but why would we need it on the CPU? On the GPU, triangular solve will be
just as crappy as it currently is, but will look even worse due to large
number
of cores. It just has very little concurrency. We need a better option. It
is not
the only smoother. For instance, polynomial smoothers would be more
concurrent.


 I have trouble finding decent preconditioning algorithms suitable for
 the fine granularity of GPUs.  Matt thinks we can get rid of all the
 crappy sparse matrix kernels and precondition everything with FMM.


That is definitely my view, or at least my goal. And I would say this, if we
are just
starting out on these things, I think it makes sense to do the home runs
first. If we
just try and reproduce things, people might say That is nice, but I can
already do that
pretty well.

   Matt


 Noteh that all Python implementations have a global interpreter lock
 which could also make a single Python process the bottleneck.

 Jed

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/71c4bab1/attachment.html


since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Jed Brown
On Sat, 5 Dec 2009 16:02:38 -0600, Matthew Knepley knepley at gmail.com wrote:
 I need to understand better. You are asking about the case where we have
 many GPUs and one CPU? If its always one or two GPUs per CPU I do not
 see the problem.

Barry initially proposed one Python thread per node, then distributing
the kernels over many CPU cores on that node, or to one-or-more GPUs.
With some abuse of terminology, lets call them all worker threads,
perhaps dozens if running on multicore CPUs, or hundreds/thousands when
on a GPU.  The physics, such as FEM integration, has to be done by those
worker threads.  But unless every thread is it's own subdomain
(i.e. Block Jacobi/ASM with very small subdomains), we still need to
assemble a small number of matrices per node.  So we would need a
lock-free concurrent MatSetValues, otherwise we'll only scale to a few
worker threads before everything is blocked on MatSetValues.

 Hmm, still not quite getting this problem. We need concurrency on the
 GPU, but why would we need it on the CPU?

Only if the we were doing real work on the many CPU cores per node.

 On the GPU, triangular solve will be just as crappy as it currently
 is, but will look even worse due to large number of cores.

It could be worse because a single GPU thread is likely slower than a
CPU core.

 It is not the only smoother. For instance, polynomial smoothers would
 be more concurrent.

Yup.

  I have trouble finding decent preconditioning algorithms suitable for
  the fine granularity of GPUs.  Matt thinks we can get rid of all the
  crappy sparse matrix kernels and precondition everything with FMM.
 
 
 That is definitely my view, or at least my goal. And I would say this,
 if we are just starting out on these things, I think it makes sense to
 do the home runs first. If we just try and reproduce things, people
 might say That is nice, but I can already do that pretty well.

Agreed, but it's also important to have something good to offer people
who aren't ready to throw out everything they know and design a new
algorithm based on a radically different approach that may or may not be
any good for their physics.

Jed



since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Matthew Knepley
On Sat, Dec 5, 2009 at 4:25 PM, Jed Brown jed at 59a2.org wrote:

 On Sat, 5 Dec 2009 16:02:38 -0600, Matthew Knepley knepley at gmail.com
 wrote:
  I need to understand better. You are asking about the case where we have
  many GPUs and one CPU? If its always one or two GPUs per CPU I do not
  see the problem.

 Barry initially proposed one Python thread per node, then distributing
 the kernels over many CPU cores on that node, or to one-or-more GPUs.
 With some abuse of terminology, lets call them all worker threads,
 perhaps dozens if running on multicore CPUs, or hundreds/thousands when
 on a GPU.  The physics, such as FEM integration, has to be done by those
 worker threads.  But unless every thread is it's own subdomain
 (i.e. Block Jacobi/ASM with very small subdomains), we still need to
 assemble a small number of matrices per node.  So we would need a
 lock-free concurrent MatSetValues, otherwise we'll only scale to a few
 worker threads before everything is blocked on MatSetValues.


I imagined that this kind of assembly will be handled similarly to what we
do in the FMM.
You assign a few threads per element to calculate the FEM integral. You
could maintain
this unassembled if you only need actions. However, if you want an actual
sparse matrix,
there are a couple of options

  1) Store the unassembled matrix, and run assembly after integration is
complete. This
   needs more memory, but should perform well.

  2) Use atmoic operations to update. I have not seen this yet, so I am
unsure how is will perform.

  3) Use some memory scheme (monitor) to update. This will have terrible
performance.

Can you think of other options?

   Matt


  Hmm, still not quite getting this problem. We need concurrency on the
  GPU, but why would we need it on the CPU?

 Only if the we were doing real work on the many CPU cores per node.

  On the GPU, triangular solve will be just as crappy as it currently
  is, but will look even worse due to large number of cores.

 It could be worse because a single GPU thread is likely slower than a
 CPU core.

  It is not the only smoother. For instance, polynomial smoothers would
  be more concurrent.

 Yup.

   I have trouble finding decent preconditioning algorithms suitable for
   the fine granularity of GPUs.  Matt thinks we can get rid of all the
   crappy sparse matrix kernels and precondition everything with FMM.
  
 
  That is definitely my view, or at least my goal. And I would say this,
  if we are just starting out on these things, I think it makes sense to
  do the home runs first. If we just try and reproduce things, people
  might say That is nice, but I can already do that pretty well.

 Agreed, but it's also important to have something good to offer people
 who aren't ready to throw out everything they know and design a new
 algorithm based on a radically different approach that may or may not be
 any good for their physics.

 Jed




-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/490d4c3a/attachment.html


since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-05 Thread Matthew Knepley
On Sat, Dec 5, 2009 at 6:01 PM, Jed Brown jed at 59a2.org wrote:

 On Sat, 5 Dec 2009 16:50:38 -0600, Matthew Knepley knepley at gmail.com
 wrote:
  You assign a few threads per element to calculate the FEM
  integral. You could maintain this unassembled if you only need
  actions.

 You can also store it with much less memory as just values at quadrature
 points.


Depends on the quadrature, but yes there are sometimes better storage
schemes
(especially if you have other properties like decay).


  However, if you want an actual sparse matrix, there are a couple of
  options
 
 
1) Store the unassembled matrix, and run assembly after integration
  is complete. This needs more memory, but should perform well.

 Fine, but how is this assembly done?  If it's serial then it would be a
 bottleneck, so you still need the concurrent thing below.


Vec assembly can be done on the CPU since its so small I think.


2) Use atmoic operations to update. I have not seen this yet, so I am
  unsure how is will perform.

 Atomic operations could be used per-entry but this costs on the order of
 100 cycles on CPUs.  I think newer GPUs have atomics, but I don't know
 the cost.  Presumably it's at least as much as the latency of a read
 from memory.


Not sure. Needs to be explored. Felipe is working on it.


 When inserting in decent sized chunks, it would probably be worth taking
 per-row or larger locks to amortize the cost of the atomics.
 Additionally, you could statically partition the workload and only use
 atomics for rows/entries that were shared.


You can use partitioning/coloring techniques to increase the lockless
concurrency.

  Matt


 Jed

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/7662bfc8/attachment.html


since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++

2009-12-04 Thread Barry Smith

Suggestion:

1) Discard PETSc
2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches  
tasks onto GPUs and multi-core systems (generally we would have one  
python process per compute node and local parallelism would be done  
via the low-level kernels to the cores and/or GPUs.)
3) Write a new PETSc using MPI4py and 2) written purely in Python 3000  
using all its cool class etc features
4) Use other packages (like f2py) to generate bindings so that 3)  
maybe called from Fortran90/2003 and C++ (these probably suck since  
people usually think the other way around; calling Fortran/C++ from  
Python, but maybe we shouldn't care; we and our friends can just be 10  
times more efficient developing apps in Python).

enjoy coding much better than today.

  What is wrong with Python 3000 that would make this approach not be  
great?


Barry

When coding a new numerical algorithm for PETSc we would just code in  
Python, then when tested and happy with reimplement in Py{{CL, CUDA,  
OpenMP-C}

The other choice is designing and implementing our own cool/great OO  
language with the flexibilty and power we want, but I fear that is way  
to hard  and why not instead leverage Python.