since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Mon, 7 Dec 2009 10:53:10 -0500, Alex Peyser peyser.alex at gmail.com wrote: I've had endless trouble integrating petsc into my system because the object- oriented C is both only partially object oriented while simultaneously making access to the underlying functionality obscure. Okay, I'm curious what you find to be only partially OO, and what functionality you find difficult to use. PETSc's approach to object creation might look a bit unconventional (depending on what languages you're familiar with) but it's very much a modern OO pattern, see http://martinfowler.com/articles/injection.html In my opinion, the main downside of C is that it requires a significant amount of boilerplate to implement these patterns (though not a lot more than designs with comparable properties in C++ or Java). I'm not in love with Python, but there is a large amount of PETSc that is in no way performance sensitive, it would be very nice to do this in a more expressive language than C. Jed
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
This whole thread has been really interesting. I'm interested in Barry's suggestion, and I know a handful of people not on the list who would also be interested. I've got a couple of questions/responses to the MatSetValues, point-wise physics, and support tools issues that were raised. Did you guys settle the issue that Jed brought up about about the Python Global Interpreter Lock (GIL)? There was some discussion of it at SciPy this summer, but I didn't come away feeling like anybody had an immediate solution or that Guido was in favor of eliminating the GIL anytime soon. Are you guys assuming that the worker threads for the multiple cores/GPU's are going to be set up by C code (hand- or machine-generated)? It seems to me that the MatSetValues discussion moved on from the threading issue (since you can't call MatSetValues efficiently in python anyway) to the more general issue of how PETSc is going to explicitly support shared memory parallelism. I interpret the last interchange as implying that the solution is another level of domain decomposition that would be hidden from the mat assembly loop level (i.e. users aren't going to have to write their code like the thread part is their parallel subdomain). Is that a correct interpretation? Something along the lines of additional args to MatCreate that indicates how many threads to use per subdomain and some OpenMPI style instructions to parallelize the assembly loop? On the issue of the point-wise physics (e.g. quadrature point routines, Riemann solvers, ...). It seems like this is a long-standing bottleneck for object-orient approaches to PDE's going back to the OO Fem research in the 80s and 90s. The SciPy group that met to talk about PDE's spent some time on it, partly related to femhub project. I agree that a code generation approach seems to be the only way forward. Ondrej Certik from the femhub project has written a symbolic math module for python, and we talked some about using this to represent the physics in a way that can be used for code generation. I'm not sure what progress he's made. At the end of the day you've got to allow people to write code for physics without knowledge of the assembly processes while still injecting that code into the inner loop, otherwise you'll approach the speed of monolithic fortran codes. One issue that you might also want to consider with multi-language approaches is the configuration and build tools. I agree that he debugging support is a major issue, but what has taken far more of my time is the lack of cross-platform configuration and build tools for multi-language software. We have looked at CMake and Scons but have had to stick with a pretty ugly mixture of python scripts of python and make. I feel like the other scientific computing packages like Sage and EPD are successfully growing because they distribute binaries that include all dependencies, but I don't see that being feasible for HPC software anytime soon. It seems like to really be successful you may have to rely on a full featured and fully documented tool like Scons is aiming at being, otherwise you'll end up with an approach that works for you (because you wrote it) but not very well for others trying to use the new PETSc. We have a small group at that US Army ERDC that has been developing a python FEM framework for multiscale/multiphysics modeling that tries to break coupling between the physics and the FEM methods (i.e. we try to support CG,DG, and non-conforming elements with variable order and element topologies running from the same description of the physics). The code uses our own wrappers to petsc and mpi, but only because we started before mpi4py and petsc4py existed. We have a memory intensive physics API based on computing and storing quadrature point values of PDE coefficients and numerical fluxes and a residual and jacobian assembly processes along the lines of Matt's first approach of storing element residuals and jacobians. We also have handwritten optimized codes that collapse all the loops and eliminate storage for direct residual and jacobian assembly. We did that last step out of necessity and as a precursor to working on a code generation approach (it's what we want the code generator to write). It also has the side effect of providing a low level API for people who want to write traditional monolithic fortran codes for residual and jacobian assembly. Our approach for the user or more correctly the physics developer is 1) prototype physics in python 2) vectorize the physics by rewriting in C/C++/Fortran using swig, f2py, or hand written wrappers and 3) take the inner loop of 2 and build monolithic residual and jacobian routines. This is obviously not satisfactory and not maintainable in the long-run but we are solving real nonlinear multi- physics problems in 3D with it
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Mon, Dec 7, 2009 at 3:05 PM, Chris Kees christopher.e.kees at usace.army.mil wrote: This whole thread has been really interesting. I'm interested in Barry's suggestion, and I know a handful of people not on the list who would also be interested. I've got a couple of questions/responses to the MatSetValues, point-wise physics, and support tools issues that were raised. Did you guys settle the issue that Jed brought up about about the Python Global Interpreter Lock (GIL)? There was some discussion of it at SciPy this summer, but I didn't come away feeling like anybody had an immediate solution or that Guido was in favor of eliminating the GIL anytime soon. Are you guys assuming that the worker threads for the multiple cores/GPU's are going to be set up by C code (hand- or machine-generated)? I think this is going to be solved be first, rearranging to the code to segregate reductions or conflicts 2) using atomics and then 3) intelligent reordering to minimize the conflicts. I have never seen this be a big problem, but some code is always going to have the bad tradeoff I guess. It seems to me that the MatSetValues discussion moved on from the threading issue (since you can't call MatSetValues efficiently in python anyway) to the more general issue of how PETSc is going to explicitly support shared memory parallelism. I interpret the last interchange as implying that the solution is another level of domain decomposition that would be hidden from the mat assembly loop level (i.e. users aren't going to have to write their code like the thread part is their parallel subdomain). Is that a correct interpretation? Something along the lines of additional args to MatCreate that indicates how many threads to use per subdomain and some OpenMPI style instructions to parallelize the assembly loop? I am not advocating that. I want a CUDA-like solution, but of course you can probably convert any one to the other with a tremendous amount of pain. I wish for all matrix-free applications and vector assemblies, because then I think the problem is very simple. I have not seen any compelling argument against this approach, and all the best codes I know use it. On the issue of the point-wise physics (e.g. quadrature point routines, Riemann solvers, ...). It seems like this is a long-standing bottleneck for object-orient approaches to PDE's going back to the OO Fem research in the 80s and 90s. The SciPy group that met to talk about PDE's spent some time on it, partly related to femhub project. I agree that a code generation approach seems to be the only way forward. Ondrej Certik from the femhub project has written a symbolic math module for python, and we talked some about using this to represent the physics in a way that can be used for code generation. I'm not sure what progress he's made. At the end of the day you've got to allow people to write code for physics without knowledge of the assembly processes while still injecting that code into the inner loop, otherwise you'll approach the speed of monolithic fortran codes. This solution has been obvious for some time, but needs a solution even a Fortran programmer could love. Or at least a bad C programmer. One issue that you might also want to consider with multi-language approaches is the configuration and build tools. I agree that he debugging support is a major issue, but what has taken far more of my time is the lack of cross-platform configuration and build tools for multi-language software. We have looked at CMake and Scons but have had to stick with a pretty ugly mixture of python scripts of python and make. I feel like the other scientific computing packages like Sage and EPD are successfully growing because they distribute binaries that include all dependencies, but I don't see that being feasible for HPC software anytime soon. It seems like to really be successful you may have to rely on a full featured and fully documented tool like Scons is aiming at being, otherwise you'll end up with an approach that works for you (because you wrote it) but not very well for others trying to use the new PETSc. I think we do a better job than anyone with PETSc right now, but we will have to work a little harder to get everything right. We have a small group at that US Army ERDC that has been developing a python FEM framework for multiscale/multiphysics modeling that tries to break coupling between the physics and the FEM methods (i.e. we try to support CG,DG, and non-conforming elements with variable order and element topologies running from the same description of the physics). The code uses our own wrappers to petsc and mpi, but only because we started before mpi4py and petsc4py existed. We have a memory intensive physics API based on computing and storing quadrature point values of PDE coefficients and numerical fluxes and a residual and jacobian assembly processes along the lines of
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Dec 7, 2009, at 9:53 AM, Alex Peyser wrote: Have y'all considered a different approach? I've had endless trouble integrating petsc into my system because the object- oriented C is both only partially object oriented while simultaneously making access to the underlying functionality obscure. Could you please give us cases where access to the underlying functionality is obscure? Everything is suppose to be accessable through an object oriented approach. (Sorry about being defensive) We'd like to fix this. Some objects have some methods, and others are missing; This is intentional. To require each object to fully support all its interface methods would require having many different classes (for example matrix-free methods cannot support the factorization and setting/getting values interfaces so they would have to be seperated into subclasses). I (and this is me, perhaps most others don't have this difficulty) find that complexity of using a class library grows RAPIDLY with the number of classes and complexity of class hierarchies. Thus we kept a very small number of classes in PETSc, we could only do this by cheating and not requiring all objects to support all methods. some have the same functionality through higher level interfaces or lower level interfaces.. I've tried very hard to prevent having the same functionality at two levels; the one place where we do have it is in the matrix factorizations and triangular solves where you can do it directly on matrix objects or through KSP and PC, but we do note in the matrix documentation the prefered approach is with KSP/PC and list them as developer level usage. We could hide the direct matrix solver methods even more by not providing public interfaces for them completely; perhaps we should have done that. Could you please list other things that have same functionality at a higher level interface and lower level so we can perhaps improve that experience. It is our intention (except with the factorization and solves) to NOT have the same functionality accessable from two levels. . and so on. Maybe you need to split this into two projects -- one a set of normal C interfaces to the kernels, communications, etc -- the underlying functionality We ended up with the hybrid model because we could not see or even conceive of how to write a normal C interfaces for sparse matrices and preconditioners. The closest numerical package to PETSc that I see that uses a normal C interface and is widely used is LAPACK and it only supports dense and banded martrices sequentially; its interface is already overwhelming, diffcult and not extensible. Bill and I selected an objected oriented approach in 1995 not because we thought it was cool or modern but because we viewed making a usable general purpose package for sparse solvers with a normal c interface as impossible. We could have been wrong, and could still be wrong about this, but I've yet to see any usable general purpose sparse solver package that works with a normal c interface. I think it is impossible. ScaLAPACK tries to use a normal C interface (but it still has somethings resembling objects) and is a totally unusable piece of garbage and again is only for dense matrices. I appreciate your comments, again sorry for coming across as defensive. Barry -- and a separate object oriented layer on top of it written in whatever language you prefer (I find python to be a very, very evil language, but y'all seem to have a taste for it). This would both allow the functionality to be adapted to multiple overlaying interfaces, and keep the low overhead to initial adoption. It seems that the current approach is a shoe horn of both very low level and high level behavior into the same programming paradigm -- which always leads to a coding mess. It seems particularly ugly to bind low level languages such as C++ and Fortran onto a higher level, interpreted language like Python -- binding C onto python which then binds onto C! And then binding any other language would be a sequence of binding X onto C++ to bind onto python to bind onto C! Alex Peyser On Friday 04 December 2009 11:42:35 pm Barry Smith wrote: Suggestion: 1) Discard PETSc 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches tasks onto GPUs and multi-core systems (generally we would have one python process per compute node and local parallelism would be done via the low-level kernels to the cores and/or GPUs.) 3) Write a new PETSc using MPI4py and 2) written purely in Python 3000 using all its cool class etc features 4) Use other packages (like f2py) to generate bindings so that 3) maybe called from Fortran90/2003 and C++ (these probably suck since people usually think the other way around; calling Fortran/C++ from Python, but maybe we shouldn't care; we and our
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Sat, 5 Dec 2009 16:50:38 -0600, Matthew Knepley knepley at gmail.com wrote: You assign a few threads per element to calculate the FEM integral. You could maintain this unassembled if you only need actions. You can also store it with much less memory as just values at quadrature points. However, if you want an actual sparse matrix, there are a couple of options 1) Store the unassembled matrix, and run assembly after integration is complete. This needs more memory, but should perform well. Fine, but how is this assembly done? If it's serial then it would be a bottleneck, so you still need the concurrent thing below. 2) Use atmoic operations to update. I have not seen this yet, so I am unsure how is will perform. Atomic operations could be used per-entry but this costs on the order of 100 cycles on CPUs. I think newer GPUs have atomics, but I don't know the cost. Presumably it's at least as much as the latency of a read from memory. When inserting in decent sized chunks, it would probably be worth taking per-row or larger locks to amortize the cost of the atomics. Additionally, you could statically partition the workload and only use atomics for rows/entries that were shared. Jed
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Fri, Dec 4, 2009 at 10:42 PM, Barry Smith bsmith at mcs.anl.gov wrote: Suggestion: 1) Discard PETSc 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches tasks onto GPUs and multi-core systems (generally we would have one python process per compute node and local parallelism would be done via the low-level kernels to the cores and/or GPUs.) 3) Write a new PETSc using MPI4py and 2) written purely in Python 3000 using all its cool class etc features 4) Use other packages (like f2py) to generate bindings so that 3) maybe called from Fortran90/2003 and C++ (these probably suck since people usually think the other way around; calling Fortran/C++ from Python, but maybe we shouldn't care; we and our friends can just be 10 times more efficient developing apps in Python). enjoy coding much better than today. What is wrong with Python 3000 that would make this approach not be great? I am very a big fan of this approach. Let me restate it: a) Write the initial code in Python for correctness checking, however develop a performance model which will allow transition to an accelerator b) Move key pieces to a faster platform using i) Cython ii) PyCUDA c) Coordinate loose collection of processes with MPI for large problems A few comments. Notice that for many people c) is unnecessary if you can coordinate several GPUs from one CPU. The key piece here is a dispatch system. Felipe, Rio, and I are getting this done now. Second, we can leverage all of petc4py in step b. In my past attempts at this development model, they have always floundered on inner loops or iterations. These cannot be done in Python (too slow) and cannot be wrapped (too much overhead). However, now we have a way to do this, namely RunTime Code Generation (like PyCUDA). I think this will get us over the hump, but we have to rethink how we code things, especially traversals, which now become lists of scheduled tasks as in FLASH (from van de Geijn). Matt Barry When coding a new numerical algorithm for PETSc we would just code in Python, then when tested and happy with reimplement in Py{{CL, CUDA, OpenMP-C} The other choice is designing and implementing our own cool/great OO language with the flexibilty and power we want, but I fear that is way to hard and why not instead leverage Python. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/e5a44149/attachment.html
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
This is an interesting proposal. Two thoughts: Residual and Jacobian evaluation cannot be written in Python (though it can be prototyped there). After a discretization is chosen, the physics is usually representable as a tiny kernel (Riemann solvers/pointwise operation at quadrature points), but if the user is still in charge of the discretization, it will need services from this new PETSc. So we need to be able to provide good debugging support across languages and concurrency models, i.e. Python/MPI-C/OpenMP-Python-CUDA. Python is not statically typed. This is both a strength and a weakness, but dynamic typing makes debugger support more important because the compiler can't catch type errors, instead we'll find them when an exception is thrown in the multi-language hybrid environment. Jed
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Sat, Dec 5, 2009 at 1:06 PM, Jed Brown jed at 59a2.org wrote: This is an interesting proposal. Two thoughts: Residual and Jacobian evaluation cannot be written in Python (though it can be prototyped there). After a discretization is chosen, the physics is usually representable as a tiny kernel (Riemann solvers/pointwise operation at quadrature points), but if the user is still in charge of the discretization, it will need services from this new PETSc. So we need to be able to provide good debugging support across languages and concurrency models, i.e. Python/MPI-C/OpenMP-Python-CUDA. Python is not statically typed. This is both a strength and a weakness, but dynamic typing makes debugger support more important because the compiler can't catch type errors, instead we'll find them when an exception is thrown in the multi-language hybrid environment. This is a great point. Debugging is the Achilles Heel of multi-language programming. That is why I propose a dual program. All code should be in a single language, which allows efficient debugging. You are correct to point out that Python debugging is all at runtime. Then kernels are moved to an accelerator. The nice part here is that debugging becomes regression, since we have a working model to test against. The worst scenario is one in which a bug only appears in the optimized kernels at a scale unattainable in the pure Python. Matt Jed -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/abcf93c0/attachment.html
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Sat, 5 Dec 2009 13:09:33 -0600, Matthew Knepley knepley at gmail.com wrote: Then kernels are moved to an accelerator. These kernels necessarily involve user code (physics). It's a lot to ask users to maintain two versions of their physics, one which is debuggable and another which is fast and runs in a very different context (fine granularity parallel, either with threads or in a GPU kernel). Jed
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
As someone who has a finite-element code built upon PETSc/Sieve with the top-level code in Python, I am in favor of Barry's approach. As Matt mentions debugging multi-languages is more complex. Unit testing helps solve some of this because tests associated with the low-level code involve only one language and find most of the bugs. We started with manual C++/Python interfaces, then moved to Pyrex, and now use SWIG. Because we use C++, the OO support in SWIG results in a much better simpler, cleaner interface between Python and C++ than what is possible with Pyrex or Cython. SWIG has eliminated 95% of the effort to interface Python and C++ compared to Pyrex. Brad Matthew Knepley wrote: On Fri, Dec 4, 2009 at 10:42 PM, Barry Smith bsmith at mcs.anl.gov mailto:bsmith at mcs.anl.gov wrote: Suggestion: 1) Discard PETSc 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches tasks onto GPUs and multi-core systems (generally we would have one python process per compute node and local parallelism would be done via the low-level kernels to the cores and/or GPUs.) 3) Write a new PETSc using MPI4py and 2) written purely in Python 3000 using all its cool class etc features 4) Use other packages (like f2py) to generate bindings so that 3) maybe called from Fortran90/2003 and C++ (these probably suck since people usually think the other way around; calling Fortran/C++ from Python, but maybe we shouldn't care; we and our friends can just be 10 times more efficient developing apps in Python). enjoy coding much better than today. What is wrong with Python 3000 that would make this approach not be great? I am very a big fan of this approach. Let me restate it: a) Write the initial code in Python for correctness checking, however develop a performance model which will allow transition to an accelerator b) Move key pieces to a faster platform using i) Cython ii) PyCUDA c) Coordinate loose collection of processes with MPI for large problems A few comments. Notice that for many people c) is unnecessary if you can coordinate several GPUs from one CPU. The key piece here is a dispatch system. Felipe, Rio, and I are getting this done now. Second, we can leverage all of petc4py in step b. In my past attempts at this development model, they have always floundered on inner loops or iterations. These cannot be done in Python (too slow) and cannot be wrapped (too much overhead). However, now we have a way to do this, namely RunTime Code Generation (like PyCUDA). I think this will get us over the hump, but we have to rethink how we code things, especially traversals, which now become lists of scheduled tasks as in FLASH (from van de Geijn). Matt Barry When coding a new numerical algorithm for PETSc we would just code in Python, then when tested and happy with reimplement in Py{{CL, CUDA, OpenMP-C} The other choice is designing and implementing our own cool/great OO language with the flexibilty and power we want, but I fear that is way to hard and why not instead leverage Python. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
Somehow this drifted off the list, hopefully the deep citations provide sufficient context. On Sat, 5 Dec 2009 13:42:31 -0600, Matthew Knepley knepley at gmail.com wrote: On Sat, Dec 5, 2009 at 1:32 PM, Jed Brown jed at 59a2.org wrote: On Sat, 5 Dec 2009 13:20:20 -0600, Matthew Knepley knepley at gmail.com wrote: I think this involves a change in mindset. Currently, people merge that tasks of writing correct and optimized code. The movement of a kernel from Python to C is just an optimization step, and in some cases (witness Psyco) can be automated. I suggest that the Python is written in exactly the same style as the CUDA, so its more a matter of translation then wholesale rewriting. I agree, the hard optimization is high-level data flow stuff that can be done in any language, the kernel-level stuff is easy in that at least it's local. But in the real world, people have some model large model and need to add a new constitutive relation, or another tracer, and it looks like a waste of time to write it twice, so they stuff it into the optimized kernel, make a typo, and have to figure out why it's behaving badly. Or they do it correctly, but now the next person doesn't want to port all the legacy stuff back to the Python reference before adding a new wave. Yes, that comes down to coding discipline. A big reason that PETSc has controlled complexity is that coding discipline has been enforced, and periodically the code is cleaned up and fixed where we lapsed. I see this as a continuum of discipline rather than a new requirement. It's easy to enforce this *within* PETSc, and I don't have a problem with maintaining reference implementations of those kernels. But we can't require users to follow the same guidelines. Many/most users are not going to know in advance how to write Python implementations that are easily translated to CL/CUDA/whatever. I just think it's a lot to ask, not that it's wrong to ask. Its good to be very explicits about the benefits and the costs of changing the development model. So far, I think the benefits of this approach appear to outweigh the costs. However, the acid test is making it easier for some certain researcher to write a certain code, so we need a guinea pig. It might be jumping the gun a little to go looking for a guinea pig without a prototype (or even a concrete design for one). Reimplementation is a nontrivial task, even if most of it would be pretty mechanical. I think PETSc's object model is really good but it requires significant boilerplate to implement in C. In what ways would a new PETSc, implemented in native Python, be different for the user than using petsc4py today? Would it offer a fundamentally different interface? Jed And debugging optimized code is sometimes necessary, e.g. the invalid frees due to improper Malloc2 that I fixed the other day only showed up in optimized builds. Undeniable. We just need to work to minimize this. It should never become routine. Matt Jed -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
Cython can accelerate almost any Python code nearly immediately (although it supports a somewhat restricted subset of Python). This is simply due to converting it to equivalent C code that is compiled and runs within CPython. Then, chunks of the Python code can be explicitly typed, which can result in a further substantial speedup. In my experience, however, the code becomes much less readable (Cython ain't no Python). Also, the compilation process becomes somewhat cumbersome: Cython is followed by cc, which slows the development process. In addition, debugging can be tricky. It is helped a little bit by the fact that the Cython-generated C includes the original Cython code as comments. Using PyCUDA after Cython as a further accelerator is definitely a promising route: that's what I'm currently doing to two codes I had to develop from scratch (and I think Jay Bardhan will take that route on a brand-new code too). Dmitry. On Sat, Dec 5, 2009 at 1:06 PM, Jed Brown jed at 59a2.org wrote: This is an interesting proposal. ?Two thoughts: Residual and Jacobian evaluation cannot be written in Python (though it can be prototyped there). ?After a discretization is chosen, the physics is usually representable as a tiny kernel (Riemann solvers/pointwise operation at quadrature points), but if the user is still in charge of the discretization, it will need services from this new PETSc. ?So we need to be able to provide good debugging support across languages and concurrency models, i.e. Python/MPI-C/OpenMP-Python-CUDA. Python is not statically typed. ?This is both a strength and a weakness, but dynamic typing makes debugger support more important because the compiler can't catch type errors, instead we'll find them when an exception is thrown in the multi-language hybrid environment. Jed
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Sat, Dec 5, 2009 at 2:29 PM, Ethan Coon etc2103 at columbia.edu wrote: I'm a big fan of Barry's approach as well. However, the current state of debugging tools is not up to snuff for this type of model. In using petsc4py regularly, debugging cython and python (user-defined) functions after they've been passed into PETSc just plain sucks -- it basically means reverting to print statements. And while unit testing can help for a lot of it, expecting users to write unit tests for even the most basic of their physics functions is a little unreasonable. Even basic exception propagation would be a huge step forward. This is a very interesting issue. Suppose you write the RHSFunction in Python and pass to SNES. Are you saying that pdb cannot stop in that method when you step over SNESSolve() in Python? That would suck. If on the other hand, you passed in C, I can see how you are relegated to obscure gdb. This happens to me in PyLith. Matt I'm all in favor of a hybrid model, but the debugging support/issues would need to be addressed. Ethan On Sat, 2009-12-05 at 12:03 -0800, Brad Aagaard wrote: As someone who has a finite-element code built upon PETSc/Sieve with the top-level code in Python, I am in favor of Barry's approach. As Matt mentions debugging multi-languages is more complex. Unit testing helps solve some of this because tests associated with the low-level code involve only one language and find most of the bugs. We started with manual C++/Python interfaces, then moved to Pyrex, and now use SWIG. Because we use C++, the OO support in SWIG results in a much better simpler, cleaner interface between Python and C++ than what is possible with Pyrex or Cython. SWIG has eliminated 95% of the effort to interface Python and C++ compared to Pyrex. Brad Matthew Knepley wrote: On Fri, Dec 4, 2009 at 10:42 PM, Barry Smith bsmith at mcs.anl.gov mailto:bsmith at mcs.anl.gov wrote: Suggestion: 1) Discard PETSc 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches tasks onto GPUs and multi-core systems (generally we would have one python process per compute node and local parallelism would be done via the low-level kernels to the cores and/or GPUs.) 3) Write a new PETSc using MPI4py and 2) written purely in Python 3000 using all its cool class etc features 4) Use other packages (like f2py) to generate bindings so that 3) maybe called from Fortran90/2003 and C++ (these probably suck since people usually think the other way around; calling Fortran/C++ from Python, but maybe we shouldn't care; we and our friends can just be 10 times more efficient developing apps in Python). enjoy coding much better than today. What is wrong with Python 3000 that would make this approach not be great? I am very a big fan of this approach. Let me restate it: a) Write the initial code in Python for correctness checking, however develop a performance model which will allow transition to an accelerator b) Move key pieces to a faster platform using i) Cython ii) PyCUDA c) Coordinate loose collection of processes with MPI for large problems A few comments. Notice that for many people c) is unnecessary if you can coordinate several GPUs from one CPU. The key piece here is a dispatch system. Felipe, Rio, and I are getting this done now. Second, we can leverage all of petc4py in step b. In my past attempts at this development model, they have always floundered on inner loops or iterations. These cannot be done in Python (too slow) and cannot be wrapped (too much overhead). However, now we have a way to do this, namely RunTime Code Generation (like PyCUDA). I think this will get us over the hump, but we have to rethink how we code things, especially traversals, which now become lists of scheduled tasks as in FLASH (from van de Geijn). Matt Barry When coding a new numerical algorithm for PETSc we would just code in Python, then when tested and happy with reimplement in Py{{CL, CUDA, OpenMP-C} The other choice is designing and implementing our own cool/great OO language with the flexibilty and power we want, but I fear that is way to hard and why not instead leverage Python. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- --- Ethan Coon DOE CSGF - Graduate Student Dept. Applied Physics Applied Mathematics Columbia University 212-854-0415 http://www.ldeo.columbia.edu/~ecoon/http://www.ldeo.columbia.edu/%7Eecoon/
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
This is a very interesting issue. Suppose you write the RHSFunction in Python and pass to SNES. Are you saying that pdb cannot stop in that method when you step over SNESSolve() in Python? That would suck. If on the other hand, you passed in C, I can see how you are relegated to obscure gdb. This happens to me in PyLith. No, I believe pure Python works fine. Cython does wierd things however. Haven't tried pure C. If you have a way of getting gdb to step between the python parts, the petsc parts, and the passed-in functions in C/Cython, I'd love to hear it (offlist would probably be better). Ethan Matt I'm all in favor of a hybrid model, but the debugging support/issues would need to be addressed. Ethan On Sat, 2009-12-05 at 12:03 -0800, Brad Aagaard wrote: As someone who has a finite-element code built upon PETSc/Sieve with the top-level code in Python, I am in favor of Barry's approach. As Matt mentions debugging multi-languages is more complex. Unit testing helps solve some of this because tests associated with the low-level code involve only one language and find most of the bugs. We started with manual C++/Python interfaces, then moved to Pyrex, and now use SWIG. Because we use C++, the OO support in SWIG results in a much better simpler, cleaner interface between Python and C ++ than what is possible with Pyrex or Cython. SWIG has eliminated 95% of the effort to interface Python and C++ compared to Pyrex. Brad Matthew Knepley wrote: On Fri, Dec 4, 2009 at 10:42 PM, Barry Smith bsmith at mcs.anl.gov mailto:bsmith at mcs.anl.gov wrote: Suggestion: 1) Discard PETSc 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches tasks onto GPUs and multi-core systems (generally we would have one python process per compute node and local parallelism would be done via the low-level kernels to the cores and/or GPUs.) 3) Write a new PETSc using MPI4py and 2) written purely in Python 3000 using all its cool class etc features 4) Use other packages (like f2py) to generate bindings so that 3) maybe called from Fortran90/2003 and C++ (these probably suck since people usually think the other way around; calling Fortran/C++ from Python, but maybe we shouldn't care; we and our friends can just be 10 times more efficient developing apps in Python). enjoy coding much better than today. What is wrong with Python 3000 that would make this approach not be great? I am very a big fan of this approach. Let me restate it: a) Write the initial code in Python for correctness checking, however develop a performance model which will allow transition to an accelerator b) Move key pieces to a faster platform using i) Cython ii) PyCUDA c) Coordinate loose collection of processes with MPI for large problems A few comments. Notice that for many people c) is unnecessary if you can coordinate several GPUs from one CPU. The key piece here is a dispatch system. Felipe, Rio, and I are getting this done now. Second, we can leverage all of petc4py in step b. In my past attempts at this development model, they have always floundered on inner loops or iterations. These cannot be done in Python (too slow) and cannot be wrapped (too much overhead). However, now we have a way to do this, namely RunTime Code Generation (like PyCUDA). I think this will get us over the hump, but we have to rethink how we code things, especially traversals, which now become lists of scheduled tasks as in FLASH (from van de Geijn). Matt Barry When coding a new numerical algorithm for PETSc we would just code in Python, then when tested and happy with reimplement in Py{{CL, CUDA, OpenMP-C} The other choice is designing and implementing our
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Fri, 4 Dec 2009 22:42:35 -0600, Barry Smith bsmith at mcs.anl.gov wrote: generally we would have one python process per compute node and local parallelism would be done via the low-level kernels to the cores and/or GPUs. I think one MPI process per node is fine for MPI performance on good hardware because the HCA reads and writes from registered memory without involving the CPU, but I'm not sure it's actually a better model. How do you envision implementing MatSetValues()? If there is only one MPI process per node, would there be another full level of domain decomposition based on threads? Otherwise you need a concurrent MatSetValues which would make proper preallocation essential and make cache coherence a very sensitive matter. And the huge algorithmic issue: triangular kernels are the backbone of almost every preconditioner and are inherently sequential. If only one process per node does MPI, then all these algorithms would need three-level implementations (decompose the per-node subdomains into per-core subdomains and use a different concurrency scheme at this smaller granularity). The use of threads on the many cores per node potentially offers more performance through the use of lock-free shared data structures with NUMA-aware work distribution. But separate memory space is much more deterministic, thus easier to work with. I have trouble finding decent preconditioning algorithms suitable for the fine granularity of GPUs. Matt thinks we can get rid of all the crappy sparse matrix kernels and precondition everything with FMM. Noteh that all Python implementations have a global interpreter lock which could also make a single Python process the bottleneck. Jed
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Sat, Dec 5, 2009 at 3:50 PM, Jed Brown jed at 59a2.org wrote: On Fri, 4 Dec 2009 22:42:35 -0600, Barry Smith bsmith at mcs.anl.gov wrote: generally we would have one python process per compute node and local parallelism would be done via the low-level kernels to the cores and/or GPUs. I think one MPI process per node is fine for MPI performance on good hardware because the HCA reads and writes from registered memory without involving the CPU, but I'm not sure it's actually a better model. How do you envision implementing MatSetValues()? If there is only one MPI process per node, would there be another full level of domain decomposition based on threads? Otherwise you need a concurrent MatSetValues which would make proper preallocation essential and make cache coherence a very sensitive matter. I need to understand better. You are asking about the case where we have many GPUs and one CPU? If its always one or two GPUs per CPU I do not see the problem. And the huge algorithmic issue: triangular kernels are the backbone of almost every preconditioner and are inherently sequential. If only one process per node does MPI, then all these algorithms would need three-level implementations (decompose the per-node subdomains into per-core subdomains and use a different concurrency scheme at this smaller granularity). The use of threads on the many cores per node potentially offers more performance through the use of lock-free shared data structures with NUMA-aware work distribution. But separate memory space is much more deterministic, thus easier to work with. Hmm, still not quite getting this problem. We need concurrency on the GPU, but why would we need it on the CPU? On the GPU, triangular solve will be just as crappy as it currently is, but will look even worse due to large number of cores. It just has very little concurrency. We need a better option. It is not the only smoother. For instance, polynomial smoothers would be more concurrent. I have trouble finding decent preconditioning algorithms suitable for the fine granularity of GPUs. Matt thinks we can get rid of all the crappy sparse matrix kernels and precondition everything with FMM. That is definitely my view, or at least my goal. And I would say this, if we are just starting out on these things, I think it makes sense to do the home runs first. If we just try and reproduce things, people might say That is nice, but I can already do that pretty well. Matt Noteh that all Python implementations have a global interpreter lock which could also make a single Python process the bottleneck. Jed -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/71c4bab1/attachment.html
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Sat, 5 Dec 2009 16:02:38 -0600, Matthew Knepley knepley at gmail.com wrote: I need to understand better. You are asking about the case where we have many GPUs and one CPU? If its always one or two GPUs per CPU I do not see the problem. Barry initially proposed one Python thread per node, then distributing the kernels over many CPU cores on that node, or to one-or-more GPUs. With some abuse of terminology, lets call them all worker threads, perhaps dozens if running on multicore CPUs, or hundreds/thousands when on a GPU. The physics, such as FEM integration, has to be done by those worker threads. But unless every thread is it's own subdomain (i.e. Block Jacobi/ASM with very small subdomains), we still need to assemble a small number of matrices per node. So we would need a lock-free concurrent MatSetValues, otherwise we'll only scale to a few worker threads before everything is blocked on MatSetValues. Hmm, still not quite getting this problem. We need concurrency on the GPU, but why would we need it on the CPU? Only if the we were doing real work on the many CPU cores per node. On the GPU, triangular solve will be just as crappy as it currently is, but will look even worse due to large number of cores. It could be worse because a single GPU thread is likely slower than a CPU core. It is not the only smoother. For instance, polynomial smoothers would be more concurrent. Yup. I have trouble finding decent preconditioning algorithms suitable for the fine granularity of GPUs. Matt thinks we can get rid of all the crappy sparse matrix kernels and precondition everything with FMM. That is definitely my view, or at least my goal. And I would say this, if we are just starting out on these things, I think it makes sense to do the home runs first. If we just try and reproduce things, people might say That is nice, but I can already do that pretty well. Agreed, but it's also important to have something good to offer people who aren't ready to throw out everything they know and design a new algorithm based on a radically different approach that may or may not be any good for their physics. Jed
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Sat, Dec 5, 2009 at 4:25 PM, Jed Brown jed at 59a2.org wrote: On Sat, 5 Dec 2009 16:02:38 -0600, Matthew Knepley knepley at gmail.com wrote: I need to understand better. You are asking about the case where we have many GPUs and one CPU? If its always one or two GPUs per CPU I do not see the problem. Barry initially proposed one Python thread per node, then distributing the kernels over many CPU cores on that node, or to one-or-more GPUs. With some abuse of terminology, lets call them all worker threads, perhaps dozens if running on multicore CPUs, or hundreds/thousands when on a GPU. The physics, such as FEM integration, has to be done by those worker threads. But unless every thread is it's own subdomain (i.e. Block Jacobi/ASM with very small subdomains), we still need to assemble a small number of matrices per node. So we would need a lock-free concurrent MatSetValues, otherwise we'll only scale to a few worker threads before everything is blocked on MatSetValues. I imagined that this kind of assembly will be handled similarly to what we do in the FMM. You assign a few threads per element to calculate the FEM integral. You could maintain this unassembled if you only need actions. However, if you want an actual sparse matrix, there are a couple of options 1) Store the unassembled matrix, and run assembly after integration is complete. This needs more memory, but should perform well. 2) Use atmoic operations to update. I have not seen this yet, so I am unsure how is will perform. 3) Use some memory scheme (monitor) to update. This will have terrible performance. Can you think of other options? Matt Hmm, still not quite getting this problem. We need concurrency on the GPU, but why would we need it on the CPU? Only if the we were doing real work on the many CPU cores per node. On the GPU, triangular solve will be just as crappy as it currently is, but will look even worse due to large number of cores. It could be worse because a single GPU thread is likely slower than a CPU core. It is not the only smoother. For instance, polynomial smoothers would be more concurrent. Yup. I have trouble finding decent preconditioning algorithms suitable for the fine granularity of GPUs. Matt thinks we can get rid of all the crappy sparse matrix kernels and precondition everything with FMM. That is definitely my view, or at least my goal. And I would say this, if we are just starting out on these things, I think it makes sense to do the home runs first. If we just try and reproduce things, people might say That is nice, but I can already do that pretty well. Agreed, but it's also important to have something good to offer people who aren't ready to throw out everything they know and design a new algorithm based on a radically different approach that may or may not be any good for their physics. Jed -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/490d4c3a/attachment.html
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
On Sat, Dec 5, 2009 at 6:01 PM, Jed Brown jed at 59a2.org wrote: On Sat, 5 Dec 2009 16:50:38 -0600, Matthew Knepley knepley at gmail.com wrote: You assign a few threads per element to calculate the FEM integral. You could maintain this unassembled if you only need actions. You can also store it with much less memory as just values at quadrature points. Depends on the quadrature, but yes there are sometimes better storage schemes (especially if you have other properties like decay). However, if you want an actual sparse matrix, there are a couple of options 1) Store the unassembled matrix, and run assembly after integration is complete. This needs more memory, but should perform well. Fine, but how is this assembly done? If it's serial then it would be a bottleneck, so you still need the concurrent thing below. Vec assembly can be done on the CPU since its so small I think. 2) Use atmoic operations to update. I have not seen this yet, so I am unsure how is will perform. Atomic operations could be used per-entry but this costs on the order of 100 cycles on CPUs. I think newer GPUs have atomics, but I don't know the cost. Presumably it's at least as much as the latency of a read from memory. Not sure. Needs to be explored. Felipe is working on it. When inserting in decent sized chunks, it would probably be worth taking per-row or larger locks to amortize the cost of the atomics. Additionally, you could statically partition the workload and only use atomics for rows/entries that were shared. You can use partitioning/coloring techniques to increase the lockless concurrency. Matt Jed -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/7662bfc8/attachment.html
since developing object oriented software is so cumbersome in C and we are all resistent to doing it in C++
Suggestion: 1) Discard PETSc 2) Develop a general Py{CL, CUDA, OpenMP-C} system that dispatches tasks onto GPUs and multi-core systems (generally we would have one python process per compute node and local parallelism would be done via the low-level kernels to the cores and/or GPUs.) 3) Write a new PETSc using MPI4py and 2) written purely in Python 3000 using all its cool class etc features 4) Use other packages (like f2py) to generate bindings so that 3) maybe called from Fortran90/2003 and C++ (these probably suck since people usually think the other way around; calling Fortran/C++ from Python, but maybe we shouldn't care; we and our friends can just be 10 times more efficient developing apps in Python). enjoy coding much better than today. What is wrong with Python 3000 that would make this approach not be great? Barry When coding a new numerical algorithm for PETSc we would just code in Python, then when tested and happy with reimplement in Py{{CL, CUDA, OpenMP-C} The other choice is designing and implementing our own cool/great OO language with the flexibilty and power we want, but I fear that is way to hard and why not instead leverage Python.