Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
Karl, that would be fantastic. Much obliged! --Richard On 9/23/19 8:09 PM, Karl Rupp wrote: Hi, `git grep cudaStreamCreate` reports that vectors, matrices and scatters create their own streams. This will almost inevitably create races (there is no synchronization mechanism implemented),

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
No objection. Thanks. --Junchao Zhang On Mon, Sep 23, 2019 at 10:09 PM Karl Rupp mailto:r...@iue.tuwien.ac.at>> wrote: Hi, `git grep cudaStreamCreate` reports that vectors, matrices and scatters create their own streams. This will almost inevitably create races (there is no synchronization

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Karl Rupp via petsc-dev
Hi, `git grep cudaStreamCreate` reports that vectors, matrices and scatters create their own streams. This will almost inevitably create races (there is no synchronization mechanism implemented), unless one calls WaitForGPU() after each operation. Some of the non-deterministic tests can

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
I'm no CUDA expert (not yet, anyway), but, from what I've read, the default stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() is not needed in that case. I don't know if there is any performance penalty in explicitly calling it in that case, anyway. In any case, it

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mark Adams via petsc-dev
Note, the numerical problems that we have look a lot like a race condition of some sort. Happens with empty processors and goes away under cuda-memcheck (valgrind like thing). I did try adding WaitForGPU() , but maybe I did do it right or there are other synchronization mechanisms. On Mon, Sep

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
It looks cusparsestruct->stream is always created (not NULL). I don't know logic of the "if (!cusparsestruct->stream)". --Junchao Zhang On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev mailto:petsc-dev@mcs.anl.gov>> wrote: In MatMultAdd_SeqAIJCUSPARSE, before Junchao's

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the function it had if (!yy) { /* MatMult */ if (!cusparsestruct->stream) { ierr = WaitForGPU();CHKERRCUDA(ierr); } } I assume we don't need the logic to do this only in the MatMult() with no add case

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
OK, I wrote to the OLCF Consultants and they told me that * Yes, the jsrun Visualizer numberings correspond to the 'lstopo' ones. and, from this I can conclude that * If I ask for 6 resource sets, each with 1 core and 1 GPU each, the some of the cores in different resource sets will share

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Done. See https://gitlab.com/petsc/petsc/commit/85ec510f49531057ebfe1fb641fe93a36371878e Hong On Mon, Sep 23, 2019 at 11:32 AM Pierre Jolivet mailto:pierre.joli...@enseeiht.fr>> wrote: Hong, You should probably cherry pick

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
To further muddy the waters, the OLCF Summit User Guide (https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide) states that "The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The processor provides 22 SMCs with separate 32kB L1 data and instruction

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
The figure did not clearly say all cores share L3. Instead, we should look at p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf "The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, and an interconnection system that connects all components within the chip

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF output from a Summit compute node to see an illustration of the node layout. --Richard On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote: I also did OpenMP stream test and then I found mismatch between OpenMPI and

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Pierre Jolivet via petsc-dev
Hong, You should probably cherry pick https://gitlab.com/petsc/petsc/commit/93d7d1d6d29b0d66b5629a261178b832a925de80?merge_request_iid=2069 (and remove the MatNest part). This fixes a

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Smith, Barry F. via petsc-dev
Junchao, Great, thanks Barry Eventually I think this should all got into a MR that includes these tests and the PetscSF ping-pongs so later someone can reproduce these numbers on Summit and on the new machines that come out. > On Sep 23, 2019, at 11:01 AM, Zhang, Junchao

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Barry: As a hack for this release could you have the Numeric portion of the multiply routines check if the symbolic data is there and if not just call the symbolic an attach the needed data? You might need to have a utility function that does all the symbolic part except the allocation of

Re: [petsc-dev] It would be really nice if you could run a single job on the pipeline with a branch

2019-09-23 Thread Smith, Barry F. via petsc-dev
> On Sep 23, 2019, at 10:43 AM, Jed Brown wrote: > > "Smith, Barry F. via petsc-dev" writes: > >>> On Sep 22, 2019, at 11:26 PM, Balay, Satish wrote: >>> >>> Even-though a fix addresses a breakage in a single build - that change >>> could break other things so its generally best to run a

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
I also did OpenMP stream test and then I found mismatch between OpenMPI and MPI. That reminded me a subtle issue on summit: pair of cores share L2 cache. One has to place MPI ranks to different pairs to get best bandwidth. See different bindings

Re: [petsc-dev] It would be really nice if you could run a single job on the pipeline with a branch

2019-09-23 Thread Jed Brown via petsc-dev
"Smith, Barry F. via petsc-dev" writes: >> On Sep 22, 2019, at 11:26 PM, Balay, Satish wrote: >> >> Even-though a fix addresses a breakage in a single build - that change >> could break other things so its generally best to run a full test. > > Sure before a merge we want everything tested

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Barry : We would like avoid allocating a huge array for the matrix and then having the user place on top of it. In the new paradigm there could be options called on the resulting C of MatMatGetProduct() that would take effect before the C is fully formed to prevent the allocating and

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Smith, Barry F. via petsc-dev
Hong, As a hack for this release could you have the Numeric portion of the multiply routines check if the symbolic data is there and if not just call the symbolic an attach the needed data? You might need to have a utility function that does all the symbolic part except the allocation

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Smith, Barry F. via petsc-dev
We would like avoid allocating a huge array for the matrix and then having the user place on top of it. In the new paradigm there could be options called on the resulting C of MatMatGetProduct() that would take effect before the C is fully formed to prevent the allocating and freeing

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Yes, we should allow users to provide their own matrix array. We use MatDensePlaceArray() to plug an array into matrix C before MatMatMult(). If we cannot do this, we will have to copy from the internal array of the result C to our array. Would the following sequence work? MatMatMultSymbolic()

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Jose E. Roman via petsc-dev
> El 22 sept 2019, a las 19:11, Smith, Barry F. escribió: > > Jose, > > Thanks for the pointer. > > Will this change dramatically affect the organization of SLEPc? As noted > in my previous email eventually we need to switch to a new API where the > REUSE with a different