Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-09-01 Thread Smith, Barry F. via petsc-dev
git branch --contains barry/2019-09-01/robustify-version-check balay/jed-gitlab-ci master Make a new branch from your current branch, add like -feature-sf-on-gpu to the end of the name and merge in jczhang/feature-sf-on-gpu configure and test with that. Barry > On Sep

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-09-01 Thread Mark Adams via petsc-dev
Junchao and Barry, I am using mark/fix-cuda-with-gamg-pintocpu, which is built on barry's robustify branch. Is this in master yet? If so, I'd like to get my branch merged to master, then merge Junchao's branch. Then us it. I think we were waiting for some refactoring from Karl to proceed.

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-09-01 Thread Zhang, Junchao via petsc-dev
On Sat, Aug 31, 2019 at 8:04 PM Mark Adams mailto:mfad...@lbl.gov>> wrote: On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. mailto:bsm...@mcs.anl.gov>> wrote: Any explanation for why the scaling is much better for CPUs and than GPUs? Is it the "extra" time needed for communication from

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-31 Thread Mark Adams via petsc-dev
On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. wrote: > > Any explanation for why the scaling is much better for CPUs and than > GPUs? Is it the "extra" time needed for communication from the GPUs? > The GPU work is well load balanced so it weak scales perfectly. When you put that work in

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-31 Thread Smith, Barry F. via petsc-dev
Any explanation for why the scaling is much better for CPUs and than GPUs? Is it the "extra" time needed for communication from the GPUs? Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA branch (in the gitlab merge requests) that can speed up the communication

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
Ahh, PGI compiler, that explains it :-) Ok, thanks. Don't worry about the runs right now. We'll figure out the fix. The code is just *a = (PetscReal)strtod(name,endptr); could be a compiler bug. > On Aug 14, 2019, at 9:23 PM, Mark Adams wrote: > > I am getting this error with

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
I am getting this error with single: 22:21 /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1 ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type aijcusparse -fp_trap [0] 81 global equations, 27 vertices [0]PETSC ERROR: *** unknown floating point error occurred ***

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
Oh, doesn't even have to be that large. We just need to be able to look at the flop rates (as a surrogate for run times) and compare with the previous runs. So long as the size per process is pretty much the same that is good enough. Barry > On Aug 14, 2019, at 8:45 PM, Mark Adams

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
I can run single, I just can't scale up. But I can use like 1500 processors. On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. wrote: > > Oh, are all your integers 8 bytes? Even on one node? > > Once Karl's new middleware is in place we should see about reducing to 4 > bytes on the GPU. > >

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
Oh, are all your integers 8 bytes? Even on one node? Once Karl's new middleware is in place we should see about reducing to 4 bytes on the GPU. Barry > On Aug 14, 2019, at 7:44 PM, Mark Adams wrote: > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte integers ... I could use 32 bit ints and just not scale out. On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. wrote: > > Mark, > >Oh, I don't even care if it converges, just put in a fixed number of > iterations.

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
FYI, this test has a smooth (polynomial) body force and it runs a convergence study. On Wed, Aug 14, 2019 at 6:15 PM Brad Aagaard via petsc-dev < petsc-dev@mcs.anl.gov> wrote: > Q2 is often useful in problems with body forces (such as gravitational > body forces), which tend to have linear

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
"Smith, Barry F." writes: >> On Aug 14, 2019, at 5:58 PM, Jed Brown wrote: >> >> "Smith, Barry F." writes: >> On Aug 14, 2019, at 2:37 PM, Jed Brown wrote: Mark Adams via petsc-dev writes: > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. > wrote: >

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
> On Aug 14, 2019, at 5:58 PM, Jed Brown wrote: > > "Smith, Barry F." writes: > >>> On Aug 14, 2019, at 2:37 PM, Jed Brown wrote: >>> >>> Mark Adams via petsc-dev writes: >>> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > > Mark, > > Would you be

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
> On Aug 14, 2019, at 3:36 PM, Mark Adams wrote: > > > > On Wed, Aug 14, 2019 at 3:37 PM Jed Brown wrote: > Mark Adams via petsc-dev writes: > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > > > >> > >> Mark, > >> > >>Would you be able to make one run using single

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
"Smith, Barry F." writes: >> On Aug 14, 2019, at 2:37 PM, Jed Brown wrote: >> >> Mark Adams via petsc-dev writes: >> >>> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: >>> Mark, Would you be able to make one run using single precision? Just single

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
> On Aug 14, 2019, at 2:37 PM, Jed Brown wrote: > > Mark Adams via petsc-dev writes: > >> On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: >> >>> >>> Mark, >>> >>> Would you be able to make one run using single precision? Just single >>> everywhere since that is all we support

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
Mark, Oh, I don't even care if it converges, just put in a fixed number of iterations. The idea is to just get a baseline of the possible improvement. ECP is literally dropping millions into research on "multi precision" computations on GPUs, we need to have some actual numbers for

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
Here is the times for KSPSolve on one node with 2,280,285 equations. These nodes seem to have 42 cores. There are 6 "devices" (GPUs) and 7 core attached to the device. The anomalous 28 core result could be from only using 4 "devices". I figure I will use 36 cores for now. I should really do this

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
Brad Aagaard via petsc-dev writes: > Q2 is often useful in problems with body forces (such as gravitational > body forces), which tend to have linear variations in stress. It's similar on the free-surface Stokes side, where pressure has a linear gradient and must be paired with a stable

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Brad Aagaard via petsc-dev
Q2 is often useful in problems with body forces (such as gravitational body forces), which tend to have linear variations in stress. On 8/14/19 2:51 PM, Mark Adams via petsc-dev wrote: Do you have any applications that specifically want Q2 (versus Q1) elasticity or have some test

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
> > > > Do you have any applications that specifically want Q2 (versus Q1) > elasticity or have some test problems that would benefit? > > No, I'm just trying to push things.

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
Mark Adams writes: > On Wed, Aug 14, 2019 at 3:37 PM Jed Brown wrote: > >> Mark Adams via petsc-dev writes: >> >> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. >> wrote: >> > >> >> >> >> Mark, >> >> >> >>Would you be able to make one run using single precision? Just single >> >>

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
On Wed, Aug 14, 2019 at 2:19 PM Smith, Barry F. wrote: > > Mark, > > This is great, we can study these for months. > > 1) At the top of the plots you say SNES but that can't be right, there is > no way it is getting such speed ups for the entire SNES solve since the > Jacobians are CPUs

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
On Wed, Aug 14, 2019 at 3:37 PM Jed Brown wrote: > Mark Adams via petsc-dev writes: > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. > wrote: > > > >> > >> Mark, > >> > >>Would you be able to make one run using single precision? Just single > >> everywhere since that is all we

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Jed Brown via petsc-dev
Mark Adams via petsc-dev writes: > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > >> >> Mark, >> >>Would you be able to make one run using single precision? Just single >> everywhere since that is all we support currently? >> >> > Experience in engineering at least is single

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev
On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. wrote: > > Mark, > >Would you be able to make one run using single precision? Just single > everywhere since that is all we support currently? > > Experience in engineering at least is single does not work for FE elasticity. I have tried it

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
Mark, Would you be able to make one run using single precision? Just single everywhere since that is all we support currently? The results will give us motivation (or anti-motivation) to have support for running KSP (or PC (or Mat) in single precision while the simulation is

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Smith, Barry F. via petsc-dev
Mark, This is great, we can study these for months. 1) At the top of the plots you say SNES but that can't be right, there is no way it is getting such speed ups for the entire SNES solve since the Jacobians are CPUs and take much of the time. Do you mean the KSP part of the SNES

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev
> > > 3) Is comparison between pointers appropriate? For example if (dptr != > zarray) { is scary if some arrays are zero length how do we know what the > pointer value will be? > > Yes, you need to consider these cases, which is kind of error prone. Also, I think merging transpose,and not,is a

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Smith, Barry F. via petsc-dev
My concern is 1) is it actually optimally efficient for all cases? This kind of stuff, IMHO if (yy) { if (dptr != zarray) { ierr = VecCopy_SeqCUDA(yy,zz);CHKERRQ(ierr); } else if (zz != yy) { ierr = VecAXPY_SeqCUDA(zz,1.0,yy);CHKERRQ(ierr); } } else

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev
Yea, I agree. Once this is working, I'll go back and split MatMultAdd, etc. On Wed, Jul 10, 2019 at 11:16 AM Smith, Barry F. wrote: > >In the long run I would like to see smaller specialized chunks of code > (with a bit of duplication between them) instead of highly overloaded > routines

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Smith, Barry F. via petsc-dev
In the long run I would like to see smaller specialized chunks of code (with a bit of duplication between them) instead of highly overloaded routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple alone, for multiple add alone and for multiple add with sparse format. Trying

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev
Thanks, you made several changes here, including switches with the workvector size. I guess I should import this logic to the transpose method(s), except for the yy==NULL branches ... MatMult_ calls MatMultAdd with yy=0, but the transpose version have their own code.

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev
On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. wrote: > > ierr = VecGetLocalSize(xx,);CHKERRQ(ierr); > if (nt != A->rmap->n) > SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A > (%D) and xx (%D)",A->rmap->n,nt); > ierr =

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-09 Thread Smith, Barry F. via petsc-dev
ierr = VecGetLocalSize(xx,);CHKERRQ(ierr); if (nt != A->rmap->n) SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A (%D) and xx (%D)",A->rmap->n,nt); ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); ierr =

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-09 Thread Mark Adams via petsc-dev
I am stumped with this GPU bug(s). Maybe someone has an idea. I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected, but I still have differences between the GPU and CPU transpose mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh with two processors.