from:"Mark Adams via petsc\-dev"

Re: [petsc-dev] Plex - Metis warnings

2018-10-29 Thread Mark Adams via petsc-dev

On Mon, Oct 29, 2018 at 5:01 PM Matthew Knepley  wrote:

> On Mon, Oct 29, 2018 at 4:56 PM Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> I am building a fresh PETSc with GNU on Titan and I get these warnings
>> about incompatible pointers in calls in PlexPartition to MarMetis.
>>
>
> Looks like PETSc has 64-bit ints and ParMetis has 32-bit ints. Just have
> PETSc build ParMetis.
>

PETSc is building ParMetis. Could it be that GNU is being too picky?


>
>   Thanks,
>
> Matt
>
>
>> Mark
>>
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:
>> In function 'PetscPartitionerPartition_ParMetis':
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1461:40:
>> warning: passing argument 1 of 'METIS_SetDefaultOptions' from incompatible
>> pointer type [-Wincompatible-pointer-types]
>>  ierr = METIS_SetDefaultOptions(options); /* initialize all
>> defaults */
>> ^~~
>> In file included from
>> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
>>  from
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
>> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:229:16:
>> note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
>> * {aka long long int *}'
>>  METIS_API(int) METIS_SetDefaultOptions(idx_t *options);
>> ^~~
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:43:
>> warning: passing argument 1 of 'METIS_PartGraphRecursive' from incompatible
>> pointer type [-Wincompatible-pointer-types]
>>ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
>> vwgt, NULL, adjwgt, , tpwgts, ubvec, options, , assignment);
>>^
>> In file included from
>> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
>>  from
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
>> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
>> note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
>> * {aka long long int *}'
>>  METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon, idx_t
>> *xadj,
>> ^~~~
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:51:
>> warning: passing argument 2 of 'METIS_PartGraphRecursive' from incompatible
>> pointer type [-Wincompatible-pointer-types]
>>ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
>> vwgt, NULL, adjwgt, , tpwgts, ubvec, options, , assignment);
>>^
>> In file included from
>> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
>>  from
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
>> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
>> note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
>> * {aka long long int *}'
>>  METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon, idx_t
>> *xadj,
>> ^~~~
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:58:
>> warning: passing argument 3 of 'METIS_PartGraphRecursive' from incompatible
>> pointer type [-Wincompatible-pointer-types]
>>ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
>> vwgt, NULL, adjwgt, , tpwgts, ubvec, options, , assignment);
>>   ^~~~
>> In file included from
>> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
>>  from
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
>> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
>> note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
>> * {aka long long int *}'
>>  METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon, idx_t
>> *xadj,
>> ^~~~
>> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:64:
>> warning: passing argument 4 of 'METIS_PartGraphRecursive' from in

Re: [petsc-dev] [petsc-users] Convergence of AMG

2018-10-29 Thread Mark Adams via petsc-dev

On Mon, Oct 29, 2018 at 2:35 PM Smith, Barry F.  wrote:

>
>Why not just stop it once it is equal to or less than the minimum
> values set by the person.


That is what it does now. It stops when it is below the value given.


> Thus you need not "backtrack" by removing levels but the user still has
> some control over preventing a "tiny" coarse problem. For example in this
> case if the user set a minimum of 1000 it would end up with 642 unknowns on
> the coarse level


Yes, that is what it would do now. I thought you wanted something different.


> which is likely better than 6 or 54.
>



>
> Barry
>
>
> > On Oct 29, 2018, at 8:27 AM, Mark Adams  wrote:
> >
> >
> >
> > On Sun, Oct 28, 2018 at 4:54 PM Smith, Barry F. 
> wrote:
> >
> >Moved a question not needed in the public discussions to petsc-dev to
> ask Mark.
> >
> >
> >Mark,
> >
> > PCGAMGSetCoarseEqLim - Set maximum number of equations on coarsest
> grid
> >
> >Is there a way to set the minimum number of equations on the coarse
> grid also? This particular case goes down to 6, 54 and 642 unknowns on the
> coarsest grids when I'm guessing it would be better to stop at 642 unknowns
> for the coarsest level.
> >
> > No, because I don't know how it is going to coarsen I did not want to
> bother with backtracking (I do when there is an error on the coarse grid so
> it would be easy to add this but I don't think it is worth the clutter).
> >
>
>

[petsc-dev] Plex - Metis warnings

2018-10-29 Thread Mark Adams via petsc-dev

I am building a fresh PETSc with GNU on Titan and I get these warnings
about incompatible pointers in calls in PlexPartition to MarMetis.

Mark

/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:
In function 'PetscPartitionerPartition_ParMetis':
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1461:40:
warning: passing argument 1 of 'METIS_SetDefaultOptions' from incompatible
pointer type [-Wincompatible-pointer-types]
 ierr = METIS_SetDefaultOptions(options); /* initialize all
defaults */
^~~
In file included from
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
 from
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:229:16:
note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
* {aka long long int *}'
 METIS_API(int) METIS_SetDefaultOptions(idx_t *options);
^~~
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:43:
warning: passing argument 1 of 'METIS_PartGraphRecursive' from incompatible
pointer type [-Wincompatible-pointer-types]
   ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
vwgt, NULL, adjwgt, , tpwgts, ubvec, options, , assignment);
   ^
In file included from
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
 from
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
* {aka long long int *}'
 METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon, idx_t
*xadj,
^~~~
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:51:
warning: passing argument 2 of 'METIS_PartGraphRecursive' from incompatible
pointer type [-Wincompatible-pointer-types]
   ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
vwgt, NULL, adjwgt, , tpwgts, ubvec, options, , assignment);
   ^
In file included from
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
 from
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
* {aka long long int *}'
 METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon, idx_t
*xadj,
^~~~
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:58:
warning: passing argument 3 of 'METIS_PartGraphRecursive' from incompatible
pointer type [-Wincompatible-pointer-types]
   ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
vwgt, NULL, adjwgt, , tpwgts, ubvec, options, , assignment);
  ^~~~
In file included from
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
 from
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
* {aka long long int *}'
 METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon, idx_t
*xadj,
^~~~
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:64:
warning: passing argument 4 of 'METIS_PartGraphRecursive' from incompatible
pointer type [-Wincompatible-pointer-types]
   ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
vwgt, NULL, adjwgt, , tpwgts, ubvec, options, , assignment);
^~
In file included from
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
 from
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
note: expected 'idx_t * {aka long int *}' but argument is of type 'PetscInt
* {aka long long int *}'
 METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon, idx_t
*xadj,
^~~~
/lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:72:
warning: passing argument 5 of 'METIS_PartGraphRecursive' from incompatible
pointer type [-Wincompatible-pointer-types]
   ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
vwgt, NULL,

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-10-29 Thread Mark Adams via petsc-dev

On Mon, Oct 29, 2018 at 5:07 PM Matthew Knepley  wrote:

> On Mon, Oct 29, 2018 at 5:01 PM Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> I get this error running the tests using GPUs. An error in an LAPACK
>> routine.
>>
>
> From the command line, it does not look like GPUs are being used.
>
> It looks like the LAPACK eigensolver is failing. Maybe there is a variant
> signature on this machine?
>

I am not doing anything with LAPACK. I have no ideas what LAPACK it is
picking up. I do notice in the configure log that hypre has LAPACK stuff
embedding in it.  (PETSc is not using hyper's LAPACK is it?).

I can try downloading blaslapack

[petsc-dev] Error running on Titan with GPUs & GNU

2018-10-29 Thread Mark Adams via petsc-dev

I get this error running the tests using GPUs. An error in an LAPACK
routine.

16:50 master= /lustre/atlas/proj-shared/geo127/petsc$ make
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
PETSC_ARCH="" test
Running test examples to verify correct installation
Using
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
and PETSC_ARCH=
***Error detected during compile or link!***
See http://www.mcs.anl.gov/petsc/documentation/faq.html
/lustre/atlas/proj-shared/geo127/petsc/src/snes/examples/tutorials ex19
*
cc -o ex19.o -c -O
 -I/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include
  `pwd`/ex19.c
cc -O  -o ex19 ex19.o
-L/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
-Wl,-rpath,/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
-L/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
-lpetsc -lHYPRE -lparmetis -lmetis -ldl
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib/libpetsc.a(dlimpl.o):
In function `PetscDLOpen':
dlimpl.c:(.text+0x3b): warning: Using 'dlopen' in statically linked
applications requires at runtime the shared libraries from the glibc
version used for linking
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib/libpetsc.a(send.o):
In function `PetscOpenSocket':
send.c:(.text+0x3be): warning: Using 'gethostbyname' in statically linked
applications requires at runtime the shared libraries from the glibc
version used for linking
true ex19
rm ex19.o
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 1 MPI
process
See http://www.mcs.anl.gov/petsc/documentation/faq.html
lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
Number of SNES iterations = 2
Application 19079964 resources: utime ~1s, stime ~1s, Rss ~29412, inblocks
~37563, outblocks ~131654
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 2 MPI
processes
See http://www.mcs.anl.gov/petsc/documentation/faq.html
lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
[1]PETSC ERROR: [0]PETSC ERROR: - Error Message
--
- Error Message
--
[1]PETSC ERROR: [0]PETSC ERROR: Error in external library
Error in external library
[1]PETSC ERROR: [0]PETSC ERROR: Error in LAPACK routine 0
Error in LAPACK routine 0
[1]PETSC ERROR: [0]PETSC ERROR: See
http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
shooting.
[1]PETSC ERROR: [0]PETSC ERROR: Petsc Development GIT revision:
v3.10.2-461-g0ed19bb123  GIT Date: 2018-10-29 13:43:53 +0100
Petsc Development GIT revision: v3.10.2-461-g0ed19bb123  GIT Date:
2018-10-29 13:43:53 +0100
[1]PETSC ERROR: [0]PETSC ERROR: ./ex19 on a  named nid16438 by adams Mon
Oct 29 16:52:05 2018
./ex19 on a  named nid16438 by adams Mon Oct 29 16:52:05 2018
[1]PETSC ERROR: [0]PETSC ERROR: Configure options --with-cudac=1
--with-batch=0
--prefix=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
--download-hypre --download-metis --download-parmetis --with-cc=cc
--with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0
--with-fc=ftn --with-fortranlib-autodetect=0 --with-shared-libraries=0
--known-mpi-shared-libraries=1 --with-mpiexec=aprun --with-x=0
--with-64-bit-indices --with-debugging=0
PETSC_ARCH=arch-titan-opt64idx-gnu-cuda
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc
Configure options --with-cudac=1 --with-batch=0
--prefix=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
--download-hypre --download-metis --download-parmetis --with-cc=cc
--with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0
--with-fc=ftn --with-fortranlib-autodetect=0 --with-shared-libraries=0
--known-mpi-shared-libraries=1 --with-mpiexec=aprun --with-x=0
--with-64-bit-indices --with-debugging=0
PETSC_ARCH=arch-titan-opt64idx-gnu-cuda
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc
[1]PETSC ERROR: [0]PETSC ERROR: #1 KSPComputeEigenvalues_GMRES() line 144
in /lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/impls/gmres/gmreig.c
#1 KSPComputeEigenvalues_GMRES() line 144 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/impls/gmres/gmreig.c
[1]PETSC ERROR: [0]PETSC ERROR: #2 KSPComputeEigenvalues() line 132 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/interface/itfunc.c
#2 KSPComputeEigenvalues() line 132 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/interface/itfunc.c
[1]PETSC ERROR: [0]PETSC ERROR: #3
KSPChebyshevComputeExtremeEigenvalues_Private() line 288 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/impls/cheby/cheby.c
#3 KSPChebyshevComputeExtremeEigenvalues_Private() line 288 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/impls/cheby/cheby.c

Re: [petsc-dev] Plex - Metis warnings

2018-10-29 Thread Mark Adams via petsc-dev

On Mon, Oct 29, 2018 at 5:19 PM Balay, Satish  wrote:

> both 'long' and 'long long' should be 64bit.
>
> Did this work before - and change today? [i.e due to one of the PR merges?]
>

I am not sure, but probably not. I don't always check the makelogs and I am
just getting this machine/compiler working.


>
> Satish
>
> On Mon, 29 Oct 2018, Matthew Knepley via petsc-dev wrote:
>
> > On Mon, Oct 29, 2018 at 4:56 PM Mark Adams via petsc-dev <
> > petsc-dev@mcs.anl.gov> wrote:
> >
> > > I am building a fresh PETSc with GNU on Titan and I get these warnings
> > > about incompatible pointers in calls in PlexPartition to MarMetis.
> > >
> >
> > Looks like PETSc has 64-bit ints and ParMetis has 32-bit ints. Just have
> > PETSc build ParMetis.
> >
> >   Thanks,
> >
> > Matt
> >
> >
> > > Mark
> > >
> > >
> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:
> > > In function 'PetscPartitionerPartition_ParMetis':
> > >
> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1461:40:
> > > warning: passing argument 1 of 'METIS_SetDefaultOptions' from
> incompatible
> > > pointer type [-Wincompatible-pointer-types]
> > >  ierr = METIS_SetDefaultOptions(options); /* initialize all
> > > defaults */
> > > ^~~
> > > In file included from
> > >
> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
> > >  from
> > >
> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
> > >
> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:229:16:
> > > note: expected 'idx_t * {aka long int *}' but argument is of type
> 'PetscInt
> > > * {aka long long int *}'
> > >  METIS_API(int) METIS_SetDefaultOptions(idx_t *options);
> > > ^~~
> > >
> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:43:
> > > warning: passing argument 1 of 'METIS_PartGraphRecursive' from
> incompatible
> > > pointer type [-Wincompatible-pointer-types]
> > >ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
> > > vwgt, NULL, adjwgt, , tpwgts, ubvec, options, ,
> assignment);
> > >^
> > > In file included from
> > >
> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
> > >  from
> > >
> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
> > >
> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
> > > note: expected 'idx_t * {aka long int *}' but argument is of type
> 'PetscInt
> > > * {aka long long int *}'
> > >  METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon,
> idx_t
> > > *xadj,
> > > ^~~~
> > >
> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:51:
> > > warning: passing argument 2 of 'METIS_PartGraphRecursive' from
> incompatible
> > > pointer type [-Wincompatible-pointer-types]
> > >ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
> > > vwgt, NULL, adjwgt, , tpwgts, ubvec, options, ,
> assignment);
> > >^
> > > In file included from
> > >
> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/parmetis.h:18:0,
> > >  from
> > >
> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1395:
> > >
> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include/metis.h:199:16:
> > > note: expected 'idx_t * {aka long int *}' but argument is of type
> 'PetscInt
> > > * {aka long long int *}'
> > >  METIS_API(int) METIS_PartGraphRecursive(idx_t *nvtxs, idx_t *ncon,
> idx_t
> > > *xadj,
> > > ^~~~
> > >
> /lustre/atlas1/geo127/proj-shared/petsc/src/dm/impls/plex/plexpartition.c:1465:58:
> > > warning: passing argument 3 of 'METIS_PartGraphRecursive' from
> incompatible
> > > pointer type [-Wincompatible-pointer-types]
> > >ierr = METIS_PartGraphRecursive(, , xadj, adjncy,
> > > vwgt, NULL, adjwgt, , tpwgts, ubvec, options, ,
> assignment);
> > >

Re: [petsc-dev] Plex - Metis warnings

2018-10-29 Thread Mark Adams via petsc-dev

>
>
> Pushing language C
> Popping language C
> Executing: cc  -o /tmp/petsc-yiGfSd/config.packages.MPI/conftest -O
> /tmp/petsc-yiGfSd/config.packages.MPI/conftest.o  -ldl
> Testing executable /tmp/petsc-yiGfSd/config.packages.MPI/conftest to see
> if it can be run
> Executing: /tmp/petsc-yiGfSd/config.packages.MPI/conftest
> Executing: /tmp/petsc-yiGfSd/config.packages.MPI/conftest
> ERROR while running executable: Could not execute
> "['/tmp/petsc-yiGfSd/config.packages.MPI/conftest']":
> [Mon Oct 29 17:44:45 2018] [unknown] Fatal error in MPI_Init: Other MPI
> error, error stack:
> MPIR_Init_thread(537):
> MPID_Init(249)...: channel initialization failed
> MPID_Init(638)...:  PMI2 init failed: 1
>
> <<<
>
> So all MPI tests fail on frontend? And you need to use --with-batch?
>

I ran the tests from an interactive shell. I guess I should run configure
from an interactive shell. but configure and make seemed to work.


>
> Satish
>

Re: [petsc-dev] Plex - Metis warnings

2018-10-29 Thread Mark Adams via petsc-dev

I was able to run ksp ex56 manually. This machine requires an inscrutable
workflow. I built the code in my home directory, copied the executable to a
working directory and ran there. I have seen my colleagues do this copy
business that is clearly not in PETSc tests. I don't understand it but I
can I can hand if off to my users with some assurance that it _can_ work!

Thanks,
Mark

On Mon, Oct 29, 2018 at 7:23 PM Balay, Satish  wrote:

> On Mon, 29 Oct 2018, Mark Adams via petsc-dev wrote:
>
> > >
> > >
> > > Pushing language C
> > > Popping language C
> > > Executing: cc  -o /tmp/petsc-yiGfSd/config.packages.MPI/conftest -O
> > > /tmp/petsc-yiGfSd/config.packages.MPI/conftest.o  -ldl
> > > Testing executable /tmp/petsc-yiGfSd/config.packages.MPI/conftest to
> see
> > > if it can be run
> > > Executing: /tmp/petsc-yiGfSd/config.packages.MPI/conftest
> > > Executing: /tmp/petsc-yiGfSd/config.packages.MPI/conftest
> > > ERROR while running executable: Could not execute
> > > "['/tmp/petsc-yiGfSd/config.packages.MPI/conftest']":
> > > [Mon Oct 29 17:44:45 2018] [unknown] Fatal error in MPI_Init: Other MPI
> > > error, error stack:
> > > MPIR_Init_thread(537):
> > > MPID_Init(249)...: channel initialization failed
> > > MPID_Init(638)...:  PMI2 init failed: 1
> > >
> > > <<<<<<<
> > >
> > > So all MPI tests fail on frontend? And you need to use --with-batch?
> > >
> >
> > I ran the tests from an interactive shell. I guess I should run configure
> > from an interactive shell. but configure and make seemed to work.
>
> >>>>
> Executing: aprun /tmp/petsc-yiGfSd/config.libraries/conftest
> Executing: aprun /tmp/petsc-yiGfSd/config.libraries/conftest
> stdout: XALT Error: unable to find aprun
> ERROR while running executable: Could not execute "['aprun
> /tmp/petsc-yiGfSd/config.libraries/conftest']":
> XALT Error: unable to find aprun
>
> Are you able to use 'aprun' via interactive nodes to run MPI jobs? Somehow
> configure is unable to use aprun.
>
> Satish
>

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-10-29 Thread Mark Adams via petsc-dev

Still getting this error with downloaded lapack. I sent the logs on the
other thread.


18:02 master= /lustre/atlas/proj-shared/geo127/petsc$ make
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
PETSC_ARCH="" test
Running test examples to verify correct installation
Using
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
and PETSC_ARCH=
***Error detected during compile or link!***
See http://www.mcs.anl.gov/petsc/documentation/faq.html
/lustre/atlas/proj-shared/geo127/petsc/src/snes/examples/tutorials ex19
*
cc -o ex19.o -c -O
 -I/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include
  `pwd`/ex19.c
cc -O  -o ex19 ex19.o
-L/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
-Wl,-rpath,/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
-L/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
-lpetsc -lHYPRE -lflapack -lfblas -lparmetis -lmetis -ldl
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib/libpetsc.a(dlimpl.o):
In function `PetscDLOpen':
dlimpl.c:(.text+0x3b): warning: Using 'dlopen' in statically linked
applications requires at runtime the shared libraries from the glibc
version used for linking
/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib/libpetsc.a(send.o):
In function `PetscOpenSocket':
send.c:(.text+0x3be): warning: Using 'gethostbyname' in statically linked
applications requires at runtime the shared libraries from the glibc
version used for linking
true ex19
rm ex19.o
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 1 MPI
process
See http://www.mcs.anl.gov/petsc/documentation/faq.html
lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
Number of SNES iterations = 2
Application 19080270 resources: utime ~0s, stime ~1s, Rss ~72056, inblocks
~19397, outblocks ~51049
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 2 MPI
processes
See http://www.mcs.anl.gov/petsc/documentation/faq.html
lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
[1]PETSC ERROR: [0]PETSC ERROR: - Error Message
--
- Error Message
--
[1]PETSC ERROR: [0]PETSC ERROR: Error in external library
Error in external library
[1]PETSC ERROR: [0]PETSC ERROR: Error in LAPACK routine 0
Error in LAPACK routine 0
[1]PETSC ERROR: [0]PETSC ERROR: See
http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
shooting.
[1]PETSC ERROR: [0]PETSC ERROR: Petsc Development GIT revision:
v3.10.2-461-g0ed19bb123  GIT Date: 2018-10-29 13:43:53 +0100
Petsc Development GIT revision: v3.10.2-461-g0ed19bb123  GIT Date:
2018-10-29 13:43:53 +0100
[1]PETSC ERROR: [0]PETSC ERROR: ./ex19 on a  named nid08331 by adams Mon
Oct 29 18:07:59 2018
./ex19 on a  named nid08331 by adams Mon Oct 29 18:07:59 2018
[1]PETSC ERROR: [0]PETSC ERROR: Configure options --with-cudac=1
--with-batch=0
--prefix=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
--download-hypre --download-metis --download-parmetis
--download-fblaslapack --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC
--with-cxxlib-autodetect=0 --with-fc=ftn --with-fortranlib-autodetect=0
--with-shared-libraries=0 --known-mpi-shared-libraries=1
--with-mpiexec=aprun --with-x=0 --with-64-bit-indices --with-debugging=0
PETSC_ARCH=arch-titan-opt64idx-gnu-cuda
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc
Configure options --with-cudac=1 --with-batch=0
--prefix=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
--download-hypre --download-metis --download-parmetis
--download-fblaslapack --with-cc=cc --with-clib-autodetect=0 --with-cxx=CC
--with-cxxlib-autodetect=0 --with-fc=ftn --with-fortranlib-autodetect=0
--with-shared-libraries=0 --known-mpi-shared-libraries=1
--with-mpiexec=aprun --with-x=0 --with-64-bit-indices --with-debugging=0
PETSC_ARCH=arch-titan-opt64idx-gnu-cuda
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc
[1]PETSC ERROR: [0]PETSC ERROR: #1 KSPComputeEigenvalues_GMRES() line 144
in /lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/impls/gmres/gmreig.c
#1 KSPComputeEigenvalues_GMRES() line 144 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/impls/gmres/gmreig.c
[1]PETSC ERROR: #2 KSPComputeEigenvalues() line 132 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: [1]PETSC ERROR: #2 KSPComputeEigenvalues() line 132 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/interface/itfunc.c
#3 KSPChebyshevComputeExtremeEigenvalues_Private() line 288 in
/lustre/atlas1/geo127/proj-shared/petsc/src/ksp/ksp/impls/cheby/cheby.c
[0]PETSC ERROR: [1]PETSC ERROR: #3

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-10-29 Thread Mark Adams via petsc-dev

On Mon, Oct 29, 2018 at 6:55 PM Smith, Barry F.  wrote:

>
>Here is the code
>
>
> PetscStackCallBLAS("LAPACKgeev",LAPACKgeev_("N","N",,R,,realpart,imagpart,work,,));
>   if (lierr) SETERRQ1(PETSC_COMM_SELF,PETSC_ERR_LIB,"Error in LAPACK
> routine %d",(int)lierr);
>
>What is unfathomable is that it prints (int) lierr of 0 but then the if
> () test should not be satisfied.
>
>Do a ./configure with debugging turned on, could be an optimizing
> compiler error.
>

Configuring debug now.

Note, I was able to run ex56 (ksp) which does not use GMRES. This error was
from a GMRES method so maybe this is an isolated problem.


>
>Barry
>
>
> > On Oct 29, 2018, at 3:56 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > I get this error running the tests using GPUs. An error in an LAPACK
> routine.
> >
> > 16:50 master= /lustre/atlas/proj-shared/geo127/petsc$ make
> PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
> PETSC_ARCH="" test
> > Running test examples to verify correct installation
> > Using
> PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
> and PETSC_ARCH=
> > ***Error detected during compile or
> link!***
> > See http://www.mcs.anl.gov/petsc/documentation/faq.html
> > /lustre/atlas/proj-shared/geo127/petsc/src/snes/examples/tutorials ex19
> >
> *
> > cc -o ex19.o -c -O
>  -I/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/include
>   `pwd`/ex19.c
> > cc -O  -o ex19 ex19.o
> -L/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
> -Wl,-rpath,/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
> -L/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib
> -lpetsc -lHYPRE -lparmetis -lmetis -ldl
> >
> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib/libpetsc.a(dlimpl.o):
> In function `PetscDLOpen':
> > dlimpl.c:(.text+0x3b): warning: Using 'dlopen' in statically linked
> applications requires at runtime the shared libraries from the glibc
> version used for linking
> >
> /lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda/lib/libpetsc.a(send.o):
> In function `PetscOpenSocket':
> > send.c:(.text+0x3be): warning: Using 'gethostbyname' in statically
> linked applications requires at runtime the shared libraries from the glibc
> version used for linking
> > true ex19
> > rm ex19.o
> > Possible error running C/C++ src/snes/examples/tutorials/ex19 with 1 MPI
> process
> > See http://www.mcs.anl.gov/petsc/documentation/faq.html
> > lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
> > Number of SNES iterations = 2
> > Application 19079964 resources: utime ~1s, stime ~1s, Rss ~29412,
> inblocks ~37563, outblocks ~131654
> > Possible error running C/C++ src/snes/examples/tutorials/ex19 with 2 MPI
> processes
> > See http://www.mcs.anl.gov/petsc/documentation/faq.html
> > lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
> > [1]PETSC ERROR: [0]PETSC ERROR: - Error Message
> --
> > - Error Message
> --
> > [1]PETSC ERROR: [0]PETSC ERROR: Error in external library
> > Error in external library
> > [1]PETSC ERROR: [0]PETSC ERROR: Error in LAPACK routine 0
> > Error in LAPACK routine 0
> > [1]PETSC ERROR: [0]PETSC ERROR: See
> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> > See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
> shooting.
> > [1]PETSC ERROR: [0]PETSC ERROR: Petsc Development GIT revision:
> v3.10.2-461-g0ed19bb123  GIT Date: 2018-10-29 13:43:53 +0100
> > Petsc Development GIT revision: v3.10.2-461-g0ed19bb123  GIT Date:
> 2018-10-29 13:43:53 +0100
> > [1]PETSC ERROR: [0]PETSC ERROR: ./ex19 on a  named nid16438 by adams Mon
> Oct 29 16:52:05 2018
> > ./ex19 on a  named nid16438 by adams Mon Oct 29 16:52:05 2018
> > [1]PETSC ERROR: [0]PETSC ERROR: Configure options --with-cudac=1
> --with-batch=0
> --prefix=/lustre/atlas/proj-shared/geo127/petsc_titan_opt64idx_gnu_cuda
> --download-hypre --download-metis --download-parmetis --with-cc=cc
> --with-clib-autodetect=0 --with-cxx=CC --with-cxxlib-autodetect=0
> --with-fc=ftn --with-fortranlib-autodetect=0 --with-shared-libraries=0
> --known-mpi-shared-libraries=1 --with-mpiexec=aprun --with-x=0
> --with-64-bit-in

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-10-29 Thread Mark Adams via petsc-dev

And a debug build seems to work:

21:04 1 master= /lustre/atlas/proj-shared/geo127/petsc$ make
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda
PETSC_ARCH="" test
Running test examples to verify correct installation
Using
PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda
and PETSC_ARCH=
***Error detected during compile or link!***
See http://www.mcs.anl.gov/petsc/documentation/faq.html
/lustre/atlas/proj-shared/geo127/petsc/src/snes/examples/tutorials ex19
*
cc -o ex19.o -c -g
 -I/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/include
  `pwd`/ex19.c
cc -g  -o ex19 ex19.o
-L/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
-Wl,-rpath,/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
-L/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
-lpetsc -lHYPRE -lflapack -lfblas -lparmetis -lmetis -ldl
/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib/libpetsc.a(dlimpl.o):
In function `PetscDLOpen':
/lustre/atlas1/geo127/proj-shared/petsc/src/sys/dll/dlimpl.c:108: warning:
Using 'dlopen' in statically linked applications requires at runtime the
shared libraries from the glibc version used for linking
/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib/libpetsc.a(send.o):
In function `PetscOpenSocket':
/lustre/atlas1/geo127/proj-shared/petsc/src/sys/classes/viewer/impls/socket/send.c:108:
warning: Using 'gethostbyname' in statically linked applications requires
at runtime the shared libraries from the glibc version used for linking
true ex19
rm ex19.o
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 1 MPI
process
See http://www.mcs.anl.gov/petsc/documentation/faq.html
lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
Number of SNES iterations = 2
Application 19081049 resources: utime ~1s, stime ~1s, Rss ~17112, inblocks
~36504, outblocks ~111043
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 2 MPI
processes
See http://www.mcs.anl.gov/petsc/documentation/faq.html
lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
Number of SNES iterations = 2
Application 19081050 resources: utime ~1s, stime ~1s, Rss ~19816, inblocks
~36527, outblocks ~111043
5a6
> Application 19081051 resources: utime ~1s, stime ~0s, Rss ~13864,
inblocks ~36527, outblocks ~111043
/lustre/atlas/proj-shared/geo127/petsc/src/snes/examples/tutorials
Possible problem with ex19_hypre, diffs above
=
***Error detected during compile or link!***
See http://www.mcs.anl.gov/petsc/documentation/faq.html
/lustre/atlas/proj-shared/geo127/petsc/src/snes/examples/tutorials ex5f
*
ftn -c -g
-I/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/include
-o ex5f.o ex5f.F90
ftn -g   -o ex5f ex5f.o
-L/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
-Wl,-rpath,/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
-L/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
-lpetsc -lHYPRE -lflapack -lfblas -lparmetis -lmetis -ldl
/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib/libpetsc.a(dlimpl.o):
In function `PetscDLOpen':
/lustre/atlas1/geo127/proj-shared/petsc/src/sys/dll/dlimpl.c:108: warning:
Using 'dlopen' in statically linked applications requires at runtime the
shared libraries from the glibc version used for linking
/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib/libpetsc.a(send.o):
In function `PetscOpenSocket':
/lustre/atlas1/geo127/proj-shared/petsc/src/sys/classes/viewer/impls/socket/send.c:108:
warning: Using 'gethostbyname' in statically linked applications requires
at runtime the shared libraries from the glibc version used for linking
rm ex5f.o
Possible error running Fortran example src/snes/examples/tutorials/ex5f
with 1 MPI process
See http://www.mcs.anl.gov/petsc/documentation/faq.html
Number of SNES iterations = 4
Application 19081055 resources: utime ~1s, stime ~0s, Rss ~12760, inblocks
~36800, outblocks ~111983
Completed test examples
21:06 master= /lustre/atlas/proj-shared/geo127/petsc$

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-11-01 Thread Mark Adams via petsc-dev

On Wed, Oct 31, 2018 at 12:30 PM Mark Adams  wrote:

>
>
> On Wed, Oct 31, 2018 at 6:59 AM Karl Rupp  wrote:
>
>> Hi Mark,
>>
>> ah, I was confused by the Python information at the beginning of
>> configure.log. So it is picking up the correct compiler.
>>
>> Have you tried uncommenting the check for GNU?
>>
>
Yes, but I am getting an error that the cuda files do not find mpi.h.


>
> I'm getting a make error.
>
> Thanks,
>

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-10-30 Thread Mark Adams via petsc-dev

>
>
>
> Are there newer versions of the Gnu compiler for this system?


Yes:

--
/opt/modulefiles
--
gcc/4.8.1  gcc/4.8.2  gcc/4.9.3  gcc/5.3.0
gcc/6.1.0  gcc/6.2.0  gcc/6.3.0(default) gcc/7.1.0
gcc/7.2.0  gcc/7.3.0



> Are there any other compilers on the system that would likely be less
> buggy? IBM compilers? If this simple code generates a gross error with
> optimization who's to say how many more subtle bugs may be induced in the
> library by the buggy optimizer (there may be none but IMHO probability says
> there will be others).
>

Let me ask them what they recommend to use with cuda codes.


>
> Is there any chance that valgrind runs on this machine; you could run
> the optimized version through it and see what it says.
>
>
Valgrind works but tons of output and I could not see anything interesting
in there.

And, this test does work with 1 processor!

I think this is only a problem when GMRES is uses as the eigen estimator in
Cheby. GMRES solvers work and Cheby works with -mg_levels_esteig_ksp_type
cg.

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-10-31 Thread Mark Adams via petsc-dev

On Wed, Oct 31, 2018 at 5:05 AM Karl Rupp  wrote:

> Hi Mark,
>
> please comment or remove lines 83 and 84 in
>   config/BuildSystem/config/packages/cuda.py
>
> Is there a compiler newer than GCC 4.3 available?
>

You mean 6.3?

06:33  ~$ module avail gcc

- /opt/modulefiles
-
gcc/4.8.1  gcc/4.9.3  gcc/6.1.0  gcc/6.3.0(default)
gcc/7.2.0
gcc/4.8.2  gcc/5.3.0  gcc/6.2.0  gcc/7.1.0
gcc/7.3.0



>
> Best regards,
> Karli
>
>
>
> On 10/31/18 8:15 AM, Mark Adams via petsc-dev wrote:
> > After loading a cuda module ...
> >
> > On Wed, Oct 31, 2018 at 2:58 AM Mark Adams  > <mailto:mfad...@lbl.gov>> wrote:
> >
> > I get an error with --with-cuda=1
> >
> > On Tue, Oct 30, 2018 at 4:44 PM Smith, Barry F.  > <mailto:bsm...@mcs.anl.gov>> wrote:
> >
> > --with-cudac=1 should be --with-cuda=1
> >
> >
> >
> >  > On Oct 30, 2018, at 12:35 PM, Smith, Barry F. via petsc-dev
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> >  >
> >  >
> >  >
> >  >> On Oct 29, 2018, at 8:09 PM, Mark Adams  > <mailto:mfad...@lbl.gov>> wrote:
> >  >>
> >  >> And a debug build seems to work:
> >  >
> >  >Well ok.
> >  >
> >  >Are there newer versions of the Gnu compiler for this
> > system? Are there any other compilers on the system that would
> > likely be less buggy? IBM compilers? If this simple code
> > generates a gross error with optimization who's to say how many
> > more subtle bugs may be induced in the library by the buggy
> > optimizer (there may be none but IMHO probability says there
> > will be others).
> >  >
> >  >Is there any chance that valgrind runs on this machine;
> > you could run the optimized version through it and see what it
> says.
> >  >
> >  >   Barry
> >  >
> >  >>
> >  >> 21:04 1 master= /lustre/atlas/proj-shared/geo127/petsc$ make
> >
>  PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda
> > PETSC_ARCH="" test
> >  >> Running test examples to verify correct installation
> >  >> Using
> >
>  PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda
> > and PETSC_ARCH=
> >  >> ***Error detected during compile or
> > link!***
> >  >> See http://www.mcs.anl.gov/petsc/documentation/faq.html
> >  >>
> >
>  /lustre/atlas/proj-shared/geo127/petsc/src/snes/examples/tutorials
> > ex19
> >  >>
> >
>  
> *
> >  >> cc -o ex19.o -c -g
> >
>  -I/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/include
>   `pwd`/ex19.c
> >  >> cc -g  -o ex19 ex19.o
> >
>  -L/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
> >
>  -Wl,-rpath,/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
> >
>  -L/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
> > -lpetsc -lHYPRE -lflapack -lfblas -lparmetis -lmetis -ldl
> >  >>
> >
>  
> /lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib/libpetsc.a(dlimpl.o):
> > In function `PetscDLOpen':
> >  >>
> >
>  /lustre/atlas1/geo127/proj-shared/petsc/src/sys/dll/dlimpl.c:108: warning:
> > Using 'dlopen' in statically linked applications requires at
> > runtime the shared libraries from the glibc version used for
> linking
> >  >>
> >
>  
> /lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib/libpetsc.a(send.o):
> > In function `PetscOpenSocket':
> >  >>
> >
>  
> /lustre/atlas1/geo127/proj-shared/petsc/src/sys/classes/viewer/impls/socket/send.c:108:
> > warning: Using 'gethostbyname' in statically linked applications
> > requires at runtime the shared libraries from the glibc version
> > used for linking
> >  >> true ex19
> >  >> rm

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-10-31 Thread Mark Adams via petsc-dev

It looks like configure is not finding the correct cc. It does not seem
hard to find.

06:37 master= /lustre/atlas/proj-shared/geo127/petsc$ cc --version
gcc (GCC) 6.3.0 20161221 (Cray Inc.)
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

06:37 master= /lustre/atlas/proj-shared/geo127/petsc$ which cc
/opt/cray/craype/2.5.13/bin/cc
06:38 master= /lustre/atlas/proj-shared/geo127/petsc$ which gcc
/opt/gcc/6.3.0/bin/gcc


On Wed, Oct 31, 2018 at 6:34 AM Mark Adams  wrote:

>
>
> On Wed, Oct 31, 2018 at 5:05 AM Karl Rupp  wrote:
>
>> Hi Mark,
>>
>> please comment or remove lines 83 and 84 in
>>   config/BuildSystem/config/packages/cuda.py
>>
>> Is there a compiler newer than GCC 4.3 available?
>>
>
> You mean 6.3?
>
> 06:33  ~$ module avail gcc
>
> - /opt/modulefiles
> -
> gcc/4.8.1  gcc/4.9.3  gcc/6.1.0
> gcc/6.3.0(default) gcc/7.2.0
> gcc/4.8.2  gcc/5.3.0  gcc/6.2.0  gcc/7.1.0
>   gcc/7.3.0
>
>
>
>>
>> Best regards,
>> Karli
>>
>>
>>
>> On 10/31/18 8:15 AM, Mark Adams via petsc-dev wrote:
>> > After loading a cuda module ...
>> >
>> > On Wed, Oct 31, 2018 at 2:58 AM Mark Adams > > <mailto:mfad...@lbl.gov>> wrote:
>> >
>> > I get an error with --with-cuda=1
>> >
>> > On Tue, Oct 30, 2018 at 4:44 PM Smith, Barry F. > > <mailto:bsm...@mcs.anl.gov>> wrote:
>> >
>> > --with-cudac=1 should be --with-cuda=1
>> >
>> >
>> >
>> >  > On Oct 30, 2018, at 12:35 PM, Smith, Barry F. via petsc-dev
>> > mailto:petsc-dev@mcs.anl.gov>> wrote:
>> >  >
>> >  >
>> >  >
>> >  >> On Oct 29, 2018, at 8:09 PM, Mark Adams > > <mailto:mfad...@lbl.gov>> wrote:
>> >  >>
>> >  >> And a debug build seems to work:
>> >  >
>> >  >Well ok.
>> >  >
>> >  >Are there newer versions of the Gnu compiler for this
>> > system? Are there any other compilers on the system that would
>> > likely be less buggy? IBM compilers? If this simple code
>> > generates a gross error with optimization who's to say how many
>> > more subtle bugs may be induced in the library by the buggy
>> > optimizer (there may be none but IMHO probability says there
>> > will be others).
>> >  >
>> >  >Is there any chance that valgrind runs on this machine;
>> > you could run the optimized version through it and see what it
>> says.
>> >  >
>> >  >   Barry
>> >  >
>> >  >>
>> >  >> 21:04 1 master= /lustre/atlas/proj-shared/geo127/petsc$ make
>> >
>>  PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda
>> > PETSC_ARCH="" test
>> >  >> Running test examples to verify correct installation
>> >  >> Using
>> >
>>  PETSC_DIR=/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda
>> > and PETSC_ARCH=
>> >  >> ***Error detected during compile or
>> > link!***
>> >  >> See http://www.mcs.anl.gov/petsc/documentation/faq.html
>> >  >>
>> >
>>  /lustre/atlas/proj-shared/geo127/petsc/src/snes/examples/tutorials
>> > ex19
>> >  >>
>> >
>>  
>> *
>> >  >> cc -o ex19.o -c -g
>> >
>>  -I/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/include
>>   `pwd`/ex19.c
>> >  >> cc -g  -o ex19 ex19.o
>> >
>>  -L/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
>> >
>>  
>> -Wl,-rpath,/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
>> >
>>  -L/lustre/atlas/proj-shared/geo127/petsc_titan_dbg64idx_gnu_cuda/lib
>> > -lpetsc -lHYPRE -lflapack -lfblas -lparmetis -lm

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-11-02 Thread Mark Adams via petsc-dev

I did not configure hypre manually, so I guess it is not using GPUs.

On Fri, Nov 2, 2018 at 2:40 PM Smith, Barry F.  wrote:

>
>
> > On Nov 2, 2018, at 1:25 PM, Mark Adams  wrote:
> >
> > And I just tested it with GAMG and it seems fine.  And hypre ran, but it
> is not clear that it used GPUs
>
> Presumably hyper must be configured to use GPUs. Currently the PETSc
> hyper download installer hypre.py doesn't have any options for getting
> hypre built for GPUs.
>
> Barry
>
> >
> > 14:13 master= ~/petsc/src/snes/examples/tutorials$ jsrun -n 1 ./ex19
> -dm_vec_type cuda -dm_mat_type aijcusparse -pc_type hypre -ksp_type fgmres
> -snes_monitor_short -snes_rtol 1.e-5 -ksp_view
> > lid velocity = 0.0625, prandtl # = 1., grashof # = 1.
> >   0 SNES Function norm 0.239155
> > KSP Object: 1 MPI processes
> >   type: fgmres
> > restart=30, using Classical (unmodified) Gram-Schmidt
> Orthogonalization with no iterative refinement
> > happy breakdown tolerance 1e-30
> >   maximum iterations=1, initial guess is zero
> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
> >   right preconditioning
> >   using UNPRECONDITIONED norm type for convergence test
> > PC Object: 1 MPI processes
> >   type: hypre
> > HYPRE BoomerAMG preconditioning
> >   Cycle type V
> >   Maximum number of levels 25
> >   Maximum number of iterations PER hypre call 1
> >   Convergence tolerance PER hypre call 0.
> >   Threshold for strong coupling 0.25
> >   Interpolation truncation factor 0.
> >   Interpolation: max elements per row 0
> >   Number of levels of aggressive coarsening 0
> >   Number of paths for aggressive coarsening 1
> >   Maximum row sums 0.9
> >   Sweeps down 1
> >   Sweeps up   1
> >   Sweeps on coarse1
> >   Relax down  symmetric-SOR/Jacobi
> >   Relax upsymmetric-SOR/Jacobi
> >   Relax on coarse Gaussian-elimination
> >   Relax weight  (all)  1.
> >   Outer relax weight (all) 1.
> >   Using CF-relaxation
> >   Not using more complex smoothers.
> >   Measure typelocal
> >   Coarsen typeFalgout
> >   Interpolation type  classical
> >   linear system matrix = precond matrix:
> >   Mat Object: 1 MPI processes
> > type: seqaijcusparse
> > rows=64, cols=64, bs=4
> > total: nonzeros=1024, allocated nonzeros=1024
> > total number of mallocs used during MatSetValues calls =0
> >   using I-node routines: found 16 nodes, limit used is 5
> >   1 SNES Function norm 6.80716e-05
> > KSP Object: 1 MPI processes
> >   type: fgmres
> > restart=30, using Classical (unmodified) Gram-Schmidt
> Orthogonalization with no iterative refinement
> > happy breakdown tolerance 1e-30
> >   maximum iterations=1, initial guess is zero
> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
> >   right preconditioning
> >   using UNPRECONDITIONED norm type for convergence test
> > PC Object: 1 MPI processes
> >   type: hypre
> > HYPRE BoomerAMG preconditioning
> >   Cycle type V
> >   Maximum number of levels 25
> >   Maximum number of iterations PER hypre call 1
> >   Convergence tolerance PER hypre call 0.
> >   Threshold for strong coupling 0.25
> >   Interpolation truncation factor 0.
> >   Interpolation: max elements per row 0
> >   Number of levels of aggressive coarsening 0
> >   Number of paths for aggressive coarsening 1
> >   Maximum row sums 0.9
> >   Sweeps down 1
> >   Sweeps up   1
> >   Sweeps on coarse1
> >   Relax down  symmetric-SOR/Jacobi
> >   Relax upsymmetric-SOR/Jacobi
> >   Relax on coarse Gaussian-elimination
> >   Relax weight  (all)  1.
> >   Outer relax weight (all) 1.
> >   Using CF-relaxation
> >   Not using more complex smoothers.
> >   Measure typelocal
> >   Coarsen typeFalgout
> >   Interpolation type  classical
> >   linear system matrix = precond matrix:
> >   Mat Object: 1 MPI processes
> > type: seqaijcusparse
> > rows=64, cols=64, bs=4
> > total: nonzeros=1024, allocated nonzeros=1024
> > total number of mallocs used during MatSetValues calls =0
> >   using I-node routines: found 16 nodes, limit used is 5
> >   2 SNES Function norm 4.093e-11
> > Number of SNES iterations = 2
> >
> >
> > On Fri, Nov 2, 2018 at 2:10 PM Smith, Barry F. 
> wrote:
> >
> >
> > > On Nov 2, 2018, at 1:03 PM, Mark Adams  wrote:
> > >
> > > FYI, I seem to have the new GPU machine at ORNL (summitdev) working
> with GPUs. That is good enough for now.
> > > Thanks,
> >
> >Excellant!
> >
> > >
> > > 14:00 master= ~/petsc/src/snes/examples/tutorials$ jsrun -n 1 ./ex19
> -dm_vec_type cuda -dm_mat_type aijcusparse -pc_type none -ksp_type fgmres
> -snes_monitor_short -snes_rtol 1.e-5 -ksp_view
> > > lid velocity = 0.0625, prandtl #

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-11-02 Thread Mark Adams via petsc-dev

FYI, I seem to have the new GPU machine at ORNL (summitdev) working with
GPUs. That is good enough for now.
Thanks,

14:00 master= ~/petsc/src/snes/examples/tutorials$ jsrun -n 1 ./ex19
-dm_vec_type cuda -dm_mat_type aijcusparse -pc_type none -ksp_type fgmres
-snes_monitor_short -snes_rtol 1.e-5 -ksp_view
lid velocity = 0.0625, prandtl # = 1., grashof # = 1.
  0 SNES Function norm 0.239155
KSP Object: 1 MPI processes
  type: fgmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization
with no iterative refinement
happy breakdown tolerance 1e-30
  maximum iterations=1, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: none
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijcusparse
rows=64, cols=64, bs=4
total: nonzeros=1024, allocated nonzeros=1024
total number of mallocs used during MatSetValues calls =0
  using I-node routines: found 16 nodes, limit used is 5
  1 SNES Function norm 6.82338e-05
KSP Object: 1 MPI processes
  type: fgmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization
with no iterative refinement
happy breakdown tolerance 1e-30
  maximum iterations=1, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: none
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijcusparse
rows=64, cols=64, bs=4
total: nonzeros=1024, allocated nonzeros=1024
total number of mallocs used during MatSetValues calls =0
  using I-node routines: found 16 nodes, limit used is 5
  2 SNES Function norm 3.346e-10
Number of SNES iterations = 2
14:01 master= ~/petsc/src/snes/examples/tutorials$



On Thu, Nov 1, 2018 at 9:33 AM Mark Adams  wrote:

>
>
> On Wed, Oct 31, 2018 at 12:30 PM Mark Adams  wrote:
>
>>
>>
>> On Wed, Oct 31, 2018 at 6:59 AM Karl Rupp  wrote:
>>
>>> Hi Mark,
>>>
>>> ah, I was confused by the Python information at the beginning of
>>> configure.log. So it is picking up the correct compiler.
>>>
>>> Have you tried uncommenting the check for GNU?
>>>
>>
> Yes, but I am getting an error that the cuda files do not find mpi.h.
>
>
>>
>> I'm getting a make error.
>>
>> Thanks,
>>
>

Re: [petsc-dev] Error running on Titan with GPUs & GNU

2018-11-02 Thread Mark Adams via petsc-dev

And I just tested it with GAMG and it seems fine.  And hypre ran, but it is
not clear that it used GPUs

14:13 master= ~/petsc/src/snes/examples/tutorials$ jsrun -n 1 ./ex19
-dm_vec_type cuda -dm_mat_type aijcusparse -pc_type hypre -ksp_type fgmres
-snes_monitor_short -snes_rtol 1.e-5 -ksp_view
lid velocity = 0.0625, prandtl # = 1., grashof # = 1.
  0 SNES Function norm 0.239155
KSP Object: 1 MPI processes
  type: fgmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization
with no iterative refinement
happy breakdown tolerance 1e-30
  maximum iterations=1, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: hypre
HYPRE BoomerAMG preconditioning
  Cycle type V
  Maximum number of levels 25
  Maximum number of iterations PER hypre call 1
  Convergence tolerance PER hypre call 0.
  Threshold for strong coupling 0.25
  Interpolation truncation factor 0.
  Interpolation: max elements per row 0
  Number of levels of aggressive coarsening 0
  Number of paths for aggressive coarsening 1
  Maximum row sums 0.9
  Sweeps down 1
  Sweeps up   1
  Sweeps on coarse1
  Relax down  symmetric-SOR/Jacobi
  Relax upsymmetric-SOR/Jacobi
  Relax on coarse Gaussian-elimination
  Relax weight  (all)  1.
  Outer relax weight (all) 1.
  Using CF-relaxation
  Not using more complex smoothers.
  Measure typelocal
  Coarsen typeFalgout
  Interpolation type  classical
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijcusparse
rows=64, cols=64, bs=4
total: nonzeros=1024, allocated nonzeros=1024
total number of mallocs used during MatSetValues calls =0
  using I-node routines: found 16 nodes, limit used is 5
  1 SNES Function norm 6.80716e-05
KSP Object: 1 MPI processes
  type: fgmres
restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization
with no iterative refinement
happy breakdown tolerance 1e-30
  maximum iterations=1, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: hypre
HYPRE BoomerAMG preconditioning
  Cycle type V
  Maximum number of levels 25
  Maximum number of iterations PER hypre call 1
  Convergence tolerance PER hypre call 0.
  Threshold for strong coupling 0.25
  Interpolation truncation factor 0.
  Interpolation: max elements per row 0
  Number of levels of aggressive coarsening 0
  Number of paths for aggressive coarsening 1
  Maximum row sums 0.9
  Sweeps down 1
  Sweeps up   1
  Sweeps on coarse1
  Relax down  symmetric-SOR/Jacobi
  Relax upsymmetric-SOR/Jacobi
  Relax on coarse Gaussian-elimination
  Relax weight  (all)  1.
  Outer relax weight (all) 1.
  Using CF-relaxation
  Not using more complex smoothers.
  Measure typelocal
  Coarsen typeFalgout
  Interpolation type  classical
  linear system matrix = precond matrix:
  Mat Object: 1 MPI processes
type: seqaijcusparse
rows=64, cols=64, bs=4
total: nonzeros=1024, allocated nonzeros=1024
total number of mallocs used during MatSetValues calls =0
  using I-node routines: found 16 nodes, limit used is 5
  2 SNES Function norm 4.093e-11
Number of SNES iterations = 2


On Fri, Nov 2, 2018 at 2:10 PM Smith, Barry F.  wrote:

>
>
> > On Nov 2, 2018, at 1:03 PM, Mark Adams  wrote:
> >
> > FYI, I seem to have the new GPU machine at ORNL (summitdev) working with
> GPUs. That is good enough for now.
> > Thanks,
>
>Excellant!
>
> >
> > 14:00 master= ~/petsc/src/snes/examples/tutorials$ jsrun -n 1 ./ex19
> -dm_vec_type cuda -dm_mat_type aijcusparse -pc_type none -ksp_type fgmres
> -snes_monitor_short -snes_rtol 1.e-5 -ksp_view
> > lid velocity = 0.0625, prandtl # = 1., grashof # = 1.
> >   0 SNES Function norm 0.239155
> > KSP Object: 1 MPI processes
> >   type: fgmres
> > restart=30, using Classical (unmodified) Gram-Schmidt
> Orthogonalization with no iterative refinement
> > happy breakdown tolerance 1e-30
> >   maximum iterations=1, initial guess is zero
> >   tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
> >   right preconditioning
> >   using UNPRECONDITIONED norm type for convergence test
> > PC Object: 1 MPI processes
> >   type: none
> >   linear system matrix = precond matrix:
> >   Mat Object: 1 MPI processes
> > type: seqaijcusparse
> > rows=64, cols=64, bs=4
> > total: nonzeros=1024, allocated nonzeros=1024
> > total number of mallocs used during MatSetValues calls =0
>

[petsc-dev] GPU web page out of date

2018-12-17 Thread Mark Adams via petsc-dev

The GPU web page looks like it is 8 years old ... this link is dead (but
ex47cu seems to be in the repo):


   - Example that uses CUDA directly in the user function evaluation
   


Thanks,
Mark

Re: [petsc-dev] FW: Re[2]: Implementing of a variable block size BILU preconditioner

2018-12-05 Thread Mark Adams via petsc-dev

If you zero a row out then put something on the diagonal.

And your matrix data file (it does not look like it has any sparsity
meta-data) has about 18 orders of scales. When you diagonally scale, which
most solvers implicitly do, it looks like some of these numbers will just
go away and you will not get correct results if they are relevant.

On Tue, Dec 4, 2018 at 11:57 PM Ali Reza Khaz'ali via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Dear Jed,
>
> It is ILU(0) since I did not set the levels. Also, no shift was applied. I
> have attached a Jacobian matrix for the smallest case that I can simulate
> with BILU. I've noticed that it has zeros on its diagonal since I had to
> remove the phase equilibria equations to make the block sizes equal.
> Additionally, after a discussion with one of my students, I am now
> convinced that zeros may temporarily appear on the diagonal of the Jacobian
> of the original code (with phase equilibrium) during SNES iterations.
> I do not know if the attached Jacobian can be used for comparison
> purposes. Changing preconditioner or the linear solver will change the
> convergence properties of SNES, and hence, the simulator will try to adjust
> some of its other parameters (e.g., time step size) to get the most
> accurate and fastest results. In other words, we won't have the same
> Jacobian as attached if scalar ILU is used.
>
> Many thanks,
> Ali
>
> -Original Message-
> From: Jed Brown 
> Sent: Wednesday, December 05, 2018 12:00 AM
> To: Ali Reza Khaz'ali ; 'Smith, Barry F.' <
> bsm...@mcs.anl.gov>
> Cc: petsc-dev@mcs.anl.gov
> Subject: RE: Re[2]: [petsc-dev] Implementing of a variable block size BILU
> preconditioner
>
> Ali Reza Khaz'ali  writes:
>
> > Dear Jed,
> >
> > ILU with BAIJ works, and its performance in reducing the condition
> number is slightly better than PCVPBJACOBI. Thanks for your guidance.
>
> Is it ILU(0)?  Did you need to turn enable shifts?  Can you write out a
> small matrix that succeeds with BAIJ/ILU, but not with AIJ/ILU so we can
> compare?
>
> > Best wishes,
> > Ali
> >
> > -Original Message-
> > From: Jed Brown 
> > Sent: Tuesday, December 04, 2018 9:40 PM
> > To: Ali Reza Khaz'ali ; 'Smith, Barry F.'
> > 
> > Cc: petsc-dev@mcs.anl.gov
> > Subject: RE: Re[2]: [petsc-dev] Implementing of a variable block size
> > BILU preconditioner
> >
> > Ali Reza Khaz'ali  writes:
> >
> >> Dear Jed,
> >>
> >> Thanks for your kind answer. I thought Scalar BJACOBI does not need
> >> data from the other domains, but ILU does.
> >
> > There is no parallel ILU in PETSc.
> >
> > $ mpiexec -n 2 mpich-clang/tests/ksp/ksp/examples/tutorials/ex2
> > -pc_type ilu [0]PETSC ERROR: - Error Message
> > --
> > [0]PETSC ERROR: See
> http://www.mcs.anl.gov/petsc/documentation/linearsolvertable.html for
> possible LU and Cholesky solvers [0]PETSC ERROR: Could not locate a solver
> package. Perhaps you must ./configure with --download- [0]PETSC
> ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
> trouble shooting.
> > [0]PETSC ERROR: Petsc Development GIT revision: v3.10.2-19-g217b8b62e2
> > GIT Date: 2018-10-17 10:34:59 +0200 [0]PETSC ERROR:
> > mpich-clang/tests/ksp/ksp/examples/tutorials/ex2 on a mpich-clang
> > named joule by jed Tue Dec  4 11:02:53 2018 [0]PETSC ERROR: Configure
> > options --download-chaco --download-p4est --download-sundials
> > --download-triangle --with-fc=0
> > --with-mpi-dir=/home/jed/usr/ccache/mpich-clang --with-visibility
> > --with-x --with-yaml PETSC_ARCH=mpich-clang [0]PETSC ERROR: #1
> > MatGetFactor() line 4485 in /home/jed/petsc/src/mat/interface/matrix.c
> > [0]PETSC ERROR: #2 PCSetUp_ILU() line 142 in
> > /home/jed/petsc/src/ksp/pc/impls/factor/ilu/ilu.c
> > [0]PETSC ERROR: #3 PCSetUp() line 932 in
> > /home/jed/petsc/src/ksp/pc/interface/precon.c
> > [0]PETSC ERROR: #4 KSPSetUp() line 391 in
> > /home/jed/petsc/src/ksp/ksp/interface/itfunc.c
> > [0]PETSC ERROR: #5 KSPSolve() line 723 in
> > /home/jed/petsc/src/ksp/ksp/interface/itfunc.c
> > [0]PETSC ERROR: #6 main() line 201 in
> > /home/jed/petsc/src/ksp/ksp/examples/tutorials/ex2.c
> > [0]PETSC ERROR: PETSc Option Table entries:
> > [0]PETSC ERROR: -malloc_test
> > [0]PETSC ERROR: -pc_type ilu
> > [0]PETSC ERROR: End of Error Message ---send
> > entire error message to petsc-ma...@mcs.anl.gov-- application
> > called MPI_Abort(MPI_COMM_WORLD, 92) - process 0
> >
> >> I have tested my code with scalar ILU. However, no KSP could converge.
> >
> > There are no guarantees.  See src/ksp/pc/examples/tutorials/ex1.c which
> tests with Kershaw's matrix, a 4x4 sparse SPD matrix where incomplete
> factorization yields an indefinite preconditioner.
> >
> >> Also, there are no zeros on the diagonal, at least in the current
> >> cases that I am simulating them. However, I will recheck it.
> >> Additionally, I am going to do a limited test with the available

Re: [petsc-dev] [petsc-users] Bad memory scaling with PETSc 3.10

2019-03-27 Thread Mark Adams via petsc-dev

So is this the instructions that I should give him? This grad student is a
quick study but he has not computing background. So we don't care what we
use, we just want to work (easily).

Thanks

Do not use "--download-fblaslapack=1". Set it to 0. Same for
"--download-mpich=1".

Now do:

> module load mkl

> export BLAS_LAPACK_LOAD=--with-blas-lapack-dir=${MKLROOT}

>  export PETSC_MPICH_HOME="${MPICH_HOME}"

And use

--with-cc=${MPICH_HOME}/mpicc --with-cxx=${MPICH_HOME}/mpicxx
--with-fc=${MPICH_HOME}/mpif90

instead of clang++

On Wed, Mar 27, 2019 at 9:30 AM Matthew Knepley  wrote:

> On Wed, Mar 27, 2019 at 8:55 AM Victor Eijkhout via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> On Mar 27, 2019, at 7:29 AM, Mark Adams  wrote:
>>
>> How should he configure to this? remove "--download-fblaslapack=1" and
>> add 
>>
>>
>> 1. If using gcc
>>
>> module load mkl
>>
>> with either compiler:
>>
>> export BLAS_LAPACK_LOAD=--with-blas-lapack-dir=${MKLROOT}
>>
>> 2.  We define MPICH_HOME for you.
>>
>> With Intel MPI:
>>
>>   export PETSC_MPICH_HOME="${MPICH_HOME}/intel64"
>>   export mpi="--with-mpi-compilers=1 --with-mpi-include=${TACC_IMPI_INC}
>> --with-mpi-lib=${TACC_IMPI_LIB}/release_mt/libmpi.so”
>>
>> with mvapich:
>>
>>   export PETSC_MPICH_HOME="${MPICH_HOME}"
>>   export mpi="--with-mpi-compilers=1 --with-mpi-dir=${PETSC_MPICH_HOME}”
>>
>> (looks like a little redundancy in my script)
>>
>
> I think Satish now prefers
>
>   --with-cc=${MPICH_HOME}/mpicc --with-cxx=${MPICH_HOME}/mpicxx
> --with-fc=${MPICH_HOME}/mpif90
>
>   Thanks,
>
> Matt
>
>
>> Victor.
>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>

Re: [petsc-dev] [petsc-users] Bad memory scaling with PETSc 3.10

2019-03-21 Thread Mark Adams via petsc-dev

I'm probably screwing up some sort of history by jumping into dev, but this
is a dev comment ...

(1) -matptap_via hypre: This call the hypre package to do the PtAP trough
> an all-at-once triple product. In our experiences, it is the most memory
> efficient, but could be slow.
>

FYI,

I visited LLNL in about 1997 and told them how I did RAP. Simple 4 nested
loops. They were very interested. Clearly they did it this way after I
talked to them. This approach came up here a while back (eg, we should
offer this as an option).

Anecdotally, I don't see a noticeable difference in performance on my 3D
elasticity problems between my old code (still used by the bone modeling
people) and ex56 ...

My kernel is an unrolled dense matrix triple product. I doubt Hypre did
this. It ran at about 2x+ the flop rate of the mat-vec at scale on the SP3
in 2004.

Mark

Re: [petsc-dev] MatNest and FieldSplit

2019-03-24 Thread Mark Adams via petsc-dev

I think he is saying that this line seems to have no effect (and the
comment is hence wrong):

KSPSetOperators(subksp[nsplits - 1], S, S);

// J2 = [[4, 0] ; [0, 0.1]]


J2 is a 2x2 but this block has been changed into two single equation
fields. Does this KSPSetOperators supposed to copy this 1x1 S matrix into
the (1,1) block of the "J2", or do some sort of correct mixing internally,
to get what he wants?


BTW, this line does not seem necessary to me so maybe I'm missing something.


KSPSetOperators(sub, J2, J2);



On Sun, Mar 24, 2019 at 4:33 PM Matthew Knepley via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> On Sun, Mar 24, 2019 at 10:21 AM Pierre Jolivet <
> pierre.joli...@enseeiht.fr> wrote:
>
>> It’s a 4x4 matrix.
>> The first 2x2 diagonal matrix is a field.
>> The second 2x2 diagonal matrix is another field.
>> In the second field, the first diagonal coefficient is a subfield.
>> In the second field, the second diagonal coefficient is another subfield.
>> I’m changing the operators from the second subfield (last diagonal
>> coefficient of the matrix).
>> When I solve a system with the complete matrix (2 fields), I get a
>> different “partial solution" than when I solve the “partial system” on just
>> the second field (with the two subfields in which I modified the operators
>> from the second one).
>>
>
> I may understand waht you are doing.
> Fieldsplit calls MatGetSubMatrix() which can copy values, depending on the
> implementation,
> so changing values in the original matrix may or may not change it in the
> PC.
>
>Matt
>
> I don’t know if this makes more or less sense… sorry :\
>> Thanks,
>> Pierre
>>
>> On 24 Mar 2019, at 8:42 PM, Matthew Knepley  wrote:
>>
>> On Sat, Mar 23, 2019 at 9:12 PM Pierre Jolivet via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>>
>>> I’m trying to figure out why both solutions are not consistent in the
>>> following example.
>>> Is what I’m doing complete nonsense?
>>>
>>
>> The code does not make clear what you are asking. I can see its a nested
>> fieldsplit.
>>
>>   Thanks,
>>
>>  Matt
>>
>>
>>> Thanks in advance for your help,
>>> Pierre
>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> 
>>
>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>

Re: [petsc-dev] SNESSolve and changing dimensions

2019-04-03 Thread Mark Adams via petsc-dev

 I agree that you want to adapt around a  converged  solution. I have code
that runs time step(s), adapts, Transfers solutions and state,  creates a
new TS & SNES, if you want to  clone  that.  It works with PForest, but
Toby and Matt are working on  these abstractions so it might not be the
most up to date. If there are more up-to-date examples I would like to know
about it also.

On Wed, Apr 3, 2019 at 3:47 PM Jed Brown via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Pierre Jolivet via petsc-dev  writes:
>
> > I am just adapting the mesh depending on the solution from the previous
> SNESSolve.
> > At first, I wanted to avoid writing an outer loop around the SNESSolve,
> so I thought, let’s put the adaptation in the SNESSetJacobian.
> > It would have been preferable because it would have required fewer lines
> of code (as I had imagined this to work), that’s the main reason. I
> understand this is too much to ask of PETSc to continue working without any
> further information from the application.
>
> SNES wants to be able to connect norms and differences between vectors
> at different iterations (e.g., rtol and stol).  I would just loop around
> SNESSolve for what you want.  Note that it may be fragile to adapt in
> early Newton iterations if globalization is a challenge for your
> problem.
>

Re: [petsc-dev] [petsc-users] Bad memory scaling with PETSc 3.10

2019-03-27 Thread Mark Adams via petsc-dev

On Wed, Mar 27, 2019 at 12:06 AM Victor Eijkhout 
wrote:

>
>
> On Mar 26, 2019, at 6:25 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
> /home1/04906/bonnheim/olympus-keaveny/Olympus/olympus.petsc-3.9.3.skx-cxx-O
> on a skx-cxx-O named c478-062.stampede2.tacc.utexas.edu with 4800
> processors, by bonnheim Fri Mar 15 04:48:27 2019
>
>
> I see you’re still using a petsc that uses the reference blas/lapack and
> ethernet instead of Intel OPA:
>
> Configure Options: --configModules=PETSc.Configure
> --optionsModule=config.compilerOptions --with-cc++=clang++ COPTFLAGS="-g
> -mavx2" CXXOPTFLAGS="-g -mavx2" FOPTFLAGS="-g -mavx2" --download-mpich=1
> --download-hypre=1 --download-metis=1 --download-parmetis=1
> --download-c2html=1 --download-ctetgen --download-p4est=1
> --download-superlu_dist --download-superlu --download-triangle=1
> --download-hdf5=1 --download-fblaslapack=1 --download-zlib --with-x=0
> --with-debugging=0 PETSC_ARCH=skx-cxx-O --download-chaco
> --with-viewfromoptions=1
> Working directory: /home1/04906/bonnheim/petsc-3.9.3
>
> I’ve alerted you guys about this months ago.
>

Yea, let me try to get him to do this. How should he configure to this?
remove "--download-fblaslapack=1" and add 

Thanks,
Mark


>
> Victor.
>
>

Re: [petsc-dev] [petsc-users] Bad memory scaling with PETSc 3.10

2019-03-26 Thread Mark Adams via petsc-dev

>
>
> The way to reduce the memory is to have the all-at-once algorithm (Mark is
> an expert on this). But I am not sure how efficient it could be
> implemented.
>

I have some data  from a 3D elasticity problem with 1.4B equations on:

/home1/04906/bonnheim/olympus-keaveny/Olympus/olympus.petsc-3.9.3.skx-cxx-O
on a skx-cxx-O named c478-062.stampede2.tacc.utexas.edu with 4800
processors, by bonnheim Fri Mar 15 04:48:27 2019
Using Petsc Release Version 3.9.3, unknown

I assume this is on 100 Skylake nodes, but not sure.

Using the all-at-once algorithm in my old solver Prometheus. There are six
levels and thus 5 RAPs. The time for these RAPs is about 150 Mat-vecs on
the fine grid.

The total flop rate for these 5 RAPs was about 4x the flop rate for these
Mat-vecs on the fine grid. This to be expected as the all-at-once algorithm
is simple and not flop optimal and has high arithmetic intensity.

There is a fair amount of load imbalance in this RAP, but the three
coarsest grids have idle processes. The max/min was 2.6 and about 25% of
the time was in the communication layer. (hand written communication layer
that I wrote in grad school)

The fine grid Mat-Vecs had a max/min of 3.1.

Anyway, I know this is not very precise data, but maybe it would help to
(de)motivate its implementation in PETSc.

Mark

Re: [petsc-dev] [petsc-users] Bad memory scaling with PETSc 3.10

2019-03-21 Thread Mark Adams via petsc-dev

>
>
> Could you explain this more by adding some small examples?
>
>
Since you are considering implementing all-at-once (four nested loops,
right?) I'll give you my old code.

This code is hardwired for two AMG and for a geometric-AMG, where the
blocks of the R (and hence P) matrices are scaled identities and I only
store the scale. So you ignore those branches. This code also does
equivalent real form complex, so more branches to ignore.


prom_mat_prod.C
Description: Binary data

Re: [petsc-dev] HYPRE_LinSysCore.h

2019-01-29 Thread Mark Adams via petsc-dev

On Tue, Jan 29, 2019 at 5:20 PM Victor Eijkhout via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

>
>
> On Jan 29, 2019, at 3:58 PM, Balay, Satish  wrote:
>
> -args.append('--without-fei')
>
>
> The late-1990s Finite Element Interface?
>

I would guess FEI is still actively used and interfaces to 1960's FE codes
at the labs.

>
> I’ll enable it and see if anyone complains about it breaking whatever.
>
> Victor.
>

[petsc-dev] is DMSetDS not in master?

2019-02-01 Thread Mark Adams via petsc-dev

10:37 master= ~/Codes/petsc$ git grep DMSetDS
src/dm/interface/dm.c:.seealso: DMGetDS(), DMSetDS()
10:37 master= ~/Codes/petsc$

Re: [petsc-dev] is DMSetDS not in master?

2019-02-01 Thread Mark Adams via petsc-dev

OK, it's not in Changes and there is one comment to it.

On Fri, Feb 1, 2019 at 10:50 AM Matthew Knepley  wrote:

> I removed it, since no one should use it anymore. You use
> DMSetField()+DMCreateDS() instead.
>
>   THanks,
>
> Matt
>
> On Fri, Feb 1, 2019 at 10:38 AM Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> 10:37 master= ~/Codes/petsc$ git grep DMSetDS
>> src/dm/interface/dm.c:.seealso: DMGetDS(), DMSetDS()
>> 10:37 master= ~/Codes/petsc$
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>

Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-12 Thread Mark Adams via petsc-dev

On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F.  wrote:

>
>
> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > Interesting, nice work.
> >
> > It would be interesting to get the flop counters working.
> >
> > This looks like GMG, I assume 3D.
> >
> > The degree of parallelism is not very realistic. You should probably run
> a 10x smaller problem, at least, or use 10x more processes.
>
>Why do you say that? He's got his machine with a certain amount of
> physical memory per node, are you saying he should ignore/not use 90% of
> that physical memory for his simulation?


In my experience 1.5M equations/process about 50x more than applications
run, but this is just anecdotal. Some apps are dominated by the linear
solver in terms of memory but some apps use a lot of memory in the physics
parts of the code.

The one app that I can think of where the memory usage is dominated by the
solver does like 10 (pseudo) time steps with pretty hard nonlinear solves,
so in the end they are not bound by turnaround time. But they are kind of a
odd (academic) application and not very representative of what I see in the
broader comp sci community. And these guys do have a scalable code so
instead of waiting a week on the queue to run a 10 hour job that uses 10%
of the machine, they wait a day to run a 2 hour job that takes 50% of the
machine because centers scheduling policies work that way.

He should buy a machine 10x bigger just because it means having less
> degrees of freedom per node (whose footing the bill for this purchase?). At
> INL they run simulations for a purpose, not just for scalability studies
> and there are no dang GPUs or barely used over-sized monstrocities sitting
> around to brag about twice a year at SC.
>

I guess the are the nuke guys. I've never worked with them or seen this
kind of complexity analysis in their talks, but OK if they fill up memory
with the solver then this is representative of a significant (DOE)app.


>
>Barry
>
>
>
> > I guess it does not matter. This basically like a one node run because
> the subdomains are so large.
> >
> > And are you sure the numerics are the same with and without hypre? Hypre
> is 15x slower. Any ideas what is going on?
> >
> > It might be interesting to scale this test down to a node to see if this
> is from communication.
> >
> > Again, nice work,
> > Mark
> >
> >
> > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong  wrote:
> > Hi Developers,
> >
> > I just want to share a good news.  It is known PETSc-ptap-scalable is
> taking too much memory for some applications because it needs to build
> intermediate data structures.  According to Mark's suggestions, I
> implemented the  all-at-once algorithm that does not cache any intermediate
> data.
> >
> > I did some comparison,  the new implementation is actually scalable in
> terms of the memory usage and the compute time even though it is still
> slower than "ptap-scalable".   There are some memory profiling results (see
> the attachments). The new all-at-once implementation use the similar amount
> of memory as hypre, but it way faster than hypre.
> >
> > For example, for a problem with 14,893,346,880 unknowns using 10,000
> processor cores,  There are timing results:
> >
> > Hypre algorithm:
> >
> > MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> > MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> > MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> >
> > PETSc scalable PtAP:
> >
> > MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> > MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05
> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
> > MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
> >
> > New implementation of the all-at-once algorithm:
> >
> > MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05
> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
> > MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05
> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
> > MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05
> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
> >
> >
> > You can see here the all-at-once is a bit slower than ptap-scalable, but
> it uses only much less memory.
> >
> >
> > Fande
> >
>
>

Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-15 Thread Mark Adams via petsc-dev

>
>
> I guess you are interested in the performance of the new algorithms on
>  small problems. I will try to test a petsc example such as
> mat/examples/tests/ex96.c.
>

It's not a big deal. And the fact that they are similar on one node tells
us the kernels are similar.


>
>
>>
>> And are you sure the numerics are the same with and without hypre? Hypre
>> is 15x slower. Any ideas what is going on?
>>
>
> Hypre performs pretty good when the number of processor core is small ( a
> couple of hundreds).  I guess the issue is related to how they handle the
> communications.
>
>
>>
>> It might be interesting to scale this test down to a node to see if this
>> is from communication.
>>
>
I wonder if the their symbolic setup is getting called every time. You do
50 solves it looks like and that should be enough to amortize a one time
setup cost.

Does PETSc do any clever scalability tricks? You just pack and send point
to point messages I would think, but maybe Hypre is doing something bad. I
have seen Hypre scale out to large machine but on synthetic problems.

So this is a realistic problem. Can you run with -info and grep on GAMG and
send me the (~20 lines) of output. You will be able to see info about each
level, like number of equations and average nnz/row.


>
> Hypre preforms similarly as petsc on a single compute node.
>
>
> Fande,
>
>
>>
>> Again, nice work,
>> Mark
>>
>>
>> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong  wrote:
>>
>>> Hi Developers,
>>>
>>> I just want to share a good news.  It is known PETSc-ptap-scalable is
>>> taking too much memory for some applications because it needs to build
>>> intermediate data structures.  According to Mark's suggestions, I
>>> implemented the  all-at-once algorithm that does not cache any intermediate
>>> data.
>>>
>>> I did some comparison,  the new implementation is actually scalable in
>>> terms of the memory usage and the compute time even though it is still
>>> slower than "ptap-scalable".   There are some memory profiling results (see
>>> the attachments). The new all-at-once implementation use the similar amount
>>> of memory as hypre, but it way faster than hypre.
>>>
>>> For example, for a problem with 14,893,346,880 unknowns using 10,000
>>> processor cores,  There are timing results:
>>>
>>> Hypre algorithm:
>>>
>>> MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
>>> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>>> MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>>> MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
>>> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>>>
>>> PETSc scalable PtAP:
>>>
>>> MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05
>>> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
>>> MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05
>>> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
>>> MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05
>>> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>>>
>>> New implementation of the all-at-once algorithm:
>>>
>>> MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05
>>> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
>>> MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05
>>> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
>>> MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05
>>> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
>>>
>>>
>>> You can see here the all-at-once is a bit slower than ptap-scalable, but
>>> it uses only much less memory.
>>>
>>>
>>> Fande
>>>
>>>
>>

Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-15 Thread Mark Adams via petsc-dev

On Mon, Apr 15, 2019 at 2:56 PM Fande Kong  wrote:

>
>
> On Mon, Apr 15, 2019 at 6:49 AM Matthew Knepley  wrote:
>
>> On Mon, Apr 15, 2019 at 12:41 AM Fande Kong via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>>
>>> On Fri, Apr 12, 2019 at 7:27 AM Mark Adams  wrote:
>>>
>>>>
>>>>
>>>> On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. 
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
>>>>> petsc-dev@mcs.anl.gov> wrote:
>>>>> >
>>>>> > Interesting, nice work.
>>>>> >
>>>>> > It would be interesting to get the flop counters working.
>>>>> >
>>>>> > This looks like GMG, I assume 3D.
>>>>> >
>>>>> > The degree of parallelism is not very realistic. You should probably
>>>>> run a 10x smaller problem, at least, or use 10x more processes.
>>>>>
>>>>>Why do you say that? He's got his machine with a certain amount of
>>>>> physical memory per node, are you saying he should ignore/not use 90% of
>>>>> that physical memory for his simulation?
>>>>
>>>>
>>>> In my experience 1.5M equations/process about 50x more than
>>>> applications run, but this is just anecdotal. Some apps are dominated by
>>>> the linear solver in terms of memory but some apps use a lot of memory in
>>>> the physics parts of the code.
>>>>
>>>
>>> The test case is solving the multigroup neutron transport equations
>>> where each mesh vertex could be associated with a hundred or a thousand
>>> variables. The mesh is actually small so that it can be handled efficiently
>>> in the physics part of the code. 90% of the memory is consumed by the
>>> solver (SNES, KSP, PC). This is the reason I was trying to implement a
>>> memory friendly PtAP.
>>>
>>>
>>>> The one app that I can think of where the memory usage is dominated by
>>>> the solver does like 10 (pseudo) time steps with pretty hard nonlinear
>>>> solves, so in the end they are not bound by turnaround time. But they are
>>>> kind of a odd (academic) application and not very representative of what I
>>>> see in the broader comp sci community. And these guys do have a scalable
>>>> code so instead of waiting a week on the queue to run a 10 hour job that
>>>> uses 10% of the machine, they wait a day to run a 2 hour job that takes 50%
>>>> of the machine because centers scheduling policies work that way.
>>>>
>>>
>>> Our code is scalable but we do not have a huge machine unfortunately.
>>>
>>>
>>>>
>>>> He should buy a machine 10x bigger just because it means having less
>>>>> degrees of freedom per node (whose footing the bill for this purchase?). 
>>>>> At
>>>>> INL they run simulations for a purpose, not just for scalability studies
>>>>> and there are no dang GPUs or barely used over-sized monstrocities sitting
>>>>> around to brag about twice a year at SC.
>>>>>
>>>>
>>>> I guess the are the nuke guys. I've never worked with them or seen this
>>>> kind of complexity analysis in their talks, but OK if they fill up memory
>>>> with the solver then this is representative of a significant (DOE)app.
>>>>
>>>
>>> You do not see the complexity analysis  in the talks because most of the
>>> people at INL live in a different community.  I will convince more people
>>> give talks in our community in the future.
>>>
>>> We focus on the nuclear energy simulations that involve multiphysics
>>> (neutron transport, mechanics contact, computational materials,
>>> compressible/incompressible flows, two-phase flows, etc.). We are
>>> developing a flexible platform (open source) that allows different physics
>>> guys couple their code together efficiently.
>>> https://mooseframework.inl.gov/old
>>>
>>
>> Fande, this is very interesting. Can you tell me:
>>
>>   1) A rough estimate of dofs/vertex (or cell or face) depending on where
>> you put unknowns
>>
>
> The big run (Neutron transport equations) posted earlier has 576 variables
> on each mesh vertex. Physics guys think at the current stage 100-1000
> variables (the number of ener

Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-15 Thread Mark Adams via petsc-dev

>
>> I wonder if the their symbolic setup is getting called every time. You do
>> 50 solves it looks like and that should be enough to amortize a one time
>> setup cost.
>>
>
> Hypre does not have concept called symbolic. They do everything from
> scratch, and won't reuse any data.
>

Really, Hypre does not cache the maps and non-zero structure, etc, that is
generated in RAP?

I suspect that that is contributing to Hypres poor performance, but it is
not the whole story as you are only doing 5 solves.

Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-15 Thread Mark Adams via petsc-dev

>
> So you could reorder your equations and see a block diagonal matrix with
>> 576 blocks. right?
>>
>
> I not sure I understand the question correctly. For each mesh vertex, we
> have a 576x576 diagonal matrix.   The unknowns are ordered in this way:
> v0, v2.., v575 for vertex 1, and another 576 variables for mesh vertex 2,
> and so on.
>

My question is,mathematically, or algebraically, is this preconditioner
equivalent to 576 Laplacian PCs? I see that it is not because you coarsen
the number of variables per node. So your interpolation operators couple
your equations. I think that other than the coupling from eigen estimates
and Krylov methods, and the coupling from your variable coursening that you
have independent scalar Laplacian PCs.

10 levels is a lot. I am guessing you do like 5 levels of variable
coarsening and 5 levels of (normal) vertex coarsening with some sort of AMG
method.

This is a very different regime that problems that I am used to.

And it would still be interesting to see the flop counters to get a sense
of the underlying performance differences between the normal and the
all-at-once PtAp.

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev

On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F.  wrote:

>
>   ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
>   if (nt != A->rmap->n)
> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
> (%D) and xx (%D)",A->rmap->n,nt);
>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
>
> So the xx on the GPU appears ok?


The norm is correct and ...


> The a->B appears ok?


yes


> But on process 1 the result a->lvec is wrong?
>

yes


> How do you look at the a->lvec? Do you copy it to the CPU and print it?
>

I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
should copy it. Maybe I should make a CUDA version of these methods?


>
>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr =
> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr =
> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
>
> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?


This is where I have been digging around an printing stuff.


>
> Are you sure the problem isn't related to the "stream business"?
>

I don't know what that is but I have played around with adding
cudaDeviceSynchronize


>
> /* This multiplication sequence is different sequence
>  than the CPU version. In particular, the diagonal block
>  multiplication kernel is launched in one stream. Then,
>  in a separate stream, the data transfers from DeviceToHost
>  (with MPI messaging in between), then HostToDevice are
>  launched. Once the data transfer stream is synchronized,
>  to ensure messaging is complete, the MatMultAdd kernel
>  is launched in the original (MatMult) stream to protect
>  against race conditions.
>
>  This sequence should only be called for GPU computation. */
>
> Note this comment isn't right and appears to be cut and paste from
> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
>

Yes, I noticed this. Same as MatMult and not correct here.


>
> Anyway to "turn off the stream business" and see if the result is then
> correct?


How do you do that? I'm looking at docs on streams but not sure how its
used here.


> Perhaps the stream business was done correctly for MatMult() but was never
> right for MatMultTranspose()?
>
> Barry
>
> BTW: Unrelated comment, the code
>
>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>   ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr);
>
> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
> them all yourself so setting them to zero before calling
> VecCUDAGetArrayWrite() does nothing except waste time.
>
>
OK, I'll get rid of it.


>
> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > I am stumped with this GPU bug(s). Maybe someone has an idea.
> >
> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck
> detected, but I still have differences between the GPU and CPU transpose
> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh
> with two processors. It works on one processor or with cg/none. So it is
> the transpose mat-vec.
> >
> > I see that the result of the off-diagonal  (a->lvec) is different only
> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
> and vec and printed out matlab vectors. Below is the CPU output and then
> the GPU with a view of the scatter object, which is identical as you can
> see.
> >
> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is
> wrong ...
> >
> > The only/first difference between CPU and GPU is a->lvec (the off
> diagonal contribution)on processor 1. (you can see the norms are
> different). Here is the diff on the process 1 a->lvec vector (all values
> are off).
> >
> > Any thoughts would be appreciated,
> > Mark
> >
> > 15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
> > 2,12c2,12
> > < %  type: seqcuda
> > < Vec_0x53738630_0 = [
> > < 9.5702137431412879e+00
> > < 2.1970298791152253e+01
> > < 4.5422290209190646e+00
> > < 2.0185031807270226e+00
> > < 4.2627312508573375e+01
> > < 1.0889191983882025e+01
> > < 1.6038202417695462e+01
> > < 2.7155672033607665e+01
> > < 6.2540357853223556e+00
> >

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev

Yea, I agree. Once this is working, I'll go back and split MatMultAdd, etc.

On Wed, Jul 10, 2019 at 11:16 AM Smith, Barry F.  wrote:

>
>In the long run I would like to see smaller specialized chunks of code
> (with a bit of duplication between them) instead of highly overloaded
> routines like MatMultAdd_AIJCUSPARSE. Better 3 routines, for multiple
> alone, for multiple add alone and for multiple add with sparse format.
> Trying to get all the cases right (performance and correctness for the
> everything at once is unnecessary and risky). Having possible zero size
> objects  (and hence null pointers) doesn't help the complex logic
>
>
>Barry
>
>
> > On Jul 10, 2019, at 10:06 AM, Mark Adams  wrote:
> >
> > Thanks, you made several changes here, including switches with the
> workvector size. I guess I should import this logic to the transpose
> method(s), except for the yy==NULL branches ...
> >
> > MatMult_ calls MatMultAdd with yy=0, but the transpose version have
> their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple.
> >
> > Thanks again,
> > Mark
> >
> > On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini <
> stefano.zamp...@gmail.com> wrote:
> > Mark,
> >
> > if the difference is on lvec, I suspect the bug has to do with
> compressed row storage. I have fixed a similar bug in MatMult.
> > you want to check cusparsestruct->workVector->size() against A->cmap->n.
> >
> > Stefano
> >
> > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> ha scritto:
> >
> >
> > On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. 
> wrote:
> >
> >   ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
> >   if (nt != A->rmap->n)
> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
> (%D) and xx (%D)",A->rmap->n,nt);
> >   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
> >   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
> >
> > So the xx on the GPU appears ok?
> >
> > The norm is correct and ...
> >
> > The a->B appears ok?
> >
> > yes
> >
> > But on process 1 the result a->lvec is wrong?
> >
> > yes
> >
> >
> > How do you look at the a->lvec? Do you copy it to the CPU and print it?
> >
> > I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
> should copy it. Maybe I should make a CUDA version of these methods?
> >
> >
> >   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
> >   ierr =
> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr =
> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
> >   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
> >
> > Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?
> >
> > This is where I have been digging around an printing stuff.
> >
> >
> > Are you sure the problem isn't related to the "stream business"?
> >
> > I don't know what that is but I have played around with adding
> cudaDeviceSynchronize
> >
> >
> > /* This multiplication sequence is different sequence
> >  than the CPU version. In particular, the diagonal block
> >  multiplication kernel is launched in one stream. Then,
> >  in a separate stream, the data transfers from DeviceToHost
> >  (with MPI messaging in between), then HostToDevice are
> >  launched. Once the data transfer stream is synchronized,
> >  to ensure messaging is complete, the MatMultAdd kernel
> >  is launched in the original (MatMult) stream to protect
> >  against race conditions.
> >
> >  This sequence should only be called for GPU computation. */
> >
> > Note this comment isn't right and appears to be cut and paste from
> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
> >
> > Yes, I noticed this. Same as MatMult and not correct here.
> >
> >
> > Anyway to "turn off the stream business" and see if the result is then
> correct?
> >
> > How do you do that? I'm looking at docs on streams but not sure how its
> used here.
> >
> > Perhaps the stream business was done correctly for MatMult() but was
> never right for MatMultTranspose()?
> >
> > Barry
> >
> > BTW: Unrelated comment, the code
> >
> >   ierr = VecSet(yy,0);CHKERRQ(ierr);
> >

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev

Thanks, you made several changes here, including switches with the
workvector size. I guess I should import this logic to the transpose
method(s), except for the yy==NULL branches ...

MatMult_ calls MatMultAdd with yy=0, but the transpose version have their
own code. MatMultTranspose_SeqAIJCUSPARSE is very simple.

Thanks again,
Mark

On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini 
wrote:

> Mark,
>
> if the difference is on lvec, I suspect the bug has to do with compressed
> row storage. I have fixed a similar bug in MatMult.
> you want to check cusparsestruct->workVector->size() against A->cmap->n.
>
> Stefano
>
> Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> ha scritto:
>
>>
>>
>> On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. 
>> wrote:
>>
>>>
>>>   ierr = VecGetLocalSize(xx,);CHKERRQ(ierr);
>>>   if (nt != A->rmap->n)
>>> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A
>>> (%D) and xx (%D)",A->rmap->n,nt);
>>>   ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr);
>>>   ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr);
>>>
>>> So the xx on the GPU appears ok?
>>
>>
>> The norm is correct and ...
>>
>>
>>> The a->B appears ok?
>>
>>
>> yes
>>
>>
>>> But on process 1 the result a->lvec is wrong?
>>>
>>
>> yes
>>
>>
>>> How do you look at the a->lvec? Do you copy it to the CPU and print it?
>>>
>>
>> I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I
>> should copy it. Maybe I should make a CUDA version of these methods?
>>
>>
>>>
>>>   ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr);
>>>   ierr =
>>> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>>>   ierr =
>>> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr);
>>>   ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr);
>>>
>>> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help?
>>
>>
>> This is where I have been digging around an printing stuff.
>>
>>
>>>
>>> Are you sure the problem isn't related to the "stream business"?
>>>
>>
>> I don't know what that is but I have played around with adding
>> cudaDeviceSynchronize
>>
>>
>>>
>>> /* This multiplication sequence is different sequence
>>>  than the CPU version. In particular, the diagonal block
>>>  multiplication kernel is launched in one stream. Then,
>>>  in a separate stream, the data transfers from DeviceToHost
>>>  (with MPI messaging in between), then HostToDevice are
>>>  launched. Once the data transfer stream is synchronized,
>>>  to ensure messaging is complete, the MatMultAdd kernel
>>>  is launched in the original (MatMult) stream to protect
>>>  against race conditions.
>>>
>>>  This sequence should only be called for GPU computation. */
>>>
>>> Note this comment isn't right and appears to be cut and paste from
>>> somewhere else, since there is no MatMult() nor MatMultAdd kernel here?
>>>
>>
>> Yes, I noticed this. Same as MatMult and not correct here.
>>
>>
>>>
>>> Anyway to "turn off the stream business" and see if the result is then
>>> correct?
>>
>>
>> How do you do that? I'm looking at docs on streams but not sure how its
>> used here.
>>
>>
>>> Perhaps the stream business was done correctly for MatMult() but was
>>> never right for MatMultTranspose()?
>>>
>>> Barry
>>>
>>> BTW: Unrelated comment, the code
>>>
>>>   ierr = VecSet(yy,0);CHKERRQ(ierr);
>>>   ierr = VecCUDAGetArrayWrite(yy,);CHKERRQ(ierr);
>>>
>>> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here.
>>> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set
>>> them all yourself so setting them to zero before calling
>>> VecCUDAGetArrayWrite() does nothing except waste time.
>>>
>>>
>> OK, I'll get rid of it.
>>
>>
>>>
>>> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev <
>>> petsc-dev@mcs.anl.gov> wrote:
>>> >
>>> > I am s

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-10 Thread Mark Adams via petsc-dev

>
>
> 3) Is comparison between pointers appropriate? For example if (dptr !=
> zarray) { is scary if some arrays are zero length how do we know what the
> pointer value will be?
>
>
Yes, you need to consider these cases, which is kind of error prone.

Also, I think merging transpose,and not,is a good idea because the way the
code is setup it is easy. You just grab a different cached object and keep
your rmaps and cmaps straight,I think.

Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-11 Thread Mark Adams via petsc-dev

Interesting, nice work.

It would be interesting to get the flop counters working.

This looks like GMG, I assume 3D.

The degree of parallelism is not very realistic. You should probably run a
10x smaller problem, at least, or use 10x more processes. I guess it does
not matter. This basically like a one node run because the subdomains are
so large.

And are you sure the numerics are the same with and without hypre? Hypre is
15x slower. Any ideas what is going on?

It might be interesting to scale this test down to a node to see if this is
from communication.

Again, nice work,
Mark


On Thu, Apr 11, 2019 at 7:08 PM Fande Kong  wrote:

> Hi Developers,
>
> I just want to share a good news.  It is known PETSc-ptap-scalable is
> taking too much memory for some applications because it needs to build
> intermediate data structures.  According to Mark's suggestions, I
> implemented the  all-at-once algorithm that does not cache any intermediate
> data.
>
> I did some comparison,  the new implementation is actually scalable in
> terms of the memory usage and the compute time even though it is still
> slower than "ptap-scalable".   There are some memory profiling results (see
> the attachments). The new all-at-once implementation use the similar amount
> of memory as hypre, but it way faster than hypre.
>
> For example, for a problem with 14,893,346,880 unknowns using 10,000
> processor cores,  There are timing results:
>
> Hypre algorithm:
>
> MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>
> PETSc scalable PtAP:
>
> MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05
> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
> MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>
> New implementation of the all-at-once algorithm:
>
> MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05
> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
> MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05
> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
> MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05
> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
>
>
> You can see here the all-at-once is a bit slower than ptap-scalable, but
> it uses only much less memory.
>
>
> Fande
>
>

[petsc-dev] running test

2019-08-13 Thread Mark Adams via petsc-dev

I want to test a test as it will be executed. Can someone please tell me
how do I run a "cuda" test SNES ex56 exactly?

This is the test in ex56.c


  test:
suffix: cuda
nsize: 2
requires: cuda
args: -cel 

Thanks,
Mark

Re: [petsc-dev] Is master broken?

2019-08-12 Thread Mark Adams via petsc-dev

On Mon, Aug 12, 2019 at 9:49 AM Karl Rupp  wrote:

> Hi Mark,
>
> most of the CUDA-related fixes from your PR are now in master. Thank you!
>
> The pinning of GPU-matrices to CPUs is not in master because it had
> several issues:
>
>
> https://bitbucket.org/petsc/petsc/pull-requests/1954/cuda-fixes-to-pinning-onto-cpu/diff
>
>
These links are dead.



> The ViennaCL-related changes in mark/gamg-fix-viennacl-rebased can be
> safely discarded as the new GPU wrapper will come in place over the next
> days. ex56 has not been pulled over as it's not running properly on GPUs
> yet (the pinning in your branch effectively turned GPU matrices into
> normal PETSc matrices, effectively running (almost) everything on the
> CPU again)
>
> So at this point I recommend to start a new branch off master and
> manually transfer over any bits from the pinning that you want to keep.
>

FYI, Satish worked on cleaning this branch up a week or two ago.


>
> Best regards,
> Karli
>
>
> On 8/3/19 8:47 PM, Mark Adams wrote:
> > Karl,
> > Did you want me to do anything at this point? (on vacation this week) I
> > will verify that master is all fixed if you get all my stuff integrated
> > when I get back to work in a week.
> > Thanks,
> > Mark
> >
> > On Sat, Aug 3, 2019 at 10:50 AM Karl Rupp  > <mailto:r...@iue.tuwien.ac.at>> wrote:
> >
> > If you ignore the initial ViennaCL-related commits and check against
> > current master (that just received cherry-picked updates from your
> PR),
> > then there are really only a few commits left that are not yet
> > integrated.
> >
> > (I'll extract two more PRs on Monday, so master will soon have your
> > fixes in.)
> >
> > Best regards,
> > Karli
> >
> >
> > On 8/3/19 5:21 AM, Balay, Satish wrote:
> >  > I've attempted to rebase this branch over latest master - and
> pushed
> >  > my changes to branch mark/gamg-fix-viennacl-rebased-v2
> >  >
> >  > You might want to check each of your commits in this branch to
> see if
> >  > they are ok. I had to add one extra commit - to make it match
> 'merge
> >  > of mark/gamg-fix-viennacl-rebased and master'.
> >  >
> >  > This branch has 21 commits. I think its best if you can collapse
> them
> >  > into reasonable chunks of changes. [presumably a single commit
> > for all
> >  > the changes is not the correct thing here. But the current set of
> 21
> >  > commits are all over the place]
> >  >
> >  > If you are able to migrate to this branch - its best to delete
> > the old
> >  > one [i.e origin/mark/gamg-fix-viennacl-rebased]
> >  >
> >  > Satish
> >  >
> >  > On Fri, 2 Aug 2019, Mark Adams via petsc-dev wrote:
> >  >
> >  >> I have been cherry-picking, etc, branch
> > mark/gamg-fix-viennacl-rebased and
> >  >> it is very messed up. Can someone please update this branch when
> > all the
> >  >> fixes are settled down? eg, I am seeing dozens of modified files
> > that I
> >  >> don't know anything about and I certainly don't want to put in a
> > PR for
> >  >> them.
> >  >>
> >  >> I also seem to lose my pinToCPU method for cuda matrices. I don't
> >  >> understand how that conflicted with anyone else but it did.
> >  >>
> >  >> Thanks,
> >  >> Mark
> >  >>
> >  >
> >
>

Re: [petsc-dev] Is master broken?

2019-08-12 Thread Mark Adams via petsc-dev

Satish, I think I can do this now.
Mark

On Mon, Aug 12, 2019 at 6:26 AM Mark Adams  wrote:

> Satish,
>
> Your new branch mark/gamg-fix-viennacl-rebased-v2 does not seem to have
> Barry's fixes (the source of this thread):
>
>  ...
>   , line 243: error: function call is not allowed in a constant
>   expression
>   #if PETSC_PKG_CUDA_VERSION_GE(10,1,0)
>
> Here is the reflog of the cherry picking that I did to fix my last
> version. I forget exactly what I did to get these changes so I'd like to
> not mess it up. Can you add these to your new branch?
>
> Thanks,
> Mark
>
> 06:10 2 mark/gamg-fix-viennacl-rebased-v2= ~/petsc$ git reflog
> 3baf678 HEAD@{0}: checkout: moving from mark/gamg-fix-viennacl-rebased to
> mark/gamg-fix-viennacl-rebased-v2
> e50f779 HEAD@{1}: cherry-pick: 1) When detecting version info handle
> blanks introducted by preprocessor, error if needed version cannot be det
> 87beba0 HEAD@{2}: cherry-pick: Use outputPreprocess instead of preprocess
> since it prints source to log
> 9163512 HEAD@{3}: cherry-pick: Fix manual pages for
> MatXXXYBAIJSetPreallocationCSR() routines
> b4500af HEAD@{4}: cherry-pick: fix compile warnings on
> arch-linux-pkgs-64idx triggered by c73702f59ac80eb68c197b7eea6d8474b9e9853c
> 9f418a9 HEAD@{5}: cherry-pick: Fix compile warnings --with-cuda
> b541e45 HEAD@{6}: commit (cherry-pick): added a file that seemed to get
> deleted
> 5d2c71f HEAD@{7}: checkout: moving from
> barry/2019-09-01/robustify-version-check to mark/gamg-fix-viennacl-rebased
> 7c2e96e HEAD@{8}: cherry-pick: Fix compile warnings --with-cuda
> d70ea55 HEAD@{9}: cherry-pick: MATSEQDENSECUDA: Fix warnings
> c85f03d HEAD@{10}: pull origin: Fast-forward
> e24ebd8 HEAD@{11}: checkout: moving from master to
> barry/2019-09-01/robustify-version-check
>
>>
>>

Re: [petsc-dev] Is master broken?

2019-08-12 Thread Mark Adams via petsc-dev

Satish,

Your new branch mark/gamg-fix-viennacl-rebased-v2 does not seem to have
Barry's fixes (the source of this thread):

 ...
  , line 243: error: function call is not allowed in a constant
  expression
  #if PETSC_PKG_CUDA_VERSION_GE(10,1,0)

Here is the reflog of the cherry picking that I did to fix my last version.
I forget exactly what I did to get these changes so I'd like to not mess it
up. Can you add these to your new branch?

Thanks,
Mark

06:10 2 mark/gamg-fix-viennacl-rebased-v2= ~/petsc$ git reflog
3baf678 HEAD@{0}: checkout: moving from mark/gamg-fix-viennacl-rebased to
mark/gamg-fix-viennacl-rebased-v2
e50f779 HEAD@{1}: cherry-pick: 1) When detecting version info handle blanks
introducted by preprocessor, error if needed version cannot be det
87beba0 HEAD@{2}: cherry-pick: Use outputPreprocess instead of preprocess
since it prints source to log
9163512 HEAD@{3}: cherry-pick: Fix manual pages for
MatXXXYBAIJSetPreallocationCSR() routines
b4500af HEAD@{4}: cherry-pick: fix compile warnings on
arch-linux-pkgs-64idx triggered by c73702f59ac80eb68c197b7eea6d8474b9e9853c
9f418a9 HEAD@{5}: cherry-pick: Fix compile warnings --with-cuda
b541e45 HEAD@{6}: commit (cherry-pick): added a file that seemed to get
deleted
5d2c71f HEAD@{7}: checkout: moving from
barry/2019-09-01/robustify-version-check to mark/gamg-fix-viennacl-rebased
7c2e96e HEAD@{8}: cherry-pick: Fix compile warnings --with-cuda
d70ea55 HEAD@{9}: cherry-pick: MATSEQDENSECUDA: Fix warnings
c85f03d HEAD@{10}: pull origin: Fast-forward
e24ebd8 HEAD@{11}: checkout: moving from master to
barry/2019-09-01/robustify-version-check

>
>

Re: [petsc-dev] Is master broken?

2019-08-12 Thread Mark Adams via petsc-dev

>
>
>> several issues:
>>
>>
>> https://bitbucket.org/petsc/petsc/pull-requests/1954/cuda-fixes-to-pinning-onto-cpu/diff
>>
>>
> These links are dead.
>

I found one issue with not protecting the pinnedtocpu member variable in
Mat and Vec. Will fix asap.

Re: [petsc-dev] Is master broken?

2019-08-02 Thread Mark Adams via petsc-dev

I picked these two into Barry's branch and it built.

I would like to get them into my cuda branch. Should I just pick them? And
not worry about Barry's branch. Or will that not work.

On Fri, Aug 2, 2019 at 12:03 PM Karl Rupp  wrote:

> FYI: The two branches are currently testing in `next-tmp` and are likely
> to be merged to master in ~5 hours.
>
> Best regards,
> Karli
>
>
> On 8/2/19 4:53 PM, Smith, Barry F. via petsc-dev wrote:
> >
> >Yes, these are bugs in Stefano's work that got into master because we
> didn't have comprehensive testing. There are two branches in the PR list
> you can cherry pick that will fix this problem. Sorry about this. We're
> trying to get them into master as quickly as possible but 
> >
> > Barry
> >
> >
> >> On Aug 2, 2019, at 8:39 AM, Mark Adams  wrote:
> >>
> >> closer,
> >>
> >> On Fri, Aug 2, 2019 at 9:13 AM Smith, Barry F. 
> wrote:
> >>
> >>Mark,
> >>
> >>  Thanks, that was not expected to work, I was just verifying the
> exact cause of the problem and it was what I was guessing.
> >>
> >>  I believe I have fixed it. Please pull that branch again and let
> me know if it works. If it does we'll do rush testing and get it into
> master.
> >>
> >>   Thanks
> >>
> >>   Barry
> >>
> >>
> >>> On Aug 1, 2019, at 11:08 AM, Mark Adams  wrote:
> >>>
> >>>
> >>>
> >>> On Thu, Aug 1, 2019 at 10:30 AM Smith, Barry F. 
> wrote:
> >>>
> >>>Send
> >>>
> >>> ls arch-linux2-c-debug/include/
> >>>
> >>> That is not my arch name. It is something like
> arch-summit-dbg64-pgi-cuda
> >>>
> >>>   arch-linux2-c-debug/include/petscpkg_version.h
> >>>
> >>> and configure.log
> >>>
> >>>
> >>>
> >>>> On Aug 1, 2019, at 5:23 AM, Mark Adams  wrote:
> >>>>
> >>>> I get the same error with a fresh clone of master.
> >>>>
> >>>> On Thu, Aug 1, 2019 at 6:03 AM Mark Adams  wrote:
> >>>> Tried again after deleting the arch dirs and still have it.
> >>>> This is my branch that just merged master. I will try with just
> master.
> >>>> Thanks,
> >>>>
> >>>> On Thu, Aug 1, 2019 at 1:36 AM Smith, Barry F. 
> wrote:
> >>>>
> >>>>It is generated automatically and put in
> arch-linux2-c-debug/include/petscpkg_version.h  this include file is
> included at top of the "bad" source  file crashes so in theory everything
> is in order check that arch-linux2-c-debug/include/petscpkg_version.h
> contains PETSC_PKG_CUDA_VERSION_GE and similar macros. If not send
> configure.lo
> >>>>
> >>>> check what is in arch-linux2-c-debug/include/petscpkg_version.h it
> nothing or broken send configure.lo
> >>>>
> >>>>
> >>>>Barry
> >>>>
> >>>>
> >>>>
> >>>>> On Jul 31, 2019, at 9:28 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >>>>>
> >>>>> I am seeing this when I pull master into my branch:
> >>>>>
> >>>>> "/autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/dense/seq/cuda/
> densecuda.cu"
> >>>>>, line 243: error: function call is not allowed in a
> constant
> >>>>>expression
> >>>>>#if PETSC_PKG_CUDA_VERSION_GE(10,1,0)
> >>>>>
> >>>>> and I see that this macro does not seem to be defined:
> >>>>>
> >>>>> 22:24 master= ~/Codes/petsc$ git grep PETSC_PKG_CUDA_VERSION_GE
> >>>>> src/mat/impls/dense/seq/cuda/densecuda.cu:#if
> PETSC_PKG_CUDA_VERSION_GE(10,1,0)
> >>>>
> >>>
> >>
> >> 
> >
>

Re: [petsc-dev] Is master broken?

2019-08-02 Thread Mark Adams via petsc-dev

I have been cherry-picking, etc, branch mark/gamg-fix-viennacl-rebased and
it is very messed up. Can someone please update this branch when all the
fixes are settled down? eg, I am seeing dozens of modified files that I
don't know anything about and I certainly don't want to put in a PR for
them.

I also seem to lose my pinToCPU method for cuda matrices. I don't
understand how that conflicted with anyone else but it did.

Thanks,
Mark

Re: [petsc-dev] Is master broken?

2019-08-03 Thread Mark Adams via petsc-dev

Karl,
Did you want me to do anything at this point? (on vacation this week) I
will verify that master is all fixed if you get all my stuff integrated
when I get back to work in a week.
Thanks,
Mark

On Sat, Aug 3, 2019 at 10:50 AM Karl Rupp  wrote:

> If you ignore the initial ViennaCL-related commits and check against
> current master (that just received cherry-picked updates from your PR),
> then there are really only a few commits left that are not yet integrated.
>
> (I'll extract two more PRs on Monday, so master will soon have your
> fixes in.)
>
> Best regards,
> Karli
>
>
> On 8/3/19 5:21 AM, Balay, Satish wrote:
> > I've attempted to rebase this branch over latest master - and pushed
> > my changes to branch mark/gamg-fix-viennacl-rebased-v2
> >
> > You might want to check each of your commits in this branch to see if
> > they are ok. I had to add one extra commit - to make it match 'merge
> > of mark/gamg-fix-viennacl-rebased and master'.
> >
> > This branch has 21 commits. I think its best if you can collapse them
> > into reasonable chunks of changes. [presumably a single commit for all
> > the changes is not the correct thing here. But the current set of 21
> > commits are all over the place]
> >
> > If you are able to migrate to this branch - its best to delete the old
> > one [i.e origin/mark/gamg-fix-viennacl-rebased]
> >
> > Satish
> >
> > On Fri, 2 Aug 2019, Mark Adams via petsc-dev wrote:
> >
> >> I have been cherry-picking, etc, branch mark/gamg-fix-viennacl-rebased
> and
> >> it is very messed up. Can someone please update this branch when all the
> >> fixes are settled down? eg, I am seeing dozens of modified files that I
> >> don't know anything about and I certainly don't want to put in a PR for
> >> them.
> >>
> >> I also seem to lose my pinToCPU method for cuda matrices. I don't
> >> understand how that conflicted with anyone else but it did.
> >>
> >> Thanks,
> >> Mark
> >>
> >
>

Re: [petsc-dev] hypre and CUDA

2019-08-16 Thread Mark Adams via petsc-dev

Thanks Karl,

>From what Barry said, the hypre configure has not done the CUDA stuff so it
can't work right now.

Mark

On Thu, Aug 15, 2019 at 8:57 PM Karl Rupp  wrote:

> Hi,
>
> one way to test is to run a sequential example through nv-prof:
>   $> nvprof ./ex56 ...
>
>
> https://devblogs.nvidia.com/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/
>
> If it uses the GPU, then you will get some information on the GPU
> kernels called. If it doesn't use the GPU, the list will be (almost) empty.
>
> Best regards,
> Karli
>
>
>
> On 8/15/19 5:47 PM, Mark Adams via petsc-dev wrote:
> > I have configured with Hypre on SUMMIT, with cuda, and it ran. I'm now
> > trying to verify that it used GPUs (I doubt it). Any ideas on how to
> > verify this? Should I use the cuda vecs and mats, or does Hypre not
> > care. Can I tell hypre not to use GPUs other than configuring an
> > non-cude PETSc? I'm not sure how to run a job without GPUs, but I will
> > look into it.
> >
> > Mark
>

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

I am getting this error with single:

22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1
./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type
aijcusparse -fp_trap
[0] 81 global equations, 27 vertices
[0]PETSC ERROR: *** unknown floating point error occurred ***
[0]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[0]PETSC ERROR: debugger traps the signal, the exception can be found with
fetestexcept(0x3e00)
[0]PETSC ERROR: where the result is a bitwise OR of the following flags:
[0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400
FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
[0]PETSC ERROR: Try option -start_in_debugger
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: -  Stack Frames

[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:   INSTEAD the line number of the start of the function
[0]PETSC ERROR:   is given.
[0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
/autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
[0]PETSC ERROR: [0] PetscStrtod line 1964
/autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
[0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021
/autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
[0]PETSC ERROR: [0] PetscOptionsGetReal line 2321
/autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
[0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015
/autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
[0]PETSC ERROR: [0] KSPSetFromOptions line 329
/autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
[0]PETSC ERROR: [0] SNESSetFromOptions line 869
/autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
[0]PETSC ERROR: - Error Message
--
[0]PETSC ERROR: Floating point exception
[0]PETSC ERROR: trapped floating point error
[0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html
for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1  GIT
Date: 2019-08-13 06:33:29 -0400
[0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda named
h36n11 by adams Wed Aug 14 22:21:56 2019
[0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC
--with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon"
FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0
--with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc
CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis
--download-fblaslapack --with-x=0 --with-64-bit-indices=0
--with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
[0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
--

On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F.  wrote:

>
>   Oh, doesn't even have to be that large. We just need to be able to look
> at the flop rates (as a surrogate for run times) and compare with the
> previous runs. So long as the size per process is pretty much the same that
> is good enough.
>
>Barry
>
>
> > On Aug 14, 2019, at 8:45 PM, Mark Adams  wrote:
> >
> > I can run single, I just can't scale up. But I can use like 1500
> processors.
> >
> > On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F. 
> wrote:
> >
> >   Oh, are all your integers 8 bytes? Even on one node?
> >
> >   Once Karl's new middleware is in place we should see about reducing to
> 4 bytes on the GPU.
> >
> >Barry
> >
> >
> > > On Aug 14, 2019, at 7:44 PM, Mark Adams  wrote:
> > >
> > > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8
> byte integers ... I could use 32 bit ints and just not scale out.
> > >
> > > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. 
> wrote:
> > >
> > >  Mark,
> > >
> > >Oh, I don't even care if it converges, just put in a fixed number
> of iterations. The idea is to just get a baseline of the possible
> improvement.
> > >
> > > ECP is literally dropping millions into research on "multi
> precision" computations on GPUs, we need to have some actual numbers for
> the best potential benefit to determine how much we invest in further
> investigating it, or not.
> > >
> > > I am not expressing any opinions on the approach, we are just in
> the fact gathering stage.
> > >
> > >
> > >Barry
> > >
> > >
> > > > On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> > > >
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> wrote:
> > > >
> > > >   Mark,
> > > >
> > > >Would you be able to make one run using single precision? Just
> single everywhere since that is all we support currently?
> > > >
> > > >
> > > > Experience in engineering at least is single does not work for FE
> elasticity. I have tried it many years ago and have heard this from

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

OK, I'll run single. It a bit perverse to run with 4 byte floats and 8 byte
integers ... I could use 32 bit ints and just not scale out.

On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F.  wrote:

>
>  Mark,
>
>Oh, I don't even care if it converges, just put in a fixed number of
> iterations. The idea is to just get a baseline of the possible improvement.
>
> ECP is literally dropping millions into research on "multi precision"
> computations on GPUs, we need to have some actual numbers for the best
> potential benefit to determine how much we invest in further investigating
> it, or not.
>
> I am not expressing any opinions on the approach, we are just in the
> fact gathering stage.
>
>
>Barry
>
>
> > On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> >
> >
> >
> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> wrote:
> >
> >   Mark,
> >
> >Would you be able to make one run using single precision? Just single
> everywhere since that is all we support currently?
> >
> >
> > Experience in engineering at least is single does not work for FE
> elasticity. I have tried it many years ago and have heard this from others.
> This problem is pretty simple other than using Q2. I suppose I could try
> it, but just be aware the FE people might say that single sucks.
> >
> >The results will give us motivation (or anti-motivation) to have
> support for running KSP (or PC (or Mat)  in single precision while the
> simulation is double.
> >
> >Thanks.
> >
> >  Barry
> >
> > For example if the GPU speed on KSP is a factor of 3 over the double on
> GPUs this is serious motivation.
> >
> >
> > > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> > >
> > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU
> speedup with 98K dof/proc (3D Q2 elasticity).
> > >
> > > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> > >
> > > Comments welcome.
> > >
> 
> >
>
>

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

On Wed, Aug 14, 2019 at 2:19 PM Smith, Barry F.  wrote:

>
>   Mark,
>
> This is great, we can study these for months.
>
> 1) At the top of the plots you say SNES  but that can't be right, there is
> no way it is getting such speed ups for the entire SNES solve since the
> Jacobians are CPUs and take much of the time. Do you mean the KSP part of
> the SNES solve?
>

It uses KSPONLY. And solve times are KSPSolve with KSPSetUp called before.


>
> 2) For the case of a bit more than 1000 processes the speedup with GPUs is
> fantastic, more than 6?
>

I did not see that one, but it is plausible and there is some noise in this
data. The largest solve had a speedup of about 4x.


>
> 3) People will ask about runs using all 48 CPUs, since they are there it
> is a little unfair to only compare 24 CPUs with the GPUs. Presumably due to
> memory bandwidth limits 48 won't be much better than 24 but you need it in
> your back pocket for completeness.
>
>
Ah, good point. I just cut and paste but I should run a little test and see
where it saturates.


> 4) From the table
>
> KSPSolve   1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02
> 8.3e+01  0  0  4  0  3  10 57 97 52 81  19113494114 3.06e-01  129
> 1.38e-01 84
> PCApply   17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02
> 3.4e+01  0  0  3  0  1   8 49 81 44 33  19684007 98 2.58e-01  113
> 1.19e-01 81
>
> only 84 percent of the total flops in the KSPSolve are on the GPU and only
> 81 for the PCApply() where are the rest? MatMult() etc are doing 100
> percent on the GPU, MatSolve on the coarsest level should be tiny and not
> taking 19 percent of the flops?
>
>
That is the smallest test with 3465 equations on 24 cores. the R and P and
coarse grid are on the CPU. Look at larger tests.


>   Thanks
>
>Barry
>
>
> > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> >
> > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU
> speedup with 98K dof/proc (3D Q2 elasticity).
> >
> > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> >
> > Comments welcome.
> >
> 
>
>

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

I can run single, I just can't scale up. But I can use like 1500 processors.

On Wed, Aug 14, 2019 at 9:31 PM Smith, Barry F.  wrote:

>
>   Oh, are all your integers 8 bytes? Even on one node?
>
>   Once Karl's new middleware is in place we should see about reducing to 4
> bytes on the GPU.
>
>Barry
>
>
> > On Aug 14, 2019, at 7:44 PM, Mark Adams  wrote:
> >
> > OK, I'll run single. It a bit perverse to run with 4 byte floats and 8
> byte integers ... I could use 32 bit ints and just not scale out.
> >
> > On Wed, Aug 14, 2019 at 6:48 PM Smith, Barry F. 
> wrote:
> >
> >  Mark,
> >
> >Oh, I don't even care if it converges, just put in a fixed number of
> iterations. The idea is to just get a baseline of the possible improvement.
> >
> > ECP is literally dropping millions into research on "multi
> precision" computations on GPUs, we need to have some actual numbers for
> the best potential benefit to determine how much we invest in further
> investigating it, or not.
> >
> > I am not expressing any opinions on the approach, we are just in the
> fact gathering stage.
> >
> >
> >Barry
> >
> >
> > > On Aug 14, 2019, at 2:27 PM, Mark Adams  wrote:
> > >
> > >
> > >
> > > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> wrote:
> > >
> > >   Mark,
> > >
> > >Would you be able to make one run using single precision? Just
> single everywhere since that is all we support currently?
> > >
> > >
> > > Experience in engineering at least is single does not work for FE
> elasticity. I have tried it many years ago and have heard this from others.
> This problem is pretty simple other than using Q2. I suppose I could try
> it, but just be aware the FE people might say that single sucks.
> > >
> > >The results will give us motivation (or anti-motivation) to have
> support for running KSP (or PC (or Mat)  in single precision while the
> simulation is double.
> > >
> > >Thanks.
> > >
> > >  Barry
> > >
> > > For example if the GPU speed on KSP is a factor of 3 over the double
> on GPUs this is serious motivation.
> > >
> > >
> > > > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> > > >
> > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x
> GPU speedup with 98K dof/proc (3D Q2 elasticity).
> > > >
> > > > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> > > >
> > > > Comments welcome.
> > > >
> 
> > >
> >
>
>

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

On Wed, Aug 14, 2019 at 3:37 PM Jed Brown  wrote:

> Mark Adams via petsc-dev  writes:
>
> > On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F. 
> wrote:
> >
> >>
> >>   Mark,
> >>
> >>Would you be able to make one run using single precision? Just single
> >> everywhere since that is all we support currently?
> >>
> >>
> > Experience in engineering at least is single does not work for FE
> > elasticity. I have tried it many years ago and have heard this from
> others.
> > This problem is pretty simple other than using Q2. I suppose I could try
> > it, but just be aware the FE people might say that single sucks.
>
> When they say that single sucks, is it for the definition of the
> operator or the preconditioner?
>

Operator.

And "ve seen GMRES stagnate when using single in communication in parallel
Gauss-Seidel. Roundoff is nonlinear.


>
> As point of reference, we can apply Q2 elasticity operators in double
> precision at nearly a billion dofs/second per GPU.


> I'm skeptical of big wins in preconditioning (especially setup) due to
> the cost and irregularity of indexing being large compared to the
> bandwidth cost of the floating point values.
>

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

>
>
>
> Do you have any applications that specifically want Q2 (versus Q1)
> elasticity or have some test problems that would benefit?
>
>
No, I'm just trying to push things.

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

Here is the times for KSPSolve on one node with 2,280,285 equations. These
nodes seem to have 42 cores. There are 6 "devices" (GPUs) and 7 core
attached to the device. The anomalous 28 core result could be from only
using 4 "devices".  I figure I will use 36 cores for now. I should really
do this with a lot of processors to include MPI communication...

NP   KSPSolve
205.6634e+00
244.7382e+00
286.0349e+00
324.7543e+00
364.2574e+00
424.2022e+00

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

FYI, this test has a smooth (polynomial) body force and it runs a
convergence study.

On Wed, Aug 14, 2019 at 6:15 PM Brad Aagaard via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Q2 is often useful in problems with body forces (such as gravitational
> body forces), which tend to have linear variations in stress.
>
> On 8/14/19 2:51 PM, Mark Adams via petsc-dev wrote:
> >
> >
> > Do you have any applications that specifically want Q2 (versus Q1)
> > elasticity or have some test problems that would benefit?
> >
> >
> > No, I'm just trying to push things.
>

[petsc-dev] hypre and CUDA

2019-08-15 Thread Mark Adams via petsc-dev

I have configured with Hypre on SUMMIT, with cuda, and it ran. I'm now
trying to verify that it used GPUs (I doubt it). Any ideas on how to verify
this? Should I use the cuda vecs and mats, or does Hypre not care. Can I
tell hypre not to use GPUs other than configuring an non-cude PETSc? I'm
not sure how to run a job without GPUs, but I will look into it.

Mark

Re: [petsc-dev] hypre and CUDA

2019-08-15 Thread Mark Adams via petsc-dev

On Thu, Aug 15, 2019 at 2:34 PM Smith, Barry F.  wrote:

>
>   Mark,
>
>I don't know how one uses it; we don't yet have an option in hypre.py
> to turn it on.
>

Any ideas about when this might get done?


>
>You should just use regular PETSc matrices and vectors, not CUDA ones.
> Hypre manages all that stuff internally for itsetl.
>
>I don't know how one knows if hypre is using the GPU or not, there are
> some Nvidia profiling tools for tracking GPU usage, perhaps you can use
> those to see if it says the GPU is being used.
>

I have just been timing it. It is very slow. 3x slower than GAMG/CPU and
20x slower than GAMG/GPU. But Hypers parameters tend to be optimized for 2D
and I have not optimized parameters. But clearly its not using GPUs.


>
>Barry
>
>
> > On Aug 15, 2019, at 10:47 AM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > I have configured with Hypre on SUMMIT, with cuda, and it ran. I'm now
> trying to verify that it used GPUs (I doubt it). Any ideas on how to verify
> this? Should I use the cuda vecs and mats, or does Hypre not care. Can I
> tell hypre not to use GPUs other than configuring an non-cude PETSc? I'm
> not sure how to run a job without GPUs, but I will look into it.
> >
> > Mark
>
>

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-14 Thread Mark Adams via petsc-dev

On Wed, Aug 14, 2019 at 2:35 PM Smith, Barry F.  wrote:

>
>   Mark,
>
>Would you be able to make one run using single precision? Just single
> everywhere since that is all we support currently?
>
>
Experience in engineering at least is single does not work for FE
elasticity. I have tried it many years ago and have heard this from others.
This problem is pretty simple other than using Q2. I suppose I could try
it, but just be aware the FE people might say that single sucks.


>The results will give us motivation (or anti-motivation) to have
> support for running KSP (or PC (or Mat)  in single precision while the
> simulation is double.
>
>Thanks.
>
>  Barry
>
> For example if the GPU speed on KSP is a factor of 3 over the double on
> GPUs this is serious motivation.
>
>
> > On Aug 14, 2019, at 12:45 PM, Mark Adams  wrote:
> >
> > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU
> speedup with 98K dof/proc (3D Q2 elasticity).
> >
> > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> >
> > Comments welcome.
> >
> 
>
>

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-08-31 Thread Mark Adams via petsc-dev

On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F.  wrote:

>
>   Any explanation for why the scaling is much better for CPUs and than
> GPUs? Is it the "extra" time needed for communication from the GPUs?
>

The GPU work is well load balanced so it weak scales perfectly. When you
put that work in the CPU you get more perfectly scalable work added so it
looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec.
from the 1 node to 512 node case for both GPU and CPU, because this
non-scaling is from communication that is the same for both cases


>
>   Perhaps you could try the GPU version with Junchao's new MPI-aware CUDA
> branch (in the gitlab merge requests)  that can speed up the communication
> from GPUs?
>

Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?


>
>Barry
>
>
> > On Aug 30, 2019, at 11:56 AM, Mark Adams  wrote:
> >
> > Here is some more weak scaling data with a fixed number of iterations (I
> have given a test with the numerical problems to ORNL and they said they
> would give it to Nvidia).
> >
> > I implemented an option to "spread" the reduced coarse grids across the
> whole machine as opposed to a "compact" layout where active processes are
> laid out in simple lexicographical order. This spread approach looks a
> little better.
> >
> > Mark
> >
> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. 
> wrote:
> >
> >   Ahh, PGI compiler, that explains it :-)
> >
> >   Ok, thanks. Don't worry about the runs right now. We'll figure out the
> fix. The code is just
> >
> >   *a = (PetscReal)strtod(name,endptr);
> >
> >   could be a compiler bug.
> >
> >
> >
> >
> > > On Aug 14, 2019, at 9:23 PM, Mark Adams  wrote:
> > >
> > > I am getting this error with single:
> > >
> > > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1
> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type
> aijcusparse -fp_trap
> > > [0] 81 global equations, 27 vertices
> > > [0]PETSC ERROR: *** unknown floating point error occurred ***
> > > [0]PETSC ERROR: The specific exception can be determined by running in
> a debugger.  When the
> > > [0]PETSC ERROR: debugger traps the signal, the exception can be found
> with fetestexcept(0x3e00)
> > > [0]PETSC ERROR: where the result is a bitwise OR of the following
> flags:
> > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400
> FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
> > > [0]PETSC ERROR: Try option -start_in_debugger
> > > [0]PETSC ERROR: likely location of problem given in stack below
> > > [0]PETSC ERROR: -  Stack Frames
> 
> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> > > [0]PETSC ERROR:   INSTEAD the line number of the start of the
> function
> > > [0]PETSC ERROR:   is given.
> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
> > > [0]PETSC ERROR: [0] PetscStrtod line 1964
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015
> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329
> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869
> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
> > > [0]PETSC ERROR: - Error Message
> --
> > > [0]PETSC ERROR: Floating point exception
> > > [0]PETSC ERROR: trapped floating point error
> > > [0]PETSC ERROR: See
> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> > > [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1685-gd3eb2e1
> GIT Date: 2019-08-13 06:33:29 -0400
> > > [0]PETSC ERROR: ./ex56_single on a arch-summit-dbg-single-pgi-cuda
> named h36n11 by adams Wed Aug 14 22:21:56 2019
> > > [0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpiCC
> --with-fc=mpif90 COPTFLAGS="-g -Mfcon" CXXOPTFLAGS="-g -Mfcon"
> FOPTFLAGS="-g -Mfcon" --with-precision=single --with-ssl=0 --with-batch=0
> --with-mpiexec="jsrun -g 1" --with-cuda=1 --with-cudac=nvcc
> CUDAFLAGS="-ccbin pgc++" --download-metis --download-parmetis
> --download-fblaslapack --with-x=0 --with-64-bit-indices=0
> --with-debugging=1 PETSC_ARCH=arch-summit-dbg-single-pgi-cuda
> > > [0]PETSC ERROR: #1 User provided function() line 0 in Unknown file
> > >
> --
> > >
> > > On Wed, Aug 14, 2019 at 9:51 PM Smith, Barry F. 
> wrote:
> > >
> > >   Oh, doesn't even have

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-09-01 Thread Mark Adams via petsc-dev

Junchao and Barry,

I am using mark/fix-cuda-with-gamg-pintocpu, which is built on barry's
robustify branch. Is this in master yet? If so, I'd like to get my branch
merged to master, then merge Junchao's branch. Then us it.

I think we were waiting for some refactoring from Karl to proceed.

Anyway, I'm not sure how to proceed.

Thanks,
Mark


On Sun, Sep 1, 2019 at 8:45 AM Zhang, Junchao  wrote:

>
>
>
> On Sat, Aug 31, 2019 at 8:04 PM Mark Adams  wrote:
>
>>
>>
>> On Sat, Aug 31, 2019 at 4:28 PM Smith, Barry F. 
>> wrote:
>>
>>>
>>>   Any explanation for why the scaling is much better for CPUs and than
>>> GPUs? Is it the "extra" time needed for communication from the GPUs?
>>>
>>
>> The GPU work is well load balanced so it weak scales perfectly. When you
>> put that work in the CPU you get more perfectly scalable work added so it
>> looks better. For instance, the 98K dof/proc data goes up by about 1/2 sec.
>> from the 1 node to 512 node case for both GPU and CPU, because this
>> non-scaling is from communication that is the same for both cases
>>
>>
>>>
>>>   Perhaps you could try the GPU version with Junchao's new MPI-aware
>>> CUDA branch (in the gitlab merge requests)  that can speed up the
>>> communication from GPUs?
>>>
>>
>> Sure, Do I just checkout jczhang/feature-sf-on-gpu and run as ussual?
>>
>
> Use jsrun --smpiargs="-gpu"  to enable IBM MPI's cuda-aware support, then
> add -use_gpu_aware_mpi in option to let PETSc use that feature.
>
>
>>
>>
>>>
>>>Barry
>>>
>>>
>>> > On Aug 30, 2019, at 11:56 AM, Mark Adams  wrote:
>>> >
>>> > Here is some more weak scaling data with a fixed number of iterations
>>> (I have given a test with the numerical problems to ORNL and they said they
>>> would give it to Nvidia).
>>> >
>>> > I implemented an option to "spread" the reduced coarse grids across
>>> the whole machine as opposed to a "compact" layout where active processes
>>> are laid out in simple lexicographical order. This spread approach looks a
>>> little better.
>>> >
>>> > Mark
>>> >
>>> > On Wed, Aug 14, 2019 at 10:46 PM Smith, Barry F. 
>>> wrote:
>>> >
>>> >   Ahh, PGI compiler, that explains it :-)
>>> >
>>> >   Ok, thanks. Don't worry about the runs right now. We'll figure out
>>> the fix. The code is just
>>> >
>>> >   *a = (PetscReal)strtod(name,endptr);
>>> >
>>> >   could be a compiler bug.
>>> >
>>> >
>>> >
>>> >
>>> > > On Aug 14, 2019, at 9:23 PM, Mark Adams  wrote:
>>> > >
>>> > > I am getting this error with single:
>>> > >
>>> > > 22:21  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 1 -a 1 -c 1 -g 1
>>> ./ex56_single -cells 2,2,2 -ex56_dm_vec_type cuda -ex56_dm_mat_type
>>> aijcusparse -fp_trap
>>> > > [0] 81 global equations, 27 vertices
>>> > > [0]PETSC ERROR: *** unknown floating point error occurred ***
>>> > > [0]PETSC ERROR: The specific exception can be determined by running
>>> in a debugger.  When the
>>> > > [0]PETSC ERROR: debugger traps the signal, the exception can be
>>> found with fetestexcept(0x3e00)
>>> > > [0]PETSC ERROR: where the result is a bitwise OR of the following
>>> flags:
>>> > > [0]PETSC ERROR: FE_INVALID=0x2000 FE_DIVBYZERO=0x400
>>> FE_OVERFLOW=0x1000 FE_UNDERFLOW=0x800 FE_INEXACT=0x200
>>> > > [0]PETSC ERROR: Try option -start_in_debugger
>>> > > [0]PETSC ERROR: likely location of problem given in stack below
>>> > > [0]PETSC ERROR: -  Stack Frames
>>> 
>>> > > [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>> available,
>>> > > [0]PETSC ERROR:   INSTEAD the line number of the start of the
>>> function
>>> > > [0]PETSC ERROR:   is given.
>>> > > [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/error/fp.c
>>> > > [0]PETSC ERROR: [0] PetscStrtod line 1964
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsStringToReal line 2021
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsGetReal line 2321
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/options.c
>>> > > [0]PETSC ERROR: [0] PetscOptionsReal_Private line 1015
>>> /autofs/nccs-svm1_home1/adams/petsc/src/sys/objects/aoptions.c
>>> > > [0]PETSC ERROR: [0] KSPSetFromOptions line 329
>>> /autofs/nccs-svm1_home1/adams/petsc/src/ksp/ksp/interface/itcl.c
>>> > > [0]PETSC ERROR: [0] SNESSetFromOptions line 869
>>> /autofs/nccs-svm1_home1/adams/petsc/src/snes/interface/snes.c
>>> > > [0]PETSC ERROR: - Error Message
>>> --
>>> > > [0]PETSC ERROR: Floating point exception
>>> > > [0]PETSC ERROR: trapped floating point error
>>> > > [0]PETSC ERROR: See
>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>> shooting.
>>> > > [0]PETSC ERROR: Petsc Development GIT revision:
>>> v3.11.3-1685-gd3eb2e1  GIT Date: 2019-08-13 06:33:29 -0400
>>> > >

Re: [petsc-dev] Should we add something about GPU support to the user manual?

2019-09-12 Thread Mark Adams via petsc-dev

>
>
>> And are there any thoughts on where this belongs in the manual?
>>
>
> I think just make another chapter.
>
>
Agreed. That way we can make it very clear that this is WIP, interfaces
will change, etc.


>   Thanks,
>
> Matt
>
>
>> --Richard
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>

Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Mark Adams via petsc-dev

Barry, I fixed CUDA to pin to CPUs correctly for GAMG at least. There are
some hacks here that we can work on.

I will start testing it tomorrow, but I am pretty sure that I have not
regressed. I am hoping that this will fix the numerical problems, which
seem to be associated with empty processors.

I did need to touch code outside of GAMG and CUDA. It might be nice to test
this in a next.

GAMG now puts all reduced processorg grids on the CPU. This could be looked
at in the future.


On Sat, Jul 27, 2019 at 1:00 PM Smith, Barry F.  wrote:

>
>
> > On Jul 27, 2019, at 11:53 AM, Mark Adams  wrote:
> >
> >
> > On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F. 
> wrote:
> >
> >   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
> >
> > THis is done  (I may have done it).
> >
> > Now it seems to me that when you call VecPinToCPU you are setting up and
> don't have data, so this copy does not seem necessary. Maybe remove the
> copy here:
> >
> > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> > {
> >   PetscErrorCode ierr;
> >
> >   PetscFunctionBegin;
> >   V->pinnedtocpu = pin;
> >   if (pin) {
> > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); 
>
>The copy from GPU should actually only do anything if the GPU already
> has data and PETSC_OFFLOAD_GPU. If the GPU does not have data
> the copy doesn't do anything. When one calls VecPinToCPU() one doesn't
> know where the data is so the call must be made, but it may do nothing
>
>   Note that VecCUDACopyFromGPU() calls VecCUDAAllocateCheckHost() not
> VecCUDAAllocateCheck() so the GPU will not allocate space,
> VecCUDAAllocateCheck() is called from VecCUDACopyToGPU().
>
>Yes, perhaps the naming could be more consistent:
>
> 1) in one place it is Host in an other place it is nothing
> 2) some places it is Host, Device, some places GPU,CPU
>
>Perhaps Karl can make these all consistent and simpler in his
> refactorization
>
>
>   Barry
>
>
> >
> > or
> >
> > Not allocate the GPU if it is pinned by added in a check here:
> >
> > PetscErrorCode VecCUDAAllocateCheck(Vec v)
> > {
> >   PetscErrorCode ierr;
> >   cudaError_terr;
> >   cudaStream_t   stream;
> >   Vec_CUDA   *veccuda;
> >
> >   PetscFunctionBegin;
> >   if (!v->spptr) {
> > ierr = PetscMalloc(sizeof(Vec_CUDA),>spptr);CHKERRQ(ierr);
> > veccuda = (Vec_CUDA*)v->spptr;
> > if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {
> > err =
> cudaMalloc((void**)>GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
> > veccuda->GPUarray = veccuda->GPUarray_allocated;
> > err = cudaStreamCreate();CHKERRCUDA(err);
> > veccuda->stream = stream;
> > veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
> > if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> >   if (v->data && ((Vec_Seq*)v->data)->array) {
> > v->valid_GPU_array = PETSC_OFFLOAD_CPU;
> >   } else {
> > v->valid_GPU_array = PETSC_OFFLOAD_GPU;
> >   }
> > }
> > }
> >   }
> >   PetscFunctionReturn(0);
> > }
> >
> >
> >
> >
> >
> > > On Jul 27, 2019, at 10:40 AM, Mark Adams  wrote:
> > >
> > > Yea, I just figured out the problem. VecDuplicate_MPICUDA did not call
> PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and am
> testing:
> > >
> > >   ierr =
> VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
> > >   vw   = (Vec_MPI*)(*v)->data;
> > >   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
> _VecOps));CHKERRQ(ierr);
> > >   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
> > >
> > > Thanks,
> > >
> > > On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F. 
> wrote:
> > >
> > >   I don't understand the context. Once a vector is pinned to the CPU
> the flag should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is
> turned off.  Do you have a pinned vector that has the value
> PETSC_OFFLOAD_GPU?  For example here it is set to PETSC_OFFLOAD_CPU
> > >
> > > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> > > {
> > > 
> > >   if (pin) {
> > > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
> > > V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code will
> likely change values in the vector */
> > >
> > >
> > >   Is there any way to reproduce the problem?
> > >
> > >   Barry
> > >
> > >
> > >
> > >
> > > > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
> > > >
> > > > I'm not sure what to do here. The problem is that pinned-to-cpu
> vectors are calling VecCUDACopyFromGPU here.
> > > >
> > > > Should I set x->valid_GPU_array to something else, like
> PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not executed?
> > > >
> > > > PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> > > > {
> > > >   PetscErrorCode ierr;
> > > > #if defined(PETSC_HAVE_VIENNACL)
> > > >   PetscBool  is_viennacltype = PETSC_FALSE;
> > > > #endif
> > > >
> > > >   PetscFunctionBegin;
> > > >   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
> > > >   ierr =

Re: [petsc-dev] PCREDUNDANT

2019-07-28 Thread Mark Adams via petsc-dev

On Sun, Jul 28, 2019 at 2:54 AM Pierre Jolivet via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Hello,
> I’m facing multiple issues with PCREDUNDANT and MATMPISBAIJ:
> 1)
> https://www.mcs.anl.gov/petsc/petsc-current/src/mat/impls/sbaij/mpi/mpisbaij.c.html#line3354
>  shouldn’t
> this be sum != N? I’m running an example where it says that sum (4) !=
> Nbs (60), with a bs=15.
>

Clearly a bug.


> 2) when I’m using MATMPIBAIJ, I can do stuff like: -prefix_mat_type baij
> -prefix_pc_type redundant -prefix_redundant_pc_type ilu, and in the
> KSPView, I have "package used to perform factorization: petsc”, so the
> underlying MatType is indeed MATSEQBAIJ.
>
However, with MATMPISBAIJ, if I do: -prefix_mat_type sbaij -prefix_pc_type
> redundant, first, it looks like you are hardwiring a PCLU (MatGetFactor()
> line 4440 in src/mat/interface/matrix.c
>

Using LU as a default for symmetric matrices does seem wrong.


> Could not locate a solver package.), then, if I
> append -prefix_redundant_pc_type cholesky, I end up with an error related
> to MUMPS: MatGetFactor_sbaij_mumps() line 2625 in
> src/mat/impls/aij/mpi/mumps/mumps.c Cannot use PETSc SBAIJ matrices with
> block size > 1 with MUMPS Cholesky, use AIJ matrix instead. Why isn’t this
> call dispatched to PETSc Cholesky for SeqSBAIJ matrices?
>
>
Generally, we don't like to switch parameters under the covers like this.
We would rather you get your inputs right so you know what is going on.


> Thanks,
> Pierre
>
> 1) I don’t think this is tested right now, at least not in
> src/ksp/ksp/examples/tutorials
> 2) reproducer: src/ksp/ksp/examples/tutorials/ex2.c
> $ mpirun -np 2 ./ex2 -pc_type redundant -mat_type sbaij
> // error because trying to do LU with a symmetric matrix
> $ mpirun -np 2 ./ex2 -pc_type redundant -mat_type sbaij -redundant_pc_type
> cholesky -ksp_view
> // you’ll see: that MUMPS is being used, but since bs=1, it’s working, but
> it won’t for the general case
> //  the MatType is mpisbaij with "1 MPI processes" whereas
> with baij, it’s seqbaij
>
>

Re: [petsc-dev] MatPinToCPU

2019-07-30 Thread Mark Adams via petsc-dev

On Mon, Jul 29, 2019 at 11:27 PM Smith, Barry F.  wrote:

>
>   Thanks. Could you please send the 24 processors with the GPU?
>

That is in  out_cuda_24


>Note the final column of the table gives you the percentage of flops
> (not rates, actual operations) on the GPU. For you biggest run it is
>
>For the MatMult it is 18 percent and for KSP solve it is 23 percent. I
> think this is much too low, we'd like to see well over 90 percent of the
> flops on the GPU; or 95 or more. Is this because you are forced to put very
> large matrices only the CPU?
>

Humm, that is strange. BLAS1 stuff is 100% GPU but the coarse grids are on
the CPU. This could be because it is > 99.5%. And there is this in the last
solve phase:

MatMult  679 1.0 5.2220e+00 1.2 7.58e+09 1.3 8.0e+07 1.1e+04
0.0e+00  1 39 14  8  0   3 74 79 60  0 16438647   438720307578 1.99e+02
 519 2.55e+02 18
MatMultAdd   150 1.0 1.1836e+00 4.7 3.41e+08 1.2 1.0e+07 1.8e+03
0.0e+00  0  2  2  0  0   1  3 10  1  0 3409019   191195194120 2.48e+01
  60 2.25e+00 21
MatMultTranspose 150 1.0 5.7940e-01 2.4 3.37e+08 1.2 1.0e+07 1.8e+03
0.0e+00  0  2  2  0  0   0  3 10  1  0 6867795   2539317196 38 1.02e+02
 150 3.22e+00 92

I have added print statements to MatMult_[CUDA,CPU] and it looks fine. Well
over 90% should be on the GPU. I am puzzled. I'll keep digging but the log
statements look OK.


>For the MatMult if we assume the flop rate for the GPU is 25 times as
> fast as the CPU and 18 percent of the flops are done on the GPU then the
> ratio of time for the GPU should be 82.7 percent of the time for the CPU
> but  it is .90; so where is the extra time? Seems too much than just for
> the communication.
>

I don't follow this analysis but the there is something funny about the
logging ...


>
>There is so much information and so much happening in the final stage
> that it is hard to discern what is killing the performance in the GPU case
> for the KSP solve. Anyway you can just have a stage at the end with several
> KSP solves and nothing else?
>

I added this, eg,

--- Event Stage 7: KSP only

SFBcastOpBegin   263 1.0 8.4140e-03 2.7 0.00e+00 0.0 6.1e+04 2.5e+03
0.0e+00  0  0 15  7  0   1  0 91 98  0 0   0  0 0.00e+000
0.00e+00  0
SFBcastOpEnd 263 1.0 6.6676e-02 6.9 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   8  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
SFReduceBegin 48 1.0 4.5977e-04 2.1 0.00e+00 0.0 6.4e+03 6.0e+02
0.0e+00  0  0  2  0  0   0  0  9  2  0 0   0  0 0.00e+000
0.00e+00  0
SFReduceEnd   48 1.0 5.4065e-0321.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
MatMult  215 1.0 3.9271e-01 1.0 6.33e+08 1.4 5.5e+04 2.7e+03
0.0e+00  1 24 14  7  0  83 89 81 95  0 33405   177859430 1.75e+01  358
2.23e+01 17
MatMultAdd48 1.0 3.3079e-02 1.3 3.20e+07 1.3 6.4e+03 6.0e+02
0.0e+00  0  1  2  0  0   7  5  9  2  0 20318   106989 48 2.33e+00   48
2.24e-01 20
MatMultTranspose  48 1.0 1.1967e-02 1.8 3.15e+07 1.3 6.4e+03 6.0e+02
0.0e+00  0  1  2  0  0   2  4  9  2  0 55325   781863  0 0.00e+00   72
3.23e-01 93
MatSolve  24 0.0 3.6270e-03 0.0 1.02e+07 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0  2810   0  0 0.00e+000
0.00e+00  0
MatResidual   48 1.0 8.2272e-02 1.0 1.33e+08 1.4 1.2e+04 2.6e+03
0.0e+00  0  5  3  1  0  17 19 18 20  0 33284   136803 96 3.62e+00   72
4.50e+00 19
VecTDot   46 1.0 6.1646e-03 1.3 1.13e+06 1.2 0.0e+00 0.0e+00
4.6e+01  0  0  0  0  2   1  0  0  0 66  41096814  0 0.00e+000
0.00e+00 100
VecNorm   24 1.0 5.2724e-03 1.9 5.90e+05 1.2 0.0e+00 0.0e+00
2.4e+01  0  0  0  0  1   1  0  0  0 34  25075050  0 0.00e+000
0.00e+00 100
VecCopy  146 1.0 3.9029e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  0  0  0  0 0   0  0 0.00e+00   24
9.87e-02  0
VecSet   169 1.0 1.3301e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000
0.00e+00  0
VecAXPY   46 1.0 1.5963e-03 1.2 1.13e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0 15870   23070  0 0.00e+000
0.00e+00 100
VecAYPX  310 1.0 1.3059e-02 1.1 4.25e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   3  1  0  0  0  7273   12000 48 1.97e-010
0.00e+00 100
VecAXPBYCZ96 1.0 6.8591e-03 1.2 6.19e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  1  0  0  0 20134   46381  0 0.00e+000
0.00e+00 100
VecPointwiseMult 192 1.0 7.1075e-03 1.2 1.24e+06 1.2 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   1  0  0  0  0  38864184 24 9.87e-020
0.00e+00 100
VecScatterBegin  311 1.0 1.1026e-02 2.0 0.00e+00 0.0 6.8e+04 2.3e+03
0.0e+00  0  0 17  7  0   2  0100100  0 0   0  0 0.00e+00   72

[petsc-dev] Is master broken?

2019-07-31 Thread Mark Adams via petsc-dev

I am seeing this when I pull master into my branch:

"/autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/dense/seq/cuda/
densecuda.cu"
  , line 243: error: function call is not allowed in a constant
  expression
  #if PETSC_PKG_CUDA_VERSION_GE(10,1,0)

and I see that this macro does not seem to be defined:

22:24 master= ~/Codes/petsc$ git grep PETSC_PKG_CUDA_VERSION_GE
src/mat/impls/dense/seq/cuda/densecuda.cu:#if
PETSC_PKG_CUDA_VERSION_GE(10,1,0)

Re: [petsc-dev] Is master broken?

2019-08-01 Thread Mark Adams via petsc-dev

I get the same error with a fresh clone of master.

On Thu, Aug 1, 2019 at 6:03 AM Mark Adams  wrote:

> Tried again after deleting the arch dirs and still have it.
> This is my branch that just merged master. I will try with just master.
> Thanks,
>
> On Thu, Aug 1, 2019 at 1:36 AM Smith, Barry F.  wrote:
>
>>
>>   It is generated automatically and put in
>> arch-linux2-c-debug/include/petscpkg_version.h  this include file is
>> included at top of the "bad" source  file crashes so in theory everything
>> is in order check that arch-linux2-c-debug/include/petscpkg_version.h
>> contains PETSC_PKG_CUDA_VERSION_GE and similar macros. If not send
>> configure.lo
>>
>> check what is in arch-linux2-c-debug/include/petscpkg_version.h it
>> nothing or broken send configure.lo
>>
>>
>>   Barry
>>
>>
>>
>> > On Jul 31, 2019, at 9:28 PM, Mark Adams via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>> >
>> > I am seeing this when I pull master into my branch:
>> >
>> > "/autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/dense/seq/cuda/
>> densecuda.cu"
>> >   , line 243: error: function call is not allowed in a constant
>> >   expression
>> >   #if PETSC_PKG_CUDA_VERSION_GE(10,1,0)
>> >
>> > and I see that this macro does not seem to be defined:
>> >
>> > 22:24 master= ~/Codes/petsc$ git grep PETSC_PKG_CUDA_VERSION_GE
>> > src/mat/impls/dense/seq/cuda/densecuda.cu:#if
>> PETSC_PKG_CUDA_VERSION_GE(10,1,0)
>>
>>

Re: [petsc-dev] MatPinToCPU

2019-07-28 Thread Mark Adams via petsc-dev

This is looking good. I'm not seeing the numerical problems, but I've just
hid them by avoiding the GPU on coarse grids.

Should I submit a pull request now or test more or wait for Karl?

On Sat, Jul 27, 2019 at 7:37 PM Mark Adams  wrote:

> Barry, I fixed CUDA to pin to CPUs correctly for GAMG at least. There are
> some hacks here that we can work on.
>
> I will start testing it tomorrow, but I am pretty sure that I have not
> regressed. I am hoping that this will fix the numerical problems, which
> seem to be associated with empty processors.
>
> I did need to touch code outside of GAMG and CUDA. It might be nice to
> test this in a next.
>
> GAMG now puts all reduced processorg grids on the CPU. This could be
> looked at in the future.
>
>
> On Sat, Jul 27, 2019 at 1:00 PM Smith, Barry F. 
> wrote:
>
>>
>>
>> > On Jul 27, 2019, at 11:53 AM, Mark Adams  wrote:
>> >
>> >
>> > On Sat, Jul 27, 2019 at 11:39 AM Smith, Barry F. 
>> wrote:
>> >
>> >   Good catch. Thanks. Maybe the SeqCUDA has the same problem?
>> >
>> > THis is done  (I may have done it).
>> >
>> > Now it seems to me that when you call VecPinToCPU you are setting up
>> and don't have data, so this copy does not seem necessary. Maybe remove the
>> copy here:
>> >
>> > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
>> > {
>> >   PetscErrorCode ierr;
>> >
>> >   PetscFunctionBegin;
>> >   V->pinnedtocpu = pin;
>> >   if (pin) {
>> > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr); 
>>
>>The copy from GPU should actually only do anything if the GPU already
>> has data and PETSC_OFFLOAD_GPU. If the GPU does not have data
>> the copy doesn't do anything. When one calls VecPinToCPU() one doesn't
>> know where the data is so the call must be made, but it may do nothing
>>
>>   Note that VecCUDACopyFromGPU() calls VecCUDAAllocateCheckHost() not
>> VecCUDAAllocateCheck() so the GPU will not allocate space,
>> VecCUDAAllocateCheck() is called from VecCUDACopyToGPU().
>>
>>Yes, perhaps the naming could be more consistent:
>>
>> 1) in one place it is Host in an other place it is nothing
>> 2) some places it is Host, Device, some places GPU,CPU
>>
>>Perhaps Karl can make these all consistent and simpler in his
>> refactorization
>>
>>
>>   Barry
>>
>>
>> >
>> > or
>> >
>> > Not allocate the GPU if it is pinned by added in a check here:
>> >
>> > PetscErrorCode VecCUDAAllocateCheck(Vec v)
>> > {
>> >   PetscErrorCode ierr;
>> >   cudaError_terr;
>> >   cudaStream_t   stream;
>> >   Vec_CUDA   *veccuda;
>> >
>> >   PetscFunctionBegin;
>> >   if (!v->spptr) {
>> > ierr = PetscMalloc(sizeof(Vec_CUDA),>spptr);CHKERRQ(ierr);
>> > veccuda = (Vec_CUDA*)v->spptr;
>> > if (v->valid_GPU_array != PETSC_OFFLOAD_CPU) {
>> > err =
>> cudaMalloc((void**)>GPUarray_allocated,sizeof(PetscScalar)*((PetscBLASInt)v->map->n));CHKERRCUDA(err);
>> > veccuda->GPUarray = veccuda->GPUarray_allocated;
>> > err = cudaStreamCreate();CHKERRCUDA(err);
>> > veccuda->stream = stream;
>> > veccuda->hostDataRegisteredAsPageLocked = PETSC_FALSE;
>> > if (v->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
>> >   if (v->data && ((Vec_Seq*)v->data)->array) {
>> > v->valid_GPU_array = PETSC_OFFLOAD_CPU;
>> >   } else {
>> > v->valid_GPU_array = PETSC_OFFLOAD_GPU;
>> >   }
>> > }
>> > }
>> >   }
>> >   PetscFunctionReturn(0);
>> > }
>> >
>> >
>> >
>> >
>> >
>> > > On Jul 27, 2019, at 10:40 AM, Mark Adams  wrote:
>> > >
>> > > Yea, I just figured out the problem. VecDuplicate_MPICUDA did not
>> call PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and
>> am testing:
>> > >
>> > >   ierr =
>> VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
>> > >   vw   = (Vec_MPI*)(*v)->data;
>> > >   ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
>> _VecOps));CHKERRQ(ierr);
>> > >   ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);
>> > >
>> > > Thanks,
>> > >
>> > > On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F. 
>> wrote:
>> > >
>> > >   I don't understand the context. Once a vector is pinned to the CPU
>> the flag should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is
>> turned off.  Do you have a pinned vector that has the value
>> PETSC_OFFLOAD_GPU?  For example here it is set to PETSC_OFFLOAD_CPU
>> > >
>> > > PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
>> > > {
>> > > 
>> > >   if (pin) {
>> > > ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
>> > > V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code
>> will likely change values in the vector */
>> > >
>> > >
>> > >   Is there any way to reproduce the problem?
>> > >
>> > >   Barry
>> > >
>> > >
>> > >
>> > >
>> > > > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
>> > > >
>> > > > I'm not sure what to do here. The problem is that pinned-to-cpu
>> vectors are calling VecCUDACopyFromGPU here.
>> > > >
>> > > > Should I set x->valid_GPU_array to something else, like
>>

Re: [petsc-dev] Is master broken?

2019-08-01 Thread Mark Adams via petsc-dev

On Thu, Aug 1, 2019 at 10:30 AM Smith, Barry F.  wrote:

>
>   Send
>
> ls arch-linux2-c-debug/include/
>

That is not my arch name. It is something like arch-summit-dbg64-pgi-cuda

>
>  arch-linux2-c-debug/include/petscpkg_version.h
>
> and configure.log
>
>
>
> > On Aug 1, 2019, at 5:23 AM, Mark Adams  wrote:
> >
> > I get the same error with a fresh clone of master.
> >
> > On Thu, Aug 1, 2019 at 6:03 AM Mark Adams  wrote:
> > Tried again after deleting the arch dirs and still have it.
> > This is my branch that just merged master. I will try with just master.
> > Thanks,
> >
> > On Thu, Aug 1, 2019 at 1:36 AM Smith, Barry F. 
> wrote:
> >
> >   It is generated automatically and put in
> arch-linux2-c-debug/include/petscpkg_version.h  this include file is
> included at top of the "bad" source  file crashes so in theory everything
> is in order check that arch-linux2-c-debug/include/petscpkg_version.h
> contains PETSC_PKG_CUDA_VERSION_GE and similar macros. If not send
> configure.lo
> >
> > check what is in arch-linux2-c-debug/include/petscpkg_version.h it
> nothing or broken send configure.lo
> >
> >
> >   Barry
> >
> >
> >
> > > On Jul 31, 2019, at 9:28 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> > >
> > > I am seeing this when I pull master into my branch:
> > >
> > > "/autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/dense/seq/cuda/
> densecuda.cu"
> > >   , line 243: error: function call is not allowed in a constant
> > >   expression
> > >   #if PETSC_PKG_CUDA_VERSION_GE(10,1,0)
> > >
> > > and I see that this macro does not seem to be defined:
> > >
> > > 22:24 master= ~/Codes/petsc$ git grep PETSC_PKG_CUDA_VERSION_GE
> > > src/mat/impls/dense/seq/cuda/densecuda.cu:#if
> PETSC_PKG_CUDA_VERSION_GE(10,1,0)
> >
>
>

Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Mark Adams via petsc-dev

I'm not sure what to do here. The problem is that pinned-to-cpu vectors are
calling *VecCUDACopyFromGPU* here.

Should I set *x->valid_GPU_array *to something else, like
PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not executed?

PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
{
  PetscErrorCode ierr;
#if defined(PETSC_HAVE_VIENNACL)
  PetscBool  is_viennacltype = PETSC_FALSE;
#endif

  PetscFunctionBegin;
  PetscValidHeaderSpecific(x,VEC_CLASSID,1);
  ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
  if (x->petscnative) {
#if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
if (*x->valid_GPU_array* == PETSC_OFFLOAD_GPU) {
#if defined(PETSC_HAVE_VIENNACL)
  ierr =
PetscObjectTypeCompareAny((PetscObject)x,_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
  if (is_viennacltype) {
ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
  } else
#endif
  {
#if defined(PETSC_HAVE_CUDA)

*ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);*#endif
 }
} else if (x->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
#if defined(PETSC_HAVE_VIENNACL)
  ierr =
PetscObjectTypeCompareAny((PetscObject)x,_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
  if (is_viennacltype) {
ierr = VecViennaCLAllocateCheckHost(x);CHKERRQ(ierr);
  } else
#endif
  {
#if defined(PETSC_HAVE_CUDA)
ierr = VecCUDAAllocateCheckHost(x);CHKERRQ(ierr);
#endif
  }
}
#endif
*a = *((PetscScalar**)x->data);
  } else {


On Tue, Jul 23, 2019 at 9:18 PM Smith, Barry F.  wrote:

>
>  Yes, it needs to be able to switch back and forth between the CPU and GPU
> methods so you need to move into it the setting of the methods that is
> currently directly in the create method. See how
> MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr =
> MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set the methods
> for the GPU initially.
>
>   Barry
>
>
> > On Jul 23, 2019, at 7:32 PM, Mark Adams  wrote:
> >
> >
> >   What are the symptoms of it not working? Does it appear to be still
> copying the matrices to the GPU? then running the functions on the GPU?
> >
> >
> > The object is dispatching the CUDA mat-vec etc.
> >
> >   I suspect the pinning is incompletely done for CUDA (and MPIOpenCL)
> matrices.
> >
> >
> > Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.
> >
> > I guess I can add something like this below. Do we need to set the
> device methods? They are already set when this method is set, right?
> >
> > We need the equivalent of
> >
> > static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> > {
> >   PetscFunctionBegin;
> >   A->pinnedtocpu = flg;
> >   if (flg) {
> > A->ops->mult   = MatMult_SeqAIJ;
> > A->ops->multadd= MatMultAdd_SeqAIJ;
> > A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> > A->ops->duplicate  = MatDuplicate_SeqAIJ;
> >   } else {
> > A->ops->mult   = MatMult_SeqAIJViennaCL;
> > A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> > A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> > A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> > A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
> >   }
> >   PetscFunctionReturn(0);
> > }
> >
> > for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has
> been written yet.
> >
> >
> > >
> > > It does not seem to work. It does not look like CUDA has an
> MatCreateVecs. Should I add one and copy this flag over?
> >
> >We do need this function. But I don't see how it relates to pinning.
> When the matrix is pinned to the CPU we want it to create CPU vectors which
> I assume it does.
> >
> >
> > >
> > > Mark
> >
>
>

Re: [petsc-dev] MatPinToCPU

2019-07-27 Thread Mark Adams via petsc-dev

Yea, I just figured out the problem. VecDuplicate_MPICUDA did not call
PinToCPU or even copy pinnedtocpu. It just copied ops, so I added and am
testing:

  ierr = VecCreate_MPICUDA_Private(*v,PETSC_TRUE,w->nghost,0);CHKERRQ(ierr);
  vw   = (Vec_MPI*)(*v)->data;
  ierr = PetscMemcpy((*v)->ops,win->ops,sizeof(struct
_VecOps));CHKERRQ(ierr);
*  ierr = VecPinToCPU(*v,win->pinnedtocpu);CHKERRQ(ierr);*

Thanks,

On Sat, Jul 27, 2019 at 11:33 AM Smith, Barry F.  wrote:

>
>   I don't understand the context. Once a vector is pinned to the CPU the
> flag should be PETSC_OFFLOAD_CPU permanently until the pin to cpu is turned
> off.  Do you have a pinned vector that has the value PETSC_OFFLOAD_GPU?
> For example here it is set to PETSC_OFFLOAD_CPU
>
> PetscErrorCode VecPinToCPU_MPICUDA(Vec V,PetscBool pin)
> {
> 
>   if (pin) {
> ierr = VecCUDACopyFromGPU(V);CHKERRQ(ierr);
> V->valid_GPU_array = PETSC_OFFLOAD_CPU; /* since the CPU code will
> likely change values in the vector */
>
>
>   Is there any way to reproduce the problem?
>
>   Barry
>
>
>
>
> > On Jul 27, 2019, at 10:28 AM, Mark Adams  wrote:
> >
> > I'm not sure what to do here. The problem is that pinned-to-cpu vectors
> are calling VecCUDACopyFromGPU here.
> >
> > Should I set x->valid_GPU_array to something else, like
> PETSC_OFFLOAD_CPU, in PinToCPU so this block of code i s not executed?
> >
> > PetscErrorCode VecGetArray(Vec x,PetscScalar **a)
> > {
> >   PetscErrorCode ierr;
> > #if defined(PETSC_HAVE_VIENNACL)
> >   PetscBool  is_viennacltype = PETSC_FALSE;
> > #endif
> >
> >   PetscFunctionBegin;
> >   PetscValidHeaderSpecific(x,VEC_CLASSID,1);
> >   ierr = VecSetErrorIfLocked(x,1);CHKERRQ(ierr);
> >   if (x->petscnative) {
> > #if defined(PETSC_HAVE_VIENNACL) || defined(PETSC_HAVE_CUDA)
> > if (x->valid_GPU_array == PETSC_OFFLOAD_GPU) {
> > #if defined(PETSC_HAVE_VIENNACL)
> >   ierr =
> PetscObjectTypeCompareAny((PetscObject)x,_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> >   if (is_viennacltype) {
> > ierr = VecViennaCLCopyFromGPU(x);CHKERRQ(ierr);
> >   } else
> > #endif
> >   {
> > #if defined(PETSC_HAVE_CUDA)
> > ierr = VecCUDACopyFromGPU(x);CHKERRQ(ierr);
> > #endif
> >  }
> > } else if (x->valid_GPU_array == PETSC_OFFLOAD_UNALLOCATED) {
> > #if defined(PETSC_HAVE_VIENNACL)
> >   ierr =
> PetscObjectTypeCompareAny((PetscObject)x,_viennacltype,VECSEQVIENNACL,VECMPIVIENNACL,VECVIENNACL,"");CHKERRQ(ierr);
> >   if (is_viennacltype) {
> > ierr = VecViennaCLAllocateCheckHost(x);CHKERRQ(ierr);
> >   } else
> > #endif
> >   {
> > #if defined(PETSC_HAVE_CUDA)
> > ierr = VecCUDAAllocateCheckHost(x);CHKERRQ(ierr);
> > #endif
> >   }
> > }
> > #endif
> > *a = *((PetscScalar**)x->data);
> >   } else {
> >
> >
> > On Tue, Jul 23, 2019 at 9:18 PM Smith, Barry F. 
> wrote:
> >
> >  Yes, it needs to be able to switch back and forth between the CPU and
> GPU methods so you need to move into it the setting of the methods that is
> currently directly in the create method. See how
> MatConvert_SeqAIJ_SeqAIJViennaCL() calls ierr =
> MatPinToCPU_SeqAIJViennaCL(A,PETSC_FALSE);CHKERRQ(ierr); to set the methods
> for the GPU initially.
> >
> >   Barry
> >
> >
> > > On Jul 23, 2019, at 7:32 PM, Mark Adams  wrote:
> > >
> > >
> > >   What are the symptoms of it not working? Does it appear to be still
> copying the matrices to the GPU? then running the functions on the GPU?
> > >
> > >
> > > The object is dispatching the CUDA mat-vec etc.
> > >
> > >   I suspect the pinning is incompletely done for CUDA (and MPIOpenCL)
> matrices.
> > >
> > >
> > > Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.
> > >
> > > I guess I can add something like this below. Do we need to set the
> device methods? They are already set when this method is set, right?
> > >
> > > We need the equivalent of
> > >
> > > static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> > > {
> > >   PetscFunctionBegin;
> > >   A->pinnedtocpu = flg;
> > >   if (flg) {
> > > A->ops->mult   = MatMult_SeqAIJ;
> > > A->ops->multadd= MatMultAdd_SeqAIJ;
> > > A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> > > A->ops->duplicate  = MatDuplicate_SeqAIJ;
> > >   } else {
> > > A->ops->mult   = MatMult_SeqAIJViennaCL;
> > > A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> > > A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> > > A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> > > A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
> > >   }
> > >   PetscFunctionReturn(0);
> > > }
> > >
> > > for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has
> been written yet.
> > >
> > >
> > > >
> > > > It does not seem to work. It does not look like CUDA has an
> MatCreateVecs. Should I add one and copy this flag over?
> > >
> > >We do need this function.

[petsc-dev] CUDA GAMG coarse grid solver

2019-07-21 Thread Mark Adams via petsc-dev

I am running ex56 with -ex56_dm_vec_type cuda -ex56_dm_mat_type aijcusparse
and I see no GPU communication in MatSolve (the serial LU coarse grid
solver). I am thinking the dispatch of the CUDA version of this got dropped
somehow.

I see that this is getting called:

PETSC_EXTERN PetscErrorCode MatSolverTypeRegister_CUSPARSE(void)
{
  PetscErrorCode ierr;

  PetscFunctionBegin;
  ierr =
MatSolverTypeRegister(MATSOLVERCUSPARSE,MATSEQAIJCUSPARSE,MAT_FACTOR_LU,MatGetFactor_seqaijcusparse_cusparse);CHKERRQ(ierr);
  ierr =
MatSolverTypeRegister(MATSOLVERCUSPARSE,MATSEQAIJCUSPARSE,MAT_FACTOR_CHOLESKY,MatGetFactor_seqaijcusparse_cusparse);CHKERRQ(ierr);
  ierr =
MatSolverTypeRegister(MATSOLVERCUSPARSE,MATSEQAIJCUSPARSE,MAT_FACTOR_ILU,MatGetFactor_seqaijcusparse_cusparse);CHKERRQ(ierr);
  ierr =
MatSolverTypeRegister(MATSOLVERCUSPARSE,MATSEQAIJCUSPARSE,MAT_FACTOR_ICC,MatGetFactor_seqaijcusparse_cusparse);CHKERRQ(ierr);
  PetscFunctionReturn(0);
}

but MatGetFactor_seqaijcusparse_cusparse is not getting  called.

GAMG does set the coarse grid solver to LU manually like this: ierr =
PCSetType(pc2, PCLU);CHKERRQ(ierr);

Any ideas?

Thanks,
Mark

Re: [petsc-dev] CUDA GAMG coarse grid solver

2019-07-21 Thread Mark Adams via petsc-dev

Barry, I do NOT see communication. This is what made me think it was not
running on the GPU. I added print statements and found that
MatSolverTypeRegister_CUSPARSE IS called but (what it registers)
MatGetFactor_seqaijcusparse_cusparse does NOT get called.

I have a job waiting on the queue. I'll send ksp_view when it runs. I will
try -mg_coarse_mat_solver_type cusparse. That is probably the problem.
Maybe I should set the coarse grid solver in a more robust way in GAMG,
like use the matrix somehow? I currently use PCSetType(pc, PCLU).

I can't get an interactive shell now to run DDT, but I can try stepping
through from MatGetFactor to see what its doing.

Thanks,
Mark

On Sun, Jul 21, 2019 at 11:14 AM Smith, Barry F.  wrote:

>
>
> > On Jul 21, 2019, at 8:55 AM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > I am running ex56 with -ex56_dm_vec_type cuda -ex56_dm_mat_type
> aijcusparse and I see no GPU communication in MatSolve (the serial LU
> coarse grid solver).
>
>Do you mean to say, you DO see communication?
>
>What does -ksp_view should you? It should show the factor type in the
> information about the coarse grid solve?
>
>You might need something like -mg_coarse_mat_solver_type cusparse
> (because it may default to the PETSc one, it may be possible to have it
> default to the cusparse if it exists and the matrix is of type
> MATSEQAIJCUSPARSE).
>
>The determination of the MatGetFactor() is a bit involved including
> pasting together strings and string compares and could be finding a CPU
> factorization.
>
>I could run on one MPI_Rank() in the debugger and put a break point in
> MatGetFactor() and track along to see what it picks and why. You could do
> this debugging without GAMG first, just -pc_type lu
>
> > GAMG does set the coarse grid solver to LU manually like this: ierr =
> PCSetType(pc2, PCLU);CHKERRQ(ierr);
>
>   For parallel runs this won't work using the GPU code and only sequential
> direct solvers, so it must using BJACOBI in that case?
>
>Barry
>
>
>
>
>
> > I am thinking the dispatch of the CUDA version of this got dropped
> somehow.
> >
> > I see that this is getting called:
> >
> > PETSC_EXTERN PetscErrorCode MatSolverTypeRegister_CUSPARSE(void)
> > {
> >   PetscErrorCode ierr;
> >
> >   PetscFunctionBegin;
> >   ierr =
> MatSolverTypeRegister(MATSOLVERCUSPARSE,MATSEQAIJCUSPARSE,MAT_FACTOR_LU,MatGetFactor_seqaijcusparse_cusparse);CHKERRQ(ierr);
> >   ierr =
> MatSolverTypeRegister(MATSOLVERCUSPARSE,MATSEQAIJCUSPARSE,MAT_FACTOR_CHOLESKY,MatGetFactor_seqaijcusparse_cusparse);CHKERRQ(ierr);
> >   ierr =
> MatSolverTypeRegister(MATSOLVERCUSPARSE,MATSEQAIJCUSPARSE,MAT_FACTOR_ILU,MatGetFactor_seqaijcusparse_cusparse);CHKERRQ(ierr);
> >   ierr =
> MatSolverTypeRegister(MATSOLVERCUSPARSE,MATSEQAIJCUSPARSE,MAT_FACTOR_ICC,MatGetFactor_seqaijcusparse_cusparse);CHKERRQ(ierr);
> >   PetscFunctionReturn(0);
> > }
> >
> > but MatGetFactor_seqaijcusparse_cusparse is not getting  called.
> >
> > GAMG does set the coarse grid solver to LU manually like this: ierr =
> PCSetType(pc2, PCLU);CHKERRQ(ierr);
> >
> > Any ideas?
> >
> > Thanks,
> > Mark
> >
> >
>
>

[petsc-dev] MatPinToCPU

2019-07-23 Thread Mark Adams via petsc-dev

I've tried to add pining the matrix and prolongator to the CPU on coarse
grids in GAMG with this:

/* pin reduced coase grid - could do something smarter */
ierr = MatPinToCPU(*a_Amat_crs,PETSC_TRUE);CHKERRQ(ierr);
ierr = MatPinToCPU(*a_P_inout,PETSC_TRUE);CHKERRQ(ierr);

It does not seem to work. It does not look like CUDA has an MatCreateVecs.
Should I add one and copy this flag over?

Mark

Re: [petsc-dev] MatPinToCPU

2019-07-23 Thread Mark Adams via petsc-dev

>
>
>   What are the symptoms of it not working? Does it appear to be still
> copying the matrices to the GPU? then running the functions on the GPU?
>
>
The object is dispatching the CUDA mat-vec etc.

  I suspect the pinning is incompletely done for CUDA (and MPIOpenCL)
> matrices.
>
>
Yes, git grep MatPinToCPU shows stuff for ViennaCL but not CUDA.

I guess I can add something like this below. Do we need to set the device
methods? They are already set when this method is set, right?


> We need the equivalent of
>
> static PetscErrorCode MatPinToCPU_SeqAIJViennaCL(Mat A,PetscBool flg)
> {
>   PetscFunctionBegin;
>   A->pinnedtocpu = flg;
>   if (flg) {
> A->ops->mult   = MatMult_SeqAIJ;
> A->ops->multadd= MatMultAdd_SeqAIJ;
> A->ops->assemblyend= MatAssemblyEnd_SeqAIJ;
> A->ops->duplicate  = MatDuplicate_SeqAIJ;
>   } else {
> A->ops->mult   = MatMult_SeqAIJViennaCL;
> A->ops->multadd= MatMultAdd_SeqAIJViennaCL;
> A->ops->assemblyend= MatAssemblyEnd_SeqAIJViennaCL;
> A->ops->destroy= MatDestroy_SeqAIJViennaCL;
> A->ops->duplicate  = MatDuplicate_SeqAIJViennaCL;
>   }
>   PetscFunctionReturn(0);
> }
>
> for MPIViennaCL and MPISeqAIJ Cusparse but it doesn't look like it has
> been written yet.
>
>
> >
> > It does not seem to work. It does not look like CUDA has an
> MatCreateVecs. Should I add one and copy this flag over?
>
>We do need this function. But I don't see how it relates to pinning.
> When the matrix is pinned to the CPU we want it to create CPU vectors which
> I assume it does.
>
>
> >
> > Mark
>
>

Re: [petsc-dev] CUDA GAMG coarse grid solver

2019-07-21 Thread Mark Adams via petsc-dev

Barry,

Option left: name:-mg_coarse_mat_solver_type value: cusparse

I tried this too:

Option left: name:-mg_coarse_sub_mat_solver_type value: cusparse

Here is the view. cuda did not get into the factor type.

PC Object: 24 MPI processes
  type: gamg
type is MULTIPLICATIVE, levels=5 cycles=v
  Cycles per PCApply=1
  Using externally compute Galerkin coarse grid matrices
  GAMG specific options
Threshold for dropping small values in graph on each level =   0.05
  0.025   0.0125
Threshold scaling factor for each level not specified = 0.5
AGG specific options
  Symmetric graph false
  Number of levels to square graph 10
  Number smoothing steps 1
Complexity:grid = 1.14213
  Coarse grid solver -- level ---
KSP Object: (mg_coarse_) 24 MPI processes
  type: preonly
  maximum iterations=1, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
  left preconditioning
  using NONE norm type for convergence test
PC Object: (mg_coarse_) 24 MPI processes
  type: bjacobi
number of blocks = 24
Local solve is same for all blocks, in the following KSP and PC
objects:
  KSP Object: (mg_coarse_sub_) 1 MPI processes
type: preonly
maximum iterations=1, initial guess is zero
tolerances:  relative=1e-05, absolute=1e-50, divergence=1.
left preconditioning
using NONE norm type for convergence test
  PC Object: (mg_coarse_sub_) 1 MPI processes
type: lu
  out-of-place factorization
  tolerance for zero pivot 2.22045e-14
  using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
  matrix ordering: nd
  factor fill ratio given 5., needed 1.
Factored matrix follows:
  Mat Object: 1 MPI processes
*type: seqaij*
rows=6, cols=6
package used to perform factorization: petsc
total: nonzeros=36, allocated nonzeros=36
total number of mallocs used during MatSetValues calls =0
  using I-node routines: found 2 nodes, limit used is 5
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
  *type: seqaijcusparse*
  rows=6, cols=6
  total: nonzeros=36, allocated nonzeros=36
  total number of mallocs used during MatSetValues calls =0
using I-node routines: found 2 nodes, limit used is 5
  linear system matrix = precond matrix:
  Mat Object: 24 MPI processes
   * type: mpiaijcusparse*
rows=6, cols=6, bs=6
total: nonzeros=36, allocated nonzeros=36
total number of mallocs used during MatSetValues calls =0
  using scalable MatPtAP() implementation
  using I-node (on process 0) routines: found 2 nodes, limit used
is 5
  Down solver (pre-smoother) on level 1 ---

On Sun, Jul 21, 2019 at 3:58 PM Mark Adams  wrote:

> Barry, I do NOT see communication. This is what made me think it was not
> running on the GPU. I added print statements and found that
> MatSolverTypeRegister_CUSPARSE IS called but (what it registers)
> MatGetFactor_seqaijcusparse_cusparse does NOT get called.
>
> I have a job waiting on the queue. I'll send ksp_view when it runs. I will
> try -mg_coarse_mat_solver_type cusparse. That is probably the problem.
> Maybe I should set the coarse grid solver in a more robust way in GAMG,
> like use the matrix somehow? I currently use PCSetType(pc, PCLU).
>
> I can't get an interactive shell now to run DDT, but I can try stepping
> through from MatGetFactor to see what its doing.
>
> Thanks,
> Mark
>
> On Sun, Jul 21, 2019 at 11:14 AM Smith, Barry F. 
> wrote:
>
>>
>>
>> > On Jul 21, 2019, at 8:55 AM, Mark Adams via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>> >
>> > I am running ex56 with -ex56_dm_vec_type cuda -ex56_dm_mat_type
>> aijcusparse and I see no GPU communication in MatSolve (the serial LU
>> coarse grid solver).
>>
>>Do you mean to say, you DO see communication?
>>
>>What does -ksp_view should you? It should show the factor type in the
>> information about the coarse grid solve?
>>
>>You might need something like -mg_coarse_mat_solver_type cusparse
>> (because it may default to the PETSc one, it may be possible to have it
>> default to the cusparse if it exists and the matrix is of type
>> MATSEQAIJCUSPARSE).
>>
>>The determination of the MatGetFactor() is a bit involved including
>> pasting together strings and string compares and could be finding a CPU
>> factorization.
>>
>>I could run on one MPI_Rank() in the debugger and put a break

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Mark Adams via petsc-dev

I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty
saturated at that point.

On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Here are CPU version results on one node with 24 cores, 42 cores. Click
> the links for core layout.
>
> 24 MPI ranks,
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04
> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000
> 0.00e+00  0
> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>
> 42 MPI ranks,
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04
> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000
> 0.00e+00  0
> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04
> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000
> 0.00e+00  0
> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000
> 0.00e+00  0
>
> --Junchao Zhang
>
>
> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
> wrote:
>
>>
>>   Junchao,
>>
>>Very interesting. For completeness please run also 24 and 42 CPUs
>> without the GPUs. Note that the default layout for CPU cores is not good.
>> You will want 3 cores on each socket then 12 on each.
>>
>>   Thanks
>>
>>Barry
>>
>>   Since Tim is one of our reviewers next week this is a very good test
>> matrix :-)
>>
>>
>> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>> >
>> > Click the links to visualize it.
>> >
>> > 6 ranks
>> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
>> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU
>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> >
>> > 24 ranks
>> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
>> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU
>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
>> >
>> > --Junchao Zhang
>> >
>> >
>> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>> > Junchao,
>> >
>> > Can you share your 'jsrun' command so that we can see how you are
>> mapping things to resource sets?
>> >
>> > --Richard
>> >
>> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
>> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix
>> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100
>> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I
>> found MatMult was almost dominated by VecScatter in this simple test. Using
>> 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But
>> if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I
>> found CUDA aware SF hurt performance. I don't know why and have to profile
>> it. I will also collect  data with multiple nodes. Are the matrix and tests
>> proper?
>> >>
>> >>
>> 
>> >> EventCount  Time (sec) Flop
>>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -   -
>> GpuToCpu - GPU
>> >>Max Ratio  Max Ratio   Max  Ratio  Mess
>>  AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count
>>  Size   Count   Size  %F
>> >>
>> ---
>> >> 6 MPI ranks (CPU version)
>> >> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0
>> 0.00e+000 0.00e+00  0
>> >> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
>> 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0
>> 0.00e+000 0.00e+00  0
>> >> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0
>> 0.00e+000 0.00e+00  0
>> >>
>> >> 6 MPI ranks + 6 GPUs + regular SF
>> >> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03
>> 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100
>> 1.02e+02  100 2.69e+02 100
>> >>

Re: [petsc-dev] MatMult on Summit

2019-09-21 Thread Mark Adams via petsc-dev

On Sat, Sep 21, 2019 at 12:48 AM Smith, Barry F. via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

>
>   Junchao,
>
>Very interesting. For completeness please run also 24 and 42 CPUs
> without the GPUs. Note that the default layout for CPU cores is not good.
> You will want 3 cores on each socket then 12 on each.
>

His parms are balanced. see:
https://jsrunvisualizer.olcf.ornl.gov/?s1f0o01n6c4g1r14d1b21l0=


>
>   Thanks
>
>Barry
>
>   Since Tim is one of our reviewers next week this is a very good test
> matrix :-)
>
>
> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > Click the links to visualize it.
> >
> > 6 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > 24 ranks
> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU
> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f
> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >
> > --Junchao Zhang
> >
> >
> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> > Junchao,
> >
> > Can you share your 'jsrun' command so that we can see how you are
> mapping things to resource sets?
> >
> > --Richard
> >
> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix
> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100
> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I
> found MatMult was almost dominated by VecScatter in this simple test. Using
> 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve performance. But
> if I enabled Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I
> found CUDA aware SF hurt performance. I don't know why and have to profile
> it. I will also collect  data with multiple nodes. Are the matrix and tests
> proper?
> >>
> >>
> 
> >> EventCount  Time (sec) Flop
>   --- Global ---  --- Stage   Total   GPU- CpuToGpu -   -
> GpuToCpu - GPU
> >>Max Ratio  Max Ratio   Max  Ratio  Mess
>  AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count
>  Size   Count   Size  %F
> >>
> ---
> >> 6 MPI ranks (CPU version)
> >> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0
> 0.00e+000 0.00e+00  0
> >> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
> 2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0
> 0.00e+000 0.00e+00  0
> >> VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0
> 0.00e+000 0.00e+00  0
> >>
> >> 6 MPI ranks + 6 GPUs + regular SF
> >> MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03
> 2.2e+05 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100
> 1.02e+02  100 2.69e+02 100
> >> VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03
> 2.2e+05 0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0
> 0.00e+00  100 2.69e+02  0
> >> VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0
> 0.00e+000 0.00e+00  0
> >> VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100
> 1.02e+020 0.00e+00  0
> >> VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0
> 0.00e+00  100 2.69e+02  0
> >>
> >> 6 MPI ranks + 6 GPUs + CUDA-aware SF
> >> MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03
> 2.2e+05 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0
> 0.00e+000 0.00e+00 100
> >> VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03
> 2.2e+05 0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0
> 0.00e+000 0.00e+00  0
> >> VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0
> 0.00e+000 0.00e+00  0
> >>
> >> 24 MPI ranks + 6 GPUs + regular SF
> >> MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04
> 5.9e+04 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100
> 4.61e+01  100 6.72e+01 100
>

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-28 Thread Mark Adams via petsc-dev

On Sat, Sep 28, 2019 at 12:55 AM Karl Rupp  wrote:

> Hi Mark,
>
> > OK, so now the problem has shifted somewhat in that it now manifests
> > itself on small cases.


It is somewhat random and anecdotal but it does happen on the smaller test
problem now. When I try to narrow down when the problem manifests by
reducing the number of GPUs/procs the problem can not be too small (ie, the
bug does not manifest on even smaller problems).

But it is much more stable and there does seem to be only this one problem
with mat-transpose-mult. You made a lot of progress.


> In earlier investigation I was drawn to
> > MatTranspose but had a hard time pinning it down. The bug seems more
> > stable now or you probably fixed what looks like all the other bugs.
> >
> > I added print statements with norms of vectors in mg.c (v-cycle) and
> > found that the diffs between the CPU and GPU runs came in MatRestrict,
> > which calls MatMultTranspose. I added identical print statements in the
> > two versions of MatMultTranspose and see this. (pinning to the CPU does
> > not seem to make any difference). Note that the problem comes in the 2nd
> > iteration where the *output* vector is non-zero coming in (this should
> > not matter).
> >
> > Karl, I zeroed out the output vector (yy) when I come into this method
> > and it fixed the problem. This is with -n 4, and this always works with
> > -n 3. See the attached process layouts. It looks like this comes when
> > you use the 2nd socket.
> >
> > So this looks like an Nvidia bug. Let me know what you think and I can
> > pass it on to ORNL.
>
> Hmm, there were some issues with MatMultTranspose_MPIAIJ at some point.
> I've addressed some of them, but I can't confidently say that all of the
> issues were fixed. Thus, I don't think it's a problem in NVIDIA's
> cuSparse, but rather something we need to fix in PETSc. Note that the
> problem shows up with multiple MPI ranks;


It seems to need to use two sockets. My current test works with 1,2, and 3
GPUs (one socket) but fails with 4, when you go to the second socket.


> if it were a problem in
> cuSparse, it would show up on a single rank as well.
>

What I am seeing is consistent with CUSPARSE having a race condition in
zeroing out the output vector in some way, But I don't know.


>
> Best regards,
> Karli
>
>
>
>
>
> > 06:49  /gpfs/alpine/geo127/scratch/adams$ jsrun*-n 4 *-a 4 -c 4 -g 1
> > ./ex56 -cells 8,12,16 *-ex56_dm_vec_type cuda -ex56_dm_mat_type
> aijcusparse*
> > [0] 3465 global equations, 1155 vertices
> > [0] 3465 equations in vector, 1155 vertices
> >0 SNES Function norm 1.725526579328e+01
> >  0 KSP Residual norm 1.725526579328e+01
> >  2) call Restrict with |r| = 1.402719214830704e+01
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 1.40271921483070e+01
> > *MatMultTranspose_MPIAIJ |y in| =
> > 0.00e+00
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 3.43436359545813e+00
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.29055494844681e+01
> >  3) |R| = 1.290554948446808e+01
> >  2) call Restrict with |r| = 4.109771717986951e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 4.10977171798695e+00
> > *MatMultTranspose_MPIAIJ |y in| =
> > 0.00e+00
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.79415048609144e-01
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 9.01083013948788e-01
> >  3) |R| = 9.010830139487883e-01
> >  4) |X| = 2.864698671963022e+02
> >  5) |x| = 9.76328911783e+02
> >  6) post smooth |x| = 8.940011621494751e+02
> >  4) |X| = 8.940011621494751e+02
> >  5) |x| = 1.005081556495388e+03
> >  6) post smooth |x| = 1.029043994031627e+03
> >  1 KSP Residual norm 8.102614049404e+00
> >  2) call Restrict with |r| = 4.402603749876137e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 4.40260374987614e+00
> > *MatMultTranspose_MPIAIJ |y in| =
> > 1.29055494844681e+01
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.68544559626318e+00
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.82129824300863e+00
> >  3) |R| = 1.821298243008628e+00
> >  2) call Restrict with |r| = 1.068309793900564e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 1.06830979390056e+00
> >

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-28 Thread Mark Adams via petsc-dev

The logic is basically correct because I simple zero out yy vector (the
output vector) and it runs great now. The numerics look fine without CPU
pinning.

AND, it worked with 1,2, and 3 GPUs (one node, one socket), but failed with
4 GPU's which uses the second socket. Strange.

On Sat, Sep 28, 2019 at 3:43 AM Stefano Zampini 
wrote:

> Mark,
>
>
> MatMultTransposeAdd_SeqAIJCUSPARSE checks if the matrix is in compressed
> row storage, MatMultTranspose_SeqAIJCUSPARSE does not. Probably is this
> the issue? The CUSPARSE classes are kind of messy
>
>
>
> Il giorno sab 28 set 2019 alle ore 07:55 Karl Rupp via petsc-dev <
> petsc-dev@mcs.anl.gov> ha scritto:
>
>> Hi Mark,
>>
>> > OK, so now the problem has shifted somewhat in that it now manifests
>> > itself on small cases. In earlier investigation I was drawn to
>> > MatTranspose but had a hard time pinning it down. The bug seems more
>> > stable now or you probably fixed what looks like all the other bugs.
>> >
>> > I added print statements with norms of vectors in mg.c (v-cycle) and
>> > found that the diffs between the CPU and GPU runs came in MatRestrict,
>> > which calls MatMultTranspose. I added identical print statements in the
>> > two versions of MatMultTranspose and see this. (pinning to the CPU does
>> > not seem to make any difference). Note that the problem comes in the
>> 2nd
>> > iteration where the *output* vector is non-zero coming in (this should
>> > not matter).
>> >
>> > Karl, I zeroed out the output vector (yy) when I come into this method
>> > and it fixed the problem. This is with -n 4, and this always works with
>> > -n 3. See the attached process layouts. It looks like this comes when
>> > you use the 2nd socket.
>> >
>> > So this looks like an Nvidia bug. Let me know what you think and I can
>> > pass it on to ORNL.
>>
>> Hmm, there were some issues with MatMultTranspose_MPIAIJ at some point.
>> I've addressed some of them, but I can't confidently say that all of the
>> issues were fixed. Thus, I don't think it's a problem in NVIDIA's
>> cuSparse, but rather something we need to fix in PETSc. Note that the
>> problem shows up with multiple MPI ranks; if it were a problem in
>> cuSparse, it would show up on a single rank as well.
>>
>> Best regards,
>> Karli
>>
>>
>>
>>
>>
>> > 06:49  /gpfs/alpine/geo127/scratch/adams$ jsrun*-n 4 *-a 4 -c 4 -g 1
>> > ./ex56 -cells 8,12,16 *-ex56_dm_vec_type cuda -ex56_dm_mat_type
>> aijcusparse*
>> > [0] 3465 global equations, 1155 vertices
>> > [0] 3465 equations in vector, 1155 vertices
>> >0 SNES Function norm 1.725526579328e+01
>> >  0 KSP Residual norm 1.725526579328e+01
>> >  2) call Restrict with |r| = 1.402719214830704e+01
>> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
>> > 1.40271921483070e+01
>> > *MatMultTranspose_MPIAIJ |y in| =
>> > 0.00e+00
>> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
>> > 0.00e+00
>> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
>> > 3.43436359545813e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
>> > 1.29055494844681e+01
>> >  3) |R| = 1.290554948446808e+01
>> >  2) call Restrict with |r| = 4.109771717986951e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
>> > 4.10977171798695e+00
>> > *MatMultTranspose_MPIAIJ |y in| =
>> > 0.00e+00
>> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
>> > 0.00e+00
>> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
>> > 1.79415048609144e-01
>> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
>> > 9.01083013948788e-01
>> >  3) |R| = 9.010830139487883e-01
>> >  4) |X| = 2.864698671963022e+02
>> >  5) |x| = 9.76328911783e+02
>> >  6) post smooth |x| = 8.940011621494751e+02
>> >  4) |X| = 8.940011621494751e+02
>> >  5) |x| = 1.005081556495388e+03
>> >  6) post smooth |x| = 1.029043994031627e+03
>> >  1 KSP Residual norm 8.102614049404e+00
>> >  2) call Restrict with |r| = 4.402603749876137e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
>> > 4.40260374987614e+00
>> > *MatMultTranspose_MPIAIJ |y in| =
>> > 1.29055494844681e+01
>> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
>> > 0.00e+00
>> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
>> > 1.68544559626318e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
>> > 1.82129824300863e+00
>> >  3) |R| = 1.821298243008628e+00
>> >  2) call Restrict with |r| = 1.068309793900564e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
>> > 1.06830979390056e+00

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-26 Thread Mark Adams via petsc-dev

Karl, I have it running but I am not seeing any difference from master. I
wonder if I have the right version:

Using Petsc Development GIT revision: v3.11.3-2207-ga8e311a

I could not find karlrupp/fix-cuda-streams on the gitlab page to check your
last commit SHA1 (???), and now I get:

08:37 karlrupp/fix-cuda-streams= ~/petsc-karl$ git pull origin
karlrupp/fix-cuda-streams
fatal: Couldn't find remote ref karlrupp/fix-cuda-streams
Unexpected end of command stream
10:09 1 karlrupp/fix-cuda-streams= ~/petsc-karl$



On Wed, Sep 25, 2019 at 8:51 AM Karl Rupp  wrote:

>
> > I double checked that a clean build of your (master) branch has this
> > error by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > stuff from Barry that is not yet in master, works.
>
> so did master work recently (i.e. right before my branch got merged)?
>
> Best regards,
> Karli
>
>
>
> >
> > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> >
> >
> >
> > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> >  > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > get this
> >  > error:
> >  >
> >  > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe
> -n 1
> >  > printenv']":
> >  > Error, invalid argument:  1
> >  >
> >  > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
> > did edit
> >  > the jsrun command but Karl's branch still fails. (SUMMIT was down
> > today
> >  > so there could have been updates).
> >  >
> >  > Any suggestions?
> >
> > Looks very much like a systems issue to me.
> >
> > Best regards,
> > Karli
> >
>

Re: [petsc-dev] getting eigen estimates from GAMG to CHEBY

2019-09-26 Thread Mark Adams via petsc-dev

>
> Okay, it seems like they should be stored in GAMG.
>

Before we stored them in the matrix. When you get to the test in Cheby you
don't have caller anymore (GAMG).


> Why would the PC type change anything?
>

Oh, the eigenvalues are the preconditioned ones, the PC (Jacobi) matters
but it is not too sensitive to normal PCs that you would use in a smoother
and it is probably not an understatement.


>
>   Thanks,
>
> Matt
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> 
>

Re: [petsc-dev] getting eigen estimates from GAMG to CHEBY

2019-09-27 Thread Mark Adams via petsc-dev

As I recall we attached the eigenstates to the matrix. Is that old attach
mechanism still the used/recommended? Or is there a better way to do this
now?
Thanks,
Mark

On Thu, Sep 26, 2019 at 7:45 AM Mark Adams  wrote:

>
>
>> Okay, it seems like they should be stored in GAMG.
>>
>
> Before we stored them in the matrix. When you get to the test in Cheby you
> don't have caller anymore (GAMG).
>
>
>> Why would the PC type change anything?
>>
>
> Oh, the eigenvalues are the preconditioned ones, the PC (Jacobi) matters
> but it is not too sensitive to normal PCs that you would use in a smoother
> and it is probably not an understatement.
>
>
>>
>>   Thanks,
>>
>> Matt
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> 
>>
>

Re: [petsc-dev] MatMult on Summit

2019-09-24 Thread Mark Adams via petsc-dev

Yes, please, thank you.

On Tue, Sep 24, 2019 at 1:46 AM Mills, Richard Tran via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Karl, that would be fantastic. Much obliged!
>
> --Richard
>
> On 9/23/19 8:09 PM, Karl Rupp wrote:
>
> Hi,
>
> `git grep cudaStreamCreate` reports that vectors, matrices and scatters
> create their own streams. This will almost inevitably create races (there
> is no synchronization mechanism implemented), unless one calls WaitForGPU()
> after each operation. Some of the non-deterministic tests can likely be
> explained by this.
>
> I'll clean this up in the next few hours if there are no objections.
>
> Best regards,
> Karli
>
>
>
> On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
>
> I'm no CUDA expert (not yet, anyway), but, from what I've read, the
> default stream (stream 0) is (mostly) synchronous to host and device, so
> WaitForGPU() is not needed in that case. I don't know if there is any
> performance penalty in explicitly calling it in that case, anyway.
>
> In any case, it looks like there are still some cases where potentially
> asynchronous CUDA library calls are being "timed" without a WaitForGPU() to
> ensure that the calls actually complete. I will make a pass through the
> aijcusparse and aijviennacl code looking for these.
>
> --Richard
>
> On 9/23/19 3:28 PM, Zhang, Junchao wrote:
>
> It looks cusparsestruct->stream is always created (not NULL).  I don't
> know logic of the "if (!cusparsestruct->stream)".
> --Junchao Zhang
>
>
> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <
> petsc-dev@mcs.anl.gov 
> > wrote:
>
> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
> the end of the function it had
>
>   if (!yy) { /* MatMult */
> if (!cusparsestruct->stream) {
>   ierr = WaitForGPU();CHKERRCUDA(ierr);
> }
>   }
>
> I assume we don't need the logic to do this only in the MatMult()
> with no add case and should just do this all the time, for the
> purposes of timing if no other reason. Is there some reason to NOT
> do this because of worries the about effects that these
> WaitForGPU() invocations might have on performance?
>
> I notice other problems in aijcusparse.cu 
> ,
> now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
> see that we have GPU timing calls around the cusparse_csr_spmv()
> (but no WaitForGPU() inside the timed region). I believe this is
> another area in which we get a meaningless timing. It looks like
> we need a WaitForGPU() there, and then maybe inside the timed
> region handling the scatter. (I don't know if this stuff happens
> asynchronously or not.) But do we potentially want two
> WaitForGPU() calls in one function, just to help with getting
> timings? I don't have a good idea of how much overhead this adds.
>
> --Richard
>
> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>
> I made the following changes:
> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>   PetscFunctionReturn(0);
> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
> The old code swapped the first two lines. Since with
> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
> order to have better overlap.
>   ierr =
>
> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>   ierr =
>
> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
> 3) Log time directly in the test code so we can also know
> execution time without -log_view (hence cuda synchronization). I
> manually calculated the Total Mflop/s for these cases for easy
> comparison.
>
> <>
>
>
> 
> EventCount  Time (sec) Flop
>  --- Global ---  --- Stage   Total   GPU-
> CpuToGpu -   - GpuToCpu - GPU
>Max Ratio  Max Ratio   Max  Ratio  Mess
> AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
> Count   Size   Count   Size  %F
>
> ---
> 6 MPI ranks,
> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0 0
> 0.00e+000 0.00e+00  0
> VecScatterBegin  100 1.0

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mark Adams via petsc-dev

Note, the numerical problems that we have look a lot like a race condition
of some sort. Happens with empty processors and goes away under
cuda-memcheck (valgrind like thing).

I did try adding WaitForGPU() , but maybe I did do it right or there are
other synchronization mechanisms.


On Mon, Sep 23, 2019 at 6:28 PM Zhang, Junchao via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> It looks cusparsestruct->stream is always created (not NULL).  I don't
> know logic of the "if (!cusparsestruct->stream)".
> --Junchao Zhang
>
>
> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end
>> of the function it had
>>
>>   if (!yy) { /* MatMult */
>> if (!cusparsestruct->stream) {
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>> }
>>   }
>>
>> I assume we don't need the logic to do this only in the MatMult() with no
>> add case and should just do this all the time, for the purposes of timing
>> if no other reason. Is there some reason to NOT do this because of worries
>> the about effects that these WaitForGPU() invocations might have on
>> performance?
>>
>> I notice other problems in aijcusparse.cu, now that I look closer. In
>> MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU timing calls
>> around the cusparse_csr_spmv() (but no WaitForGPU() inside the timed
>> region). I believe this is another area in which we get a meaningless
>> timing. It looks like we need a WaitForGPU() there, and then maybe inside
>> the timed region handling the scatter. (I don't know if this stuff happens
>> asynchronously or not.) But do we potentially want two WaitForGPU() calls
>> in one function, just to help with getting timings? I don't have a good
>> idea of how much overhead this adds.
>>
>> --Richard
>>
>> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>
>> I made the following changes:
>> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>   PetscFunctionReturn(0);
>> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old
>> code swapped the first two lines. Since with
>> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the order to
>> have better overlap.
>>   ierr =
>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>   ierr =
>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>> 3) Log time directly in the test code so we can also know execution
>> time without -log_view (hence cuda synchronization). I manually calculated
>> the Total Mflop/s for these cases for easy comparison.
>>
>> <>
>>
>>
>> 
>> EventCount  Time (sec) Flop
>>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   -
>> GpuToCpu - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>> Count   Size  %F
>>
>> ---
>> 6 MPI ranks,
>> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05
>> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05
>> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>>
>> 24 MPI ranks
>> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
>> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04
>> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>>
>> 42 MPI ranks
>> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04
>> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04
>> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0

[petsc-dev] CUDA STREAMS

2019-10-02 Thread Mark Adams via petsc-dev

I found a CUDAVersion.cu of STREAMS and tried to build it. I got it to
compile manually with:

nvcc -o CUDAVersion.o -ccbin pgc++
-I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/include
-Wno-deprecated-gpu-targets -c --compiler-options="-g
-I/ccs/home/adams/petsc/include
-I/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/include   "
`pwd`/CUDAVersion.cu
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu(22): warning: conversion
from a string literal to "char *" is deprecated
 

And this did produce a .o file. But I get this when I try to link.

make -f makestreams CUDAVersion
mpicc -g -fast  -o CUDAVersion CUDAVersion.o
-Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib
-L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib
-Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib
-L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/pgi.ld
-Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib
-L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib
-Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib
-L/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib
-Wl,-rpath,/usr/lib/gcc/ppc64le-redhat-linux/4.8.5
-L/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 -lpetsc -llapack -lblas
-lparmetis -lmetis -lstdc++ -ldl -lpthread -lmpiprofilesupport
-lmpi_ibm_usempi -lmpi_ibm_mpifh -lmpi_ibm -lpgf90rtl -lpgf90 -lpgf90_rpm1
-lpgf902 -lpgftnrtl -latomic -lpgkomp -lomp -lomptarget -lpgmath -lpgc -lrt
-lmass_simdp9 -lmassvp9 -lmassp9 -lm -lgcc_s -lstdc++ -ldl
CUDAVersion.o: In function `setupStream(long, PetscBool, PetscBool)':
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:394: undefined reference
to `cudaGetDeviceCount'
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:406: undefined reference
to `cudaSetDevice'
 

I have compared this link line with working examples and it looks the same.
There is not .c file here -- main is in the .cu file. I assume that is the
difference.

Any ideas?
Thanks,
Mark

Re: [petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?

2019-10-02 Thread Mark Adams via petsc-dev

FWIW, I've heard that CUSPARSE is going to provide integer matrix-matrix
products for indexing applications, and that it should be easy to extend
that to double, etc.

On Wed, Oct 2, 2019 at 6:00 PM Mills, Richard Tran via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> Fellow PETSc developers,
>
> I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not
> support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() in
> PETSc parlance) routines provided by cuSPARSE and ViennaCL, respectively.
> Is there a good reason that I shouldn't add those? My guess is that support
> was not added because SpGEMM is hard to do well on a GPU compared to many
> CPUs (it is hard to compete with, say, Intel Xeon CPUs with their huge
> caches) and it has been the case that one would generally be better off
> doing these operations on the CPU. Since the trend at the big
> supercomputing centers seems to be to put more and more of the
> computational power into GPUs, I'm thinking that I should add the option to
> use the GPU library routines for SpGEMM, though. Is there some good reason
> to *not* do this that I am not aware of? (Maybe the CPUs are better for
> this even on a machine like Summit, but I think we're at the point that we
> should at least be able to experimentally verify this.)
>
> --Richard
>

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev

>
> If jsrun is not functional from configure, alternatives are
> --with-mpiexec=/bin/true or --with-batch=1
>
>
--with-mpiexec=/bin/true  seems to be working.

Thanks,
Mark


> Satish
>

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev

On Wed, Sep 25, 2019 at 8:40 PM Balay, Satish  wrote:

> > Unable to run jsrun -g 1 with option "-n 1"
> > Error: It is only possible to use js commands within a job allocation
> > unless CSM is running
>
>
> Nope  this is a different error message.
>
> The message suggests - you can't run 'jsrun -g 1 -n 1 binary' Can you try
> this manually and see
> what you get?
>
> jsrun -g 1 -n 1 printenv
>

I tested this earlier today and originally when I was figuring out the/a
minimal run command:

22:08  /gpfs/alpine/geo127/scratch/adams$ jsrun -g 1 -n 1 printenv
GIT_PS1_SHOWDIRTYSTATE=1
XDG_SESSION_ID=494
SHELL=/bin/bash
HISTSIZE=100
PETSC_ARCH=arch-summit-opt64-pgi-cuda
SSH_CLIENT=160.91.202.152 48626 22
LC_ALL=
USER=adams
 ...


>
> Satish
>
>
> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>
> > On Wed, Sep 25, 2019 at 6:23 PM Balay, Satish  wrote:
> >
> > > > 18:16 (cb53a04...) ~/petsc-karl$
> > >
> > > So this is the commit I recommended you test against - and that's what
> > > you have got now. Please go ahead and test.
> > >
> > >
> > I sent the log for this. This is the output:
> >
> > 18:16 (cb53a04...) ~/petsc-karl$ ../arch-summit-opt64idx-pgi-cuda.py
> > PETSC_DIR=$PWD
> >
> ===
> >  Configuring PETSc to compile on your system
> >
> >
> ===
> >
> ===
> >
> > * WARNING: F77 (set to
> >
> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linux
> >   use ./configure F77=$F77 if you really want to use that value
> **
> >
> >
> >
> ===
> >
> >
> >
> ===
> >
> > * WARNING: Using default optimization C flags -O
> >
> >You might consider manually
> setting
> > optimal optimization flags for your system with
> >
> >  COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
> > examples
> >
> >
> ===
> >
> >
> >
> ===
> >
> > * WARNING: You have an older version of Gnu make,
> > it will work,
> > but may not support all the
> > parallel testing options. You can install the
> >   latest
> > Gnu make with your package manager, such as brew or macports, or use
> >
> > the --download-make option to get the latest Gnu make warning
> > message *
> >
> >
> ===
> >
> >   TESTING: configureMPIEXEC from
> > config.packages.MPI(config/BuildSystem/config/packages/MPI.py:174)
> >
> >
> ***
> >  UNABLE to CONFIGURE with GIVEN OPTIONS(see configure.log for
> > details):
> >
> ---
> > Unable to run jsrun -g 1 with option "-n 1"
> > Error: It is only possible to use js commands within a job allocation
> > unless CSM is running
> > 09-25-2019 18:20:13:224 108023 main: Error initializing RM connection.
> > Exiting.
> >
> ***********
> >
> > 18:20 1 (cb53a04...) ~/petsc-karl$
> >
> >
> > > [note: the branch is rebased - so 'git pull' won't work -(as you can
> > > see from the "(forced update)" message - and '<>' status from git
> > > prompt on balay/fix-mpiexec-shell-escape). So perhaps its easier to
> > > deal with in detached mode - which makes this obvious]
> > >
> >
> > I got this <> and "fixed" it by deleting the branch and repulling it. I
> > guess I needed to fetch also.
> >
> > Mark
> >
> >
> > >
> > > Satish
> > &

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev

On Wed, Sep 25, 2019 at 6:23 PM Balay, Satish  wrote:

> > 18:16 (cb53a04...) ~/petsc-karl$
>
> So this is the commit I recommended you test against - and that's what
> you have got now. Please go ahead and test.
>
>
I sent the log for this. This is the output:

18:16 (cb53a04...) ~/petsc-karl$ ../arch-summit-opt64idx-pgi-cuda.py
PETSC_DIR=$PWD
===
 Configuring PETSc to compile on your system

===
===

* WARNING: F77 (set to
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linux
  use ./configure F77=$F77 if you really want to use that value **


===


===

* WARNING: Using default optimization C flags -O

   You might consider manually setting
optimal optimization flags for your system with

 COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
examples

 ===


===

* WARNING: You have an older version of Gnu make,
it will work,
but may not support all the
parallel testing options. You can install the
  latest
Gnu make with your package manager, such as brew or macports, or use

the --download-make option to get the latest Gnu make warning
message *

===

  TESTING: configureMPIEXEC from
config.packages.MPI(config/BuildSystem/config/packages/MPI.py:174)

***
 UNABLE to CONFIGURE with GIVEN OPTIONS(see configure.log for
details):
---
Unable to run jsrun -g 1 with option "-n 1"
Error: It is only possible to use js commands within a job allocation
unless CSM is running
09-25-2019 18:20:13:224 108023 main: Error initializing RM connection.
Exiting.
***

18:20 1 (cb53a04...) ~/petsc-karl$


> [note: the branch is rebased - so 'git pull' won't work -(as you can
> see from the "(forced update)" message - and '<>' status from git
> prompt on balay/fix-mpiexec-shell-escape). So perhaps its easier to
> deal with in detached mode - which makes this obvious]
>

I got this <> and "fixed" it by deleting the branch and repulling it. I
guess I needed to fetch also.

Mark


>
> Satish
>
>
> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>
> > I will test this now but 
> >
> > 17:52 balay/fix-mpiexec-shell-escape= ~/petsc-karl$ git fetch
> > remote: Enumerating objects: 119, done.
> > remote: Counting objects: 100% (119/119), done.
> > remote: Compressing objects: 100% (91/91), done.
> > remote: Total 119 (delta 49), reused 74 (delta 28)
> > Receiving objects: 100% (119/119), 132.88 KiB | 0 bytes/s, done.
> > Resolving deltas: 100% (49/49), completed with 1 local objects.
> > >From https://gitlab.com/petsc/petsc
> >  + b5e99a5...cb53a04 balay/fix-mpiexec-shell-escape ->
> > origin/balay/fix-mpiexec-shell-escape  (forced update)
> >  + ffdc635...7eeb5f9 jczhang/feature-sf-on-gpu ->
> > origin/jczhang/feature-sf-on-gpu  (forced update)
> >cb9de97..f9ff08a  jolivet/fix-error-col-row ->
> > origin/jolivet/fix-error-col-row
> >40ea605..de5ad60  oanam/jacobf/cell-to-ref-mapping ->
> > origin/oanam/jacobf/cell-to-ref-mapping
> >  + ecac953...9fb579e stefanozampini/hypre-cuda-rebased ->
> > origin/stefanozampini/hypre-cuda-rebased  (forced update)
> > 18:16 balay/fix-mpiexec-shell-escape<> ~/petsc-karl$ git checkout
> > origin/balay/fix-mpiexec-shell-escape
> > Note: checking out 'origin/balay/fix-mpiexec-shell-escape'.
> >
> > You are in 'detached HEAD' state. You can look around, make experimental
> > changes and commit them, and you can discard any commits you make in this
> > state without impacting any branches by performing another checkout.
> >
> > If you want to create a new branch to retain commits you create, you may

Re: [petsc-dev] [petsc-maint] running CUDA on SUMMIT

2019-07-09 Thread Mark Adams via petsc-dev

I am stumped with this GPU bug(s). Maybe someone has an idea.

I did find a bug in the cuda transpose mat-vec that cuda-memcheck detected,
but I still have differences between the GPU and CPU transpose mat-vec.
I've got it down to a very simple test: bicg/none on a tiny mesh with two
processors. It works on one processor or with cg/none. So it is the
transpose mat-vec.

I see that the result of the off-diagonal  (a->lvec) is different* only
proc 1*. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat
and vec and printed out matlab vectors. Below is the CPU output and then
the GPU with a view of the scatter object, which is identical as you can
see.

The matlab B matrix and xx vector are identical. Maybe the GPU copy
is wrong ...

The only/first difference between CPU and GPU is a->lvec (the off diagonal
contribution)on processor 1. (you can see the norms are *different*). Here
is the diff on the process 1 a->lvec vector (all values are off).

Any thoughts would be appreciated,
Mark

15:30 1  /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m
2,12c2,12
< %  type: seqcuda
< Vec_0x53738630_0 = [
< 9.5702137431412879e+00
< 2.1970298791152253e+01
< 4.5422290209190646e+00
< 2.0185031807270226e+00
< 4.2627312508573375e+01
< 1.0889191983882025e+01
< 1.6038202417695462e+01
< 2.7155672033607665e+01
< 6.2540357853223556e+00
---
> %  type: seq
> Vec_0x3a546440_0 = [
> 4.5565851251714653e+00
> 1.0460532998971189e+01
> 2.1626531807270220e+00
> 9.6105288923182408e-01
> 2.0295782656035659e+01
> 5.1845791066529463e+00
> 7.6361340020576058e+00
> 1.2929401011659799e+01
> 2.9776812928669392e+00

15:15 130  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1
./ex56 -cells 2,2,1
[0] 27 global equations, 9 vertices
[0] 27 equations in vector, 9 vertices
  0 SNES Function norm 1.223958326481e+02
0 KSP Residual norm 1.223958326481e+02
[0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
 1.424708937136e+00
[1] |x|=  1.223958326481e+02 |a->lvec|=  *2.844171413778e*+01 |B|=
 1.424708937136e+00
[1] 1) |yy|=  2.007423334680e+02
[0] 1) |yy|=  2.007423334680e+02
[0] 2) |yy|=  1.957605719265e+02
[1] 2) |yy|=  1.957605719265e+02
[1] Number sends = 1; Number to self = 0
[1]   0 length = 9 to whom 0
Now the indices for all remote sends (in order by process sent to)
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] Number receives = 1; Number from self = 0
[1] 0 length 9 from whom 0
Now the indices for all remote receives (in order by process received from)
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
1 KSP Residual norm 8.199932342150e+01
  Linear solve did not converge due to DIVERGED_ITS iterations 1
Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0


15:19  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 ./ex56
-cells 2,2,1 *-ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda*
[0] 27 global equations, 9 vertices
[0] 27 equations in vector, 9 vertices
  0 SNES Function norm 1.223958326481e+02
0 KSP Residual norm 1.223958326481e+02
[0] |x|=  1.223958326481e+02 |a->lvec|=  1.773965489475e+01 |B|=
 1.424708937136e+00
[1] |x|=  1.223958326481e+02 |a->lvec|=  *5.973624458725e*+01 |B|=
 1.424708937136e+00
[0] 1) |yy|=  2.007423334680e+02
[1] 1) |yy|=  2.007423334680e+02
[0] 2) |yy|=  1.953571867633e+02
[1] 2) |yy|=  1.953571867633e+02
[1] Number sends = 1; Number to self = 0
[1]   0 length = 9 to whom 0
Now the indices for all remote sends (in order by process sent to)
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] Number receives = 1; Number from self = 0
[1] 0 length 9 from whom 0
Now the indices for all remote receives (in order by process received from)
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
1 KSP Residual norm 8.199932342150e+01

Re: [petsc-dev] Parmetis bug

2019-11-10 Thread Mark Adams via petsc-dev

Fande, It looks to me like this branch in ParMetis must be taken to trigger
this error. First *Match_SHEM* and then CreateCoarseGraphNoMask.

   /* determine which matching scheme you will use */
switch (ctrl->ctype) {
  case METIS_CTYPE_RM:
Match_RM(ctrl, graph);
break;
  case METIS_CTYPE_SHEM:
if (eqewgts || graph->nedges == 0)
  Match_RM(ctrl, graph);
else

*  Match_SHEM(ctrl, graph);*break;
  default:
gk_errexit(SIGERR, "Unknown ctype: %d\n", ctrl->ctype);
}

---

  /* Check if the mask-version of the code is a good choice */
  mask = HTLENGTH;
  if (cnvtxs < 2*mask || graph->nedges/graph->nvtxs > mask/20) {
CreateCoarseGraphNoMask(ctrl, graph, cnvtxs, match);
return;
  }



The actual error is in CreateCoarseGraphNoMask, graph->cmap is too small
and this gets garbage. parmetis coarsen.c:856:

istart = xadj[v];
iend   = xadj[v+1];
for (j=istart; j wrote:

> Fande, the problem is k below seems to index beyond the end of htable,
> resulting in a crazy m and a segv on the last line below.
>
> I don't have a clean valgrind machine now, that is what is needed if no
> one has seen anything like this. I could add a test in a MR and get the
> pipeline to do it.
>
> void CreateCoarseGraphNoMask(ctrl_t *ctrl, graph_t *graph, idx_t cnvtxs,
>  idx_t *match)
> {
>   idx_t j, k, m, istart, iend, nvtxs, nedges, ncon, cnedges, v, u, dovsize;
>   idx_t *xadj, *vwgt, *vsize, *adjncy, *adjwgt;
>   idx_t *cmap, *htable;
>   idx_t *cxadj, *cvwgt, *cvsize, *cadjncy, *cadjwgt;
>   graph_t *cgraph;
> ine
>   WCOREPUSH;
>
>   dovsize = (ctrl->objtype == METIS_OBJTYPE_VOL ? 1 : 0);
>
>   IFSET(ctrl->dbglvl, METIS_DBG_TIME, gk_startcputimer(ctrl->ContractTmr));
>
>   nvtxs   = graph->nvtxs;
>   ncon= graph->ncon;
>   xadj= graph->xadj;
>   vwgt= graph->vwgt;
>   vsize   = graph->vsize;
>   adjncy  = graph->adjncy;
>   adjwgt  = graph->adjwgt;
>   cmap= graph->cmap;
>
>
>   /* Initialize the coarser graph */
>   cgraph = SetupCoarseGraph(graph, cnvtxs, dovsize);
>   cxadj= cgraph->xadj;
>   cvwgt= cgraph->vwgt;
>   cvsize   = cgraph->vsize;
>   cadjncy  = cgraph->adjncy;
>   cadjwgt  = cgraph->adjwgt;
>
>   htable = iset(cnvtxs, -1, iwspacemalloc(ctrl, cnvtxs));
>
>   cxadj[0] = cnvtxs = cnedges = 0;
>   for (v=0; v if ((u = match[v]) < v)
>   continue;
>
> ASSERT(cmap[v] == cnvtxs);
> ASSERT(cmap[match[v]] == cnvtxs);
>
> if (ncon == 1)
>   cvwgt[cnvtxs] = vwgt[v];
> else
>   icopy(ncon, vwgt+v*ncon, cvwgt+cnvtxs*ncon);
>
> if (dovsize)
>   cvsize[cnvtxs] = vsize[v];
>
> nedges = 0;
>
> istart = xadj[v];
> iend   = xadj[v+1];
> for (j=istart; j   k = cmap[adjncy[j]];
>   if ((m = htable[k]) == -1) {
> cadjncy[nedges] = k;
> cadjwgt[nedges] = adjwgt[j];
> htable[k] = nedges++;
>   }
>   else {
> cadjwgt[m] += adjwgt[j];
>
> On Sun, Nov 10, 2019 at 1:35 AM Mark Adams  wrote:
>
>>
>>
>> On Sat, Nov 9, 2019 at 10:51 PM Fande Kong  wrote:
>>
>>> Hi Mark,
>>>
>>> Thanks for reporting this bug. I was surprised because we have
>>> sufficient heavy tests in moose using partition weights and do not have any
>>> issue so far.
>>>
>>>
>> I have been pounding on this code with elasticity and have not seen this
>> issue. I am now looking at Lapacianas and I only see it with pretty large
>> problems. The example below is pretty minimal (eg, it works with 16 cores
>> and it works with -dm_refine 4). I have reproduced this on Cori, SUMMIT and
>> my laptop.
>>
>>
>>> I will take a shot on this.
>>>
>>
>> Thanks, I'll try to take a look at it also. I have seen it in DDT, but
>> did not dig further. It looked like a typical segv in ParMetis.
>>
>>
>>>
>>> Fande,
>>>
>>> On Sat, Nov 9, 2019 at 3:08 PM Mark Adams  wrote:
>>>
 snes/ex13 is getting a ParMetis segv with GAMG and coarse grid
 repartitioning. Below shows the branch and how to run it.

 I've tried valgrind on Cori but it gives a lot of false positives. I've
 seen this error in DDT but I have not had a chance to dig and try to fix
 it. At least I know it has something to do with weights.

 If anyone wants to take a shot at it feel free. This bug rarely happens.

 The changes use weights and are just a few lines of code (from 1.5
 years ago):

 12:08 (0455fb9fec...)|BISECTING ~/Codes/petsc$ git bisect bad
 0455fb9fecf69cf5cf35948c84d3837e5a427e2e is the first bad commit
 commit 0455fb9fecf69cf5cf35948c84d3837e5a427e2e
 Author: Fande Kong 
 Date:   Thu Jun 21 18:21:19 2018 -0600

 Let parmetis and ptsotch take edge weights and vertex weights

  src/mat/partition/impls/pmetis/pmetis.c | 7 +++
  src/mat/partition/impls/scotch/scotch.c | 6 +++---
  2 files changed, 10 insertions(+), 3 deletions(-)

 > mpiexec -n 32 ./ex13

Re: [petsc-dev] GPU counters

2019-11-06 Thread Mark Adams via petsc-dev

Yea, that is what I thought.
Oh, I am probably seeing the flops from KSP. The PC is a monolithic code
(AMGx).

On Wed, Nov 6, 2019 at 11:18 AM Zhang, Junchao  wrote:

> No. For each vector/matrix operation, PETSc can get its flop count based
> on number of nonzeros, for example.
>
> --Junchao Zhang
>
>
> On Wed, Nov 6, 2019 at 8:44 AM Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> I am puzzled.
>>
>> I am running AMGx now, and I am getting flop counts/rates. How does that
>> happen? Does PETSc use hardware counters to get flops?
>>
>

[petsc-dev] GPU counters

2019-11-06 Thread Mark Adams via petsc-dev

I am puzzled.

I am running AMGx now, and I am getting flop counts/rates. How does that
happen? Does PETSc use hardware counters to get flops?

Re: [petsc-dev] Parmetis bug

2019-11-09 Thread Mark Adams via petsc-dev

On Sat, Nov 9, 2019 at 10:51 PM Fande Kong  wrote:

> Hi Mark,
>
> Thanks for reporting this bug. I was surprised because we have sufficient
> heavy tests in moose using partition weights and do not have any issue so
> far.
>
>
I have been pounding on this code with elasticity and have not seen this
issue. I am now looking at Lapacianas and I only see it with pretty large
problems. The example below is pretty minimal (eg, it works with 16 cores
and it works with -dm_refine 4). I have reproduced this on Cori, SUMMIT and
my laptop.


> I will take a shot on this.
>

Thanks, I'll try to take a look at it also. I have seen it in DDT, but did
not dig further. It looked like a typical segv in ParMetis.


>
> Fande,
>
> On Sat, Nov 9, 2019 at 3:08 PM Mark Adams  wrote:
>
>> snes/ex13 is getting a ParMetis segv with GAMG and coarse grid
>> repartitioning. Below shows the branch and how to run it.
>>
>> I've tried valgrind on Cori but it gives a lot of false positives. I've
>> seen this error in DDT but I have not had a chance to dig and try to fix
>> it. At least I know it has something to do with weights.
>>
>> If anyone wants to take a shot at it feel free. This bug rarely happens.
>>
>> The changes use weights and are just a few lines of code (from 1.5 years
>> ago):
>>
>> 12:08 (0455fb9fec...)|BISECTING ~/Codes/petsc$ git bisect bad
>> 0455fb9fecf69cf5cf35948c84d3837e5a427e2e is the first bad commit
>> commit 0455fb9fecf69cf5cf35948c84d3837e5a427e2e
>> Author: Fande Kong 
>> Date:   Thu Jun 21 18:21:19 2018 -0600
>>
>> Let parmetis and ptsotch take edge weights and vertex weights
>>
>>  src/mat/partition/impls/pmetis/pmetis.c | 7 +++
>>  src/mat/partition/impls/scotch/scotch.c | 6 +++---
>>  2 files changed, 10 insertions(+), 3 deletions(-)
>>
>> > mpiexec -n 32 ./ex13 -cells 2,4,4, -dm_refine 5 -simplex 0 -dim 3
>> -potential_petscspace_degree 1 -potential_petscspace_order 1 -pc_type gamg
>> -petscpartitioner_type simple -pc_gamg_repartition
>> true -check_pointer_intensity 0
>>
>

Re: [petsc-dev] Parmetis bug

2019-11-10 Thread Mark Adams via petsc-dev

Fande, the problem is k below seems to index beyond the end of htable,
resulting in a crazy m and a segv on the last line below.

I don't have a clean valgrind machine now, that is what is needed if no one
has seen anything like this. I could add a test in a MR and get the
pipeline to do it.

void CreateCoarseGraphNoMask(ctrl_t *ctrl, graph_t *graph, idx_t cnvtxs,
 idx_t *match)
{
  idx_t j, k, m, istart, iend, nvtxs, nedges, ncon, cnedges, v, u, dovsize;
  idx_t *xadj, *vwgt, *vsize, *adjncy, *adjwgt;
  idx_t *cmap, *htable;
  idx_t *cxadj, *cvwgt, *cvsize, *cadjncy, *cadjwgt;
  graph_t *cgraph;
ine
  WCOREPUSH;

  dovsize = (ctrl->objtype == METIS_OBJTYPE_VOL ? 1 : 0);

  IFSET(ctrl->dbglvl, METIS_DBG_TIME, gk_startcputimer(ctrl->ContractTmr));

  nvtxs   = graph->nvtxs;
  ncon= graph->ncon;
  xadj= graph->xadj;
  vwgt= graph->vwgt;
  vsize   = graph->vsize;
  adjncy  = graph->adjncy;
  adjwgt  = graph->adjwgt;
  cmap= graph->cmap;


  /* Initialize the coarser graph */
  cgraph = SetupCoarseGraph(graph, cnvtxs, dovsize);
  cxadj= cgraph->xadj;
  cvwgt= cgraph->vwgt;
  cvsize   = cgraph->vsize;
  cadjncy  = cgraph->adjncy;
  cadjwgt  = cgraph->adjwgt;

  htable = iset(cnvtxs, -1, iwspacemalloc(ctrl, cnvtxs));

  cxadj[0] = cnvtxs = cnedges = 0;
  for (v=0; v wrote:

>
>
> On Sat, Nov 9, 2019 at 10:51 PM Fande Kong  wrote:
>
>> Hi Mark,
>>
>> Thanks for reporting this bug. I was surprised because we have sufficient
>> heavy tests in moose using partition weights and do not have any issue so
>> far.
>>
>>
> I have been pounding on this code with elasticity and have not seen this
> issue. I am now looking at Lapacianas and I only see it with pretty large
> problems. The example below is pretty minimal (eg, it works with 16 cores
> and it works with -dm_refine 4). I have reproduced this on Cori, SUMMIT and
> my laptop.
>
>
>> I will take a shot on this.
>>
>
> Thanks, I'll try to take a look at it also. I have seen it in DDT, but did
> not dig further. It looked like a typical segv in ParMetis.
>
>
>>
>> Fande,
>>
>> On Sat, Nov 9, 2019 at 3:08 PM Mark Adams  wrote:
>>
>>> snes/ex13 is getting a ParMetis segv with GAMG and coarse grid
>>> repartitioning. Below shows the branch and how to run it.
>>>
>>> I've tried valgrind on Cori but it gives a lot of false positives. I've
>>> seen this error in DDT but I have not had a chance to dig and try to fix
>>> it. At least I know it has something to do with weights.
>>>
>>> If anyone wants to take a shot at it feel free. This bug rarely happens.
>>>
>>> The changes use weights and are just a few lines of code (from 1.5 years
>>> ago):
>>>
>>> 12:08 (0455fb9fec...)|BISECTING ~/Codes/petsc$ git bisect bad
>>> 0455fb9fecf69cf5cf35948c84d3837e5a427e2e is the first bad commit
>>> commit 0455fb9fecf69cf5cf35948c84d3837e5a427e2e
>>> Author: Fande Kong 
>>> Date:   Thu Jun 21 18:21:19 2018 -0600
>>>
>>> Let parmetis and ptsotch take edge weights and vertex weights
>>>
>>>  src/mat/partition/impls/pmetis/pmetis.c | 7 +++
>>>  src/mat/partition/impls/scotch/scotch.c | 6 +++---
>>>  2 files changed, 10 insertions(+), 3 deletions(-)
>>>
>>> > mpiexec -n 32 ./ex13 -cells 2,4,4, -dm_refine 5 -simplex 0 -dim 3
>>> -potential_petscspace_degree 1 -potential_petscspace_order 1 -pc_type gamg
>>> -petscpartitioner_type simple -pc_gamg_repartition
>>> true -check_pointer_intensity 0
>>>
>>

Re: [petsc-dev] ksp_error_if_not_converged in multilevel solvers

2019-10-20 Thread Mark Adams via petsc-dev

> If one just wants to run a fixed number of iterations, not checking for
> convergence, why would one set ksp->errorifnotconverged to true?
>
>
Good question. I can see not worrying too much about convergence on the
coarse grids, but to not allow it ... and now that I think about it, it
seems like we might want to error out with in indefinite PC. Maybe make
ksp->errorifnotconverged an int <1>:
0: no error
1: error if indefinite only
2: error if any error



> Thanks,
> Pierre
>
>

[petsc-dev] SuperLU + GPUs

2019-10-18 Thread Mark Adams via petsc-dev

What is the status of supporting SuperLU_DIST with GPUs?
Thanks,
Mark

[petsc-dev] getting eigen estimates from GAMG to CHEBY

2019-09-25 Thread Mark Adams via petsc-dev

It's been a few years since we lost the ability to cache the eigen
estimates, that smoothed aggregation computes, to chebyshev smoothers. I'd
like to see if we bring this back.

This is slightly (IMO) complicated by the fact that the smoother PC may not
be Jacobi, but I think it is close enough (and an overestimate probably).
Maybe provide a chevy option to chebyshev_recompute_eig_est.

What do people think?

1 2 >

1 - 100 of 110 matches

Mail list logo