Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0

2024-02-09 Thread Zhang, Chonglin
I don’t understand the reason, but I think I figured out a solution by trial 
and error. Basically, I just copied the configure line from “configure.log” 
file to terminal, deleted the “--with-cc=mpicc --with-cxx=mpicxx 
--with-fc=mpif90” arguments, and added the “--with-mpi-dir=”xxx”” argument, as 
below. With this it was able to configure. (need to test it can compile now).

./configure --with-mpiexec="jsrun -g 1" --with-shared-libraries=1 
--with-debugging=yes --COPTFLAGS="-g -Ofast -mcpu=power9 -fPIC" 
--CXXOPTFLAGS="-g -Ofast -mcpu=power9 -fPIC" --FOPTFLAGS="-g -Ofast 
-mcpu=power9 -fPIC" --with-cuda=1 --with-fortran-bindings=0 --with-batch=0 
--with-cuda-arch=70 --with-cudac=nvcc --download-metis --download-parmetis 
--with-blaslapack-lib="-L/sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/netlib-lapack-3.11.0-g3vx4sojdbcj5ph6t4gzimzbtkfjpn4y/lib64
 -lblas -llapack" --download-triangle --with-make-np=4 
--with-mpi-dir="/sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu"
 PETSC_ARCH=arch-summit

Thanks,
Chonglin
From: petsc-users  on behalf of Zhang, 
Chonglin 
Date: Friday, February 9, 2024 at 4:21 PM
To: Barry Smith 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0
Thanks Barry! I am still getting the same error message. Any more suggestions?

I can see that library from the login node:
lrwxrwxrwx 1 sauesw ccsstaff 29 Jan 16 16:39 
/sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/lib/libmpi_ibm_usempif08.so
 -> libmpi_ibm_usempif08.so.3.1.0

Thanks,
Chonglin
From: Barry Smith 
Date: Friday, February 9, 2024 at 4:10 PM
To: Zhang, Chonglin 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0

 error while loading shared libraries: libmpi_ibm_usempif08.so: cannot open 
shared object file: No such file or directory

 So using the mpif90 does not work because it links a shared library that 
cannot be found at run time.

 Perhaps that library is only visible on the bach nodes. You can tryed adding 
-with-batch=0 to the ./configure options


  Barry


On Feb 9, 2024, at 5:01 PM, Zhang, Chonglin  wrote:

Dear PETSc developers,

I am trying to compile PETSc on Summit with gcc 12.1.0 and spectrum-mpi 
10.4.0.6, but encountered the following configuration issues:

=
  Configuring PETSc to compile on your system
=
TESTING: checkFortranCompiler from 
config.setCompilers(config/BuildSystem/config/setCompilers.py:1271) 

   
***
OSError while running ./configure
---
Cannot run executables created with FC. If this machine uses a batch system
to submit jobs you will need to configure using ./configure with the additional 
option  --with-batch.
 Otherwise there is problem with the compilers. Can you compile and run code 
with your compiler 'mpif90'?
***

Also attached is the configure.log file. Could you help with this issue?

Thanks,
Chonglin







Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0

2024-02-09 Thread Zhang, Chonglin
Thanks Barry! I am still getting the same error message. Any more suggestions?

I can see that library from the login node:
lrwxrwxrwx 1 sauesw ccsstaff 29 Jan 16 16:39 
/sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/lib/libmpi_ibm_usempif08.so
 -> libmpi_ibm_usempif08.so.3.1.0

Thanks,
Chonglin
From: Barry Smith 
Date: Friday, February 9, 2024 at 4:10 PM
To: Zhang, Chonglin 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0

 error while loading shared libraries: libmpi_ibm_usempif08.so: cannot open 
shared object file: No such file or directory

 So using the mpif90 does not work because it links a shared library that 
cannot be found at run time.

 Perhaps that library is only visible on the bach nodes. You can tryed adding 
-with-batch=0 to the ./configure options


  Barry



On Feb 9, 2024, at 5:01 PM, Zhang, Chonglin  wrote:

Dear PETSc developers,

I am trying to compile PETSc on Summit with gcc 12.1.0 and spectrum-mpi 
10.4.0.6, but encountered the following configuration issues:

=
  Configuring PETSc to compile on your system
=
TESTING: checkFortranCompiler from 
config.setCompilers(config/BuildSystem/config/setCompilers.py:1271) 

   
***
OSError while running ./configure
---
Cannot run executables created with FC. If this machine uses a batch system
to submit jobs you will need to configure using ./configure with the additional 
option  --with-batch.
 Otherwise there is problem with the compilers. Can you compile and run code 
with your compiler 'mpif90'?
***

Also attached is the configure.log file. Could you help with this issue?

Thanks,
Chonglin







[petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0

2024-02-09 Thread Zhang, Chonglin
Dear PETSc developers,

I am trying to compile PETSc on Summit with gcc 12.1.0 and spectrum-mpi 
10.4.0.6, but encountered the following configuration issues:

=
  Configuring PETSc to compile on your system
=
TESTING: checkFortranCompiler from 
config.setCompilers(config/BuildSystem/config/setCompilers.py:1271) 

   
***
OSError while running ./configure
---
Cannot run executables created with FC. If this machine uses a batch system
to submit jobs you will need to configure using ./configure with the additional 
option  --with-batch.
 Otherwise there is problem with the compilers. Can you compile and run code 
with your compiler 'mpif90'?
***

Also attached is the configure.log file. Could you help with this issue?

Thanks,
Chonglin





configure.log
Description: configure.log


Re: [petsc-users] [EXTERNAL] GPU implementation of serial smoothers

2023-01-10 Thread Zhang, Chonglin
I am using the following in my Poisson solver running on GPU, which were 
suggested by Barry and Mark (Dr. Mark Adams).
  -ksp_type cg
  -pc_type gamg
  -mg_levels_ksp_type chebyshev
  -mg_levels_pc_type jacobi

On Jan 10, 2023, at 3:31 PM, Mark Lohry  wrote:

So what are people using for GAMG configs on GPU? I was hoping petsc today 
would be performance competitive with AMGx but it sounds like that's not the 
case?

On Tue, Jan 10, 2023 at 3:03 PM Jed Brown 
mailto:j...@jedbrown.org>> wrote:
Mark Lohry mailto:mlo...@gmail.com>> writes:

> I definitely need multigrid. I was under the impression that GAMG was
> relatively cuda-complete, is that not the case? What functionality works
> fully on GPU and what doesn't, without any host transfers (aside from
> what's needed for MPI)?
>
> If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi -mg_levels_ksp_type
> richardson is that fully on device, but -mg_levels_pc_type ilu or
> -mg_levels_pc_type sor require transfers?

You can do `-mg_levels_pc_type ilu`, but it'll be extremely slow (like 20x 
slower than an operator apply). One can use Krylov smoothers, though that's 
more synchronization. Automatic construction of operator-dependent multistage 
smoothers for linear multigrid (because Chebyshev only works for problems that 
have eigenvalues near the real axis) is something I've wanted to develop for at 
least a decade, but time is always short. I might put some effort into p-MG 
with such smoothers this year as we add DDES to our scale-resolving 
compressible solver.



Re: [petsc-users] is PETSc's random deterministic?

2021-07-28 Thread Zhang, Chonglin
I was having the same website not found problem the other day. I remember email 
by Satish saying PETSc has a new website. It seems now that all the manual 
pages are hosted there: https://petsc.org/release/documentation/manualpages/; 
https://petsc.org/release/docs/manualpages/singleindex.html.

Thanks!
Chonglin

On Jul 28, 2021, at 9:31 PM, Mark Adams 
mailto:mfad...@lbl.gov>> wrote:

Also, when I google function the Argonne web pages are not found (MIT seems to 
have mirrored this and that works).
Thanks,
Mark



Re: [petsc-users] Proper GPU usage in PETSc

2020-09-24 Thread Zhang, Chonglin
Hi Matt,

Thanks for the comments and the nice example code. Right now our objective is 
to use XGC unstructured flux-surface-following mesh (fixed in size), I will 
keep your comment on mesh refinement in mind.

Thanks!
Chonglin

On Sep 24, 2020, at 3:26 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Thu, Sep 24, 2020 at 3:08 PM Zhang, Chonglin 
mailto:zhang...@rpi.edu>> wrote:
Hi Matt,

I will quickly summarize what I found with “CreateMesh" for running ex12 here: 
https://gitlab.com/petsc/petsc/-/blob/master/src/snes/tutorials/ex12.c. If this 
is not a proper threads to discuss this, I can open a new one.

Commands used (relevant to mesh creation) to run ex12 (quad core desktop 
computer with CPU only, 4 MPI ranks):
mpirun -np 4 -cells 100, 100, 0 -options_left -log_view
I built PETSc commit: 2bbfc05, dated Sep 23, 2020, with debug=no.

Mesh size   CreateMesh (seconds)  DMPlexDistribute (seconds)
 100 *100 0.14   0.081
 500 *500 2.28   1.33
 1000*1000  10.1   5.10
 2000*1000  24.6  10.96
 2000*2000  73.7  22.72

Is the performance reasonable for the “CreateMesh” functionality?

Anything I am not doing correctly with DMPlex running ex12?

ex12 is a little old. I have been meaning to update it. ex13 does the same 
thing in a more modern way.

Above looks reasonable I think. The CreateMesh time includes generating the 
mesh using Triangle, since simplex is the
default. In example 12, you could use

  -simplex 0

or in ex13

  -dm_plex_box_simplex 0

to get hexes, which do not use a generator.  Second, you are interpolating all 
on process 0, which is probably
the bulk of the time. I do that because I do not care about parallel 
performance in the examples and it is simpler.
You can also refine the mesh after distribution, which is faster, and cuts down 
on the distribution time. So if you
want the whole thing, you could use

  DM odm, dm;

  /* Create a cell-vertex box mesh */
  ierr = DMPlexCreateBoxMesh(comm, 2, PETSC_TRUE, NULL, NULL, NULL, NULL, 
PETSC_FALSE, );CHKERRQ(ierr);
  ierr = PetscObjectSetOptionsPrefix((PetscObject) dm, "orig_");CHKERRQ(ierr);
  /* Distributes the mesh here */
  ierr = DMSetFromOptions(odm);CHKERRQ(ierr);
  /* Interpolate the mesh */
  ierr = DMPlexInterpolate(odm, );CHKERRQ(ierr);
  ierr = DMDestroy();CHKERRQ(ierr);
  /* Refine the mesh */
  ierr = DMSetFromOptions(dm);CHKERRQ(ierr);

and run with

  -dm_plex_box_simplex 0 -dm_plex_box_faces 100,100 -orig_dm_distribute 
-dm_refine 3

  Thanks,

 Matt

Thanks!
Chonglin

On Sep 24, 2020, at 2:06 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Thu, Sep 24, 2020 at 2:04 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
On Thu, Sep 24, 2020 at 1:38 PM Matthew Knepley 
mailto:knep...@gmail.com>> wrote:
On Thu, Sep 24, 2020 at 12:48 PM Zhang, Chonglin 
mailto:zhang...@rpi.edu>> wrote:
Thanks Mark and Barry!

A quick try of using “-pc_type jacobi” did reduce the number of count for 
“CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not 
decrease in this case) solves the problem faster (may not be of any meaning 
since the problem size is too small; the function 
“DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size 
preventing running larger problems, separate issue).

It sounds like something is wrong then, or I do not understand what you mean by 
slow.

sor may be the default so you need to set the -mg_level_ksp[pc]_type 
chebyshev[jacobi]. chebyshev is the ksp default.

I meant for the mesh creation.

  Thanks,

 Matt

  Thanks,

 Matt

Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant 
amount of time for a realistic sized problem, say for example a linear problem 
with ~1-2 million DOFs?

Also, is there any plan to have the SNES and DMPlex code run on GPU?

Thanks!
Chonglin

On Sep 24, 2020, at 12:17 PM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


   MatSOR() runs on the CPU, this causes copy to CPU for each application of 
MatSOR() and then a copy to GPU for the next step.

   You can try, for example -pc_type jacobi  better yet use PCGAMG if it 
amenable for your problem.

   Also the problem is way to small for a GPU.

  There will be copies between the GPU/CPU for each SNES iteration since the 
DMPLEX code does not run on GPUs.

   Barry



On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin 
mailto:zhang...@rpi.edu>> wrote:

Dear PETSc Users,

I have some questions regarding the proper GPU usage. I would like to know the 
proper way to:
(1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments 
should I be using;
(2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer 
showed in PETSc log file, when using CUDA aware MPI.


D

Re: [petsc-users] Proper GPU usage in PETSc

2020-09-24 Thread Zhang, Chonglin
Hi Matt,

I will quickly summarize what I found with “CreateMesh" for running ex12 here: 
https://gitlab.com/petsc/petsc/-/blob/master/src/snes/tutorials/ex12.c. If this 
is not a proper threads to discuss this, I can open a new one.

Commands used (relevant to mesh creation) to run ex12 (quad core desktop 
computer with CPU only, 4 MPI ranks):
mpirun -np 4 -cells 100, 100, 0 -options_left -log_view
I built PETSc commit: 2bbfc05, dated Sep 23, 2020, with debug=no.

Mesh size   CreateMesh (seconds)  DMPlexDistribute (seconds)
 100 *100 0.14   0.081
 500 *500 2.28   1.33
 1000*1000  10.1   5.10
 2000*1000  24.6  10.96
 2000*2000  73.7  22.72

Is the performance reasonable for the “CreateMesh” functionality?

Anything I am not doing correctly with DMPlex running ex12?

Thanks!
Chonglin

On Sep 24, 2020, at 2:06 PM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Thu, Sep 24, 2020 at 2:04 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
On Thu, Sep 24, 2020 at 1:38 PM Matthew Knepley 
mailto:knep...@gmail.com>> wrote:
On Thu, Sep 24, 2020 at 12:48 PM Zhang, Chonglin 
mailto:zhang...@rpi.edu>> wrote:
Thanks Mark and Barry!

A quick try of using “-pc_type jacobi” did reduce the number of count for 
“CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not 
decrease in this case) solves the problem faster (may not be of any meaning 
since the problem size is too small; the function 
“DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size 
preventing running larger problems, separate issue).

It sounds like something is wrong then, or I do not understand what you mean by 
slow.

sor may be the default so you need to set the -mg_level_ksp[pc]_type 
chebyshev[jacobi]. chebyshev is the ksp default.

I meant for the mesh creation.

  Thanks,

 Matt

  Thanks,

 Matt

Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant 
amount of time for a realistic sized problem, say for example a linear problem 
with ~1-2 million DOFs?

Also, is there any plan to have the SNES and DMPlex code run on GPU?

Thanks!
Chonglin

On Sep 24, 2020, at 12:17 PM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


   MatSOR() runs on the CPU, this causes copy to CPU for each application of 
MatSOR() and then a copy to GPU for the next step.

   You can try, for example -pc_type jacobi  better yet use PCGAMG if it 
amenable for your problem.

   Also the problem is way to small for a GPU.

  There will be copies between the GPU/CPU for each SNES iteration since the 
DMPLEX code does not run on GPUs.

   Barry



On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin 
mailto:zhang...@rpi.edu>> wrote:

Dear PETSc Users,

I have some questions regarding the proper GPU usage. I would like to know the 
proper way to:
(1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments 
should I be using;
(2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer 
showed in PETSc log file, when using CUDA aware MPI.


Details of what I am doing now and my observations are below:

System and compilers used:
(1) RPI’s AiMOS computer (node wise, it is the same as Summit);
(2) using GCC 7.4.0 and Spectrum-MPI 10.3.

I am doing the followings to solve the linear Poisson equation with SNES 
interface, under DMPlex:
(1) using DMPlex to set up the unstructured mesh;
(2) using DM to create vector and matrix;
(3) using SNES interface to solve the linear Poisson equation, with “-snes_type 
ksponly”;
(4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and 
matrix, as suggested in this webpage: 
https://www.mcs.anl.gov/petsc/features/gpus.html
(5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable 
GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): 
https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; 
https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf
(6) using “-options_left” to check and make sure all the arguments are accepted 
and used by PETSc.
(7) After problem setup, I am running the “SNESSolve()” multiple times to solve 
the linear problem and observe the log file with “-log_view"

I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the 
“CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of 
the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, 
MatCUSPARSCopyTo. See below for a truncated log corresponding to running 
SNESSolve() 500 times:


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  

Re: [petsc-users] Proper GPU usage in PETSc

2020-09-24 Thread Zhang, Chonglin


On Sep 24, 2020, at 1:11 PM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:



On Sep 24, 2020, at 11:48 AM, Zhang, Chonglin 
mailto:zhang...@rpi.edu>> wrote:

Thanks Mark and Barry!

A quick try of using “-pc_type jacobi” did reduce the number of count for 
“CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not 
decrease in this case) solves the problem faster (may not be of any meaning 
since the problem size is too small; the function 
“DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size 
preventing running larger problems, separate issue).

Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant 
amount of time for a realistic sized problem, say for example a linear problem 
with ~1-2 million DOFs?

   It depends on how often the copies are done.

   With GAMG once the preconditioner is built the entire linear solve can run 
on the GPU and Mark has some good speed ups of the liner solve using GAMG on 
the GPU instead of the CPU on Summit.

   The speedup of the entire simulation will depend on the relative cost of the 
finite element matrix assembly vs the linear solver time and Amdahl's law kicks 
in so, for example, if the finite element assembly takes 50 percent of the time 
even if the linear solve takes 0 time one cannot only get a speedup of two 
which is not much.


Thanks for the detailed explanation Barry!

Mark: could you share the results of GAMG on GPU vs CPU on Summit, or pointing 
to me where I could see them. (Actual code how you are doing this would be even 
better as a learning opportunity for me). Thanks!


Also, is there any plan to have the SNES and DMPlex code run on GPU?

  Basically the finite element computation for the nonlinear function and its 
Jacobian need to run on the GPU, this is a big project that we've barely begun 
thinking about. If this is something you are interested in it would be 
fantastic if you could take a look at that.

I see. I will think about this, discuss internally and get back to you if I can!

Thanks!
Chonglin


  Barry




Thanks!
Chonglin

On Sep 24, 2020, at 12:17 PM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


   MatSOR() runs on the CPU, this causes copy to CPU for each application of 
MatSOR() and then a copy to GPU for the next step.

   You can try, for example -pc_type jacobi  better yet use PCGAMG if it 
amenable for your problem.

   Also the problem is way to small for a GPU.

  There will be copies between the GPU/CPU for each SNES iteration since the 
DMPLEX code does not run on GPUs.

   Barry



On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin 
mailto:zhang...@rpi.edu>> wrote:

Dear PETSc Users,

I have some questions regarding the proper GPU usage. I would like to know the 
proper way to:
(1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments 
should I be using;
(2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer 
showed in PETSc log file, when using CUDA aware MPI.


Details of what I am doing now and my observations are below:

System and compilers used:
(1) RPI’s AiMOS computer (node wise, it is the same as Summit);
(2) using GCC 7.4.0 and Spectrum-MPI 10.3.

I am doing the followings to solve the linear Poisson equation with SNES 
interface, under DMPlex:
(1) using DMPlex to set up the unstructured mesh;
(2) using DM to create vector and matrix;
(3) using SNES interface to solve the linear Poisson equation, with “-snes_type 
ksponly”;
(4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and 
matrix, as suggested in this webpage: 
https://www.mcs.anl.gov/petsc/features/gpus.html
(5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable 
GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): 
https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; 
https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf
(6) using “-options_left” to check and make sure all the arguments are accepted 
and used by PETSc.
(7) After problem setup, I am running the “SNESSolve()” multiple times to solve 
the linear problem and observe the log file with “-log_view"

I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the 
“CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of 
the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, 
MatCUSPARSCopyTo. See below for a truncated log corresponding to running 
SNESSolve() 500 times:


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
--

Re: [petsc-users] Proper GPU usage in PETSc

2020-09-24 Thread Zhang, Chonglin
Thanks Mark and Barry!

A quick try of using “-pc_type jacobi” did reduce the number of count for 
“CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not 
decrease in this case) solves the problem faster (may not be of any meaning 
since the problem size is too small; the function 
“DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size 
preventing running larger problems, separate issue).

Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant 
amount of time for a realistic sized problem, say for example a linear problem 
with ~1-2 million DOFs?

Also, is there any plan to have the SNES and DMPlex code run on GPU?

Thanks!
Chonglin

On Sep 24, 2020, at 12:17 PM, Barry Smith 
mailto:bsm...@petsc.dev>> wrote:


   MatSOR() runs on the CPU, this causes copy to CPU for each application of 
MatSOR() and then a copy to GPU for the next step.

   You can try, for example -pc_type jacobi  better yet use PCGAMG if it 
amenable for your problem.

   Also the problem is way to small for a GPU.

  There will be copies between the GPU/CPU for each SNES iteration since the 
DMPLEX code does not run on GPUs.

   Barry



On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin 
mailto:zhang...@rpi.edu>> wrote:

Dear PETSc Users,

I have some questions regarding the proper GPU usage. I would like to know the 
proper way to:
(1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments 
should I be using;
(2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer 
showed in PETSc log file, when using CUDA aware MPI.


Details of what I am doing now and my observations are below:

System and compilers used:
(1) RPI’s AiMOS computer (node wise, it is the same as Summit);
(2) using GCC 7.4.0 and Spectrum-MPI 10.3.

I am doing the followings to solve the linear Poisson equation with SNES 
interface, under DMPlex:
(1) using DMPlex to set up the unstructured mesh;
(2) using DM to create vector and matrix;
(3) using SNES interface to solve the linear Poisson equation, with “-snes_type 
ksponly”;
(4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and 
matrix, as suggested in this webpage: 
https://www.mcs.anl.gov/petsc/features/gpus.html
(5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable 
GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): 
https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; 
https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf
(6) using “-options_left” to check and make sure all the arguments are accepted 
and used by PETSc.
(7) After problem setup, I am running the “SNESSolve()” multiple times to solve 
the linear problem and observe the log file with “-log_view"

I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the 
“CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of 
the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, 
MatCUSPARSCopyTo. See below for a truncated log corresponding to running 
SNESSolve() 500 times:


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---

--- Event Stage 0: Main Stage

BuildTwoSided510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00 
1.0e+03  0  0  0  0  0   0  0 21  0  0 0   0  0 0.00e+000 
0.00e+00  0
BuildTwoSidedF   501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+03  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
SNESSolve500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 
8.7e+05100100  0  0100 100100  0  0100   144 202   31947 7.82e+02 63363 
1.44e+03 82
SNESSetUp  1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
SNESFunctionEval 500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00 
5.0e+02 12  3  0  0  0  12  3  0  0  036  13  0 0.00e+00 1000 
2.48e+01  0
SNESJacobianEval 500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00 
2.0e+03 15  5  0  0  0  15  5  0  0  050   0   1000 7.77e+01  500 
1.24e+01  0
DMPlexResidualFE 500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 10  3  0  0  0  10  3  0  0  039   0  0 0.00e+00  500 
1.24e+01  0
DMPlexJacobianFE 500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00 
2.0e+03 14  5  0  0  0  14  5  0  0  052   0   1000 7.77e+010 
0.00e+00  0
MatSOR 30947 1.0 3.1254e+00 1.1 1.21e+09 1.

[petsc-users] Proper GPU usage in PETSc

2020-09-24 Thread Zhang, Chonglin
Dear PETSc Users,

I have some questions regarding the proper GPU usage. I would like to know the 
proper way to:
(1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments 
should I be using;
(2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer 
showed in PETSc log file, when using CUDA aware MPI.


Details of what I am doing now and my observations are below:

System and compilers used:
(1) RPI’s AiMOS computer (node wise, it is the same as Summit);
(2) using GCC 7.4.0 and Spectrum-MPI 10.3.

I am doing the followings to solve the linear Poisson equation with SNES 
interface, under DMPlex:
(1) using DMPlex to set up the unstructured mesh;
(2) using DM to create vector and matrix;
(3) using SNES interface to solve the linear Poisson equation, with “-snes_type 
ksponly”;
(4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and 
matrix, as suggested in this webpage: 
https://www.mcs.anl.gov/petsc/features/gpus.html
(5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable 
GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): 
https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; 
https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf
(6) using “-options_left” to check and make sure all the arguments are accepted 
and used by PETSc.
(7) After problem setup, I am running the “SNESSolve()” multiple times to solve 
the linear problem and observe the log file with “-log_view"

I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the 
“CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of 
the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, 
MatCUSPARSCopyTo. See below for a truncated log corresponding to running 
SNESSolve() 500 times:


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---

--- Event Stage 0: Main Stage

BuildTwoSided510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00 
1.0e+03  0  0  0  0  0   0  0 21  0  0 0   0  0 0.00e+000 
0.00e+00  0
BuildTwoSidedF   501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+03  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
SNESSolve500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 
8.7e+05100100  0  0100 100100  0  0100   144 202   31947 7.82e+02 63363 
1.44e+03 82
SNESSetUp  1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
SNESFunctionEval 500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00 
5.0e+02 12  3  0  0  0  12  3  0  0  036  13  0 0.00e+00 1000 
2.48e+01  0
SNESJacobianEval 500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00 
2.0e+03 15  5  0  0  0  15  5  0  0  050   0   1000 7.77e+01  500 
1.24e+01  0
DMPlexResidualFE 500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 10  3  0  0  0  10  3  0  0  039   0  0 0.00e+00  500 
1.24e+01  0
DMPlexJacobianFE 500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00 
2.0e+03 14  5  0  0  0  14  5  0  0  052   0   1000 7.77e+010 
0.00e+00  0
MatSOR 30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  1 10  0  0  0   1 10  0  0  0  1542   0  0 0.00e+00 61863 
1.41e+03  0
MatAssemblyBegin 511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+03  1  0  0  0  0   1  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
MatAssemblyEnd   511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
2.1e+01  0  0  0  0  0   0  0  0  0  0 0   0   1002 7.80e+010 
0.00e+00  0
MatCUSPARSCopyTo1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0   0   1002 7.80e+010 
0.00e+00  0
VecMDot29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00 0.0e+00 
6.0e+04 12 22  0  0  7  12 22  0  0  7   2773236   29930 6.81e+020 
0.00e+00 100
VecNorm31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00 0.0e+00 
6.3e+04  5  2  0  0  7   5  2  0  0  734  55   1017 2.31e+010 
0.00e+00 100
VecNormalize   30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00 0.0e+00 
6.2e+04  7  2  0  0  7   7  2  0  0  744  51   1017 2.31e+010 
0.00e+00 100
VecCUDACopyTo  30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0 0   0   30947 7.04e+020 
0.00e+00  0
VecCUDACopyFrom63363 1.0 1.0873e+00 1.1 0.00e+00 0.0