Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0
I don’t understand the reason, but I think I figured out a solution by trial and error. Basically, I just copied the configure line from “configure.log” file to terminal, deleted the “--with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90” arguments, and added the “--with-mpi-dir=”xxx”” argument, as below. With this it was able to configure. (need to test it can compile now). ./configure --with-mpiexec="jsrun -g 1" --with-shared-libraries=1 --with-debugging=yes --COPTFLAGS="-g -Ofast -mcpu=power9 -fPIC" --CXXOPTFLAGS="-g -Ofast -mcpu=power9 -fPIC" --FOPTFLAGS="-g -Ofast -mcpu=power9 -fPIC" --with-cuda=1 --with-fortran-bindings=0 --with-batch=0 --with-cuda-arch=70 --with-cudac=nvcc --download-metis --download-parmetis --with-blaslapack-lib="-L/sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/netlib-lapack-3.11.0-g3vx4sojdbcj5ph6t4gzimzbtkfjpn4y/lib64 -lblas -llapack" --download-triangle --with-make-np=4 --with-mpi-dir="/sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu" PETSC_ARCH=arch-summit Thanks, Chonglin From: petsc-users on behalf of Zhang, Chonglin Date: Friday, February 9, 2024 at 4:21 PM To: Barry Smith Cc: petsc-users@mcs.anl.gov Subject: Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0 Thanks Barry! I am still getting the same error message. Any more suggestions? I can see that library from the login node: lrwxrwxrwx 1 sauesw ccsstaff 29 Jan 16 16:39 /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/lib/libmpi_ibm_usempif08.so -> libmpi_ibm_usempif08.so.3.1.0 Thanks, Chonglin From: Barry Smith Date: Friday, February 9, 2024 at 4:10 PM To: Zhang, Chonglin Cc: petsc-users@mcs.anl.gov Subject: Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0 error while loading shared libraries: libmpi_ibm_usempif08.so: cannot open shared object file: No such file or directory So using the mpif90 does not work because it links a shared library that cannot be found at run time. Perhaps that library is only visible on the bach nodes. You can tryed adding -with-batch=0 to the ./configure options Barry On Feb 9, 2024, at 5:01 PM, Zhang, Chonglin wrote: Dear PETSc developers, I am trying to compile PETSc on Summit with gcc 12.1.0 and spectrum-mpi 10.4.0.6, but encountered the following configuration issues: = Configuring PETSc to compile on your system = TESTING: checkFortranCompiler from config.setCompilers(config/BuildSystem/config/setCompilers.py:1271) *** OSError while running ./configure --- Cannot run executables created with FC. If this machine uses a batch system to submit jobs you will need to configure using ./configure with the additional option --with-batch. Otherwise there is problem with the compilers. Can you compile and run code with your compiler 'mpif90'? *** Also attached is the configure.log file. Could you help with this issue? Thanks, Chonglin
Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0
Thanks Barry! I am still getting the same error message. Any more suggestions? I can see that library from the login node: lrwxrwxrwx 1 sauesw ccsstaff 29 Jan 16 16:39 /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/lib/libmpi_ibm_usempif08.so -> libmpi_ibm_usempif08.so.3.1.0 Thanks, Chonglin From: Barry Smith Date: Friday, February 9, 2024 at 4:10 PM To: Zhang, Chonglin Cc: petsc-users@mcs.anl.gov Subject: Re: [petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0 error while loading shared libraries: libmpi_ibm_usempif08.so: cannot open shared object file: No such file or directory So using the mpif90 does not work because it links a shared library that cannot be found at run time. Perhaps that library is only visible on the bach nodes. You can tryed adding -with-batch=0 to the ./configure options Barry On Feb 9, 2024, at 5:01 PM, Zhang, Chonglin wrote: Dear PETSc developers, I am trying to compile PETSc on Summit with gcc 12.1.0 and spectrum-mpi 10.4.0.6, but encountered the following configuration issues: = Configuring PETSc to compile on your system = TESTING: checkFortranCompiler from config.setCompilers(config/BuildSystem/config/setCompilers.py:1271) *** OSError while running ./configure --- Cannot run executables created with FC. If this machine uses a batch system to submit jobs you will need to configure using ./configure with the additional option --with-batch. Otherwise there is problem with the compilers. Can you compile and run code with your compiler 'mpif90'? *** Also attached is the configure.log file. Could you help with this issue? Thanks, Chonglin
[petsc-users] Help with compiling PETSc on Summit with gcc 12.1.0
Dear PETSc developers, I am trying to compile PETSc on Summit with gcc 12.1.0 and spectrum-mpi 10.4.0.6, but encountered the following configuration issues: = Configuring PETSc to compile on your system = TESTING: checkFortranCompiler from config.setCompilers(config/BuildSystem/config/setCompilers.py:1271) *** OSError while running ./configure --- Cannot run executables created with FC. If this machine uses a batch system to submit jobs you will need to configure using ./configure with the additional option --with-batch. Otherwise there is problem with the compilers. Can you compile and run code with your compiler 'mpif90'? *** Also attached is the configure.log file. Could you help with this issue? Thanks, Chonglin configure.log Description: configure.log
Re: [petsc-users] [EXTERNAL] GPU implementation of serial smoothers
I am using the following in my Poisson solver running on GPU, which were suggested by Barry and Mark (Dr. Mark Adams). -ksp_type cg -pc_type gamg -mg_levels_ksp_type chebyshev -mg_levels_pc_type jacobi On Jan 10, 2023, at 3:31 PM, Mark Lohry wrote: So what are people using for GAMG configs on GPU? I was hoping petsc today would be performance competitive with AMGx but it sounds like that's not the case? On Tue, Jan 10, 2023 at 3:03 PM Jed Brown mailto:j...@jedbrown.org>> wrote: Mark Lohry mailto:mlo...@gmail.com>> writes: > I definitely need multigrid. I was under the impression that GAMG was > relatively cuda-complete, is that not the case? What functionality works > fully on GPU and what doesn't, without any host transfers (aside from > what's needed for MPI)? > > If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi -mg_levels_ksp_type > richardson is that fully on device, but -mg_levels_pc_type ilu or > -mg_levels_pc_type sor require transfers? You can do `-mg_levels_pc_type ilu`, but it'll be extremely slow (like 20x slower than an operator apply). One can use Krylov smoothers, though that's more synchronization. Automatic construction of operator-dependent multistage smoothers for linear multigrid (because Chebyshev only works for problems that have eigenvalues near the real axis) is something I've wanted to develop for at least a decade, but time is always short. I might put some effort into p-MG with such smoothers this year as we add DDES to our scale-resolving compressible solver.
Re: [petsc-users] is PETSc's random deterministic?
I was having the same website not found problem the other day. I remember email by Satish saying PETSc has a new website. It seems now that all the manual pages are hosted there: https://petsc.org/release/documentation/manualpages/; https://petsc.org/release/docs/manualpages/singleindex.html. Thanks! Chonglin On Jul 28, 2021, at 9:31 PM, Mark Adams mailto:mfad...@lbl.gov>> wrote: Also, when I google function the Argonne web pages are not found (MIT seems to have mirrored this and that works). Thanks, Mark
Re: [petsc-users] Proper GPU usage in PETSc
Hi Matt, Thanks for the comments and the nice example code. Right now our objective is to use XGC unstructured flux-surface-following mesh (fixed in size), I will keep your comment on mesh refinement in mind. Thanks! Chonglin On Sep 24, 2020, at 3:26 PM, Matthew Knepley mailto:knep...@gmail.com>> wrote: On Thu, Sep 24, 2020 at 3:08 PM Zhang, Chonglin mailto:zhang...@rpi.edu>> wrote: Hi Matt, I will quickly summarize what I found with “CreateMesh" for running ex12 here: https://gitlab.com/petsc/petsc/-/blob/master/src/snes/tutorials/ex12.c. If this is not a proper threads to discuss this, I can open a new one. Commands used (relevant to mesh creation) to run ex12 (quad core desktop computer with CPU only, 4 MPI ranks): mpirun -np 4 -cells 100, 100, 0 -options_left -log_view I built PETSc commit: 2bbfc05, dated Sep 23, 2020, with debug=no. Mesh size CreateMesh (seconds) DMPlexDistribute (seconds) 100 *100 0.14 0.081 500 *500 2.28 1.33 1000*1000 10.1 5.10 2000*1000 24.6 10.96 2000*2000 73.7 22.72 Is the performance reasonable for the “CreateMesh” functionality? Anything I am not doing correctly with DMPlex running ex12? ex12 is a little old. I have been meaning to update it. ex13 does the same thing in a more modern way. Above looks reasonable I think. The CreateMesh time includes generating the mesh using Triangle, since simplex is the default. In example 12, you could use -simplex 0 or in ex13 -dm_plex_box_simplex 0 to get hexes, which do not use a generator. Second, you are interpolating all on process 0, which is probably the bulk of the time. I do that because I do not care about parallel performance in the examples and it is simpler. You can also refine the mesh after distribution, which is faster, and cuts down on the distribution time. So if you want the whole thing, you could use DM odm, dm; /* Create a cell-vertex box mesh */ ierr = DMPlexCreateBoxMesh(comm, 2, PETSC_TRUE, NULL, NULL, NULL, NULL, PETSC_FALSE, );CHKERRQ(ierr); ierr = PetscObjectSetOptionsPrefix((PetscObject) dm, "orig_");CHKERRQ(ierr); /* Distributes the mesh here */ ierr = DMSetFromOptions(odm);CHKERRQ(ierr); /* Interpolate the mesh */ ierr = DMPlexInterpolate(odm, );CHKERRQ(ierr); ierr = DMDestroy();CHKERRQ(ierr); /* Refine the mesh */ ierr = DMSetFromOptions(dm);CHKERRQ(ierr); and run with -dm_plex_box_simplex 0 -dm_plex_box_faces 100,100 -orig_dm_distribute -dm_refine 3 Thanks, Matt Thanks! Chonglin On Sep 24, 2020, at 2:06 PM, Matthew Knepley mailto:knep...@gmail.com>> wrote: On Thu, Sep 24, 2020 at 2:04 PM Mark Adams mailto:mfad...@lbl.gov>> wrote: On Thu, Sep 24, 2020 at 1:38 PM Matthew Knepley mailto:knep...@gmail.com>> wrote: On Thu, Sep 24, 2020 at 12:48 PM Zhang, Chonglin mailto:zhang...@rpi.edu>> wrote: Thanks Mark and Barry! A quick try of using “-pc_type jacobi” did reduce the number of count for “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not decrease in this case) solves the problem faster (may not be of any meaning since the problem size is too small; the function “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size preventing running larger problems, separate issue). It sounds like something is wrong then, or I do not understand what you mean by slow. sor may be the default so you need to set the -mg_level_ksp[pc]_type chebyshev[jacobi]. chebyshev is the ksp default. I meant for the mesh creation. Thanks, Matt Thanks, Matt Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant amount of time for a realistic sized problem, say for example a linear problem with ~1-2 million DOFs? Also, is there any plan to have the SNES and DMPlex code run on GPU? Thanks! Chonglin On Sep 24, 2020, at 12:17 PM, Barry Smith mailto:bsm...@petsc.dev>> wrote: MatSOR() runs on the CPU, this causes copy to CPU for each application of MatSOR() and then a copy to GPU for the next step. You can try, for example -pc_type jacobi better yet use PCGAMG if it amenable for your problem. Also the problem is way to small for a GPU. There will be copies between the GPU/CPU for each SNES iteration since the DMPLEX code does not run on GPUs. Barry On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin mailto:zhang...@rpi.edu>> wrote: Dear PETSc Users, I have some questions regarding the proper GPU usage. I would like to know the proper way to: (1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments should I be using; (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer showed in PETSc log file, when using CUDA aware MPI. D
Re: [petsc-users] Proper GPU usage in PETSc
Hi Matt, I will quickly summarize what I found with “CreateMesh" for running ex12 here: https://gitlab.com/petsc/petsc/-/blob/master/src/snes/tutorials/ex12.c. If this is not a proper threads to discuss this, I can open a new one. Commands used (relevant to mesh creation) to run ex12 (quad core desktop computer with CPU only, 4 MPI ranks): mpirun -np 4 -cells 100, 100, 0 -options_left -log_view I built PETSc commit: 2bbfc05, dated Sep 23, 2020, with debug=no. Mesh size CreateMesh (seconds) DMPlexDistribute (seconds) 100 *100 0.14 0.081 500 *500 2.28 1.33 1000*1000 10.1 5.10 2000*1000 24.6 10.96 2000*2000 73.7 22.72 Is the performance reasonable for the “CreateMesh” functionality? Anything I am not doing correctly with DMPlex running ex12? Thanks! Chonglin On Sep 24, 2020, at 2:06 PM, Matthew Knepley mailto:knep...@gmail.com>> wrote: On Thu, Sep 24, 2020 at 2:04 PM Mark Adams mailto:mfad...@lbl.gov>> wrote: On Thu, Sep 24, 2020 at 1:38 PM Matthew Knepley mailto:knep...@gmail.com>> wrote: On Thu, Sep 24, 2020 at 12:48 PM Zhang, Chonglin mailto:zhang...@rpi.edu>> wrote: Thanks Mark and Barry! A quick try of using “-pc_type jacobi” did reduce the number of count for “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not decrease in this case) solves the problem faster (may not be of any meaning since the problem size is too small; the function “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size preventing running larger problems, separate issue). It sounds like something is wrong then, or I do not understand what you mean by slow. sor may be the default so you need to set the -mg_level_ksp[pc]_type chebyshev[jacobi]. chebyshev is the ksp default. I meant for the mesh creation. Thanks, Matt Thanks, Matt Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant amount of time for a realistic sized problem, say for example a linear problem with ~1-2 million DOFs? Also, is there any plan to have the SNES and DMPlex code run on GPU? Thanks! Chonglin On Sep 24, 2020, at 12:17 PM, Barry Smith mailto:bsm...@petsc.dev>> wrote: MatSOR() runs on the CPU, this causes copy to CPU for each application of MatSOR() and then a copy to GPU for the next step. You can try, for example -pc_type jacobi better yet use PCGAMG if it amenable for your problem. Also the problem is way to small for a GPU. There will be copies between the GPU/CPU for each SNES iteration since the DMPLEX code does not run on GPUs. Barry On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin mailto:zhang...@rpi.edu>> wrote: Dear PETSc Users, I have some questions regarding the proper GPU usage. I would like to know the proper way to: (1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments should I be using; (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer showed in PETSc log file, when using CUDA aware MPI. Details of what I am doing now and my observations are below: System and compilers used: (1) RPI’s AiMOS computer (node wise, it is the same as Summit); (2) using GCC 7.4.0 and Spectrum-MPI 10.3. I am doing the followings to solve the linear Poisson equation with SNES interface, under DMPlex: (1) using DMPlex to set up the unstructured mesh; (2) using DM to create vector and matrix; (3) using SNES interface to solve the linear Poisson equation, with “-snes_type ksponly”; (4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and matrix, as suggested in this webpage: https://www.mcs.anl.gov/petsc/features/gpus.html (5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf (6) using “-options_left” to check and make sure all the arguments are accepted and used by PETSc. (7) After problem setup, I am running the “SNESSolve()” multiple times to solve the linear problem and observe the log file with “-log_view" I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, MatCUSPARSCopyTo. See below for a truncated log corresponding to running SNESSolve() 500 times: EventCount Time (sec) Flop --- Global --- --- Stage Total GPU- CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen
Re: [petsc-users] Proper GPU usage in PETSc
On Sep 24, 2020, at 1:11 PM, Barry Smith mailto:bsm...@petsc.dev>> wrote: On Sep 24, 2020, at 11:48 AM, Zhang, Chonglin mailto:zhang...@rpi.edu>> wrote: Thanks Mark and Barry! A quick try of using “-pc_type jacobi” did reduce the number of count for “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not decrease in this case) solves the problem faster (may not be of any meaning since the problem size is too small; the function “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size preventing running larger problems, separate issue). Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant amount of time for a realistic sized problem, say for example a linear problem with ~1-2 million DOFs? It depends on how often the copies are done. With GAMG once the preconditioner is built the entire linear solve can run on the GPU and Mark has some good speed ups of the liner solve using GAMG on the GPU instead of the CPU on Summit. The speedup of the entire simulation will depend on the relative cost of the finite element matrix assembly vs the linear solver time and Amdahl's law kicks in so, for example, if the finite element assembly takes 50 percent of the time even if the linear solve takes 0 time one cannot only get a speedup of two which is not much. Thanks for the detailed explanation Barry! Mark: could you share the results of GAMG on GPU vs CPU on Summit, or pointing to me where I could see them. (Actual code how you are doing this would be even better as a learning opportunity for me). Thanks! Also, is there any plan to have the SNES and DMPlex code run on GPU? Basically the finite element computation for the nonlinear function and its Jacobian need to run on the GPU, this is a big project that we've barely begun thinking about. If this is something you are interested in it would be fantastic if you could take a look at that. I see. I will think about this, discuss internally and get back to you if I can! Thanks! Chonglin Barry Thanks! Chonglin On Sep 24, 2020, at 12:17 PM, Barry Smith mailto:bsm...@petsc.dev>> wrote: MatSOR() runs on the CPU, this causes copy to CPU for each application of MatSOR() and then a copy to GPU for the next step. You can try, for example -pc_type jacobi better yet use PCGAMG if it amenable for your problem. Also the problem is way to small for a GPU. There will be copies between the GPU/CPU for each SNES iteration since the DMPLEX code does not run on GPUs. Barry On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin mailto:zhang...@rpi.edu>> wrote: Dear PETSc Users, I have some questions regarding the proper GPU usage. I would like to know the proper way to: (1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments should I be using; (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer showed in PETSc log file, when using CUDA aware MPI. Details of what I am doing now and my observations are below: System and compilers used: (1) RPI’s AiMOS computer (node wise, it is the same as Summit); (2) using GCC 7.4.0 and Spectrum-MPI 10.3. I am doing the followings to solve the linear Poisson equation with SNES interface, under DMPlex: (1) using DMPlex to set up the unstructured mesh; (2) using DM to create vector and matrix; (3) using SNES interface to solve the linear Poisson equation, with “-snes_type ksponly”; (4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and matrix, as suggested in this webpage: https://www.mcs.anl.gov/petsc/features/gpus.html (5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf (6) using “-options_left” to check and make sure all the arguments are accepted and used by PETSc. (7) After problem setup, I am running the “SNESSolve()” multiple times to solve the linear problem and observe the log file with “-log_view" I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, MatCUSPARSCopyTo. See below for a truncated log corresponding to running SNESSolve() 500 times: EventCount Time (sec) Flop --- Global --- --- Stage Total GPU- CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F --
Re: [petsc-users] Proper GPU usage in PETSc
Thanks Mark and Barry! A quick try of using “-pc_type jacobi” did reduce the number of count for “CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not decrease in this case) solves the problem faster (may not be of any meaning since the problem size is too small; the function “DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size preventing running larger problems, separate issue). Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant amount of time for a realistic sized problem, say for example a linear problem with ~1-2 million DOFs? Also, is there any plan to have the SNES and DMPlex code run on GPU? Thanks! Chonglin On Sep 24, 2020, at 12:17 PM, Barry Smith mailto:bsm...@petsc.dev>> wrote: MatSOR() runs on the CPU, this causes copy to CPU for each application of MatSOR() and then a copy to GPU for the next step. You can try, for example -pc_type jacobi better yet use PCGAMG if it amenable for your problem. Also the problem is way to small for a GPU. There will be copies between the GPU/CPU for each SNES iteration since the DMPLEX code does not run on GPUs. Barry On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin mailto:zhang...@rpi.edu>> wrote: Dear PETSc Users, I have some questions regarding the proper GPU usage. I would like to know the proper way to: (1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments should I be using; (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer showed in PETSc log file, when using CUDA aware MPI. Details of what I am doing now and my observations are below: System and compilers used: (1) RPI’s AiMOS computer (node wise, it is the same as Summit); (2) using GCC 7.4.0 and Spectrum-MPI 10.3. I am doing the followings to solve the linear Poisson equation with SNES interface, under DMPlex: (1) using DMPlex to set up the unstructured mesh; (2) using DM to create vector and matrix; (3) using SNES interface to solve the linear Poisson equation, with “-snes_type ksponly”; (4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and matrix, as suggested in this webpage: https://www.mcs.anl.gov/petsc/features/gpus.html (5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf (6) using “-options_left” to check and make sure all the arguments are accepted and used by PETSc. (7) After problem setup, I am running the “SNESSolve()” multiple times to solve the linear problem and observe the log file with “-log_view" I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, MatCUSPARSCopyTo. See below for a truncated log corresponding to running SNESSolve() 500 times: EventCount Time (sec) Flop --- Global --- --- Stage Total GPU- CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F --- --- Event Stage 0: Main Stage BuildTwoSided510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00 1.0e+03 0 0 0 0 0 0 0 21 0 0 0 0 0 0.00e+000 0.00e+00 0 BuildTwoSidedF 501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+03 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+000 0.00e+00 0 SNESSolve500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 8.7e+05100100 0 0100 100100 0 0100 144 202 31947 7.82e+02 63363 1.44e+03 82 SNESSetUp 1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+000 0.00e+00 0 SNESFunctionEval 500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00 5.0e+02 12 3 0 0 0 12 3 0 0 036 13 0 0.00e+00 1000 2.48e+01 0 SNESJacobianEval 500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00 2.0e+03 15 5 0 0 0 15 5 0 0 050 0 1000 7.77e+01 500 1.24e+01 0 DMPlexResidualFE 500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00 0.0e+00 10 3 0 0 0 10 3 0 0 039 0 0 0.00e+00 500 1.24e+01 0 DMPlexJacobianFE 500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00 2.0e+03 14 5 0 0 0 14 5 0 0 052 0 1000 7.77e+010 0.00e+00 0 MatSOR 30947 1.0 3.1254e+00 1.1 1.21e+09 1.
[petsc-users] Proper GPU usage in PETSc
Dear PETSc Users, I have some questions regarding the proper GPU usage. I would like to know the proper way to: (1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments should I be using; (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer showed in PETSc log file, when using CUDA aware MPI. Details of what I am doing now and my observations are below: System and compilers used: (1) RPI’s AiMOS computer (node wise, it is the same as Summit); (2) using GCC 7.4.0 and Spectrum-MPI 10.3. I am doing the followings to solve the linear Poisson equation with SNES interface, under DMPlex: (1) using DMPlex to set up the unstructured mesh; (2) using DM to create vector and matrix; (3) using SNES interface to solve the linear Poisson equation, with “-snes_type ksponly”; (4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and matrix, as suggested in this webpage: https://www.mcs.anl.gov/petsc/features/gpus.html (5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf (6) using “-options_left” to check and make sure all the arguments are accepted and used by PETSc. (7) After problem setup, I am running the “SNESSolve()” multiple times to solve the linear problem and observe the log file with “-log_view" I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, MatCUSPARSCopyTo. See below for a truncated log corresponding to running SNESSolve() 500 times: EventCount Time (sec) Flop --- Global --- --- Stage Total GPU- CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F --- --- Event Stage 0: Main Stage BuildTwoSided510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00 1.0e+03 0 0 0 0 0 0 0 21 0 0 0 0 0 0.00e+000 0.00e+00 0 BuildTwoSidedF 501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+03 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+000 0.00e+00 0 SNESSolve500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 8.7e+05100100 0 0100 100100 0 0100 144 202 31947 7.82e+02 63363 1.44e+03 82 SNESSetUp 1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+000 0.00e+00 0 SNESFunctionEval 500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00 5.0e+02 12 3 0 0 0 12 3 0 0 036 13 0 0.00e+00 1000 2.48e+01 0 SNESJacobianEval 500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00 2.0e+03 15 5 0 0 0 15 5 0 0 050 0 1000 7.77e+01 500 1.24e+01 0 DMPlexResidualFE 500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00 0.0e+00 10 3 0 0 0 10 3 0 0 039 0 0 0.00e+00 500 1.24e+01 0 DMPlexJacobianFE 500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00 2.0e+03 14 5 0 0 0 14 5 0 0 052 0 1000 7.77e+010 0.00e+00 0 MatSOR 30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 10 0 0 0 1 10 0 0 0 1542 0 0 0.00e+00 61863 1.41e+03 0 MatAssemblyBegin 511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+03 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+000 0.00e+00 0 MatAssemblyEnd 511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 2.1e+01 0 0 0 0 0 0 0 0 0 0 0 0 1002 7.80e+010 0.00e+00 0 MatCUSPARSCopyTo1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1002 7.80e+010 0.00e+00 0 VecMDot29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00 0.0e+00 6.0e+04 12 22 0 0 7 12 22 0 0 7 2773236 29930 6.81e+020 0.00e+00 100 VecNorm31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00 0.0e+00 6.3e+04 5 2 0 0 7 5 2 0 0 734 55 1017 2.31e+010 0.00e+00 100 VecNormalize 30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00 0.0e+00 6.2e+04 7 2 0 0 7 7 2 0 0 744 51 1017 2.31e+010 0.00e+00 100 VecCUDACopyTo 30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 30947 7.04e+020 0.00e+00 0 VecCUDACopyFrom63363 1.0 1.0873e+00 1.1 0.00e+00 0.0