Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-07 Thread Matthew Knepley
On Wed, Apr 7, 2021 at 5:47 PM Scott Kruger  wrote:

> On 2021-04-06 14:44, Matthew Knepley did write:
> > > > Does spack have some magic for this we could use?
> > > >
> > >
> > > spack developed the archspec repo to abstract all of these issues:
> > > https://github.com/archspec/archspec
> >
> >
> > I do not love it. Besides the actual code (you can always complain about
> > code),
> > they do not really do any tests. They go look in a few places that data
> > should be.
> > We can do the same thing in probably 10x less code. It would be great to
> > actually
> > test the hardware to verify.
> >
>
> My impression is that the current project is languishing because they
> are focusing on the spack side right now.   But if this is the project
> that is the ECP-annointed solution, then it has the best chance of
> succeeding through sheer resources.
>

Maybe this will end up eventually being good. However, in my lifetime at
DOE, funded
projects are usually the best indicator of what not to do.

   Matt


> The thing I like the best is that having a stand-alone project to handle
> these issues is a real forehead-slapper (i.e., "why didn't I think of
> that?!").  Todd Gamblin has stated that the goal is to allow vendors to
> contribute because it will be in their interest to contribute.  This
> should have been done years ago.
>
> Regarding whether we could do better:  Now would actually be a good time
> to contribute while the project is young, but I don't have the time
> (like everyone else which is why this is a perennial problem).   It
> would also be a good time to create a separate project if this one is
> too annoying for folks.  In general, like spack, they have done a good
> job on the interface so that part is important.
>
> Scott
>
>
>
>
> >   Thanks,
> >
> >  Matt
> >
> >
> > > This is a *great* idea and eventually BuildSystem should incorporate
> it as
> > > the standard way of doing things; however, it is been focused mostly on
> > > the CPU issues, and is still under active development (my understanding
> > > is that the pulling it out of spack and getting those interop issues
> > > sorted out is tangled up in how spack handles dependencies and
> > > compilers).  It'd be nice if someone would go in and port the Kokkos
> gpu
> > > mappings to archspec as there is some great knowledge on these mapping
> > > buried in the Kokkos build system (not volunteering); i.e., translating
> > > that webpage to some real code (even if it is in make) is valuable.
> > >
> > > TL;DR:  It's a known problem with currently no good solution AFAIK.
> > > Waiting until archspec gets further along seems like the best solution.
> > >
> > > Scott
> > >
> > > P.S. ROCm has rocminfo which also doesn't solve the problem but is at
> > > least sane.
> > >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> > experiments is infinitely more interesting than any results to which
> their
> > experiments lead.
> > -- Norbert Wiener
> >
> > https://www.cse.buffalo.edu/~knepley/ <
> http://www.cse.buffalo.edu/~knepley/>
>
> --
> Scott Kruger
> Tech-X Corporation   kru...@txcorp.com
> 5621 Arapahoe Ave, Suite A   Phone: (720) 466-3196
> Boulder, CO 80303Fax:   (303) 448-7756
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-07 Thread Scott Kruger
On 2021-04-06 14:44, Matthew Knepley did write:
> > > Does spack have some magic for this we could use?
> > >
> >
> > spack developed the archspec repo to abstract all of these issues:
> > https://github.com/archspec/archspec
> 
> 
> I do not love it. Besides the actual code (you can always complain about
> code),
> they do not really do any tests. They go look in a few places that data
> should be.
> We can do the same thing in probably 10x less code. It would be great to
> actually
> test the hardware to verify.
> 

My impression is that the current project is languishing because they
are focusing on the spack side right now.   But if this is the project
that is the ECP-annointed solution, then it has the best chance of
succeeding through sheer resources.   

The thing I like the best is that having a stand-alone project to handle
these issues is a real forehead-slapper (i.e., "why didn't I think of
that?!").  Todd Gamblin has stated that the goal is to allow vendors to
contribute because it will be in their interest to contribute.  This
should have been done years ago.

Regarding whether we could do better:  Now would actually be a good time
to contribute while the project is young, but I don't have the time
(like everyone else which is why this is a perennial problem).   It
would also be a good time to create a separate project if this one is
too annoying for folks.  In general, like spack, they have done a good
job on the interface so that part is important.

Scott




>   Thanks,
> 
>  Matt
> 
> 
> > This is a *great* idea and eventually BuildSystem should incorporate it as
> > the standard way of doing things; however, it is been focused mostly on
> > the CPU issues, and is still under active development (my understanding
> > is that the pulling it out of spack and getting those interop issues
> > sorted out is tangled up in how spack handles dependencies and
> > compilers).  It'd be nice if someone would go in and port the Kokkos gpu
> > mappings to archspec as there is some great knowledge on these mapping
> > buried in the Kokkos build system (not volunteering); i.e., translating
> > that webpage to some real code (even if it is in make) is valuable.
> >
> > TL;DR:  It's a known problem with currently no good solution AFAIK.
> > Waiting until archspec gets further along seems like the best solution.
> >
> > Scott
> >
> > P.S. ROCm has rocminfo which also doesn't solve the problem but is at
> > least sane.
> >
> 
> 
> -- 
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ 

-- 
Scott Kruger
Tech-X Corporation   kru...@txcorp.com
5621 Arapahoe Ave, Suite A   Phone: (720) 466-3196
Boulder, CO 80303Fax:   (303) 448-7756


Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-06 Thread Matthew Knepley
On Tue, Apr 6, 2021 at 2:08 PM Scott Kruger  wrote:

>
> I wrote sent this yesterday but am having some strange mailing issues.
>
> On 2021-04-03 22:42, Barry Smith did write:
> >
> >   It would be very nice to NOT require PETSc users to provide this flag,
> how the heck will they know what it should be when we cannot automate it
> ourselves?
> >
> >   Any ideas of how this can be determined based on the current system?
> NVIDIA does not help since these "advertising" names don't seem to
> trivially map to information you can get from a particular GPU when you
> logged into it. For example nvidia-smi doesn't use these names directly. Is
> there some mapping from nvidia-smi  to these names we could use? If we are
> serious about having a non-trivial number of users utilizing GPUs, which we
> need to be for future, we cannot have this absurd demands in our
> installation process.
>
> The mapping of the Nvidia card to the gencodes and cuda arch is one of
> those annoyances that is so ridiculous it is hard to believe.
> The best reference I have found is this:
>
> https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
>
> To this end, the fact that Kokkos provides a mapping from colloquial
> card name to gencode/arch is a real benefit and useful.  The problem is
> that this mapping is buried in their build system and lacks
> introspection.
>
> >
> >   Barry
> >
> > Does spack have some magic for this we could use?
> >
>
> spack developed the archspec repo to abstract all of these issues:
> https://github.com/archspec/archspec


I do not love it. Besides the actual code (you can always complain about
code),
they do not really do any tests. They go look in a few places that data
should be.
We can do the same thing in probably 10x less code. It would be great to
actually
test the hardware to verify.

  Thanks,

 Matt


> This is a *great* idea and eventually BuildSystem should incorporate it as
> the standard way of doing things; however, it is been focused mostly on
> the CPU issues, and is still under active development (my understanding
> is that the pulling it out of spack and getting those interop issues
> sorted out is tangled up in how spack handles dependencies and
> compilers).  It'd be nice if someone would go in and port the Kokkos gpu
> mappings to archspec as there is some great knowledge on these mapping
> buried in the Kokkos build system (not volunteering); i.e., translating
> that webpage to some real code (even if it is in make) is valuable.
>
> TL;DR:  It's a known problem with currently no good solution AFAIK.
> Waiting until archspec gets further along seems like the best solution.
>
> Scott
>
> P.S. ROCm has rocminfo which also doesn't solve the problem but is at
> least sane.
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-06 Thread Scott Kruger


I wrote sent this yesterday but am having some strange mailing issues.

On 2021-04-03 22:42, Barry Smith did write:
> 
>   It would be very nice to NOT require PETSc users to provide this flag, how 
> the heck will they know what it should be when we cannot automate it 
> ourselves? 
> 
>   Any ideas of how this can be determined based on the current system? NVIDIA 
> does not help since these "advertising" names don't seem to trivially map to 
> information you can get from a particular GPU when you logged into it. For 
> example nvidia-smi doesn't use these names directly. Is there some mapping 
> from nvidia-smi  to these names we could use? If we are serious about having 
> a non-trivial number of users utilizing GPUs, which we need to be for future, 
> we cannot have this absurd demands in our installation process. 

The mapping of the Nvidia card to the gencodes and cuda arch is one of
those annoyances that is so ridiculous it is hard to believe.
The best reference I have found is this:
https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

To this end, the fact that Kokkos provides a mapping from colloquial
card name to gencode/arch is a real benefit and useful.  The problem is
that this mapping is buried in their build system and lacks
introspection.

> 
>   Barry
> 
> Does spack have some magic for this we could use?
> 

spack developed the archspec repo to abstract all of these issues:
https://github.com/archspec/archspec

This is a *great* idea and eventually BuildSystem should incorporate it as
the standard way of doing things; however, it is been focused mostly on
the CPU issues, and is still under active development (my understanding
is that the pulling it out of spack and getting those interop issues
sorted out is tangled up in how spack handles dependencies and
compilers).  It'd be nice if someone would go in and port the Kokkos gpu
mappings to archspec as there is some great knowledge on these mapping
buried in the Kokkos build system (not volunteering); i.e., translating
that webpage to some real code (even if it is in make) is valuable.

TL;DR:  It's a known problem with currently no good solution AFAIK.
Waiting until archspec gets further along seems like the best solution.

Scott

P.S. ROCm has rocminfo which also doesn't solve the problem but is at
least sane.


Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-06 Thread Barry Smith

   Jeff,

  Likely deviceQuery provides more than enough information; sometimes it is 
prebuilt but it seems now it is only provided as source code so the user needs 
to build it (and the Makefile is huge :-)). I think it would be enough if 
NVIDIA just always provided prebuilt deviceQuery in a standard location.

  Barry


> On Apr 6, 2021, at 12:42 AM, Jeff Hammond  wrote:
> 
> 
> Generically, independent of Kokkos,  ideally I would run a single 
> precompiled NVIDIA program that gave me all the information about the current 
> hardware I was running and that would provide in simple format exactly the 
> information I needed to configure PETSc, Kokkos etc for THAT system.
> 
> I will try to write something for you tomorrow.  For NVIDIA hardware, the 
> sole dependency will be nvcc.
> 
> Jeff
> 
> -- 
> Jeff Hammond
> jeff.scie...@gmail.com 
> http://jeffhammond.github.io/ 


Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Jeff Hammond
>
>
> Generically, independent of Kokkos,  ideally I would run a single
> precompiled NVIDIA program that gave me all the information about the
> current hardware I was running and that would provide in simple format
> exactly the information I needed to configure PETSc, Kokkos etc for THAT
> system.
>

I will try to write something for you tomorrow.  For NVIDIA hardware, the
sole dependency will be nvcc.

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Barry Smith

  Junchao,

I hope my latest MRs  manages that for the current generation of those 
values. If not, we need refinement.

  Barry


> On Apr 5, 2021, at 9:30 PM, Junchao Zhang  wrote:
> 
> 
> 
> 
> On Mon, Apr 5, 2021 at 7:33 PM Jeff Hammond  > wrote:
> NVCC has supported multi-versioned "fat" binaries since I worked for Argonne. 
>  Libraries should figure out what the oldest hardware they are about is and 
> then compile for everything from that point forward.  Kepler (3.5) is oldest 
> version any reasonable person should be thinking about at this point.  The 
> oldest thing I know of in the DOE HPC fleet is Pascal (6.x).  Volta and 
> Turing are 7.x and Ampere is 8.x.
> 
> The biggest architectural changes came with unified memory 
> (https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ 
> ) and 
> cooperative (https://developer.nvidia.com/blog/cooperative-groups/ 
>  in CUDA 9) but Kokkos 
> doesn't use the latter.  Both features can be used on quite old GPU 
> architectures, although the performance is better on newer ones.
> 
> I haven't dug into what Kokkos and PETSc are doing but the direct use of this 
> stuff in CUDA is well-documented, certainly as well as the CPU switches for 
> x86 binaries in the Intel compiler are.
> 
> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
>  
> 
> 
> Devices with the same major revision number are of the same core 
> architecture. The major revision number is 8 for devices based on the NVIDIA 
> Ampere GPU architecture, 7 for devices based on the Volta architecture, 6 for 
> devices based on the Pascal architecture, 5 for devices based on the Maxwell 
> architecture, 3 for devices based on the Kepler architecture, 2 for devices 
> based on the Fermi architecture, and 1 for devices based on the Tesla 
> architecture.
> Kokkos has config options Kokkos_ARCH_TURING75, Kokkos_ARCH_VOLTA70, 
> Kokkos_ARCH_VOLTA72.Any idea how one can map compute capability versions 
> to arch names?
>  
> 
> 
> https://docs.nvidia.com/cuda/pascal-compatibility-guide/index.html#building-pascal-compatible-apps-using-cuda-8-0
>  
> 
> https://docs.nvidia.com/cuda/volta-compatibility-guide/index.html#building-volta-compatible-apps-using-cuda-9-0
>  
> 
> https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html#building-turing-compatible-apps-using-cuda-10-0
>  
> 
> https://docs.nvidia.com/cuda/ampere-compatibility-guide/index.html#building-ampere-compatible-apps-using-cuda-11-0
>  
> 
> 
> Programmatic querying can be done with the following 
> (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html 
> ):
> 
> cudaDeviceGetAttribute
> cudaDevAttrComputeCapabilityMajor 
> :
>  Major compute capability version number;
> cudaDevAttrComputeCapabilityMinor 
> :
>  Minor compute capability version number;
> The compiler help tells me this, which can be cross-referenced with CUDA 
> documentation above.
> 
> $ /usr/local/cuda-10.0/bin/nvcc -h
> 
> Usage  : nvcc [options] 
> 
> ...
> 
> Options for steering GPU code generation.
> =
> 
> --gpu-architecture   (-arch) 
> Specify the name of the class of NVIDIA 'virtual' GPU architecture 
> for which
> the CUDA input files must be compiled.
> With the exception as described for the shorthand below, the 
> architecture
> specified with this option must be a 'virtual' architecture (such as 
> compute_50).
> Normally, this option alone does not trigger assembly of the 
> generated PTX
> for a 'real' architecture (that is the role of nvcc option 
> '--gpu-code',
> see below); rather, its purpose is to control preprocessing and 
> compilation
> of the input to PTX.
> For convenience, in case of simple nvcc compilations, the following 
> shorthand
>  

Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Barry Smith

  Thanks Jeff,

 The information is eventually there somewhere, the issue is more getting 
the information in a simple way, automatically, at PETSc configure time that is 
portable and will never crash. 
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html 
 
seems to require compiling a program and running it to get the information, 
this means invoking nvcc (what sub compiler to use for nvcc  with what flags 
etc? Not so easy on systems like Summit where their are multiple choices for 
the sub-compiler). So how complicated and fragile do we want to make PETSc 
configure be (for each particular piece of hardware) to always get the best 
information about the current hardware? 

   It looks like Kokkos really only needs the NVIDIA numerical generation 
information, not the code name, but their API requires both the codename 
(irrelevant) and the numerical information (relevant) in what is passed to 
Kokkos. The problem has always been generating the irrelevant part so Kokkos 
does not complain. We can, with a little pain, possibly automate completely the 
CUDA device information, the numerical part, but the mapping to code name has 
been problematic because it is hard to find in a single place the mapping from 
numerical information to codename. But I think, thanks to Max's input, I now 
understand the mapping and have put it in PETSc's configure.

Generically, independent of Kokkos,  ideally I would run a single 
precompiled NVIDIA program that gave me all the information about the current 
hardware I was running and that would provide in simple format exactly the 
information I needed to configure PETSc, Kokkos etc for THAT system. The idea 
of support a multitude of hardware is important for package management systems, 
but is not important for 99% of PETSc users who are configuring for exactly the 
hardware they have on the system they are configuring on, then all they care 
about it is "give me the best reasonable performance on the machine I am using 
today". This means the system software should be able to provide in a trivial 
way what the current hardware is. The problem is not unique to GPUs, of course, 
it is not always easy in a portable way to get this information for generic 
CPUs either.


  Barry






> On Apr 5, 2021, at 7:32 PM, Jeff Hammond  wrote:
> 
> NVCC has supported multi-versioned "fat" binaries since I worked for Argonne. 
>  Libraries should figure out what the oldest hardware they are about is and 
> then compile for everything from that point forward.  Kepler (3.5) is oldest 
> version any reasonable person should be thinking about at this point.  The 
> oldest thing I know of in the DOE HPC fleet is Pascal (6.x).  Volta and 
> Turing are 7.x and Ampere is 8.x.
> 
> The biggest architectural changes came with unified memory 
> (https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ 
> ) and 
> cooperative (https://developer.nvidia.com/blog/cooperative-groups/ 
>  in CUDA 9) but Kokkos 
> doesn't use the latter.  Both features can be used on quite old GPU 
> architectures, although the performance is better on newer ones.
> 
> I haven't dug into what Kokkos and PETSc are doing but the direct use of this 
> stuff in CUDA is well-documented, certainly as well as the CPU switches for 
> x86 binaries in the Intel compiler are.
> 
> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
>  
> 
> 
> Devices with the same major revision number are of the same core 
> architecture. The major revision number is 8 for devices based on the NVIDIA 
> Ampere GPU architecture, 7 for devices based on the Volta architecture, 6 for 
> devices based on the Pascal architecture, 5 for devices based on the Maxwell 
> architecture, 3 for devices based on the Kepler architecture, 2 for devices 
> based on the Fermi architecture, and 1 for devices based on the Tesla 
> architecture.
> 
> https://docs.nvidia.com/cuda/pascal-compatibility-guide/index.html#building-pascal-compatible-apps-using-cuda-8-0
>  
> 
> https://docs.nvidia.com/cuda/volta-compatibility-guide/index.html#building-volta-compatible-apps-using-cuda-9-0
>  
> 
> https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html#building-turing-compatible-apps-using-cuda-10-0
>  
> 
> 

Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Junchao Zhang
On Mon, Apr 5, 2021 at 7:33 PM Jeff Hammond  wrote:

> NVCC has supported multi-versioned "fat" binaries since I worked for
> Argonne.  Libraries should figure out what the oldest hardware they are
> about is and then compile for everything from that point forward.  Kepler
> (3.5) is oldest version any reasonable person should be thinking about at
> this point.  The oldest thing I know of in the DOE HPC fleet is Pascal
> (6.x).  Volta and Turing are 7.x and Ampere is 8.x.
>
> The biggest architectural changes came with unified memory (
> https://developer.nvidia.com/blog/unified-memory-in-cuda-6/) and
> cooperative (https://developer.nvidia.com/blog/cooperative-groups/ in
> CUDA 9) but Kokkos doesn't use the latter.  Both features can be used on
> quite old GPU architectures, although the performance is better on newer
> ones.
>
> I haven't dug into what Kokkos and PETSc are doing but the direct use of
> this stuff in CUDA is well-documented, certainly as well as the CPU
> switches for x86 binaries in the Intel compiler are.
>
>
> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
>
> Devices with the same major revision number are of the same core
> architecture. The major revision number is 8 for devices based on the NVIDIA
> Ampere GPU architecture, 7 for devices based on the Volta architecture, 6
> for devices based on the Pascal architecture, 5 for devices based on the
> Maxwell architecture, 3 for devices based on the Kepler architecture, 2
> for devices based on the Fermi architecture, and 1 for devices based on
> the Tesla architecture.
>
Kokkos has config options Kokkos_ARCH_TURING75,
Kokkos_ARCH_VOLTA70, Kokkos_ARCH_VOLTA72.Any idea how one can map
compute capability versions to arch names?


>
>
>
> https://docs.nvidia.com/cuda/pascal-compatibility-guide/index.html#building-pascal-compatible-apps-using-cuda-8-0
>
> https://docs.nvidia.com/cuda/volta-compatibility-guide/index.html#building-volta-compatible-apps-using-cuda-9-0
>
> https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html#building-turing-compatible-apps-using-cuda-10-0
>
> https://docs.nvidia.com/cuda/ampere-compatibility-guide/index.html#building-ampere-compatible-apps-using-cuda-11-0
>
> Programmatic querying can be done with the following (
> https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html):
>
> cudaDeviceGetAttribute
>
>-
>
>cudaDevAttrComputeCapabilityMajor
>
> :
>Major compute capability version number;
>-
>
>cudaDevAttrComputeCapabilityMinor
>
> :
>Minor compute capability version number;
>
> The compiler help tells me this, which can be cross-referenced with CUDA
> documentation above.
>
> $ /usr/local/cuda-10.0/bin/nvcc -h
>
>
> Usage  : nvcc [options] 
>
>
> ...
>
>
> Options for steering GPU code generation.
>
> =
>
>
> --gpu-architecture   (-arch)
>
>
> Specify the name of the class of NVIDIA 'virtual' GPU
> architecture for which
>
> the CUDA input files must be compiled.
>
> With the exception as described for the shorthand below, the
> architecture
>
> specified with this option must be a 'virtual' architecture (such
> as compute_50).
>
> Normally, this option alone does not trigger assembly of the
> generated PTX
>
> for a 'real' architecture (that is the role of nvcc option
> '--gpu-code',
>
> see below); rather, its purpose is to control preprocessing and
> compilation
>
> of the input to PTX.
>
> For convenience, in case of simple nvcc compilations, the
> following shorthand
>
> is supported.  If no value for option '--gpu-code' is specified,
> then the
>
> value of this option defaults to the value of
> '--gpu-architecture'.  In this
>
> situation, as only exception to the description above, the value
> specified
>
> for '--gpu-architecture' may be a 'real' architecture (such as a
> sm_50),
>
> in which case nvcc uses the specified 'real' architecture and its
> closest
>
> 'virtual' architecture as effective architecture values.  For
> example, 'nvcc
>
> --gpu-architecture=sm_50' is equivalent to 'nvcc
> --gpu-architecture=compute_50
>
> --gpu-code=sm_50,compute_50'.
>
> Allowed values for this option:
> 'compute_30','compute_32','compute_35',
>
>
> 'compute_37','compute_50','compute_52','compute_53','compute_60','compute_61',
>
>
> 'compute_62','compute_70','compute_72','compute_75','sm_30','sm_32','sm_35',
>
>
> 'sm_37','sm_50','sm_52','sm_53','sm_60','sm_61','sm_62','sm_70','sm_72',
>
>  

Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Jeff Hammond
NVCC has supported multi-versioned "fat" binaries since I worked for
Argonne.  Libraries should figure out what the oldest hardware they are
about is and then compile for everything from that point forward.  Kepler
(3.5) is oldest version any reasonable person should be thinking about at
this point.  The oldest thing I know of in the DOE HPC fleet is Pascal
(6.x).  Volta and Turing are 7.x and Ampere is 8.x.

The biggest architectural changes came with unified memory (
https://developer.nvidia.com/blog/unified-memory-in-cuda-6/) and
cooperative (https://developer.nvidia.com/blog/cooperative-groups/ in CUDA
9) but Kokkos doesn't use the latter.  Both features can be used on quite
old GPU architectures, although the performance is better on newer ones.

I haven't dug into what Kokkos and PETSc are doing but the direct use of
this stuff in CUDA is well-documented, certainly as well as the CPU
switches for x86 binaries in the Intel compiler are.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

Devices with the same major revision number are of the same core
architecture. The major revision number is 8 for devices based on the NVIDIA
Ampere GPU architecture, 7 for devices based on the Volta architecture, 6
for devices based on the Pascal architecture, 5 for devices based on the
Maxwell architecture, 3 for devices based on the Kepler architecture, 2 for
devices based on the Fermi architecture, and 1 for devices based on the
Tesla architecture.

https://docs.nvidia.com/cuda/pascal-compatibility-guide/index.html#building-pascal-compatible-apps-using-cuda-8-0
https://docs.nvidia.com/cuda/volta-compatibility-guide/index.html#building-volta-compatible-apps-using-cuda-9-0
https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html#building-turing-compatible-apps-using-cuda-10-0
https://docs.nvidia.com/cuda/ampere-compatibility-guide/index.html#building-ampere-compatible-apps-using-cuda-11-0

Programmatic querying can be done with the following (
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html):

cudaDeviceGetAttribute

   -

   cudaDevAttrComputeCapabilityMajor
   
:
   Major compute capability version number;
   -

   cudaDevAttrComputeCapabilityMinor
   
:
   Minor compute capability version number;

The compiler help tells me this, which can be cross-referenced with CUDA
documentation above.

$ /usr/local/cuda-10.0/bin/nvcc -h


Usage  : nvcc [options] 


...


Options for steering GPU code generation.

=


--gpu-architecture   (-arch)

Specify the name of the class of NVIDIA 'virtual' GPU architecture
for which

the CUDA input files must be compiled.

With the exception as described for the shorthand below, the
architecture

specified with this option must be a 'virtual' architecture (such
as compute_50).

Normally, this option alone does not trigger assembly of the
generated PTX

for a 'real' architecture (that is the role of nvcc option
'--gpu-code',

see below); rather, its purpose is to control preprocessing and
compilation

of the input to PTX.

For convenience, in case of simple nvcc compilations, the following
shorthand

is supported.  If no value for option '--gpu-code' is specified,
then the

value of this option defaults to the value of '--gpu-architecture'.
In this

situation, as only exception to the description above, the value
specified

for '--gpu-architecture' may be a 'real' architecture (such as a
sm_50),

in which case nvcc uses the specified 'real' architecture and its
closest

'virtual' architecture as effective architecture values.  For
example, 'nvcc

--gpu-architecture=sm_50' is equivalent to 'nvcc
--gpu-architecture=compute_50

--gpu-code=sm_50,compute_50'.

Allowed values for this option:
'compute_30','compute_32','compute_35',


'compute_37','compute_50','compute_52','compute_53','compute_60','compute_61',


'compute_62','compute_70','compute_72','compute_75','sm_30','sm_32','sm_35',


'sm_37','sm_50','sm_52','sm_53','sm_60','sm_61','sm_62','sm_70','sm_72',

'sm_75'.


--gpu-code ,...  (-code)

Specify the name of the NVIDIA GPU to assemble and optimize PTX for.

nvcc embeds a compiled code image in the resulting executable for
each specified

 architecture, which is a true binary load image for each
'real' architecture

(such as sm_50), and PTX code for the 'virtual' architecture (such
as compute_50).

During runtime, such embedded PTX 

Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Satish Balay via petsc-dev
This is nvidia mess-up. Why isn't there a command that give me these values [if 
they insist on this interface for nvcc]

I see Barry want configure to do something here - but whatever we do - we would 
be shifting the problem around.
[even if we detect stuff - build box might not have the GPU used for runs.]

We have --with-cuda-arch - which I tried to remove from configure - but its 
come back in a different form (--with-cuda-gencodearch)

And I see other packages:

  --with-kokkos-cuda-arch

Wrt spack - I'm having to do:

spack install xsdk+cuda ^magma cuda_arch=60

[magma uses CudaPackage() infrastructure in spack]

Satish

On Mon, 5 Apr 2021, Mills, Richard Tran via petsc-dev wrote:

> You raise a good point, Barry. I've been completely mystified by what some of 
> these names even mean. What does "PASCAL60" vs. "PASCAL61" even mean? Do you 
> know of where this is even documented? I can't really find anything about it 
> in the Kokkos documentation. The only thing I can really find is an issue or 
> two about "hey, shouldn't our CMake stuff figure this out automatically" and 
> then some posts about why it can't really do that. Not encouraging.
> 
> --Richard
> 
> On 4/3/21 8:42 PM, Barry Smith wrote:
> 
> 
>   It would be very nice to NOT require PETSc users to provide this flag, how 
> the heck will they know what it should be when we cannot automate it 
> ourselves?
> 
>   Any ideas of how this can be determined based on the current system? NVIDIA 
> does not help since these "advertising" names don't seem to trivially map to 
> information you can get from a particular GPU when you logged into it. For 
> example nvidia-smi doesn't use these names directly. Is there some mapping 
> from nvidia-smi  to these names we could use? If we are serious about having 
> a non-trivial number of users utilizing GPUs, which we need to be for future, 
> we cannot have this absurd demands in our installation process.
> 
>   Barry
> 
> Does spack have some magic for this we could use?
> 
> 
> 
> 



Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Mills, Richard Tran via petsc-dev
Hmm, OK, I found a table at

  https://sparta.sandia.gov/doc/accelerate_kokkos.html

and it tells me that "PASCAL60" refers to "NVIDIA Pascal generation CC 6.0 GPU" 
and "PASCAL61" refers to "NVIDIA Pascal generation CC 6.1 GPU". But I have no 
idea what those 6.0 vs 6.1 version numbers mean, and I can't seem to easily 
find any information from NVIDIA that connects anything in the output of 
"nvidia-smi -a" to these versions.

I think maybe what I want is an NVIDIA equivalent to Intel's ark.intel.com, 
which decodes the mysterious Intel version numbers to tell me what 
architectural features are present. But does anything like this exist for 
NVIDIA?

--Richard



On 4/5/21 1:10 PM, Mills, Richard Tran wrote:
You raise a good point, Barry. I've been completely mystified by what some of 
these names even mean. What does "PASCAL60" vs. "PASCAL61" even mean? Do you 
know of where this is even documented? I can't really find anything about it in 
the Kokkos documentation. The only thing I can really find is an issue or two 
about "hey, shouldn't our CMake stuff figure this out automatically" and then 
some posts about why it can't really do that. Not encouraging.

--Richard

On 4/3/21 8:42 PM, Barry Smith wrote:

  It would be very nice to NOT require PETSc users to provide this flag, how 
the heck will they know what it should be when we cannot automate it ourselves?

  Any ideas of how this can be determined based on the current system? NVIDIA 
does not help since these "advertising" names don't seem to trivially map to 
information you can get from a particular GPU when you logged into it. For 
example nvidia-smi doesn't use these names directly. Is there some mapping from 
nvidia-smi  to these names we could use? If we are serious about having a 
non-trivial number of users utilizing GPUs, which we need to be for future, we 
cannot have this absurd demands in our installation process.

  Barry

Does spack have some magic for this we could use?






Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Mills, Richard Tran via petsc-dev
You raise a good point, Barry. I've been completely mystified by what some of 
these names even mean. What does "PASCAL60" vs. "PASCAL61" even mean? Do you 
know of where this is even documented? I can't really find anything about it in 
the Kokkos documentation. The only thing I can really find is an issue or two 
about "hey, shouldn't our CMake stuff figure this out automatically" and then 
some posts about why it can't really do that. Not encouraging.

--Richard

On 4/3/21 8:42 PM, Barry Smith wrote:


  It would be very nice to NOT require PETSc users to provide this flag, how 
the heck will they know what it should be when we cannot automate it ourselves?

  Any ideas of how this can be determined based on the current system? NVIDIA 
does not help since these "advertising" names don't seem to trivially map to 
information you can get from a particular GPU when you logged into it. For 
example nvidia-smi doesn't use these names directly. Is there some mapping from 
nvidia-smi  to these names we could use? If we are serious about having a 
non-trivial number of users utilizing GPUs, which we need to be for future, we 
cannot have this absurd demands in our installation process.

  Barry

Does spack have some magic for this we could use?