Re: [hwloc-users] Why do I get such little information back about GPU's on my system

2017-07-07 Thread fabricio

Em 07-07-2017 18:44, Brice Goglin escreveu:

Does "ldd /path/to/libhwloc.so" say that it depends on libcuda?

Check the end of the configure output, you should see "CUDA" on the
"probe/display I/O devices" line:

-
Hwloc optional build support status (more details can be found above):

Probe / display I/O devices: PCI(pciaccess+linux) LinuxIO OpenCL CUDA NVML GL
Graphical output (Cairo):yes
XML input / output:  full
Netloc functionality:yes (with scotch: no)
Plugin support:  no
-

If not, go above and look for CUDA checks:

checking for cuda.h... yes
checking if CUDA_VERSION >= 3020... yes
checking for cuInit in -lcuda... yes
checking cuda_runtime_api.h usability... yes
checking cuda_runtime_api.h presence... yes
checking for cuda_runtime_api.h... yes
checking if CUDART_VERSION >= 3020... yes
checking for cudaGetDeviceProperties in -lcudart... yes

Brice


Hi David

Try exporting LDFLAGS='-L/usr/local/cuda-8.0/lib64 -L/usr/lib64/nvidia' 
CPPFLAGS='-I/usr/local/cuda-8.0/include' before running './configure'



HIH,
Fabricio
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] Why do I get such little information back about GPU's on my system

2017-07-07 Thread Brice Goglin
Does "ldd /path/to/libhwloc.so" say that it depends on libcuda?

Check the end of the configure output, you should see "CUDA" on the
"probe/display I/O devices" line:

-
Hwloc optional build support status (more details can be found above):

Probe / display I/O devices: PCI(pciaccess+linux) LinuxIO OpenCL CUDA NVML GL
Graphical output (Cairo):yes
XML input / output:  full
Netloc functionality:yes (with scotch: no)
Plugin support:  no
-

If not, go above and look for CUDA checks:

checking for cuda.h... yes
checking if CUDA_VERSION >= 3020... yes
checking for cuInit in -lcuda... yes
checking cuda_runtime_api.h usability... yes
checking cuda_runtime_api.h presence... yes
checking for cuda_runtime_api.h... yes
checking if CUDART_VERSION >= 3020... yes
checking for cudaGetDeviceProperties in -lcudart... yes

Brice




Le 07/07/2017 23:36, David Solt a écrit :
> Ok here we have a machine with cuda installed:
>
> # rpm -qa | grep cuda
> cuda-nvgraph-8-0-8.0.54-1.ppc64le
> cuda-curand-dev-8-0-8.0.54-1.ppc64le
> cuda-nvrtc-dev-8-0-8.0.54-1.ppc64le
> cuda-cudart-8-0-8.0.54-1.ppc64le
> cuda-cusolver-dev-8-0-8.0.54-1.ppc64le
> cuda-8.0.54-1.ppc64le
> cuda-driver-dev-8-0-8.0.54-1.ppc64le
> cuda-core-8-0-8.0.54-1.ppc64le
> etc, etc, etc
>
> #lspci | grep NVIDIA
>
> [root@c712f6n06 ~]# lspci | grep NVI
> 0002:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1)
> 0003:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1)
> 0006:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1)
> 0007:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1)
>
> But the only devices returned by hwloc are named "cardX" (same as what
> lstopo shows) and have osdev.type of HWLOC_OBJ_OSDEV_GPU and we see no
> devices of type HWLOC_OBJ_OSDEV_COPROC
>
> Sorry, I'm sure I'm doing something stupid here... I'm certainly new
> to using hwloc.
>
> Dave
>
>
>
> From:Brice Goglin <brice.gog...@inria.fr>
> To:Hardware locality user list <hwloc-users@lists.open-mpi.org>
> Date:07/07/2017 02:54 PM
> Subject:Re: [hwloc-users] Why do I get such little information
> back about GPU's on my system
> Sent by:"hwloc-users" <hwloc-users-boun...@lists.open-mpi.org>
> 
>
>
>
> Le 07/07/2017 21:51, David Solt a écrit :
> Oh, Geoff Paulsen will be there at Open MPI meeting and he can help
> with the discussion.   We tried searching for
>
>   // Iterate over each osdevice and identify the GPU's on each socket.
>   while ((obj = hwloc_get_next_osdev(machine_topology, obj)) != NULL) {
> if (HWLOC_OBJ_OSDEV_COPROC == obj->attr->osdev.type) {   // was
> HWLOC_OBJ_OSDEV_GPU
>
> Currently we are not finding any such devices.   Does this require
> that the cuda libraries be installed on the system for hwloc to find
> the hardware?
>
> Yes. We use the cuda API for listing these objects.
> Brice
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
>
>
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Why do I get such little information back about GPU's on my system

2017-07-07 Thread David Solt
Ok here we have a machine with cuda installed:

# rpm -qa | grep cuda
cuda-nvgraph-8-0-8.0.54-1.ppc64le
cuda-curand-dev-8-0-8.0.54-1.ppc64le
cuda-nvrtc-dev-8-0-8.0.54-1.ppc64le
cuda-cudart-8-0-8.0.54-1.ppc64le
cuda-cusolver-dev-8-0-8.0.54-1.ppc64le
cuda-8.0.54-1.ppc64le
cuda-driver-dev-8-0-8.0.54-1.ppc64le
cuda-core-8-0-8.0.54-1.ppc64le
etc, etc, etc

#lspci | grep NVIDIA

[root@c712f6n06 ~]# lspci | grep NVI
0002:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1)
0003:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1)
0006:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1)
0007:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1)

But the only devices returned by hwloc are named "cardX" (same as what 
lstopo shows) and have osdev.type of HWLOC_OBJ_OSDEV_GPU and we see no 
devices of type HWLOC_OBJ_OSDEV_COPROC

Sorry, I'm sure I'm doing something stupid here... I'm certainly new to 
using hwloc.

Dave



From:   Brice Goglin <brice.gog...@inria.fr>
To: Hardware locality user list <hwloc-users@lists.open-mpi.org>
Date:   07/07/2017 02:54 PM
Subject:        Re: [hwloc-users] Why do I get such little information 
back about GPU's on my system
Sent by:"hwloc-users" <hwloc-users-boun...@lists.open-mpi.org>



Le 07/07/2017 21:51, David Solt a écrit :
Oh, Geoff Paulsen will be there at Open MPI meeting and he can help with 
the discussion.   We tried searching for 

  // Iterate over each osdevice and identify the GPU's on each socket.
  while ((obj = hwloc_get_next_osdev(machine_topology, obj)) != NULL) {
if (HWLOC_OBJ_OSDEV_COPROC == obj->attr->osdev.type) {   // was 
HWLOC_OBJ_OSDEV_GPU

Currently we are not finding any such devices.   Does this require that 
the cuda libraries be installed on the system for hwloc to find the 
hardware?

Yes. We use the cuda API for listing these objects.
Brice
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users



___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Why do I get such little information back about GPU's on my system

2017-07-07 Thread David Solt
Oh, Geoff Paulsen will be there at Open MPI meeting and he can help with 
the discussion.   We tried searching for 

  // Iterate over each osdevice and identify the GPU's on each socket.
  while ((obj = hwloc_get_next_osdev(machine_topology, obj)) != NULL) {
if (HWLOC_OBJ_OSDEV_COPROC == obj->attr->osdev.type) {   // was 
HWLOC_OBJ_OSDEV_GPU

Currently we are not finding any such devices.   Does this require that 
the cuda libraries be installed on the system for hwloc to find the 
hardware?

Thanks,
Dave



From:   "David Solt" <ds...@us.ibm.com>
To: Hardware locality user list <hwloc-users@lists.open-mpi.org>
Date:   07/07/2017 02:21 PM
Subject:        Re: [hwloc-users] Why do I get such little information 
back about GPU's on my system
Sent by:"hwloc-users" <hwloc-users-boun...@lists.open-mpi.org>



Thank you so much!   We are going to try this out and I'll let you know 
what we find.   Yes, I think we got distracted by GPU when CUDA is what we 
want.  I'll give you a status of how it went soon

Dave



From:Brice Goglin <brice.gog...@inria.fr>
To:hwloc-users@lists.open-mpi.org
Date:    07/07/2017 02:02 PM
Subject:    Re: [hwloc-users] Why do I get such little information 
back about GPU's on my system
Sent by:"hwloc-users" <hwloc-users-boun...@lists.open-mpi.org>



Le 07/07/2017 20:38, David Solt a écrit :
We are using the hwloc api to identify GPUs on our cluster. While we are 
able to "discover" the GPUs, other information about them does not appear 
to be getting filled in. See below for example:

(gdb) p *obj->attr
$20 = {
  cache = {
size = 1, 
depth = 0, 
linesize = 0, 
associativity = 0, 
type = HWLOC_OBJ_CACHE_UNIFIED
  }, 
  group = {
depth = 1
  }, 
  pcidev = {
domain = 1, 
bus = 0 '\000', 
dev = 0 '\000', 
func = 0 '\000', 
class_id = 0, 
vendor_id = 0,
device_id = 0, 
subvendor_id = 0, 
subdevice_id = 0, 
revision = 0 '\000', 
linkspeed = 0
  }, 
  bridge = {
upstream = {
  pci = {
domain = 1, 
bus = 0 '\000', 
dev = 0 '\000', 
func = 0 '\000', 
class_id = 0, 
vendor_id = 0, 
device_id = 0, 
subvendor_id = 0, 
subdevice_id = 0, 
revision = 0 '\000', 
linkspeed = 0
  }
}, 
upstream_type = HWLOC_OBJ_BRIDGE_HOST, 
downstream = {
  pci = {
domain = 0, 
secondary_bus = 0 '\000', 
subordinate_bus = 0 '\000'
  }
}, 
downstream_type = HWLOC_OBJ_BRIDGE_HOST, 
depth = 0
  }, 
  osdev = {
type = HWLOC_OBJ_OSDEV_GPU
  }
}

The name is generally just "cardX". 


Hello

attr is an union so only the "osdev" portion above matters. "osdev" can be 
a lot of different things. So instead of having all possible attributes in 
a struct, we use info key/value pairs (hwloc_obj->infos). But those 
"cardX" devices are the GPU reported by the Linux kernel DRM subsystem, we 
don't have much information about them anyway.

If you're looking at Power machine, I am going to assume you care about 
CUDA devices. Those are "osdev" objects of type "COPROC" instead of "GPU". 
They have many more attributes. Here's what I see on one of our machines:
  PCI 10de:1094 (P#540672 busid=:84:00.0 class=0302(3D) 
PCIVendor="NVIDIA Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing 
Processor Module") "NVIDIA Corporation Tesla M2075 Dual-Slot Computing 
Processor Module"
Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA 
Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 
CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 
CUDASharedMemorySizePerMP=48) "cuda2"

On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX" 
COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if 
you have the nvml and nvctrl libraries. Those are basically different ways 
to talk the GPU (Linux kernel DRM, CUDA, etc).

Given that I have never seen anybody use "cardX" for placing task/data 
near a GPU, I am wondering if we should disable those by default. Or maybe 
rename "GPU" into something that wouldn't attract people as much, maybe 
"DRM".

Does this mean that the cards are not configured correctly? Or is there an 
additional flag that needs to be set to get this information?


Make sure "cuda" appears in the summary at the end of the configure.

Currently the code does:

  hwloc_topology_init(_topology);
  hwloc_topology_set_flags(machine_topology, 
HWLOC_TOPOLOGY_FLAG_IO_DEVICES);
  hwloc_topology_load(machine_topology);

And this is enough to identify the CPUs and GPUs, but any additional 
information 

Re: [hwloc-users] Why do I get such little information back about GPU's on my system

2017-07-07 Thread David Solt
Thank you so much!   We are going to try this out and I'll let you know 
what we find.   Yes, I think we got distracted by GPU when CUDA is what we 
want.  I'll give you a status of how it went soon

Dave



From:   Brice Goglin <brice.gog...@inria.fr>
To: hwloc-users@lists.open-mpi.org
Date:   07/07/2017 02:02 PM
Subject:        Re: [hwloc-users] Why do I get such little information 
back about GPU's on my system
Sent by:"hwloc-users" <hwloc-users-boun...@lists.open-mpi.org>



Le 07/07/2017 20:38, David Solt a écrit :
We are using the hwloc api to identify GPUs on our cluster. While we are 
able to "discover" the GPUs, other information about them does not appear 
to be getting filled in. See below for example:

(gdb) p *obj->attr
$20 = {
  cache = {
size = 1, 
depth = 0, 
linesize = 0, 
associativity = 0, 
type = HWLOC_OBJ_CACHE_UNIFIED
  }, 
  group = {
depth = 1
  }, 
  pcidev = {
domain = 1, 
bus = 0 '\000', 
dev = 0 '\000', 
func = 0 '\000', 
class_id = 0, 
vendor_id = 0,
device_id = 0, 
subvendor_id = 0, 
subdevice_id = 0, 
revision = 0 '\000', 
linkspeed = 0
  }, 
  bridge = {
upstream = {
  pci = {
domain = 1, 
bus = 0 '\000', 
dev = 0 '\000', 
func = 0 '\000', 
class_id = 0, 
vendor_id = 0, 
device_id = 0, 
subvendor_id = 0, 
subdevice_id = 0, 
revision = 0 '\000', 
linkspeed = 0
  }
}, 
upstream_type = HWLOC_OBJ_BRIDGE_HOST, 
downstream = {
  pci = {
domain = 0, 
secondary_bus = 0 '\000', 
subordinate_bus = 0 '\000'
  }
}, 
downstream_type = HWLOC_OBJ_BRIDGE_HOST, 
depth = 0
  }, 
  osdev = {
type = HWLOC_OBJ_OSDEV_GPU
  }
}

The name is generally just "cardX". 


Hello

attr is an union so only the "osdev" portion above matters. "osdev" can be 
a lot of different things. So instead of having all possible attributes in 
a struct, we use info key/value pairs (hwloc_obj->infos). But those 
"cardX" devices are the GPU reported by the Linux kernel DRM subsystem, we 
don't have much information about them anyway.

If you're looking at Power machine, I am going to assume you care about 
CUDA devices. Those are "osdev" objects of type "COPROC" instead of "GPU". 
They have many more attributes. Here's what I see on one of our machines:
  PCI 10de:1094 (P#540672 busid=:84:00.0 class=0302(3D) 
PCIVendor="NVIDIA Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing 
Processor Module") "NVIDIA Corporation Tesla M2075 Dual-Slot Computing 
Processor Module"
Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA 
Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 
CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 
CUDASharedMemorySizePerMP=48) "cuda2"

On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX" 
COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if 
you have the nvml and nvctrl libraries. Those are basically different ways 
to talk the GPU (Linux kernel DRM, CUDA, etc).

Given that I have never seen anybody use "cardX" for placing task/data 
near a GPU, I am wondering if we should disable those by default. Or maybe 
rename "GPU" into something that wouldn't attract people as much, maybe 
"DRM".

Does this mean that the cards are not configured correctly? Or is there an 
additional flag that needs to be set to get this information?


Make sure "cuda" appears in the summary at the end of the configure.

Currently the code does:

  hwloc_topology_init(_topology);
  hwloc_topology_set_flags(machine_topology, 
HWLOC_TOPOLOGY_FLAG_IO_DEVICES);
  hwloc_topology_load(machine_topology);

And this is enough to identify the CPUs and GPUs, but any additional 
information - particularly the device and vendor id's - seem to not be 
there. 

I tried this with the most recent release (1.11.7) and saw the same 
results.   

We tried this on a variety of PowerPC machines and I think even some 
x86_64 machines with similar results.   

Thoughts?
Dave

BTW, it looks like you're not going to the OMPI dev meeting next week. 
I'll be there if one of your colleague wants to discuss this face to face.

Brice
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users



___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Why do I get such little information back about GPU's on my system

2017-07-07 Thread Brice Goglin
Le 07/07/2017 20:38, David Solt a écrit :
> We are using the hwloc api to identify GPUs on our cluster. While we
> are able to "discover" the GPUs, other information about them does not
> appear to be getting filled in. See below for example:

> (gdb) p *obj->attr
> $20 = {
>   cache = {
> size = 1,
> depth = 0,
> linesize = 0,
> associativity = 0,
> type = HWLOC_OBJ_CACHE_UNIFIED
>   },
>   group = {
> depth = 1
>   },
>   pcidev = {
> domain = 1,
> bus = 0 '\000',
> dev = 0 '\000',
> func = 0 '\000',
> class_id = 0,
> *vendor_id = 0,*
>*device_id = 0, *
> subvendor_id = 0,
> subdevice_id = 0,
> revision = 0 '\000',
> linkspeed = 0
>   },
>   bridge = {
> upstream = {
>   pci = {
> domain = 1,
> bus = 0 '\000',
> dev = 0 '\000',
> func = 0 '\000',
> class_id = 0,
> vendor_id = 0,
> device_id = 0,
> subvendor_id = 0,
> subdevice_id = 0,
> revision = 0 '\000',
> linkspeed = 0
>   }
> },
> upstream_type = HWLOC_OBJ_BRIDGE_HOST,
> downstream = {
>   pci = {
> domain = 0,
> secondary_bus = 0 '\000',
> subordinate_bus = 0 '\000'
>   }
> },
> downstream_type = HWLOC_OBJ_BRIDGE_HOST,
> depth = 0
>   },
>   osdev = {
> type = *HWLOC_OBJ_OSDEV_GPU*
>   }
> }

> The name is generally just "cardX". 


Hello

attr is an union so only the "osdev" portion above matters. "osdev" can
be a lot of different things. So instead of having all possible
attributes in a struct, we use info key/value pairs (hwloc_obj->infos).
But those "cardX" devices are the GPU reported by the Linux kernel DRM
subsystem, we don't have much information about them anyway.

If you're looking at Power machine, I am going to assume you care about
CUDA devices. Those are "osdev" objects of type "COPROC" instead of
"GPU". They have many more attributes. Here's what I see on one of our
machines:

  PCI 10de:1094 (P#540672 busid=:84:00.0 class=0302(3D) PCIVendor="NVIDIA 
Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing Processor Module") 
"NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module"
Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA 
Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 
CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 
CUDASharedMemorySizePerMP=48) "cuda2"


On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX"
COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if
you have the nvml and nvctrl libraries. Those are basically different
ways to talk the GPU (Linux kernel DRM, CUDA, etc).

Given that I have never seen anybody use "cardX" for placing task/data
near a GPU, I am wondering if we should disable those by default. Or
maybe rename "GPU" into something that wouldn't attract people as much,
maybe "DRM".

> Does this mean that the cards are not configured correctly? Or is
> there an additional flag that needs to be set to get this information?


Make sure "cuda" appears in the summary at the end of the configure.

> Currently the code does:

>   hwloc_topology_init(_topology);
>   hwloc_topology_set_flags(machine_topology,
> HWLOC_TOPOLOGY_FLAG_IO_DEVICES);
>   hwloc_topology_load(machine_topology);

> And this is enough to identify the CPUs and GPUs, but any additional
> information - particularly the device and vendor id's - seem to not be
> there. 

> I tried this with the most recent release (1.11.7) and saw the same
> results.   

> We tried this on a variety of PowerPC machines and I think even some
> x86_64 machines with similar results.   

> Thoughts?
> Dave

BTW, it looks like you're not going to the OMPI dev meeting next week.
I'll be there if one of your colleague wants to discuss this face to face.

Brice

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users