Re: [hwloc-users] Why do I get such little information back about GPU's on my system
Em 07-07-2017 18:44, Brice Goglin escreveu: Does "ldd /path/to/libhwloc.so" say that it depends on libcuda? Check the end of the configure output, you should see "CUDA" on the "probe/display I/O devices" line: - Hwloc optional build support status (more details can be found above): Probe / display I/O devices: PCI(pciaccess+linux) LinuxIO OpenCL CUDA NVML GL Graphical output (Cairo):yes XML input / output: full Netloc functionality:yes (with scotch: no) Plugin support: no - If not, go above and look for CUDA checks: checking for cuda.h... yes checking if CUDA_VERSION >= 3020... yes checking for cuInit in -lcuda... yes checking cuda_runtime_api.h usability... yes checking cuda_runtime_api.h presence... yes checking for cuda_runtime_api.h... yes checking if CUDART_VERSION >= 3020... yes checking for cudaGetDeviceProperties in -lcudart... yes Brice Hi David Try exporting LDFLAGS='-L/usr/local/cuda-8.0/lib64 -L/usr/lib64/nvidia' CPPFLAGS='-I/usr/local/cuda-8.0/include' before running './configure' HIH, Fabricio ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Why do I get such little information back about GPU's on my system
Does "ldd /path/to/libhwloc.so" say that it depends on libcuda? Check the end of the configure output, you should see "CUDA" on the "probe/display I/O devices" line: - Hwloc optional build support status (more details can be found above): Probe / display I/O devices: PCI(pciaccess+linux) LinuxIO OpenCL CUDA NVML GL Graphical output (Cairo):yes XML input / output: full Netloc functionality:yes (with scotch: no) Plugin support: no - If not, go above and look for CUDA checks: checking for cuda.h... yes checking if CUDA_VERSION >= 3020... yes checking for cuInit in -lcuda... yes checking cuda_runtime_api.h usability... yes checking cuda_runtime_api.h presence... yes checking for cuda_runtime_api.h... yes checking if CUDART_VERSION >= 3020... yes checking for cudaGetDeviceProperties in -lcudart... yes Brice Le 07/07/2017 23:36, David Solt a écrit : > Ok here we have a machine with cuda installed: > > # rpm -qa | grep cuda > cuda-nvgraph-8-0-8.0.54-1.ppc64le > cuda-curand-dev-8-0-8.0.54-1.ppc64le > cuda-nvrtc-dev-8-0-8.0.54-1.ppc64le > cuda-cudart-8-0-8.0.54-1.ppc64le > cuda-cusolver-dev-8-0-8.0.54-1.ppc64le > cuda-8.0.54-1.ppc64le > cuda-driver-dev-8-0-8.0.54-1.ppc64le > cuda-core-8-0-8.0.54-1.ppc64le > etc, etc, etc > > #lspci | grep NVIDIA > > [root@c712f6n06 ~]# lspci | grep NVI > 0002:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1) > 0003:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1) > 0006:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1) > 0007:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1) > > But the only devices returned by hwloc are named "cardX" (same as what > lstopo shows) and have osdev.type of HWLOC_OBJ_OSDEV_GPU and we see no > devices of type HWLOC_OBJ_OSDEV_COPROC > > Sorry, I'm sure I'm doing something stupid here... I'm certainly new > to using hwloc. > > Dave > > > > From:Brice Goglin > To:Hardware locality user list > Date:07/07/2017 02:54 PM > Subject:Re: [hwloc-users] Why do I get such little information > back about GPU's on my system > Sent by:"hwloc-users" > > > > > Le 07/07/2017 21:51, David Solt a écrit : > Oh, Geoff Paulsen will be there at Open MPI meeting and he can help > with the discussion. We tried searching for > > // Iterate over each osdevice and identify the GPU's on each socket. > while ((obj = hwloc_get_next_osdev(machine_topology, obj)) != NULL) { > if (HWLOC_OBJ_OSDEV_COPROC == obj->attr->osdev.type) { // was > HWLOC_OBJ_OSDEV_GPU > > Currently we are not finding any such devices. Does this require > that the cuda libraries be installed on the system for hwloc to find > the hardware? > > Yes. We use the cuda API for listing these objects. > Brice > ___ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users > > > > > ___ > hwloc-users mailing list > hwloc-users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Why do I get such little information back about GPU's on my system
Ok here we have a machine with cuda installed: # rpm -qa | grep cuda cuda-nvgraph-8-0-8.0.54-1.ppc64le cuda-curand-dev-8-0-8.0.54-1.ppc64le cuda-nvrtc-dev-8-0-8.0.54-1.ppc64le cuda-cudart-8-0-8.0.54-1.ppc64le cuda-cusolver-dev-8-0-8.0.54-1.ppc64le cuda-8.0.54-1.ppc64le cuda-driver-dev-8-0-8.0.54-1.ppc64le cuda-core-8-0-8.0.54-1.ppc64le etc, etc, etc #lspci | grep NVIDIA [root@c712f6n06 ~]# lspci | grep NVI 0002:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1) 0003:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1) 0006:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1) 0007:01:00.0 3D controller: NVIDIA Corporation GP100GL (rev a1) But the only devices returned by hwloc are named "cardX" (same as what lstopo shows) and have osdev.type of HWLOC_OBJ_OSDEV_GPU and we see no devices of type HWLOC_OBJ_OSDEV_COPROC Sorry, I'm sure I'm doing something stupid here... I'm certainly new to using hwloc. Dave From: Brice Goglin To: Hardware locality user list Date: 07/07/2017 02:54 PM Subject:Re: [hwloc-users] Why do I get such little information back about GPU's on my system Sent by:"hwloc-users" Le 07/07/2017 21:51, David Solt a écrit : Oh, Geoff Paulsen will be there at Open MPI meeting and he can help with the discussion. We tried searching for // Iterate over each osdevice and identify the GPU's on each socket. while ((obj = hwloc_get_next_osdev(machine_topology, obj)) != NULL) { if (HWLOC_OBJ_OSDEV_COPROC == obj->attr->osdev.type) { // was HWLOC_OBJ_OSDEV_GPU Currently we are not finding any such devices. Does this require that the cuda libraries be installed on the system for hwloc to find the hardware? Yes. We use the cuda API for listing these objects. Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Why do I get such little information back about GPU's on my system
Le 07/07/2017 21:51, David Solt a écrit : > Oh, Geoff Paulsen will be there at Open MPI meeting and he can help > with the discussion. We tried searching for > > // Iterate over each osdevice and identify the GPU's on each socket. > while ((obj = hwloc_get_next_osdev(machine_topology, obj)) != NULL) { > if (HWLOC_OBJ_OSDEV_COPROC == obj->attr->osdev.type) { // was > HWLOC_OBJ_OSDEV_GPU > > Currently we are not finding any such devices. Does this require > that the cuda libraries be installed on the system for hwloc to find > the hardware? Yes. We use the cuda API for listing these objects. Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Why do I get such little information back about GPU's on my system
Oh, Geoff Paulsen will be there at Open MPI meeting and he can help with the discussion. We tried searching for // Iterate over each osdevice and identify the GPU's on each socket. while ((obj = hwloc_get_next_osdev(machine_topology, obj)) != NULL) { if (HWLOC_OBJ_OSDEV_COPROC == obj->attr->osdev.type) { // was HWLOC_OBJ_OSDEV_GPU Currently we are not finding any such devices. Does this require that the cuda libraries be installed on the system for hwloc to find the hardware? Thanks, Dave From: "David Solt" To: Hardware locality user list Date: 07/07/2017 02:21 PM Subject:Re: [hwloc-users] Why do I get such little information back about GPU's on my system Sent by:"hwloc-users" Thank you so much! We are going to try this out and I'll let you know what we find. Yes, I think we got distracted by GPU when CUDA is what we want. I'll give you a status of how it went soon Dave From:Brice Goglin To:hwloc-users@lists.open-mpi.org Date:07/07/2017 02:02 PM Subject:Re: [hwloc-users] Why do I get such little information back about GPU's on my system Sent by:"hwloc-users" Le 07/07/2017 20:38, David Solt a écrit : We are using the hwloc api to identify GPUs on our cluster. While we are able to "discover" the GPUs, other information about them does not appear to be getting filled in. See below for example: (gdb) p *obj->attr $20 = { cache = { size = 1, depth = 0, linesize = 0, associativity = 0, type = HWLOC_OBJ_CACHE_UNIFIED }, group = { depth = 1 }, pcidev = { domain = 1, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 0, vendor_id = 0, device_id = 0, subvendor_id = 0, subdevice_id = 0, revision = 0 '\000', linkspeed = 0 }, bridge = { upstream = { pci = { domain = 1, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 0, vendor_id = 0, device_id = 0, subvendor_id = 0, subdevice_id = 0, revision = 0 '\000', linkspeed = 0 } }, upstream_type = HWLOC_OBJ_BRIDGE_HOST, downstream = { pci = { domain = 0, secondary_bus = 0 '\000', subordinate_bus = 0 '\000' } }, downstream_type = HWLOC_OBJ_BRIDGE_HOST, depth = 0 }, osdev = { type = HWLOC_OBJ_OSDEV_GPU } } The name is generally just "cardX". Hello attr is an union so only the "osdev" portion above matters. "osdev" can be a lot of different things. So instead of having all possible attributes in a struct, we use info key/value pairs (hwloc_obj->infos). But those "cardX" devices are the GPU reported by the Linux kernel DRM subsystem, we don't have much information about them anyway. If you're looking at Power machine, I am going to assume you care about CUDA devices. Those are "osdev" objects of type "COPROC" instead of "GPU". They have many more attributes. Here's what I see on one of our machines: PCI 10de:1094 (P#540672 busid=:84:00.0 class=0302(3D) PCIVendor="NVIDIA Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing Processor Module") "NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module" Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 CUDASharedMemorySizePerMP=48) "cuda2" On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX" COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if you have the nvml and nvctrl libraries. Those are basically different ways to talk the GPU (Linux kernel DRM, CUDA, etc). Given that I have never seen anybody use "cardX" for placing task/data near a GPU, I am wondering if we should disable those by default. Or maybe rename "GPU" into something that wouldn't attract people as much, maybe "DRM". Does this mean that the cards are not configured correctly? Or is there an additional flag that needs to be set to get this information? Make sure "cuda" appears in the summary at the end of the configure. Currently the code does: hwloc_topology_init(&machine_topology); hwloc_topology_set_flags(machine_topology, HWLOC_TOPOLOGY_FLAG_IO_DEVICES); hwloc_topology_load(machine_topology); And this is enough to identify the CPUs and GPUs, but any additional information - particularly the device and vendor id's - seem to not be there. I tried this with the most recent release (1.11.7) and saw the same results. We tried this on a variety of PowerPC machines and I think even some x86_64 machines with similar results. Thoughts? Dave BTW, it looks like you're not going to the OMPI dev meeting next week. I'll be there if one of your colleague wants to discuss this face to face. Brice _
Re: [hwloc-users] Why do I get such little information back about GPU's on my system
Thank you so much! We are going to try this out and I'll let you know what we find. Yes, I think we got distracted by GPU when CUDA is what we want. I'll give you a status of how it went soon Dave From: Brice Goglin To: hwloc-users@lists.open-mpi.org Date: 07/07/2017 02:02 PM Subject:Re: [hwloc-users] Why do I get such little information back about GPU's on my system Sent by:"hwloc-users" Le 07/07/2017 20:38, David Solt a écrit : We are using the hwloc api to identify GPUs on our cluster. While we are able to "discover" the GPUs, other information about them does not appear to be getting filled in. See below for example: (gdb) p *obj->attr $20 = { cache = { size = 1, depth = 0, linesize = 0, associativity = 0, type = HWLOC_OBJ_CACHE_UNIFIED }, group = { depth = 1 }, pcidev = { domain = 1, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 0, vendor_id = 0, device_id = 0, subvendor_id = 0, subdevice_id = 0, revision = 0 '\000', linkspeed = 0 }, bridge = { upstream = { pci = { domain = 1, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 0, vendor_id = 0, device_id = 0, subvendor_id = 0, subdevice_id = 0, revision = 0 '\000', linkspeed = 0 } }, upstream_type = HWLOC_OBJ_BRIDGE_HOST, downstream = { pci = { domain = 0, secondary_bus = 0 '\000', subordinate_bus = 0 '\000' } }, downstream_type = HWLOC_OBJ_BRIDGE_HOST, depth = 0 }, osdev = { type = HWLOC_OBJ_OSDEV_GPU } } The name is generally just "cardX". Hello attr is an union so only the "osdev" portion above matters. "osdev" can be a lot of different things. So instead of having all possible attributes in a struct, we use info key/value pairs (hwloc_obj->infos). But those "cardX" devices are the GPU reported by the Linux kernel DRM subsystem, we don't have much information about them anyway. If you're looking at Power machine, I am going to assume you care about CUDA devices. Those are "osdev" objects of type "COPROC" instead of "GPU". They have many more attributes. Here's what I see on one of our machines: PCI 10de:1094 (P#540672 busid=:84:00.0 class=0302(3D) PCIVendor="NVIDIA Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing Processor Module") "NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module" Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 CUDASharedMemorySizePerMP=48) "cuda2" On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX" COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if you have the nvml and nvctrl libraries. Those are basically different ways to talk the GPU (Linux kernel DRM, CUDA, etc). Given that I have never seen anybody use "cardX" for placing task/data near a GPU, I am wondering if we should disable those by default. Or maybe rename "GPU" into something that wouldn't attract people as much, maybe "DRM". Does this mean that the cards are not configured correctly? Or is there an additional flag that needs to be set to get this information? Make sure "cuda" appears in the summary at the end of the configure. Currently the code does: hwloc_topology_init(&machine_topology); hwloc_topology_set_flags(machine_topology, HWLOC_TOPOLOGY_FLAG_IO_DEVICES); hwloc_topology_load(machine_topology); And this is enough to identify the CPUs and GPUs, but any additional information - particularly the device and vendor id's - seem to not be there. I tried this with the most recent release (1.11.7) and saw the same results. We tried this on a variety of PowerPC machines and I think even some x86_64 machines with similar results. Thoughts? Dave BTW, it looks like you're not going to the OMPI dev meeting next week. I'll be there if one of your colleague wants to discuss this face to face. Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
Re: [hwloc-users] Why do I get such little information back about GPU's on my system
Le 07/07/2017 20:38, David Solt a écrit : > We are using the hwloc api to identify GPUs on our cluster. While we > are able to "discover" the GPUs, other information about them does not > appear to be getting filled in. See below for example: > (gdb) p *obj->attr > $20 = { > cache = { > size = 1, > depth = 0, > linesize = 0, > associativity = 0, > type = HWLOC_OBJ_CACHE_UNIFIED > }, > group = { > depth = 1 > }, > pcidev = { > domain = 1, > bus = 0 '\000', > dev = 0 '\000', > func = 0 '\000', > class_id = 0, > *vendor_id = 0,* >*device_id = 0, * > subvendor_id = 0, > subdevice_id = 0, > revision = 0 '\000', > linkspeed = 0 > }, > bridge = { > upstream = { > pci = { > domain = 1, > bus = 0 '\000', > dev = 0 '\000', > func = 0 '\000', > class_id = 0, > vendor_id = 0, > device_id = 0, > subvendor_id = 0, > subdevice_id = 0, > revision = 0 '\000', > linkspeed = 0 > } > }, > upstream_type = HWLOC_OBJ_BRIDGE_HOST, > downstream = { > pci = { > domain = 0, > secondary_bus = 0 '\000', > subordinate_bus = 0 '\000' > } > }, > downstream_type = HWLOC_OBJ_BRIDGE_HOST, > depth = 0 > }, > osdev = { > type = *HWLOC_OBJ_OSDEV_GPU* > } > } > The name is generally just "cardX". Hello attr is an union so only the "osdev" portion above matters. "osdev" can be a lot of different things. So instead of having all possible attributes in a struct, we use info key/value pairs (hwloc_obj->infos). But those "cardX" devices are the GPU reported by the Linux kernel DRM subsystem, we don't have much information about them anyway. If you're looking at Power machine, I am going to assume you care about CUDA devices. Those are "osdev" objects of type "COPROC" instead of "GPU". They have many more attributes. Here's what I see on one of our machines: PCI 10de:1094 (P#540672 busid=:84:00.0 class=0302(3D) PCIVendor="NVIDIA Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing Processor Module") "NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module" Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 CUDASharedMemorySizePerMP=48) "cuda2" On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX" COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if you have the nvml and nvctrl libraries. Those are basically different ways to talk the GPU (Linux kernel DRM, CUDA, etc). Given that I have never seen anybody use "cardX" for placing task/data near a GPU, I am wondering if we should disable those by default. Or maybe rename "GPU" into something that wouldn't attract people as much, maybe "DRM". > Does this mean that the cards are not configured correctly? Or is > there an additional flag that needs to be set to get this information? Make sure "cuda" appears in the summary at the end of the configure. > Currently the code does: > hwloc_topology_init(&machine_topology); > hwloc_topology_set_flags(machine_topology, > HWLOC_TOPOLOGY_FLAG_IO_DEVICES); > hwloc_topology_load(machine_topology); > And this is enough to identify the CPUs and GPUs, but any additional > information - particularly the device and vendor id's - seem to not be > there. > I tried this with the most recent release (1.11.7) and saw the same > results. > We tried this on a variety of PowerPC machines and I think even some > x86_64 machines with similar results. > Thoughts? > Dave BTW, it looks like you're not going to the OMPI dev meeting next week. I'll be there if one of your colleague wants to discuss this face to face. Brice ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
[hwloc-users] Why do I get such little information back about GPU's on my system
We are using the hwloc api to identify GPUs on our cluster. While we are able to "discover" the GPUs, other information about them does not appear to be getting filled in. See below for example: (gdb) p *obj->attr $20 = { cache = { size = 1, depth = 0, linesize = 0, associativity = 0, type = HWLOC_OBJ_CACHE_UNIFIED }, group = { depth = 1 }, pcidev = { domain = 1, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 0, vendor_id = 0, device_id = 0, subvendor_id = 0, subdevice_id = 0, revision = 0 '\000', linkspeed = 0 }, bridge = { upstream = { pci = { domain = 1, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 0, vendor_id = 0, device_id = 0, subvendor_id = 0, subdevice_id = 0, revision = 0 '\000', linkspeed = 0 } }, upstream_type = HWLOC_OBJ_BRIDGE_HOST, downstream = { pci = { domain = 0, secondary_bus = 0 '\000', subordinate_bus = 0 '\000' } }, downstream_type = HWLOC_OBJ_BRIDGE_HOST, depth = 0 }, osdev = { type = HWLOC_OBJ_OSDEV_GPU } } The name is generally just "cardX". Does this mean that the cards are not configured correctly? Or is there an additional flag that needs to be set to get this information? Currently the code does: hwloc_topology_init(&machine_topology); hwloc_topology_set_flags(machine_topology, HWLOC_TOPOLOGY_FLAG_IO_DEVICES); hwloc_topology_load(machine_topology); And this is enough to identify the CPUs and GPUs, but any additional information - particularly the device and vendor id's - seem to not be there. I tried this with the most recent release (1.11.7) and saw the same results. We tried this on a variety of PowerPC machines and I think even some x86_64 machines with similar results. Thoughts? Dave ___ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users