[gcc r14-9844] GCN, nvptx: Errors during device probing are fatal

Thomas Schwinge via Gcc-cvs Mon, 08 Apr 2024 13:09:05 -0700

https://gcc.gnu.org/g:a02d7f0edc47495ffe456af7ab7718896e0a0c25


commit r14-9844-ga02d7f0edc47495ffe456af7ab7718896e0a0c25
Author: Thomas Schwinge <tschwi...@baylibre.com>
Date:   Thu Mar 7 14:42:07 2024 +0100

    GCN, nvptx: Errors during device probing are fatal
    
    Currently, we silently disable libgomp GCN and nvptx plugins/devices in
    presence of certain error conditions during device probing, thus typically
    silently resorting to host-fallback execution.  Make such errors fatal, 
similar
    as for any other device access later on, so that we early and reliably 
notice
    when things go wrong.  (Keep just two cases non-fatal: (a) libgomp GCN or 
nvptx
    plugins are available but 'libhsa-runtime64.so.1' or 'libcuda.so.1' are not,
    and (b) those are available, but the corresponding devices are not.)
    
    This resolves the issue that we've got execution test cases unexpectedly
    PASSing, despite:
    
        libgomp: GCN fatal error: Run-time could not be initialized
        Runtime message: HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed 
to allocate the necessary resources. This error may also occur when the core 
runtime library needs to spawn threads or create internal OS-specific events.
    
    ..., and therefore they were not offloaded to the GCN device, but ran in
    host-fallback execution mode.  What happend in that scenario is that in
    'init_hsa_context' during the initial 'GOMP_OFFLOAD_get_num_devices' we ran
    into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', but it wasn't fatal, but just
    silently disabled the libgomp plugin/device.
    
    Especially "entertaining" were cases where such unintended host-fallback
    execution happened during effective-target checks like
    'offload_device_available' (host-fallback execution there meaning: no 
offload
    device available), but actual test cases then were running with an offload
    device available, and therefore mis-configured.
    
            include/
            * cuda/cuda.h (CUresult): Add 'CUDA_ERROR_NO_DEVICE'.
            libgomp/
            * plugin/plugin-gcn.c (init_hsa_context): Add and handle
            'bool probe' parameter.  Adjust all users; errors during device
            probing are fatal.
            * plugin/plugin-nvptx.c (nvptx_get_num_devices): Aside from
            'CUDA_ERROR_NO_DEVICE', errors during device probing are fatal.

Diff:
---
 include/cuda/cuda.h           |  1 +
 libgomp/plugin/plugin-gcn.c   | 14 ++++++++------
 libgomp/plugin/plugin-nvptx.c |  4 +++-
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 114aba4e074..0dca4b3a5c0 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -57,6 +57,7 @@ typedef enum {
   CUDA_ERROR_OUT_OF_MEMORY = 2,
   CUDA_ERROR_NOT_INITIALIZED = 3,
   CUDA_ERROR_DEINITIALIZED = 4,
+  CUDA_ERROR_NO_DEVICE = 100,
   CUDA_ERROR_INVALID_CONTEXT = 201,
   CUDA_ERROR_INVALID_HANDLE = 400,
   CUDA_ERROR_NOT_FOUND = 500,
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 1d183b61ca4..27947801ccd 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1513,10 +1513,12 @@ assign_agent_ids (hsa_agent_t agent, void *data)
 }
 
 /* Initialize hsa_context if it has not already been done.
-   Return TRUE on success.  */
+   If !PROBE: returns TRUE on success.
+   If PROBE: returns TRUE on success or if the plugin/device shall be silently
+   ignored, and otherwise emits an error and returns FALSE.  */
 
 static bool
-init_hsa_context (void)
+init_hsa_context (bool probe)
 {
   hsa_status_t status;
   int agent_index = 0;
@@ -1531,7 +1533,7 @@ init_hsa_context (void)
        GOMP_PLUGIN_fatal ("%s\n", msg);
       else
        GCN_WARNING ("%s\n", msg);
-      return false;
+      return probe ? true : false;
     }
   status = hsa_fns.hsa_init_fn ();
   if (status != HSA_STATUS_SUCCESS)
@@ -3337,8 +3339,8 @@ GOMP_OFFLOAD_version (void)
 int
 GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
-  if (!init_hsa_context ())
-    return 0;
+  if (!init_hsa_context (true))
+    exit (EXIT_FAILURE);
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
      devices were present.  */
   if (hsa_context.agent_count > 0
@@ -3355,7 +3357,7 @@ GOMP_OFFLOAD_get_num_devices (unsigned int 
omp_requires_mask)
 bool
 GOMP_OFFLOAD_init_device (int n)
 {
-  if (!init_hsa_context ())
+  if (!init_hsa_context (false))
     return false;
   if (n >= hsa_context.agent_count)
     {
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ced6e014ece..5aad3448a8d 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -604,12 +604,14 @@ nvptx_get_num_devices (void)
       CUresult r = CUDA_CALL_NOCHECK (cuInit, 0);
       /* This is not an error: e.g. we may have CUDA libraries installed but
          no devices available.  */
-      if (r != CUDA_SUCCESS)
+      if (r == CUDA_ERROR_NO_DEVICE)
        {
          GOMP_PLUGIN_debug (0, "Disabling nvptx offloading; cuInit: %s\n",
                             cuda_error (r));
          return 0;
        }
+      else if (r != CUDA_SUCCESS)
+       GOMP_PLUGIN_fatal ("cuInit error: %s", cuda_error (r));
     }
 
   CUDA_CALL_ASSERT (cuDeviceGetCount, &n);

[gcc r14-9844] GCN, nvptx: Errors during device probing are fatal

Reply via email to