Re: [gmx-users] Fw: cudaFuncGetAttributes failed: out of memory

2020-02-26 Thread Szilárd Páll
Hi,

Indeed, there is an issue with the GPU detection code's consistency checks
that trip and abort the run if any of the detected GPUs behaves in
unexpected ways (e.g. runs out of memory during checks).

This should be fixed in an upcoming release, but until then as you have
observed, you can always restrict the set of GPUs exposed to GROMACS using
the CUDA_VISIBLE_DEVICES environment variable.

Cheers,


--
Szilárd


On Sun, Feb 23, 2020 at 7:51 AM bonjour899  wrote:

> I think I've temporarily solved this problem. Only when I use
> CUDA_VISIABLE_DEVICE to block the memory-almost-fully-occupied GPUs, I can
> run GROMACS smoothly (using -gpu_id only is useless). I think there may be
> some bug in GROMACS's GPU usage model in a multi-GPU environment (It seems
> like as long as one of the GPUs is fully occupied, GROMACS cannot submit to
> any GPUs and return an error with "cudaFuncGetAttributes failed: out of
> memory").
>
>
>
> Best regards,
> W
>
>
>
>
>  Forwarding messages 
> From: "bonjour899" 
> Date: 2020-02-23 11:32:53
> To:  gromacs.org_gmx-users@maillist.sys.kth.se
> Subject: [gmx-users] cudaFuncGetAttributes failed: out of memory
> I also tried to restricting to different GPU using -gpu_id, but still with
> the same error. I've also posting my question on
> https://devtalk.nvidia.com/default/topic/1072038/cuda-programming-and-performance/cudafuncgetattributes-failed-out-of-memory/
> Following is the output of nvidia-smi:
>
>
> +-+
>
> | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
>
>
> |---+--+--+
>
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
>
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
>
>
> |===+==+==|
>
> | 0 Tesla P100-PCIE... On | :04:00.0 Off | 0 |
>
> | N/A 35C P0 34W / 250W | 16008MiB / 16280MiB | 0% Default |
>
>
> +---+--+--+
>
> | 1 Tesla P100-PCIE... On | :06:00.0 Off | 0 |
>
> | N/A 35C P0 28W / 250W | 10MiB / 16280MiB | 0% Default |
>
>
> +---+--+--+
>
> | 2 Tesla P100-PCIE... On | :07:00.0 Off | 0 |
>
> | N/A 35C P0 33W / 250W | 16063MiB / 16280MiB | 0% Default |
>
>
> +---+--+--+
>
> | 3 Tesla P100-PCIE... On | :08:00.0 Off | 0 |
>
> | N/A 36C P0 29W / 250W | 10MiB / 16280MiB | 0% Default |
>
>
> +---+--+--+
>
> | 4 Quadro P4000 On | :0B:00.0 Off | N/A |
>
> | 46% 27C P8 8W / 105W | 12MiB / 8119MiB | 0% Default |
>
>
> +---+--+--+
>
>
>
>
> +-+
>
> | Processes: GPU Memory |
>
> | GPU PID Type Process name Usage |
>
>
> |=|
>
> | 0 20497 C /usr/bin/python3 5861MiB |
>
> | 0 24503 C /usr/bin/python3 10137MiB |
>
> | 2 23162 C /home/appuser/Miniconda3/bin/python 16049MiB |
>
>
> +-+
>
>
>
>
>
>
>
>  Forwarding messages 
> From: "bonjour899" 
> Date: 2020-02-20 10:30:36
> To: "gromacs.org_gmx-users@maillist.sys.kth.se" <
> gromacs.org_gmx-users@maillist.sys.kth.se>
> Subject: cudaFuncGetAttributes failed: out of memory
>
> Hello,
>
>
> I have encountered a weird problem. I've been using GROMACS with GPU on a
> server and always performance good. However when I just reran a job today
> and suddenly got this error:
>
>
>
> Command line:
>
> gmx mdrun -deffnm pull -ntmpi 1 -nb gpu -pme gpu -gpu_id 3
>
> Back Off! I just backed up pull.log to ./#pull.log.1#
>
> ---
>
> Program: gmx mdrun, version 2019.4
>
> Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 100)
>
>
>
> Fatal error:
>
> cudaFuncGetAttributes failed: out of memory
>
>
>
> For more information and tips for troubleshooting, please check the GROMACS
>
> website at http://www.gromacs.org/Documentation/Errors
>
> ---
>
>
>
>
> It seems the GPU is 0 occupied and I can run other apps with GPU, but I
> cannot run GROMACS mdrun anymore, even if doing energy minimization.
>
>
>
>
>
>
>
>
>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> 

[gmx-users] Fw: cudaFuncGetAttributes failed: out of memory

2020-02-22 Thread bonjour899
I think I've temporarily solved this problem. Only when I use 
CUDA_VISIABLE_DEVICE to block the memory-almost-fully-occupied GPUs, I can run 
GROMACS smoothly (using -gpu_id only is useless). I think there may be some bug 
in GROMACS's GPU usage model in a multi-GPU environment (It seems like as long 
as one of the GPUs is fully occupied, GROMACS cannot submit to any GPUs and 
return an error with "cudaFuncGetAttributes failed: out of memory").



Best regards,
W




 Forwarding messages 
From: "bonjour899" 
Date: 2020-02-23 11:32:53
To:  gromacs.org_gmx-users@maillist.sys.kth.se
Subject: [gmx-users] cudaFuncGetAttributes failed: out of memory
I also tried to restricting to different GPU using -gpu_id, but still with the 
same error. I've also posting my question on 
https://devtalk.nvidia.com/default/topic/1072038/cuda-programming-and-performance/cudafuncgetattributes-failed-out-of-memory/
 
Following is the output of nvidia-smi:

+-+

| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |

|---+--+--+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===+==+==|

| 0 Tesla P100-PCIE... On | :04:00.0 Off | 0 |

| N/A 35C P0 34W / 250W | 16008MiB / 16280MiB | 0% Default |

+---+--+--+

| 1 Tesla P100-PCIE... On | :06:00.0 Off | 0 |

| N/A 35C P0 28W / 250W | 10MiB / 16280MiB | 0% Default |

+---+--+--+

| 2 Tesla P100-PCIE... On | :07:00.0 Off | 0 |

| N/A 35C P0 33W / 250W | 16063MiB / 16280MiB | 0% Default |

+---+--+--+

| 3 Tesla P100-PCIE... On | :08:00.0 Off | 0 |

| N/A 36C P0 29W / 250W | 10MiB / 16280MiB | 0% Default |

+---+--+--+

| 4 Quadro P4000 On | :0B:00.0 Off | N/A |

| 46% 27C P8 8W / 105W | 12MiB / 8119MiB | 0% Default |

+---+--+--+

 

+-+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=|

| 0 20497 C /usr/bin/python3 5861MiB |

| 0 24503 C /usr/bin/python3 10137MiB |

| 2 23162 C /home/appuser/Miniconda3/bin/python 16049MiB |

+-+







 Forwarding messages 
From: "bonjour899" 
Date: 2020-02-20 10:30:36
To: "gromacs.org_gmx-users@maillist.sys.kth.se" 

Subject: cudaFuncGetAttributes failed: out of memory

Hello,


I have encountered a weird problem. I've been using GROMACS with GPU on a 
server and always performance good. However when I just reran a job today and 
suddenly got this error:



Command line:

gmx mdrun -deffnm pull -ntmpi 1 -nb gpu -pme gpu -gpu_id 3 

Back Off! I just backed up pull.log to ./#pull.log.1#

---

Program: gmx mdrun, version 2019.4

Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 100)

 

Fatal error:

cudaFuncGetAttributes failed: out of memory

 

For more information and tips for troubleshooting, please check the GROMACS

website at http://www.gromacs.org/Documentation/Errors

---




It seems the GPU is 0 occupied and I can run other apps with GPU, but I cannot 
run GROMACS mdrun anymore, even if doing energy minimization.











 
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.