Re: [gmx-users] strange GPU load distribution

2018-05-08 Thread Alex
Hi Szilárd, It really does appear that GMX_DISABLE_GPU_DETECTION=1 in the user's .bashrc fixed it right up. We haven't tried his runs alongside GPU-accelerated jobs yet, but he reports that none of his PIDs ever appear in nvidia-smi anymore and overall his jobs start much faster. This was an

Re: [gmx-users] strange GPU load distribution

2018-05-07 Thread Alex
I think we have everything ready at this point: a separate binary (not sourced yet), and these options. We've set GMX_DISABLE_GPU_DETECTION=1 in the user's .bashrc and will try the other option, if this one fails. Will update here on the bogging down situation. Thanks a lot. Alex On Mon, May

Re: [gmx-users] strange GPU load distribution

2018-05-07 Thread Szilárd Páll
Hi, You have at least one option more elegant than using a separate binary for EM. Set GMX_DISABLE_GPU_DETECTION=1 environment variable which is the internal GROMACS override that forces the detection off for cases similar to yours. That should solve the detection latency. If for some reason it

Re: [gmx-users] strange GPU load distribution

2018-05-07 Thread Alex
Thanks Mark. No need to be sorry, a CPU-only build is a simple enough fix. Inelegant, but if it works, it's all good. I'll report as soon as we have tried. I myself run things in a way that you would find very familiar, but we have a colleague developing forcefields and that involves tons of

Re: [gmx-users] strange GPU load distribution

2018-05-07 Thread Mark Abraham
Hi, I don't see any problems there, but I note that there are run-time settings for the driver/runtime to block until no other process is using the GPU, which may be a contributing factor here. As Justin noted, if your EM jobs would use a build of GROMACS that is not configured to have access to

Re: [gmx-users] strange GPU load distribution

2018-05-06 Thread Alex
Mark, I am forwarding the response I received from the colleague who prepared the box for my GMX install -- this is from the latest installation of 2018.1. See text below and please let me know what you think. We have no problem rebuilding things, but would like to understand what is wrong

Re: [gmx-users] strange GPU load distribution

2018-05-06 Thread Alex
Hi Mark, I forwarded your email to the person who installed CUDA on our boxes. Just to be clear, there is no persistent occupancy of the GPUs _after_ the process has finished. The observation is as follows: EM jobs submitted > low CPU use by the EM jobs, GPUs bogged down, no output files yet

Re: [gmx-users] strange GPU load distribution

2018-05-06 Thread Mark Abraham
Hi, In 2018 and 2018.1, mdrun does indeed run GPU detection and compatibility checks before any logic about whether it should use any GPUs that were in fact detected. However, there's nothing about those checks that should a) take any noticeable time, b) acquire any ongoing resources, or c) lead

Re: [gmx-users] strange GPU load distribution

2018-05-06 Thread Justin Lemkul
On 5/6/18 6:11 PM, Alex wrote: A separate CPU-only build is what we were going to try, but if it succeeds with not touching GPUs, then what -- keep several builds? If your CPU-only run produces something that doesn't touch the GPU (which it shouldn't), that test would rather conclusively

Re: [gmx-users] strange GPU load distribution

2018-05-06 Thread Alex
A separate CPU-only build is what we were going to try, but if it succeeds with not touching GPUs, then what -- keep several builds? That latency you mention is definitely there, I think it is related to my earlier report of one of the regression tests failing (I think Mark might remember

Re: [gmx-users] strange GPU load distribution

2018-05-06 Thread Justin Lemkul
On 5/6/18 5:51 PM, Alex wrote: Unfortunately, we're still bogged down when the EM runs (example below) start -- CPU usage by these jobs is initially low, while their PIDs show up in nvidia-smi. After about a minute all goes back to normal. Because the user is doing it frequently (scripted),

Re: [gmx-users] strange GPU load distribution

2018-05-06 Thread Alex
Unfortunately, we're still bogged down when the EM runs (example below) start -- CPU usage by these jobs is initially low, while their PIDs show up in nvidia-smi. After about a minute all goes back to normal. Because the user is doing it frequently (scripted), everything is slowed down by a

Re: [gmx-users] strange GPU load distribution

2018-04-30 Thread Alex
Hi Mark, We checked and one example is below. Thanks, Alex PID TTY  STAT   TIME COMMAND 60432 pts/8    Dl+ 0:01 gmx mdrun -table ../../../tab_it.xvg -nt 1 -nb cpu -pme cpu -deffnm em_steep On 4/27/2018 2:16 PM, Mark Abraham wrote: Hi, What you think was run isn't nearly as useful

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Alex
I see. :) I will check again when I am back at work. Thanks! Alex On 4/27/2018 2:16 PM, Mark Abraham wrote: Hi, What you think was run isn't nearly as useful when troubleshooting as asking the kernel what is actually running. Mark On Fri, Apr 27, 2018, 21:59 Alex

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Mark Abraham
Hi, What you think was run isn't nearly as useful when troubleshooting as asking the kernel what is actually running. Mark On Fri, Apr 27, 2018, 21:59 Alex wrote: > Mark, I copied the exact command line from the script, right above the > mdp file. It is literally how the

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Alex
Mark, I copied the exact command line from the script, right above the mdp file. It is literally how the script calls mdrun in this case: gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm On 4/27/2018 1:52 PM, Mark Abraham wrote: Group cutoff scheme can never run on a gpu, so none of that should

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Mark Abraham
Group cutoff scheme can never run on a gpu, so none of that should matter. Use ps and find out what the command lines were. Mark On Fri, Apr 27, 2018, 21:37 Alex wrote: > Update: we're basically removing commands one by one from the script that > submits the jobs causing

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Alex
Update: we're basically removing commands one by one from the script that submits the jobs causing the issue. The culprit is both EM and the MD run: and GPUs are being affected _before_ MD starts loading the CPU, i.e. this is the initial setting up of the EM run -- CPU load is near zero,

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Alex
As I said, only two users, and nvidia-smi shows the process name. We're investigating and it does appear that it is EM that uses cutoff electrostatics and as a result the user did not bother with -pme cpu in the mdrun call. What would be the correct way to enforce cpu-only mdrun when coulombtype =

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Mark Abraham
No. Look at the processes that are running, e.g. with top or ps. Either old simulations or another user is running. Mark On Fri, Apr 27, 2018, 20:33 Alex wrote: > Strange. There are only two people using this machine, myself being one of > them, and the other person

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Alex
Strange. There are only two people using this machine, myself being one of them, and the other person specifically forces -nb cpu -pme cpu in his calls to mdrun. Are any other GMX utilities (e.g. insert-molecules, grompp, or energy) trying to use GPUs? Thanks, Alex On Fri, Apr 27, 2018 at 5:33

Re: [gmx-users] strange GPU load distribution

2018-04-27 Thread Szilárd Páll
The second column is PIDs so there is a whole lot more going on there than just a single simulation, single rank using two GPUs. That would be one PID and two entries for the two GPUs. Are you sure you're not running other processes? -- Szilárd On Thu, Apr 26, 2018 at 5:52 AM, Alex

[gmx-users] strange GPU load distribution

2018-04-25 Thread Alex
Hi all, I am running GMX 2018 with gmx mdrun -pinoffset 0 -pin on -nt 24 -ntmpi 4 -npme 1 -pme gpu -nb gpu -gputasks 1122 Once in a while the simulation slows down and nvidia-smi reports something like this: |    1 12981  C gmx  175MiB | |