Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-13 Thread Alex
Just a quick update here regarding regression tests. On an old machine with a single puny GTX 960, the 2018 build passes all tests with Start 9: GpuUtilsUnitTests 9/39 Test #9: GpuUtilsUnitTests Passed5.64 sec Hope this is useful. Alex -- Gromacs Users mailing list *

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-09 Thread Szilárd Páll
Great to hear! (Also note that one thing we have explicitly focused on is not only peak performance, but to get as close to peak as possible with just a few CPU cores! You should be able to get >75% perf with just 3-5 Xeon or 2-3 desktop cores rather than needing a full fast CPU.) -- Szilárd On

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
With -pme gpu, I am reporting 383.032 ns/day vs 270 ns/day with the 2016.4 version. I _did not_ mistype. The system is close to a cubic box of water with some ions. Incredible. Alex On Thu, Feb 8, 2018 at 12:27 PM, Szilárd Páll wrote: > Note that the actual mdrun

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Szilárd Páll
Note that the actual mdrun performance need not be affected both of it's it's a driver persistence issue (you'll just see a few seconds lag at mdrun startup) or some other CUDA application startup-related lag (an mdrun run does mostly very different kind of things than this set of particular unit

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
I keep getting bounce messages from the list, so in case things didn't get posted... 1. We enabled PM -- still times out. 2. 3-4 days ago we had very fast runs with GPU (2016.4), so I don't know if we miraculously broke everything to the point where our $25K box performs worse than Mark's

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Mark Abraham
On Thu, Feb 8, 2018 at 6:54 PM Szilárd Páll wrote: > BTW, do you have persistence mode (PM) set (see in the nvidia-smi output)? > If you do not have PM it set nor is there an X server that keeps the driver > loaded, the driver gets loaded every time a CUDA application is

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
Got it. Given all the messing around, I am rebuilding GMX and if make check results are the same, will install. We have an angry postdoc here demanding tools. Thank you gentlemen. Alex On Thu, Feb 8, 2018 at 10:50 AM, Szilárd Páll wrote: > On Thu, Feb 8, 2018 at 6:46

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Szilárd Páll
BTW, do you have persistence mode (PM) set (see in the nvidia-smi output)? If you do not have PM it set nor is there an X server that keeps the driver loaded, the driver gets loaded every time a CUDA application is started. This could be causing the lag which shows up as long execution time for

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Mark Abraham
Hi, Assuming the other test binary has the same behaviour (succeeds when run manually), then the build is working correctly and you could install it for general use. But I suspect its performance will suffer from whatever is causing the slowdown (e.g. compare with old numbers). That's not really

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Szilárd Páll
On Thu, Feb 8, 2018 at 6:46 PM, Alex wrote: > Are you suggesting that i should accept these results and install the 2018 > version? > Yes, your GROMACS build seems fine. make check simply runs the test that I suggested you to run manually (and which successfully finished).

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
Are you suggesting that i should accept these results and install the 2018 version? Thanks, Alex On Thu, Feb 8, 2018 at 10:43 AM, Mark Abraham wrote: > Hi, > > PATH doesn't matter, only what ldd thinks matters. > > I have opened

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Mark Abraham
Hi, PATH doesn't matter, only what ldd thinks matters. I have opened https://redmine.gromacs.org/issues/2405 to address that the implementation of these tests are perhaps proving more pain than usefulness (from this thread and others I have seen). Mark On Thu, Feb 8, 2018 at 6:41 PM Alex

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
That is quite weird. We found that I have PATH values pointing to the old gmx installation while running these tests. Do you think that could cause issues? Alex On Thu, Feb 8, 2018 at 10:36 AM, Mark Abraham wrote: > Hi, > > Great. The manual run took 74.5 seconds,

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Mark Abraham
Hi, Great. The manual run took 74.5 seconds, failing the 30 second timeout. So the code is fine. But you have some crazy large overhead going on - gpu_utils-test runs in 7s on my 2013 desktop with CUDA 9.1. Mark On Thu, Feb 8, 2018 at 6:29 PM Alex wrote: > uh, no sir. >

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
uh, no sir. > 9/39 Test #9: GpuUtilsUnitTests ***Timeout 30.43 sec On Thu, Feb 8, 2018 at 10:25 AM, Mark Abraham wrote: > Hi, > > Those all succeeded. Does make check now also succeed? > > Mark > > On Thu, Feb 8, 2018 at 6:24 PM Alex

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Mark Abraham
Hi, Those all succeeded. Does make check now also succeed? Mark On Thu, Feb 8, 2018 at 6:24 PM Alex wrote: > Here you are: > > [==] Running 35 tests from 7 test cases. > [--] Global test environment set-up. > [--] 7 tests from HostAllocatorTest/0,

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
Here you are: [==] Running 35 tests from 7 test cases. [--] Global test environment set-up. [--] 7 tests from HostAllocatorTest/0, where TypeParam = int [ RUN ] HostAllocatorTest/0.EmptyMemoryAlwaysWorks [ OK ] HostAllocatorTest/0.EmptyMemoryAlwaysWorks (5457

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Szilárd Páll
It might help to know which of the unit test(s) in that group stall? Can you run it manually (bin/gpu_utils-test) and report back the standard output? -- Szilárd On Thu, Feb 8, 2018 at 3:56 PM, Alex wrote: > Nope, still persists after reboot and no other jobs running: >

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
Here's some additional info: # # cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.12 Wed Dec 20 07:19:16 PST 2017 GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.6)

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
Forwarding colleague's email below, any suggestions highly appreciated. Thanks! Alex *** I ran the minimal tests suggested in the cuda installation guide. (bandwidthTest, deviceQuery) and then I individually ran 10 of the samples provided. However, many of the samples require a graphics

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
I did hear yesterday that CUDA's own tests passed, but will update on that in more detail as soon as people start showing up -- it's 8 am right now... :) Thanks Mark, Alex On 2/8/2018 7:59 AM, Mark Abraham wrote: Hi, OK, but not clear to me if followed the other advice - cleaned out all

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Mark Abraham
Hi, OK, but not clear to me if followed the other advice - cleaned out all the NVIDIA stuff (CUDA, runtime, drivers), nor if CUDA's own tests work. Mark On Thu, Feb 8, 2018 at 3:57 PM Alex wrote: > Nope, still persists after reboot and no other jobs running: > 9/39 Test

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
Nope, still persists after reboot and no other jobs running:  9/39 Test  #9: GpuUtilsUnitTests ***Timeout  30.59 sec Any additional suggestions? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
I am rebooting the box and kicking out all the jobs until we figure this out. Thanks! Alex On 2/8/2018 7:27 AM, Szilárd Páll wrote: BTW, timeouts can be caused by contention from stupid number of ranks/tMPI threads hammering a single GPU (especially with 2 threads/core with HT), but I'm not

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
Mark, Peter -- thanks. Your comments make sense. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Szilárd Páll
BTW, timeouts can be caused by contention from stupid number of ranks/tMPI threads hammering a single GPU (especially with 2 threads/core with HT), but I'm not sure if the tests are ever executed with such a huge rank count. -- Szilárd On Thu, Feb 8, 2018 at 2:40 PM, Mark Abraham

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Mark Abraham
Hi, On Thu, Feb 8, 2018 at 2:15 PM Alex wrote: > Mark and Peter, > > Thanks for commenting. I was told that all CUDA tests passed, but I will > double check on how many of those were actually run. Also, we never > rebooted the box after CUDA install, and finally we had a

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Peter Kroon
Jup, start with rebooting before trying anything else. There's probably still old drivers loaded in the kernel. Peter On 08-02-18 14:14, Alex wrote: > Mark and Peter, > > Thanks for commenting. I was told that all CUDA tests passed, but I > will double check on how many of those were actually

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Alex
Mark and Peter, Thanks for commenting. I was told that all CUDA tests passed, but I will double check on how many of those were actually run. Also, we never rebooted the box after CUDA install, and finally we had a bunch of gromacs (2016.4) jobs running, because we didn't want to interrupt

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Mark Abraham
Hi, Or leftovers of the drivers that are now mismatching. That has caused timeouts for us. Mark On Thu, Feb 8, 2018 at 10:55 AM Peter Kroon wrote: > Hi, > > > with changing failures like this I would start to suspect the hardware > as well. Mark's suggestion of looking at

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-08 Thread Peter Kroon
Hi, with changing failures like this I would start to suspect the hardware as well. Mark's suggestion of looking at simpler test programs than GMX is a good one :) Peter On 08-02-18 09:10, Mark Abraham wrote: > Hi, > > That suggests that your new CUDA installation is differently incomplete.

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-07 Thread Alex
Update: we seem to have had a hiccup with an orphan CUDA install and that was causing issues. After wiping everything off and rebuilding the errors from the initial post disappeared. However, two tests failed during regression: 95% tests passed, 2 tests failed out of 39 Label Time Summary: GTest

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-07 Thread Alex
Hi Mark, Nothing has been installed yet, so the commands were issued from /build/bin and so I am not sure about the output of that mdrun-test (let me know what exact command could make it more informative). Thank you, Alex *** > ./gmx -version GROMACS version:    2018 Precision: 

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-07 Thread Mark Abraham
Hi, I checked back with the CUDA-facing GROMACS developers. They've run the code with 9.1 and believe there's no intrinsic problem within GROMACS. > So I don't have much to suggest other then rebuilding everything cleanly, as this is an internal non-descript cuFFT/driver error that is not

Re: [gmx-users] GMX 2018 regression tests: cufftPlanMany R2C plan failure (error code 5)

2018-02-06 Thread Alex
And this is with: > gcc --version > gcc (Ubuntu 5.4.0-6ubuntu1~16.04.6) 5.4.0 20160609 On Tue, Feb 6, 2018 at 1:18 PM, Alex wrote: > Hi all, > > I've just built the latest version and regression tests are running. Here > is one error: > > "Program: mdrun-test, version