Re: CUDNN 7.5 Issues

2019-04-09 Thread Per da Silva
Hey Kellen,

I really appreciate that. Thank you!

And thanks to the community for supporting me ^^

Per


On Wed, Apr 10, 2019 at 5:53 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Hey Per, just wanted to drop a line and say thanks for supporting the
> community on this one.
>
> On Tue, Apr 9, 2019 at 4:20 AM Per da Silva  wrote:
>
> > I've created an issue to track this problem:
> > https://github.com/apache/incubator-mxnet/issues/14652
> >
> > On Tue, Apr 9, 2019 at 9:07 AM Per da Silva 
> wrote:
> >
> > > Dear MXNet community,
> > >
> > > I've been trying to update the CI GPU images to CUDA 10, but the tests
> > are
> > > failing. I'm not sure why and would really appreciate some help =D
> > >
> > > I've managed, at least, to narrow down the problem to the cuDNN
> version.
> > > The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> > >
> >
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> > > ).
> > >
> > > I noticed that the binary in the python packages we release uses cuDNN
> > > 7.3.1.20 (
> > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34
> > ),
> > > so decided to create a PR with CI updated to CUDA 10 with cuDNN
> 7.3.1.20
> > > and sure enough the tests passed (
> > > https://github.com/apache/incubator-mxnet/pull/14513).
> > >
> > > After talking with another contributer, we decided that I would try to
> > > create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing
> tests
> > > (to be fixed later). But, it seems the problem is a bit more heinous. I
> > > disable one test, and another one fails...So, it might make sense to
> > reach
> > > out now and see if we can find the root cause and fix it.
> > >
> > > Some things I've sanity checked:
> > >
> > > We run the tests on g3.8xlarge instances. These instances contain Tesla
> > > M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10
> > supports
> > > compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA
> ).
> > >
> > > According to the cuDNN support matrix (
> > >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html
> > ),
> > > cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver
> > r410.48
> > > (I assume greater or equal).
> > >
> > > The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
> > >
> > > So, as best I can tell, our environment ought to support cuDNN 7.5,
> which
> > > leads me to conclude that maybe there's something wrong in the code.
> > >
> > > The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check
> failed:
> > > e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
> > >
> > > According to the cuDNN user guide (
> > >
> >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> > > ):
> > >
> > > CUDNN_STATUS_ARCH_MISMATCH
> > >
> > > The function requires a feature absent from the current GPU device.
> Note
> > > that cuDNN only supports devices with compute capabilities greater than
> > or
> > > equal to 3.0.
> > >
> > > To correct: compile and run the application on a device with
> appropriate
> > > compute capability.
> > >
> > > But, as we've seen, our environment seems to support this version of
> > cuDNN
> > > and other versions go through CI w/o any problem...
> > >
> > > You can see some logs here:
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> > >
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
> > >
> > > I have about 13 runs of this pipeline. The errors for different runs
> can
> > > be seen by changing the number before /pipeline (e.g.
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> > > <
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> >
> > for
> > > the 2nd run, etc.)
> > >
> > > Thanks in advance for the help!
> > >
> > > You can reach me here or on Slack if you have any questions =D
> > >
> > > Cheers,
> > >
> > > Per
> > >
> > > P.S. I'm attaching some instructions on how to reproduce the issue at
> > home
> > > (or at least on a g3.8xlarge instance running ubuntu 16.04).
> > >
> >
>


Re: CUDNN 7.5 Issues

2019-04-09 Thread kellen sunderland
Hey Per, just wanted to drop a line and say thanks for supporting the
community on this one.

On Tue, Apr 9, 2019 at 4:20 AM Per da Silva  wrote:

> I've created an issue to track this problem:
> https://github.com/apache/incubator-mxnet/issues/14652
>
> On Tue, Apr 9, 2019 at 9:07 AM Per da Silva  wrote:
>
> > Dear MXNet community,
> >
> > I've been trying to update the CI GPU images to CUDA 10, but the tests
> are
> > failing. I'm not sure why and would really appreciate some help =D
> >
> > I've managed, at least, to narrow down the problem to the cuDNN version.
> > The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> >
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> > ).
> >
> > I noticed that the binary in the python packages we release uses cuDNN
> > 7.3.1.20 (
> >
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34
> ),
> > so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
> > and sure enough the tests passed (
> > https://github.com/apache/incubator-mxnet/pull/14513).
> >
> > After talking with another contributer, we decided that I would try to
> > create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
> > (to be fixed later). But, it seems the problem is a bit more heinous. I
> > disable one test, and another one fails...So, it might make sense to
> reach
> > out now and see if we can find the root cause and fix it.
> >
> > Some things I've sanity checked:
> >
> > We run the tests on g3.8xlarge instances. These instances contain Tesla
> > M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10
> supports
> > compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).
> >
> > According to the cuDNN support matrix (
> > https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html
> ),
> > cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver
> r410.48
> > (I assume greater or equal).
> >
> > The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
> >
> > So, as best I can tell, our environment ought to support cuDNN 7.5, which
> > leads me to conclude that maybe there's something wrong in the code.
> >
> > The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed:
> > e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
> >
> > According to the cuDNN user guide (
> >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> > ):
> >
> > CUDNN_STATUS_ARCH_MISMATCH
> >
> > The function requires a feature absent from the current GPU device. Note
> > that cuDNN only supports devices with compute capabilities greater than
> or
> > equal to 3.0.
> >
> > To correct: compile and run the application on a device with appropriate
> > compute capability.
> >
> > But, as we've seen, our environment seems to support this version of
> cuDNN
> > and other versions go through CI w/o any problem...
> >
> > You can see some logs here:
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> >
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
> >
> > I have about 13 runs of this pipeline. The errors for different runs can
> > be seen by changing the number before /pipeline (e.g.
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> > <
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/>
> for
> > the 2nd run, etc.)
> >
> > Thanks in advance for the help!
> >
> > You can reach me here or on Slack if you have any questions =D
> >
> > Cheers,
> >
> > Per
> >
> > P.S. I'm attaching some instructions on how to reproduce the issue at
> home
> > (or at least on a g3.8xlarge instance running ubuntu 16.04).
> >
>


Re: CUDNN 7.5 Issues

2019-04-09 Thread Per da Silva
I've created an issue to track this problem:
https://github.com/apache/incubator-mxnet/issues/14652

On Tue, Apr 9, 2019 at 9:07 AM Per da Silva  wrote:

> Dear MXNet community,
>
> I've been trying to update the CI GPU images to CUDA 10, but the tests are
> failing. I'm not sure why and would really appreciate some help =D
>
> I've managed, at least, to narrow down the problem to the cuDNN version.
> The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> ).
>
> I noticed that the binary in the python packages we release uses cuDNN
> 7.3.1.20 (
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34),
> so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
> and sure enough the tests passed (
> https://github.com/apache/incubator-mxnet/pull/14513).
>
> After talking with another contributer, we decided that I would try to
> create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
> (to be fixed later). But, it seems the problem is a bit more heinous. I
> disable one test, and another one fails...So, it might make sense to reach
> out now and see if we can find the root cause and fix it.
>
> Some things I've sanity checked:
>
> We run the tests on g3.8xlarge instances. These instances contain Tesla
> M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10 supports
> compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).
>
> According to the cuDNN support matrix (
> https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html),
> cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver r410.48
> (I assume greater or equal).
>
> The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
>
> So, as best I can tell, our environment ought to support cuDNN 7.5, which
> leads me to conclude that maybe there's something wrong in the code.
>
> The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed:
> e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
>
> According to the cuDNN user guide (
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> ):
>
> CUDNN_STATUS_ARCH_MISMATCH
>
> The function requires a feature absent from the current GPU device. Note
> that cuDNN only supports devices with compute capabilities greater than or
> equal to 3.0.
>
> To correct: compile and run the application on a device with appropriate
> compute capability.
>
> But, as we've seen, our environment seems to support this version of cuDNN
> and other versions go through CI w/o any problem...
>
> You can see some logs here:
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
>
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
>
> I have about 13 runs of this pipeline. The errors for different runs can
> be seen by changing the number before /pipeline (e.g.
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> 
>  for
> the 2nd run, etc.)
>
> Thanks in advance for the help!
>
> You can reach me here or on Slack if you have any questions =D
>
> Cheers,
>
> Per
>
> P.S. I'm attaching some instructions on how to reproduce the issue at home
> (or at least on a g3.8xlarge instance running ubuntu 16.04).
>


CUDNN 7.5 Issues

2019-04-09 Thread Per da Silva
Dear MXNet community,

I've been trying to update the CI GPU images to CUDA 10, but the tests are
failing. I'm not sure why and would really appreciate some help =D

I've managed, at least, to narrow down the problem to the cuDNN version.
The current CUDA 10 images uses cuDNN version 7.5.0.56 (
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
).

I noticed that the binary in the python packages we release uses cuDNN
7.3.1.20 (
https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34),
so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
and sure enough the tests passed (
https://github.com/apache/incubator-mxnet/pull/14513).

After talking with another contributer, we decided that I would try to
create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
(to be fixed later). But, it seems the problem is a bit more heinous. I
disable one test, and another one fails...So, it might make sense to reach
out now and see if we can find the root cause and fix it.

Some things I've sanity checked:

We run the tests on g3.8xlarge instances. These instances contain Tesla M60
GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10 supports
compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).

According to the cuDNN support matrix (
https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html),
cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver r410.48
(I assume greater or equal).

The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.

So, as best I can tell, our environment ought to support cuDNN 7.5, which
leads me to conclude that maybe there's something wrong in the code.

The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed: e
== CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".

According to the cuDNN user guide (
https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html):

CUDNN_STATUS_ARCH_MISMATCH

The function requires a feature absent from the current GPU device. Note
that cuDNN only supports devices with compute capabilities greater than or
equal to 3.0.

To correct: compile and run the application on a device with appropriate
compute capability.

But, as we've seen, our environment seems to support this version of cuDNN
and other versions go through CI w/o any problem...

You can see some logs here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/

I have about 13 runs of this pipeline. The errors for different runs can be
seen by changing the number before /pipeline (e.g.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/

for
the 2nd run, etc.)

Thanks in advance for the help!

You can reach me here or on Slack if you have any questions =D

Cheers,

Per

P.S. I'm attaching some instructions on how to reproduce the issue at home
(or at least on a g3.8xlarge instance running ubuntu 16.04).
# Launch g3.8xlarge instance

# ==-_-==-_-== Environment Setup ==-_-==-_-==

sudo apt update
sudo apt-get install -y \
apt-transport-https \
build-essential \
ca-certificates \
curl \
git \
libatlas-base-dev \
libcurl4-openssl-dev \
libjemalloc-dev \
libhdf5-dev \
liblapack-dev \
libopenblas-dev \
libopencv-dev \
libturbojpeg \
libzmq3-dev \
ninja-build \
software-properties-common \
sudo \
unzip \
wget

sudo apt-get install -y python-dev python3-dev virtualenv wget

# the version of the pip shipped with ubuntu may be too lower, install a recent 
version here
wget -nv https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo python2 get-pip.py

pip2 install --user nose cpplint==1.3.0 pylint==1.9.3 'numpy<=1.15.2,>=1.8.2' 
nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3
pip3 install --user nose cpplint==1.3.0 pylint==2.1.1 'numpy<=1.15.2,>=1.8.2' 
nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3

# ==-_-==-_-== CUDA Installation ==-_-==-_-==

wget 
https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
chmod +x cuda_10.0.130_410.48_linux && sudo ./cuda_10.0.130_410.48_linux

# Installation except:
# Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
# (y)es/(n)o/(q)uit: y
# 
# Do you want to install the OpenGL libraries?
# (y)es/(n)o/(q)uit [ default is yes ]:
#
# Do you want to run nvidia-xconfig?
# This will update the system X configuration file so that the NVIDIA X driver
# is used. The pre-existing X configuration file will be backed