Re: CUDNN 7.5 Issues

2019-04-09 Thread Per da Silva
Hey Kellen,

I really appreciate that. Thank you!

And thanks to the community for supporting me ^^

Per


On Wed, Apr 10, 2019 at 5:53 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Hey Per, just wanted to drop a line and say thanks for supporting the
> community on this one.
>
> On Tue, Apr 9, 2019 at 4:20 AM Per da Silva  wrote:
>
> > I've created an issue to track this problem:
> > https://github.com/apache/incubator-mxnet/issues/14652
> >
> > On Tue, Apr 9, 2019 at 9:07 AM Per da Silva 
> wrote:
> >
> > > Dear MXNet community,
> > >
> > > I've been trying to update the CI GPU images to CUDA 10, but the tests
> > are
> > > failing. I'm not sure why and would really appreciate some help =D
> > >
> > > I've managed, at least, to narrow down the problem to the cuDNN
> version.
> > > The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> > >
> >
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> > > ).
> > >
> > > I noticed that the binary in the python packages we release uses cuDNN
> > > 7.3.1.20 (
> > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34
> > ),
> > > so decided to create a PR with CI updated to CUDA 10 with cuDNN
> 7.3.1.20
> > > and sure enough the tests passed (
> > > https://github.com/apache/incubator-mxnet/pull/14513).
> > >
> > > After talking with another contributer, we decided that I would try to
> > > create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing
> tests
> > > (to be fixed later). But, it seems the problem is a bit more heinous. I
> > > disable one test, and another one fails...So, it might make sense to
> > reach
> > > out now and see if we can find the root cause and fix it.
> > >
> > > Some things I've sanity checked:
> > >
> > > We run the tests on g3.8xlarge instances. These instances contain Tesla
> > > M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10
> > supports
> > > compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA
> ).
> > >
> > > According to the cuDNN support matrix (
> > >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html
> > ),
> > > cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver
> > r410.48
> > > (I assume greater or equal).
> > >
> > > The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
> > >
> > > So, as best I can tell, our environment ought to support cuDNN 7.5,
> which
> > > leads me to conclude that maybe there's something wrong in the code.
> > >
> > > The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check
> failed:
> > > e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
> > >
> > > According to the cuDNN user guide (
> > >
> >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> > > ):
> > >
> > > CUDNN_STATUS_ARCH_MISMATCH
> > >
> > > The function requires a feature absent from the current GPU device.
> Note
> > > that cuDNN only supports devices with compute capabilities greater than
> > or
> > > equal to 3.0.
> > >
> > > To correct: compile and run the application on a device with
> appropriate
> > > compute capability.
> > >
> > > But, as we've seen, our environment seems to support this version of
> > cuDNN
> > > and other versions go through CI w/o any problem...
> > >
> > > You can see some logs here:
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> > >
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
> > >
> > > I have about 13 runs of this pipeline. The errors for different runs
> can
> > > be seen by changing the number before /pipeline (e.g.
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> > > <
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> >
> > for
> > > the 2nd run, etc.)
> > >
> > > Thanks in advance for the help!
> > >
> > > You can reach me here or on Slack if you have any questions =D
> > >
> > > Cheers,
> > >
> > > Per
> > >
> > > P.S. I'm attaching some instructions on how to reproduce the issue at
> > home
> > > (or at least on a g3.8xlarge instance running ubuntu 16.04).
> > >
> >
>


Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

2019-04-09 Thread Junru Shao
Agreed with Tianqi that we could have better implementation once we have
better tvm nnvm v2 integration. For now I believe that we shouldn't block
the development of Intel folks.

On Tue, Apr 9, 2019 at 10:10 PM Tianqi Chen 
wrote:

> Such kind of conversion can be viewed as an enhanced version of
> AlterOpLayout in the TVM relay Pass
>
> On Tue, Apr 9, 2019 at 8:03 PM Lv, Tao A  wrote:
>
> >
> > Thank you Tianqi and Sam for the kind suggestions.
> >
> > @Tianqi,
> >
> > Can you please point me to the code of this pass or do you think anyone
> > from TVM community can help to educate me on this? I'm very happy to
> learn
> > from that.
> >
> > Just one note, we are not only doing layout transformation but also want
> > to have more memory for layout transformation.
> > For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16,
> > H=256, W=256) on channel dimension then convert (N=32, C=16, H=256,
> W=256)
> > to nchw16c so we can leverage corresponding optimal computation kernels.
> > That's why we also need changes to the memory planning pass.
> >
> >
> > @Sam,
> >
> > Yes, definitely we're treating MKL-DNN as an accelerator on CPU.
> > Previously we used it to accelerate certain critical operators in MXNet
> in
> > certain situations, eg. FP32 convolution/deconvolution/fullyConnected,
> etc.
> > But along with the evolving of both MXNet and MKL-DNN, we started to do
> > more which might not supported by MXNet in original CPU implementation,
> > such as quantization and graph fusion. So MKL-DNN backend is also
> changing
> > from a simple `accelerator` to a `default` backend on CPU. And I totally
> > agree with you that we need think more about the software architecture
> for
> > maintainability, testability and readability - that's why I sent out this
> > proposal to get more ideas from the community.
> >
> >
> > -tao
> >
> > -Original Message-
> > From: Skalicky, Sam [mailto:sska...@amazon.com.INVALID]
> > Sent: Wednesday, April 10, 2019 2:24 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType
> > and memory planning pass
> >
> > I agree with Tianqi. We should let MKLDNN partitipate in memory planning
> > by first having a separate NNVM pass and then using that info in the
> > regular memory planning phase.
> >
> > Its starting to sound like MKLDNN should be treated like an accelerator
> > rather than an operator library. As it has explicit needs and can provide
> > acceleration when given extra capabilities in MXNet like having input to
> > the memory planning NNVM pass. It also has special tensor formatting
> needs
> > and conversions that could be best architected in another way than they
> > currently are.
> >
> > We need to think about how we want to architect this for maintainability,
> > testability, and readability.
> >
> > Sam
> >
> >
> > > On Apr 9, 2019, at 11:11 AM, Tianqi Chen 
> > wrote:
> > >
> > > The layout transformation should really be a separate optimization
> > > pass rather than memory planning. As is done in the TVM stack. If we
> > > want to do a clean slate solution, I would recommend looking into that
> > instead.
> > >
> > > TIanqi
> > >
> > > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A  wrote:
> > >
> > >>
> > >>
> > >> Hi dev,
> > >>
> > >>
> > >>
> > >> As we're discussing the roadmap for MXNet 2.0, I would like to start
> > >> a thread about refining the InferStorageType and memory planning pass
> > >> in MXNet and hope it can happen as a part of the 2.0 release.
> > >>
> > >>
> > >>
> > >> Thanks to @eric-haibin-lin, part of the proposal has already been
> > >> discussed in issue #13598 [1].
> > >>
> > >>
> > >>
> > >> As mentioned in the description of issue #13598, there are several
> > >> drawbacks of the existing flow. Please allow me to quote them here:
> > >> *the selection of MKL/CPU/GPU/CUDNN implementation happens
> after
> > >> graph attribute inference and memory planning, memory planning is
> > >> thus not aware of the implementation that will be used for execution
> > >> in the future, which may result in sub-optimal result. For example,
> > >> the memory inplace option may vary depending on the accelerator
> > >> backend (the new version of CUDNN enables x/dx inplace for
> > _backward_conv).
> > >> *some sparse operator need to access dtype/shape information
> to
> > >> decide which implementation to invoke for execution, and whether to
> > >> perform fallback. This information is not yet exposed in the existing
> > >> infer storage type interface.
> > >>
> > >>
> > >>
> > >> Besides, the existing memory planning pass calculates and afterwards
> > >> allocates memory strictly according to the input/output tensor shapes
> > >> (which can be got from operators' arithmetic formulas through
> > InferShape).
> > >> That's not true anymore when we come to accelerators like MKL-DNN on
> > >> CPU which wants to pad input/output tensor to optimal formats (eg.
> > >> nc

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

2019-04-09 Thread Tianqi Chen
Such kind of conversion can be viewed as an enhanced version of
AlterOpLayout in the TVM relay Pass

On Tue, Apr 9, 2019 at 8:03 PM Lv, Tao A  wrote:

>
> Thank you Tianqi and Sam for the kind suggestions.
>
> @Tianqi,
>
> Can you please point me to the code of this pass or do you think anyone
> from TVM community can help to educate me on this? I'm very happy to learn
> from that.
>
> Just one note, we are not only doing layout transformation but also want
> to have more memory for layout transformation.
> For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16,
> H=256, W=256) on channel dimension then convert (N=32, C=16, H=256, W=256)
> to nchw16c so we can leverage corresponding optimal computation kernels.
> That's why we also need changes to the memory planning pass.
>
>
> @Sam,
>
> Yes, definitely we're treating MKL-DNN as an accelerator on CPU.
> Previously we used it to accelerate certain critical operators in MXNet in
> certain situations, eg. FP32 convolution/deconvolution/fullyConnected, etc.
> But along with the evolving of both MXNet and MKL-DNN, we started to do
> more which might not supported by MXNet in original CPU implementation,
> such as quantization and graph fusion. So MKL-DNN backend is also changing
> from a simple `accelerator` to a `default` backend on CPU. And I totally
> agree with you that we need think more about the software architecture for
> maintainability, testability and readability - that's why I sent out this
> proposal to get more ideas from the community.
>
>
> -tao
>
> -Original Message-
> From: Skalicky, Sam [mailto:sska...@amazon.com.INVALID]
> Sent: Wednesday, April 10, 2019 2:24 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType
> and memory planning pass
>
> I agree with Tianqi. We should let MKLDNN partitipate in memory planning
> by first having a separate NNVM pass and then using that info in the
> regular memory planning phase.
>
> Its starting to sound like MKLDNN should be treated like an accelerator
> rather than an operator library. As it has explicit needs and can provide
> acceleration when given extra capabilities in MXNet like having input to
> the memory planning NNVM pass. It also has special tensor formatting needs
> and conversions that could be best architected in another way than they
> currently are.
>
> We need to think about how we want to architect this for maintainability,
> testability, and readability.
>
> Sam
>
>
> > On Apr 9, 2019, at 11:11 AM, Tianqi Chen 
> wrote:
> >
> > The layout transformation should really be a separate optimization
> > pass rather than memory planning. As is done in the TVM stack. If we
> > want to do a clean slate solution, I would recommend looking into that
> instead.
> >
> > TIanqi
> >
> > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A  wrote:
> >
> >>
> >>
> >> Hi dev,
> >>
> >>
> >>
> >> As we're discussing the roadmap for MXNet 2.0, I would like to start
> >> a thread about refining the InferStorageType and memory planning pass
> >> in MXNet and hope it can happen as a part of the 2.0 release.
> >>
> >>
> >>
> >> Thanks to @eric-haibin-lin, part of the proposal has already been
> >> discussed in issue #13598 [1].
> >>
> >>
> >>
> >> As mentioned in the description of issue #13598, there are several
> >> drawbacks of the existing flow. Please allow me to quote them here:
> >> *the selection of MKL/CPU/GPU/CUDNN implementation happens after
> >> graph attribute inference and memory planning, memory planning is
> >> thus not aware of the implementation that will be used for execution
> >> in the future, which may result in sub-optimal result. For example,
> >> the memory inplace option may vary depending on the accelerator
> >> backend (the new version of CUDNN enables x/dx inplace for
> _backward_conv).
> >> *some sparse operator need to access dtype/shape information to
> >> decide which implementation to invoke for execution, and whether to
> >> perform fallback. This information is not yet exposed in the existing
> >> infer storage type interface.
> >>
> >>
> >>
> >> Besides, the existing memory planning pass calculates and afterwards
> >> allocates memory strictly according to the input/output tensor shapes
> >> (which can be got from operators' arithmetic formulas through
> InferShape).
> >> That's not true anymore when we come to accelerators like MKL-DNN on
> >> CPU which wants to pad input/output tensor to optimal formats (eg.
> >> nchw16c) according to hardware architecture. It also can be described
> >> as shape + stride. As many of you know, MKL-DNN shows great
> >> performance on these optimal formats which is blocked by the vector
> length of AVX512 or AVX2.
> >> It's very natural for us to pad on the channel dimension for those
> >> inputs/outputs which IC or OC is not multiples of vector length and
> >> leverage optimal kernels for blocked formats. Unfortunately this
> >> cannot be implemented w

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

2019-04-09 Thread Junru Shao
+1 for this proposal. Probably this is doable prior to 2.0?

While I totally agree with Tianqi that in the sense of a compiler, we
should make layout transformation a separate pass, I would love to mention
that it will be non-trivial engineering effort given the fact that our
current NNVM does not have a pass manager for optionally applying passes.
Moreover, I believe Tao's proposal is somehow equivalent to adding a new
pass in NNVM (but one with the same name).

By the way, making MKLDNN as an accelerator is a nice proposal, which I
guess could be a wish for MXNet 2.0.

On Tue, Apr 9, 2019 at 8:39 PM Zhao, Patric  wrote:

> BTW,  "maintainability, testability and readability"  is always our design
> goal from starting point of MKL-DNN integration :)
>
> > -Original Message-
> > From: Lv, Tao A [mailto:tao.a...@intel.com]
> > Sent: Wednesday, April 10, 2019 11:03 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType
> and
> > memory planning pass
> >
> >
> > Thank you Tianqi and Sam for the kind suggestions.
> >
> > @Tianqi,
> >
> > Can you please point me to the code of this pass or do you think anyone
> > from TVM community can help to educate me on this? I'm very happy to
> > learn from that.
> >
> > Just one note, we are not only doing layout transformation but also want
> to
> > have more memory for layout transformation.
> > For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16,
> > H=256, W=256) on channel dimension then convert (N=32, C=16, H=256,
> > W=256) to nchw16c so we can leverage corresponding optimal computation
> > kernels.
> > That's why we also need changes to the memory planning pass.
> >
> >
> > @Sam,
> >
> > Yes, definitely we're treating MKL-DNN as an accelerator on CPU.
> Previously
> > we used it to accelerate certain critical operators in MXNet in certain
> > situations, eg. FP32 convolution/deconvolution/fullyConnected, etc. But
> > along with the evolving of both MXNet and MKL-DNN, we started to do more
> > which might not supported by MXNet in original CPU implementation, such
> > as quantization and graph fusion. So MKL-DNN backend is also changing
> from
> > a simple `accelerator` to a `default` backend on CPU. And I totally
> agree with
> > you that we need think more about the software architecture for
> > maintainability, testability and readability - that's why I sent out
> this proposal
> > to get more ideas from the community.
> >
> >
> > -tao
> >
> > -Original Message-
> > From: Skalicky, Sam [mailto:sska...@amazon.com.INVALID]
> > Sent: Wednesday, April 10, 2019 2:24 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType
> and
> > memory planning pass
> >
> > I agree with Tianqi. We should let MKLDNN partitipate in memory planning
> > by first having a separate NNVM pass and then using that info in the
> regular
> > memory planning phase.
> >
> > Its starting to sound like MKLDNN should be treated like an accelerator
> rather
> > than an operator library. As it has explicit needs and can provide
> acceleration
> > when given extra capabilities in MXNet like having input to the memory
> > planning NNVM pass. It also has special tensor formatting needs and
> > conversions that could be best architected in another way than they
> > currently are.
> >
> > We need to think about how we want to architect this for maintainability,
> > testability, and readability.
> >
> > Sam
> >
> >
> > > On Apr 9, 2019, at 11:11 AM, Tianqi Chen 
> > wrote:
> > >
> > > The layout transformation should really be a separate optimization
> > > pass rather than memory planning. As is done in the TVM stack. If we
> > > want to do a clean slate solution, I would recommend looking into that
> > instead.
> > >
> > > TIanqi
> > >
> > > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A  wrote:
> > >
> > >>
> > >>
> > >> Hi dev,
> > >>
> > >>
> > >>
> > >> As we're discussing the roadmap for MXNet 2.0, I would like to start
> > >> a thread about refining the InferStorageType and memory planning pass
> > >> in MXNet and hope it can happen as a part of the 2.0 release.
> > >>
> > >>
> > >>
> > >> Thanks to @eric-haibin-lin, part of the proposal has already been
> > >> discussed in issue #13598 [1].
> > >>
> > >>
> > >>
> > >> As mentioned in the description of issue #13598, there are several
> > >> drawbacks of the existing flow. Please allow me to quote them here:
> > >> *the selection of MKL/CPU/GPU/CUDNN implementation happens
> > after
> > >> graph attribute inference and memory planning, memory planning is
> > >> thus not aware of the implementation that will be used for execution
> > >> in the future, which may result in sub-optimal result. For example,
> > >> the memory inplace option may vary depending on the accelerator
> > >> backend (the new version of CUDNN enables x/dx inplace for
> > _backward_conv).
> > >> *some sparse operat

Re: CUDNN 7.5 Issues

2019-04-09 Thread kellen sunderland
Hey Per, just wanted to drop a line and say thanks for supporting the
community on this one.

On Tue, Apr 9, 2019 at 4:20 AM Per da Silva  wrote:

> I've created an issue to track this problem:
> https://github.com/apache/incubator-mxnet/issues/14652
>
> On Tue, Apr 9, 2019 at 9:07 AM Per da Silva  wrote:
>
> > Dear MXNet community,
> >
> > I've been trying to update the CI GPU images to CUDA 10, but the tests
> are
> > failing. I'm not sure why and would really appreciate some help =D
> >
> > I've managed, at least, to narrow down the problem to the cuDNN version.
> > The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> >
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> > ).
> >
> > I noticed that the binary in the python packages we release uses cuDNN
> > 7.3.1.20 (
> >
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34
> ),
> > so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
> > and sure enough the tests passed (
> > https://github.com/apache/incubator-mxnet/pull/14513).
> >
> > After talking with another contributer, we decided that I would try to
> > create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
> > (to be fixed later). But, it seems the problem is a bit more heinous. I
> > disable one test, and another one fails...So, it might make sense to
> reach
> > out now and see if we can find the root cause and fix it.
> >
> > Some things I've sanity checked:
> >
> > We run the tests on g3.8xlarge instances. These instances contain Tesla
> > M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10
> supports
> > compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).
> >
> > According to the cuDNN support matrix (
> > https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html
> ),
> > cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver
> r410.48
> > (I assume greater or equal).
> >
> > The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
> >
> > So, as best I can tell, our environment ought to support cuDNN 7.5, which
> > leads me to conclude that maybe there's something wrong in the code.
> >
> > The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed:
> > e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
> >
> > According to the cuDNN user guide (
> >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> > ):
> >
> > CUDNN_STATUS_ARCH_MISMATCH
> >
> > The function requires a feature absent from the current GPU device. Note
> > that cuDNN only supports devices with compute capabilities greater than
> or
> > equal to 3.0.
> >
> > To correct: compile and run the application on a device with appropriate
> > compute capability.
> >
> > But, as we've seen, our environment seems to support this version of
> cuDNN
> > and other versions go through CI w/o any problem...
> >
> > You can see some logs here:
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> >
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
> >
> > I have about 13 runs of this pipeline. The errors for different runs can
> > be seen by changing the number before /pipeline (e.g.
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> > <
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/>
> for
> > the 2nd run, etc.)
> >
> > Thanks in advance for the help!
> >
> > You can reach me here or on Slack if you have any questions =D
> >
> > Cheers,
> >
> > Per
> >
> > P.S. I'm attaching some instructions on how to reproduce the issue at
> home
> > (or at least on a g3.8xlarge instance running ubuntu 16.04).
> >
>


RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

2019-04-09 Thread Zhao, Patric
BTW,  "maintainability, testability and readability"  is always our design goal 
from starting point of MKL-DNN integration :)

> -Original Message-
> From: Lv, Tao A [mailto:tao.a...@intel.com]
> Sent: Wednesday, April 10, 2019 11:03 AM
> To: dev@mxnet.incubator.apache.org
> Subject: RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and
> memory planning pass
> 
> 
> Thank you Tianqi and Sam for the kind suggestions.
> 
> @Tianqi,
> 
> Can you please point me to the code of this pass or do you think anyone
> from TVM community can help to educate me on this? I'm very happy to
> learn from that.
> 
> Just one note, we are not only doing layout transformation but also want to
> have more memory for layout transformation.
> For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16,
> H=256, W=256) on channel dimension then convert (N=32, C=16, H=256,
> W=256) to nchw16c so we can leverage corresponding optimal computation
> kernels.
> That's why we also need changes to the memory planning pass.
> 
> 
> @Sam,
> 
> Yes, definitely we're treating MKL-DNN as an accelerator on CPU. Previously
> we used it to accelerate certain critical operators in MXNet in certain
> situations, eg. FP32 convolution/deconvolution/fullyConnected, etc. But
> along with the evolving of both MXNet and MKL-DNN, we started to do more
> which might not supported by MXNet in original CPU implementation, such
> as quantization and graph fusion. So MKL-DNN backend is also changing from
> a simple `accelerator` to a `default` backend on CPU. And I totally agree with
> you that we need think more about the software architecture for
> maintainability, testability and readability - that's why I sent out this 
> proposal
> to get more ideas from the community.
> 
> 
> -tao
> 
> -Original Message-
> From: Skalicky, Sam [mailto:sska...@amazon.com.INVALID]
> Sent: Wednesday, April 10, 2019 2:24 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and
> memory planning pass
> 
> I agree with Tianqi. We should let MKLDNN partitipate in memory planning
> by first having a separate NNVM pass and then using that info in the regular
> memory planning phase.
> 
> Its starting to sound like MKLDNN should be treated like an accelerator rather
> than an operator library. As it has explicit needs and can provide 
> acceleration
> when given extra capabilities in MXNet like having input to the memory
> planning NNVM pass. It also has special tensor formatting needs and
> conversions that could be best architected in another way than they
> currently are.
> 
> We need to think about how we want to architect this for maintainability,
> testability, and readability.
> 
> Sam
> 
> 
> > On Apr 9, 2019, at 11:11 AM, Tianqi Chen 
> wrote:
> >
> > The layout transformation should really be a separate optimization
> > pass rather than memory planning. As is done in the TVM stack. If we
> > want to do a clean slate solution, I would recommend looking into that
> instead.
> >
> > TIanqi
> >
> > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A  wrote:
> >
> >>
> >>
> >> Hi dev,
> >>
> >>
> >>
> >> As we're discussing the roadmap for MXNet 2.0, I would like to start
> >> a thread about refining the InferStorageType and memory planning pass
> >> in MXNet and hope it can happen as a part of the 2.0 release.
> >>
> >>
> >>
> >> Thanks to @eric-haibin-lin, part of the proposal has already been
> >> discussed in issue #13598 [1].
> >>
> >>
> >>
> >> As mentioned in the description of issue #13598, there are several
> >> drawbacks of the existing flow. Please allow me to quote them here:
> >> *the selection of MKL/CPU/GPU/CUDNN implementation happens
> after
> >> graph attribute inference and memory planning, memory planning is
> >> thus not aware of the implementation that will be used for execution
> >> in the future, which may result in sub-optimal result. For example,
> >> the memory inplace option may vary depending on the accelerator
> >> backend (the new version of CUDNN enables x/dx inplace for
> _backward_conv).
> >> *some sparse operator need to access dtype/shape information to
> >> decide which implementation to invoke for execution, and whether to
> >> perform fallback. This information is not yet exposed in the existing
> >> infer storage type interface.
> >>
> >>
> >>
> >> Besides, the existing memory planning pass calculates and afterwards
> >> allocates memory strictly according to the input/output tensor shapes
> >> (which can be got from operators' arithmetic formulas through
> InferShape).
> >> That's not true anymore when we come to accelerators like MKL-DNN on
> >> CPU which wants to pad input/output tensor to optimal formats (eg.
> >> nchw16c) according to hardware architecture. It also can be described
> >> as shape + stride. As many of you know, MKL-DNN shows great
> >> performance on these optimal formats which is blocked by the vector
> length 

RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

2019-04-09 Thread Lv, Tao A


Thank you Tianqi and Sam for the kind suggestions.

@Tianqi,

Can you please point me to the code of this pass or do you think anyone from 
TVM community can help to educate me on this? I'm very happy to learn from that.

Just one note, we are not only doing layout transformation but also want to 
have more memory for layout transformation.
For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16, H=256, 
W=256) on channel dimension then convert (N=32, C=16, H=256, W=256) to nchw16c 
so we can leverage corresponding optimal computation kernels.
That's why we also need changes to the memory planning pass.


@Sam,

Yes, definitely we're treating MKL-DNN as an accelerator on CPU. Previously we 
used it to accelerate certain critical operators in MXNet in certain 
situations, eg. FP32 convolution/deconvolution/fullyConnected, etc. But along 
with the evolving of both MXNet and MKL-DNN, we started to do more which might 
not supported by MXNet in original CPU implementation, such as quantization and 
graph fusion. So MKL-DNN backend is also changing from a simple `accelerator` 
to a `default` backend on CPU. And I totally agree with you that we need think 
more about the software architecture for maintainability, testability and 
readability - that's why I sent out this proposal to get more ideas from the 
community.


-tao

-Original Message-
From: Skalicky, Sam [mailto:sska...@amazon.com.INVALID] 
Sent: Wednesday, April 10, 2019 2:24 AM
To: dev@mxnet.incubator.apache.org
Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and 
memory planning pass

I agree with Tianqi. We should let MKLDNN partitipate in memory planning by 
first having a separate NNVM pass and then using that info in the regular 
memory planning phase.

Its starting to sound like MKLDNN should be treated like an accelerator rather 
than an operator library. As it has explicit needs and can provide acceleration 
when given extra capabilities in MXNet like having input to the memory planning 
NNVM pass. It also has special tensor formatting needs and conversions that 
could be best architected in another way than they currently are.

We need to think about how we want to architect this for maintainability, 
testability, and readability.

Sam


> On Apr 9, 2019, at 11:11 AM, Tianqi Chen  wrote:
> 
> The layout transformation should really be a separate optimization 
> pass rather than memory planning. As is done in the TVM stack. If we 
> want to do a clean slate solution, I would recommend looking into that 
> instead.
> 
> TIanqi
> 
> On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A  wrote:
> 
>> 
>> 
>> Hi dev,
>> 
>> 
>> 
>> As we're discussing the roadmap for MXNet 2.0, I would like to start 
>> a thread about refining the InferStorageType and memory planning pass 
>> in MXNet and hope it can happen as a part of the 2.0 release.
>> 
>> 
>> 
>> Thanks to @eric-haibin-lin, part of the proposal has already been 
>> discussed in issue #13598 [1].
>> 
>> 
>> 
>> As mentioned in the description of issue #13598, there are several 
>> drawbacks of the existing flow. Please allow me to quote them here:
>> *the selection of MKL/CPU/GPU/CUDNN implementation happens after
>> graph attribute inference and memory planning, memory planning is 
>> thus not aware of the implementation that will be used for execution 
>> in the future, which may result in sub-optimal result. For example, 
>> the memory inplace option may vary depending on the accelerator 
>> backend (the new version of CUDNN enables x/dx inplace for _backward_conv).
>> *some sparse operator need to access dtype/shape information to
>> decide which implementation to invoke for execution, and whether to 
>> perform fallback. This information is not yet exposed in the existing 
>> infer storage type interface.
>> 
>> 
>> 
>> Besides, the existing memory planning pass calculates and afterwards 
>> allocates memory strictly according to the input/output tensor shapes 
>> (which can be got from operators' arithmetic formulas through InferShape).
>> That's not true anymore when we come to accelerators like MKL-DNN on 
>> CPU which wants to pad input/output tensor to optimal formats (eg. 
>> nchw16c) according to hardware architecture. It also can be described 
>> as shape + stride. As many of you know, MKL-DNN shows great 
>> performance on these optimal formats which is blocked by the vector length 
>> of AVX512 or AVX2.
>> It's very natural for us to pad on the channel dimension for those 
>> inputs/outputs which IC or OC is not multiples of vector length and 
>> leverage optimal kernels for blocked formats. Unfortunately this 
>> cannot be implemented without changing the logic in the memory planning pass.
>> Currently we always fallback to slow reference kernels for both 
>> convolution [1] and deconvolution [2].
>> 
>> 
>> 
>> AFAIK, the padding feature of MKL-DNN has already been used in 
>> TensorFlow and other frameworks. We also found that

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

2019-04-09 Thread Skalicky, Sam
I agree with Tianqi. We should let MKLDNN partitipate in memory planning by 
first having a separate NNVM pass and then using that info in the regular 
memory planning phase.

Its starting to sound like MKLDNN should be treated like an accelerator rather 
than an operator library. As it has explicit needs and can provide acceleration 
when given extra capabilities in MXNet like having input to the memory planning 
NNVM pass. It also has special tensor formatting needs and conversions that 
could be best architected in another way than they currently are.

We need to think about how we want to architect this for maintainability, 
testability, and readability.

Sam


> On Apr 9, 2019, at 11:11 AM, Tianqi Chen  wrote:
> 
> The layout transformation should really be a separate optimization pass
> rather than memory planning. As is done in the TVM stack. If we want to do
> a clean slate solution, I would recommend looking into that instead.
> 
> TIanqi
> 
> On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A  wrote:
> 
>> 
>> 
>> Hi dev,
>> 
>> 
>> 
>> As we're discussing the roadmap for MXNet 2.0, I would like to start a
>> thread about refining the InferStorageType and memory planning pass in
>> MXNet and hope it can happen as a part of the 2.0 release.
>> 
>> 
>> 
>> Thanks to @eric-haibin-lin, part of the proposal has already been
>> discussed in issue #13598 [1].
>> 
>> 
>> 
>> As mentioned in the description of issue #13598, there are several
>> drawbacks of the existing flow. Please allow me to quote them here:
>> *the selection of MKL/CPU/GPU/CUDNN implementation happens after
>> graph attribute inference and memory planning, memory planning is thus not
>> aware of the implementation that will be used for execution in the future,
>> which may result in sub-optimal result. For example, the memory inplace
>> option may vary depending on the accelerator backend (the new version of
>> CUDNN enables x/dx inplace for _backward_conv).
>> *some sparse operator need to access dtype/shape information to
>> decide which implementation to invoke for execution, and whether to perform
>> fallback. This information is not yet exposed in the existing infer storage
>> type interface.
>> 
>> 
>> 
>> Besides, the existing memory planning pass calculates and afterwards
>> allocates memory strictly according to the input/output tensor shapes
>> (which can be got from operators' arithmetic formulas through InferShape).
>> That's not true anymore when we come to accelerators like MKL-DNN on CPU
>> which wants to pad input/output tensor to optimal formats (eg. nchw16c)
>> according to hardware architecture. It also can be described as shape +
>> stride. As many of you know, MKL-DNN shows great performance on these
>> optimal formats which is blocked by the vector length of AVX512 or AVX2.
>> It's very natural for us to pad on the channel dimension for those
>> inputs/outputs which IC or OC is not multiples of vector length and
>> leverage optimal kernels for blocked formats. Unfortunately this cannot be
>> implemented without changing the logic in the memory planning pass.
>> Currently we always fallback to slow reference kernels for both convolution
>> [1] and deconvolution [2].
>> 
>> 
>> 
>> AFAIK, the padding feature of MKL-DNN has already been used in TensorFlow
>> and other frameworks. We also found that, without supporting this feature,
>> many other new features from MKL-DNN cannot be applied to MXNet,  such as
>> the deconvolution primitive, winograd, etc.
>> 
>> 
>> 
>> Changes for this proposal can be divided into following parts:
>> 1.  Following the proposal in issue #13598, we need add new
>> InferStorageTypeEx functions to operators which need to do dispatch in a
>> more fine-grained way. This also need the InfereStorage pass can handle the
>> new -Ex function as what we did for FCompute and FComputeEx.
>> 2.  Attach more information to the computation graph/node, eg.
>> accelerator specific information. Currently we add `IsMKLDNN` directly
>> during operator registration if MXNET_USE_MKLDNN == 1. It looks simple and
>> rude to me.
>> 3.  Do memory planning according to more information: topology,
>> shapes, data types, in-place options and more accurate accelerator
>> information (accelerator path, memory size requirements, accelerator-wise
>> attributes).
>> 4.  Improve MKL-DNN operators so they can work on those well planned
>> memory which may be larger than the arithmetic requirements and work with
>> optimal kernels. Also, with more accurate dispatching in
>> InferStorageTypeEx, there is no need for us to write complicated fallback
>> logic in MKL-DNN operators.
>> 5.  If users feel uncomfortable with more memory usage, we can disable
>> this feature by environmental variables.
>> 
>> 
>> 
>> Since the memory planning pass is implemented in NNVM, so we also need
>> support from TVM community.
>> 
>> 
>> 
>> Please let me know what do you think. Thank you.
>> 
>> 
>> 
>> -tao
>> 
>>

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

2019-04-09 Thread Tianqi Chen
The layout transformation should really be a separate optimization pass
rather than memory planning. As is done in the TVM stack. If we want to do
a clean slate solution, I would recommend looking into that instead.

TIanqi

On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A  wrote:

>
>
> Hi dev,
>
>
>
> As we're discussing the roadmap for MXNet 2.0, I would like to start a
> thread about refining the InferStorageType and memory planning pass in
> MXNet and hope it can happen as a part of the 2.0 release.
>
>
>
> Thanks to @eric-haibin-lin, part of the proposal has already been
> discussed in issue #13598 [1].
>
>
>
> As mentioned in the description of issue #13598, there are several
> drawbacks of the existing flow. Please allow me to quote them here:
> *the selection of MKL/CPU/GPU/CUDNN implementation happens after
> graph attribute inference and memory planning, memory planning is thus not
> aware of the implementation that will be used for execution in the future,
> which may result in sub-optimal result. For example, the memory inplace
> option may vary depending on the accelerator backend (the new version of
> CUDNN enables x/dx inplace for _backward_conv).
> *some sparse operator need to access dtype/shape information to
> decide which implementation to invoke for execution, and whether to perform
> fallback. This information is not yet exposed in the existing infer storage
> type interface.
>
>
>
> Besides, the existing memory planning pass calculates and afterwards
> allocates memory strictly according to the input/output tensor shapes
> (which can be got from operators' arithmetic formulas through InferShape).
> That's not true anymore when we come to accelerators like MKL-DNN on CPU
> which wants to pad input/output tensor to optimal formats (eg. nchw16c)
> according to hardware architecture. It also can be described as shape +
> stride. As many of you know, MKL-DNN shows great performance on these
> optimal formats which is blocked by the vector length of AVX512 or AVX2.
> It's very natural for us to pad on the channel dimension for those
> inputs/outputs which IC or OC is not multiples of vector length and
> leverage optimal kernels for blocked formats. Unfortunately this cannot be
> implemented without changing the logic in the memory planning pass.
> Currently we always fallback to slow reference kernels for both convolution
> [1] and deconvolution [2].
>
>
>
> AFAIK, the padding feature of MKL-DNN has already been used in TensorFlow
> and other frameworks. We also found that, without supporting this feature,
> many other new features from MKL-DNN cannot be applied to MXNet,  such as
> the deconvolution primitive, winograd, etc.
>
>
>
> Changes for this proposal can be divided into following parts:
> 1.  Following the proposal in issue #13598, we need add new
> InferStorageTypeEx functions to operators which need to do dispatch in a
> more fine-grained way. This also need the InfereStorage pass can handle the
> new -Ex function as what we did for FCompute and FComputeEx.
> 2.  Attach more information to the computation graph/node, eg.
> accelerator specific information. Currently we add `IsMKLDNN` directly
> during operator registration if MXNET_USE_MKLDNN == 1. It looks simple and
> rude to me.
> 3.  Do memory planning according to more information: topology,
> shapes, data types, in-place options and more accurate accelerator
> information (accelerator path, memory size requirements, accelerator-wise
> attributes).
> 4.  Improve MKL-DNN operators so they can work on those well planned
> memory which may be larger than the arithmetic requirements and work with
> optimal kernels. Also, with more accurate dispatching in
> InferStorageTypeEx, there is no need for us to write complicated fallback
> logic in MKL-DNN operators.
> 5.  If users feel uncomfortable with more memory usage, we can disable
> this feature by environmental variables.
>
>
>
> Since the memory planning pass is implemented in NNVM, so we also need
> support from TVM community.
>
>
>
> Please let me know what do you think. Thank you.
>
>
>
> -tao
>
>
>
> [1] https://github.com/apache/incubator-mxnet/issues/13598
>
> [2]
> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_convolution.cc#L194
>
> [3]
> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_deconvolution.cc#L55
>
>


MXNet Berlin User Group

2019-04-09 Thread Jose Luis Contreras Santos
Hello dev,



This is a friendly reminder that the MXNet Berlin User Group will be held
today, starting in a few minutes at 6pm-7pm (CEST) / 9am-10am (PST).


More info here:

https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28Incubating%29+User+Groups+recurring+meetings



https://chime.aws/4907624430



Thanks!


Jose


Re: CUDNN 7.5 Issues

2019-04-09 Thread Per da Silva
I've created an issue to track this problem:
https://github.com/apache/incubator-mxnet/issues/14652

On Tue, Apr 9, 2019 at 9:07 AM Per da Silva  wrote:

> Dear MXNet community,
>
> I've been trying to update the CI GPU images to CUDA 10, but the tests are
> failing. I'm not sure why and would really appreciate some help =D
>
> I've managed, at least, to narrow down the problem to the cuDNN version.
> The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> ).
>
> I noticed that the binary in the python packages we release uses cuDNN
> 7.3.1.20 (
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34),
> so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
> and sure enough the tests passed (
> https://github.com/apache/incubator-mxnet/pull/14513).
>
> After talking with another contributer, we decided that I would try to
> create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
> (to be fixed later). But, it seems the problem is a bit more heinous. I
> disable one test, and another one fails...So, it might make sense to reach
> out now and see if we can find the root cause and fix it.
>
> Some things I've sanity checked:
>
> We run the tests on g3.8xlarge instances. These instances contain Tesla
> M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10 supports
> compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).
>
> According to the cuDNN support matrix (
> https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html),
> cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver r410.48
> (I assume greater or equal).
>
> The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
>
> So, as best I can tell, our environment ought to support cuDNN 7.5, which
> leads me to conclude that maybe there's something wrong in the code.
>
> The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed:
> e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
>
> According to the cuDNN user guide (
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> ):
>
> CUDNN_STATUS_ARCH_MISMATCH
>
> The function requires a feature absent from the current GPU device. Note
> that cuDNN only supports devices with compute capabilities greater than or
> equal to 3.0.
>
> To correct: compile and run the application on a device with appropriate
> compute capability.
>
> But, as we've seen, our environment seems to support this version of cuDNN
> and other versions go through CI w/o any problem...
>
> You can see some logs here:
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
>
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
>
> I have about 13 runs of this pipeline. The errors for different runs can
> be seen by changing the number before /pipeline (e.g.
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> 
>  for
> the 2nd run, etc.)
>
> Thanks in advance for the help!
>
> You can reach me here or on Slack if you have any questions =D
>
> Cheers,
>
> Per
>
> P.S. I'm attaching some instructions on how to reproduce the issue at home
> (or at least on a g3.8xlarge instance running ubuntu 16.04).
>


[MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

2019-04-09 Thread Lv, Tao A


Hi dev,



As we're discussing the roadmap for MXNet 2.0, I would like to start a thread 
about refining the InferStorageType and memory planning pass in MXNet and hope 
it can happen as a part of the 2.0 release.



Thanks to @eric-haibin-lin, part of the proposal has already been discussed in 
issue #13598 [1].



As mentioned in the description of issue #13598, there are several drawbacks of 
the existing flow. Please allow me to quote them here:
*the selection of MKL/CPU/GPU/CUDNN implementation happens after graph 
attribute inference and memory planning, memory planning is thus not aware of 
the implementation that will be used for execution in the future, which may 
result in sub-optimal result. For example, the memory inplace option may vary 
depending on the accelerator backend (the new version of CUDNN enables x/dx 
inplace for _backward_conv).
*some sparse operator need to access dtype/shape information to decide 
which implementation to invoke for execution, and whether to perform fallback. 
This information is not yet exposed in the existing infer storage type 
interface.



Besides, the existing memory planning pass calculates and afterwards allocates 
memory strictly according to the input/output tensor shapes (which can be got 
from operators' arithmetic formulas through InferShape). That's not true 
anymore when we come to accelerators like MKL-DNN on CPU which wants to pad 
input/output tensor to optimal formats (eg. nchw16c) according to hardware 
architecture. It also can be described as shape + stride. As many of you know, 
MKL-DNN shows great performance on these optimal formats which is blocked by 
the vector length of AVX512 or AVX2. It's very natural for us to pad on the 
channel dimension for those inputs/outputs which IC or OC is not multiples of 
vector length and leverage optimal kernels for blocked formats. Unfortunately 
this cannot be implemented without changing the logic in the memory planning 
pass. Currently we always fallback to slow reference kernels for both 
convolution [1] and deconvolution [2].



AFAIK, the padding feature of MKL-DNN has already been used in TensorFlow and 
other frameworks. We also found that, without supporting this feature, many 
other new features from MKL-DNN cannot be applied to MXNet,  such as the 
deconvolution primitive, winograd, etc.



Changes for this proposal can be divided into following parts:
1.  Following the proposal in issue #13598, we need add new 
InferStorageTypeEx functions to operators which need to do dispatch in a more 
fine-grained way. This also need the InfereStorage pass can handle the new -Ex 
function as what we did for FCompute and FComputeEx.
2.  Attach more information to the computation graph/node, eg. accelerator 
specific information. Currently we add `IsMKLDNN` directly during operator 
registration if MXNET_USE_MKLDNN == 1. It looks simple and rude to me.
3.  Do memory planning according to more information: topology, shapes, 
data types, in-place options and more accurate accelerator information 
(accelerator path, memory size requirements, accelerator-wise attributes).
4.  Improve MKL-DNN operators so they can work on those well planned memory 
which may be larger than the arithmetic requirements and work with optimal 
kernels. Also, with more accurate dispatching in InferStorageTypeEx, there is 
no need for us to write complicated fallback logic in MKL-DNN operators.
5.  If users feel uncomfortable with more memory usage, we can disable this 
feature by environmental variables.



Since the memory planning pass is implemented in NNVM, so we also need support 
from TVM community.



Please let me know what do you think. Thank you.



-tao



[1] https://github.com/apache/incubator-mxnet/issues/13598

[2] 
https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_convolution.cc#L194

[3] 
https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_deconvolution.cc#L55



CUDNN 7.5 Issues

2019-04-09 Thread Per da Silva
Dear MXNet community,

I've been trying to update the CI GPU images to CUDA 10, but the tests are
failing. I'm not sure why and would really appreciate some help =D

I've managed, at least, to narrow down the problem to the cuDNN version.
The current CUDA 10 images uses cuDNN version 7.5.0.56 (
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
).

I noticed that the binary in the python packages we release uses cuDNN
7.3.1.20 (
https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34),
so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
and sure enough the tests passed (
https://github.com/apache/incubator-mxnet/pull/14513).

After talking with another contributer, we decided that I would try to
create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
(to be fixed later). But, it seems the problem is a bit more heinous. I
disable one test, and another one fails...So, it might make sense to reach
out now and see if we can find the root cause and fix it.

Some things I've sanity checked:

We run the tests on g3.8xlarge instances. These instances contain Tesla M60
GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10 supports
compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).

According to the cuDNN support matrix (
https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html),
cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver r410.48
(I assume greater or equal).

The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.

So, as best I can tell, our environment ought to support cuDNN 7.5, which
leads me to conclude that maybe there's something wrong in the code.

The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed: e
== CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".

According to the cuDNN user guide (
https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html):

CUDNN_STATUS_ARCH_MISMATCH

The function requires a feature absent from the current GPU device. Note
that cuDNN only supports devices with compute capabilities greater than or
equal to 3.0.

To correct: compile and run the application on a device with appropriate
compute capability.

But, as we've seen, our environment seems to support this version of cuDNN
and other versions go through CI w/o any problem...

You can see some logs here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/

I have about 13 runs of this pipeline. The errors for different runs can be
seen by changing the number before /pipeline (e.g.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/

for
the 2nd run, etc.)

Thanks in advance for the help!

You can reach me here or on Slack if you have any questions =D

Cheers,

Per

P.S. I'm attaching some instructions on how to reproduce the issue at home
(or at least on a g3.8xlarge instance running ubuntu 16.04).
# Launch g3.8xlarge instance

# ==-_-==-_-== Environment Setup ==-_-==-_-==

sudo apt update
sudo apt-get install -y \
apt-transport-https \
build-essential \
ca-certificates \
curl \
git \
libatlas-base-dev \
libcurl4-openssl-dev \
libjemalloc-dev \
libhdf5-dev \
liblapack-dev \
libopenblas-dev \
libopencv-dev \
libturbojpeg \
libzmq3-dev \
ninja-build \
software-properties-common \
sudo \
unzip \
wget

sudo apt-get install -y python-dev python3-dev virtualenv wget

# the version of the pip shipped with ubuntu may be too lower, install a recent 
version here
wget -nv https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo python2 get-pip.py

pip2 install --user nose cpplint==1.3.0 pylint==1.9.3 'numpy<=1.15.2,>=1.8.2' 
nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3
pip3 install --user nose cpplint==1.3.0 pylint==2.1.1 'numpy<=1.15.2,>=1.8.2' 
nose-timer 'requests<2.19.0,>=2.18.4' h5py==2.8.0rc1 scipy==1.0.1 boto3

# ==-_-==-_-== CUDA Installation ==-_-==-_-==

wget 
https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
chmod +x cuda_10.0.130_410.48_linux && sudo ./cuda_10.0.130_410.48_linux

# Installation except:
# Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
# (y)es/(n)o/(q)uit: y
# 
# Do you want to install the OpenGL libraries?
# (y)es/(n)o/(q)uit [ default is yes ]:
#
# Do you want to run nvidia-xconfig?
# This will update the system X configuration file so that the NVIDIA X driver
# is used. The pre-existing X configuration file will be backed up