RE: Include MKLDNN into default mxnet pip package

2018-11-21 Thread Lv, Tao A
Thanks for the information, Kellen and Naveen.

Better than onnx-tensorrt, MKL-DNN has already provided versioning and release 
tags. My concern is that as MKL-DNN is still under intensive development, if it 
has a new feature or bug fix on its master branch, do we really want to wait 
for next release to get it supported in MXNet?

Take the LSTM regression as an example, probably MKL-DNN will give a fix or 
improvement on its master branch soon, do we need to wait for 0.18 release to 
get it fixed for mxnet user? AFAIK, tensorflow is also using normal commit id, 
not release, as the dependency for MKL-DNN.

Regarding the LSTM regression, we are using internal JIRA tickets rather than 
github issues to track the defects of MKL-DNN. But I agree with you, we need 
update the progress of it in Alex's issue.

Thanks,
-tao

-Original Message-
From: kellen sunderland [mailto:kellen.sunderl...@gmail.com] 
Sent: Thursday, November 22, 2018 10:55 AM
To: dev@mxnet.incubator.apache.org
Subject: Re: Include MKLDNN into default mxnet pip package

Agree with your point about other repos also not being based on versioning Tao. 
 I would point out that I've given some that I've worked with similar
feedback: https://github.com/onnx/onnx-tensorrt/issues/68

On Wed, Nov 21, 2018 at 6:48 PM Naveen Swamy  wrote:

> Tao,
>
> You are right there are many submodules in 3rd party. We have to start 
> somewhere and I believe this one is a good candidate to start with. 
> This is not to cater to release of MXNet or to tie them with the 
> releases of the submodules but instead to pick only stable releases 
> and not to pick up bleeding edge commits from the tip of the master, 
> this gives us confidence in the submodule that MXNet users are 
> depending on that especially if we make MKLDNN the default.
>
> Good to know it is known already as a regression.Alex has created this 
> issue https://github.com/apache/incubator-mxnet/issues/13369, please 
> add details and link the corresponding issue in MKLDNN(I couldn't find).
>
> -Naveen
>
> On Wed, Nov 21, 2018 at 6:04 PM Lv, Tao A  wrote:
>
> > Here are my answers for the questions from Kellen and Naveen about 
> > MKL-DNN. It doesn't mean that I'm supportive for making MKL-DNN 
> > default here.
> >
> > @Kellen,
> >
> > FYI, here is a list for those platforms which are officially 
> > supported by MKL-DNN.
> > https://github.com/intel/mkl-dnn#system-requirements
> >
> > Most of computation intensive kernels in MKL-DNN are JITed. So they 
> > are supposed to generate code according to the platform during 
> > runtime. For non-JIT code in MKL-DNN, same as other code in MXNet, 
> > it will generate instructions according to the options/flags of 
> > compiler. We can set -DARCH_OPT_FLAGS when build MKL-DNN to avoid 
> > optimization for compiling machine. That's exactly what we are doing for 
> > MKL-DNN build in MXNet.
> Even
> > without MKL-DNN, I noticed there were issues about illegal 
> > instructions
> of
> > MXNet when users import the pip package on a lower end machine which 
> > probably only supports SSE.
> >
> > @Naveen,
> >
> > The LSTM issue has already been identified as a regression from the
> recent
> > version of MKL-DNN. Hopefully it will be fixed soon with a new 
> > update of MKL-DNN.
> >
> > MXNet has many submodule dependencies under the 3rd party folder. 
> > Seems
> we
> > don't require release versions for most of these dependencies. The
> release
> > period of MKL-DNN and MXNet are not matched very well. I think it 
> > would
> be
> > a risk for MXNet release if it hardly depends on the release of a 
> > submodule, no need to say depends on the releases of all submodules.
> >
> > -tao
> >
> > -Original Message-
> > From: Naveen Swamy [mailto:mnnav...@gmail.com]
> > Sent: Thursday, November 22, 2018 9:08 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: d...@mxnet.apache.org
> > Subject: Re: Include MKLDNN into default mxnet pip package
> >
> > Hi Alex,
> >
> > Thanks for promptly running the numbers on AMD and reporting here.
> >
> > Can you please update the AMD numbers here for posterity
> >
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL
> -DNN+-+Performance+Benchmarking
> > ?
> >
> > are there any outstanding issues when MKLDNN is enabled? from my 
> > offline conversation I am briefly aware performance issues with 
> > LSTM, is there an GitHub issue for it?
> >
> > MKLDNN is a submodule dependency, are we pulling the latest commit 
> > or releases  ? If not we should move to releases before we make it a
> default.
> > Ideally we should use platform specific distributions (-dev 
> > packages) at least we should rely on well tested releases.
> >
> >
> > Thanks, Naveen
> >
> > On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander
>  > >
> > wrote:
> >
> > > AMD benchmarks have been published. We are seeing a x15.8 speedup 
> > > with
> > > Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. With a 
> > > smaller 

Re: Include MKLDNN into default mxnet pip package

2018-11-21 Thread kellen sunderland
Agree with your point about other repos also not being based on versioning
Tao.  I would point out that I've given some that I've worked with similar
feedback: https://github.com/onnx/onnx-tensorrt/issues/68

On Wed, Nov 21, 2018 at 6:48 PM Naveen Swamy  wrote:

> Tao,
>
> You are right there are many submodules in 3rd party. We have to start
> somewhere and I believe this one is a good candidate to start with. This is
> not to cater to release of MXNet or to tie them with the releases of the
> submodules but instead to pick only stable releases and not to pick up
> bleeding edge commits from the tip of the master, this gives us confidence
> in the submodule that MXNet users are depending on that especially if we
> make MKLDNN the default.
>
> Good to know it is known already as a regression.Alex has created this
> issue https://github.com/apache/incubator-mxnet/issues/13369, please add
> details and link the corresponding issue in MKLDNN(I couldn't find).
>
> -Naveen
>
> On Wed, Nov 21, 2018 at 6:04 PM Lv, Tao A  wrote:
>
> > Here are my answers for the questions from Kellen and Naveen about
> > MKL-DNN. It doesn't mean that I'm supportive for making MKL-DNN default
> > here.
> >
> > @Kellen,
> >
> > FYI, here is a list for those platforms which are officially supported by
> > MKL-DNN.
> > https://github.com/intel/mkl-dnn#system-requirements
> >
> > Most of computation intensive kernels in MKL-DNN are JITed. So they are
> > supposed to generate code according to the platform during runtime. For
> > non-JIT code in MKL-DNN, same as other code in MXNet, it will generate
> > instructions according to the options/flags of compiler. We can set
> > -DARCH_OPT_FLAGS when build MKL-DNN to avoid optimization for compiling
> > machine. That's exactly what we are doing for MKL-DNN build in MXNet.
> Even
> > without MKL-DNN, I noticed there were issues about illegal instructions
> of
> > MXNet when users import the pip package on a lower end machine which
> > probably only supports SSE.
> >
> > @Naveen,
> >
> > The LSTM issue has already been identified as a regression from the
> recent
> > version of MKL-DNN. Hopefully it will be fixed soon with a new update of
> > MKL-DNN.
> >
> > MXNet has many submodule dependencies under the 3rd party folder. Seems
> we
> > don't require release versions for most of these dependencies. The
> release
> > period of MKL-DNN and MXNet are not matched very well. I think it would
> be
> > a risk for MXNet release if it hardly depends on the release of a
> > submodule, no need to say depends on the releases of all submodules.
> >
> > -tao
> >
> > -Original Message-
> > From: Naveen Swamy [mailto:mnnav...@gmail.com]
> > Sent: Thursday, November 22, 2018 9:08 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: d...@mxnet.apache.org
> > Subject: Re: Include MKLDNN into default mxnet pip package
> >
> > Hi Alex,
> >
> > Thanks for promptly running the numbers on AMD and reporting here.
> >
> > Can you please update the AMD numbers here for posterity
> >
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking
> > ?
> >
> > are there any outstanding issues when MKLDNN is enabled? from my offline
> > conversation I am briefly aware performance issues with LSTM, is there an
> > GitHub issue for it?
> >
> > MKLDNN is a submodule dependency, are we pulling the latest commit or
> > releases  ? If not we should move to releases before we make it a
> default.
> > Ideally we should use platform specific distributions (-dev packages) at
> > least we should rely on well tested releases.
> >
> >
> > Thanks, Naveen
> >
> > On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander
>  > >
> > wrote:
> >
> > > AMD benchmarks have been published. We are seeing a x15.8 speedup with
> > > Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. With a
> > > smaller network (Mobilenet - batch size 32) the speedup is more
> > > significant at x38.7. Let's have a vote to see if the PR to have
> > > MKLDNN enabled by default
> > > (https://github.com/apache/incubator-mxnet/pull/12591) can be merged
> > > before 1.4.0 release.
> > >
> > > On 10/19/18, 9:17 AM, "Pedro Larroy" 
> > > wrote:
> > >
> > > I did  pip install mxnet-mkl==1.3.1b20181018 on an AMD Ryzen 1950X
> > > and unit
> > > tests are passing.
> > >
> > > Is this build using AVX512?  in /proc/cpuinfo I see only "avx"
> flag.
> > > There's no "avx2" like on recent intel cpus.
> > >
> > > Pedro.
> > >
> > > On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko 
> > > wrote:
> > >
> > > > Awesome collaborative effort across many contributors and
> > companies!
> > > >
> > > > The boost is impressive and for MXNet users to get this boost
> > > "out of the
> > > > box" is a great benefit and makes MXNet an even better choice.
> > > >
> > > > Alex - can you clarify whether there are any down sides with
> > > regards to
> > > > noon AVX-512 architectures, AMD CPUs, 

Re: Include MKLDNN into default mxnet pip package

2018-11-21 Thread Naveen Swamy
Tao,

You are right there are many submodules in 3rd party. We have to start
somewhere and I believe this one is a good candidate to start with. This is
not to cater to release of MXNet or to tie them with the releases of the
submodules but instead to pick only stable releases and not to pick up
bleeding edge commits from the tip of the master, this gives us confidence
in the submodule that MXNet users are depending on that especially if we
make MKLDNN the default.

Good to know it is known already as a regression.Alex has created this
issue https://github.com/apache/incubator-mxnet/issues/13369, please add
details and link the corresponding issue in MKLDNN(I couldn't find).

-Naveen

On Wed, Nov 21, 2018 at 6:04 PM Lv, Tao A  wrote:

> Here are my answers for the questions from Kellen and Naveen about
> MKL-DNN. It doesn't mean that I'm supportive for making MKL-DNN default
> here.
>
> @Kellen,
>
> FYI, here is a list for those platforms which are officially supported by
> MKL-DNN.
> https://github.com/intel/mkl-dnn#system-requirements
>
> Most of computation intensive kernels in MKL-DNN are JITed. So they are
> supposed to generate code according to the platform during runtime. For
> non-JIT code in MKL-DNN, same as other code in MXNet, it will generate
> instructions according to the options/flags of compiler. We can set
> -DARCH_OPT_FLAGS when build MKL-DNN to avoid optimization for compiling
> machine. That's exactly what we are doing for MKL-DNN build in MXNet. Even
> without MKL-DNN, I noticed there were issues about illegal instructions of
> MXNet when users import the pip package on a lower end machine which
> probably only supports SSE.
>
> @Naveen,
>
> The LSTM issue has already been identified as a regression from the recent
> version of MKL-DNN. Hopefully it will be fixed soon with a new update of
> MKL-DNN.
>
> MXNet has many submodule dependencies under the 3rd party folder. Seems we
> don't require release versions for most of these dependencies. The release
> period of MKL-DNN and MXNet are not matched very well. I think it would be
> a risk for MXNet release if it hardly depends on the release of a
> submodule, no need to say depends on the releases of all submodules.
>
> -tao
>
> -Original Message-
> From: Naveen Swamy [mailto:mnnav...@gmail.com]
> Sent: Thursday, November 22, 2018 9:08 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: Re: Include MKLDNN into default mxnet pip package
>
> Hi Alex,
>
> Thanks for promptly running the numbers on AMD and reporting here.
>
> Can you please update the AMD numbers here for posterity
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking
> ?
>
> are there any outstanding issues when MKLDNN is enabled? from my offline
> conversation I am briefly aware performance issues with LSTM, is there an
> GitHub issue for it?
>
> MKLDNN is a submodule dependency, are we pulling the latest commit or
> releases  ? If not we should move to releases before we make it a default.
> Ideally we should use platform specific distributions (-dev packages) at
> least we should rely on well tested releases.
>
>
> Thanks, Naveen
>
> On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander  >
> wrote:
>
> > AMD benchmarks have been published. We are seeing a x15.8 speedup with
> > Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. With a
> > smaller network (Mobilenet - batch size 32) the speedup is more
> > significant at x38.7. Let's have a vote to see if the PR to have
> > MKLDNN enabled by default
> > (https://github.com/apache/incubator-mxnet/pull/12591) can be merged
> > before 1.4.0 release.
> >
> > On 10/19/18, 9:17 AM, "Pedro Larroy" 
> > wrote:
> >
> > I did  pip install mxnet-mkl==1.3.1b20181018 on an AMD Ryzen 1950X
> > and unit
> > tests are passing.
> >
> > Is this build using AVX512?  in /proc/cpuinfo I see only "avx" flag.
> > There's no "avx2" like on recent intel cpus.
> >
> > Pedro.
> >
> > On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko 
> > wrote:
> >
> > > Awesome collaborative effort across many contributors and
> companies!
> > >
> > > The boost is impressive and for MXNet users to get this boost
> > "out of the
> > > box" is a great benefit and makes MXNet an even better choice.
> > >
> > > Alex - can you clarify whether there are any down sides with
> > regards to
> > > noon AVX-512 architectures, AMD CPUs, etc? Will it gracefully
> > fallback?
> > >
> > > Hagay
> > >
> > >
> > > On Fri, Oct 19, 2018, 15:46 Sergio Fernández 
> > wrote:
> > >
> > > > If there is no downside on platforms not supporting AVX512
> > instructions,
> > > > then +1
> > > >
> > > >
> > > > On Wed, Oct 17, 2018, 14:10 Alex Zai  wrote:
> > > >
> > > > > Hey all,
> > > > > We have been working hard these past few months to integrate
> and
> > > > stabilize
> > > > > Intel’s 

RE: Include MKLDNN into default mxnet pip package

2018-11-21 Thread Zhao, Patric
Hi Kellen,

Thank you very much for your recognition for our works :) 

This is a great joint work from the community (Wu Jun, Zheng Da, etc.) and 
Intel team.

We are continuously improving the quantization flow now and more amazing 
features will be ready soon.

Thanks,

--Patric

> -Original Message-
> From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> Sent: Thursday, November 22, 2018 9:07 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: Re: Include MKLDNN into default mxnet pip package
> 
> I've spent the last few days testing MXNet w/ MKLDNN and quantized models
> and it's a beast.  Really good speed improvements on my models, no bugs
> that I've noticed.
> 
> I'm in general supportive but I'm still wondering what the story is like when
> there's no AVX instructions present on CPUs.  Do we get an illegal instruction
> error, or does it fallback gracefully?  So far it sounds like it works on a
> Threadripper and Xen AMD CPU.  I can try on a Ryzen.  What about older
> Intel or AMD CPUs?
> 
> On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander
> 
> wrote:
> 
> > AMD benchmarks have been published. We are seeing a x15.8 speedup
> with
> > Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. With a
> > smaller network (Mobilenet - batch size 32) the speedup is more
> > significant at x38.7. Let's have a vote to see if the PR to have
> > MKLDNN enabled by default
> > (https://github.com/apache/incubator-mxnet/pull/12591) can be merged
> > before 1.4.0 release.
> >
> > On 10/19/18, 9:17 AM, "Pedro Larroy" 
> > wrote:
> >
> > I did  pip install mxnet-mkl==1.3.1b20181018 on an AMD Ryzen 1950X
> > and unit
> > tests are passing.
> >
> > Is this build using AVX512?  in /proc/cpuinfo I see only "avx" flag.
> > There's no "avx2" like on recent intel cpus.
> >
> > Pedro.
> >
> > On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko 
> > wrote:
> >
> > > Awesome collaborative effort across many contributors and companies!
> > >
> > > The boost is impressive and for MXNet users to get this boost
> > "out of the
> > > box" is a great benefit and makes MXNet an even better choice.
> > >
> > > Alex - can you clarify whether there are any down sides with
> > regards to
> > > noon AVX-512 architectures, AMD CPUs, etc? Will it gracefully
> > fallback?
> > >
> > > Hagay
> > >
> > >
> > > On Fri, Oct 19, 2018, 15:46 Sergio Fernández 
> > wrote:
> > >
> > > > If there is no downside on platforms not supporting AVX512
> > instructions,
> > > > then +1
> > > >
> > > >
> > > > On Wed, Oct 17, 2018, 14:10 Alex Zai  wrote:
> > > >
> > > > > Hey all,
> > > > > We have been working hard these past few months to integrate and
> > > > stabilize
> > > > > Intel’s MKLDNN deep learning CPU accelerator into Mxnet and
> > have made
> > > > > incredible progress. On CPUs with AVX512 instructions (such
> > as
> > c5.18x)
> > > we
> > > > > have seen performance increase up to 12x and on other
> > platforms (Macs,
> > > > > AVX2) we seen a speedup of 1.5+. Full list of benchmarks can
> > be found
> > > > here
> > > > > (
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=956507
> 64
> > > > >  and https://github.com/apache/incubator-mxnet/pull/12591).
> > > > >
> > > > > Currently, using this accelerator requires the developer to
> > either pip
> > > > > install the mxnet-mkl version of mxnet or to build it
> > themselves from
> > > > > source. Given that we should try to provide the best
> > performance "out
> > > of
> > > > > the box” with mxnet we should include this in the default build.
> > The
> > > > mkldnn
> > > > > library is included with in the pip package build so it does not
> > > require
> > > > an
> > > > > external dependency.
> > > > >
> > > > > There were concerns that MKLDNN could cause regressions on
> > certain
> > > > > platforms (as it did with the tensorflow version a while
> > back); but we
> > > > > added a env flag (MXNET_MKLDNN_ENABLED) that allows users to
> > turn of
> > > this
> > > > > feature during runtime. Please bring up any other concerns
> > you may have
> > > > and
> > > > > your thoughts on including this accelerator in the default build.
> > > > >
> > > > > Best,
> > > > > Alex
> > > > >
> > > >
> > >
> >
> >
> >


RE: Include MKLDNN into default mxnet pip package

2018-11-21 Thread Lv, Tao A
Here are my answers for the questions from Kellen and Naveen about MKL-DNN. It 
doesn't mean that I'm supportive for making MKL-DNN default here.

@Kellen,

FYI, here is a list for those platforms which are officially supported by 
MKL-DNN.
https://github.com/intel/mkl-dnn#system-requirements 

Most of computation intensive kernels in MKL-DNN are JITed. So they are 
supposed to generate code according to the platform during runtime. For non-JIT 
code in MKL-DNN, same as other code in MXNet, it will generate instructions 
according to the options/flags of compiler. We can set -DARCH_OPT_FLAGS when 
build MKL-DNN to avoid optimization for compiling machine. That's exactly what 
we are doing for MKL-DNN build in MXNet. Even without MKL-DNN, I noticed there 
were issues about illegal instructions of MXNet when users import the pip 
package on a lower end machine which probably only supports SSE.

@Naveen,

The LSTM issue has already been identified as a regression from the recent 
version of MKL-DNN. Hopefully it will be fixed soon with a new update of 
MKL-DNN.

MXNet has many submodule dependencies under the 3rd party folder. Seems we 
don't require release versions for most of these dependencies. The release 
period of MKL-DNN and MXNet are not matched very well. I think it would be a 
risk for MXNet release if it hardly depends on the release of a submodule, no 
need to say depends on the releases of all submodules.

-tao

-Original Message-
From: Naveen Swamy [mailto:mnnav...@gmail.com] 
Sent: Thursday, November 22, 2018 9:08 AM
To: dev@mxnet.incubator.apache.org
Cc: d...@mxnet.apache.org
Subject: Re: Include MKLDNN into default mxnet pip package

Hi Alex,

Thanks for promptly running the numbers on AMD and reporting here.

Can you please update the AMD numbers here for posterity 
https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking
?

are there any outstanding issues when MKLDNN is enabled? from my offline 
conversation I am briefly aware performance issues with LSTM, is there an 
GitHub issue for it?

MKLDNN is a submodule dependency, are we pulling the latest commit or releases  
? If not we should move to releases before we make it a default.
Ideally we should use platform specific distributions (-dev packages) at least 
we should rely on well tested releases.


Thanks, Naveen

On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander 
wrote:

> AMD benchmarks have been published. We are seeing a x15.8 speedup with
> Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. With a 
> smaller network (Mobilenet - batch size 32) the speedup is more 
> significant at x38.7. Let's have a vote to see if the PR to have 
> MKLDNN enabled by default
> (https://github.com/apache/incubator-mxnet/pull/12591) can be merged 
> before 1.4.0 release.
>
> On 10/19/18, 9:17 AM, "Pedro Larroy" 
> wrote:
>
> I did  pip install mxnet-mkl==1.3.1b20181018 on an AMD Ryzen 1950X 
> and unit
> tests are passing.
>
> Is this build using AVX512?  in /proc/cpuinfo I see only "avx" flag.
> There's no "avx2" like on recent intel cpus.
>
> Pedro.
>
> On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko 
> wrote:
>
> > Awesome collaborative effort across many contributors and companies!
> >
> > The boost is impressive and for MXNet users to get this boost 
> "out of the
> > box" is a great benefit and makes MXNet an even better choice.
> >
> > Alex - can you clarify whether there are any down sides with 
> regards to
> > noon AVX-512 architectures, AMD CPUs, etc? Will it gracefully 
> fallback?
> >
> > Hagay
> >
> >
> > On Fri, Oct 19, 2018, 15:46 Sergio Fernández 
> wrote:
> >
> > > If there is no downside on platforms not supporting AVX512 
> instructions,
> > > then +1
> > >
> > >
> > > On Wed, Oct 17, 2018, 14:10 Alex Zai  wrote:
> > >
> > > > Hey all,
> > > > We have been working hard these past few months to integrate and
> > > stabilize
> > > > Intel’s MKLDNN deep learning CPU accelerator into Mxnet and 
> have made
> > > > incredible progress. On CPUs with AVX512 instructions (such 
> as
> c5.18x)
> > we
> > > > have seen performance increase up to 12x and on other 
> platforms (Macs,
> > > > AVX2) we seen a speedup of 1.5+. Full list of benchmarks can 
> be found
> > > here
> > > > (
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95650764
> > > >  and https://github.com/apache/incubator-mxnet/pull/12591).
> > > >
> > > > Currently, using this accelerator requires the developer to 
> either pip
> > > > install the mxnet-mkl version of mxnet or to build it 
> themselves from
> > > > source. Given that we should try to provide the best 
> performance "out
> > of
> > > > the box” with mxnet we should include this in the default build.
> The
> > > mkldnn
> > > > 

Re: Include MKLDNN into default mxnet pip package

2018-11-21 Thread kellen sunderland
I've spent the last few days testing MXNet w/ MKLDNN and quantized models
and it's a beast.  Really good speed improvements on my models, no bugs
that I've noticed.

I'm in general supportive but I'm still wondering what the story is like
when there's no AVX instructions present on CPUs.  Do we get an illegal
instruction error, or does it fallback gracefully?  So far it sounds like
it works on a Threadripper and Xen AMD CPU.  I can try on a Ryzen.  What
about older Intel or AMD CPUs?

On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander 
wrote:

> AMD benchmarks have been published. We are seeing a x15.8 speedup with
> Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. With a smaller
> network (Mobilenet - batch size 32) the speedup is more significant at
> x38.7. Let's have a vote to see if the PR to have MKLDNN enabled by default
> (https://github.com/apache/incubator-mxnet/pull/12591) can be merged
> before 1.4.0 release.
>
> On 10/19/18, 9:17 AM, "Pedro Larroy" 
> wrote:
>
> I did  pip install mxnet-mkl==1.3.1b20181018 on an AMD Ryzen 1950X and
> unit
> tests are passing.
>
> Is this build using AVX512?  in /proc/cpuinfo I see only "avx" flag.
> There's no "avx2" like on recent intel cpus.
>
> Pedro.
>
> On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko 
> wrote:
>
> > Awesome collaborative effort across many contributors and companies!
> >
> > The boost is impressive and for MXNet users to get this boost "out
> of the
> > box" is a great benefit and makes MXNet an even better choice.
> >
> > Alex - can you clarify whether there are any down sides with regards
> to
> > noon AVX-512 architectures, AMD CPUs, etc? Will it gracefully
> fallback?
> >
> > Hagay
> >
> >
> > On Fri, Oct 19, 2018, 15:46 Sergio Fernández 
> wrote:
> >
> > > If there is no downside on platforms not supporting AVX512
> instructions,
> > > then +1
> > >
> > >
> > > On Wed, Oct 17, 2018, 14:10 Alex Zai  wrote:
> > >
> > > > Hey all,
> > > > We have been working hard these past few months to integrate and
> > > stabilize
> > > > Intel’s MKLDNN deep learning CPU accelerator into Mxnet and have
> made
> > > > incredible progress. On CPUs with AVX512 instructions (such as
> c5.18x)
> > we
> > > > have seen performance increase up to 12x and on other platforms
> (Macs,
> > > > AVX2) we seen a speedup of 1.5+. Full list of benchmarks can be
> found
> > > here
> > > > (
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95650764
> > > >  and https://github.com/apache/incubator-mxnet/pull/12591).
> > > >
> > > > Currently, using this accelerator requires the developer to
> either pip
> > > > install the mxnet-mkl version of mxnet or to build it themselves
> from
> > > > source. Given that we should try to provide the best performance
> "out
> > of
> > > > the box” with mxnet we should include this in the default build.
> The
> > > mkldnn
> > > > library is included with in the pip package build so it does not
> > require
> > > an
> > > > external dependency.
> > > >
> > > > There were concerns that MKLDNN could cause regressions on
> certain
> > > > platforms (as it did with the tensorflow version a while back);
> but we
> > > > added a env flag (MXNET_MKLDNN_ENABLED) that allows users to
> turn of
> > this
> > > > feature during runtime. Please bring up any other concerns you
> may have
> > > and
> > > > your thoughts on including this accelerator in the default build.
> > > >
> > > > Best,
> > > > Alex
> > > >
> > >
> >
>
>
>


Re: Include MKLDNN into default mxnet pip package

2018-11-21 Thread Zai, Alexander
AMD benchmarks have been published. We are seeing a x15.8 speedup with Resnet50 
(batch size 32) on AWS's new m5a.24xlarge machine. With a smaller network 
(Mobilenet - batch size 32) the speedup is more significant at x38.7. Let's 
have a vote to see if the PR to have MKLDNN enabled by default 
(https://github.com/apache/incubator-mxnet/pull/12591) can be merged before 
1.4.0 release.

On 10/19/18, 9:17 AM, "Pedro Larroy"  wrote:

I did  pip install mxnet-mkl==1.3.1b20181018 on an AMD Ryzen 1950X and unit
tests are passing.

Is this build using AVX512?  in /proc/cpuinfo I see only "avx" flag.
There's no "avx2" like on recent intel cpus.

Pedro.

On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko  wrote:

> Awesome collaborative effort across many contributors and companies!
>
> The boost is impressive and for MXNet users to get this boost "out of the
> box" is a great benefit and makes MXNet an even better choice.
>
> Alex - can you clarify whether there are any down sides with regards to
> noon AVX-512 architectures, AMD CPUs, etc? Will it gracefully fallback?
>
> Hagay
>
>
> On Fri, Oct 19, 2018, 15:46 Sergio Fernández  wrote:
>
> > If there is no downside on platforms not supporting AVX512 instructions,
> > then +1
> >
> >
> > On Wed, Oct 17, 2018, 14:10 Alex Zai  wrote:
> >
> > > Hey all,
> > > We have been working hard these past few months to integrate and
> > stabilize
> > > Intel’s MKLDNN deep learning CPU accelerator into Mxnet and have made
> > > incredible progress. On CPUs with AVX512 instructions (such as c5.18x)
> we
> > > have seen performance increase up to 12x and on other platforms (Macs,
> > > AVX2) we seen a speedup of 1.5+. Full list of benchmarks can be found
> > here
> > > (
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95650764
> > >  and https://github.com/apache/incubator-mxnet/pull/12591).
> > >
> > > Currently, using this accelerator requires the developer to either pip
> > > install the mxnet-mkl version of mxnet or to build it themselves from
> > > source. Given that we should try to provide the best performance "out
> of
> > > the box” with mxnet we should include this in the default build. The
> > mkldnn
> > > library is included with in the pip package build so it does not
> require
> > an
> > > external dependency.
> > >
> > > There were concerns that MKLDNN could cause regressions on certain
> > > platforms (as it did with the tensorflow version a while back); but we
> > > added a env flag (MXNET_MKLDNN_ENABLED) that allows users to turn of
> this
> > > feature during runtime. Please bring up any other concerns you may 
have
> > and
> > > your thoughts on including this accelerator in the default build.
> > >
> > > Best,
> > > Alex
> > >
> >
>




Re: CI impaired

2018-11-21 Thread Qing Lan
Appreciated for your effort and help to make CI a better place!

Qing 

On 11/21/18, 4:38 PM, "Lin Yuan"  wrote:

Thanks for your efforts, Marco!

On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian 
wrote:

> Thanks for the quick response and mitigation!
>
> On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
>  wrote:
>
> > Hello,
> >
> > today, CI had some issues and I had to cancel all jobs a few minutes 
ago.
> > This was basically caused by the high load that is currently being put 
on
> > our CI system due to the pre-release efforts for this Friday.
> >
> > It's really unfortunate that we just had outages of three core 
components
> > within the last two days - sorry about that!. To recap, we had the
> > following outages (which are unrelated to the parallel refactor of the
> > Jenkins pipeline):
> > - (yesterday evening) The Jenkins master ran out of disk space and thus
> > processed requests at reduced capacity
> > - (this morning) The Jenkins master got updated which broke our
> > autoscalings upscaling capabilities.
> > - (new, this evening) Jenkins API was irresponsive: Due to the high
> number
> > of jobs and a bad API design in the Jenkins REST API, the 
time-complexity
> > of a simple create or delete request was quadratic which resulted in all
> > requests timing out (that was the current outage). This resulted in our
> > auto scaling to be unable to interface with the Jenkins master.
> >
> > I have now made improvements to our REST API calls which reduced the
> > complexity from O(N^2) to O(1). The reason was an underlying redirect
> loop
> > in the Jenkins createNode and deleteNode REST API in combination with
> > unrolling the entire slave and job graph (which got quite huge during
> > extensive load) upon every single request. Since we had about 150
> > registered slaves and 1000 jobs in the queue, the duration for a single
> > REST API call rose to up to 45 seconds (we execute up to a few hundred
> > queries per auto scaling loop). This lead to our auto scaling timing 
out.
> >
> > Everything should be back to normal now. I'm closely observing the
> > situation and I'll let you know if I encounter any additional issues.
> >
> > Again, sorry for any caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell 
> > wrote:
> >
> > > Yes, let me add to the kudos, very nice work Marco.
> > >
> > >
> > > "I'm trying real hard to be the shepherd." -Jules Winnfield
> > >
> > >
> > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > >  wrote:
> > > >
> > > > Appreciate the big effort in bring the CI back so quickly.  Thanks
> > Marco.
> > > >
> > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> marco.g.ab...@googlemail.com
> > .INVALID>
> > > wrote:
> > > > Thanks Aaron! Just for the record, the new Jenkins jobs were
> unrelated
> > to
> > > > that incident.
> > > >
> > > > If somebody is interested in the details around the outage:
> > > >
> > > > Due to a required maintenance (disk running full), we had to upgrade
> > our
> > > > Jenkins master because it was running on Ubuntu 17.04 (for an 
unknown
> > > > reason, it used to be 16.04) and we needed to install some packages.
> > > Since
> > > > the support for Ubuntu 17.04 was stopped, this resulted in all
> package
> > > > updates and installations to fail because the repositories were 
taken
> > > > offline. Due to the unavailable maintenance package and other issues
> > with
> > > > the installed OpenJDK8 version, we made the decision to upgrade the
> > > Jenkins
> > > > master to Ubuntu 18.04 LTS in order to get back to a supported
> version
> > > with
> > > > maintenance tools. During this upgrade, Jenkins was automatically
> > updated
> > > > by APT as part of the dist-upgrade process.
> > > >
> > > > In the latest version of Jenkins, some labels have been changed 
which
> > we
> > > > depend on for our auto scaling. To be more specific:
> > > >> Waiting for next available executor on mxnetlinux-gpu
> > > > has been changed to
> > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > > > Notice the quote characters.
> > > >
> > > > Jenkins does not offer a better way than to parse these messages
> > > > unfortunately - there's no standardized way to express queue items.
> > Since
> > > > our parser expected the above message without quote signs, this
> message
> > > was
> > > > discarded.
> > > >
> > > > We support various queue reasons (5 of them to be exact) that
> indicate
> > > > resource starvation. If we run super low on capacity, 

Re: CI impaired

2018-11-21 Thread Marco de Abreu
Hello,

today, CI had some issues and I had to cancel all jobs a few minutes ago.
This was basically caused by the high load that is currently being put on
our CI system due to the pre-release efforts for this Friday.

It's really unfortunate that we just had outages of three core components
within the last two days - sorry about that!. To recap, we had the
following outages (which are unrelated to the parallel refactor of the
Jenkins pipeline):
- (yesterday evening) The Jenkins master ran out of disk space and thus
processed requests at reduced capacity
- (this morning) The Jenkins master got updated which broke our
autoscalings upscaling capabilities.
- (new, this evening) Jenkins API was irresponsive: Due to the high number
of jobs and a bad API design in the Jenkins REST API, the time-complexity
of a simple create or delete request was quadratic which resulted in all
requests timing out (that was the current outage). This resulted in our
auto scaling to be unable to interface with the Jenkins master.

I have now made improvements to our REST API calls which reduced the
complexity from O(N^2) to O(1). The reason was an underlying redirect loop
in the Jenkins createNode and deleteNode REST API in combination with
unrolling the entire slave and job graph (which got quite huge during
extensive load) upon every single request. Since we had about 150
registered slaves and 1000 jobs in the queue, the duration for a single
REST API call rose to up to 45 seconds (we execute up to a few hundred
queries per auto scaling loop). This lead to our auto scaling timing out.

Everything should be back to normal now. I'm closely observing the
situation and I'll let you know if I encounter any additional issues.

Again, sorry for any caused inconveniences.

Best regards,
Marco

On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell 
wrote:

> Yes, let me add to the kudos, very nice work Marco.
>
>
> "I'm trying real hard to be the shepherd." -Jules Winnfield
>
>
> > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
>  wrote:
> >
> > Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.
> >
> > On Nov 21, 2018 5:52 AM, Marco de Abreu 
> > 
> wrote:
> > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
> > that incident.
> >
> > If somebody is interested in the details around the outage:
> >
> > Due to a required maintenance (disk running full), we had to upgrade our
> > Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> > reason, it used to be 16.04) and we needed to install some packages.
> Since
> > the support for Ubuntu 17.04 was stopped, this resulted in all package
> > updates and installations to fail because the repositories were taken
> > offline. Due to the unavailable maintenance package and other issues with
> > the installed OpenJDK8 version, we made the decision to upgrade the
> Jenkins
> > master to Ubuntu 18.04 LTS in order to get back to a supported version
> with
> > maintenance tools. During this upgrade, Jenkins was automatically updated
> > by APT as part of the dist-upgrade process.
> >
> > In the latest version of Jenkins, some labels have been changed which we
> > depend on for our auto scaling. To be more specific:
> >> Waiting for next available executor on mxnetlinux-gpu
> > has been changed to
> >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > Notice the quote characters.
> >
> > Jenkins does not offer a better way than to parse these messages
> > unfortunately - there's no standardized way to express queue items. Since
> > our parser expected the above message without quote signs, this message
> was
> > discarded.
> >
> > We support various queue reasons (5 of them to be exact) that indicate
> > resource starvation. If we run super low on capacity, the queue reason is
> > different and we would still be able to scale up, but most of the cases
> > would have printed the unsupported message. This resulted in reduced
> > capacity (to be specific, the limit during that time was 1 slave per
> type).
> >
> > We have now fixed our autoscaling to automatically strip these characters
> > and added that message to our test suite.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham  >
> > wrote:
> >
> >> Marco, thanks for your hard work on this. I'm super excited about the
> new
> >> Jenkins jobs. This is going to be very helpful and improve sanity for
> our
> >> PRs and ourselves!
> >>
> >> Cheers,
> >> Aaron
> >>
> >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> >>  >>
> >>> Hello,
> >>>
> >>> the CI is now back up and running. Auto scaling is working as expected
> >> and
> >>> it passed our load tests.
> >>>
> >>> Please excuse the caused inconveniences.
> >>>
> >>> Best regards,
> >>> Marco
> >>>
> >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> >>> marco.g.ab...@googlemail.com>
> >>> wrote:
> >>>
>  Hello,
> 
>  I'd like to let you know that our CI was impaired and down for the
> 

Re: Splitting Jenkins pipelines - stop changes to Jenkinsfiles!

2018-11-21 Thread Anirudh
Hi Marco,

Can you point out specifically which checks we have to make sure pass
before merging PRs. Currently apart from the required one there are six
steps added.  Also, is the CI down currently :
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13324/17/pipeline


Anirudh

On Wed, Nov 21, 2018 at 9:31 AM Marco de Abreu
 wrote:

> Please notice that the "continuous-integration/jenkins/pr-merge" currently
> is overlapping with the new pipelines. Please make sure all checks pass
> (also the non-required ones) before merging the PRs. I will work on a fix
> for this overlap.
>
> -Marco
>
> On Wed, Nov 21, 2018 at 5:42 PM Anton Chernov  wrote:
>
> > The ability to retrigger the pipelines separately is an amazing step
> > forward. Great job Marco!
> >
> > ср, 21 нояб. 2018 г. в 15:03, Marco de Abreu
> > :
> >
> > > Hello,
> > >
> > > the PR has been merged and I've created the new pipelines at [1]. You
> can
> > > see the new reports if you have a look at this example PR at [2].
> > >
> > > The new status messages will be the ones starting with
> > > "ci/jenkins/mxnet-validation/".
> > >
> > > This now allows you to retrigger specific pipelines if they fail. For
> > > example, if you're interested in the website pipeline, you can now go
> to
> > > [3] and just retrigger that instead of running the entire suite.
> Whenever
> > > there's a new commit, all pipelines will still be scheduled as before
> > (the
> > > overall behaviour or coverage of our pipeline did not change, I just
> > > decoupled them and increased the usability).
> > >
> > > The next step will be the deprecation of the main Jenkinsfile (the one
> > > which reports the status as "continuous-integration/jenkins/pr-merge")
> > and
> > > requesting these new statuses to be marked as required (protected
> master
> > > branch). Since we have to change some reporting tools to point to the
> new
> > > jobs and I'd like to observe the stability for some time, this will
> take
> > > some times.
> > >
> > > You can now resume changes in the Jenkinsfiles. But please do not
> modify
> > > the Jenkinsfile in the root directory but instead the ones at [4]. The
> > > nightly Jenkinsfiles (or basically all Jenkinsfiles that are not part
> of
> > > the main pipeline) have not been migrated yet and I will do that at a
> > later
> > > point in time.
> > >
> > > Best regards,
> > > Marco
> > >
> > > [1]: http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/
> > > [2]: https://github.com/apache/incubator-mxnet/pull/13352
> > > [3]:
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwebsite/detail/PR-13352/1/pipeline
> > > [4]: https://github.com/apache/incubator-mxnet/tree/master/ci/jenkins
> > >
> > > On Tue, Nov 20, 2018 at 9:33 PM Marco de Abreu <
> > > marco.g.ab...@googlemail.com>
> > > wrote:
> > >
> > > > I have just submitted my PR at
> > > > https://github.com/apache/incubator-mxnet/pull/13344. Test jobs are
> > > > available at
> > > > http://jenkins.mxnet-ci-dev.amazon-ml.com/view/test-marco-mxnet/.
> > > >
> > > > As soon as I'm done with my tests, I will mark it as ready for
> review.
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > On Tue, Nov 20, 2018 at 9:09 PM Marco de Abreu <
> > > > marco.g.ab...@googlemail.com> wrote:
> > > >
> > > >> Thanks, Pedro!
> > > >>
> > > >> I have also been looking into that issue, but it seems like this
> would
> > > >> require changes in the groovy interpreter of Jenkins. From what I
> can
> > > tell,
> > > >> a refactor will give us multiple benefits (clarity and speed) aside
> > from
> > > >> resolving this issue.
> > > >>
> > > >> Best regards,
> > > >> Marco
> > > >>
> > > >> Am Di., 20. Nov. 2018, 19:54 hat Pedro Larroy <
> > > >> pedro.larroy.li...@gmail.com> geschrieben:
> > > >>
> > > >>> I think this is a big problem, which has blocked us before. I want
> to
> > > >>> point out that you are doing a great thing by avoiding everyone
> > > >>> getting blocked by refactoring the pipelines.
> > > >>>
> > > >>> My concern is that we are kicking the can down the road and not
> > > >>> addressing the root cause of the problem with is known
> > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984
> > > >>>
> > > >>> Pedro.
> > > >>>
> > > >>>
> > > >>> On Tue, Nov 20, 2018 at 6:08 PM Marco de Abreu
> > > >>>  wrote:
> > > >>> >
> > > >>> > Hello Steffen,
> > > >>> >
> > > >>> > no, there won't be any impact on the PR process or nightly
> > > regressions.
> > > >>> > Only the reporting will have to be updated with the new job
> links,
> > > but
> > > >>> that
> > > >>> > should be a minor issue. To avoid any outage, I have been
> thinking
> > > >>> about
> > > >>> > running both versions in parallel.
> > > >>> >
> > > >>> > Best regards,
> > > >>> > Marco
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > On Tue, Nov 20, 2018 at 5:53 PM Steffen Rochel <
> > > >>> steffenroc...@gmail.com>
> > > >>> > wrote:
> 

Re: A New API for creating .rec files

2018-11-21 Thread Anirudh Acharya
Hi All,

Sorry for the delay, but here is the design spec for the API -
https://cwiki.apache.org/confluence/display/MXNET/Image+Transforms+and+RecordIO+file+Creation

Look forward to feedback from the community.


Regards
Anirudh


On Tue, Sep 25, 2018 at 2:15 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> This makes a lot of sense to me Anirudh.
>
> On Tue, Sep 25, 2018 at 11:38 AM Anirudh Acharya 
> wrote:
>
> > Hi,
> >
> > During some recent MXNet user surveys, one of the user requests was to
> have
> > a im2rec API that will have similar functionality as the im2rec tool(
> > https://mxnet.incubator.apache.org/faq/recordio.html?highlight=im2rec).
> > The
> > advantage with the API would be that the user can access this
> functionality
> > from the PyPi package itself, instead of cloning the repo.
> >
> > I was thinking of converting the tool into an API call under the mx.io
> > package. I will send the API design shortly. I wanted to know what the
> > community thinks of this change.
> >
> >
> > Thanks
> > Anirudh Acharya
> >
>


Re: CI impaired

2018-11-21 Thread Gavin M Bell
Yes, let me add to the kudos, very nice work Marco. 


"I'm trying real hard to be the shepherd." -Jules Winnfield


> On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen  
> wrote:
> 
> Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.
> 
> On Nov 21, 2018 5:52 AM, Marco de Abreu 
>  wrote:
> Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
> that incident.
> 
> If somebody is interested in the details around the outage:
> 
> Due to a required maintenance (disk running full), we had to upgrade our
> Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> reason, it used to be 16.04) and we needed to install some packages. Since
> the support for Ubuntu 17.04 was stopped, this resulted in all package
> updates and installations to fail because the repositories were taken
> offline. Due to the unavailable maintenance package and other issues with
> the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
> master to Ubuntu 18.04 LTS in order to get back to a supported version with
> maintenance tools. During this upgrade, Jenkins was automatically updated
> by APT as part of the dist-upgrade process.
> 
> In the latest version of Jenkins, some labels have been changed which we
> depend on for our auto scaling. To be more specific:
>> Waiting for next available executor on mxnetlinux-gpu
> has been changed to
>> Waiting for next available executor on ‘mxnetlinux-gpu’
> Notice the quote characters.
> 
> Jenkins does not offer a better way than to parse these messages
> unfortunately - there's no standardized way to express queue items. Since
> our parser expected the above message without quote signs, this message was
> discarded.
> 
> We support various queue reasons (5 of them to be exact) that indicate
> resource starvation. If we run super low on capacity, the queue reason is
> different and we would still be able to scale up, but most of the cases
> would have printed the unsupported message. This resulted in reduced
> capacity (to be specific, the limit during that time was 1 slave per type).
> 
> We have now fixed our autoscaling to automatically strip these characters
> and added that message to our test suite.
> 
> Best regards,
> Marco
> 
> On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham 
> wrote:
> 
>> Marco, thanks for your hard work on this. I'm super excited about the new
>> Jenkins jobs. This is going to be very helpful and improve sanity for our
>> PRs and ourselves!
>> 
>> Cheers,
>> Aaron
>> 
>> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
>> > 
>>> Hello,
>>> 
>>> the CI is now back up and running. Auto scaling is working as expected
>> and
>>> it passed our load tests.
>>> 
>>> Please excuse the caused inconveniences.
>>> 
>>> Best regards,
>>> Marco
>>> 
>>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
>>> marco.g.ab...@googlemail.com>
>>> wrote:
>>> 
 Hello,
 
 I'd like to let you know that our CI was impaired and down for the last
 few hours. After getting the CI back up, I noticed that our auto
>> scaling
 broke due to a silent update of Jenkins which broke our
>>> upscale-detection.
 Manual scaling is currently not possible and stopping the scaling won't
 help either because there are currently no p3 instances available,
>> which
 means that all jobs will fail none the less. In a few hours, the auto
 scaling will have recycled all slaves through the down-scale mechanism
>>> and
 we will be out of capacity. This will lead to resource starvation and
>>> thus
 timeouts.
 
 Your PRs will be properly registered by Jenkins, but please expect the
 jobs to time out and thus fail your PRs.
 
 I will fix the auto scaling as soon as I'm awake again.
 
 Sorry for the caused inconveniences.
 
 Best regards,
 Marco
 
 
 P.S. Sorry for the brief email and my lack of further fixes, but it's
 5:30AM now and I've been working for 17 hours.
 
>>> 
>> 


Re: CI impaired

2018-11-21 Thread Sunderland, Kellen
Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.

On Nov 21, 2018 5:52 AM, Marco de Abreu  
wrote:
Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
that incident.

If somebody is interested in the details around the outage:

Due to a required maintenance (disk running full), we had to upgrade our
Jenkins master because it was running on Ubuntu 17.04 (for an unknown
reason, it used to be 16.04) and we needed to install some packages. Since
the support for Ubuntu 17.04 was stopped, this resulted in all package
updates and installations to fail because the repositories were taken
offline. Due to the unavailable maintenance package and other issues with
the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
master to Ubuntu 18.04 LTS in order to get back to a supported version with
maintenance tools. During this upgrade, Jenkins was automatically updated
by APT as part of the dist-upgrade process.

In the latest version of Jenkins, some labels have been changed which we
depend on for our auto scaling. To be more specific:
> Waiting for next available executor on mxnetlinux-gpu
has been changed to
> Waiting for next available executor on ‘mxnetlinux-gpu’
Notice the quote characters.

Jenkins does not offer a better way than to parse these messages
unfortunately - there's no standardized way to express queue items. Since
our parser expected the above message without quote signs, this message was
discarded.

We support various queue reasons (5 of them to be exact) that indicate
resource starvation. If we run super low on capacity, the queue reason is
different and we would still be able to scale up, but most of the cases
would have printed the unsupported message. This resulted in reduced
capacity (to be specific, the limit during that time was 1 slave per type).

We have now fixed our autoscaling to automatically strip these characters
and added that message to our test suite.

Best regards,
Marco

On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham 
wrote:

> Marco, thanks for your hard work on this. I'm super excited about the new
> Jenkins jobs. This is going to be very helpful and improve sanity for our
> PRs and ourselves!
>
> Cheers,
> Aaron
>
> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> 
> > Hello,
> >
> > the CI is now back up and running. Auto scaling is working as expected
> and
> > it passed our load tests.
> >
> > Please excuse the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > marco.g.ab...@googlemail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'd like to let you know that our CI was impaired and down for the last
> > > few hours. After getting the CI back up, I noticed that our auto
> scaling
> > > broke due to a silent update of Jenkins which broke our
> > upscale-detection.
> > > Manual scaling is currently not possible and stopping the scaling won't
> > > help either because there are currently no p3 instances available,
> which
> > > means that all jobs will fail none the less. In a few hours, the auto
> > > scaling will have recycled all slaves through the down-scale mechanism
> > and
> > > we will be out of capacity. This will lead to resource starvation and
> > thus
> > > timeouts.
> > >
> > > Your PRs will be properly registered by Jenkins, but please expect the
> > > jobs to time out and thus fail your PRs.
> > >
> > > I will fix the auto scaling as soon as I'm awake again.
> > >
> > > Sorry for the caused inconveniences.
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > P.S. Sorry for the brief email and my lack of further fixes, but it's
> > > 5:30AM now and I've been working for 17 hours.
> > >
> >
>


Re: [RESULTS] [VOTE] Release Apache MXNet (incubating) version 1.3.1.rc0

2018-11-21 Thread Anton Chernov
The vote on the @general list of incubator.apache.org has been started:

https://lists.apache.org/thread.html/adf86e1c3332559ad91880a412aa1063dc72cd6f4f3d6c4c0d91a2dd@%3Cgeneral.incubator.apache.org%3E

The vote closes on 24th of November 2018 14:30 CET.


Best
Anton


вт, 20 нояб. 2018 г. в 20:34, Hagay Lupesko :

> Great - congrats!
>
> On Tue, Nov 20, 2018 at 8:51 AM Anton Chernov  wrote:
>
> > Dear MXNet community,
> >
> > I'm happy to announce the results of the vote.
> >
> > This vote passes with 8 +1 votes (4 binding) and no 0 or -1 votes.
> >
> > +1 votes
> >
> > * Carin / binding
> > * Indhu / binding
> > * Sandeep / binding
> > * Jim / binding
> > * Kellen
> > * Steffen
> > * Roshani
> > * Aaron
> >
> > 0 votes
> > * No votes
> >
> > -1 votes
> > * No votes
> >
> > Vote thread can be found here [1]. The list of members can be found here
> > [2].
> >
> > I'll continue with the release process and the release announcement will
> > follow in the next few days.
> >
> >
> > Best
> > Anton
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/32ab13b6d2d80fd75dbc2ec62151d12d09f6e0ca89799ae0aa26894b@%3Cdev.mxnet.apache.org%3E
> > [2] http://incubator.apache.org/projects/mxnet.html
> >
>


Re: Splitting Jenkins pipelines - stop changes to Jenkinsfiles!

2018-11-21 Thread Marco de Abreu
Hello,

the PR has been merged and I've created the new pipelines at [1]. You can
see the new reports if you have a look at this example PR at [2].

The new status messages will be the ones starting with
"ci/jenkins/mxnet-validation/".

This now allows you to retrigger specific pipelines if they fail. For
example, if you're interested in the website pipeline, you can now go to
[3] and just retrigger that instead of running the entire suite. Whenever
there's a new commit, all pipelines will still be scheduled as before (the
overall behaviour or coverage of our pipeline did not change, I just
decoupled them and increased the usability).

The next step will be the deprecation of the main Jenkinsfile (the one
which reports the status as "continuous-integration/jenkins/pr-merge") and
requesting these new statuses to be marked as required (protected master
branch). Since we have to change some reporting tools to point to the new
jobs and I'd like to observe the stability for some time, this will take
some times.

You can now resume changes in the Jenkinsfiles. But please do not modify
the Jenkinsfile in the root directory but instead the ones at [4]. The
nightly Jenkinsfiles (or basically all Jenkinsfiles that are not part of
the main pipeline) have not been migrated yet and I will do that at a later
point in time.

Best regards,
Marco

[1]: http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/
[2]: https://github.com/apache/incubator-mxnet/pull/13352
[3]:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwebsite/detail/PR-13352/1/pipeline
[4]: https://github.com/apache/incubator-mxnet/tree/master/ci/jenkins

On Tue, Nov 20, 2018 at 9:33 PM Marco de Abreu 
wrote:

> I have just submitted my PR at
> https://github.com/apache/incubator-mxnet/pull/13344. Test jobs are
> available at
> http://jenkins.mxnet-ci-dev.amazon-ml.com/view/test-marco-mxnet/.
>
> As soon as I'm done with my tests, I will mark it as ready for review.
>
> Best regards,
> Marco
>
> On Tue, Nov 20, 2018 at 9:09 PM Marco de Abreu <
> marco.g.ab...@googlemail.com> wrote:
>
>> Thanks, Pedro!
>>
>> I have also been looking into that issue, but it seems like this would
>> require changes in the groovy interpreter of Jenkins. From what I can tell,
>> a refactor will give us multiple benefits (clarity and speed) aside from
>> resolving this issue.
>>
>> Best regards,
>> Marco
>>
>> Am Di., 20. Nov. 2018, 19:54 hat Pedro Larroy <
>> pedro.larroy.li...@gmail.com> geschrieben:
>>
>>> I think this is a big problem, which has blocked us before. I want to
>>> point out that you are doing a great thing by avoiding everyone
>>> getting blocked by refactoring the pipelines.
>>>
>>> My concern is that we are kicking the can down the road and not
>>> addressing the root cause of the problem with is known
>>> https://issues.jenkins-ci.org/browse/JENKINS-37984
>>>
>>> Pedro.
>>>
>>>
>>> On Tue, Nov 20, 2018 at 6:08 PM Marco de Abreu
>>>  wrote:
>>> >
>>> > Hello Steffen,
>>> >
>>> > no, there won't be any impact on the PR process or nightly regressions.
>>> > Only the reporting will have to be updated with the new job links, but
>>> that
>>> > should be a minor issue. To avoid any outage, I have been thinking
>>> about
>>> > running both versions in parallel.
>>> >
>>> > Best regards,
>>> > Marco
>>> >
>>> >
>>> >
>>> > On Tue, Nov 20, 2018 at 5:53 PM Steffen Rochel <
>>> steffenroc...@gmail.com>
>>> > wrote:
>>> >
>>> > > Hi Marco - is there any impact on reporting, the PR process or
>>> nightly
>>> > > regression beside reduction in TAT?  If yes, please elaborate.
>>> > > Steffen
>>> > >
>>> > > On Tue, Nov 20, 2018 at 8:05 AM Marco de Abreu
>>> > >  wrote:
>>> > >
>>> > > > Hello,
>>> > > >
>>> > > > we ran into issues around the maximum filesize of the Jenkinsfile
>>> a few
>>> > > > times already. In order to resolve this issue, I'd like to combine
>>> this
>>> > > > with some refactors I have planned for quite some time.
>>> > > >
>>> > > > The idea is basically to move away from one big Jenkinsfile and
>>> instead
>>> > > > split it into separate jobs that run in parallel and report their
>>> status
>>> > > > individually. Besides avoiding the size restriction, this will
>>> greatly
>>> > > > speed up the PR validation process by reducing the critical path.
>>> Instead
>>> > > > of having to wait for every single step within a stage to finish
>>> before
>>> > > the
>>> > > > next stage (e.g. tests) is getting executed, these pipelines would
>>> now be
>>> > > > able to move forward individually. I'm still in the process of
>>> > > refactoring
>>> > > > and can't provide any numbers or documentation at this time, but I
>>> would
>>> > > > like to announce this early on to avoid conflicts:
>>> > > >
>>> > > > Since I will remove the original Jenkinsfile, this might cause
>>> conflicts
>>> > > > with ongoing efforts that try to change the Jenkinsfile. This
>>> poses the
>>> > > > risk that I might forget to port a 

Re: CI impaired

2018-11-21 Thread Marco de Abreu
Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
that incident.

If somebody is interested in the details around the outage:

Due to a required maintenance (disk running full), we had to upgrade our
Jenkins master because it was running on Ubuntu 17.04 (for an unknown
reason, it used to be 16.04) and we needed to install some packages. Since
the support for Ubuntu 17.04 was stopped, this resulted in all package
updates and installations to fail because the repositories were taken
offline. Due to the unavailable maintenance package and other issues with
the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
master to Ubuntu 18.04 LTS in order to get back to a supported version with
maintenance tools. During this upgrade, Jenkins was automatically updated
by APT as part of the dist-upgrade process.

In the latest version of Jenkins, some labels have been changed which we
depend on for our auto scaling. To be more specific:
> Waiting for next available executor on mxnetlinux-gpu
has been changed to
> Waiting for next available executor on ‘mxnetlinux-gpu’
Notice the quote characters.

Jenkins does not offer a better way than to parse these messages
unfortunately - there's no standardized way to express queue items. Since
our parser expected the above message without quote signs, this message was
discarded.

We support various queue reasons (5 of them to be exact) that indicate
resource starvation. If we run super low on capacity, the queue reason is
different and we would still be able to scale up, but most of the cases
would have printed the unsupported message. This resulted in reduced
capacity (to be specific, the limit during that time was 1 slave per type).

We have now fixed our autoscaling to automatically strip these characters
and added that message to our test suite.

Best regards,
Marco

On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham 
wrote:

> Marco, thanks for your hard work on this. I'm super excited about the new
> Jenkins jobs. This is going to be very helpful and improve sanity for our
> PRs and ourselves!
>
> Cheers,
> Aaron
>
> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> 
> > Hello,
> >
> > the CI is now back up and running. Auto scaling is working as expected
> and
> > it passed our load tests.
> >
> > Please excuse the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > marco.g.ab...@googlemail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'd like to let you know that our CI was impaired and down for the last
> > > few hours. After getting the CI back up, I noticed that our auto
> scaling
> > > broke due to a silent update of Jenkins which broke our
> > upscale-detection.
> > > Manual scaling is currently not possible and stopping the scaling won't
> > > help either because there are currently no p3 instances available,
> which
> > > means that all jobs will fail none the less. In a few hours, the auto
> > > scaling will have recycled all slaves through the down-scale mechanism
> > and
> > > we will be out of capacity. This will lead to resource starvation and
> > thus
> > > timeouts.
> > >
> > > Your PRs will be properly registered by Jenkins, but please expect the
> > > jobs to time out and thus fail your PRs.
> > >
> > > I will fix the auto scaling as soon as I'm awake again.
> > >
> > > Sorry for the caused inconveniences.
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > P.S. Sorry for the brief email and my lack of further fixes, but it's
> > > 5:30AM now and I've been working for 17 hours.
> > >
> >
>


Re: CI impaired

2018-11-21 Thread Aaron Markham
Marco, thanks for your hard work on this. I'm super excited about the new
Jenkins jobs. This is going to be very helpful and improve sanity for our
PRs and ourselves!

Cheers,
Aaron

On Wed, Nov 21, 2018, 05:37 Marco de Abreu
 Hello,
>
> the CI is now back up and running. Auto scaling is working as expected and
> it passed our load tests.
>
> Please excuse the caused inconveniences.
>
> Best regards,
> Marco
>
> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> marco.g.ab...@googlemail.com>
> wrote:
>
> > Hello,
> >
> > I'd like to let you know that our CI was impaired and down for the last
> > few hours. After getting the CI back up, I noticed that our auto scaling
> > broke due to a silent update of Jenkins which broke our
> upscale-detection.
> > Manual scaling is currently not possible and stopping the scaling won't
> > help either because there are currently no p3 instances available, which
> > means that all jobs will fail none the less. In a few hours, the auto
> > scaling will have recycled all slaves through the down-scale mechanism
> and
> > we will be out of capacity. This will lead to resource starvation and
> thus
> > timeouts.
> >
> > Your PRs will be properly registered by Jenkins, but please expect the
> > jobs to time out and thus fail your PRs.
> >
> > I will fix the auto scaling as soon as I'm awake again.
> >
> > Sorry for the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> >
> > P.S. Sorry for the brief email and my lack of further fixes, but it's
> > 5:30AM now and I've been working for 17 hours.
> >
>


Re: CI impaired

2018-11-21 Thread Marco de Abreu
Hello,

the CI is now back up and running. Auto scaling is working as expected and
it passed our load tests.

Please excuse the caused inconveniences.

Best regards,
Marco

On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu 
wrote:

> Hello,
>
> I'd like to let you know that our CI was impaired and down for the last
> few hours. After getting the CI back up, I noticed that our auto scaling
> broke due to a silent update of Jenkins which broke our upscale-detection.
> Manual scaling is currently not possible and stopping the scaling won't
> help either because there are currently no p3 instances available, which
> means that all jobs will fail none the less. In a few hours, the auto
> scaling will have recycled all slaves through the down-scale mechanism and
> we will be out of capacity. This will lead to resource starvation and thus
> timeouts.
>
> Your PRs will be properly registered by Jenkins, but please expect the
> jobs to time out and thus fail your PRs.
>
> I will fix the auto scaling as soon as I'm awake again.
>
> Sorry for the caused inconveniences.
>
> Best regards,
> Marco
>
>
> P.S. Sorry for the brief email and my lack of further fixes, but it's
> 5:30AM now and I've been working for 17 hours.
>