Re: Custom C++ Operators

2019-12-14 Thread kellen sunderland
Awesome news Sam, should make maintaining and integrating custom ops a lot
easier.  Thanks for the efforts everyone.

On Mon, Dec 9, 2019 at 5:55 AM Skalicky, Sam 
wrote:

> Thanks Ciyong,
>
> Absolutely! Heres how a backward function is registered [1] and here’s an
> example backward function for GEMM [2]. We’ll be working on documentation
> and a blog post/tutorial soon, hopefully that will  help clarify things as
> well.
>
> Keep the questions coming!
>
> Thanks,
> Sam
>
> [1]
> https://github.com/apache/incubator-mxnet/blob/master/example/extensions/lib_custom_op/gemm_lib.cc#L171
> [2]
> https://github.com/apache/incubator-mxnet/blob/master/example/extensions/lib_custom_op/gemm_lib.cc#L90-L116
>
> On Dec 8, 2019, at 6:48 AM, Chen, Ciyong  ciyong.c...@intel.com>> wrote:
>
> Really great features, it will provide a more convenient way for
> deployment.
> BTW, does it support backward ops too?
>
> -Ciyong
>
> -Original Message-
> From: Marco de Abreu  marco.g.ab...@gmail.com>>
> Sent: Sunday, December 8, 2019 2:56 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: Custom C++ Operators
>
> Awesome project, love it! It really seems easy to use, great job!
>
> -Marco
>
> Skalicky, Sam  sska...@amazon.com.invalid>> schrieb am Sa., 7. Dez. 2019,
> 19:50:
>
> Hi MXNet Community,
>
> We have been working on adding support for custom C++ operators for a
> while and are happy to announce that the initial functionality is now
> available for you to try out in the master branch!
>
> CustomOp support in MXNet began with allowing users to write custom
> operators in Python and has been available for years. If you wanted to
> write a high-performance C++ operator you had to do it by adding it to
> the MXNet source code, recompiling a custom version of MXNet, and
> distributing that custom build. The Custom C++ operator support
> enhances this by enabling users to write high-performance C++
> operators and compile them separately from MXNet. This frees up users
> from having to recompile MXNet from source and makes it easier to add
> custom operators to suit their needs.
>
> Heres a few pointers to get started:
> 1. Check out the overview in the cwiki [1] 2. Check out the PR [2] 3.
> You can try this out using the new nightly builds that are available
> in
> S3 [3]
> 4. Leave feedback on features to add or things to fix in a followup PR
> here [4]
>
> Credit goes to everyone involved (in no particular order) Manu Seth
> Sheng Zha Jackie Wu Junru Shao Ziyi Mu
>
> Special thanks to all the PR reviewers!
>
> Thanks!
> Sam
>
>
> [1]
> https://cwiki.apache.org/confluence/display/MXNET/Dynamic+CustomOp+Sup
> port [2] https://github.com/apache/incubator-mxnet/pull/15921
> [3]
> https://lists.apache.org/thread.html/0a22e10b290b4ad322ed50024d778c373
> 6b0a772811caea317790732%40%3Cdev.mxnet.apache.org%3E
> <
> https://lists.apache.org/thread.html/0a22e10b290b4ad322ed50024d778c373
> 6b0a772811caea317790732@
> >
> [4] https://github.com/apache/incubator-mxnet/issues/17006
>
>
>


Re: BytePS-MXNet Integration

2019-11-08 Thread kellen sunderland
Quite interested in BytePS.  Looking forward to seeing how integration
could evolve.

On Wed, Nov 6, 2019, 8:14 AM Yimin Jiang  wrote:

> Hi Zhennan,
>
> Thanks for your interest. To be honest, our team currently do not have a
> plan for CPU training. That said, the notion of BytePS is not GPU-specific
> and should also apply to CPU. I do not see a fundamental challenge yet. And
> we welcome contributions on this.
>
> Thank you,
> Yimin
>
> On Wed, Nov 6, 2019 at 2:59 PM Qin, Zhennan  wrote:
>
> > Hi Yimin,
> >
> > Welcome to make contribution to MXNet project!
> >
> > From 
> > https://github.com/bytedance/byteps/blob/master/README.md I found
> another
> > limitation that isn't shown in your proposal:
> >
> > BytePS does not support pure CPU training for now. One reason is that the
> > cheap PS assumption<
> > https://github.com/bytedance/byteps/blob/master/docs/rationale.md> of
> > BytePS do not hold for CPU training. Consequently, you need CUDA and NCCL
> > to build and run BytePS.
> >
> > I have a couple of question for this: How's the status of CPU training
> > support? If CPU training isn't supported yet, what's the challenge to
> > support it? Do you have a plan to support it?
> >
> > Thanks,
> > Zhennan
> >
> > On Wed, 2019-11-06 at 12:14 +0800, Yimin Jiang wrote:
> >
> > Hi MXNet Community,
> >
> >
> > BytePS (https://github.com/bytedance/byteps) is a high-performance,
> >
> > cross-framework architecture for distributed training. BytePS developers
> >
> > are planning to integrate a part of BytePS into MXNet. The link below is
> >
> > the proposal. Feedbacks are welcome.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/MXNET/BytePS-MXNet+Integration
> >
> >
> >
> > Thank you,
> >
> > Yimin Jiang
> >
> >
>


Re: [REVIEW][ANNOUNCE] Release Apache MXNet (incubating) version 1.5.1

2019-10-10 Thread kellen sunderland
Lgtm Tao.

On Thu, Oct 10, 2019, 7:23 AM Tao Lv  wrote:

> Okay, looks like there is no objections. I will send the announcement to
> announce@ and general@ soon.
>
> Thanks,
> -tao
>
> On Wed, Oct 9, 2019 at 10:35 AM Tao Lv  wrote:
>
> > Dear community,
> >
> > This is to review the announcement for 1.5.1 release according to the
> > section 3.4 in release process.
> >
> > ===
> >
> > The Apache MXNet (incubating) community is happy to announce Apache MXNet
> > (incubating) version 1.5.1!
> >
> > Apache MXNet (incubating) is a deep learning framework designed for both
> > efficiency and flexibility. It allows you to mix symbolic and imperative
> > programming to maximize efficiency and productivity.
> >
> > 1.5.1 is a maintenance release incorporating important bug fixes and
> > important performance improvements.
> >
> > A full list of the changes in this release can be found in the release
> > notes:
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.1+Release+Notes
> >
> > A link to the download can be found here:
> > http://mxnet.incubator.apache.org/get_started/download
> >
> > If you prefer to build from source and experiment with various
> > compile-time configuration options, use this link to get the
> instructions:
> > http://mxnet.incubator.apache.org/get_started/ubuntu_setup.html
> >
> > http://mxnet.incubator.apache.org/get_started/centos_setup.html
> >
> >  Or you can download and play with MXNet easily using one of the options
> > below:
> >
> > 1. The Pip packages can be found here:
> > https://pypi.python.org/pypi/mxnet
> >
> > 2. The Docker Images can be found here:
> > https://hub.docker.com/r/mxnet/python/
> >
> > Links in Maven to the published Scala packages:
> >
> >
> >
> https://repository.apache.org/content/repositories/releases/org/apache/mxnet/
> > https://repository.apache.org/#nexus-search;quick~org.apache.mxnet
> >
> > and to the experimental Clojure packages:
> >
> >
> https://repository.apache.org/content/repositories/releases/org/apache/mxnet/contrib/clojure/
> >
> > The Release Tag:
> > https://github.com/apache/incubator-mxnet/tree/1.5.
> > 1
> >
> > MXNet Resources
> > - Our discussion forum (https://discuss.mxnet.io)
> > - MXNet user mailing list (
> > https://lists.apache.org/list.html?u...@mxnet.apache.org)
> > - MXNet dev mailing list (
> > https://lists.apache.org/list.html?d...@mxnet.apache.org)
> > - StackOverflow mxnet tag (
> > https://stackoverflow.com/questions/tagged/mxnet)
> > - MXNet website (https://mxnet.incubator.apache.org)
> > - Github issues (https://github.com/apache/incubator-mxnet/issues)
> > - Wiki (https://cwiki.apache.org/confluence/display/MXNET)
> >
> > Attend one of the regular user groups meetings:
> > https://cwiki.apache.org/confluence/x/7BY0BQ
> >
> > For more information on Apache MXNet (incubating), please see:
> > https://mxnet.incubator.apache.org/
> >
> >
> > Best regards,
> > Apache MXNet (incubating) Team
> >
> > ___
> >
> > DISCLAIMER:
> >
> > Apache MXNet (incubating) is an effort undergoing incubation at The
> Apache
> > Software Foundation (ASF), sponsored by the name of Apache Incubator PMC.
> > Incubation is required of all newly accepted projects until a further
> > review indicates that the infrastructure, communications, and decision
> > making process have stabilized in a manner consistent with other
> successful
> > ASF projects. While incubation status is not necessarily a reflection of
> > the completeness or stability of the code, it does indicate that the
> > project has yet to be fully endorsed by the ASF.
> >
> > https://cwiki.apache.org/confluence/x/BINjB
> >
>


Re: new website, docs code freeze

2019-09-22 Thread kellen sunderland
New site looks good.  I do notice that a few tutorials from the old site
are missing (for example the TensorRT tutorial).  Any plans to bring them
back?

On Sun, Sep 22, 2019 at 10:04 AM Haibin Lin 
wrote:

> Another issue I found with the current website: the Sphinx object inventory
> 
> file https://mxnet.apache.org/objects.inv is missing. GluonNLP relies on
> this file to link document across projects. Shall we add it back?
>
> Best,
> Haibin
>
> On Sun, Sep 22, 2019 at 2:04 AM Lieven Govaerts  wrote:
>
> > Hi,
> >
> >
> > On Sat, 21 Sep 2019 at 06:28, Thomas DELTEIL 
> > wrote:
> >
> > > Thanks all for the feedback,
> > >
> > > We'll send an email next week with the list of missing features,
> content
> > > and bugs that we plan to fix.
> > > We took the option of releasing early, with some features missing,
> rather
> > > than trying to be at feature parity with the old website before
> launching
> > > the website.
> > > The reason why we decided to do that is two-fold:
> > > - playing catch-up with docs in master introduce daily conflicts that
> > need
> > > to be resolved and introduce opportunity for errors
> > > - by releasing early, we can take advantage of the community
> > contributions
> > > in modifying whatever the community feels like a better way of doing
> > > things.
> > >
> > > One of the goals of the new website was to disentangle the main
> website,
> > > now called "static_site" to the auto-generated docs. Now the overall
> site
> > > is made of a main static site, with easy to modify content and easy to
> > > understand architecture for anybody familiar with basic html, and a
> > > collection of mini-websites for each language bindings that can be
> built
> > in
> > > isolation and that are self-contained. Actually the new CI jobs builds
> > all
> > > of them in parallel independently.
> > >
> > > There is PLENTY of room for improvement, it would be great if the
> > community
> > > can help contribute to bring the new website at the same level of
> content
> > > richness as the old one, and then even further.
> > >
> > > Missing features:
> > > - As pointed by Haibin, the API docs do not have the full list of
> > operators
> > > and classes. There is a mix of auto-generated docs based on packages,
> and
> > > some docs that are spelled out manually to improve the logical
> > organization
> > > of the package where there is a need. The drawback with manually listed
> > > classes in a package is that it's very easy to miss some. If someone
> > wanted
> > > to build a sanity check that would automatically detect which classes
> are
> > > not in the documentation, or if someone knew how to enable that with
> > > sphinx, that would be a great addition to the python docs
> > > - There is missing content in the python tutorials, and the
> > discoverability
> > > could be improved. Some old tutorials have not been migrated just yet.
> > > - The nightly tests on tutorials have been disabled for now
> > > - There is no "Download jupyter notebook" for tutorials just yet.
> > > - Non-python tutorials might benefit from a blurb description and a
> > better
> > > content organization.
> > > - Python tutorials could be better organized, have a picture
> accompanying
> > > their description
> > > - There is no site-wide search, this is not an easy problem to solve to
> > be
> > > fair given the static nature of the website, but maybe an external
> plugin
> > > might be able to give a half-way solution
> > > - There is no version selector for the docs
> > > - There is bug in search box of the python docs, but this is just a
> small
> > > JS bug that can be fixed easily (on my list for next week)
> > > - Most old links have not had a redirect put in place.
> > >
> > >
> > I noticed on the Ubuntu home page in the Developer dropdown that the link
> > MXNet on Ubuntu  > >with
> > Nvidia
> > <
> >
> https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/mxnet/
> > >
> > doesn't work anymore, it points to:
> > https://mxnet.incubator.apache.org/install/index.html
> >
> > Also, on the MXNet 'getting started' page
> > https://mxnet.incubator.apache.org/get_started , the link "Ubuntu
> > Installation Guide" at the bottom doesn't work either, it points to:
> > https://mxnet.incubator.apache.org/ubuntu_setup.html
> >
> >
> > I suggest you do a scan of the new website to find these dangling links.
> >
> > regards,
> >
> > Lieven
> >
> >
> >
> > > We'll formalize this in github issues next week, but they are all
> fairly
> > > small and helping out on these would be a great way of familiarizing
> > > yourself with the new website build system and website architecture.
> > >
> > >  Thanks all for the feedback, please keep it coming!
> > >
> > > Thomas Delteil
> > >
> > > Le sam. 21 sept. 2019 à 09:53, Haibin Lin  a
> > > écrit :
> > >
> > > > It looks like my previous 

Re: Code freeze for 1.5.1 patch release

2019-09-02 Thread kellen sunderland
Thanks for organizing the release Tao.

On Sun, Sep 1, 2019, 5:53 PM Tao Lv  wrote:

> Hi Community,
>
> Code freeze for 1.5.1 patch release will be 9/3 6pm PST (9/4 9am CST). If
> you have any additional fix in progress and would like to include it in
> this release, please assure they have been merged before code freeze.
>
> Thanks for all your support and contribution.
>
> -tao
>


Re: Turning CUDNN on/off in anaconda distribution

2019-07-11 Thread kellen sunderland
Having runtime loadable / plugable operators might help with this.

On Thu, Jul 11, 2019 at 10:20 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Once it's compiled the forward / backward, etc kernel implementations are
> hard coded to use cuDNN.  In theory we could support raw CUDA in addition
> to cuDNN but the additional CUDA kernel code would bloat the binary (it
> targets several GPU types).
>
> On Thu, Jul 11, 2019 at 9:36 AM Chris Olivier 
> wrote:
>
>> Is there an environment variable or some other way to not use CUDNN in the
>> anaconda distribution of mxnet?
>>
>


Re: Turning CUDNN on/off in anaconda distribution

2019-07-11 Thread kellen sunderland
Once it's compiled the forward / backward, etc kernel implementations are
hard coded to use cuDNN.  In theory we could support raw CUDA in addition
to cuDNN but the additional CUDA kernel code would bloat the binary (it
targets several GPU types).

On Thu, Jul 11, 2019 at 9:36 AM Chris Olivier  wrote:

> Is there an environment variable or some other way to not use CUDNN in the
> anaconda distribution of mxnet?
>


Re: OMP

2019-06-24 Thread kellen sunderland
I remember at the time we also had a read through of this blog post, but to
use the code looked like it was following the advice:
https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> believe the stacks for the hang are here:
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and
> the trick was we could only debug it up to the point that we hit:
>
> #0  0x7fec6df1ba4f in futex_wait (private=0, expected=1,
> futex_word=0x7fec60843758)
> at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
> at ../sysdeps/nptl/futex-internal.h:135
> #2  __pthread_once_slow (once_control=0x7fec60843758,
> init_routine=0x7fec605f38f0)
> at pthread_once.c:105
> ...
> #6  0x7fec6061c577 in cudaSetDevice () from
> /usr/local/cuda/lib64/libcudart.so.9.0
>
> because the code in libcudart is obviously closed source we couldn't dig
> into what threading work was going on when we called cudaSetDevice.
>
> On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy 
> wrote:
>
>> If you check initialize.cc we seem to be explicitly disabling that
>> behaviour in pthread_at_fork which seems to cause thread contention
>> during multiprocessing. Why do we need this major advantage for the
>> library if that's the case?
>>
>> Related PRs:
>>
>> https://github.com/apache/incubator-mxnet/pull/10820
>> https://github.com/apache/incubator-mxnet/issues/14396
>>
>> The original code was authored in this PR:
>>
>> https://github.com/apache/incubator-mxnet/pull/8677
>>
>> I actually remember this fix, it was done during a release as the cuda
>> runtime was forking and the engine was being re-entered. If that
>> situation is now happening anymore it might not be needed any longer.
>> I don't think we know the cause why there was a fork inside cuda, so
>> the code has grown around a fix for an issue which its root cause was
>> not understood, and side effects which this fix caused afterwards.
>>
>> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
>> the link above, no libgomp.
>>
>> I didn't try the Make build.
>>
>> I would refactor the code linked above and stop using pthread_at_fork,
>> since OMP assumes it won't be initialized twice, but needs to be very
>> well tested to make sure it doesn't cause bugs or affect the fixes
>> done on the linked PRs above.
>>
>> Pedro.
>>
>> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier 
>> wrote:
>> >
>> > one major advantage of intel/llvm omp is that it spawns a new thread
>> pool
>> > after fork if a thread pool was already created. this is so that omp
>> can be
>> > used in the forked processes. libgomp doesn’t do this so it’ll just
>> lock up
>> > if you try to do omp in the forked process.
>> >
>> > is your build linking libgomp as well?
>> >
>> > standard mkl build (from Makefile) uses same omp library. are there
>> > problems with that build?
>> >
>> > what changes need to be made to make the assertion not fire?
>> >
>> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > wrote:
>> >
>> > > There's an assertion which is easily reproducible, and also there's a
>> > > crash including core dump, the latter is not easy to reproduce for me
>> > > in different environments. I have also seen mxnet getting stuck
>> > > without progressing with this build configuration and using no CPU at
>> > > all when running unit tests.
>> > >
>> > > In my view, the root cause of the assertion is that we are re-entering
>> > > OMP initialization when spawning threads on the following code through
>> > > pthread_at_fork
>> > >
>> > >
>> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
>> > >
>> > > This causes double initialization of the OMP engine, including the
>> > > assertion which you are asking about,  and I suspect some additional
>> > > overhead. That's the shady forking part you are asking for.
>> > >
>> > > A question for you: What is the cause of runtime differences between
>> > > OMP runtimes? Shouldn't the implementation overhead diminish as
>> > > threa

Re: OMP

2019-06-24 Thread kellen sunderland
t; > > > > >
> > > > >
> > >
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > > > > > (0x7f05b09f4000)
> > > > > > >
> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > > > > > INFO:root:Epoch[19] Batch [0-100]   Speed: 36651.63
> samples/sec
> > > > > > >  accuracy=0.999691
> > > > > > > INFO:root:Epoch[19] Batch [100-200] Speed: 45093.98
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [200-300] Speed: 45146.84
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [300-400] Speed: 45119.90
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [400-500] Speed: 44998.96
> samples/sec
> > > > > > >  accuracy=0.999531
> > > > > > > INFO:root:Epoch[19] Batch [500-600] Speed: 45072.25
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [600-700] Speed: 44969.79
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [700-800] Speed: 44962.78
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [800-900] Speed: 44945.47
> samples/sec
> > > > > > >  accuracy=0.999375
> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > > > > > INFO:root:Epoch[19] Time cost=1.367
> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> (0avgtext+0avgdata
> > > > > > > 1154348maxresident)k
> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > > > > > >
> > > > > > >
> > > > > > > MKL OFF:
> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i
> MKL
> > > > > > > cmake_options.yml
> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL
> found) IF
> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > > > > > build/libmxnet.so |grep -i omp
> > > > > > > libomp.so =>
> > > > > > >
> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > (0x7fb720c54000)
> > > > > > >
> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > > > > > INFO:root:Epoch[19] Batch [0-100]   Speed: 46784.02
> samples/sec
> > > > > > >  accuracy=1.00
> > > > > > > INFO:root:Epoch[19] Batch [100-200] Speed: 48824.29
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [200-300] Speed: 49190.31
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [300-400] Speed: 51518.77
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [400-500] Speed: 51551.62
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [500-600] Speed: 49026.35
> samples/sec
> > > > > > >  accuracy=0.999844
> > > > > > > INFO:root:Epoch[19] Batch [600-700] Speed: 49002.46
> samples/sec
> > > > > > >  accuracy=0.999375
> > > > > > > INFO:root:Epoch[19] Batch [700-800] Speed: 48980.55
> samples/sec
> > > > > > >  accuracy=0.999687
> > > > > > > INFO:root:Epoch[19] Batch [800-900] Speed: 47402.56
> samples/sec
> > > > > > >  accuracy=0.999844
> > > >

Re: OMP

2019-06-19 Thread kellen sunderland
"if you’re linking in two then you’re doing something wrong." Correct,
that's one thing I believe we've got consensus on.  So let's call that out
as a bug to be fixed.

Let's move forward with some reproducible numbers and then discuss the pros
/ cons of which particular OMP implementation we should use.

On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy 
wrote:

> Hi Chris
>
> I would ask you to have a bit of patience and help us with your
> experience in this matter. Nobody is ignoring anything, I think we are
> individually gathering feedbacks and trying to understand the multiple
> contributions done to this topic including yours, then go step by
> step, understand what is going on and run experiments and report back
> to the list or the corresponding github item. It was suggested by
> Kellen to prepare some containers, this takes effort.
>
> Regarding your final comment, most of us also have many other things
> to do and responsibilities even if our daytime jobs might involve
> MXNet in some form or another. I think that's part of the privilege
> and responsibility of working close with an open source project and
> the magic of collaboration across organizations. Let's all be patient
> and take some time to understand and reason about this topic which is
> not simple. Since we decided to step back and gather more data let's
> take time and do it properly.
>
> Personally I hope to find time to look again into this issue before
> the end of the week.
>
> Thanks.
>
> Pedro.
>
> On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier 
> wrote:
> >
> > if you’re linking in two then you’re doing something wrong. You can see
> by
> > my email yesterday that only one is linked in. This is also the case with
> > the mkl version built by the Makefile — only the Intel OMP library is
> used
> > (no libgomp).
> >
> > That being said, Do you have clear evidence that using Intel OMP is both
> > problematic and the situation isn’t fixable?  The burden of proof is on
> the
> > ones requesting the change — it is not my responsibility to justify the
> > current state.  There must be something “terrible” and unfixable to
> justify
> > a change.  I have seen no proof of this in all this time.
> >
> > On a side note, I mentioned a couple of things in my email yesterday that
> > still are not being responded to (they were also ignored in the last
> > incarnation of this “discussion” — I have much experience in this matter
> to
> > assume “discussion” is a waste of my time, seeing and I am not paid to
> > “work on” mxnet like y’all are).
> >
> > -C
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > I've also quite often seen two versions of OpenMP linked.  I think we
> can
> > > all agree we probably want to avoid linking in two libraries that do
> > > effectively the same thing.
> > >
> > > The performance questions should be fairly straight forward to
> demonstrate
> > > right?  Could we just collaborate on a few minimal Dockerfiles that
> show
> > > (or don't show) Intel OpenMP performance speedups with the workloads
> Chris
> > > is referencing?
> > >
> > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > stanislav.tsuk...@gmail.com> wrote:
> > >
> > > > Hi, Chris!
> > > >
> > > > Stas here - I've gathered that performance data.
> > > > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > > > missing.
> > > > Be assured, intentional misdirection was never a case.
> > > >
> > > > Thanks a lot for being constructive.
> > > >
> > > > > Turning Intel OMP on and off (and MKL as well, since it tends to
> pull
> > > in
> > > > omp, depending which one is linked in).
> > > >
> > > > We never ever considered turning MKL off. We are on the same page
> here -
> > > > MKL is crucial for the performance.
> > > > Why should we? There's a GOMP-linked version of MKL, that we can use.
> > > >
> > > > What we did - we measured, if using compilers default OpenMP
> > > > implementation instead of referenced source code distribution of
> OpenMP
> > > > makes anything slower.
> > > > We have found the impact to be hardly measurable.
> > > > The difference between GOMP and iOMP is <5% on our benchmarks, most
> of
> > > the
> > > > time less than t

Re: OMP

2019-06-19 Thread kellen sunderland
I've also quite often seen two versions of OpenMP linked.  I think we can
all agree we probably want to avoid linking in two libraries that do
effectively the same thing.

The performance questions should be fairly straight forward to demonstrate
right?  Could we just collaborate on a few minimal Dockerfiles that show
(or don't show) Intel OpenMP performance speedups with the workloads Chris
is referencing?

On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
stanislav.tsuk...@gmail.com> wrote:

> Hi, Chris!
>
> Stas here - I've gathered that performance data.
> Sure thing, I can be wrong, but please elaborate a bit on what we are
> missing.
> Be assured, intentional misdirection was never a case.
>
> Thanks a lot for being constructive.
>
> > Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> omp, depending which one is linked in).
>
> We never ever considered turning MKL off. We are on the same page here -
> MKL is crucial for the performance.
> Why should we? There's a GOMP-linked version of MKL, that we can use.
>
> What we did - we measured, if using compilers default OpenMP
> implementation instead of referenced source code distribution of OpenMP
> makes anything slower.
> We have found the impact to be hardly measurable.
> The difference between GOMP and iOMP is <5% on our benchmarks, most of the
> time less than that.
>
> We just suggest to simplify the build of mxnet, by removing the
> unnecessary dependency.
>
> During that we discovered for example the following amazing issue:
> https://github.com/apache/incubator-mxnet/issues/14087
>
> Best Regards
>
> Stas
>
> On 18.06.19, 18:24, "Chris Olivier"  wrote:
>
> I am very reluctant to feed the trolls again, and this will be teh last
> time I address Pedro or Anton on the subject, but since I think the
> numbers
> being presented are incorrect (either by te builders not really
> understanding what they are building, or possibly intentional
> misdirection):
>
> Turning Intel OMP on and off (and MKL as well, since it tends to pull
> in
> omp, depending which one is linked in).
> There is a HUGE difference.  This is consistent with my experience
> before
> when it was added.
>
>
> default mnist:
>
> python ../example/image-classification/train_mnist.py
> INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
> disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> gpus=None, image_shape='1, 28, 28', initializer='default',
> kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> monitor=0, network='mlp', num_classes=10, num_epochs=20,
> num_examples=6, num_layers=None, optimizer='sgd',
> profile_server_suffix='', profile_worker_suffix='', save_period=1,
> test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> wd=0.0001)
>
> INTEL OMP:
>
> ldd libmxnet.so | grep omp
> libomp.so =>
> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> (0x7f978fde7000)
>
> :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
> accuracy=0.780012
> INFO:root:Epoch[0] Batch [100-200]  Speed: 16073.21 samples/sec
> accuracy=0.920469
> INFO:root:Epoch[0] Batch [200-300]  Speed: 19075.91 samples/sec
> accuracy=0.928281
> INFO:root:Epoch[0] Batch [300-400]  Speed: 23211.36 samples/sec
> accuracy=0.942813
> INFO:root:Epoch[0] Batch [400-500]  Speed: 22139.79 samples/sec
> accuracy=0.938750
> INFO:root:Epoch[0] Batch [500-600]  Speed: 23225.52 samples/sec
> accuracy=0.946562
> INFO:root:Epoch[0] Batch [600-700]  Speed: 19547.41 samples/sec
> accuracy=0.953281
> INFO:root:Epoch[0] Batch [700-800]  Speed: 24111.73 samples/sec
> accuracy=0.951562
> INFO:root:Epoch[0] Batch [800-900]  Speed: 13959.88 samples/sec
> accuracy=0.957500
> INFO:root:Epoch[0] Train-accuracy=0.925423
> INFO:root:Epoch[0] Time cost=3.806
> INFO:root:Epoch[0] Validation-accuracy=0.962580
> INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec
> accuracy=0.968131
> INFO:root:Epoch[1] Batch [100-200]  Speed: 23457.03 samples/sec
> accuracy=0.966250
>
>
> LIBGOMP:
>
> ldd libmxnet.so | grep omp
> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x7f25c25dd000)
>
> INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec
>  accuracy=0.782488
> INFO:root:Epoch[0] Batch [100-200]  Speed: 3551.32 samples/sec
>  accuracy=0.907813
> INFO:root:Epoch[0] Batch [200-300]  Speed: 1991.00 samples/sec
>  accuracy=0.927188
> INFO:root:Epoch[0] Batch [300-400]  Speed: 2175.45 samples/sec
>  accuracy=0.937969
> INFO:root:Epoch[0] Batch [400-500]  Speed: 1644.95 samples/sec
>  

Re: CUDA / CUDNN support revisited

2019-06-19 Thread kellen sunderland
Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I
don't believe there's any need to drop SMs.

On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I think where we're all going to have agreement is that we shouldn't have
> code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
> than 6.  We can go ahead and remove any code that targets those old
> versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
> suggest we also add some logging for users with prior versions letting them
> know they can still use MXNet 1.4.
>
> Where things get interesting is CUDA 9 / cuDNN 6 support.  I was
> originally a proponent of the N and N-1 route for simplicity.  Looking back
> at the choice, one complication I see is that there's competing concerns
> between semver library compatibility and feature releases on NVIDIA's
> part.  NVIDIA is releasing new libraries with a lot of new features on a
> regular basis, which is good, but for compatibility reasons they've begun
> to bump major versions less often, which is also probably also good.  For
> example if memory serves correctly cuDNN used to get an MV bump every 6
> months or so, but now the N-1 MV (6) was released in April of 2017.  As a
> project maintainer I would certainly like to drop support for library
> versions that are 2 years old in my latest release.  Supporting a 2 year
> wide range of dependency libraries in the CI for example is going to be a
> burden.
>
> From the MXNet users' perspective obviously having to update dependencies
> is a pain, but updating these libs are likely to give significant
> performance increases (occasional perf regressions aside).  I think a
> consistent thread I've heard from users is that training takes too long,
> inference costs too much, and they want their DL framework to abstract the
> complexity of using custom hardware like TCs or AVX with them having to put
> in a lot of effort.  Another consideration is that using old versions of
> MXNet is actually quite easy and convenient thanks to (IMO) some solid
> release practices and naming conventions.
>
> Given how easy it is to use old MXNet versions I think it's reasonable to
> target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
> versions).
>
> On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu 
> wrote:
>
>> Good points anirudh. Generally I would understand N as being the major
>> versions. Speak we would maintain CUDA 9 and 10.1 in your given example
>> and
>> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only
>> be
>> dropped when 11 is released and tested.
>>
>> At the same time, we would always only supported the latest compatible
>> cudnn version. Or is there any reason somebody would use an old cudnn
>> version?
>>
>> Wdyt?
>>
>> -Marco
>>
>> Anirudh Subramanian  schrieb am Mi., 19. Juni
>> 2019,
>> 01:47:
>>
>> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
>> CUDA
>> > Version N and CUDA Version N - 1 should be supported in CI.
>> >
>> > My question is what happens, when we are at a position, where we are on
>> a
>> > CUDA version N and removed support for CUDA version N - 1. Within a
>> small
>> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
>> perf
>> > regressions and some bugs have been fixed. Should we just move to N + 1,
>> > since version N will have all these issues for users and may also slow
>> us
>> > down on CI.
>> >
>> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
>> > causing intermittent CI failures:
>> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
>> already
>> > a
>> > PR to bump up Nvidia version to 10.1 (
>> > https://github.com/apache/incubator-mxnet/pull/14986/files).
>> >
>> > I think for situations where there is a quick follow up release like
>> 10.1
>> > and MXNet users are impacted by certain issues, we should just bump up
>> the
>> > version and stop support for 10.0.
>> > Would like to hear more from Nvidia folks (on this particular case of
>> CUDA
>> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
>> customers).
>> >
>> > Anirudh
>> >
>> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter 
>> wrote:
>> >
>> > > Actually, I tried to say that support *doesn't necessarily* include
>> N-1.
>> > > I'm proposing that the supported vers

Re: CUDA / CUDNN support revisited

2019-06-19 Thread kellen sunderland
I think where we're all going to have agreement is that we shouldn't have
code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
than 6.  We can go ahead and remove any code that targets those old
versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
suggest we also add some logging for users with prior versions letting them
know they can still use MXNet 1.4.

Where things get interesting is CUDA 9 / cuDNN 6 support.  I was originally
a proponent of the N and N-1 route for simplicity.  Looking back at the
choice, one complication I see is that there's competing concerns between
semver library compatibility and feature releases on NVIDIA's part.  NVIDIA
is releasing new libraries with a lot of new features on a regular basis,
which is good, but for compatibility reasons they've begun to bump major
versions less often, which is also probably also good.  For example if
memory serves correctly cuDNN used to get an MV bump every 6 months or so,
but now the N-1 MV (6) was released in April of 2017.  As a project
maintainer I would certainly like to drop support for library versions that
are 2 years old in my latest release.  Supporting a 2 year wide range of
dependency libraries in the CI for example is going to be a burden.

>From the MXNet users' perspective obviously having to update dependencies
is a pain, but updating these libs are likely to give significant
performance increases (occasional perf regressions aside).  I think a
consistent thread I've heard from users is that training takes too long,
inference costs too much, and they want their DL framework to abstract the
complexity of using custom hardware like TCs or AVX with them having to put
in a lot of effort.  Another consideration is that using old versions of
MXNet is actually quite easy and convenient thanks to (IMO) some solid
release practices and naming conventions.

Given how easy it is to use old MXNet versions I think it's reasonable to
target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
versions).

On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu 
wrote:

> Good points anirudh. Generally I would understand N as being the major
> versions. Speak we would maintain CUDA 9 and 10.1 in your given example and
> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only be
> dropped when 11 is released and tested.
>
> At the same time, we would always only supported the latest compatible
> cudnn version. Or is there any reason somebody would use an old cudnn
> version?
>
> Wdyt?
>
> -Marco
>
> Anirudh Subramanian  schrieb am Mi., 19. Juni 2019,
> 01:47:
>
> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
> CUDA
> > Version N and CUDA Version N - 1 should be supported in CI.
> >
> > My question is what happens, when we are at a position, where we are on a
> > CUDA version N and removed support for CUDA version N - 1. Within a small
> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
> perf
> > regressions and some bugs have been fixed. Should we just move to N + 1,
> > since version N will have all these issues for users and may also slow us
> > down on CI.
> >
> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> > causing intermittent CI failures:
> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
> already
> > a
> > PR to bump up Nvidia version to 10.1 (
> > https://github.com/apache/incubator-mxnet/pull/14986/files).
> >
> > I think for situations where there is a quick follow up release like 10.1
> > and MXNet users are impacted by certain issues, we should just bump up
> the
> > version and stop support for 10.0.
> > Would like to hear more from Nvidia folks (on this particular case of
> CUDA
> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
> customers).
> >
> > Anirudh
> >
> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter  wrote:
> >
> > > Actually, I tried to say that support *doesn't necessarily* include
> N-1.
> > > I'm proposing that the supported versions are 1) covered by CI and 2)
> > have
> > > been available in a usable form long enough that a semi-motivated user
> > has
> > > been able to transition to it.  That might mean only N (e.g. per my
> > > proposal, only cuDNN v7).
> > >
> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> > users
> > > will transition to it at their own pace, thereby creating a N / N-1
> > support
> > > situation for some period.
> > >
> > >
> > > On 2019/06/03 22:43:20, Pedro Larroy 
> > > wrote:
> > > > Your proposal of having support for N and N-1 makes a lot of sense to
> > > > me. Are there use cases for supporting older CUDA versions?
> > > >
> > > >
> > > > Thanks.
> > > >
> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter 
> > wrote:
> > > > >
> > > > > I'd like to revisit the discussion of:
> > >
> >
> 

Re: [VOTE] Release Apache MXNet (incubating) version 1.4.1.rc0

2019-05-06 Thread kellen sunderland
Hey Tao.  Fixed my problem by recompiling cmake with ssl support (so
basically just a problem on my end).  After that MKL downloaded correctly,
and everything compiled correctly with Anirudh's build flags.

On Sun, May 5, 2019 at 6:59 PM Lv, Tao A  wrote:

> Hi Kellen, does the problem still exist for you? I just built mxnet
> 1.4.1rc0 + mkldnn from source with cmake on my centos machine and
> everything works well:
>
> -- Downloading MKLML...
> -- [download 0% complete]
> ...
> -- [download 100% complete]
> -- Setting MKLROOT path to
> /home/lvtao/Workspace/mxnet-official/build/mklml/mklml_lnx_2019.0.1.20180928
> -- CMAKE_BUILD_TYPE is unset, defaulting to Release
> -- Detecting Intel(R) MKL: trying mklml_intel
> -- Intel(R) MKL: include
> /home/lvtao/Workspace/mxnet-official/build/mklml/mklml_lnx_2019.0.1.20180928/include
> -- Intel(R) MKL: lib
> /home/lvtao/Workspace/mxnet-official/build/mklml/mklml_lnx_2019.0.1.20180928/lib/libmklml_intel.so
>
> Thank you Junru for managing this release. We also verified MKL-DNN
> related tests, convergence, quantization and FP32/INT8 performance. They
> all look good to me.
>
> -tao
>
> -Original Message-
> From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> Sent: Monday, May 6, 2019 3:20 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.4.1.rc0
>
> I gave checking a shot but am now getting
>
> -- Downloading MKLML...
> CMake Error at cmake/DownloadMKLML.cmake:62 (file):
>   file DOWNLOAD HASH mismatch
>
> Assuming that's a transient mkl dep download error, I'll try again later.
>
>
> On Sat, May 4, 2019 at 12:09 AM Junru Shao 
> wrote:
>
> > Thank you Anirudh for your quick response! I will change the result
> > accordingly :-)
> >
> > On Fri, May 3, 2019 at 11:58 PM Anirudh Subramanian
> >  > >
> > wrote:
> >
> > > No worries, maybe its just something with my setup.
> > > Moving my vote to +0, pending someone else check.
> > >
> > > On Fri, May 3, 2019 at 11:39 PM Junru Shao 
> > > wrote:
> > >
> > > > Hi Anirudh,
> > > >
> > > > Thanks for reporting this!
> > > >
> > > > I verified on my EC2 machine for the second time. It perfectly
> > > > builds
> > > with
> > > > your commands. It is a bit weird...I noticed that there is a
> > > > subtle difference that my ninja progress bar is like "[xxx/506]",
> > > > while yours
> > is
> > > > "[xxx/488]". I am not sure if there is anything different between
> > > > our settings.
> > > >
> > > > My understanding is that cmake should work because it is tested in
> > > > our
> > CI
> > > > system under "ci/jenkins/incubator-mxnet" (
> > > >
> > > >
> > >
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incub
> > ator-mxnet/detail/v1.4.x/201/pipeline
> > > > ).
> > > >
> > > > It will be much appreciated if someone could help confirm whether
> > > > cmake works on their side.
> > > >
> > > > Thanks,
> > > > Junru
> > > >
> > > >
> > > > On Fri, May 3, 2019 at 9:43 PM Anirudh Subramanian <
> > > anirudh2...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Junru,
> > > > >
> > > > > I am on v1.4.x , and my dmlc-core commit is this one :
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/dmlc/dmlc-core/tree/0a0e8addf92e1287fd7a25c6314016b
> > 8c0138dee
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Fri, May 3, 2019 at 8:30 PM Junru Shao
> > > > > 
> > > > wrote:
> > > > >
> > > > > > Hey Anirudh,
> > > > > >
> > > > > > Although the vote has been closed, I am very interested in
> > > > > > digging
> > > into
> > > > > > this issue.
> > > > > >
> > > > > > I build on my EC2 machine using your instructions, and it
> > > > > > seems
> > that
> > > > > > everything is working fine...
> > > > > >
> > > > > > Then, I noticed that your issue seems to be related to
> > > > > > unittests in dmlc-core, not in mxnet. Could you kindly check
> > > 

Re: [VOTE] Release Apache MXNet (incubating) version 1.4.1.rc0

2019-05-05 Thread kellen sunderland
am.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_parser.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_array_view.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_any.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_config.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_threaditer.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_serializer.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_threaditer_exc_handling.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_inputsplit.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_logging.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_json.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_optional.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_main.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_env.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_thread_group.cc.o
> > > > > > -o 3rdparty/dmlc-core/test/unittest/dmlc_unit_tests  -rdynamic
> > > > > > lib/libgtestd.a 3rdparty/dmlc-core/libdmlc.a -lpthread && :
> > > > > > FAILED: : && /usr/bin/c++   -Wall -Wno-unknown-pragmas -fPIC -g
> -O0
> > > > > -msse2
> > > > > > -std=c++11 -fopenmp -g  -pthread
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_lockfree.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_param.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_parser.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_array_view.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_any.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dmlc-core/test/unittest/CMakeFiles/dmlc_unit_tests.dir/unittest_config.cc.o
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 3rdparty/dm

Re: [VOTE] add conan support for Apache MXNet (incubating)

2019-05-03 Thread kellen sunderland
So firstly let's try to keep our responses empathetic and avoid ad-hom
comments.  It might be beneficial to take some time to review the Apache
Code of Conduct [1].  Konstantin has taken a lot of time to think about
dependency management in MXNet on a volunteer basis which is commendable.

Second, Junru, I share many of your security concerns, but my understanding
is that Conan.io allows you to pull dependencies as binaries, or as source
using the -build option, so we're not limited to strictly pulling binaries
from remote servers.

Some benefits I see:
* A uniform method of pulling dependencies is much easier to maintain and
reason about.  Need to update a package because of a security
vulnerability?  Go to the single place we configure dependencies and update
it.
* We have many dependencies that do not need to be checked out depending on
the build options a user desires (so called build conditional
dependencies).  There's not much point in downloading / checking out these
dependencies if you're not going to use them.
* Subrepo sources have to have certified license reviews every release.
Using a package manager would solve this issue.
* We have an extra user base (Conan.io users) who get exposure to MXNet,
growing our user base.

Many of these benefits we'd get with other package management systems.  One
option I had previously proposed was Hunter, which is basically a wrapper
around CMake's ExternalProject functionality.  The tradeoff I see between
the two is that Hunter (or ExternalProject via CMake) wold be lighter
weight but would have less support, a smaller community and would be hard
to use consistently across the project with the variety of collaborators it
has.

1: https://www.apache.org/foundation/policies/conduct.html.
2: https://github.com/ruslo/hunter

On Fri, May 3, 2019 at 9:10 AM Junru Shao  wrote:

> I am actually a bit concerned about the security issues. We are asked to
> download binaries from third-party websites, which are not controlled or
> validated by Apache. Although it is claimed to be “decentralized”, I am
> really not convinced where the security comes from.
>
> In the meantime, sacrificing security doesn’t really bring us tangible
> benefits. CMake does support download artifacts from a pre-defined website
> very well. We may also have pre-built binaries stored in our CI docker
> without having to download it them from the internet.
>
> Another point is that I am not convinced of any advantage of Conan over
> other package managers for C++. Would you mind at least mentioning the
> benefits somewhere in this thread, rather than carelessly includes tons of
> irrelevant links (some of which are even wrong) and eagerly asking for a
> vote? I believe that reasonable discussion would keep us within a *healthy*
> discussion.
>
> Last but not least, as we all know, most of the dependencies you mentioned
> adopts CMake, rather than Conan, the meta-generator which generates CMake.
> I didn’t see your logic stands like “oh you have tons of dependencies so
> you must use Conan”, why not I simply specify versions in git submidule,
> which everyone understands how it behaves? Everyone know how to include a
> sub-directory in cmake in one line, so why we have to write python to make
> it less understandable and more complicated?
>
> In conclusion, we need to be reasonable in healthy discussion. I don’t
> particularly want to rudely +1 or -1 for a thing that is unclear to me, but
> I really want to see pros and cons, discussion about issues and concerns,
> rather than broken links to nonsense.
>
> Looking forward to your reply!
>
> Thanks,
> Junru
>
> On Fri, May 3, 2019 at 08:05 kellen sunderland <
> kellen.sunderl...@gmail.com>
> wrote:
>
> > Hey Konstantin.  Thanks for starting an email thread and sorry for the
> > confusion.  I think the ides is that we should discuss and agree on
> > Conan.io adoption first on the dev list, then start merging PRs.  Release
> > 1.4.1 is already in testing and the 1.5 code freeze deadline is also near
> > so I think it could be difficult to make such a large change on one of
> > those releases.  I've looked into package management solutions for the
> > project before.  I was in favour of hunter, but I think Conan's adoption
> > rate now makes it the best option.  It's simple to use and is becoming
> > industry standard, with a minor downside of requiring Python (which has
> > meanwhile become the most popular dev language).  I'd personally be -1
> for
> > 1.4.1 or 1.5, +1 for using Conan in 1.6 or 2.0.
> >
> > -Kellen
> >
> > On Fri, May 3, 2019 at 12:59 AM Konstantin Ivlev 
> > wrote:
> >
> > > hi Sheng Zha,
> > >
> > > on pull request review I was told by Anirudh anirudhacharya and Roshani
> &g

Re: [VOTE] add conan support for Apache MXNet (incubating)

2019-05-03 Thread kellen sunderland
Hey Konstantin.  Thanks for starting an email thread and sorry for the
confusion.  I think the ides is that we should discuss and agree on
Conan.io adoption first on the dev list, then start merging PRs.  Release
1.4.1 is already in testing and the 1.5 code freeze deadline is also near
so I think it could be difficult to make such a large change on one of
those releases.  I've looked into package management solutions for the
project before.  I was in favour of hunter, but I think Conan's adoption
rate now makes it the best option.  It's simple to use and is becoming
industry standard, with a minor downside of requiring Python (which has
meanwhile become the most popular dev language).  I'd personally be -1 for
1.4.1 or 1.5, +1 for using Conan in 1.6 or 2.0.

-Kellen

On Fri, May 3, 2019 at 12:59 AM Konstantin Ivlev 
wrote:

> hi Sheng Zha,
>
> on pull request review I was told by Anirudh anirudhacharya and Roshani
> Nagmote to start discussion/vote on the mxnet dev list. it seems to be a
> vicious circle now - on GitHub I am told to use vote, and on vote I am told
> to use GitHub, this doesn't help much.
> FYI GitHub review stuck, it's already opened since November 2018, and it's
> still not approved (however, there were no objections during the review).
> Previous discussion in e-mail thread also didn't encounter any objections,
> and all questions were answered.
> JIRA ticket has no discussion at all (except it has duplicates of comments
> made on GitHub).
> so let's process with 3-day vote for now, as other communication channels
> were already tried with no success.
>
> yours sincerely, Konstantin
>
> пт, 3 мая 2019 г. в 14:17, Sheng Zha :
>
> > Hi Konstantin,
> >
> > While conan looks like an option that's worth exploring, given that your
> > request is to merge the pull request, I'd suggest that the request should
> > go through the regular pull request review and it doesn't really need a
> > vote (as it doesn't substitute reviews anyway)
> >
> > If you would like to gather more attention to it, feel free to ping in a
> > discussion thread.
> >
> > -sz
> >
> > On 2019/05/03 06:29:55, Konstantin Ivlev  wrote:
> > > Dear MXNet community,
> > >
> > > This is the 3-day vote to add conan support for Apache MXNet
> (incubating)
> > > version v1.4.1.
> > > The voting on dev@ list will start May 03 23:59:59 (PST) and close on
> > May
> > > 06 23:59:59.
> > >
> > > Background: conan is open-source, freeware, cross-platform package
> > manager
> > > for C and C++ projects, written in python. it provides integration with
> > > various build systems, include CMake. conan may use bintray as a server
> > to
> > > store and download pre-built packages, or packages might be always
> built
> > > from sources.
> > >
> > > Problem: currently (as for v1.4.1), Apache MXNet (incubating) is using
> > > several ways to fetch 3rd-party dependencies simultaneously, for
> > instance:
> > > 1. download GitHub archives during the build
> > > - OpenBLAS
> > > - OpenCV
> > > 2. conda (alternative way to GitHub archives)
> > > 3. download from CMake
> > > - Intel Math Kernel Library (MKL)
> > > 4. Git submodules
> > > - cub
> > > - dlpack
> > > - dmlc-core
> > > - googletest
> > > - mkldnn
> > > - mshadow
> > > - onnx-tensorrt
> > > - openmp
> > > - ps-lite
> > > - tvm
> > > therefore, there are multiple places to look for 3rd parties, and its
> > hard
> > > to update them, as you need to remember or figure it out how to update
> a
> > > particular dependency to newer version, for instance.
> > > current Apache MXNet (incubating) build instructions differ very much
> per
> > > platform, require to download and unzip some archives manually,
> > specifying
> > > variables with paths to this archives, in conjunction of updating git
> > > submodules,
> > >
> > > Action: merge pull request providing an initial conan support for
> Apache
> > > MXNet (incubating). support conan as an alternate approach to fetch
> > various
> > > 3rd-party dependencies. old approaches will be still available,
> supported
> > > and left intact.
> > >
> > > Below are links to
> > > 1) conan web-site:  https://conan.io/
> > > 2) conan GitHub repository: https://github.com/conan-io/conan
> > > 3) conan documentation: https://docs.conan.io/en/latest/
> > > 4) bintray: https://bintray.com
> > > 5) pull request adding conan support to Apache MXNet (incubating):
> > > https://github.com/apache/incubator-mxnet/pull/13400
> > > 6) JIRA issue: https://issues.apache.org/jira/browse/MXNET-1229
> > > 7) previous email discussion:
> > >
> >
> https://lists.apache.org/thread.html/301a46a637f7e3c249c475713f701bef7530c32bc92d8834c0882897@%3Cdev.mxnet.apache.org%3E
> > > 8) MXNet build instructions:
> > > https://mxnet-tqchen.readthedocs.io/en/latest/how_to/build.html
> > > 9) MXNet build instructions (Windows):
> > >
> >
> https://mxnet.incubator.apache.org/versions/master/install/windows_setup.html
> > > 10) MXNet build instructions (OSX):
> > >
> 

Re: [VOTE] Release Apache MXNet (incubating) version 1.4.1.rc0

2019-05-03 Thread kellen sunderland
No problem Damien, glad to have you helping us validating the release.
Just wanted to make suer we have enough votes to pass the general vote (the
next release step) and with Sheng I think we should.

On Fri, May 3, 2019 at 7:52 AM Damien Stanton 
wrote:

> Ah, I misunderstood the binding/non-binding distinction. I am not a PPMC
> member, so my vote is non-binding.
>
> Best,
> Damien
>
> On Fri, May 3, 2019 at 3:19 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Hi Junru could you give a quick summary of the binding / non-binding
> votes.
> >
> > Damien just want to confirm, are you a member of the PPMC for MXNet?
> > Usually committers or community members (like most of us) are encouraged
> to
> > test and vote, but technically count as non-binding for releases.
> >
> > Sheng can we assume you're +1 on the release?
> >
> > On Fri, May 3, 2019 at 12:09 AM Junru Shao 
> > wrote:
> >
> > > Hi folks,
> > >
> > > So far we have collected enough binding votes. Thank you guys for the
> > hard
> > > work testing the release!
> > >
> > > The vote on dev@ is closed on May 02 23:59:59 (PST). Next, we are
> going
> > to
> > > vote for the Apache MXNet (incubating) release 1.4.1 on general@
> > tomorrow,
> > > which starts on May 3 2019, 23:59:59 PST, and ends on May 07 2019,
> > 23:59:59
> > > PST.
> > >
> > > Best,
> > > Junru
> > >
> > > On Thu, May 2, 2019 at 11:29 PM Aston Zhang 
> > wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > Passed all the code at zh.d2l.ai
> > > >
> > > > On Thu, May 2, 2019 at 1:46 PM Joshua Z. Zhang  >
> > > > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Build from source with cuda/cudnn.
> > > > >
> > > > > - All tests passed
> > > > > - GluonCV unittest scripts passed
> > > > > - GluonCV training scripts passed
> > > > > - No issue with python multiprocessing
> > > > >
> > > > > Best,
> > > > > Zhi
> > > > > > On May 2, 2019, at 11:34 AM, kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > I checked TRT integration builds and tests pass.
> > > > > > MD5s
> > > > > > Sigs look good.
> > > > > >
> > > > > > -Kellen
> > > > > >
> > > > > > On Thu, May 2, 2019 at 10:51 AM Damien Stanton <
> > > > damien.stan...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> +1 (binding)
> > > > > >>
> > > > > >> Built from source / Scala / Clojure. All tests pass. The only
> > issue
> > > of
> > > > > >> minor note: The macOS build guide indicates a directive `brew
> > > install
> > > > > >> opencv` however this installs OpenCV 4, which is currently
> > > > incompatible
> > > > > >> with mxnet and causes a failed build. The guide should specify
> > `brew
> > > > > >> install opencv@3` until/if version 4 is supported.
> > > > > >>
> > > > > >> Best,
> > > > > >> Damien
> > > > > >>
> > > > > >> On Thu, May 2, 2019 at 12:53 PM Lai Wei 
> > > wrote:
> > > > > >>
> > > > > >>> +1
> > > > > >>>
> > > > > >>> Built from source and tested keras-mxnet working fine.
> > > > > >>>
> > > > > >>> Best Regards
> > > > > >>>
> > > > > >>> Lai
> > > > > >>>
> > > > > >>>
> > > > > >>> On Wed, May 1, 2019 at 4:22 PM Carin Meier <
> carinme...@gmail.com
> > >
> > > > > wrote:
> > > > > >>>
> > > > > >>>> + 1 (binding)
> > > > > >>>>
> > > > > >>>> Built Scala/ Clojure and ran tests
> > > > > >>>>
> > > > > >>>> On Wed, May

Re: [VOTE] Release Apache MXNet (incubating) version 1.4.1.rc0

2019-05-03 Thread kellen sunderland
Awesome.  That should do it.  Thanks Sheng, and Junru for being the manager
this time around.

On Fri, May 3, 2019, 12:24 AM Sheng Zha  wrote:

> Hi Kellen,
>
> Of course, feel free to count in my vote if that’s ok. Since I helped
> prepare the artifacts I wasn’t sure if it was appropriate for me to vote so
> I refrained from voting till now.
>
> +1
>
> -sz
>
> > On May 3, 2019, at 12:19 AM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> >
> > Hi Junru could you give a quick summary of the binding / non-binding
> votes.
> >
> > Damien just want to confirm, are you a member of the PPMC for MXNet?
> > Usually committers or community members (like most of us) are encouraged
> to
> > test and vote, but technically count as non-binding for releases.
> >
> > Sheng can we assume you're +1 on the release?
> >
> >> On Fri, May 3, 2019 at 12:09 AM Junru Shao 
> wrote:
> >>
> >> Hi folks,
> >>
> >> So far we have collected enough binding votes. Thank you guys for the
> hard
> >> work testing the release!
> >>
> >> The vote on dev@ is closed on May 02 23:59:59 (PST). Next, we are
> going to
> >> vote for the Apache MXNet (incubating) release 1.4.1 on general@
> tomorrow,
> >> which starts on May 3 2019, 23:59:59 PST, and ends on May 07 2019,
> 23:59:59
> >> PST.
> >>
> >> Best,
> >> Junru
> >>
> >>> On Thu, May 2, 2019 at 11:29 PM Aston Zhang 
> wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>> Passed all the code at zh.d2l.ai
> >>>
> >>> On Thu, May 2, 2019 at 1:46 PM Joshua Z. Zhang 
> >>> wrote:
> >>>
> >>>> +1 (non-binding)
> >>>>
> >>>> Build from source with cuda/cudnn.
> >>>>
> >>>> - All tests passed
> >>>> - GluonCV unittest scripts passed
> >>>> - GluonCV training scripts passed
> >>>> - No issue with python multiprocessing
> >>>>
> >>>> Best,
> >>>> Zhi
> >>>>> On May 2, 2019, at 11:34 AM, kellen sunderland <
> >>>> kellen.sunderl...@gmail.com> wrote:
> >>>>>
> >>>>> +1 (non-binding)
> >>>>>
> >>>>> I checked TRT integration builds and tests pass.
> >>>>> MD5s
> >>>>> Sigs look good.
> >>>>>
> >>>>> -Kellen
> >>>>>
> >>>>> On Thu, May 2, 2019 at 10:51 AM Damien Stanton <
> >>> damien.stan...@gmail.com
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> +1 (binding)
> >>>>>>
> >>>>>> Built from source / Scala / Clojure. All tests pass. The only issue
> >> of
> >>>>>> minor note: The macOS build guide indicates a directive `brew
> >> install
> >>>>>> opencv` however this installs OpenCV 4, which is currently
> >>> incompatible
> >>>>>> with mxnet and causes a failed build. The guide should specify `brew
> >>>>>> install opencv@3` until/if version 4 is supported.
> >>>>>>
> >>>>>> Best,
> >>>>>> Damien
> >>>>>>
> >>>>>> On Thu, May 2, 2019 at 12:53 PM Lai Wei 
> >> wrote:
> >>>>>>
> >>>>>>> +1
> >>>>>>>
> >>>>>>> Built from source and tested keras-mxnet working fine.
> >>>>>>>
> >>>>>>> Best Regards
> >>>>>>>
> >>>>>>> Lai
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, May 1, 2019 at 4:22 PM Carin Meier 
> >>>> wrote:
> >>>>>>>
> >>>>>>>> + 1 (binding)
> >>>>>>>>
> >>>>>>>> Built Scala/ Clojure and ran tests
> >>>>>>>>
> >>>>>>>> On Wed, May 1, 2019 at 7:06 PM Aaron Markham <
> >>>>>> aaron.s.mark...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Make that +1 (non-binding)
> >>>>>>>>>
> >>>>>>>>> On Wed, May 1, 2019 at 3:42 PM

Re: [VOTE] Release Apache MXNet (incubating) version 1.4.1.rc0

2019-05-03 Thread kellen sunderland
Hi Junru could you give a quick summary of the binding / non-binding votes.

Damien just want to confirm, are you a member of the PPMC for MXNet?
Usually committers or community members (like most of us) are encouraged to
test and vote, but technically count as non-binding for releases.

Sheng can we assume you're +1 on the release?

On Fri, May 3, 2019 at 12:09 AM Junru Shao  wrote:

> Hi folks,
>
> So far we have collected enough binding votes. Thank you guys for the hard
> work testing the release!
>
> The vote on dev@ is closed on May 02 23:59:59 (PST). Next, we are going to
> vote for the Apache MXNet (incubating) release 1.4.1 on general@ tomorrow,
> which starts on May 3 2019, 23:59:59 PST, and ends on May 07 2019, 23:59:59
> PST.
>
> Best,
> Junru
>
> On Thu, May 2, 2019 at 11:29 PM Aston Zhang  wrote:
>
> > +1 (non-binding)
> >
> > Passed all the code at zh.d2l.ai
> >
> > On Thu, May 2, 2019 at 1:46 PM Joshua Z. Zhang 
> > wrote:
> >
> > > +1 (non-binding)
> > >
> > > Build from source with cuda/cudnn.
> > >
> > > - All tests passed
> > > - GluonCV unittest scripts passed
> > > - GluonCV training scripts passed
> > > - No issue with python multiprocessing
> > >
> > > Best,
> > > Zhi
> > > > On May 2, 2019, at 11:34 AM, kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > +1 (non-binding)
> > > >
> > > > I checked TRT integration builds and tests pass.
> > > > MD5s
> > > > Sigs look good.
> > > >
> > > > -Kellen
> > > >
> > > > On Thu, May 2, 2019 at 10:51 AM Damien Stanton <
> > damien.stan...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> +1 (binding)
> > > >>
> > > >> Built from source / Scala / Clojure. All tests pass. The only issue
> of
> > > >> minor note: The macOS build guide indicates a directive `brew
> install
> > > >> opencv` however this installs OpenCV 4, which is currently
> > incompatible
> > > >> with mxnet and causes a failed build. The guide should specify `brew
> > > >> install opencv@3` until/if version 4 is supported.
> > > >>
> > > >> Best,
> > > >> Damien
> > > >>
> > > >> On Thu, May 2, 2019 at 12:53 PM Lai Wei 
> wrote:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> Built from source and tested keras-mxnet working fine.
> > > >>>
> > > >>> Best Regards
> > > >>>
> > > >>> Lai
> > > >>>
> > > >>>
> > > >>> On Wed, May 1, 2019 at 4:22 PM Carin Meier 
> > > wrote:
> > > >>>
> > > >>>> + 1 (binding)
> > > >>>>
> > > >>>> Built Scala/ Clojure and ran tests
> > > >>>>
> > > >>>> On Wed, May 1, 2019 at 7:06 PM Aaron Markham <
> > > >> aaron.s.mark...@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Make that +1 (non-binding)
> > > >>>>>
> > > >>>>> On Wed, May 1, 2019 at 3:42 PM Aaron Markham <
> > > >>> aaron.s.mark...@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>> +1 (binding)
> > > >>>>>>
> > > >>>>>> * Built with GPU and tested the first part of the ssd example.
> > > >>>>>> * Built with GPU / cross-compiled to arm8 for Jetson.
> > > >>>>>> * Built Scala/Java on top of the cross-compiled arm8 (ran into
> > > >>> trouble
> > > >>>>>> here, but I think this is not popular enough yet to derail
> things,
> > > >>>>>> plus there are workarounds)
> > > >>>>>> * Built on CPU instance and tested docs.
> > > >>>>>> http://34.201.8.176/versions/1.4.1/api/python/io/io.html
> > > >>>>>> I don't see anything specific being different in this patch for
> > > >> docs,
> > > >>>>>> so hard to tell if there's an issue. I'll assume not given the
> > > >>>>>> successful generation of the API docs.
> > > &g

Re: [VOTE] Release Apache MXNet (incubating) version 1.4.1.rc0

2019-05-02 Thread kellen sunderland
+1 (non-binding)

I checked TRT integration builds and tests pass.
MD5s
Sigs look good.

-Kellen

On Thu, May 2, 2019 at 10:51 AM Damien Stanton 
wrote:

> +1 (binding)
>
> Built from source / Scala / Clojure. All tests pass. The only issue of
> minor note: The macOS build guide indicates a directive `brew install
> opencv` however this installs OpenCV 4, which is currently incompatible
> with mxnet and causes a failed build. The guide should specify `brew
> install opencv@3` until/if version 4 is supported.
>
> Best,
> Damien
>
> On Thu, May 2, 2019 at 12:53 PM Lai Wei  wrote:
>
> > +1
> >
> > Built from source and tested keras-mxnet working fine.
> >
> > Best Regards
> >
> > Lai
> >
> >
> > On Wed, May 1, 2019 at 4:22 PM Carin Meier  wrote:
> >
> > > + 1 (binding)
> > >
> > > Built Scala/ Clojure and ran tests
> > >
> > > On Wed, May 1, 2019 at 7:06 PM Aaron Markham <
> aaron.s.mark...@gmail.com>
> > > wrote:
> > >
> > > > Make that +1 (non-binding)
> > > >
> > > > On Wed, May 1, 2019 at 3:42 PM Aaron Markham <
> > aaron.s.mark...@gmail.com>
> > > > wrote:
> > > > >
> > > > > +1 (binding)
> > > > >
> > > > > * Built with GPU and tested the first part of the ssd example.
> > > > > * Built with GPU / cross-compiled to arm8 for Jetson.
> > > > > * Built Scala/Java on top of the cross-compiled arm8 (ran into
> > trouble
> > > > > here, but I think this is not popular enough yet to derail things,
> > > > > plus there are workarounds)
> > > > > * Built on CPU instance and tested docs.
> > > > > http://34.201.8.176/versions/1.4.1/api/python/io/io.html
> > > > > I don't see anything specific being different in this patch for
> docs,
> > > > > so hard to tell if there's an issue. I'll assume not given the
> > > > > successful generation of the API docs.
> > > > >
> > > > >
> > > > > On Wed, May 1, 2019 at 1:28 PM Pedro Larroy
> > > > >  wrote:
> > > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > Tried CPU build + C++ tests + 714 Python unit tests in 605s.
> > > > > > ARMv7 build + small unit test in QEMU + ARMv8 builds.
> > > > > >
> > > > > > Thanks. Regards
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Wed, May 1, 2019 at 10:41 AM Qing Lan 
> > > wrote:
> > > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > build from source works for OSX and Ubuntu CPU
> > > > > > > Scala build/test successfully with Dynamic link and static
> link.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Qing
> > > > > > >
> > > > > > > 
> > > > > > > From: Sheng Zha 
> > > > > > > Sent: Wednesday, May 1, 2019 13:14
> > > > > > > To: d...@mxnet.apache.org
> > > > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version
> > > > 1.4.1.rc0
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Reminder that the vote for 1.4.1 release is still ongoing. If
> you
> > > > can, please help out. Thank you.
> > > > > > >
> > > > > > > -sz
> > > > > > >
> > > > > > > On 2019/04/30 06:51:45, Junru Shao 
> > > wrote:
> > > > > > > > Dear MXNet community,
> > > > > > > >
> > > > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > > > version v1.4.1.
> > > > > > > > The voting on dev@ list will start Apr 29 23:59:59 (PST) and
> > > > close on May
> > > > > > > > 02 23:59:59.
> > > > > > > >
> > > > > > > > Below are links to
> > > > > > > > 1) Release notes:
> > > > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.1+Release+Notes
> > > > > > > > .
> > > > > > > > 2) Release Candidate:
> > > > > > > >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.4.1.rc0
> > > .
> > > > > > > > 3) Source and signatures on Apache dist server:
> > > > > > > >
> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.4.1.rc0/.
> > > > > > > >
> > > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > > +1 = approve
> > > > > > > > +0 = no opinion
> > > > > > > > -1 = disapprove (provide reason)
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Junru Shao
> > > > > > > >
> > > >
> > >
> >
>


Re: [Announcement] New Committer - Wang Jiajun

2019-04-16 Thread kellen sunderland
Welcome!  Very impressed with the work fixing memory leaks so far.

On Tue, Apr 16, 2019 at 9:14 AM Carin Meier  wrote:

> Congrats!
>
> On Tue, Apr 16, 2019 at 11:58 AM Anirudh Subramanian <
> anirudh2...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Please join me to welcome Wang Jiajun (https://github.com/arcadiaphy)
> as a
> > new committer of Apache (incubating) MXNet!
> >
> > Wang has been solving some tough bugs with respect to memory leaks,
> process
> > fork handling, dependency engine issues and custom op exception handling.
> >
> > Issue Involvement:
> >
> >
> https://github.com/apache/incubator-mxnet/issues?utf8=%E2%9C%93=is%3Aissue+involves%3Aarcadiaphy
> >
> > PRs authored:
> >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+author%3Aarcadiaphy+
> >
> > Anirudh
> >
>


Re: CUDNN 7.5 Issues

2019-04-09 Thread kellen sunderland
Hey Per, just wanted to drop a line and say thanks for supporting the
community on this one.

On Tue, Apr 9, 2019 at 4:20 AM Per da Silva  wrote:

> I've created an issue to track this problem:
> https://github.com/apache/incubator-mxnet/issues/14652
>
> On Tue, Apr 9, 2019 at 9:07 AM Per da Silva  wrote:
>
> > Dear MXNet community,
> >
> > I've been trying to update the CI GPU images to CUDA 10, but the tests
> are
> > failing. I'm not sure why and would really appreciate some help =D
> >
> > I've managed, at least, to narrow down the problem to the cuDNN version.
> > The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> >
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> > ).
> >
> > I noticed that the binary in the python packages we release uses cuDNN
> > 7.3.1.20 (
> >
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34
> ),
> > so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
> > and sure enough the tests passed (
> > https://github.com/apache/incubator-mxnet/pull/14513).
> >
> > After talking with another contributer, we decided that I would try to
> > create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
> > (to be fixed later). But, it seems the problem is a bit more heinous. I
> > disable one test, and another one fails...So, it might make sense to
> reach
> > out now and see if we can find the root cause and fix it.
> >
> > Some things I've sanity checked:
> >
> > We run the tests on g3.8xlarge instances. These instances contain Tesla
> > M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10
> supports
> > compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).
> >
> > According to the cuDNN support matrix (
> > https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html
> ),
> > cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver
> r410.48
> > (I assume greater or equal).
> >
> > The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
> >
> > So, as best I can tell, our environment ought to support cuDNN 7.5, which
> > leads me to conclude that maybe there's something wrong in the code.
> >
> > The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed:
> > e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
> >
> > According to the cuDNN user guide (
> >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> > ):
> >
> > CUDNN_STATUS_ARCH_MISMATCH
> >
> > The function requires a feature absent from the current GPU device. Note
> > that cuDNN only supports devices with compute capabilities greater than
> or
> > equal to 3.0.
> >
> > To correct: compile and run the application on a device with appropriate
> > compute capability.
> >
> > But, as we've seen, our environment seems to support this version of
> cuDNN
> > and other versions go through CI w/o any problem...
> >
> > You can see some logs here:
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> >
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
> >
> > I have about 13 runs of this pipeline. The errors for different runs can
> > be seen by changing the number before /pipeline (e.g.
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> > <
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/>
> for
> > the 2nd run, etc.)
> >
> > Thanks in advance for the help!
> >
> > You can reach me here or on Slack if you have any questions =D
> >
> > Cheers,
> >
> > Per
> >
> > P.S. I'm attaching some instructions on how to reproduce the issue at
> home
> > (or at least on a g3.8xlarge instance running ubuntu 16.04).
> >
>


Re: [MXNET 2.0 Wishlist] [DISCUSS] Backend choices during runtime

2019-04-07 Thread kellen sunderland
I think we can make some incremental progress.  My thoughts were along the
lines of plugins (thinking about what happens with the VLC project).  At
process launch time we could gather some information about our execution
environment (either through configuration, or by convention looking at our
folder structure and libraries available).  We could then later load the
components we need after understanding if we're using a CUDA backend and
what operators or subgraph components we would need.  Advantages would be
that we would move a lot of the current conditional compile logic to
runtime, and automate a lot of it.  It would also make packaging binaries
for targeted environments a little easier.  As an example we could compile
once, then remove CUDA focused libraries for systems that are going to run
on CPUs.

On Sun, Apr 7, 2019 at 2:45 PM Tianqi Chen  wrote:

> While I personally like the idea. This can be something that is fairly
> technical challenging and I would caution against this idea vs pushing for
> good features and just allow runtime configuration.
>
> The main problem here is due to the C++ ABI. There is no standard c++ ABI
> across compilers, which means resorting to runtime DLL and dynamic loading
> brings all sorts of technical problems, especially when multiple modules
> depend on the same third dependency(CUDA runtime).
> There is no good to go solution can be made here, especially given the
> explosion of the backend variants and dependencies in C++.
> A partial solution could be achieved, through the sole use of C ABI.
> Combing this with code generation can result in some simplifications and
> enable some runtime loadable module. TVM does this, and perhaps MXNet could
> reuse some of that component for operator libraries. Similarly, having a
> customizable operator library that is loadable via C ABI might be possible.
>
> So to summarize, while I really like the idea of dynamically loadable
> modules. My past experience suggests that this will bring a lot of
> additional engineering burden and technical debts without significant
> benefit. I would suggest starting by supporting something simple like a
> plugin module, before moving toward the general direction.
>
> Tianqi
>
> On Sun, Apr 7, 2019 at 1:31 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Strongly support the idea of runtime loadable components in MXNet.
> There's
> > no reason (other than perhaps engineering effort) we can't have a single
> > compilation of MXNet that finds dependencies and chooses execution paths
> > intelligently (or based on configuration) at runtime.
> >
> > On Thu, Apr 4, 2019 at 12:29 PM Marco de Abreu 
> > wrote:
> >
> > > Hello,
> > >
> > > I'd like to start a discussion about something that I've noticed being
> > > troublesome to maintain in the current version: Backend choices being
> > made
> > > at compile time.
> > >
> > > Right now, the different backends and accelerators (CPU, cuda, mkl, AWS
> > > elastic inference, (future) AMD, openblas,TVM, etc) are all scattered
> > > across the different layers of MXNet. On one hand, we have compile time
> > > flags that decide which backends are being compiled into the binary,
> > while
> > > at the same time choices can be made in the frontend during runtime.
> > >
> > > At the moment, we have a lot of conditional build logic that picks
> > > different parts. With the addition of MKLML and later MKLDNN the clear
> > > separation of CPU and GPU got kind of broken up. While we have some
> > places
> > > where each code lives, in the end we resort to some files containing a
> > lot
> > > of conditional logic for the different backends (sorry I can't provide
> > > links right now since I'm on mobile). To me this seems like a residue
> of
> > > the fast development style from the early days (more processor
> statement
> > > and less object orientation) while also having organic growth with new
> > > accelerators. When I see how much AMD had to hack to fit in their
> > > implementation, it seemed like we have to make this part more developer
> > > friendly.
> > >
> > > At the moment, every new flavour of MXNet has to be entirely
> recompiled.
> > > This makes it hard for users to figure out which options to use, while
> it
> > > makes it harder for us to test since the overhead to test every single
> > > combination of compile parameters would be overwhelming.
> > >
> > > I'd propose to have a clear class hierarchy based structure for
> > > accelerators, operators and memory managem

Re: [MXNET 2.0 Wishlist] [DISCUSS] Backend choices during runtime

2019-04-07 Thread kellen sunderland
Strongly support the idea of runtime loadable components in MXNet.  There's
no reason (other than perhaps engineering effort) we can't have a single
compilation of MXNet that finds dependencies and chooses execution paths
intelligently (or based on configuration) at runtime.

On Thu, Apr 4, 2019 at 12:29 PM Marco de Abreu 
wrote:

> Hello,
>
> I'd like to start a discussion about something that I've noticed being
> troublesome to maintain in the current version: Backend choices being made
> at compile time.
>
> Right now, the different backends and accelerators (CPU, cuda, mkl, AWS
> elastic inference, (future) AMD, openblas,TVM, etc) are all scattered
> across the different layers of MXNet. On one hand, we have compile time
> flags that decide which backends are being compiled into the binary, while
> at the same time choices can be made in the frontend during runtime.
>
> At the moment, we have a lot of conditional build logic that picks
> different parts. With the addition of MKLML and later MKLDNN the clear
> separation of CPU and GPU got kind of broken up. While we have some places
> where each code lives, in the end we resort to some files containing a lot
> of conditional logic for the different backends (sorry I can't provide
> links right now since I'm on mobile). To me this seems like a residue of
> the fast development style from the early days (more processor statement
> and less object orientation) while also having organic growth with new
> accelerators. When I see how much AMD had to hack to fit in their
> implementation, it seemed like we have to make this part more developer
> friendly.
>
> At the moment, every new flavour of MXNet has to be entirely recompiled.
> This makes it hard for users to figure out which options to use, while it
> makes it harder for us to test since the overhead to test every single
> combination of compile parameters would be overwhelming.
>
> I'd propose to have a clear class hierarchy based structure for
> accelerators, operators and memory management. This structure can then be
> implemented by the different backends. To reduce the compile burden, we
> would introduce dynamic loading and split the different backends into
> modules. These could then be developed, maintained and compiled on their
> own and then placed in a "module" folder to be loaded at runtime. Adding a
> new accelerator would be a matter of placing the precompiled binary into
> the folder. The detailed configuration of that Backend would then be done
> on runtime - the user shouldn't worry at the point of downloading mxnet
> whether they want mkl, MKLDNN, mkl, openblas, atlas, TVM, cuda or what ever
> else there is. I have an idea how we could help the user choosing, but
> that's outside the scope of this proposal.
>
> This would allow us to have a "core" MXNet that takes care of the engine,
> scheduling, communication and all the other crucial parts. On the other
> hand we could make MXNet less of a monolith and have clear interfaces. This
> would also act as a forcing function because the different parts wouldn't
> be intermingled but have to follow the common interface.
>
> Of course this comes with the question what these interfaces would look
> like. For operators, I'd like to propose getting inspiring (or fully
> adapting) ONNX. For memory management and other Backend specific things we
> could look at the current implementations and find a common ground.
>
> Back when I had a community driven project, we heavily used this modularity
> and it brought great benefits - besides the fact that our core was closed
> source. It allowed community developers to act entirely independent from
> other parts and even allowed them to add their own logic without having to
> touch the core. Thinking about companies that implement their own backends
> or have special tweaked operators without wanting to disclose them, this
> structure would avoid them having to fork the project and then spend a lot
> of effort porting the changes to the latest source release versions.
> Instead, they would maintain their module and we as MXNet community would
> only have to maintain these interfaces.
>
> Right now this is a lot of prosa and basically a brain dump of my thoughts.
> I'd be happy to follow up with details, but first I'd be curious what the
> community thinks about this design.
>
> Best regards,
> Marco
>


Re: assimilation of mshadow into the MXNet codebase

2019-04-07 Thread kellen sunderland
"Does merging mshadow into mxnet bring any actual benefit for customers in
sense of performance, portability, or anything else?"

It would improve the contributor experience in that if we find a bug which
requires fixes in both repos, we won't have to coordinate 2 PRs.  It would
also make compilation more straightforward (as others have mentioned).

On Sun, Apr 7, 2019 at 11:56 AM Aaron Markham 
wrote:

> +1
> Reduced complexity. Choice of math library... Hopefully you can just
> install MKL and not be forced into mshadow's dependency on OpenBLAS. This
> could make Windows setup easier.
> Maybe this issue will get fixed: #11769.
>
> On Sun, Apr 7, 2019, 00:51 Junru Shao  wrote:
>
> > Does merging mshadow into mxnet bring any actual benefit for customers in
> > sense of performance, portability, or anything else?
> >
> > On Fri, Apr 5, 2019 at 9:38 PM Tianqi Chen 
> > wrote:
> >
> > > Technically, mshadow is sufficient for MXNet. Adopting other libraries
> (
> > > eigen or xtensor) will unnecessarily increase the codebase complexity
> > > without any additional gains.
> > >
> > > Given that mshadow is only used by mxnet. I do support donating it into
> > > mxnet codebase.
> > > To respect the original mshadow community. I would recommend starting a
> > > community RFC In the mshadow github issue for a week, before we start
> the
> > > migrating process.
> > > Also, I would recommend a rebase merge just like the case of MXNet.jl
> > code
> > > base to preserve the contribution history.
> > >
> > > Tianqi
> > >
> > >
> > > On Fri, Apr 5, 2019 at 9:25 PM Alfredo Luque
> > >  wrote:
> > >
> > > > Do you have a link to both of these proposals?
> > > >
> > > > On Fri, Apr 5, 2019 at 20:14 Anirudh Acharya 
> > > > wrote:
> > > >
> > > > > Hi Pedro,
> > > > >
> > > > > mshadow is mostly used for tensor arithmetic. There have been
> > > discussions
> > > > > about including it within mxnet. I think it is a good idea.
> > > > >
> > > > > As a more long term solution using libraries like eigen to perform
> > > linear
> > > > > algebra operations was also suggested by anirudh2290@. I think
> > > xtensor(
> > > > > https://github.com/QuantStack/xtensor ) can also be a candidate
> > here.
> > > > >
> > > > > -
> > > > > Anirudh
> > > > >
> > > > >
> > > > > On Fri, Apr 5, 2019 at 7:03 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > >
> > > > > > Some developers have noticed that working in mshadow is
> cumbersome
> > as
> > > > > > it's a 3rdparty subrepo.
> > > > > >
> > > > > > Since mshadow is a bunch of headers which don't have much of
> > > > > > independent tests / library functionality, me and other
> developers
> > > > > > believe that it would be good to assimilate this code in the
> > > > > > repository for ease of contribution and changes without having to
> > go
> > > > > > trough contortions to test PRs that modify mshadow.
> > > > > >
> > > > > > Would anybody oppose this change?
> > > > > >
> > > > > > Thanks and have a nice weekend.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > >
> > > >
> > >
> >
>


[MXNET 2.0 Wishlist] [DISCUSS] Single build system

2019-04-04 Thread Kellen Sunderland
Hello MXNet devs,

I'd like to start a thread discussing what our build system should look
like in MXNet 2.0.  I'd propose that although the current make system has
served us well in the past, we remove it along with the bump to 2.0.  The
end goal I'd like to see is that we have a clean build system, without a
bunch of conditional logic that makes contributing and testing MXNet a
simpler process.  Additionally I'd propose we target a minimum cmake
version of 3.7 for reasons described below.

First I'd like to give some context on why I'd propose we don't just switch
to cmake, but we also target a relatively new version (version 3.7 from
Nov, 2016) of cmake.  The largest benefits in making this change would
apply to CUDA builds where cmake itself has quite inconsistent
functionality between versions.  One persistent annoyance I've had with
cmake is that we've had conditional logic for the FindCUDA command which at
one point targeted some modern cmake features, but then in subsequent
versions of cmake the way these features works was tweaked, and now I find
these cmake features are consistently broken to the point that I require a
bunch of -D defines to compile properly or to use an IDE.  An additional
CUDA related issue is that every time there's a new SM added to NVCC we
have to make a few source changes to support it.  I could see this being
problematic for users who may suddenly realize that due to their
compilation settings, they may not actually be enabling the features they
think they are with their shiny new GPUs.

As an alternative if we, for example, target cmake 3.7 at a minimum, and we
want to find cuda and then build a list of reasonable PTX/BINS we could use
the following command[1]:


FindCUDA(...)
...
CUDA_SELECT_NVCC_ARCH_FLAGS(ARCH_FLAGS 3.0 3.5+PTX 5.2(5.0) Maxwell)
  LIST(APPEND CUDA_NVCC_FLAGS ${ARCH_FLAGS})


Simple, concise, and it would help to make the building experience more
consistent across platforms, build environments and IDEs (looking at you
CLion).  We'd of course need to do a little experimentation work to make
sure that this does indeed work as intended, and can replace the currently
complex findCuda logic we have in our build systems, but for the sake of
the proposal let's assume these cmake commands do indeed work consistently
as documented from cmake 3.7 onwards.

To give users a chance to update their tooling I'd also suggest we begin
warning users at least a release in advance that make based builds will be
deprecated in MXNet 2.0 so they can begin migrating to cmake.  I'd also
want to display deprecation messages for unused cmake flags (such as the
profiler flag) for a release before the 2.0 release, and then remove them
in 2.0.

Of course not all users have cmake 3.7 on their systems, some of our
employers force use to use ridiculously outdated linux distributions.  The
good news for these users is that if we can offer Docker compilation with
an image that has a supported version of cmake and we should be able to
build a portable binary that work even with very old distributions of
Linux.  Additionally installing cmake from source is also fairly
straightforward [2] and works quite well on older distros in my experience.

Looking forward to hearing what others think.  Any preferred build systems
that you all would want to use?  Is cmake the right system to centralize
on?  If so, is version 3.7 a reasonable minimum version to target?  Is the
2.0 release a good point at which we can think about simplifying build
logic?

1: https://cmake.org/cmake/help/v3.7/module/FindCUDA.html
2: https://github.com/Kitware/CMake


Re: Discussing plans for next MXNet releases

2019-04-02 Thread kellen sunderland
Release breakdown makes sense to me Hagay.  Thanks for initiating a
discussion.

Some features that I'm personally looking forward to that I hope can make
it into 1.5 (schedule permitting):
*  TensorRT being integrated with the subgraph API
*  VNNI MKLDNN support
*  AMP training in MXNet

I like the idea of having a call to align on features for the 1.5 release.
For those unable to dial in we have a rapporteur who can send notes around
after the meeting.

For the 2.0 release I wonder if we could start a thread that would have a
list of big changes/features people would like to see.  I know there's been
a few changes I've made that required sub-optimal implementations to avoid
a breaking change.  This could be a good opportunity to clean up prior
work.  It'd also be a good opportunity to prune our operators to those that
are well supported, and to make sure they're named and structured in an
understandable way for users and contributors.

-Kellen

On Tue, Apr 2, 2019 at 5:06 PM Hagay Lupesko  wrote:

> Dear MXNet community,
>
> I wanted to initiate a discussion about the plan and scope for the next
> MXNet releases.
> I suggest we focus on three releases, and get the process going in
> parallel:
> (1) 1.4.1 - patch release on top of 1.4.0 to address some perf regressions
> and memory leaks I am aware of, such as the memory leak fixed on Scala [0
> ]. I went ahead and
> created a draft release proposal wiki [1
> <
> https://cwiki.apache.org/confluence/display/MXNET/%5BDRAFT+PROPOSAL%5D+Apache+MXNet+%28incubating%29+1.4.1+Release+Plan+and+Status
> >
> ].
> (2) 1.5.0 - a minor release to add new features introduced since 1.4.0
> release started (back in Nov 2018!), such as various performance
> improvements: aggregate SGD, in-place updates in optimizers, gpu support
> for image processing operators and many more features useful for MXNet’s
> users.
> (3) 2.0 - an exciting major release that will include major enhancements to
> MXNet.
>
> Timeframes will probably vary based on the scope. I think we should plan to
> start 1.4.1 release within a couple of weeks, 1.5.0 should target starting
> once we release 1.4.1, and 2.0 timeline is TBD - but such a major release
> will require more time to discuss and decide in the community.
>
> I was thinking to get started through:
> (1) Draft proposals on CWiki, where the community can add content and
> propose scope and features.
> (2) Setup online meetings, where anyone can dial into, from anywhere, where
> we will have a chance to discuss in voice+video.
> (3) With (1)+(2) have a scope and timeline that the community, in large,
> supports.
>
> Would be great to get the community's feedback and suggestions, and please
> reply if you would like to be involved in the effort of supporting the
> releases!
>
> MXNet is awesome, looking forward to working together to make it even
> better!
> Hagay
>
> [0] https://github.com/apache/incubator-mxnet/pull/14586
> [1]
>
> https://cwiki.apache.org/confluence/display/MXNET/%5BDRAFT+PROPOSAL%5D+Apache+MXNet+%28incubating%29+1.4.1+Release+Plan+and+Status
>


Re: R help

2019-03-25 Thread kellen sunderland
Is this the error?
"test_model.R:129: error: Fine-tune

cannot open URL
'http://data.dmlc.ml/models/imagenet/inception-bn/Inception-BN-0126.params'
1: GetInception() at R-package/tests/testthat/test_model.R:129
2: 
download.file("http://data.dmlc.ml/models/imagenet/inception-bn/Inception-BN-0126.params;,
   destfile = "model/Inception-BN-0126.params")"

Looks like the 
http://data.dmlc.ml/models/imagenet/inception-bn/Inception-BN-0126.params
is failing for me as well.


On Mon, Mar 25, 2019 at 10:37 AM Anirudh Acharya 
wrote:

> Hi Per da Silva,
>
> Let me know if I can help, we can chat offline.
>
> From first glance it would seem
>
>- R:MKLDNN CPU is passing whereas R:CPU is failing
>- R:GPU might have failed due to this "cannot open URL '
>
> http://data.dmlc.ml/models/imagenet/inception-bn/Inception-BN-0126.params
>'"
>
>
> Thanks
> Anirudh
>
>
> On Mon, Mar 25, 2019 at 7:34 AM Per da Silva  wrote:
>
> > Dear community,
> >
> > I'm working on a PR <
> https://github.com/apache/incubator-mxnet/pull/14513>
> > to update CI GPU jobs to be based on CUDA v10. However, for some reason,
> > amongst other things, the R tests are failing
> > <
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-14513/4/pipeline
> > >.
> > I would really appreciate some help from the R experts to get it sorted
> =D
> >
> > Thanks in advance,
> >
> > Per
> >
>


Re: [Announcement] New Committer - Patric Zhao

2019-03-20 Thread kellen sunderland
Congrats Patric!

On Sun, Mar 17, 2019 at 10:34 PM Hagay Lupesko  wrote:

> Congrats Patric!
>
> On Fri, Mar 15, 2019 at 7:49 AM Joshua Z. Zhang 
> wrote:
>
> >
> >
> >
> >  Congrats Patrick!
> >
> >
> >
> >
> >
> >  Zhi
> >
> > >
> > > On Mar 15, 2019 at 10:46 PM,   > marco.g.ab...@gmail.com)>  wrote:
> > >
> > >
> > >
> > >  Congratulations, great to have you on board!
> > >
> > > -Marco
> > >
> > > Lv, Tao Aschrieb am Fr., 15. März 2019, 15:38:
> > >
> > > >  Wow, congratulation Patric!
> > > >
> > > >  -Original Message-
> > > >  From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> > > >  Sent: Friday, March 15, 2019 10:25 PM
> > > >  To: dev@mxnet.incubator.apache.org
> > > >  Cc: patric zhao  
> > > >  Subject: Re: [Announcement] New Committer - Patric Zhao
> > > >
> > > >  Congratulation Patrick!
> > > >  Steffen
> > > >
> > > >  On Fri, Mar 15, 2019 at 5:38 AM Zhao, Patric  <
> patric.z...@intel.com>
> >
> > > >  wrote:
> > > >
> > > >   >  I am very glad to have this opportunity to contribute to the
> > > >   >  Apache/MXNet community :)
> > > >   >
> > > >   >  Thanks all of the supports from the community and Intel.
> > > >   >
> > > >   >  BR,
> > > >   >
> > > >   >  --Patric
> > > >   >
> > > >   >
> > > >   >   >  -Original Message-
> > > >   >   >  From: MiraiWK WKCN [mailto:w...@live.cn]
> > > >   >   >  Sent: Friday, March 15, 2019 12:52 AM
> > > >   >   >  To: dev@mxnet.incubator.apache.org; patric zhao
> > > >   >   >   
> > > >   >   >  Subject: Re: [Announcement] New Committer - Patric Zhao
> > > >   >   >
> > > >   >   >  Welcome Peng Zhao!
> > > >   >   >  Peng is the AI Tech Leader in Intel Corporation. We have
> > good
> > > >   >   >  cooperation before. He is very professional and contribute a
> > lot to
> > > >   >   >  MXNet,
> > > >   >  especially deep
> > > >   >   >  learning boost on CPU.
> > > >   >   >
> > > >   >   >  
> > > >   >   >  From: Anirudh Subramanian  
> > > >   >   >  Sent: Thursday, March 14, 2019 3:54:50 PM
> > > >   >   >  To: dev@mxnet.incubator.apache.org; patric zhao
> > > >   >   >  Subject: [Announcement] New Committer - Patric Zhao
> > > >   >   >
> > > >   >   >  Hi all,
> > > >   >   >
> > > >   >   >  Please join me to welcome Patric Zhao as a new committer of
> > Apache
> > > >   >   >  (incubating) MXNet!
> > > >   >   >
> > > >   >   >  Patric has put in great effort around MKLDNN integration
> into
> > MXNet
> > > >   >   >  and
> > > >   >  has
> > > >   >   >  been involved in features like quantization, graph fusion
> and
> > fused
> > > >   >   >  RNN operators for CPU.
> > > >   >   >
> > > >   >   >  Dev List activity:
> > > >   >   >
> > > >   >
> > https://lists.apache.org/list.html?d...@mxnet.apache.org:lte=3y:patric.
> > > >   >  zhao
> > > >   >   >
> > > >   >   >  Issues:
> > > >   >   >  https://github.com/apache/incubator-
> > > >   >   >
> > mxnet/issues?utf8=%E2%9C%93=is%3Aissue+involves%3Apengzhao-intel+
> > > >   >   >
> > > >   >   >  PR Reviews:
> > > >   >   >  https://github.com/apache/incubator-
> > > >   >   >
> > mxnet/pulls?utf8=%E2%9C%93=is%3Apr+reviewed-by%3Apengzhao-intel
> > > >   >   >
> > > >   >   >  Proposals involved in:
> > > >   >   >
> > https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimi
> > > >   >   >  z
> > > >   >   >  ation+and+Quantization+based+on+subgraph+and+MKL-DNN
> > > >   >   >
> > https://cwiki.apache.org/confluence/display/MXNET/Fused+RNN+Operator
> > > >   >   >  s
> > > >   >   >  +for+CPU
> > > >   >   >   <
> > https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optim
> > > >   >   >  i
> > > >   >   >  zation+and+Quantization+based+on+subgraph+and+MKL-DNN>
> > > >   >   >
> > > >   >   >
> > > >   >   >  Thanks,
> > > >   >   >  Anirudh
> > > >   >
> > > >
> > >
>


Re: [Announcement] New Committer -- Steffen Rochel

2019-02-04 Thread kellen sunderland
Great news.  Congrats Steffen.

On Mon, Feb 4, 2019, 5:29 PM Thomas DELTEIL  Welcome Steffen!
>
> On Mon, Feb 4, 2019, 15:55 Marco de Abreu 
> > Welcome!
> >
> > Am Di., 5. Feb. 2019, 00:45 hat Chris Olivier 
> > geschrieben:
> >
> > > Dear Community:
> > >
> > > Please join me to welcome Steffen Rochel (steffenroc...@gmail.com) as
> a
> > > new
> > > committer of Apache (incubating) MXNet!
> > >
> > > Steffen has played a role in nearly every MXNet release in the past 18
> > > months, managed several of the wiki pages and has contributed in
> > expanding
> > > the community by managing and hosting meetups in different parts of the
> > > world.
> > >
> > > -Chris
> > >
> >
>


Re: [Announcement] New Committer -- Lin Yuan

2019-02-02 Thread kellen sunderland
Congrats Lin!  Well deserved.

On Sat, Feb 2, 2019 at 11:05 PM Marco de Abreu 
wrote:

> Congratulations, welcome!
>
> Am So., 3. Feb. 2019, 04:04 hat Chaitanya Bapat 
> geschrieben:
>
> > Congratulations Lin! Way to go!
> >
> > On Sat, 2 Feb 2019 at 19:39, sandeep krishnamurthy <
> > sandeep.krishn...@gmail.com> wrote:
> >
> > > Welcome Lin :-)
> > >
> > > On Sat, Feb 2, 2019, 3:28 PM Yuan Tang  > >
> > > > Welcome Lin!
> > > >
> > > > On Sat, Feb 2, 2019 at 6:27 PM Tianqi Chen  >
> > > > wrote:
> > > >
> > > > > Dear Community:
> > > > >
> > > > > Please join me to welcome Lin Yuan(@apeforest) as a new committer
> of
> > > > > Apache(incubating) MXNet!
> > > > >
> > > > > He has contributed to various improvements, including better
> > > > compatibility
> > > > > of larger arrays across the codebase.
> > > > >
> > > > > Commits:
> > > > > https://github.com/apache/incubator-mxnet/commits?author=apeforest
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+author%3Aapeforest
> > > > >
> > > > >
> > > > > Reviews:
> > > > > https://github.com/apache/incubator-mxnet/pulls?utf8=%
> > > > > E2%9C%93=reviewed-by%3Aapeforest
> > > > >
> > > > > dev@ activitivity
> > > > >
> > >
> https://lists.apache.org/list.html?*@mxnet.apache.org:lte=6M:Lin%20Yuan
> > > > >
> > > > > Tianqi
> > > > >
> > > >
> > >
> >
> >
> > --
> > *Chaitanya Prakash Bapat*
> > *+1 (973) 953-6299*
> >
> > [image: https://www.linkedin.com//in/chaibapat25]
> > [image:
> https://www.facebook.com/chaibapat
> > ]
> > [image:
> > https://twitter.com/ChaiBapchya]  >[image:
> > https://www.linkedin.com//in/chaibapat25]
> > 
> >
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.4.0.rc2

2019-01-31 Thread kellen sunderland
Great, thanks Steffen!  I added a few key files but missed that one.

+1 from me.

On Thu, Jan 31, 2019 at 9:35 AM Steffen Rochel 
wrote:

> Kellen - Sergey, the 1.4.0 release co-manager signed the tar file. Please
> use his public key to validate the asc.
> I was able to validate:
>
> curl https://dist.apache.org/repos/dist/dev/incubator/mxnet/KEYS -o KEYS
>
> gpg --import KEYS
>
> gpg --verify apache-mxnet-src-1.4.0.rc2-incubating.tar.gz.asc
>
>
> output:
>
> gpg: assuming signed data in 'apache-mxnet-src-1.4.0.rc2-incubating.tar.gz'
>
> gpg: Signature made Sat Jan 26 16:25:41 2019 PST
>
> gpg:using RSA key BD52136E76B7BD68E7843B0B591C06669F740FD7
>
> gpg: Good signature from "Sergey Kolychev "
> [unknown]
>
> gpg: WARNING: This key is not certified with a trusted signature!
>
> gpg:  There is no indication that the signature belongs to the
> owner.
>
> Primary key fingerprint: BD52 136E 76B7 BD68 E784  3B0B 591C 0666 9F74 0FD7
>
>
> Best,
> Steffen
>
> On Wed, Jan 30, 2019 at 10:39 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > +0
> >
> > Overall release looks good.  Probably something I'm doing wrong, but so
> far
> > not able to validate the .asc.  I'm getting "Can't check signature: No
> > public key".  I've added the keys from GitHub and the release folder, and
> > also added your public key "40C9346904DFCE37" from the MIT key server
> > Steffen.  Is there another key I'm missing?
> >
> > 1. sha512 look good.
> > 2. Compile from source successfully
> > 3. TensorRT build succeeds and runs inference for demo models
> > 4. License, notice and disclaimer exist.
> >
> > -Kellen
> >
> > On Wed, Jan 30, 2019 at 8:58 PM Steffen Rochel 
> > wrote:
> >
> > > Dear MXNet community -
> > > we currently have three +1 votes, one binding.
> > > As the vote did not reach the necessary number of binding votes I'm
> > > extending voting.
> > >
> > > I'm calling on all PMC member, please test and vote.
> > >
> > > Regards,
> > > Steffen
> > >
> > > On Wed, Jan 30, 2019 at 6:43 PM Aston Zhang 
> > wrote:
> > >
> > > > +1
> > > >
> > > > Tested with the Dive into Deep Learning book.
> > > >
> > > > On Wed, Jan 30, 2019 at 1:25 PM Steffen Rochel <
> > steffenroc...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks Carin and Yuxi.
> > > > >
> > > > > Committers and PMC members - please test and send your vote to
> > release
> > > > > Apache MXNet (incubating) v1.4.0.
> > > > >
> > > > > Regards,
> > > > > Steffen
> > > > >
> > > > > On Wed, Jan 30, 2019 at 10:55 AM Yuxi Hu 
> > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > verified the training throughput for resnet50_v1 looks normal
> > > compared
> > > > to
> > > > > > 1.3.1 release
> > > > > >
> > > > > > On Tue, Jan 29, 2019 at 3:36 PM Carin Meier <
> carinme...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > +1 - checked out from the release tag and built and tested
> > > > > Scala/Clojure
> > > > > > > package.
> > > > > > >
> > > > > > > On Sat, Jan 26, 2019 at 8:53 PM Steffen Rochel <
> > > > > steffenroc...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Dear MXNet community,
> > > > > > > >
> > > > > > > > This is the vote to release Apache MXNet (incubating) version
> > > > v1.4.0.
> > > > > > > > Voting will
> > > > > > > > start today, Saturday January 26th 6pm PST and will close on
> > > > > Wednesday,
> > > > > > > > January 30th 7pm PST.
> > > > > > > >
> > > > > > > > Link to release notes:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+
> > > > > > > > 1.4.0+Release+Notes
> > > > > > > >
> > > > > > > > Link to release candidate:
> > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/
> > > > > > > > <
> > > https://github.com/apache/incubator-mxnet/releases/tag/1.4.0.rc2
> > > > > > > >1.4.0.rc
> > > > > > > > <
> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.4.0.rc1
> > > > >2
> > > > > > > >
> > > > > > > > Link to source and signatures on apache dist server:
> > > > > > > >
> > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.4.0.rc2
> > > > > > > >
> > > > > > > >
> > > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > > +1 = approve
> > > > > > > > +0 = no opinion
> > > > > > > > -1 = disapprove (provide reason)
> > > > > > > >
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Steffen
> > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Yuxi(Darren) Hu, Ph.D.
> > > > > > Software Development Engineer
> > > > > > Amazon Web Services
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.4.0.rc2

2019-01-30 Thread kellen sunderland
+0

Overall release looks good.  Probably something I'm doing wrong, but so far
not able to validate the .asc.  I'm getting "Can't check signature: No
public key".  I've added the keys from GitHub and the release folder, and
also added your public key "40C9346904DFCE37" from the MIT key server
Steffen.  Is there another key I'm missing?

1. sha512 look good.
2. Compile from source successfully
3. TensorRT build succeeds and runs inference for demo models
4. License, notice and disclaimer exist.

-Kellen

On Wed, Jan 30, 2019 at 8:58 PM Steffen Rochel 
wrote:

> Dear MXNet community -
> we currently have three +1 votes, one binding.
> As the vote did not reach the necessary number of binding votes I'm
> extending voting.
>
> I'm calling on all PMC member, please test and vote.
>
> Regards,
> Steffen
>
> On Wed, Jan 30, 2019 at 6:43 PM Aston Zhang  wrote:
>
> > +1
> >
> > Tested with the Dive into Deep Learning book.
> >
> > On Wed, Jan 30, 2019 at 1:25 PM Steffen Rochel 
> > wrote:
> >
> > > Thanks Carin and Yuxi.
> > >
> > > Committers and PMC members - please test and send your vote to release
> > > Apache MXNet (incubating) v1.4.0.
> > >
> > > Regards,
> > > Steffen
> > >
> > > On Wed, Jan 30, 2019 at 10:55 AM Yuxi Hu  wrote:
> > >
> > > > +1
> > > >
> > > > verified the training throughput for resnet50_v1 looks normal
> compared
> > to
> > > > 1.3.1 release
> > > >
> > > > On Tue, Jan 29, 2019 at 3:36 PM Carin Meier 
> > > wrote:
> > > >
> > > > > +1 - checked out from the release tag and built and tested
> > > Scala/Clojure
> > > > > package.
> > > > >
> > > > > On Sat, Jan 26, 2019 at 8:53 PM Steffen Rochel <
> > > steffenroc...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Dear MXNet community,
> > > > > >
> > > > > > This is the vote to release Apache MXNet (incubating) version
> > v1.4.0.
> > > > > > Voting will
> > > > > > start today, Saturday January 26th 6pm PST and will close on
> > > Wednesday,
> > > > > > January 30th 7pm PST.
> > > > > >
> > > > > > Link to release notes:
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+
> > > > > > 1.4.0+Release+Notes
> > > > > >
> > > > > > Link to release candidate:
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/
> > > > > > <
> https://github.com/apache/incubator-mxnet/releases/tag/1.4.0.rc2
> > > > > >1.4.0.rc
> > > > > > <
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.4.0.rc1
> > >2
> > > > > >
> > > > > > Link to source and signatures on apache dist server:
> > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.4.0.rc2
> > > > > >
> > > > > >
> > > > > > Please remember to TEST first before voting accordingly:
> > > > > > +1 = approve
> > > > > > +0 = no opinion
> > > > > > -1 = disapprove (provide reason)
> > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > > Steffen
> > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Yuxi(Darren) Hu, Ph.D.
> > > > Software Development Engineer
> > > > Amazon Web Services
> > > >
> > >
> >
>


Re: Rust Client Lib

2019-01-29 Thread kellen sunderland
Great response Carin.

Just wanted to chime in and say, while the amount of work shouldn't be
underestimated to maintain a new language binding, I'd love to see some
Rust support.  The interop patterns between Rust and C/C++ in particular
could make propagating errors a little nicer of an experience.  I've also
often wished we had a native but memory-safe language that had a binding
with MXNet.

On Tue, Jan 29, 2019 at 10:13 AM Carin Meier  wrote:

> Hi Zach,
>
> I'm the original author of the Clojure package so I can give you my
> perspective, (although your path might be different).
>
> First, one of the advantages that MXNet has of the other deep learning
> libraries is its multi-language support. People can program and develop in
> the language of their choice.
>
> The path the Clojure package took is that it originated in an github issue.
> From there, the main package was developed in my personal repo until it got
> to a point that I could share it and get feedback from other people in the
> Clojure community. Once I felt like it was developed enough, I sent out a
> email to the dev list, opened a PR, and drafted up some documentation on
> the Design and Architecture as well as the state of things on the wiki
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Clojure.
>
> After much feedback and review, it was brought in under as a "contrib"
> package, where it is spending time stabilizing and generally improving to
> the point that it can "graduate".
>
> It's a long term commitment to bring a new language support in, but it is
> very rewarding for both the MXNet project and your language community.
>
> One valuable piece of feedback that I got on the original PR
> https://github.com/apache/incubator-mxnet/pull/11205 that might be
> valuable
> for you as well to think of, came from Kovas Boguta on higher level
> concerns:
>
> - What will the debug experience be like? How do users track down errors
> that happen in the front end Rust code or lower level code?
> - What does maintenance look like? How does the Rust API evolve with the
> rest of the library?
> - How do people learn to use this thing? Is there any easy way to go from
> the current MXNet docs to the Rust version. How is the documentation going
> to work long term.
>
> I hope this helps,
> Carin
>
>
>
>
> On Tue, Jan 29, 2019 at 5:05 AM Zach Boldyga  wrote:
>
> > Hey y'all!
> >
> > I'm thinking about spending this week working on a rust client lib for
> > MXNet. saw a little bit of chatter about this in the github issues and no
> > strong existing crates at the moment. Any pointers on approaching this
> in a
> > way that will lead to it being adopted as an officially supported client
> > library? And overall yay/nay on whether adding a Rust lib makes sense &
> why
> > / why not?
> >
> > Zach Boldyga
> > Scalabull  |  Founder
> > 1 (866) 846-8771 x 101
> >
>


Re: [DISCUSS] Current Publish problems

2019-01-23 Thread kellen sunderland
Hey Qing, thanks for the summary and to everyone for automating the
deployment process.  I've left a few comments on the doc.

On Wed, Jan 23, 2019 at 11:46 AM Qing Lan  wrote:

> Hi all,
>
> Recently Zach announced the availability for MXNet Maven publishing
> pipeline and general static-build instructions. In order to make it better,
> I drafted a document that includes the problems we have for this pipeline:
> https://cwiki.apache.org/confluence/display/MXNET/Outstanding+problems+with+publishing.
> Some of them may need to be addressed very soon.
>
> Please kindly review and leave any comments you may have in this thread or
> in the document.
>
> thanks,
> Qing
>
>


Re: Apache MXNet v1.4.0 release status

2019-01-20 Thread kellen sunderland
Hey Steffen, thanks for allowing a little extra time to merge PRs.  All my
PRs are in.  #13905 looks good to me, but we may want someone a little more
familiar to MKLDNN / the ndarray.h file review as well.

On Sun, Jan 20, 2019 at 11:53 AM Steffen Rochel 
wrote:

> Still waiting for merge of
> https://github.com/apache/incubator-mxnet/pull/13905
>
> All other PR are merged and CI tests are passing. Please no more changes on
> 1.4.x branch beside merge for PR 13905, so we can move forward with 1.4.0
> release.
>
> Best Regards,
> Steffen
>
> On Fri, Jan 18, 2019 at 9:48 AM Steffen Rochel 
> wrote:
>
> > Dear MXNet community -
> > thanks for merging previously agreed PR's into v1.4.x branch.
> >
> > Kellen and Zhennan - what is the ETA for
> > https://github.com/apache/incubator-mxnet/pull/13905 ? Please try to
> > merge today.
> >
> > Yuxi asked offline to merge
> > https://github.com/apache/incubator-mxnet/pull/13922 to complete Horovod
> > integration. PR will be merged today.
> >
> > After above PR are merge and CI passed successfully 1.4.0.rc1 will be
> > created and voting started.
> >
> > Regards,
> > Steffen
> >
> > On Wed, Jan 16, 2019 at 2:34 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> >> Sounds good Steffen.  I believe most of these PRs only fix functional
> >> problems (they don't add features) and should be fairly low risk.
> >>
> >> Update from my side:
> >> 13695: https://github.com/apache/incubator-mxnet/pull/13899 <- Already
> >> merged, thanks Haibin!
> >>
> >> Ready for review / merge with all tests passed:
> >> https://github.com/apache/incubator-mxnet/pull/13898
> >> https://github.com/apache/incubator-mxnet/pull/13900
> >> https://github.com/apache/incubator-mxnet/pull/13897
> >>
> >> -Kellen
> >>
> >> On Tue, Jan 15, 2019 at 10:06 PM Steffen Rochel <
> steffenroc...@gmail.com>
> >> wrote:
> >>
> >> > Kellen - thanks, please go ahead. I'm ok as long we avoid risky PR and
> >> can
> >> > get to a stable and tested build by Friday.
> >> >
> >> > Best,
> >> > Steffen
> >> >
> >> > On Tue, Jan 15, 2019 at 9:48 PM kellen sunderland <
> >> > kellen.sunderl...@gmail.com> wrote:
> >> >
> >> > > Many thanks for the license fixes and allowing some other PRs to
> come
> >> > into
> >> > > the release.
> >> > >
> >> > > For #13697 I've contacted the author Zhennan and let him know he can
> >> cut
> >> > a
> >> > > branch to v1.4.x to update any APIs that are required.
> >> > >
> >> > > For the other PRs listed here's some new PRs for the v1.4.x branch.
> >> > > 13188: https://github.com/apache/incubator-mxnet/pull/13898
> >> > > 13727: https://github.com/apache/incubator-mxnet/pull/13900
> >> > > 13695: https://github.com/apache/incubator-mxnet/pull/13899 <-
> >> Already
> >> > > merged, thanks Haibin!
> >> > >
> >> > > I'd also propose that we include this TensorRT PR which fixes
> >> inference
> >> > > bugs and updates to a more stable commit of onnx-trt:
> >> > > https://github.com/apache/incubator-mxnet/pull/13897
> >> > >
> >> > > -Kellen
> >> > >
> >> > > On Tue, Jan 15, 2019 at 5:57 PM Steffen Rochel <
> >> steffenroc...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi Lin - please go ahead to integrate into 1.4.x.
> >> > > > Steffen
> >> > > >
> >> > > > On Tue, Jan 15, 2019 at 4:17 PM Lin Yuan 
> >> wrote:
> >> > > >
> >> > > > > Hi Steffen,
> >> > > > >
> >> > > > > I would like to ask to include one more PR for 1.4.0.rc1:
> >> > > > > https://github.com/apache/incubator-mxnet/pull/13845
> >> > > > >
> >> > > > > This PR exports exception handling API of MXNet. It is needed by
> >> > > Horovod
> >> > > > > with MXNet integration to elegantly throw exception at Python
> >> level
> >> > > > rather
> >> > > > > than a C++ abort.
> >> > > > >
> >> > > > > Thanks,
> >> > > >

Re: Apache MXNet v1.4.0 release status

2019-01-16 Thread kellen sunderland
Sounds good Steffen.  I believe most of these PRs only fix functional
problems (they don't add features) and should be fairly low risk.

Update from my side:
13695: https://github.com/apache/incubator-mxnet/pull/13899 <- Already
merged, thanks Haibin!

Ready for review / merge with all tests passed:
https://github.com/apache/incubator-mxnet/pull/13898
https://github.com/apache/incubator-mxnet/pull/13900
https://github.com/apache/incubator-mxnet/pull/13897

-Kellen

On Tue, Jan 15, 2019 at 10:06 PM Steffen Rochel 
wrote:

> Kellen - thanks, please go ahead. I'm ok as long we avoid risky PR and can
> get to a stable and tested build by Friday.
>
> Best,
> Steffen
>
> On Tue, Jan 15, 2019 at 9:48 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Many thanks for the license fixes and allowing some other PRs to come
> into
> > the release.
> >
> > For #13697 I've contacted the author Zhennan and let him know he can cut
> a
> > branch to v1.4.x to update any APIs that are required.
> >
> > For the other PRs listed here's some new PRs for the v1.4.x branch.
> > 13188: https://github.com/apache/incubator-mxnet/pull/13898
> > 13727: https://github.com/apache/incubator-mxnet/pull/13900
> > 13695: https://github.com/apache/incubator-mxnet/pull/13899 <- Already
> > merged, thanks Haibin!
> >
> > I'd also propose that we include this TensorRT PR which fixes inference
> > bugs and updates to a more stable commit of onnx-trt:
> > https://github.com/apache/incubator-mxnet/pull/13897
> >
> > -Kellen
> >
> > On Tue, Jan 15, 2019 at 5:57 PM Steffen Rochel 
> > wrote:
> >
> > > Hi Lin - please go ahead to integrate into 1.4.x.
> > > Steffen
> > >
> > > On Tue, Jan 15, 2019 at 4:17 PM Lin Yuan  wrote:
> > >
> > > > Hi Steffen,
> > > >
> > > > I would like to ask to include one more PR for 1.4.0.rc1:
> > > > https://github.com/apache/incubator-mxnet/pull/13845
> > > >
> > > > This PR exports exception handling API of MXNet. It is needed by
> > Horovod
> > > > with MXNet integration to elegantly throw exception at Python level
> > > rather
> > > > than a C++ abort.
> > > >
> > > > Thanks,
> > > >
> > > > Lin
> > > >
> > > >
> > > > On Tue, Jan 15, 2019 at 2:24 PM Steffen Rochel <
> > steffenroc...@gmail.com>
> > > > wrote:
> > > >
> > > > > Dear MXNet community -
> > > > > Zach & friends made good progress resolving the licensing issues.
> One
> > > > more
> > > > > PR on 1.4.x branch is expected today.
> > > > > The code freeze for 1.4.0.rc1 is Thursday Jan 17th 6pm PST.
> > > > > I'm asking the requester to add following PR to 1.4.x branch:
> > > > > Tao:
> > > > > https://github.com/apache/incubator-mxnet/pull/13882
> > > > > Kellen:
> > > > > https://github.com/apache/incubator-mxnet/pull/13697
> > > > > https://github.com/apache/incubator-mxnet/pull/13188
> > > > > https://github.com/apache/incubator-mxnet/pull/13727
> > > > > https://github.com/apache/incubator-mxnet/pull/13695
> > > > > Pedro:
> > > > > https://github.com/apache/incubator-mxnet/pull/13535
> > > > >
> > > > > If there are additional PR to be considered for 1.4.0.rc1 please
> send
> > > > > request to dev@.
> > > > >
> > > > > Regards,
> > > > > Steffen
> > > > >
> > > > > On Tue, Jan 8, 2019 at 11:28 AM Qing Lan 
> > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I added a section F in the document that explained the current
> > > > > > static-linked dependencies we used for official release. As there
> > > are a
> > > > > few
> > > > > > licenses are under BSD3 and GPL, we need to handle them in our
> next
> > > > > > release. Please take a look and leave any concerns you may have.
> > > > > >
> > > > > > Thanks,
> > > > > > Qing
> > > > > >
> > > > > > On 1/7/19, 8:33 PM, "kellen sunderland" <
> > > kellen.sunderl...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > So I see two quick o

Re: Apache MXNet v1.4.0 release status

2019-01-15 Thread kellen sunderland
Many thanks for the license fixes and allowing some other PRs to come into
the release.

For #13697 I've contacted the author Zhennan and let him know he can cut a
branch to v1.4.x to update any APIs that are required.

For the other PRs listed here's some new PRs for the v1.4.x branch.
13188: https://github.com/apache/incubator-mxnet/pull/13898
13727: https://github.com/apache/incubator-mxnet/pull/13900
13695: https://github.com/apache/incubator-mxnet/pull/13899 <- Already
merged, thanks Haibin!

I'd also propose that we include this TensorRT PR which fixes inference
bugs and updates to a more stable commit of onnx-trt:
https://github.com/apache/incubator-mxnet/pull/13897

-Kellen

On Tue, Jan 15, 2019 at 5:57 PM Steffen Rochel 
wrote:

> Hi Lin - please go ahead to integrate into 1.4.x.
> Steffen
>
> On Tue, Jan 15, 2019 at 4:17 PM Lin Yuan  wrote:
>
> > Hi Steffen,
> >
> > I would like to ask to include one more PR for 1.4.0.rc1:
> > https://github.com/apache/incubator-mxnet/pull/13845
> >
> > This PR exports exception handling API of MXNet. It is needed by Horovod
> > with MXNet integration to elegantly throw exception at Python level
> rather
> > than a C++ abort.
> >
> > Thanks,
> >
> > Lin
> >
> >
> > On Tue, Jan 15, 2019 at 2:24 PM Steffen Rochel 
> > wrote:
> >
> > > Dear MXNet community -
> > > Zach & friends made good progress resolving the licensing issues. One
> > more
> > > PR on 1.4.x branch is expected today.
> > > The code freeze for 1.4.0.rc1 is Thursday Jan 17th 6pm PST.
> > > I'm asking the requester to add following PR to 1.4.x branch:
> > > Tao:
> > > https://github.com/apache/incubator-mxnet/pull/13882
> > > Kellen:
> > > https://github.com/apache/incubator-mxnet/pull/13697
> > > https://github.com/apache/incubator-mxnet/pull/13188
> > > https://github.com/apache/incubator-mxnet/pull/13727
> > > https://github.com/apache/incubator-mxnet/pull/13695
> > > Pedro:
> > > https://github.com/apache/incubator-mxnet/pull/13535
> > >
> > > If there are additional PR to be considered for 1.4.0.rc1 please send
> > > request to dev@.
> > >
> > > Regards,
> > > Steffen
> > >
> > > On Tue, Jan 8, 2019 at 11:28 AM Qing Lan  wrote:
> > >
> > > > Hi all,
> > > >
> > > > I added a section F in the document that explained the current
> > > > static-linked dependencies we used for official release. As there
> are a
> > > few
> > > > licenses are under BSD3 and GPL, we need to handle them in our next
> > > > release. Please take a look and leave any concerns you may have.
> > > >
> > > > Thanks,
> > > > Qing
> > > >
> > > > On 1/7/19, 8:33 PM, "kellen sunderland" <
> kellen.sunderl...@gmail.com>
> > > > wrote:
> > > >
> > > > So I see two quick options that should cut down on the dependency
> > > > licenses
> > > > required for TRT in the source release.
> > > >
> > > > 1: We can simply remove in the release package the submodules for
> > > onnx
> > > > in
> > > > folder
> > > > incubator-mxnet/3rdparty/onnx-tensorrt/third_party/onnx/third_party.
> > > > None of those dependencies are used in the build (I've just
> > verified
> > > > locally on my machine).
> > > > 2: We can make a cmake based checkout system and ensure we only
> > > > checkout
> > > > the required files when TRT builds are enabled (similar to the
> > > current
> > > > mkl-ml setup).
> > > >
> > > > I'd suggest option 1 for this release, and that we support
> option 2
> > > > for the
> > > > 1.5 release.
> > > >
> > > > On Mon, Jan 7, 2019 at 8:19 PM Lv, Tao A 
> > wrote:
> > > >
> > > > > What should I do for the double headers in
> > > > 3rdparty/mkldnn/src/cpu/xbyak/?
> > > > >
> > > > > -tao
> > > > >
> > > > > -Original Message-
> > > > > From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> > > > > Sent: Tuesday, January 8, 2019 10:51 AM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Subject: Re: Apache MXNet v1.4.0 release status
> > > > >
> > > > > 

Re: Cherry pick bug fix from master branch to v1.4.x

2019-01-15 Thread kellen sunderland
We may want to consider having a new code freeze deadline for RC1.  We
could allow users to open PRs against the 1.4.x branch up until this
deadline.

One advantage is we can have a second look at some API changes which we may
not have got 100% right before we push them out and have to support them.
This PR I know of could benefit from this
https://github.com/apache/incubator-mxnet/pull/13697

Other PRs we may want to consider migrating because they fix functional
issues:
https://github.com/apache/incubator-mxnet/pull/13188
https://github.com/apache/incubator-mxnet/pull/13727
https://github.com/apache/incubator-mxnet/pull/13695

-Kellen

On Tue, Jan 15, 2019 at 12:24 AM Lv, Tao A  wrote:

>
> Hi community,
>
> As 1.4.0 release is still in process, I would like to propose to cherry
> pick https://github.com/apache/incubator-mxnet/pull/13843  into the
> v1.4.x branch. It fixed a crash issue of quantized SSD example on master
> branch which was reported by MXNet user. This issue also exists on the
> v1.4.x branch. Since quantization is an important feature of MKL-DNN
> backend in 1.4.0 release, so I think this fix is critical and we should
> have it in the release.
>
> A PR is filed to do that:
> https://github.com/apache/incubator-mxnet/pull/13882
>
> Thank you,
> -tao
>


Re: [Announcement] New Committer - Roshani Nagmote

2019-01-08 Thread kellen sunderland
Congrats Roshani.  Well deserved.

On Tue, Jan 8, 2019, 8:29 AM Marco de Abreu  Great to have you on board, Roshani!
>
> -Marco
>
> Am Di., 8. Jan. 2019, 15:18 hat Carin Meier 
> geschrieben:
>
> > Please join me in welcoming Roshani Nagmote as a new committer.
> >
> > She has been active in the project for quite some time. She has managed
> the
> > 1.3.0 release as well as being involved various features including the
> Java
> > API and ONNX operators.
> >
> > We are excited to have her onboard as a committer.
> >
> >  Github Activity
> >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+involves%3ARoshrini
> > +
> >
> > Confluence
> >
> >
> https://cwiki.apache.org/confluence/users/viewuserprofile.action?username=roshrini
> >
>


Re: Apache MXNet v1.4.0 release status

2019-01-07 Thread kellen sunderland
So I see two quick options that should cut down on the dependency licenses
required for TRT in the source release.

1: We can simply remove in the release package the submodules for onnx in
folder incubator-mxnet/3rdparty/onnx-tensorrt/third_party/onnx/third_party.
None of those dependencies are used in the build (I've just verified
locally on my machine).
2: We can make a cmake based checkout system and ensure we only checkout
the required files when TRT builds are enabled (similar to the current
mkl-ml setup).

I'd suggest option 1 for this release, and that we support option 2 for the
1.5 release.

On Mon, Jan 7, 2019 at 8:19 PM Lv, Tao A  wrote:

> What should I do for the double headers in 3rdparty/mkldnn/src/cpu/xbyak/?
>
> -tao
>
> -Original Message-
> From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> Sent: Tuesday, January 8, 2019 10:51 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: Apache MXNet v1.4.0 release status
>
> Kellen and Tao -
> yes, the understanding is that dependencies need to be considered and all
> licences referenced to include in top level LICENSE file.
> Appreciate your help with it.
> Steffen
>
> On Mon, Jan 7, 2019 at 6:39 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Sorry to hear about the licensing issues.  I was following the general
> > vote but I'm still lacking some clarity around what licenses in the
> > onnx-trt repo need to be surfaced.  I believe onnx-trt is MIT
> > licensed, but it includes Onnx as a third party repo which then brings
> > in dependencies with a variety of licenses.  The proposal is that we
> > look at these on an individual basis and then add them to our top level
> LICENSE file right?
> >
> > An alternative is that we may be able to checkout a smaller source
> > code dependency tree if we remove a few unneeded ONNX's dependencies
> > (pybind and google-bench).  My hope is that this wouldn't affect our
> > compilation process and would get us down to two licenses to report
> > (just Onnx and Onnx-TRT, both MIT).
> >
> > On Mon, Jan 7, 2019 at 6:07 PM Meghna Baijal
> > 
> > wrote:
> >
> > > Hi All,
> > > For some more context, these were the last emails I sent on the dev
> > > and legal lists requesting help on the open questions  –
> > >
> > > 1. Question on legal about the CC-By-2.5 <
> > >
> > http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201805.mbox
> > /%3CCAK1xzDe6ECToKt_2cTR_7txQQCwHeYfvxXDfmuGgfA3jaTs=j...@mail.gmail.com
> > %3E
> > > >
> > > 2. Question on dev about googletest file <
> > >
> > http://mail-archives.apache.org/mod_mbox/mxnet-dev/201804.mbox/%3CCAMG
> > gKDC8szdfFqQhhSNpwwT_3zi4LBS7A=u4v7kj4ule44u...@mail.gmail.com%3E
> > > >
> > > 3. General Request for review of the licenses wiki <
> > >
> > https://mail-archives.apache.org/mod_mbox/mxnet-dev/201801.mbox/%3CCAM
> > GgKDCi=s933zcVWwei15i5uBC1h88VUogt3Br=Vq28=vi...@mail.gmail.com%3E
> > > >
> > >
> > >  (Note: You can click on the the “>>” next to the thread on the top
> > > right to view the next responses in the email threads in the apache
> > > archive. )
> > >
> > > Thanks,
> > > Meghna Baijal
> > >
> > > On Mon, Jan 7, 2019 at 4:30 PM Steffen Rochel
> > > 
> > > wrote:
> > >
> > > > Dear MXNet community -
> > > > as you should have seen in previous email, voting for v1.4.0.rc0
> > > > has
> > been
> > > > cancelled. We received a -1 vote due to outstanding license issues.
> > > > Please help to update
> > > >
> > https://cwiki.apache.org/confluence/display/MXNET/MXNet+Source+License
> > s
> > > > and
> > > > resolve outstanding issues.
> > > >
> > > > I would like to ask specifically for help from contributors to
> > > > mkldnn, opemmp and onnx-tensorrt to address the feedback from
> > > > Justin - see
> > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/ebb8c4c00fb66dd98da13621c7dcb8753
> > fee57562a861d61379d31e9@%3Cgeneral.incubator.apache.org%3E
> > > > .
> > > >
> > > > I suggest to fix the issues first on master, then cherry-pick and
> > > > merge
> > > to
> > > > 1.4.x branch.
> > > >
> > > > I'm suggesting to exclude Julia from 1.4.0 release as integration
> > > > into MXNet repo and upgrade to 0.7+ is WIP.
> > > > I'm sugge

Re: Apache MXNet v1.4.0 release status

2019-01-07 Thread kellen sunderland
Sorry to hear about the licensing issues.  I was following the general vote
but I'm still lacking some clarity around what licenses in the onnx-trt
repo need to be surfaced.  I believe onnx-trt is MIT licensed, but it
includes Onnx as a third party repo which then brings in dependencies with
a variety of licenses.  The proposal is that we look at these on an
individual basis and then add them to our top level LICENSE file right?

An alternative is that we may be able to checkout a smaller source code
dependency tree if we remove a few unneeded ONNX's dependencies (pybind and
google-bench).  My hope is that this wouldn't affect our compilation
process and would get us down to two licenses to report (just Onnx and
Onnx-TRT, both MIT).

On Mon, Jan 7, 2019 at 6:07 PM Meghna Baijal 
wrote:

> Hi All,
> For some more context, these were the last emails I sent on the dev and
> legal lists requesting help on the open questions  –
>
> 1. Question on legal about the CC-By-2.5
> <
> http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201805.mbox/%3CCAK1xzDe6ECToKt_2cTR_7txQQCwHeYfvxXDfmuGgfA3jaTs=j...@mail.gmail.com%3E
> >
> 2. Question on dev about googletest file
> <
> http://mail-archives.apache.org/mod_mbox/mxnet-dev/201804.mbox/%3CCAMGgKDC8szdfFqQhhSNpwwT_3zi4LBS7A=u4v7kj4ule44u...@mail.gmail.com%3E
> >
> 3. General Request for review of the licenses wiki
> <
> https://mail-archives.apache.org/mod_mbox/mxnet-dev/201801.mbox/%3CCAMGgKDCi=s933zcVWwei15i5uBC1h88VUogt3Br=Vq28=vi...@mail.gmail.com%3E
> >
>
>  (Note: You can click on the the “>>” next to the thread on the top right
> to view the next responses in the email threads in the apache archive. )
>
> Thanks,
> Meghna Baijal
>
> On Mon, Jan 7, 2019 at 4:30 PM Steffen Rochel 
> wrote:
>
> > Dear MXNet community -
> > as you should have seen in previous email, voting for v1.4.0.rc0 has been
> > cancelled. We received a -1 vote due to outstanding license issues.
> > Please help to update
> > https://cwiki.apache.org/confluence/display/MXNET/MXNet+Source+Licenses
> > and
> > resolve outstanding issues.
> >
> > I would like to ask specifically for help from contributors to mkldnn,
> > opemmp and onnx-tensorrt to address the feedback from Justin - see
> >
> >
> https://lists.apache.org/thread.html/ebb8c4c00fb66dd98da13621c7dcb8753fee57562a861d61379d31e9@%3Cgeneral.incubator.apache.org%3E
> > .
> >
> > I suggest to fix the issues first on master, then cherry-pick and merge
> to
> > 1.4.x branch.
> >
> > I'm suggesting to exclude Julia from 1.4.0 release as integration into
> > MXNet repo and upgrade to 0.7+ is WIP.
> > I'm suggesting to exclude googletest/googlemock from 1.4.0 release as
> > outstanind license issues are not resolved yet. This should not impact
> > users.
> >
> > Please provide your feedback to the suggestions.
> >
> > Regards,
> > Steffen
> >
> >
> > On Fri, Dec 21, 2018 at 1:34 PM Steffen Rochel 
> > wrote:
> >
> > > Dear MXNet community -
> > > I hope you have seen that voting for v1.4.0.rc0 has started and will
> > > continue until December 27th noon. So far two binding +1 votes.
> > > I suggesting the following schedule to account for holidays and of
> course
> > > depending on voting feedback.
> > >
> > > Vote on dev@ until 12/27
> > >
> > > Vote on general@ 12/28 – 1/3
> > >
> > > Release announcement with pre-build language bindings 1/9
> > >
> > >
> > > Please let me know if you have concerns with the proposed schedule.
> > >
> > > Regards,
> > > Steffen
> > >
> > >
> > > On Wed, Dec 19, 2018 at 11:22 AM Haibin Lin 
> > > wrote:
> > >
> > >> Hi Steffen,
> > >>
> > >> Aston and I would like to bring this PR to your attention:
> > >> https://github.com/apache/incubator-mxnet/pull/13686, where Zhi fixed
> > the
> > >> num_worker argument of DataLoader on windows. Without this fix, using
> > >> DataLoader with num_worker > 0 would result in crash on Windows.
> > Bringing
> > >> this PR to 1.4.x would greatly benefit windows users of MXNet. Aston
> is
> > >> working on the dive into deep learning book
> > >>  based on MXNet, which is due and
> > frozen
> > >> for publication next week. Currently the book will depend on MXNet
> 1.4.0
> > >> and discourages readers from using multi-worker DataLoaders due to
> this
> > >> bug
> > >> on Windows. With this fix Aston can update the examples in the book
> with
> > >> DataLoader using multiple workers, which will be very beneficial to
> the
> > >> broader MXNet community.
> > >>
> > >> Best,
> > >> Haibin
> > >>
> > >> On Mon, Dec 17, 2018 at 6:11 AM Pedro Larroy <
> > >> pedro.larroy.li...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Steffen
> > >> >
> > >> > Added some notes in your PR for the release notes.
> > >> >
> > >> > In particular, I'm a bit concerned about the status of topology
> aware
> > >> > communication, since it has open issues and is not being tested in
> CI.
> > >> > (The tests also fail). I think we should anounce it when it's
> working
> > 

Re: [DISCUSS] About the usage of CUDA/CUDNN

2018-12-17 Thread kellen sunderland
Restricted nodes may provide enough security for some use cases, but in my
opinion they don't provide enough for artifact publishing. An example would
be if there were a exploit available that worked against a Jenkins master.
In this case I think an attacker code still pivot to a secure node (correct
me if I'm wrong).

To your second point, it shouldn't be too hard for us to maintain all the
deps for our packages in Dockerfiles which are checked into source and
built on a regular basis.  To publish these artifacts I'd recommend doing
this from a separate, secure environment.  The flow I'd recommend would be
something like: (1) Developers commit PRs with verification that the
artifacts build properly on a continual basis from the CI. (2) In a
separate, secure environment we do the same artifact build generation
again, but this time we publish to various repos as a convenience to our
MXNet users.

On Mon, Dec 17, 2018 at 2:34 PM Qing Lan  wrote:

> Hi Kellen,
>
> Firstly the restricted node is completely isolated to the PR-checking CI
> system (physically) which is explained in here:
> https://cwiki.apache.org/confluence/display/MXNET/Restricted+jobs+and+nodes
> .
> What you are mentioning: the Public CIs are all having troubles if they
> are public accessible. I am not sure how secure the restricted node is.
> However, the only way I can think of from your end is to downloading all
> deps in a single machine and run everything there (disconnected from
> internet). It would bring us the best security we have.
>
> Thanks,
> Qing
>
> On 12/17/18, 2:06 PM, "kellen sunderland" 
> wrote:
>
> I'm not in favour of publishing artifacts from any Jenkins based
> systems.
> There are many ways to bundle artifacts and publish them from an
> automated
> system.  Why we would use a CI system like Jenkins for this task?
> Jenkins
> frequently has security vulnerabilities and is designed to run
> arbitrary
> code from the internet.  It is a real possibility that an attacker
> could
> pivot from any Jenkins based CI system to infect artifacts which would
> then
> potentially be pushed to repositories our users would consume.  I would
> consider any system using Jenkins as insecure-by-design, and encourage
> us
> to air-gapped any artifact generation (websites, jars, PyPi packages)
> completely from a system like that.
>
> An alternative I could see is a simple Dockerfile (no Jenkins) that
> builds
> all artifacts end-to-end and can be run in an automated account well
> outside our CI account.
>
> On Mon, Dec 17, 2018 at 1:53 PM Qing Lan  wrote:
>
> > Dear community,
> >
> > Currently me and Zach are working on the Automated-publish pipeline
> on
> > Jenkins which is a pipeline used to publish Maven packages and pip
> packages
> > nightly build. We are trying to use NVIDIA deb which could help us
> to build
> > different CUDA/CUDNN versions in the publish system. Sheng has
> provided a
> > script here: https://github.com/apache/incubator-mxnet/pull/13646.
> This
> > provide a very concrete and automatic solution from downloading to
> > installing on the system. The only scenario we are facing is: It
> seemed
> > NVIDIA has a restriction on distributing CUDA. We are not sure if it
> is
> > legally-safe for us to use this in public.
> >
> > We would be grateful if somebody has a better context on it and help
> us
> > out!
> >
> > Thanks,
> > Qing
> >
>
>
>


Re: [DISCUSS] About the usage of CUDA/CUDNN

2018-12-17 Thread kellen sunderland
I'm not in favour of publishing artifacts from any Jenkins based systems.
There are many ways to bundle artifacts and publish them from an automated
system.  Why we would use a CI system like Jenkins for this task?  Jenkins
frequently has security vulnerabilities and is designed to run arbitrary
code from the internet.  It is a real possibility that an attacker could
pivot from any Jenkins based CI system to infect artifacts which would then
potentially be pushed to repositories our users would consume.  I would
consider any system using Jenkins as insecure-by-design, and encourage us
to air-gapped any artifact generation (websites, jars, PyPi packages)
completely from a system like that.

An alternative I could see is a simple Dockerfile (no Jenkins) that builds
all artifacts end-to-end and can be run in an automated account well
outside our CI account.

On Mon, Dec 17, 2018 at 1:53 PM Qing Lan  wrote:

> Dear community,
>
> Currently me and Zach are working on the Automated-publish pipeline on
> Jenkins which is a pipeline used to publish Maven packages and pip packages
> nightly build. We are trying to use NVIDIA deb which could help us to build
> different CUDA/CUDNN versions in the publish system. Sheng has provided a
> script here: https://github.com/apache/incubator-mxnet/pull/13646. This
> provide a very concrete and automatic solution from downloading to
> installing on the system. The only scenario we are facing is: It seemed
> NVIDIA has a restriction on distributing CUDA. We are not sure if it is
> legally-safe for us to use this in public.
>
> We would be grateful if somebody has a better context on it and help us
> out!
>
> Thanks,
> Qing
>


Re: Julia CI Package

2018-12-14 Thread kellen sunderland
If it's hanging consistently would you be able to dump a native stack trace
and see what call specifically is hanging?

On Fri, Dec 14, 2018 at 11:38 AM Alex Zai  wrote:

> Is anyone familiar with the Julia build and can help debug an issue where
> the Julia stage in the CI just hangs? I have made a change where I am
> making mkldnn default on the master branch. This means that the julia
> package now is being build with a version of mxnet that is linking to
> mkldnn). The julia stage on the CI just hangs without any error message and
> gets killed after the 2 hour timeout (
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-13464/40/pipeline
> ).
>
> Alex
>


Re: [Announcement] New Committer -- Aaron Markham

2018-12-03 Thread kellen sunderland
Congrats Aaron.  Really appreciate all the effort spent improving the
documentation.

On Mon, Dec 3, 2018 at 6:30 PM Hagay Lupesko  wrote:

> Congrats Aaron!
> Your work on the docs definitely set a new standard and helps the community
> tremendously - well deserved!
>
>
> On Mon, Dec 3, 2018 at 6:22 PM Tianqi Chen  wrote:
>
> > Let us welcome Aron Markham as a new committer of MXNet. Aaron has been
> > actively working on improving documents and website of MXNet
> > PRs  *
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+reviewed-by%3Aaaronmarkham
> > <
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+reviewed-by%3Aaaronmarkham
> > >*
> > reviews  *
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+reviewed-by%3Aaaronmarkham+
> > <
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+reviewed-by%3Aaaronmarkham+
> > >*
> > dev@  https://lists.apache.org/list.html?d...@mxnet.apache.org:lte=3y:
> > 
> > *aaronmarkham*
> >
> > Tianqi
> >
>


Re: [Announcement] New Committer -- Rahul Huilgol

2018-12-03 Thread kellen sunderland
Congrats Rahul, well deserved.

On Mon, Dec 3, 2018 at 6:24 PM Tianqi Chen  wrote:

> Let us welcome Rahul Huilgol as a new Committer of MXNet. He has
> contributed to many fronts, including the FP16 support, distributed
> training and mixed precision support of MXNet. He has a breadth of
> knowledge across multiple modules of the system and would be valuable
> member of the committer team
>
> PRs https://github.com/apache/incubator-mxnet/commits?author=rahul003
> Reviews
>
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+reviewed-by%3Arahul003
> dev@
> https://lists.apache.org/list.html?d...@mxnet.apache.org:lte=3y:rahul003
>
>
> Tianqi
>


Re: Adding AMD CPU to CI

2018-11-30 Thread kellen sunderland
Damn, knew i should have double-checked!  Oh well it's also carbon neutral.

On Fri, Nov 30, 2018 at 10:27 AM Pedro Larroy 
wrote:

> Agee with Tianqi and Hao. Adding AMD brings no value and increases
> complexity and CI cost. The instructions sets are the same. For
> benchmarking it might make sense though.
>
> Pedro
>
> > On 30. Nov 2018, at 18:19, Tianqi Chen  wrote:
> >
> > I still think it is overkill to add AMD CPU to the CI, given the
> additional
> > cost it could bring and little additional information we can get out from
> > it.
> >
> > A middle group is to add AMD CPU to a nightly build or final sweep before
> > release. If there is a case that we find that AMD CPU really makes a
> > difference, then we add it to the CI
> >
> > Tianqi
> >
> >> On Thu, Nov 29, 2018 at 6:29 PM Hao Jin  wrote:
> >>
> >> For CPUs, the supported instruction sets may also vary between the same
> >> manufacturer's different product lines of the same generation
> (Skylake-SP
> >> versus Skylake).
> >> For the same instruction set, the two manufacturers should both have a
> >> working version of the hardware implementation. If any of the
> >> implementations does not work, then the chip would not even be
> considered
> >> functioning properly.
> >> If some AMD CPUs only support up to AVX2 instruction sets, they would
> just
> >> function in the same way as an Intel CPU that supports up to AVX2
> >> instruction sets. The performance may vary, but the capability and
> behavior
> >> of the two chips would be the same when given the same machine code.
> >> For AMD GPUs it's a totally different story, as AMD GPUs do not share
> the
> >> same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if
> we
> >> do have support for them) would definitely add values.
> >> Hao
> >>
> >> On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian <
> anirudh2...@gmail.com
> >>>
> >> wrote:
> >>
> >>> Instruction set extensions support like AVX2, AVX512 etc. can vary
> >> between
> >>> AMD and Intel and there can also be a time lag between when Intel
> >> supports
> >>> it versus when AMD supports it.
> >>> Also, in the future this setup may be useful in case MXNet supports AMD
> >>> GPUs and AWS also happens to have support for it.
> >>>
> >>> Anirudh
> >>>
> >>>
> >>> On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
> >>>  wrote:
> >>>
> >>>> I think it's worth a discussion to do a sanity check. While generally
> >>> these
> >>>> instructions are standardized, we also made the experience with ARM
> >> that
> >>>> the theory and reality sometimes don't match. Thus, it's always good
> to
> >>>> check.
> >>>>
> >>>> In the next months we are going to refactor our slave creation
> >> processes.
> >>>> Chance Bair has been working on rewriting Windows slaves from scratch
> >> (we
> >>>> used images that haven't really been updated for 2 years - we still
> >> don't
> >>>> know what was done on them) and they're ready soon. In the following
> >>>> months, we will also port our Ubuntu slaves to the new method (don't
> >>> have a
> >>>> timeline yet). Ideally, the integration of AMD instances will only be
> a
> >>>> matter of running the same pipeline on a different instance type. In
> >> that
> >>>> Case, it should not be a big deal.
> >>>>
> >>>> If there are big differences, that's already a yellow flag for
> >>>> compatibility, but that's unlikely. But in that case, we would have to
> >>> make
> >>>> a more thorough time analysis and whether it's worth the effort.
> Maybe,
> >>>> somebody else could also lend us a hand and help us with adding AMD
> >>>> support.
> >>>>
> >>>> -Marco
> >>>>
> >>>> Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
> >>>> geschrieben:
> >>>>
> >>>>> f16c is also an instruction set supported by both brands' recent CPUs
> >>>> just
> >>>>> like x86, AVX, SSE etc., and any difference in behaviors (quite
> >>>> impossible
> >>>>> to happen or it will be a major defect) would most likely be caused
> >

Re: Adding AMD CPU to CI

2018-11-30 Thread kellen sunderland
+1 to nightly.

Given the awesome results shown by Alex for AMD cpus I think MKLDNN
actually would probably be something I'd use, even on my AMD machines.
Kudos to Intel for releasing this lib which works great on their hardware,
but still pretty well w/ AMD.  The upshot of MKLDNN supporting AMD to me is
that it makes me much more likely to support it as the default PyPi package
(discussed in another thread).  This is part of the reason I'd like to have
a sanity test in CI somewhere for AMD hardware.

Unrelated note: regarding global warming I actually partially chose
eu-west-1 to host CI because it's carbon neutral.  The cost of the CI is
significant, and although it's donated by AWS I'm glad the community is
cognizant of that.

On Fri, Nov 30, 2018 at 9:54 AM Kumar, Vikas 
wrote:

> I concur. +1 for nightly for pre-release suit.
>
> On 11/30/18, 9:49 AM, "Tianqi Chen"  wrote:
>
> +1 for nightly for pre-release suit, but not the CI that triggered in
> every
> test.  The best engineering practice is not to add things, but to
> remove
> things so that there is nothing can be removed.
>
> In terms of MLDNN, since it is an Intel product, I doubt optimizing
> for AMD
> CPUs is its goal, adding CI to guard against backward compatibility is
> a
> bit overkill even. Since the AMD CPU user would likely disable this
> feature
> and use the original CPU version of the project.
>
> At least we can contribute to reducing the carbon footprint and slows
> down
> the global warming :)
>
> Tianqi
>
> On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Regarding cost, yes we could run this nightly or simply make it run
> an
> > existing test suite that would make sense rather than having it
> duplicate a
> > suite.
> >
> > On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas
> 
> > wrote:
> >
> > > I don't think there is any downside to this proposal. I think a
> basic
> > > sanity CI testing on AMD processors will give extra boost to our
> tests.
> > > This adds to developer productivity and they have one less thing
> to worry
> > > about. Developers have spent time in past where they had to
> manually test
> > > on AMD  processors, MKLDNN being the recent instance. It's good to
> have
> > > those test in CI pipeline.
> > > All I see is benefit. If the $ cost is not too high for basic
> sanity
> > > testing, we should do this, until and unless some strong downside
> is
> > called
> > > out.
> > >
> > > +1
> > >
> > >
> > > On 11/29/18, 5:37 PM, "Anirudh Subramanian"  >
> > > wrote:
> > >
> > > Instruction set extensions support like AVX2, AVX512 etc. can
> vary
> > > between
> > > AMD and Intel and there can also be a time lag between when
> Intel
> > > supports
> > > it versus when AMD supports it.
> > > Also, in the future this setup may be useful in case MXNet
> supports
> > AMD
> > > GPUs and AWS also happens to have support for it.
> > >
> > > Anirudh
> > >
> > >
> > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
> > >  wrote:
> > >
> > > > I think it's worth a discussion to do a sanity check. While
> > > generally these
> > > > instructions are standardized, we also made the experience
> with ARM
> > > that
> > > > the theory and reality sometimes don't match. Thus, it's
> always
> > good
> > > to
> > > > check.
> > > >
> > > > In the next months we are going to refactor our slave
> creation
> > > processes.
> > > > Chance Bair has been working on rewriting Windows slaves from
> > > scratch (we
> > > > used images that haven't really been updated for 2 years -
> we still
> > > don't
> > > > know what was done on them) and they're ready soon. In the
> > following
> > > > months, we will also port our Ubuntu slaves to the new method
> > (don't
> > > have a
> > > > timeline yet). Ideally, the integration of AMD instances
> will only
> > > be a
> > > > matter of running the same pipeline on a differ

Re: Adding AMD CPU to CI

2018-11-29 Thread kellen sunderland
Just looked at the mf16c work and wanted to mention Rahul clearly _was_
thinking about AMD users in that PR.

On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> From my perspective we're developing a few features like mf16c and MKLDNN
> integration specifically for Intel CPUs.  It wouldn't hurt to make sure
> those changes also run properly on AMD cpus.
>
> On Thu, Nov 29, 2018, 3:38 PM Hao Jin 
>> I'm a bit confused about why we need extra functionality tests just for
>> AMD
>> CPUs, aren't AMD CPUs supporting roughly the same instruction sets as the
>> Intel ones? In the very impossible case that something working on Intel
>> CPUs being not functioning on AMD CPUs (or vice versa), it would mostly
>> likely be related to the underlying hardware implementation of the same
>> ISA, to which we definitely do not have a good solution. So I don't think
>> performing extra tests on functional aspect of the system on AMD CPUs is
>> adding any values.
>> Hao
>>
>> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu 
>> wrote:
>>
>> > +1
>> >
>> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
>> >
>> > What are people's thoughts on having AMD machines tested on the CI?
>> AMD
>> > machines are now available on AWS.
>> >
>> > Best,
>> > Alex
>> >
>> >
>> >
>>
>


Re: Adding AMD CPU to CI

2018-11-29 Thread kellen sunderland
>From my perspective we're developing a few features like mf16c and MKLDNN
integration specifically for Intel CPUs.  It wouldn't hurt to make sure
those changes also run properly on AMD cpus.

On Thu, Nov 29, 2018, 3:38 PM Hao Jin  I'm a bit confused about why we need extra functionality tests just for AMD
> CPUs, aren't AMD CPUs supporting roughly the same instruction sets as the
> Intel ones? In the very impossible case that something working on Intel
> CPUs being not functioning on AMD CPUs (or vice versa), it would mostly
> likely be related to the underlying hardware implementation of the same
> ISA, to which we definitely do not have a good solution. So I don't think
> performing extra tests on functional aspect of the system on AMD CPUs is
> adding any values.
> Hao
>
> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu 
> wrote:
>
> > +1
> >
> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> >
> > What are people's thoughts on having AMD machines tested on the CI?
> AMD
> > machines are now available on AWS.
> >
> > Best,
> > Alex
> >
> >
> >
>


Re: Adding AMD CPU to CI

2018-11-29 Thread kellen sunderland
+1

On Thu, Nov 29, 2018 at 2:50 PM Seth, Manu 
wrote:

> +1
>
> On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
>
> What are people's thoughts on having AMD machines tested on the CI? AMD
> machines are now available on AWS.
>
> Best,
> Alex
>
>
>


Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread kellen sunderland
I believe this PR is ready to merge but so far I don't have any approvals.
Would appreciate if someone could do a quick review:

https://github.com/apache/incubator-mxnet/pull/13311
and
https://github.com/apache/incubator-mxnet/pull/13310

-Kellen

On Thu, Nov 29, 2018 at 12:43 PM Steffen Rochel 
wrote:

> Kellen - please merge your PR before v1.4.x branch is created or integrate
> afterwards.
> Steffen
>
> On Tue, Nov 20, 2018 at 7:01 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > regression in master which causes incorrect feature vectors to be output
> > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> track
> > down the root cause of the issue).   I'm currently blocked on a CI issue
> I
> > haven't seen before, but hope to have it resolved by EOW.
> >
> > One call-out I would make is that we currently don't support Turing
> > architecture (sm_75).  I've been slowly trying to add support, but I
> don't
> > think I'd have capacity to do this done by EOW.  Does anyone feel
> strongly
> > we need this in the 1.4 release?  From my perspective this will already
> be
> > a strong release without it.
> >
> > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel 
> > wrote:
> >
> > > Thanks Patrick, lets target to get the PR's merged this week.
> > >
> > > Call for contributions from the community: Right now we have 10 PR
> > awaiting
> > > merge
> > > <
> > >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > > >
> > > and
> > > we have 61 open PR awaiting review.
> > > <
> > >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-review
> > > >
> > > I would appreciate if you all can help to review the open PR and the
> > > committers can drive the merge before code freeze for 1.4.0.
> > >
> > > The contributors on the Java API are making progress, but not all
> > > performance issues are resolved. With some luck it should be possible
> to
> > > code freeze towards end of this week.
> > >
> > > Are there other critical features/bugs/PR you think need to be included
> > in
> > > 1.4.0? If so, please communicate as soon as possible.
> > >
> > > Regards,
> > > Steffen
> > >
> > > On Mon, Nov 19, 2018 at 8:26 PM Zhao, Patric 
> > > wrote:
> > >
> > > > Thanks, Steffen. I think there is NO open issue to block the MKLDNN
> to
> > GA
> > > > now.
> > > >
> > > > BTW, several quantization related PRs (#13297,#13260) are under the
> > > review
> > > > and I think it can be merged in this week.
> > > >
> > > > Thanks,
> > > >
> > > > --Patric
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> > > > > Sent: Tuesday, November 20, 2018 2:57 AM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Subject: Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0
> > > release
> > > > >
> > > > > On Friday the contributors working on Java API discovered a
> potential
> > > > > performance problem with inference using Java API vs. Python.
> > > > Investigation
> > > > > is ongoing.
> > > > > As the Java API is one of the main features for the upcoming
> > release, I
> > > > > suggest to post-pone the code freeze towards end of this week.
> > > > >
> > > > > Please provide feedback and concern about the change in dates for
> > code
> > > > > freeze and 1.4.0 release. I will provide updates on progress
> > resolving
> > > > the
> > > > > potential performance problem.
> > > > >
> > > > > Patrick - do you think it is possible to resolve the remaining
> issues
> > > on
> > > > MKL-
> > > > > DNN this week, so we can consider GA for MKL-DNN with 1.4.0?
> > > > >
> > > > > Regards,
> > > > > Steffen
> > > > >
> > > > > On Thu, Nov 15, 2018 at 5:26 AM Anton Chernov  >
&g

Re: [Anouncement] New Committer: Tao Lv

2018-11-26 Thread kellen sunderland
Welcome Tao!

On Mon, Nov 26, 2018 at 7:13 PM Sheng Zha  wrote:

> We are pleased to announce Tao Lv as a new committer of Apache
> MXNet. Tao's sustained contribution to the project has been greatly helping
> the CPU performance of MXNet.
>
> Please join me to welcome Tao to the team!
>
> -sz
>


Re: CI impaired

2018-11-25 Thread kellen sunderland
Sorry, [1] meant to reference
https://issues.jenkins-ci.org/browse/JENKINS-37984 .

On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Marco and I ran into another urgent issue over the weekend that was
> causing builds to fail.  This issue was unrelated to any feature
> development work, or other CI fixes applied recently, but it did require
> quite a bit of work from Marco (and a little from me) to fix.
>
> We spent enough time on the problem that it caused us to take a step back
> and consider how we could both fix issues in CI and support the 1.4 release
> with the least impact possible on MXNet devs.  Marco had planned to make a
> significant change to the CI to fix a long-standing Jenkins error [1], but
> we feel that most developers would prioritize having a stable build
> environment for the next few weeks over having this fix in place.
>
> To properly introduce a new CI system the intent was to do a gradual
> blue/green roll out of the fix.  To manage this rollout would have taken
> operational effort and double compute load as we run systems in parallel.
> This risks outages due to scaling limits, and we’d rather make this change
> during a period of low-developer activity, i.e. shortly after the 1.4
> release.
>
> This means that from now until the 1.4 release, in order to reduce
> complexity MXNet developers should only see a single Jenkins verification
> check, and a single Travis check.
>
>


Re: CI impaired

2018-11-25 Thread kellen sunderland
Marco and I ran into another urgent issue over the weekend that was causing
builds to fail.  This issue was unrelated to any feature development work,
or other CI fixes applied recently, but it did require quite a bit of work
from Marco (and a little from me) to fix.

We spent enough time on the problem that it caused us to take a step back
and consider how we could both fix issues in CI and support the 1.4 release
with the least impact possible on MXNet devs.  Marco had planned to make a
significant change to the CI to fix a long-standing Jenkins error [1], but
we feel that most developers would prioritize having a stable build
environment for the next few weeks over having this fix in place.

To properly introduce a new CI system the intent was to do a gradual
blue/green roll out of the fix.  To manage this rollout would have taken
operational effort and double compute load as we run systems in parallel.
This risks outages due to scaling limits, and we’d rather make this change
during a period of low-developer activity, i.e. shortly after the 1.4
release.

This means that from now until the 1.4 release, in order to reduce
complexity MXNet developers should only see a single Jenkins verification
check, and a single Travis check.


Re: CI impaired

2018-11-24 Thread kellen sunderland
Hey Marco, I'm still having quite a few issues passing PRs.  Would you be
able to at least test a handful of PRs and make sure they pass/fail tests
as you expect?

On Sat, Nov 24, 2018, 7:01 PM Marco de Abreu
 Hello Steffen,
>
> thank you for bringing up these PRs.
>
> I had to abort the builds during the outage which means that the jobs
> didn't finish and not even the status propagation could have finished
> (hence they show pending instead of failure or aborted).
>
> Recently, we merged a PR that adds utility slaves. This will ensure that
> status updates will always be posted, no matter whether the main queue
> hangs or not. This means that the status would then be properly reflected
> and there should be no hanging pending runs.
>
> I could retrigger all PRs to kick off another round of validation, but this
> would result in 240 jobs (2 main pipelines times 120 open PRs) to run.
> Since we are currently in the pre-release stage, I wanted to avoid putting
> the system under such heavy load.
>
> Instead, I'd kindly like to request the PR creators to make a new commit to
> trigger the pipelines. In order to merge a PR, only PR-merge has to pass
> and I tried to retrigger all PRs that have been aborted during the outage.
> It might have been possible that I missed a few.
>
> Since it's still the weekend and there's not much going on, I can use the
> time to trigger all PRs. Please advise whether you think I should move
> forward (I expect the CI to finish all PRs within 6-10 hours) or if it's
> fine to ask people to retrigger themselves.
>
> Please excuse the caused inconveniences.
>
> Best regards,
> Marco
>
>
> Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel 
> geschrieben:
>
> > Thanks Marco for the updates and resolving the issues.
> > However, I do see a number of PR waiting to be merged with inconsistent
> PR
> > validation status check.
> > E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9
> pending
> > checks being queued. However, when you look at the details, either the
> > checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu,
> > windows-gpu failed, required pr-merge which includes edge, gpu tests
> > passed).
> > Similar also for other PR with label pr-awaiting-merge (
> >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge
> > )
> > Please advice on resolution.
> >
> > Regards,
> > Steffen
> >
> > On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu
> >  wrote:
> >
> > > Thanks everybody, I really appreciate it!
> > >
> > > Today was a good day, there were no incidents and everything appears to
> > be
> > > stable. In the meantime I did a deep dive on why we has such a
> > significant
> > > performance decrease with of our compilation jobs - which then clogged
> up
> > > the queue and resulted in 1000 jobs waiting to be scheduled.
> > >
> > > The reason was the way how we use ccache to speed up our compilation
> > jobs.
> > > Usually, this yields us a huge performance improvement (CPU openblas,
> for
> > > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down
> > to
> > > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting
> > factor.
> > > Here's some background about how we operate our cache:
> > >
> > > We use EFS to have a distributed ccache between all of our
> > > unrestricted-prod-slaves. EFS is classified for almost unlimited
> > > scalability (being consumed by thousands of instances in parallel [1])
> > with
> > > a theoretical throughput of over 10Gbps. One thing I didn't know when I
> > > designed this approach was the method how throughput is being granted.
> > > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
> > > throughput (default is 50MiB/s) [2]. Due to the high load, we consumed
> > all
> > > of our credits - here's a very interesting graph: [3].
> > >
> > > To avoid similar incidents in future, I have taken the following
> actions:
> > > 1. I switched EFS from burst-mode to provisioned throughput with
> 300MB/s
> > > (in the graph at [3] you can see how our IO immediately increases - and
> > > thus our CI gets faster - as soon as I added provisioned throughput).
> > > 2. I created internal follow-up tickets to add monitoring and automated
> > > actions.
> > >
> > > First, we should be notified if we are running low on credits to
> kick-off
> > > an investigation. Second (nice to have), we could have a
> lambda-function
> > > which listens for that event and automatically switches the EFS volume
> > from
> > > burst-mode to provisioned throughput during high-load-times. The
> required
> > > throughput could be retrieved via CloudWatch and then multiplied by a
> > > factor. EFS allows to downgrade the throughput mode 24h after the last
> > > changes (to reduce capacity if the load is over) and always allows to
> > > upgrade the provisioned capacity (if the load goes even higher). I've
> > been
> > > looking for a pre-made 

Re: Include MKLDNN into default mxnet pip package

2018-11-21 Thread kellen sunderland
Agree with your point about other repos also not being based on versioning
Tao.  I would point out that I've given some that I've worked with similar
feedback: https://github.com/onnx/onnx-tensorrt/issues/68

On Wed, Nov 21, 2018 at 6:48 PM Naveen Swamy  wrote:

> Tao,
>
> You are right there are many submodules in 3rd party. We have to start
> somewhere and I believe this one is a good candidate to start with. This is
> not to cater to release of MXNet or to tie them with the releases of the
> submodules but instead to pick only stable releases and not to pick up
> bleeding edge commits from the tip of the master, this gives us confidence
> in the submodule that MXNet users are depending on that especially if we
> make MKLDNN the default.
>
> Good to know it is known already as a regression.Alex has created this
> issue https://github.com/apache/incubator-mxnet/issues/13369, please add
> details and link the corresponding issue in MKLDNN(I couldn't find).
>
> -Naveen
>
> On Wed, Nov 21, 2018 at 6:04 PM Lv, Tao A  wrote:
>
> > Here are my answers for the questions from Kellen and Naveen about
> > MKL-DNN. It doesn't mean that I'm supportive for making MKL-DNN default
> > here.
> >
> > @Kellen,
> >
> > FYI, here is a list for those platforms which are officially supported by
> > MKL-DNN.
> > https://github.com/intel/mkl-dnn#system-requirements
> >
> > Most of computation intensive kernels in MKL-DNN are JITed. So they are
> > supposed to generate code according to the platform during runtime. For
> > non-JIT code in MKL-DNN, same as other code in MXNet, it will generate
> > instructions according to the options/flags of compiler. We can set
> > -DARCH_OPT_FLAGS when build MKL-DNN to avoid optimization for compiling
> > machine. That's exactly what we are doing for MKL-DNN build in MXNet.
> Even
> > without MKL-DNN, I noticed there were issues about illegal instructions
> of
> > MXNet when users import the pip package on a lower end machine which
> > probably only supports SSE.
> >
> > @Naveen,
> >
> > The LSTM issue has already been identified as a regression from the
> recent
> > version of MKL-DNN. Hopefully it will be fixed soon with a new update of
> > MKL-DNN.
> >
> > MXNet has many submodule dependencies under the 3rd party folder. Seems
> we
> > don't require release versions for most of these dependencies. The
> release
> > period of MKL-DNN and MXNet are not matched very well. I think it would
> be
> > a risk for MXNet release if it hardly depends on the release of a
> > submodule, no need to say depends on the releases of all submodules.
> >
> > -tao
> >
> > -Original Message-
> > From: Naveen Swamy [mailto:mnnav...@gmail.com]
> > Sent: Thursday, November 22, 2018 9:08 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: d...@mxnet.apache.org
> > Subject: Re: Include MKLDNN into default mxnet pip package
> >
> > Hi Alex,
> >
> > Thanks for promptly running the numbers on AMD and reporting here.
> >
> > Can you please update the AMD numbers here for posterity
> >
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking
> > ?
> >
> > are there any outstanding issues when MKLDNN is enabled? from my offline
> > conversation I am briefly aware performance issues with LSTM, is there an
> > GitHub issue for it?
> >
> > MKLDNN is a submodule dependency, are we pulling the latest commit or
> > releases  ? If not we should move to releases before we make it a
> default.
> > Ideally we should use platform specific distributions (-dev packages) at
> > least we should rely on well tested releases.
> >
> >
> > Thanks, Naveen
> >
> > On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander
>  > >
> > wrote:
> >
> > > AMD benchmarks have been published. We are seeing a x15.8 speedup with
> > > Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. With a
> > > smaller network (Mobilenet - batch size 32) the speedup is more
> > > significant at x38.7. Let's have a vote to see if the PR to have
> > > MKLDNN enabled by default
> > > (https://github.com/apache/incubator-mxnet/pull/12591) can be merged
> > > before 1.4.0 release.
> > >
> > > On 10/19/18, 9:17 AM, "Pedro Larroy" 
> > > wrote:
> > >
> > > I did  pip install mxnet-mkl==1.3.1b20181018 on an AMD Ryzen 1950X
> > > and unit
> > > tests are passing.
> > >
> > > Is this build using AVX512?  in /proc/cpuinfo I see only "avx"
> flag.
> > > There's no "avx2" like on recent intel cpus.
> > >
> > > Pedro.
> > >
> > > On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko 
> > > wrote:
> > >
> > > > Awesome collaborative effort across many contributors and
> > companies!
> > > >
> > > > The boost is impressive and for MXNet users to get this boost
> > > "out of the
> > > > box" is a great benefit and makes MXNet an even better choice.
> > > >
> > > > Alex - can you clarify whether there are any down sides with
> > > regards to
> > > > noon AVX-512 architectures, AMD CPUs, 

Re: Include MKLDNN into default mxnet pip package

2018-11-21 Thread kellen sunderland
I've spent the last few days testing MXNet w/ MKLDNN and quantized models
and it's a beast.  Really good speed improvements on my models, no bugs
that I've noticed.

I'm in general supportive but I'm still wondering what the story is like
when there's no AVX instructions present on CPUs.  Do we get an illegal
instruction error, or does it fallback gracefully?  So far it sounds like
it works on a Threadripper and Xen AMD CPU.  I can try on a Ryzen.  What
about older Intel or AMD CPUs?

On Wed, Nov 21, 2018 at 4:55 PM Zai, Alexander 
wrote:

> AMD benchmarks have been published. We are seeing a x15.8 speedup with
> Resnet50 (batch size 32) on AWS's new m5a.24xlarge machine. With a smaller
> network (Mobilenet - batch size 32) the speedup is more significant at
> x38.7. Let's have a vote to see if the PR to have MKLDNN enabled by default
> (https://github.com/apache/incubator-mxnet/pull/12591) can be merged
> before 1.4.0 release.
>
> On 10/19/18, 9:17 AM, "Pedro Larroy" 
> wrote:
>
> I did  pip install mxnet-mkl==1.3.1b20181018 on an AMD Ryzen 1950X and
> unit
> tests are passing.
>
> Is this build using AVX512?  in /proc/cpuinfo I see only "avx" flag.
> There's no "avx2" like on recent intel cpus.
>
> Pedro.
>
> On Fri, Oct 19, 2018 at 5:12 PM Hagay Lupesko 
> wrote:
>
> > Awesome collaborative effort across many contributors and companies!
> >
> > The boost is impressive and for MXNet users to get this boost "out
> of the
> > box" is a great benefit and makes MXNet an even better choice.
> >
> > Alex - can you clarify whether there are any down sides with regards
> to
> > noon AVX-512 architectures, AMD CPUs, etc? Will it gracefully
> fallback?
> >
> > Hagay
> >
> >
> > On Fri, Oct 19, 2018, 15:46 Sergio Fernández 
> wrote:
> >
> > > If there is no downside on platforms not supporting AVX512
> instructions,
> > > then +1
> > >
> > >
> > > On Wed, Oct 17, 2018, 14:10 Alex Zai  wrote:
> > >
> > > > Hey all,
> > > > We have been working hard these past few months to integrate and
> > > stabilize
> > > > Intel’s MKLDNN deep learning CPU accelerator into Mxnet and have
> made
> > > > incredible progress. On CPUs with AVX512 instructions (such as
> c5.18x)
> > we
> > > > have seen performance increase up to 12x and on other platforms
> (Macs,
> > > > AVX2) we seen a speedup of 1.5+. Full list of benchmarks can be
> found
> > > here
> > > > (
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95650764
> > > >  and https://github.com/apache/incubator-mxnet/pull/12591).
> > > >
> > > > Currently, using this accelerator requires the developer to
> either pip
> > > > install the mxnet-mkl version of mxnet or to build it themselves
> from
> > > > source. Given that we should try to provide the best performance
> "out
> > of
> > > > the box” with mxnet we should include this in the default build.
> The
> > > mkldnn
> > > > library is included with in the pip package build so it does not
> > require
> > > an
> > > > external dependency.
> > > >
> > > > There were concerns that MKLDNN could cause regressions on
> certain
> > > > platforms (as it did with the tensorflow version a while back);
> but we
> > > > added a env flag (MXNET_MKLDNN_ENABLED) that allows users to
> turn of
> > this
> > > > feature during runtime. Please bring up any other concerns you
> may have
> > > and
> > > > your thoughts on including this accelerator in the default build.
> > > >
> > > > Best,
> > > > Alex
> > > >
> > >
> >
>
>
>


Re: Should PR-860 (Use modernized range loops where possible) be reverted?

2018-11-20 Thread kellen sunderland
Hey Carin, I don't think there's any issues merging this PR.  The veto'd
aspect was around _requiring_ modern loop usage, and failing the build if
clang tidy detected modern loops could be used but weren't.  The original
PR included a check for this and would fail any builds not using modern
loops.  Several people didn't like this aspect so I updated the PR and
removed that overly-strict check.  The current PR doesn't have anything it
in that's been vetod.  We're continuing to warn if clang-tidy detects a
loop could be modernized, but are not causing an error (which was actually
the behaviour before this PR was merged).

On Tue, Nov 20, 2018 at 7:29 AM Anton Chernov  wrote:

> Hi Carin,
>
> The discussion [1] was about whether to enable automatic checks on using
> old behaviour in new PR's. Kellens PR [2] was about modernizing the actual
> code itself and was not up for voting, thus could not receive any technical
> veto votes.
>
> Per the discussion (as I have understood it), we won't get veto votes if we
> would enable the check on CI, if it would be treated as a warning.
>
> Thank you for merging the PR in the first place. I see no reason for
> reverting it.
>
> Best
> Anton
>
> [1]
>
> https://lists.apache.org/thread.html/b47f285a80bef47c5ead6c361614e338a0661f6c0c76196c1e3719c5@%3Cdev.mxnet.apache.org%3E
> [2] https://github.com/apache/incubator-mxnet/pull/12356
>
>
> вт, 20 нояб. 2018 г. в 15:24, Pedro Larroy :
>
> > Hi all
> >
> > I think we have to make the clear separation between the thread votes
> > on "uniformly adopting C++11 range loops in the MXNet project" and a
> > PR which refactored code to be more legible and with improved variable
> > names.
> > Merging that PR doesn't imply that we have to uniformly adopt the
> > previous proposal.  The PR was reviewed and approved by several
> > people. I would keep the two topics separate, merging this PR doesn't
> > prescribe any particular idiom for future commits or reviews.
> >
> > Pedro.
> >
> > On Tue, Nov 20, 2018 at 2:58 PM Carin Meier 
> wrote:
> > >
> > > My intent was to be helpful, but I think I may have merged this PR
> > > yesterday too soon thinking it was approved and ready to merge
> > > https://github.com/apache/incubator-mxnet/pull/12356
> > >
> > > I didn't see the connected dev discussion
> > >
> >
> https://lists.apache.org/thread.html/b47f285a80bef47c5ead6c361614e338a0661f6c0c76196c1e3719c5@%3Cdev.mxnet.apache.org%3E
> > > where there were -1 votes, which I believe are vetos?
> > >
> > > So the question is confirm: should PR should be reverted?
> > >
> > > Sorry for any confusion,
> > > Carin
> >
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.3.1.rc0

2018-11-16 Thread kellen sunderland
Just tested with 1.3.0 and those tests were failing for that release as
well.  Given it's not a regression I'm +1 (non-binding).

On Thu, Nov 15, 2018 at 11:52 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Thanks for organizing the release Anton and for testing Carin and
> Steffen.  Lots of great fixes in this release.  As we don't have the
> required 3 committers I'd suggest extending the vote for a few days.
>
> I tested the following on MacOS 10.13, High Sierra:
>
> INCUBATING IN RELEASE FILE: check.
> LICENSE check.
> NOTICE check.
> SIGNATURE check.
> HASH check.
> DISCLAIMER check.
> SOURCE COMPILES VIA MAKEFILE check.
> SOURCE COMPILES VIA CMAKE check.
> C++ TESTS PASS fail
> Two tests failing for me.
> Build with flags: cmake -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_OPENMP=0
> -DUSE_OPENCV=0 ..
> Ran c++ tests with exclusions: ./tests/mxnet_unit_tests
> --gtest_filter=-GpuTopology.*
> Result:
> [  FAILED  ] 2 tests, listed below:
> [  FAILED  ] ACTIVATION_PERF.ExecuteBidirectional
> [  FAILED  ] ACTIVATION_PERF.TimingCPU
>
> PYHTON UNIT TESTS PASS check.
>
> Not sure if the test failures are a regression so I'm +0 (non-binding)
>
> On Thu, Nov 15, 2018 at 5:43 PM Steffen Rochel 
> wrote:
>
>> +1 build on MacOS Sierra following instructions on
>>
>> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Developer+Setup+on+Mac
>> and run one training test.
>>
>> On Tue, Nov 13, 2018 at 2:34 PM Carin Meier  wrote:
>>
>> > +1 - Clojure package tested fine with Scala jars
>> >
>> > On Mon, Nov 12, 2018 at 6:53 PM Anton Chernov 
>> wrote:
>> >
>> > > Dear MXNet community,
>> > >
>> > > This is the vote to release Apache MXNet (incubating) version 1.3.1.
>> > Voting
>> > > will start now, on Monday the 12th of November 2018 and close on 14:00
>> > > Thursday the 15th of November 2018, Pacific Time (PT).
>> > >
>> > > Link to release notes:
>> > > https://cwiki.apache.org/confluence/x/eZGzBQ
>> > >
>> > > Link to release candidate 1.3.1.rc0:
>> > > https://github.com/apache/incubator-mxnet/releases/tag/1.3.1.rc0
>> > >
>> > > Link to source and signatures on apache dist server:
>> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.3.1.rc0/
>> > >
>> > > Link to scala packages on the staging repo:
>> > >
>> > > * CPU
>> > >
>> > >
>> >
>> https://repository.apache.org/content/repositories/snapshots/org/apache/mxnet/mxnet-full_2.11-osx-x86_64-cpu/1.3.1-SNAPSHOT/
>> > >
>> > > * GPU
>> > >
>> > >
>> >
>> https://repository.apache.org/content/repositories/snapshots/org/apache/mxnet/mxnet-full_2.11-linux-x86_64-gpu/1.3.1-SNAPSHOT/
>> > >
>> > > Please remember to TEST first before voting accordingly:
>> > > +1 = approve
>> > > +0 = no opinion
>> > > -1 = disapprove (provide reason)
>> > >
>> > >
>> > > Best regards,
>> > > Anton
>> > >
>> >
>>
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.3.1.rc0

2018-11-15 Thread kellen sunderland
Thanks for organizing the release Anton and for testing Carin and Steffen.
Lots of great fixes in this release.  As we don't have the required 3
committers I'd suggest extending the vote for a few days.

I tested the following on MacOS 10.13, High Sierra:

INCUBATING IN RELEASE FILE: check.
LICENSE check.
NOTICE check.
SIGNATURE check.
HASH check.
DISCLAIMER check.
SOURCE COMPILES VIA MAKEFILE check.
SOURCE COMPILES VIA CMAKE check.
C++ TESTS PASS fail
Two tests failing for me.
Build with flags: cmake -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_OPENMP=0
-DUSE_OPENCV=0 ..
Ran c++ tests with exclusions: ./tests/mxnet_unit_tests
--gtest_filter=-GpuTopology.*
Result:
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] ACTIVATION_PERF.ExecuteBidirectional
[  FAILED  ] ACTIVATION_PERF.TimingCPU

PYHTON UNIT TESTS PASS check.

Not sure if the test failures are a regression so I'm +0 (non-binding)

On Thu, Nov 15, 2018 at 5:43 PM Steffen Rochel 
wrote:

> +1 build on MacOS Sierra following instructions on
>
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Developer+Setup+on+Mac
> and run one training test.
>
> On Tue, Nov 13, 2018 at 2:34 PM Carin Meier  wrote:
>
> > +1 - Clojure package tested fine with Scala jars
> >
> > On Mon, Nov 12, 2018 at 6:53 PM Anton Chernov 
> wrote:
> >
> > > Dear MXNet community,
> > >
> > > This is the vote to release Apache MXNet (incubating) version 1.3.1.
> > Voting
> > > will start now, on Monday the 12th of November 2018 and close on 14:00
> > > Thursday the 15th of November 2018, Pacific Time (PT).
> > >
> > > Link to release notes:
> > > https://cwiki.apache.org/confluence/x/eZGzBQ
> > >
> > > Link to release candidate 1.3.1.rc0:
> > > https://github.com/apache/incubator-mxnet/releases/tag/1.3.1.rc0
> > >
> > > Link to source and signatures on apache dist server:
> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.3.1.rc0/
> > >
> > > Link to scala packages on the staging repo:
> > >
> > > * CPU
> > >
> > >
> >
> https://repository.apache.org/content/repositories/snapshots/org/apache/mxnet/mxnet-full_2.11-osx-x86_64-cpu/1.3.1-SNAPSHOT/
> > >
> > > * GPU
> > >
> > >
> >
> https://repository.apache.org/content/repositories/snapshots/org/apache/mxnet/mxnet-full_2.11-linux-x86_64-gpu/1.3.1-SNAPSHOT/
> > >
> > > Please remember to TEST first before voting accordingly:
> > > +1 = approve
> > > +0 = no opinion
> > > -1 = disapprove (provide reason)
> > >
> > >
> > > Best regards,
> > > Anton
> > >
> >
>


Re: MKLDNN dynamically linked

2018-11-08 Thread kellen sunderland
I think we should bias towards static linking.  It should make using mxnet
easier in a lot of cases for users.  As long as the license permits static
linking (i.e. is non-gpl) I'd +1 static linking for portability and ease of
use.  The only caveat would be in cases where the package size would cause
grief for PyPi maintainers.

On Thu, Nov 8, 2018, 3:54 PM Sheng Zha  +1. Ideally, MKLDNN can be statically linked. mxnet-mkl relies on Make for
> building it so help is wanted on mxnet.
>
> -sz
>
> On 2018/11/08 21:28:50, Alex Zai  wrote:
> > Currently in mxnet-mkl the libmxnet.so is dynamically linked to to
> > libmkldnn.so.0. This is known to cause some issues if the wrong version
> of
> > mkldnn is linked. Can we static link this file instead?
> >
> > Alex
> >
>


Re: [VOTE] Separating PMC and Committership

2018-11-08 Thread kellen sunderland
+1 (non-binding)

On Thu, Nov 8, 2018 at 10:37 AM Thomas DELTEIL 
wrote:

> +1 (non-binding)
>
> Le jeu. 8 nov. 2018 à 10:04, Carin Meier  a écrit :
>
> > Reminder - Vote ends tomorrow-  Friday Nov 9th at 6:00 am EST
> >
> > On Mon, Nov 5, 2018 at 11:29 AM Carin Meier 
> wrote:
> >
> > > This is a procedural vote on whether to separate the committer and PPMC
> > > levels in the project. The current state is that a user is considered
> as
> > > both a committer and a PPMC member at the same time. This vote is to
> > change
> > > that to be able to invite a person in as a committer separately from a
> > PPMC
> > > member.
> > >
> > > Document reference:
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member
> > >
> > > Discussion thread:
> > >
> >
> https://lists.apache.org/thread.html/9c6ecda02e081aa6b689c92badc9dcf05ced6fb3691fd370471773d1@%3Cdev.mxnet.apache.org%3E
> > >
> > > The vote will be a procedural issue vote as defined
> > > https://www.apache.org/foundation/voting.html
> > >
> > > Votes on procedural issues follow the common format of majority rule
> > unless
> > > otherwise stated. That is, if there are more favourable votes than
> > > unfavourable ones, the issue is considered to have passed -- regardless
> > of
> > > the number of votes in each category. (If the number of votes seems too
> > > small to be representative of a community consensus, the issue is
> > typically
> > > not pursued. However, see the description of lazy consensus
> > >  for a
> > > modifying factor.)
> > >
> > > The vote will run until Friday Nov 9th at 6:00 am EST
> > >
> > > Thanks,
> > > Carin
> > >
> > >
> > >
> >
>


Re: [RESULT][LAZY VOTE] Next MXNet release

2018-11-07 Thread kellen sunderland
+1 to trying to get a 1.4.0 Nov release.  I think the MKLDNN work alone is
a headline feature that users would love to get their hands on.

On Tue, Nov 6, 2018 at 11:32 PM Sheng Zha  wrote:

> I'd like to propose that we expedite the 1.4.0 release slightly as there
> doesn't seem to be a rule that prevents a minor release from happening at
> the same time of a patch release. This would shorten the time it takes for
> new features to reach users. Proposed revision to the timeline:
> - Code freeze: 11/9
> - Release published: 11/22
>
> If there's no issue about both the proposal and new timeline, I'd be happy
> to manage 1.4.0 release as release manager.
>
> -sz
>
> On Thu, Nov 1, 2018 at 7:56 AM Steffen Rochel 
> wrote:
>
> > There have been no objections, so lazy vote passed.
> > Anton volunteered to manage the 1.3.1 release and Naveen will support him
> > as co-manager to handle the release tasks requiring committer powers.
> > Please support Anton for a smooth 1.3.1 release process.
> >
> > I'm still looking for volunteers to manage / co-manage the 1.4.0 release.
> >
> > Regards,
> > Steffen
> >
> > On Sun, Oct 28, 2018 at 7:33 PM Steffen Rochel 
> > wrote:
> >
> > > I calling a lazy vote to release MXNet
> > > 1.3.1 (patch release) and 1.4.0 (minor relase).
> > >
> > > Release content: release proposal page
> > > <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Project+Proposals+for+next+MXNet+Release
> > >
> > >
> > > Target milestones:
> > > *1.3.1*
> > >
> > >- Code Freeze: 10/31
> > >- Release published: 11/13
> > >
> > > *1.4.0:*
> > >
> > >- Code Freeze: 11/13
> > >- Release published: 12/13 (if possible announce during NIPS)
> > >
> > >
> > > The vote will be open until Wednesday October 31, 2018 8.00pm PDT.
> > >
> > > Regards,
> > > Steffen
> > >
> > > On Fri, Oct 26, 2018 at 7:56 AM Steffen Rochel <
> steffenroc...@gmail.com>
> > > wrote:
> > >
> > >> During the Hangout on Wednesday multiple release proposals have been
> > >> discussed. I summarized discussion here
> > >> <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Hangout+October+24th+2018+8am+and+5pm+PDT
> >
> > and
> > >> updated the release proposal page
> > >> <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Project+Proposals+for+next+MXNet+Release
> > >
> > >> .
> > >> Please review, provide feedback and propose changes.
> > >> I plan to start a lazy vote on Sunday regarding the release proposal.
> > >>
> > >> Calling for volunteers to manage the 1.3.1 and 1.4.0 release.
> > >>
> > >> Regards,
> > >> Steffen
> > >>
> > >> On Tue, Oct 9, 2018 at 7:20 AM kellen sunderland <
> > >> kellen.sunderl...@gmail.com> wrote:
> > >>
> > >>> Hey Steffen,
> > >>>
> > >>> Recommend these be merged into patch release:
> > >>>
> > >>> https://github.com/apache/incubator-mxnet/pull/12631
> > >>> https://github.com/apache/incubator-mxnet/pull/12603
> > >>> https://github.com/apache/incubator-mxnet/pull/12499
> > >>>
> > >>> -Kellen
> > >>>
> > >>> On Tue, Oct 2, 2018 at 7:17 AM Zhao, Patric 
> > >>> wrote:
> > >>>
> > >>> > Thanks to let us know this discussion.
> > >>> > Because we don't have enough bandwidth to track the different
> > sources,
> > >>> > like discussion forum.
> > >>> >
> > >>> > I think the best way is to open issue in the github so that we can
> > >>> > answer/solve the issue in time :)
> > >>> >
> > >>> > Thanks,
> > >>> >
> > >>> > --Patric
> > >>> >
> > >>> > > -Original Message-
> > >>> > > From: Afrooze, Sina [mailto:sina@gmail.com]
> > >>> > > Sent: Tuesday, October 2, 2018 1:14 AM
> > >>> > > To: dev@mxnet.incubator.apache.org
> > >>> > > Cc: Ye, Jason Y ; Zai, Alexander
> > >>> > > ; Zheng, Da 
> > >>> > > Subject: Re: [Discuss] Next MXNet release
> > >>> > >
&

Re: [DISCUSS] Build OSX builds in CI (possibly with TravisCI).

2018-11-06 Thread kellen sunderland
 >>>> > >
> >>>> > > > This is awesome. Thanks a lot Kellen and Marco. With this work
> >>>> > complete,
> >>>> > > we
> >>>> > > > will have MXNet Python tests running for Mac on Travis CI, for
> PR
> >>>> and
> >>>> > > > Branch builds?
> >>>> > > > Thank you for working on fixing the tests and making it run as
> >>>> part of
> >>>> > > > Travis CI for Mac platform. Is there any Github issue or Jira
> >>>> where we
> >>>> > > can
> >>>> > > > see disabled / tests that needs to be fixed for Mac? This might
> be
> >>>> > useful
> >>>> > > > if we can call for contributions.
> >>>> > > >
> >>>> > > > Best,
> >>>> > > > Sandeep
> >>>> > > >
> >>>> > > >
> >>>> > > > On Tue, Sep 18, 2018 at 9:51 AM Marco de Abreu
> >>>> > > >  wrote:
> >>>> > > >
> >>>> > > > > Hey everyone,
> >>>> > > > >
> >>>> > > > > we are about to enable Python tests for Mac. The outstanding
> >>>> bugs
> >>>> > have
> >>>> > > > been
> >>>> > > > > fixed by Kellen and we're just waiting for the PRs to pass.
> >>>> We'll
> >>>> > send
> >>>> > > a
> >>>> > > > > separate email as soon as they are enabled.
> >>>> > > > >
> >>>> > > > > Additionally, we had a small problem that Travis runs got
> >>>> aborted if
> >>>> > > > > multiple commits were done in a short timeframe. While this is
> >>>> > > acceptable
> >>>> > > > > for PRs, this causes our branch jobs to also fail. An examples
> >>>> is
> >>>> > > > available
> >>>> > > > > at [1]. In order to cope with this, I have asked Apache Infra
> to
> >>>> > > disable
> >>>> > > > > cancellation of concurrent jobs. They agreed to this, but
> >>>> reminded us
> >>>> > > > that
> >>>> > > > > they might turn it back on if we consume too many resources.
> >>>> > > > >
> >>>> > > > > The dashboard to review the Travis resource utilization is
> >>>> available
> >>>> > at
> >>>> > > > > [2]. Just log in as Guest.
> >>>> > > > >
> >>>> > > > > Best regards,
> >>>> > > > > Marco
> >>>> > > > >
> >>>> > > > > [1]:
> >>>> > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://travis-ci.org/apache/incubator-mxnet/builds/430135867?utm_source=github_status_medium=notification
> >>>> > > > > [2]:
> >>>> > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> https://demo.kibble.apache.org/dashboard.html?page=ci=e0ce4eee89a77ec231eee1fdbbc647cb3de2f6ecfc3cef8d8c11dc2d=hour
> >>>> > > > >
> >>>> > > > >
> >>>> > > > > On Thu, Sep 13, 2018 at 1:06 AM kellen sunderland <
> >>>> > > > > kellen.sunderl...@gmail.com> wrote:
> >>>> > > > >
> >>>> > > > > > We've got fairly limited ability to change what's reported
> by
> >>>> > Travis.
> >>>> > > > > Most
> >>>> > > > > > administration is done by the ASF Infra crew, so it's tough
> >>>> for us
> >>>> > to
> >>>> > > > > > experiment with settings.  It'd be great if you could bear
> >>>> with us
> >>>> > > for
> >>>> > > > a
> >>>> > > > > > few days.  It shouldn't take too long to either (1) get
> >>>

Re: Hold on the merge of new MKL-DNN operator tests

2018-11-03 Thread kellen sunderland
Hey Tao, thanks for letting the community know.  It's completely
understandable if you want to dig deep on the failure.  Don't worry about
taking a little extra time to get to the bottom of test failures, that's
exactly the reason we have the CI setup.  Let us know if there's anything
you think we can help with.

On Fri, Nov 2, 2018 at 7:56 PM Lv, Tao A  wrote:

>
> Hi MXNet developers,
>
> I am working on PR#12953<
> https://github.com/apache/incubator-mxnet/pull/12953> to update the
> version of MKL-DNN dependency. This PR will help to address several
> requests and issues from the commnunity of both MXNet and MKL-DNN. It will
> also improve MXNet performance a lot when using MKL-DNN backend. Currently,
> this work is almost done except the failure of LRN CPP test. Since this PR
> doesn't touch the integration code of MKL-DNN LRN operator, so I guess
> there is something new in MKL-DNN which has conflict with the CPP test . An
> internal discussion is ongoing with MKL-DNN developers about that and
> hopefully this issue will be fixed soon.
>
> I noticed that there are several other CPP tests for MKL-DNN operator are
> under review and they are following the some methodology of LRN CPP test.
> To avoid more conflicts, I suggest to hold on the merge of these new tests
> before the LRN issue in #12953 is clrealy addressed.
>
> Here is the list of these PRs:
> https://github.com/apache/incubator-mxnet/pull/13084
> https://github.com/apache/incubator-mxnet/pull/12985
> https://github.com/apache/incubator-mxnet/pull/12884
>
> Let me know what do you think. Thanks.
> -tao
>


Re: Coverity scan

2018-11-02 Thread kellen sunderland
Totally agree Pedro, reporting the data in a more accessible way would be a
huge improvement.  For this reason alone I think it might be worthwhile
adopting coverity.

On Fri, Nov 2, 2018, 11:38 AM Pedro Larroy  Thanks a lot, I think is very beneficial that we invest in these kind of
> tooling for code quality. As a developer I wonder, do we have actionable
> items for looking at / fixing these issues or right now is done in an
> informational / good will basis?
>
> Is there a way to colorize this output?
>
> Pedro.
>
> On Fri, Nov 2, 2018 at 5:10 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Reference scan here (I believe I also count 5 memory violations):
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1856/nodes/104/log/?start=0
> >
> > -Kellen
> >
> > On Fri, Nov 2, 2018 at 9:07 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hey Anton, can you provide a sample scan?  I'm interested to see if it
> > > catches different memory access violations, or if it gets the same ones
> > > we've already seen reported by clang-tidy.  For example are these
> > > violations in the reports:
> > > --
> > > "/work/mxnet/3rdparty/dmlc-core/include/dmlc/concurrentqueue.h:3443:24:
> > > warning: Access to field 'capacity' results in a dereference of a null
> > > pointer (loaded from variable 'mainHash')
> > > [clang-analyzer-core.NullDereference]"
> > >
> > > ---
> > >
> > > /work/mxnet/3rdparty/mshadow/mshadow/./tensor.h:64:23: warning:
> Assigned
> > value is garbage or undefined [clang-analyzer-core.uninitialized.Assign]
> > >   this->shape_[i] = s[i];"
> > >
> > > -
> > >
> > >
> > >
> >
> /usr/bin/../lib/gcc/x86_64-linux-gnu/8.0.1/../../../../include/c++/8.0.1/ext/atomicity.h:67:29:
> > warning: Use of memory after it is freed
> > [clang-analyzer-cplusplus.NewDelete]
> > >
> > > --
> > >
> > > -Kellen
> > >
> > >
> > >
> > > On Fri, Nov 2, 2018 at 2:20 AM Anton Chernov 
> > wrote:
> > >
> > >> Dear MXNet community,
> > >>
> > >> I had investigated the possibility to adopt Coverity static analysis
> > tools
> > >> for the MXNet project and it turned out that there is a tool provided
> by
> > >> Synopsys for open-source projects:
> > >>
> > >> https://scan.coverity.com
> > >>
> > >> The tool works nicely with GitHub [1] and I found that a scan for a
> fork
> > >> (from @apeforest) [2] was already set up. I can not tell how long ago
> > the
> > >> scan was performed, but at the time of writing the project page shows
> 5
> > >> illegal memory access errors, that I think would be worth
> investigating.
> > >>
> > >> If there is interest I would suggest that we would setup a Coverity
> scan
> > >> for the main repository instead of a fork and people that have
> interest
> > >> managing and fixing issues would request add them to the project.
> > >>
> > >> I would appreciate feedback for this proposal and help from people
> > having
> > >> rights for the main repository to set things up.
> > >>
> > >> Best regards,
> > >> Anton
> > >>
> > >> [1] https://scan.coverity.com/github
> > >> [2] https://scan.coverity.com/projects/apeforest-incubator-mxnet
> > >>
> > >
> >
>


Re: Coverity scan

2018-11-02 Thread kellen sunderland
Reference scan here (I believe I also count 5 memory violations):
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1856/nodes/104/log/?start=0

-Kellen

On Fri, Nov 2, 2018 at 9:07 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Hey Anton, can you provide a sample scan?  I'm interested to see if it
> catches different memory access violations, or if it gets the same ones
> we've already seen reported by clang-tidy.  For example are these
> violations in the reports:
> --
> "/work/mxnet/3rdparty/dmlc-core/include/dmlc/concurrentqueue.h:3443:24:
> warning: Access to field 'capacity' results in a dereference of a null
> pointer (loaded from variable 'mainHash')
> [clang-analyzer-core.NullDereference]"
>
> ---
>
> /work/mxnet/3rdparty/mshadow/mshadow/./tensor.h:64:23: warning: Assigned 
> value is garbage or undefined [clang-analyzer-core.uninitialized.Assign]
>   this->shape_[i] = s[i];"
>
> -
>
>
> /usr/bin/../lib/gcc/x86_64-linux-gnu/8.0.1/../../../../include/c++/8.0.1/ext/atomicity.h:67:29:
>  warning: Use of memory after it is freed [clang-analyzer-cplusplus.NewDelete]
>
> --
>
> -Kellen
>
>
>
> On Fri, Nov 2, 2018 at 2:20 AM Anton Chernov  wrote:
>
>> Dear MXNet community,
>>
>> I had investigated the possibility to adopt Coverity static analysis tools
>> for the MXNet project and it turned out that there is a tool provided by
>> Synopsys for open-source projects:
>>
>> https://scan.coverity.com
>>
>> The tool works nicely with GitHub [1] and I found that a scan for a fork
>> (from @apeforest) [2] was already set up. I can not tell how long ago the
>> scan was performed, but at the time of writing the project page shows 5
>> illegal memory access errors, that I think would be worth investigating.
>>
>> If there is interest I would suggest that we would setup a Coverity scan
>> for the main repository instead of a fork and people that have interest
>> managing and fixing issues would request add them to the project.
>>
>> I would appreciate feedback for this proposal and help from people having
>> rights for the main repository to set things up.
>>
>> Best regards,
>> Anton
>>
>> [1] https://scan.coverity.com/github
>> [2] https://scan.coverity.com/projects/apeforest-incubator-mxnet
>>
>


Re: Coverity scan

2018-11-02 Thread kellen sunderland
Hey Anton, can you provide a sample scan?  I'm interested to see if it
catches different memory access violations, or if it gets the same ones
we've already seen reported by clang-tidy.  For example are these
violations in the reports:
--
"/work/mxnet/3rdparty/dmlc-core/include/dmlc/concurrentqueue.h:3443:24:
warning: Access to field 'capacity' results in a dereference of a null
pointer (loaded from variable 'mainHash')
[clang-analyzer-core.NullDereference]"

---

/work/mxnet/3rdparty/mshadow/mshadow/./tensor.h:64:23: warning:
Assigned value is garbage or undefined
[clang-analyzer-core.uninitialized.Assign]
  this->shape_[i] = s[i];"

-


/usr/bin/../lib/gcc/x86_64-linux-gnu/8.0.1/../../../../include/c++/8.0.1/ext/atomicity.h:67:29:
warning: Use of memory after it is freed
[clang-analyzer-cplusplus.NewDelete]

--

-Kellen



On Fri, Nov 2, 2018 at 2:20 AM Anton Chernov  wrote:

> Dear MXNet community,
>
> I had investigated the possibility to adopt Coverity static analysis tools
> for the MXNet project and it turned out that there is a tool provided by
> Synopsys for open-source projects:
>
> https://scan.coverity.com
>
> The tool works nicely with GitHub [1] and I found that a scan for a fork
> (from @apeforest) [2] was already set up. I can not tell how long ago the
> scan was performed, but at the time of writing the project page shows 5
> illegal memory access errors, that I think would be worth investigating.
>
> If there is interest I would suggest that we would setup a Coverity scan
> for the main repository instead of a fork and people that have interest
> managing and fixing issues would request add them to the project.
>
> I would appreciate feedback for this proposal and help from people having
> rights for the main repository to set things up.
>
> Best regards,
> Anton
>
> [1] https://scan.coverity.com/github
> [2] https://scan.coverity.com/projects/apeforest-incubator-mxnet
>


Re: [VOTE] - Adopt "Become a Committer and PPMC Member" Document

2018-10-29 Thread kellen sunderland
+1 non-binding.  As mentioned in various threads, this model should be much
more scalable.  I like the idea of hierarchies of contributors on the
project.

On Mon, Oct 29, 2018 at 3:47 PM Carin Meier  wrote:

> This vote is to adopt the document
>
> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member+Proposal
> to replace the current document
> https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer
>
> The dev discussion thread is here
>
> https://lists.apache.org/thread.html/e61ffa26af374de7a99c475d406e462a00b26cfc1155e232198dd53e@%3Cdev.mxnet.apache.org%3E
>
> The vote will be a procedural issue vote as defined
> https://www.apache.org/foundation/voting.html
>
> Votes on procedural issues follow the common format of majority rule unless
> otherwise stated. That is, if there are more favourable votes than
> unfavourable ones, the issue is considered to have passed -- regardless of
> the number of votes in each category. (If the number of votes seems too
> small to be representative of a community consensus, the issue is typically
> not pursued. However, see the description of lazy consensus
>  for a
> modifying factor.)
>
> The vote will run until Friday Nov 2nd at 6:00 am EST
>
> Thanks,
> Carin
>


Re: [DISCUSS] - Revisions to Committer Criteria

2018-10-29 Thread kellen sunderland
I believe the wording _must_ comes from the fact that the PMC (as a body)
must have a formal vote for a release, otherwise the release will not
happen.  I don't believe it means every PMC member is required to vote on
the release.  I can see where the confusion comes from, but also feel the
wording is correct.

On Mon, Oct 29, 2018 at 8:53 AM Yuan Tang  wrote:

> On Sun, Oct 28, 2018 at 11:12 PM Naveen Swamy  wrote:
>
> > I added clarifying sections to explicitly call out committers/PMC
> > privileges. Please review.
> >
> > Pasting here for convenience
> > Committer Privileges
> >
> >- Committers have write access to the code repository.
> >- Committers have an @apache.org email address.
> >- Committers can make short-term decisions for the project, approving
> >and merging pull requests.
> >- Committer Vote is *NOT* considered *binding* thus the vote you cast
> do
> >not have *Veto* on issues that require consensus.
> >- Committer's can request changes on Pull Requests but it does not
> >constitute Veto, PMC can agree to approve or reject requested changes.
> >
> > PMC Privileges
> >
> >- PMC makes the long-term decisions with regard to the project.
> >- PMC members have write access to the code repository.
> >- PMC members have @apache.org email address.
> >- PMC has access to private@ email list
> >- PMC has the right to vote for the community-related decisions, PMC
> >Votes are *binding*.
> >- PMC has the right to propose active users for committership.
> >- PMC must vote on any formal release of the project's software
> product.
> >
> Could you clarify on this (I don't think you meant "PMC *must* vote")? How
> many votes are required by PMCs before the formal release can happen? Is
> this considered community-related decision as well, e.g. PMC vetos are
> binding?
>
>
> >
> >
> > All, I suggest you review the proposal and if there is any concern please
> > voice it here before this goes out for voting.
> >
> >
> > On Sun, Oct 28, 2018 at 8:04 AM Carin Meier 
> wrote:
> >
> > > I plan to start a vote on the adopting
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member+Proposal
> > > to
> > > replace our current document
> > > https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer
> > > tomorrow
> > > (Monday).
> > >
> > > - Carin
> > >
> > > On Thu, Oct 25, 2018 at 8:32 AM Carin Meier 
> > wrote:
> > >
> > > > Thanks for publishing the notes and also thanks everyone for
> providing
> > > > valuable feedback and discussion.
> > > >
> > > > I encourage everyone that has ideas for improvement to the document
> to
> > > > feel free to edit and revise. If you need a login to the wiki, please
> > > just
> > > > ask.
> > > >
> > > > Also, while editing, please keep in mind that the intent is to have a
> > > vote
> > > > on adopting the new
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member+Proposal
> > > > to replace our current document
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer
> > > > before a vote on separating levels of committer and PPMC as a
> process.
> > > So,
> > > > if possible, adopting wording that would work in either outcome of
> that
> > > > vote.
> > > >
> > > > On the subject of voting, I was thinking of starting a vote on
> Friday,
> > > but
> > > > will delay that until the active discussions and revisions are
> > complete.
> > > >
> > > > Best,
> > > > Carin
> > > >
> > > > On Thu, Oct 25, 2018 at 6:39 AM Pedro Larroy <
> > > pedro.larroy.li...@gmail.com>
> > > > wrote:
> > > >
> > > >> This is the first hangout that I was able to attend, I liked the
> > format
> > > >> and
> > > >> found them valuable. Thanks for organizing and publishing the notes.
> > > >> Looking forward to the next one.
> > > >>
> > > >> Pedro
> > > >>
> > > >> On Thu, Oct 25, 2018 at 6:44 AM Steffen Rochel <
> > steffenroc...@gmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >> > Carin - please see
> > > >> >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Hangout+October+24th+2018+8am+and+5pm+PDT
> > > >> > :
> > > >> > Discussion about committer proposal:
> > > >> >
> > > >> >- Proposal default should be to have separation between
> committer
> > > and
> > > >> >PPMC election
> > > >> >- Criteria are vague, should we add some example persona?
> > > >> >- Spell out privileges of committer and PPMC member
> > > >> >
> > > >> >
> > > >> > Note: I update the project proposal to address first bullet.
> > > >> >
> > > >> > Steffen
> > > >> >
> > > >> >
> > > >> > On Wed, Oct 24, 2018 at 11:29 AM Carin Meier <
> carinme...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > A request to whoever is taking notes at the MXNet Hangouts that
> > are
> > > >> > > occurring today. Could you please 

Re: Apache MXNet (incubating) Python Docker Images

2018-10-22 Thread kellen sunderland
Hey Sergio, I think it's mostly to keep the Dockerfile size down by
matching the system python package.  Of course people can extend the image
and use python 3.6 / 3.7.  I think we should follow this up with an update
to the new Ubuntu LTS version as a base docker image at which point it
would use python 3.6.

On Fri, Oct 19, 2018 at 6:41 AM Sergio Fernández  wrote:

> Python 2.7 reaches End of Life by the end on 2019.Take that into account.
>
> About Python 3.x, why not having 3.6 docker images? I know 3.7 is not yet
> supposed by MXNet. But starting with 3 5 doesn't make much sense for me
>
> On Wed, Oct 17, 2018, 11:52 Meghna Baijal 
> wrote:
>
> > Hi All,
> >
> > I am currently in the process of updating the python docker images for
> > Apache MXNet such that they are built on top of the pip binaries.
> > Until now these were built to use python 2.7 but with an upcoming PR I am
> > also adding python 3.5 docker images. I would like to know the
> community’s
> > preference on whether I should keep the *Python 2.7 Docker image as the
> > default or should I move to Python 3.5 as the default version*?
> >
> > [1] The new python2 dockerfiles and build script can be found here.
> > <
> >
> https://github.com/apache/incubator-mxnet/tree/master/docker/docker-python
> > >
> > [2] The PR for python3 images is in progress and is here.
> > 
> >
> > Thanks,
> > Meghna Baijal
> >
>


Re: Include MKLDNN into default mxnet pip package

2018-10-17 Thread kellen sunderland
First of all thanks to Intel for these improvements, really a great effort.

What would the compatibility story look like for users that don't have
these AVX instructions?  Would there be any negative affect for AMD users?

Regarding TensorRT: It's a possibility but not planned in the short term. A
few considerations would be the limits on PyPi package sizes and the bloat
incurred with TRT, the requirements of TRT to be installed on the user
side, and the TRT engine build times which are non-trivial.  We can work
towards fixing or working around these issues in the future if default TRT
is something the user community would like to see for Cuda packages.  While
the feature is experimental we'll likely continue to use
'mxnet-tensorrt-cu92' and 'mxnet-tensorrt-cu90'.

On Wed, Oct 17, 2018 at 2:12 PM Alfredo Luque
 wrote:

> This is huge. Thanks for working on this. Is there a similar plan with eg;
> tensor-rt support being ported into the main cuda-9.x packages?
>
> On October 17, 2018 at 2:10:20 PM, Alex Zai (aza...@gmail.com) wrote:
>
> Hey all,
> We have been working hard these past few months to integrate and stabilize
> Intel’s MKLDNN deep learning CPU accelerator into Mxnet and have made
> incredible progress. On CPUs with AVX512 instructions (such as c5.18x) we
> have seen performance increase up to 12x and on other platforms (Macs,
> AVX2) we seen a speedup of 1.5+. Full list of benchmarks can be found here
> (
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95650764
> and https://github.com/apache/incubator-mxnet/pull/12591).
>
> Currently, using this accelerator requires the developer to either pip
> install the mxnet-mkl version of mxnet or to build it themselves from
> source. Given that we should try to provide the best performance "out of
> the box” with mxnet we should include this in the default build. The mkldnn
> library is included with in the pip package build so it does not require an
> external dependency.
>
> There were concerns that MKLDNN could cause regressions on certain
> platforms (as it did with the tensorflow version a while back); but we
> added a env flag (MXNET_MKLDNN_ENABLED) that allows users to turn of this
> feature during runtime. Please bring up any other concerns you may have and
> your thoughts on including this accelerator in the default build.
>
> Best,
> Alex
>
> —
> Alfredo Luque
> Software Engineer
> Machine Learning Infrastructure
> Airbnb
> San Francisco, CA
>


Re: Apache MXNet (incubating) Python Docker Images

2018-10-17 Thread kellen sunderland
This feels like something we should get a little data on before making a
decision, but I also don't have a strong opinion.  I would bias towards
pushing something that might be imperfect and moving on to develop other
improvements for users rather than determining a 'perfect' solution.

The questions/tradeoffs I see are (1) should we support multiple python
versions in the first place, which requires extra work on our part but
supports more users and (2) should we favor forwards or backwards
compatibility, i.e. should we prioritize supporting existing users or
prioritize making something that won't cause future problems for new and
exiting users.

The best data I can find with a quick google is the annual Jetbrains survey
which shows python2 went from 47% in 2017 to 25% in 2018:
https://www.jetbrains.com/research/devecosystem-2017/python/
https://www.jetbrains.com/research/devecosystem-2018/python/

So python2 usage is trending sharply down but is not yet low enough to
ignore which I think means we should try and support both on Dockerhub (1).

I don't see backwards compatibility with existing Docker users is a major
concern given these Dockerfiles haven't been supported for a long time.  I
would prioritize forwards compatibility (2) and assume we want to create
something that will remain compatible for as long as possible.

So I would push both python2 and python3 images, but make python 3.5 the
default version, and python2 a version with a postfixed py2 tag in
Dockerhub.

Thanks to Mu (and others?) for original creating this Dockerhub images, I
used to use them and found them very convenient, and to you Meghna for
updating them.  I think basing them on the pip packages is also a good way
to lower maintenance burden and make sure we leverage the great work Sheng
has done to create those packages.

On Wed, Oct 17, 2018 at 11:52 AM Meghna Baijal 
wrote:

> Hi All,
>
> I am currently in the process of updating the python docker images for
> Apache MXNet such that they are built on top of the pip binaries.
> Until now these were built to use python 2.7 but with an upcoming PR I am
> also adding python 3.5 docker images. I would like to know the community’s
> preference on whether I should keep the *Python 2.7 Docker image as the
> default or should I move to Python 3.5 as the default version*?
>
> [1] The new python2 dockerfiles and build script can be found here.
> <
> https://github.com/apache/incubator-mxnet/tree/master/docker/docker-python
> >
> [2] The PR for python3 images is in progress and is here.
> 
>
> Thanks,
> Meghna Baijal
>


Re: [LAZY VOTE]: rename dockerfiles s/.build.//

2018-10-17 Thread kellen sunderland
May be of interest to people that we're trying get a good set of
production-ready Dockerfiles (which I'm referring to as runtime Dockerfiles
in this thread) with a PR open here:
https://github.com/apache/incubator-mxnet/pull/12791 (thanks for updating
these Meghna).

On Wed, Oct 17, 2018 at 12:00 PM Naveen Swamy  wrote:

> I agree with Kellen on not renaming the CI docker files (by renaming - i
> think its implicit you can use these for production) i don't think we
> should telling our users go use these bloated docker files, you could
> create lean separate docker files for production use-case with only
> necessary runtime packages.
>
> -1
>
> On Wed, Oct 17, 2018 at 11:48 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Hey Pedro, sorry I still don't see a good reason to justify changing the
> > filenames.  Renaming them to be less specific isn't going to explain to
> > users what the purpose of the files is, and it could cause breakages with
> > any system that refer to these files including external company's CI
> > systems.  If I think of the benefits versus potential errors introduced
> by
> > making the change I see more potential risk than obvious benefits.  I
> also
> > feel that this change will make the difference between the runtime docker
> > files and the CI docker files less clear to users, not more clear.  In
> > general I think adding a descriptive README.md would server our purposed
> > better here.  Happy to hear what others think.
> >
> > On Wed, Oct 17, 2018 at 6:45 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > Hi Kellen, thank you for your response.
> > >
> > > Maybe I didn't explain myself correctly. The purpose of this
> > infrastructure
> > > is not changed.
> > >
> > > I'm not planning to use these Dockerfiles as MXNet docker containers
> for
> > > users to run MXNet, that is a separate concern.
> > >
> > > It is just that some of this Dockerfiles we use in CI to build, test
> and
> > > generate documentation, so are used as a runtime container as well.
> Thus
> > > i'm just changing the pathing for semantic reasons and remove the
> .build.
> > > which is just noise.
> > >
> > > As an example I would like to explain that we are about to merge the PR
> > > which uses QEMU to run the unit tests, so there's an associated
> > Dockerfile
> > > which hosts the QEMU runtime environment used to execute the unit tests
> > in
> > > an ARM emulated machine. Thus makes little sense that these Dockerfiles
> > are
> > > called "build".  I don't know if my explanation changes your vote.
> Either
> > > way please let me know. Separating this change in a different PR was
> > > suggested by several MXNet contributors during review.
> > >
> > > Pedro.
> > >
> > > On Wed, Oct 17, 2018 at 11:21 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > -1. (non-binding)
> > > >
> > > > These Dockerfiles are very bloated and imo only useful for creating a
> > > build
> > > > environment or running tests.  Just as you wouldn't setup a server
> for
> > a
> > > > service and then install 200 packages that may or may not be used for
> > the
> > > > service I wouldn't recommend using these Dockerfiles at runtime.
> > Runtime
> > > > Dockerfiles should in my opinion be as lightweight and suited to
> their
> > > task
> > > > as possible.
> > > >
> > > > On Wed, Oct 17, 2018, 1:58 AM Hagay Lupesko 
> wrote:
> > > >
> > > > > The PR provides a good explanation of this change and all code
> > updates.
> > > > > LGTM.
> > > > >
> > > > > On Tue, Oct 16, 2018 at 8:41 AM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > >
> > > > > > I would like to rename the dockerfiles since they are used as a
> > > runtime
> > > > > > environment and not only as build as they were initially
> intended.
> > > > > >
> > > > > > More info about the change in this PR:
> > > > > > https://github.com/apache/incubator-mxnet/pull/12423/files
> > > > > >
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [LAZY VOTE]: rename dockerfiles s/.build.//

2018-10-17 Thread kellen sunderland
Hey Pedro, sorry I still don't see a good reason to justify changing the
filenames.  Renaming them to be less specific isn't going to explain to
users what the purpose of the files is, and it could cause breakages with
any system that refer to these files including external company's CI
systems.  If I think of the benefits versus potential errors introduced by
making the change I see more potential risk than obvious benefits.  I also
feel that this change will make the difference between the runtime docker
files and the CI docker files less clear to users, not more clear.  In
general I think adding a descriptive README.md would server our purposed
better here.  Happy to hear what others think.

On Wed, Oct 17, 2018 at 6:45 AM Pedro Larroy 
wrote:

> Hi Kellen, thank you for your response.
>
> Maybe I didn't explain myself correctly. The purpose of this infrastructure
> is not changed.
>
> I'm not planning to use these Dockerfiles as MXNet docker containers for
> users to run MXNet, that is a separate concern.
>
> It is just that some of this Dockerfiles we use in CI to build, test and
> generate documentation, so are used as a runtime container as well. Thus
> i'm just changing the pathing for semantic reasons and remove the .build.
> which is just noise.
>
> As an example I would like to explain that we are about to merge the PR
> which uses QEMU to run the unit tests, so there's an associated Dockerfile
> which hosts the QEMU runtime environment used to execute the unit tests in
> an ARM emulated machine. Thus makes little sense that these Dockerfiles are
> called "build".  I don't know if my explanation changes your vote. Either
> way please let me know. Separating this change in a different PR was
> suggested by several MXNet contributors during review.
>
> Pedro.
>
> On Wed, Oct 17, 2018 at 11:21 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > -1. (non-binding)
> >
> > These Dockerfiles are very bloated and imo only useful for creating a
> build
> > environment or running tests.  Just as you wouldn't setup a server for a
> > service and then install 200 packages that may or may not be used for the
> > service I wouldn't recommend using these Dockerfiles at runtime.  Runtime
> > Dockerfiles should in my opinion be as lightweight and suited to their
> task
> > as possible.
> >
> > On Wed, Oct 17, 2018, 1:58 AM Hagay Lupesko  wrote:
> >
> > > The PR provides a good explanation of this change and all code updates.
> > > LGTM.
> > >
> > > On Tue, Oct 16, 2018 at 8:41 AM Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > I would like to rename the dockerfiles since they are used as a
> runtime
> > > > environment and not only as build as they were initially intended.
> > > >
> > > > More info about the change in this PR:
> > > > https://github.com/apache/incubator-mxnet/pull/12423/files
> > > >
> > > >
> > > > Pedro.
> > > >
> > >
> >
>


Re: [LAZY VOTE]: rename dockerfiles s/.build.//

2018-10-17 Thread kellen sunderland
-1. (non-binding)

These Dockerfiles are very bloated and imo only useful for creating a build
environment or running tests.  Just as you wouldn't setup a server for a
service and then install 200 packages that may or may not be used for the
service I wouldn't recommend using these Dockerfiles at runtime.  Runtime
Dockerfiles should in my opinion be as lightweight and suited to their task
as possible.

On Wed, Oct 17, 2018, 1:58 AM Hagay Lupesko  wrote:

> The PR provides a good explanation of this change and all code updates.
> LGTM.
>
> On Tue, Oct 16, 2018 at 8:41 AM Pedro Larroy  >
> wrote:
>
> > Hi
> >
> > I would like to rename the dockerfiles since they are used as a runtime
> > environment and not only as build as they were initially intended.
> >
> > More info about the change in this PR:
> > https://github.com/apache/incubator-mxnet/pull/12423/files
> >
> >
> > Pedro.
> >
>


Re: MXNet - Label Bot functionality

2018-10-12 Thread kellen sunderland
Awesome work!  Many thanks.

On Fri, Oct 12, 2018, 1:19 AM Harsh Patel 
wrote:

> Hey,
> I am looking to contribute to MXNet. I have a working implementation based
> on my proposed design structure according to this wiki page (
>
> https://cwiki.apache.org/confluence/display/MXNET/Machine+Learning+Based+GitHub+Bot
> )
> - under 7.
> I have provided users with functionality allowing for adding, updating, and
> deleting labels for our issues. The response time with the bot to provide
> the aforementioned functionality has been reduced as well. Users should
> expect speedy updates to labels based on requests made to the label bot.
> The total number of GitHub API calls have been further reduced as well
> preventing potential bottlenecks that could result from GitHub.
> I would like to have a webhook for this repo:
> https://github.com/apache/incubator-mxnet so that this functionality will
> be used and tested by the developers here. Thanks.
>
> Best,
> -Harsh Patel
>


Call for contributions - CI Runtime Improvements

2018-10-10 Thread kellen sunderland
Hello MXNet Community,

Some community members recently had an offline brainstorming focused on how
to speed up CI builds and test runs.  I've summarized some of that offline
discussion, but we'd like to call out that we're also open to new ideas
from the community.  If others have speedup approaches they feel would help
CI feel free to suggest them in this thread, or edit the doc directly.

https://cwiki.apache.org/confluence/display/MXNET/CI+Runtime+Improvements

We've also come to realize that there's a variety of different approaches
that can be taken to speed up CI, and that many of the approaches can be
developed in parallel.  If anyone in the community wants to help us in
implementing those approaches they're more than welcome to participate.

-Kellen


Re: [Discussion] Separating PMC and Committership

2018-10-10 Thread kellen sunderland
I think it makes a lot of sense to separate these roles Haibin.  My
impression is there's a high degree of knowledge and experience required to
make strategic design decisions on the project.  There's a bunch of core
members of the team that have that knowledge, and I feel there's a bit of
an un-written rule at the moment within the community that we defer to
their judgement for important decisions.

Given this I think it makes sense to have some tiered membership.  This
gives some opportunities for contributors to be recognized, and would allow
them to help out with some of the day-to-day tasks that PPMC members
wouldn't have time for.

When it comes to responsibilities one high-level suggestion I'd make is
that core members retain decision making abilities for the 'big' decisions
where experience is required.  Anything controversial, anything that has
wide impacts on the project etc. should be signed off on by a PPMC member.
Committers could then potentially be free to work on anything that doesn't
fall in this category.  As an example a committer could help update code to
follow existing style guides, and PPMC member would be required to sign off
on new guides.

On Wed, Oct 10, 2018 at 6:44 AM Isabel Drost-Fromm 
wrote:

>
>
> Am 10. Oktober 2018 04:31:49 MESZ schrieb Chris Olivier <
> cjolivie...@gmail.com>:
> >is it convenient to define the difference and the rights and privileges
> >of
> >each? write access, private list, voting and veto power, etc?
>
> +1 - also, likely it would make sense to also list the responsibilities of
> each.
>
> Isabel
>
> --
> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
>


Re: [Discuss] Next MXNet release

2018-10-09 Thread kellen sunderland
Hey Steffen,

Recommend these be merged into patch release:

https://github.com/apache/incubator-mxnet/pull/12631
https://github.com/apache/incubator-mxnet/pull/12603
https://github.com/apache/incubator-mxnet/pull/12499

-Kellen

On Tue, Oct 2, 2018 at 7:17 AM Zhao, Patric  wrote:

> Thanks to let us know this discussion.
> Because we don't have enough bandwidth to track the different sources,
> like discussion forum.
>
> I think the best way is to open issue in the github so that we can
> answer/solve the issue in time :)
>
> Thanks,
>
> --Patric
>
> > -Original Message-
> > From: Afrooze, Sina [mailto:sina@gmail.com]
> > Sent: Tuesday, October 2, 2018 1:14 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: Ye, Jason Y ; Zai, Alexander
> > ; Zheng, Da 
> > Subject: Re: [Discuss] Next MXNet release
> >
> > This post suggests there is a regression from 1.1.0 to 1.2.1 related to
> > MKLDNN integration: https://discuss.mxnet.io/t/mxnet-1-2-1-module-get-
> > outputs/1882
> >
> > The error is related to MKLDNN layout not being converted back to MXNet
> > layout in some operator: " !IsMKLDNNData() We can’t generate TBlob for
> > MKLDNN data. Please use Reorder2Default() to generate a new NDArray
> > first"
> >
> > Sina
> >
> >
> >
> >
> > On 9/30/18, 6:55 PM, "Steffen Rochel"  wrote:
> >
> > Thanks Patrick.
> > Updated roadmap and next release content.
> >
> > Patrick - suggest to send a reminder to review the design doc and
> collect
> > feedback.
> > Are there still known issues or gaps before we declare MKL-DNN
> > integration
> > as GA?
> >
> > Regards,
> > Steffen
> >
> > On Sat, Sep 29, 2018 at 1:31 AM Zhao, Patric 
> > wrote:
> >
> > > Thanks, Steffen.
> > >
> > > Regarding the next release note, two items from our side:
> > >
> > > 1. (-remove) MKL-DNN integration is done. I think we can remove
> this
> > item.
> > > 2. (+add) MKL-DNN based graph optimization and quantization by
> > subgraph
> > > Design doc:
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimiz
> > ation+and+Quantization+based+on+subgraph+and+MKL-DNN
> > > Lead Contributor: Patric Zhao,
> https://github.com/pengzhao-intel/
> > >
> > > Regarding the Roadmap
> > > (+add) Q1 2019: MKL-DNN RNN API supports
> > >
> > > BR,
> > >
> > > Thanks,
> > >
> > > --Patric
> > >
> > >
> > > > -Original Message-
> > > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > > > Sent: Saturday, September 29, 2018 11:31 AM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Subject: Re: [Discuss] Next MXNet release
> > > >
> > > > Sorry I meant to say next 'Regarding the *minor* release'.
> > > >
> > > > On Sat, Sep 29, 2018 at 5:27 AM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Thanks for transparently setting a rough timeline Steffen.  I
> think
> > > > > this will go a long way in helping the community plan their
> work, even
> > > > > if the details change somewhat on the road to the release.
> > > > >
> > > > > Regarding the major release: I would propose we unify TensorRT
> with
> > > > > the subgraph operator work.
> > > > >
> > > > > Regarding the patch release:  There were a few minor
> stack/buffer
> > > > > overflows exposed by ASAN that have been addressed.  It's
> probably
> > a
> > > > > good idea to include them in a patch release, as they at best
> result
> > > > > in non-deterministic behaviour.
> > > > >
> > > > > -Kellen
> > > > >
> > > > >
> > > > > On Sat, Sep 29, 2018 at 1:39 AM Steffen Rochel
> > > > > 
> > > > > wrote:
> > > > >
> > > > >> I updated
> > > > >>
> > > > >>
> > https://cwiki.apache.org/confluence/display/MXNET/Project+Proposals+f
> > > > >> or+next+MXNet+Releas

Re: CUDNN algorithm selection failure

2018-10-04 Thread kellen sunderland
"I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't
reproduce the issue."

One thing to keep in mind is that the SelectAlgo call will cache results in
a registry that is in static scope.  To repro you'd likely have to create a
new process each time you run the test.  (Apologies if this is already how
you're reproducing).

SelectAlgo call:
https://github.com/apache/incubator-mxnet/blob/403831ace46eab4447794df9411351e439e8983e/src/operator/nn/cudnn/cudnn_convolution-inl.h#L609

Static local / singleton registry pattern here:
https://github.com/apache/incubator-mxnet/blob/024b5a916dd3a39a39031ce5e6565cd7d9d60fe2/src/operator/nn/cudnn/cudnn_algoreg.cc#L37

On Thu, Oct 4, 2018 at 8:58 PM Marco de Abreu
 wrote:

> For GPU, we don't run any tests in parallel.
>
> -Marco
>
> Naveen Swamy  schrieb am Do., 4. Okt. 2018, 19:54:
>
> > Looking at the error raised, you can see that the workspace size(GPU mem
> > size) of 1GB isn't sufficient. I am wondering if it is due to tests
> running
> > in parallel on CI, if this is true(tests running in parallel) is it
> > possible to reduce the parallelism ?
> > Error:
> > "mxnet.base.MXNetError: [05:40:12]
> > src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any
> > forward convolution algorithm.  with workspace size of 1073741824 bytes,
> > please consider reducing batch/model size or increasing the workspace
> size"
> >
> > I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't
> > reproduce the issue. I will look into it further to see if there are
> other
> > alternatives.
> >
> >
> > On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai 
> wrote:
> >
> > > Another build where test_slice_batchnorm_reshape_batchnorm fails :
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
> > > <
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
> > > >
> > >
> > > —
> > > Piyush
> > >
> > > > On Oct 3, 2018, at 9:32 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > > wrote:
> > > >
> > > > Seems is not the only test:
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline
> > > >
> > > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't
> been
> > > > touched for a while. It doesn't look like a problem with the test to
> > me,
> > > > (not a flaky test). Looks to me that should find and address the root
> > > cause
> > > > instead of disabling the test in this case.
> > > >
> > > > Pedro.
> > > >
> > > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu
> > > >  wrote:
> > > >
> > > >> I have created an issue at
> > > >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to
> > > disable
> > > >> the test at https://github.com/apache/incubator-mxnet/pull/12716.
> > > >>
> > > >> This test is pretty new and was submitted with a number of other
> > > >> problematic (and disabled) tests:
> > > >> https://github.com/apache/incubator-mxnet/issues/11164 It could be
> > > >> possible
> > > >> that the test is simply not stable enough. The PR that introduced
> that
> > > test
> > > >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was
> > merged
> > > >> two
> > > >> days ago.
> > > >>
> > > >> Best regards,
> > > >> Marco
> > > >>
> > > >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <
> > > pedro.larroy.li...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Thanks for checking Lin. If it happens again we will have to dig
> > > deeper.
> > > >> We
> > > >>> have just one executor in GPU so I wonder what could be the root
> > cause
> > > of
> > > >>> this.
> > > >>>
> > > >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan 
> > wrote:
> > > >>>
> > >  I could not reproduce the error on an EC2 g3x8 instance making it
> > hard
> > > >> to
> > >  debug. I also suspect it was due to resource usage limit on ci
> > > >>> Instance.
> > > 
> > >  On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy <
> > > >>> pedro.larroy.li...@gmail.com
> > > >
> > >  wrote:
> > > 
> > > > It doesn't look like flakiness to me at first sight. I think it
> > might
> > > >>> be
> > > > related to resource usage / allocation / leak in the worst case.
> > > >
> > > > Could be that there was not enough memory GPU memory at the time
> of
> > > >>> test
> > > > execution. But I'm just speculating, hence my original question.
> > > >
> > > > Pedro.
> > > >
> > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan 
> > wrote:
> > > >
> > > >> Hi Pedro,
> > > >>
> > > >> I also got this failure in my PR
> > > >>
> > > >>
> > > >
> > > 
> > > >>>
> > > >>
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
> > > >>
> > > >> I was not able to identify the root cause of it from 

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Well I'd propose we get clarification from Travis before bring the issue up
with infra.  No point debating something with infra or amongst ourselves if
it's not possible.

Orthogonal to the paid account option let's merge this speedup to unblock
Intel.

On Oct 2, 2018 4:37 AM, "Marco de Abreu"
 wrote:

I think the timeout and other limitations have been employed by Apache
Infra and not by Travis. They didn't say that specifically, but they
already made me aware that we might get further restrictions if we consume
too many resources.


kellen sunderland  schrieb am Di., 2. Okt.
2018, 04:34:


> Still worth following up with Travis (I've already messaged them).
They're
> in the middle of reorganizing their business model and merging paid and
> free accounts into the same service, so maybe this policy is changing.  It
> doesn't make a lot of sense to me that public repo accounts would have
> timeout limits that are different to private repo accounts in cases where
> they are both paid.
>
> On Tue, Oct 2, 2018, 4:27 AM Marco de Abreu
>  wrote:
>
> > Apache has it's own shared Travis fleet. We are basically using an
> > on-premise version of the paid Travis plan. That was the information I
> got
> > from Infra when I had a chat with them a few days ago. But from that
> > conversation it was made pretty clear that we cannot increase the
limits.
> >
> > -Marco
> >
> > kellen sunderland  schrieb am Di., 2. Okt.
> > 2018, 03:25:
> >
> > > Interesting, this page seems to indicate that private projects do have
> a
> > > longer time out.  I'll drop Travis a quick email and see what the deal
> > > would be for our project.
> > > https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts.
> > >
> > > On Tue, Oct 2, 2018, 3:15 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com>
> > > wrote:
> > >
> > > > I actually thought we were already using a paid plan through Apache
> > > >
> https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
> > > >
> > > > On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:
> > > >
> > > >> Are we currently on a free plan? If we are, probably the unlimited
> > build
> > > >> minutes would help
> > > >>
> > > >> Thanks,
> > > >> Qing
> > > >>
> > > >> On 10/1/18, 6:08 PM, "kellen sunderland" <
> > kellen.sunderl...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> Does the global time out change for paid plans?  I looked into
> it
> > > >> briefly
> > > >> but didn't see anything that would indicate it does.
> > > >>
> > > >> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> > > >> pedro.larroy.li...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > I think there's two approaches that we can take to mitigate
> the
> > > >> build &
> > > >> > test time problem, in one hand use a paid travis CI plan, in
> > other
> > > >> improve
> > > >> > the unit tests in suites and only run a core set of tests, as
> we
> > > >> should do
> > > >> > on devices, but on this case we reduce coverage.
> > > >> >
> > > >> > https://travis-ci.com/plans
> > > >> >
> > > >> > Pedro.
> > > >> >
> > > >> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu <
> eazhi@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > This makes sense. Thanks
> > > >> > >
> > > >> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > > >> > > kellen.sunderl...@gmail.com> wrote:
> > > >> > >
> > > >> > > > Hey Zhennan, yes this is the exact problem, and I agree
> with
> > > >> your
> > > >> > points
> > > >> > > > completely.  This is why when we first added Travis we
> > > >> attempted to
> > > >> > > > communicate that it would be informational only, and that
> > we'd
> > > >> need to
> > > >> > > > iterate on the config before it would be a test that
> people
> > > >> should
> > > >> > > c

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Still worth following up with Travis (I've already messaged them).  They're
in the middle of reorganizing their business model and merging paid and
free accounts into the same service, so maybe this policy is changing.  It
doesn't make a lot of sense to me that public repo accounts would have
timeout limits that are different to private repo accounts in cases where
they are both paid.

On Tue, Oct 2, 2018, 4:27 AM Marco de Abreu
 wrote:

> Apache has it's own shared Travis fleet. We are basically using an
> on-premise version of the paid Travis plan. That was the information I got
> from Infra when I had a chat with them a few days ago. But from that
> conversation it was made pretty clear that we cannot increase the limits.
>
> -Marco
>
> kellen sunderland  schrieb am Di., 2. Okt.
> 2018, 03:25:
>
> > Interesting, this page seems to indicate that private projects do have a
> > longer time out.  I'll drop Travis a quick email and see what the deal
> > would be for our project.
> > https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts.
> >
> > On Tue, Oct 2, 2018, 3:15 AM kellen sunderland <
> > kellen.sunderl...@gmail.com>
> > wrote:
> >
> > > I actually thought we were already using a paid plan through Apache
> > > https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
> > >
> > > On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:
> > >
> > >> Are we currently on a free plan? If we are, probably the unlimited
> build
> > >> minutes would help
> > >>
> > >> Thanks,
> > >> Qing
> > >>
> > >> On 10/1/18, 6:08 PM, "kellen sunderland" <
> kellen.sunderl...@gmail.com>
> > >> wrote:
> > >>
> > >> Does the global time out change for paid plans?  I looked into it
> > >> briefly
> > >> but didn't see anything that would indicate it does.
> > >>
> > >> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> > >> pedro.larroy.li...@gmail.com>
> > >> wrote:
> > >>
> > >> > I think there's two approaches that we can take to mitigate the
> > >> build &
> > >> > test time problem, in one hand use a paid travis CI plan, in
> other
> > >> improve
> > >>     > the unit tests in suites and only run a core set of tests, as we
> > >> should do
> > >> > on devices, but on this case we reduce coverage.
> > >> >
> > >> > https://travis-ci.com/plans
> > >> >
> > >> > Pedro.
> > >> >
> > >> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu 
> > >> wrote:
> > >> >
> > >> > > This makes sense. Thanks
> > >> > >
> > >> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > >> > > kellen.sunderl...@gmail.com> wrote:
> > >> > >
> > >> > > > Hey Zhennan, yes this is the exact problem, and I agree with
> > >> your
> > >> > points
> > >> > > > completely.  This is why when we first added Travis we
> > >> attempted to
> > >> > > > communicate that it would be informational only, and that
> we'd
> > >> need to
> > >> > > > iterate on the config before it would be a test that people
> > >> should
> > >> > > consider
> > >> > > > 'required'.  Apologies, we should have been more
> > >> straightforward about
> > >> > > > those tradeoffs.  The strong point in favour of adding
> Travis
> > in
> > >> > > > informational mode was that we had a serious MacOS specific
> > bug
> > >> that we
> > >> > > > wanted to verify was fixed.
> > >> > > >
> > >> > > > The good news is I've opened a PR which I hope will speed up
> > >> these
> > >> > builds
> > >> > > > to the point that they won't rely on caching.  Once it is
> > >> merged it
> > >> > would
> > >> > > > be very helpful if you could rebase on this PR and test to
> > >> ensure that
> > >> > > > large changes no longer hit the global timeout without
> cache.
> > >> > > > https://github.com/

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Interesting, this page seems to indicate that private projects do have a
longer time out.  I'll drop Travis a quick email and see what the deal
would be for our project.
https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts.

On Tue, Oct 2, 2018, 3:15 AM kellen sunderland 
wrote:

> I actually thought we were already using a paid plan through Apache
> https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
>
> On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:
>
>> Are we currently on a free plan? If we are, probably the unlimited build
>> minutes would help
>>
>> Thanks,
>> Qing
>>
>> On 10/1/18, 6:08 PM, "kellen sunderland" 
>> wrote:
>>
>> Does the global time out change for paid plans?  I looked into it
>> briefly
>> but didn't see anything that would indicate it does.
>>
>> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> wrote:
>>
>> > I think there's two approaches that we can take to mitigate the
>> build &
>> > test time problem, in one hand use a paid travis CI plan, in other
>> improve
>> > the unit tests in suites and only run a core set of tests, as we
>> should do
>> > on devices, but on this case we reduce coverage.
>> >
>> > https://travis-ci.com/plans
>> >
>> > Pedro.
>> >
>> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu 
>> wrote:
>> >
>> > > This makes sense. Thanks
>> > >
>> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
>> > > kellen.sunderl...@gmail.com> wrote:
>> > >
>> > > > Hey Zhennan, yes this is the exact problem, and I agree with
>> your
>> > points
>> > > > completely.  This is why when we first added Travis we
>> attempted to
>> > > > communicate that it would be informational only, and that we'd
>> need to
>> > > > iterate on the config before it would be a test that people
>> should
>> > > consider
>> > > > 'required'.  Apologies, we should have been more
>> straightforward about
>> > > > those tradeoffs.  The strong point in favour of adding Travis in
>> > > > informational mode was that we had a serious MacOS specific bug
>> that we
>> > > > wanted to verify was fixed.
>> > > >
>> > > > The good news is I've opened a PR which I hope will speed up
>> these
>> > builds
>> > > > to the point that they won't rely on caching.  Once it is
>> merged it
>> > would
>> > > > be very helpful if you could rebase on this PR and test to
>> ensure that
>> > > > large changes no longer hit the global timeout without cache.
>> > > > https://github.com/apache/incubator-mxnet/pull/12706
>> > > >
>> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
>> zhennan@intel.com>
>> > > > wrote:
>> > > >
>> > > > > Hi YiZhi and Kellen,
>> > > > >
>> > > > > From my point of view, travis should be able to get passed
>> from a
>> > > scratch
>> > > > > build. Pending result on ccache hit/miss is not a good idea.
>> For this
>> > > PR,
>> > > > > as it changed many header file, lots of files need be
>> recompiled,
>> > just
>> > > > like
>> > > > > a scratch build. I think that's the reason that travis
>> timeout. This
>> > > > should
>> > > > > be fixed before enabling travis, as it will block any change
>> to those
>> > > > base
>> > > > > header file. Again, it's not a special case with this PR
>> only, you
>> > can
>> > > > find
>> > > > > same problem on other PRs:
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> ht

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
I actually thought we were already using a paid plan through Apache
https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci

On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:

> Are we currently on a free plan? If we are, probably the unlimited build
> minutes would help
>
> Thanks,
> Qing
>
> On 10/1/18, 6:08 PM, "kellen sunderland" 
> wrote:
>
> Does the global time out change for paid plans?  I looked into it
> briefly
> but didn't see anything that would indicate it does.
>
> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > I think there's two approaches that we can take to mitigate the
> build &
> > test time problem, in one hand use a paid travis CI plan, in other
> improve
> > the unit tests in suites and only run a core set of tests, as we
> should do
> > on devices, but on this case we reduce coverage.
> >
> > https://travis-ci.com/plans
> >
> > Pedro.
> >
> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu 
> wrote:
> >
> > > This makes sense. Thanks
> > >
> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hey Zhennan, yes this is the exact problem, and I agree with your
> > points
> > > > completely.  This is why when we first added Travis we attempted
> to
> > > > communicate that it would be informational only, and that we'd
> need to
> > > > iterate on the config before it would be a test that people
> should
> > > consider
> > > > 'required'.  Apologies, we should have been more straightforward
> about
> > > > those tradeoffs.  The strong point in favour of adding Travis in
> > > > informational mode was that we had a serious MacOS specific bug
> that we
> > > > wanted to verify was fixed.
> > > >
> > > > The good news is I've opened a PR which I hope will speed up
> these
> > builds
> > > > to the point that they won't rely on caching.  Once it is merged
> it
> > would
> > > > be very helpful if you could rebase on this PR and test to
> ensure that
> > > > large changes no longer hit the global timeout without cache.
> > > > https://github.com/apache/incubator-mxnet/pull/12706
> > > >
> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
> zhennan@intel.com>
> > > > wrote:
> > > >
> > > > > Hi YiZhi and Kellen,
> > > > >
> > > > > From my point of view, travis should be able to get passed
> from a
> > > scratch
> > > > > build. Pending result on ccache hit/miss is not a good idea.
> For this
> > > PR,
> > > > > as it changed many header file, lots of files need be
> recompiled,
> > just
> > > > like
> > > > > a scratch build. I think that's the reason that travis
> timeout. This
> > > > should
> > > > > be fixed before enabling travis, as it will block any change
> to those
> > > > base
> > > > > header file. Again, it's not a special case with this PR only,
> you
> > can
> > > > find
> > > > > same problem on other PRs:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
> > > > >
> > > > >
> > > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
>     > > > >
> > > > >
> > > > > Thanks,
> > > > > Zhennan
> > > > >
> > > > > -Original Message-
> > > > > From: YiZhi Liu [mailto:eazhi@gmail.com]
> > > > > Sent: Sunday, September 30, 2018 5:15 AM
> > > > > To: eazhi@gmail.com
> > > > > Cc: dev@mxnet.incubator.apache.org
> > > > > Subject: Re: Time out for Travis CI
> > > > >
> > > > > while other PRs are all good.
> > > > >

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Does the global time out change for paid plans?  I looked into it briefly
but didn't see anything that would indicate it does.

On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy 
wrote:

> I think there's two approaches that we can take to mitigate the build &
> test time problem, in one hand use a paid travis CI plan, in other improve
> the unit tests in suites and only run a core set of tests, as we should do
> on devices, but on this case we reduce coverage.
>
> https://travis-ci.com/plans
>
> Pedro.
>
> On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu  wrote:
>
> > This makes sense. Thanks
> >
> > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hey Zhennan, yes this is the exact problem, and I agree with your
> points
> > > completely.  This is why when we first added Travis we attempted to
> > > communicate that it would be informational only, and that we'd need to
> > > iterate on the config before it would be a test that people should
> > consider
> > > 'required'.  Apologies, we should have been more straightforward about
> > > those tradeoffs.  The strong point in favour of adding Travis in
> > > informational mode was that we had a serious MacOS specific bug that we
> > > wanted to verify was fixed.
> > >
> > > The good news is I've opened a PR which I hope will speed up these
> builds
> > > to the point that they won't rely on caching.  Once it is merged it
> would
> > > be very helpful if you could rebase on this PR and test to ensure that
> > > large changes no longer hit the global timeout without cache.
> > > https://github.com/apache/incubator-mxnet/pull/12706
> > >
> > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan 
> > > wrote:
> > >
> > > > Hi YiZhi and Kellen,
> > > >
> > > > From my point of view, travis should be able to get passed from a
> > scratch
> > > > build. Pending result on ccache hit/miss is not a good idea. For this
> > PR,
> > > > as it changed many header file, lots of files need be recompiled,
> just
> > > like
> > > > a scratch build. I think that's the reason that travis timeout. This
> > > should
> > > > be fixed before enabling travis, as it will block any change to those
> > > base
> > > > header file. Again, it's not a special case with this PR only, you
> can
> > > find
> > > > same problem on other PRs:
> > > >
> > > >
> > > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
> > > >
> > > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
> > > >
> > > >
> > > > Thanks,
> > > > Zhennan
> > > >
> > > > -Original Message-
> > > > From: YiZhi Liu [mailto:eazhi@gmail.com]
> > > > Sent: Sunday, September 30, 2018 5:15 AM
> > > > To: eazhi@gmail.com
> > > > Cc: dev@mxnet.incubator.apache.org
> > > > Subject: Re: Time out for Travis CI
> > > >
> > > > while other PRs are all good.
> > > > On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu 
> wrote:
> > > > >
> > > > > Honestly I don't know yet. I can help to investigate. Just given
> the
> > > > > evidence that, travis timeout every time it gets re-triggered - 2
> > > > > times at least. Correct me if I'm wrong @ Zhennan On Sat, Sep 29,
> > 2018
> > > > > at 1:54 PM kellen sunderland  wrote:
> > > > > >
> > > > > > Reading over the PR I don't see what aspects would cause extra
> > > > > > runtime YiZhi, could you point them out?
> > > > > >
> > > > > > On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu 
> > > wrote:
> > > > > >
> > > > > > > Kellen, I think this PR introduces extra runtime in CI, thus
> > > > > > > causes the timeout. Which means, once merged, every PR later
> will
> > > > > > > see same timeout in travis.
> > > > > > >
> > > > > > > So shall we modify the changes to decrease the test running
> time?
> > > > > > > or just disable the Travis CI?
> > > > > > >
> > > > > > >
> > &g

Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-30 Thread kellen sunderland
About 60 but they're all addressed In the ref PR.

On Sun, Sep 30, 2018, 6:12 AM Chris Olivier  wrote:

> How many errors exist in the code base right now if it were to be enabled?
>
> On Sat, Sep 29, 2018 at 7:03 PM Naveen Swamy  wrote:
>
> > Thanks Kellen & Anton, for your detailed explanation and links to
> > advantages, appreciate it.
> > changing my vote to *-0*, I suggest to show as warnings.
> >
> > On Sat, Sep 29, 2018 at 8:06 PM Anton Chernov 
> wrote:
> >
> > > And if you want a more authoritative opinion on that check out what the
> > C++
> > > core guidelines are saying [1]:
> > >
> > > > ES.71: Prefer a range-for-statement to a for-statement when there is
> a
> > > choice
> > > > Reason
> > > > Readability. Error prevention. Efficiency.
> > >
> > > Best regards
> > > Anton
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Res-for-range
> > >
> > >
> > > сб, 29 сент. 2018 г. в 16:13, Anton Chernov :
> > >
> > > > +1
> > > >
> > > > Maybe it's not necessary to enforce usage of range-based for, but I
> > would
> > > > highly encourage to to it due to already named advantages. If code
> > would
> > > be
> > > > introduced using the old-style there could be a comment suggesting
> the
> > > new
> > > > way. But why do the manual work and not leave that to the automated
> > tool?
> > > >
> > > > And since it's already automated - wouldn't it be better to keep a
> > > unified
> > > > modern style?
> > > >
> > > > Just to make this a trend - C++ evolves quickly and this will not be
> > only
> > > > upgrade that would needed to be made. And the easier such upgrades
> get
> > > > accepted the easier in general is to upgrade the codebase.
> > > >
> > > > Soon the standard will get ranges and concepts and this will change
> the
> > > > way C++ applications get written significantly. It is a good habit to
> > be
> > > > open for changes and keep up with the trends. By using the new
> > > > possibilities the language can offer you prepare yourself for further
> > > > changes and are more likely to accept them, evolving your programming
> > > style.
> > > >
> > > > Take a look at a new examples on modern usages (taken from [1]):
> > > >
> > > > // since C++17
> > > > for (auto&& [first,second] : mymap) {
> > > > // use first and second
> > > > }
> > > >
> > > > // since C++20
> > > > for (auto& x : foo().items()) { /* .. */ } // undefined behavior if
> > foo()
> > > > returns by value
> > > > for (T thing = foo(); auto& x : thing.items()) { /* ... */ } // OK
> > > >
> > > > // since C++11
> > > > struct cow_string { /* ... */ };
> > > > // a copy-on-write string cow_string str = /* ... */;
> > > > // for(auto x : str) { /* ... */ } // may cause deep copy
> > > > for(auto x : std::as_const(str)) { /* ... */ }
> > > >
> > > > Regarding performance: it's really easy to prove that generated
> > assembly
> > > > is not changing at all. There is a really handy tool for that [2].
> You
> > > can
> > > > check online the assembly for different language constructs and
> > different
> > > > compilers.
> > > >
> > > > Best regards,
> > > > Anton
> > > >
> > > > [1] https://en.cppreference.com/w/cpp/language/range-for
> > > > [2] https://gcc.godbolt.org
> > > >
> > > > сб, 29 сент. 2018 г. в 13:15, kellen sunderland <
> > > > kellen.sunderl...@gmail.com>:
> > > >
> > > >> It's more readable because it's concise and it's consistent for many
> > > types
> > > >> you're looping over (i.e. primitive arrays, stl iterators, etc all
> > work
> > > >> the
> > > >> same way).  It's also useful because it's consistent with other
> > > >> programming
> > > >> languages, making C++ codebases much easier to read for novice and
> > > >> intermediate developers.  IMO it also leads to better naming in loop
> > > >> bodies
> > > >> as the concise s

Re: Time out for Travis CI

2018-09-29 Thread kellen sunderland
Hey Zhennan, yes this is the exact problem, and I agree with your points
completely.  This is why when we first added Travis we attempted to
communicate that it would be informational only, and that we'd need to
iterate on the config before it would be a test that people should consider
'required'.  Apologies, we should have been more straightforward about
those tradeoffs.  The strong point in favour of adding Travis in
informational mode was that we had a serious MacOS specific bug that we
wanted to verify was fixed.

The good news is I've opened a PR which I hope will speed up these builds
to the point that they won't rely on caching.  Once it is merged it would
be very helpful if you could rebase on this PR and test to ensure that
large changes no longer hit the global timeout without cache.
https://github.com/apache/incubator-mxnet/pull/12706

On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan  wrote:

> Hi YiZhi and Kellen,
>
> From my point of view, travis should be able to get passed from a scratch
> build. Pending result on ccache hit/miss is not a good idea. For this PR,
> as it changed many header file, lots of files need be recompiled, just like
> a scratch build. I think that's the reason that travis timeout. This should
> be fixed before enabling travis, as it will block any change to those base
> header file. Again, it's not a special case with this PR only, you can find
> same problem on other PRs:
>
>
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
>
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
>
>
> Thanks,
> Zhennan
>
> -Original Message-
> From: YiZhi Liu [mailto:eazhi@gmail.com]
> Sent: Sunday, September 30, 2018 5:15 AM
> To: eazhi@gmail.com
> Cc: dev@mxnet.incubator.apache.org
> Subject: Re: Time out for Travis CI
>
> while other PRs are all good.
> On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu  wrote:
> >
> > Honestly I don't know yet. I can help to investigate. Just given the
> > evidence that, travis timeout every time it gets re-triggered - 2
> > times at least. Correct me if I'm wrong @ Zhennan On Sat, Sep 29, 2018
> > at 1:54 PM kellen sunderland  wrote:
> > >
> > > Reading over the PR I don't see what aspects would cause extra
> > > runtime YiZhi, could you point them out?
> > >
> > > On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu  wrote:
> > >
> > > > Kellen, I think this PR introduces extra runtime in CI, thus
> > > > causes the timeout. Which means, once merged, every PR later will
> > > > see same timeout in travis.
> > > >
> > > > So shall we modify the changes to decrease the test running time?
> > > > or just disable the Travis CI?
> > > >
> > > >
> > > > On Fri, Sep 28, 2018 at 9:17 PM Qin, Zhennan
> > > > 
> > > > wrote:
> > > > >
> > > > > Hi Kellen,
> > > > >
> > > > > Thanks for your explanation. Do you have a time plan to solve
> > > > > the
> > > > timeout issue? Rebasing can't work for my case. Or shall we run it
> > > > silently to disallow it voting X for overall CI result? Because
> > > > most developers are used to ignore the PRs with 'X'.
> > > > >
> > > > > Thanks,
> > > > > Zhennan
> > > > >
> > > > > -Original Message-
> > > > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > > > > Sent: Friday, September 28, 2018 10:38 PM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Subject: Re: Time out for Travis CI
> > > > >
> > > > > Hey Zhennan, you're safe to ignore Travis failures for now.
> > > > > They're
> > > > just informational.
> > > > >
> > > > > The reason you sometimes see quick builds and sometimes see slow
> > > > > builds
> > > > is that we're making use of ccache in between builds.  If your PR
> > > > is similar to what's in master you should build very quickly, if
> > > > not it's going to take a while and likely time out.  If you see
> > > > timeouts rebasing may speed things up.  Unfortunately the timeouts
> > > > are global and we're not able to increase them.  I'm hoping that
> > > > adding artifact caching will speed up future builds to the point
> > > > that test runs and builds can be executed in under the global limit
> (which is ~50 minutes).
> > >

Re: Time out for Travis CI

2018-09-29 Thread kellen sunderland
Reading over the PR I don't see what aspects would cause extra runtime
YiZhi, could you point them out?

On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu  wrote:

> Kellen, I think this PR introduces extra runtime in CI, thus causes
> the timeout. Which means, once merged, every PR later will see same
> timeout in travis.
>
> So shall we modify the changes to decrease the test running time? or
> just disable the Travis CI?
>
>
> On Fri, Sep 28, 2018 at 9:17 PM Qin, Zhennan 
> wrote:
> >
> > Hi Kellen,
> >
> > Thanks for your explanation. Do you have a time plan to solve the
> timeout issue? Rebasing can't work for my case. Or shall we run it silently
> to disallow it voting X for overall CI result? Because most developers are
> used to ignore the PRs with 'X'.
> >
> > Thanks,
> > Zhennan
> >
> > -Original Message-
> > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > Sent: Friday, September 28, 2018 10:38 PM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: Time out for Travis CI
> >
> > Hey Zhennan, you're safe to ignore Travis failures for now.  They're
> just informational.
> >
> > The reason you sometimes see quick builds and sometimes see slow builds
> is that we're making use of ccache in between builds.  If your PR is
> similar to what's in master you should build very quickly, if not it's
> going to take a while and likely time out.  If you see timeouts rebasing
> may speed things up.  Unfortunately the timeouts are global and we're not
> able to increase them.  I'm hoping that adding artifact caching will speed
> up future builds to the point that test runs and builds can be executed in
> under the global limit (which is ~50 minutes).
> >
> > -Kellen
> >
> >
> > On Fri, Sep 28, 2018 at 4:05 PM Qin, Zhennan 
> wrote:
> >
> > > Hi MXNet devs,
> > >
> > > I'm struggled with new Travis CI for a while, it always run time out
> > > for this PR:
> > > https://github.com/apache/incubator-mxnet/pull/12530
> > >
> > > Most of the time, Jenkins CI can pass, while Travis can't be finished
> > > within 50 minutes. For this PR, it shouldn't affect much on the build
> > > time or unit test time. Also, I saw other PR has same problem, eg.
> > >
> > >
> > > https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_sour
> > > ce=github_status_medium=notification
> > >
> > > https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_sour
> > > ce=github_status_medium=notification
> > >
> > > According to the time stamp from Travis, all passed PR are within
> > > small code change, and can complete `make -j2` within 25s. But for
> > > timeout case, 'make -j2' will need about 1600s. Does Travis do
> > > incremental build for each test? Shall we increase time limit for
> > > large PR? Can we add more time stamp for build and unites stage to
> help understand what's going on there?
> > >
> > > Thanks in advance,
> > > Zhennan
> > >
>
>
>
> --
> Yizhi Liu
> DMLC member
> Amazon Web Services
> Vancouver, Canada
>


Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-29 Thread kellen sunderland
It's more readable because it's concise and it's consistent for many types
you're looping over (i.e. primitive arrays, stl iterators, etc all work the
same way).  It's also useful because it's consistent with other programming
languages, making C++ codebases much easier to read for novice and
intermediate developers.  IMO it also leads to better naming in loop bodies
as the concise style means you're less likely to have important 1 letter
variable names describing loop elements (e.g. no int i =0 or it ...).  More
motivation can be found in the cpp standards proposals for C++11
http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1868.html and
http://open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3853.htm.



On Sat, Sep 29, 2018 at 6:38 PM Naveen Swamy  wrote:

> Kellen,
>
> Could you please explain why you think range loops are better and how it
> improves readability?  this is a relatively new feature, many of them are
> used to the old syntax, shouldn't we leave it for the developers to choose
> the one that best suits the need and their familiarity.
> In general I support the notion of standardizing where necessary, enforcing
> rules on loops seems little bit like micro-managing how you should write
> C++ code for MXNet.
>
> -1(open to change based on new information)
>
>
>
> On Fri, Sep 28, 2018 at 5:20 PM Chris Olivier 
> wrote:
>
> > ok then, my vote is still -1, however, because it’s just adding needless
> > friction for developers imho.
> >
> > On Fri, Sep 28, 2018 at 7:42 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > "Range loops aren’t always the most performant way" Do you have an
> > example
> > > where there's a perf difference?
> > >
> > > "In addition, sometimes you want the index. Or maybe you want to
> iterate
> > > backwards, or not start from the first, etc. Maybe you want the
> iterator
> > > because you remove it from the list at the bottom of the loop Seems
> > > like a rule for the sake of having a rule."
> > >
> > > I should have been more clear about this point.  If you're using the
> > index
> > > in the loop, doing reverse iteration, or not iterating from
> start-to-end
> > > this inspection is smart enough to realize it and will not suggest
> > > optimizing that type of loop.  The loops that would be changes are
> _only_
> > > the loops which are detected as equivalent to range-loops.  Examples
> can
> > be
> > > found here:
> > >
> >
> https://clang.llvm.org/extra/clang-tidy/checks/modernize-loop-convert.html
> > > or you can look at what's been changed in the ref PR.  I've initially
> set
> > > our confidence level at 'reasonable' but we could also set to 'safe'
> > which
> > > would further reduce the number of loops the check would apply to.
> > >
> > > -Kellen
> > >
> > > On Fri, Sep 28, 2018 at 3:54 PM Chris Olivier 
> > > wrote:
> > >
> > > > -1
> > > >
> > > > Range loops aren’t always the most performant way. In addition,
> > sometimes
> > > > you want the index. Or maybe you want to iterate backwards, or not
> > start
> > > > from the first, etc. Maybe you want the iterator because you remove
> it
> > > from
> > > > the list at the bottom of the loop Seems like a rule for the sake
> > of
> > > > having a rule.
> > > >
> > > > On Fri, Sep 28, 2018 at 2:12 AM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Hello MXNet devs,
> > > > >
> > > > > I'd like to discuss uniformly adopting C++11 range loops in the
> MXNet
> > > > > project.  The benefits I see are:
> > > > >
> > > > > *  Improved C++ readability (examples below).
> > > > > *  Consistency with other languages.  The range-loops are quite
> > similar
> > > > to
> > > > > loops almost all other programming languages.  Given we're a
> project
> > > that
> > > > > supports many languages this language consistency could be positive
> > for
> > > > our
> > > > > community.
> > > > > * Consistency within the same project.  Currently different authors
> > > have
> > > > > different loops styles which hurts codebase readability.
> > > > > *  Best available performance.  There are often multiple ways to
> > write
> > > > > loops in C+

Re: [Discuss] Next MXNet release

2018-09-28 Thread kellen sunderland
Sorry I meant to say next 'Regarding the *minor* release'.

On Sat, Sep 29, 2018 at 5:27 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Thanks for transparently setting a rough timeline Steffen.  I think this
> will go a long way in helping the community plan their work, even if the
> details change somewhat on the road to the release.
>
> Regarding the major release: I would propose we unify TensorRT with the
> subgraph operator work.
>
> Regarding the patch release:  There were a few minor stack/buffer
> overflows exposed by ASAN that have been addressed.  It's probably a good
> idea to include them in a patch release, as they at best result in
> non-deterministic behaviour.
>
> -Kellen
>
>
> On Sat, Sep 29, 2018 at 1:39 AM Steffen Rochel 
> wrote:
>
>> I updated
>>
>> https://cwiki.apache.org/confluence/display/MXNET/Project+Proposals+for+next+MXNet+Release
>> ,
>> removed the completed items from 1.3 release and would like to kick off
>> discussion about the next release. Please suggest what you would like to
>> see included in the next release together with link to design proposal
>> (appropriately for the size and complexity of the proposal) or suggest
>> changes.
>> I suggest to target the next release for December 2018 to frame the
>> discussion.
>> Lets include review of
>> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Roadmap - time to
>> update and discuss changes.
>>
>> From the 1.3 release we had discussion regarding
>> https://github.com/apache/incubator-mxnet/issues/11849 and resolution in
>> https://github.com/apache/incubator-mxnet/pull/12412 .
>> Are you aware of critical issues and feedback from user which we should
>> consider for a potential 1.3.1 patch release. Should we include PR 12412
>> in
>> a potential patch release?
>>
>> Regards,
>> Steffen
>>
>


Re: [Discuss] Next MXNet release

2018-09-28 Thread kellen sunderland
Thanks for transparently setting a rough timeline Steffen.  I think this
will go a long way in helping the community plan their work, even if the
details change somewhat on the road to the release.

Regarding the major release: I would propose we unify TensorRT with the
subgraph operator work.

Regarding the patch release:  There were a few minor stack/buffer overflows
exposed by ASAN that have been addressed.  It's probably a good idea to
include them in a patch release, as they at best result in
non-deterministic behaviour.

-Kellen


On Sat, Sep 29, 2018 at 1:39 AM Steffen Rochel 
wrote:

> I updated
>
> https://cwiki.apache.org/confluence/display/MXNET/Project+Proposals+for+next+MXNet+Release
> ,
> removed the completed items from 1.3 release and would like to kick off
> discussion about the next release. Please suggest what you would like to
> see included in the next release together with link to design proposal
> (appropriately for the size and complexity of the proposal) or suggest
> changes.
> I suggest to target the next release for December 2018 to frame the
> discussion.
> Lets include review of
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Roadmap - time to
> update and discuss changes.
>
> From the 1.3 release we had discussion regarding
> https://github.com/apache/incubator-mxnet/issues/11849 and resolution in
> https://github.com/apache/incubator-mxnet/pull/12412 .
> Are you aware of critical issues and feedback from user which we should
> consider for a potential 1.3.1 patch release. Should we include PR 12412 in
> a potential patch release?
>
> Regards,
> Steffen
>


Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-28 Thread kellen sunderland
"Range loops aren’t always the most performant way" Do you have an example
where there's a perf difference?

"In addition, sometimes you want the index. Or maybe you want to iterate
backwards, or not start from the first, etc. Maybe you want the iterator
because you remove it from the list at the bottom of the loop Seems
like a rule for the sake of having a rule."

I should have been more clear about this point.  If you're using the index
in the loop, doing reverse iteration, or not iterating from start-to-end
this inspection is smart enough to realize it and will not suggest
optimizing that type of loop.  The loops that would be changes are _only_
the loops which are detected as equivalent to range-loops.  Examples can be
found here:
https://clang.llvm.org/extra/clang-tidy/checks/modernize-loop-convert.html
or you can look at what's been changed in the ref PR.  I've initially set
our confidence level at 'reasonable' but we could also set to 'safe' which
would further reduce the number of loops the check would apply to.

-Kellen

On Fri, Sep 28, 2018 at 3:54 PM Chris Olivier 
wrote:

> -1
>
> Range loops aren’t always the most performant way. In addition, sometimes
> you want the index. Or maybe you want to iterate backwards, or not start
> from the first, etc. Maybe you want the iterator because you remove it from
> the list at the bottom of the loop Seems like a rule for the sake of
> having a rule.
>
> On Fri, Sep 28, 2018 at 2:12 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Hello MXNet devs,
> >
> > I'd like to discuss uniformly adopting C++11 range loops in the MXNet
> > project.  The benefits I see are:
> >
> > *  Improved C++ readability (examples below).
> > *  Consistency with other languages.  The range-loops are quite similar
> to
> > loops almost all other programming languages.  Given we're a project that
> > supports many languages this language consistency could be positive for
> our
> > community.
> > * Consistency within the same project.  Currently different authors have
> > different loops styles which hurts codebase readability.
> > *  Best available performance.  There are often multiple ways to write
> > loops in C++ with subtle differences in performance and memory usage
> > between loop methods.  Using range-loops ensures we get the best possible
> > perf using an intuitive loop pattern.
> > *  Slightly lower chance for bugs / OOB accesses when dealing with
> indexing
> > in an array for example.
> >
> > If we decide to enable this uniformly throughout the project we can
> enable
> > this policy with a simple clang-tidy configuration change.  There would
> be
> > no need for reviewers to have to manually provide feedback when someone
> > uses an older C++ loops style.
> >
> > -Kellen
> >
> > Reference PR:  https://github.com/apache/incubator-mxnet/pull/12356/
> > Previous clang-tidy discussion on the list:
> >
> >
> https://lists.apache.org/thread.html/b0ae5a9df5dfe0d9074cb2ebe432264db4fa2175b89fa43a5f6e36be@%3Cdev.mxnet.apache.org%3E
> >
> > -
> > Examples:
> > for (auto axis_iter = param.axis.begin() ; axis_iter!= param.axis.end();
> > ++axis_iter) {
> > CHECK_LT(*axis_iter, static_cast(ishape.ndim()));
> > stride_[reverse_index] = ishape[*axis_iter];
> > ...
> > -->
> > for (int axis : param.axis) {
> > CHECK_LT(axis, static_cast(ishape.ndim()));
> > stride_[reverse_index] = ishape[axis];
> > ...
> > --
> > for (size_t i = 0; i < in_array.size(); i++) {
> > auto  = in_array[i];
> > pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true, nd.dtype());
> > }
> > -->
> > for (auto & nd : in_array) {
> > pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true, nd.dtype());
> > }
> >
>


Re: Time out for Travis CI

2018-09-28 Thread kellen sunderland
Hey Zhennan, you're safe to ignore Travis failures for now.  They're just
informational.

The reason you sometimes see quick builds and sometimes see slow builds is
that we're making use of ccache in between builds.  If your PR is similar
to what's in master you should build very quickly, if not it's going to
take a while and likely time out.  If you see timeouts rebasing may speed
things up.  Unfortunately the timeouts are global and we're not able to
increase them.  I'm hoping that adding artifact caching will speed up
future builds to the point that test runs and builds can be executed in
under the global limit (which is ~50 minutes).

-Kellen


On Fri, Sep 28, 2018 at 4:05 PM Qin, Zhennan  wrote:

> Hi MXNet devs,
>
> I'm struggled with new Travis CI for a while, it always run time out for
> this PR:
> https://github.com/apache/incubator-mxnet/pull/12530
>
> Most of the time, Jenkins CI can pass, while Travis can't be finished
> within 50 minutes. For this PR, it shouldn't affect much on the build time
> or unit test time. Also, I saw other PR has same problem, eg.
>
>
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
>
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
>
> According to the time stamp from Travis, all passed PR are within small
> code change, and can complete `make -j2` within 25s. But for timeout case,
> 'make -j2' will need about 1600s. Does Travis do incremental build for each
> test? Shall we increase time limit for large PR? Can we add more time stamp
> for build and unites stage to help understand what's going on there?
>
> Thanks in advance,
> Zhennan
>


Re: Maturity Model and Graduation

2018-09-28 Thread kellen sunderland
Hey Jim, welcome to the community.

To the best of my knowledge we have not yet discussed/run a Maturity
Model.  My gut feel is that MXNet would come away a fairly bi-model
result.  My view of the project is that it's getting the Apache Way right
in terms of Code, Releases, and Quality.  I think the project is doing
decently well with Licensing (although it's maybe a little more complex
than other projects given the many required code dependencies).  From my
observations I would say the community often struggles with consensus
building.  My opinion is that the project is doing a lot right with
community, especially in question answer, but is lacking in other areas
such as community expansion and ownership.

Independence is an area where the project is clearly behind, with almost
all active committers coming from Amazon.  We've had some great
contributions from Intel and NVIDIA, but so far have not been able to add
the members from those organizations to the IPMC for various reasons.
MXNet seems to not have had much support from other open-source communities
(with a notable exception of Carin who did a great job with Clojure and was
made a committer/ipmc member).  My impression is that the community would
love to improve in this area, so the lack of progress is not due to any
lack of desire on the community's part.  I'd love to see some more
sustained contribution from other open source communities to help us out in
this area (and am hopeful we can reach out to the Julia community as an
example).

-Kellen

On Wed, Sep 26, 2018 at 11:24 PM Jim Jagielski  wrote:

> As a newly "minted" mentor, I'm getting my feet wet on determining where
> the project is and where it needs to go in order to be ready for
> graduation...
>
> Has the project run the Maturity Model against itself? How do we stack up?
> What areas of improvement could we benefit from (this might be independent
> of what the MatModel sez, btw. If you have ideas on where we could be
> working and collaborating better, please bring them up!)?
>
> Cheers!


[DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-28 Thread kellen sunderland
Hello MXNet devs,

I'd like to discuss uniformly adopting C++11 range loops in the MXNet
project.  The benefits I see are:

*  Improved C++ readability (examples below).
*  Consistency with other languages.  The range-loops are quite similar to
loops almost all other programming languages.  Given we're a project that
supports many languages this language consistency could be positive for our
community.
* Consistency within the same project.  Currently different authors have
different loops styles which hurts codebase readability.
*  Best available performance.  There are often multiple ways to write
loops in C++ with subtle differences in performance and memory usage
between loop methods.  Using range-loops ensures we get the best possible
perf using an intuitive loop pattern.
*  Slightly lower chance for bugs / OOB accesses when dealing with indexing
in an array for example.

If we decide to enable this uniformly throughout the project we can enable
this policy with a simple clang-tidy configuration change.  There would be
no need for reviewers to have to manually provide feedback when someone
uses an older C++ loops style.

-Kellen

Reference PR:  https://github.com/apache/incubator-mxnet/pull/12356/
Previous clang-tidy discussion on the list:
https://lists.apache.org/thread.html/b0ae5a9df5dfe0d9074cb2ebe432264db4fa2175b89fa43a5f6e36be@%3Cdev.mxnet.apache.org%3E

-
Examples:
for (auto axis_iter = param.axis.begin() ; axis_iter!= param.axis.end();
++axis_iter) {
CHECK_LT(*axis_iter, static_cast(ishape.ndim()));
stride_[reverse_index] = ishape[*axis_iter];
...
-->
for (int axis : param.axis) {
CHECK_LT(axis, static_cast(ishape.ndim()));
stride_[reverse_index] = ishape[axis];
...
--
for (size_t i = 0; i < in_array.size(); i++) {
auto  = in_array[i];
pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true, nd.dtype());
}
-->
for (auto & nd : in_array) {
pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true, nd.dtype());
}


Re: Which merge option to use on the Import Julia binding PR?

2018-09-26 Thread kellen sunderland
My gut feel would be just to squash and merge, it usually works quite well.

Is there any chance that someone might want to cherry-pick, revert or
rebase any portions of the PR?

If so what I try and is to provide atomic commits the bring small
individual pieces of value to the codebase.  This often means at the end of
the PR I'd do some git hygiene, get rework my commits and then force push.
I try to ensure that I also leave a backup branch in GitHub that contains
my original git history.  If you have an atomic chain of commits then it
might make more sense to rebase and merge.

On Wed, Sep 26, 2018, 3:41 PM Carin Meier  wrote:

> The Import Julia binding PR ,(
> https://github.com/apache/incubator-mxnet/pull/10149), is getting very
> close to being merged. Because of the large number of commits there was a
> suggestion not to use the usual "Squash and Merge".  The only option would
> be "Rebase and Merge" since merging with a merge commit is not enabled for
> the project.
>
> *Squash and Merge* - The commits from this branch will be combined into one
> commit in the base branch (With all the commit messages combined)
>
> *Rebase and Merge* - The commits from this branch will be rebased and added
> to the base branch
>
> The PR is over 250+ commits (Github won't show all of them)
>
> Thoughts about how we should handle the merge?
>
> Thanks,
> Carin
>


  1   2   >