Re: Stopping nightly releases to Pypi

2019-12-01 Thread Sunderland, Kellen
Makes sense to me to release nightlies to s3 only.  Can we reduce size by 
cutting down on the SMs we release?  Was the main complaint around cuda release 
sizes?

On Dec 1, 2019 9:43 PM, "Lausen, Leonard"  wrote:
Hi MXNet Community,

since more than 2 months our binary Python nightly releases published on Pypi
are broken. The problem is that our binaries exceed Pypi's size limit.
Decreasing the binary size by adding compression breaks third-party libraries
loading libmxnet.so https://github.com/apache/incubator-mxnet/issues/16193

Sheng requested Pypi to increase their size limit:
https://github.com/pypa/pypi-support/issues/50

Currently "the biggest cost for PyPI from [the many MXNet binaries with nightly
release to Pypi] is the bandwidth consumed when several hundred mirrors attempt
to mirror each release immediately after it's published". So Pypi is not
inclined to allow us to upload even larger binaries on a nightly schedule.
Their compromise is to allow it on a weekly cadence.

However, I would like the community to revisit the necessity of releasing pre-
release binaries to Pypi on a nightly (or weekly) cadence. Instead, we can
release nightly binaries ONLY to a public S3 bucket and instruct users to
install from there. On our side, we only need to prepare a html document that
contains links to all released nightly binaries.
Finally users will install the nightly releases via

  pip install --pre mxnet-cu101 -f http://mxnet.s3.amazonaws.com/mxnet-cu101/
nightly.html

Instead of

  pip install --pre mxnet-cu101

Of course proper releases and release candidates should still be made available
via Pypi. Thus releases would be installed via

  pip install mxnet-cu101

And release candidates via

  pip install --pre mxnet-cu101

This will substantially reduce the costs of the Pypi project and in fact matches
the installation experience provided by PyTorch. I don't think the benefit of
not including "-f http://mxnet.s3.amazonaws.com/mxnet-cu101/nightly.html;
matches the costs we currently externalize to the Pypi team.

This suggestion seems uncontroversial to me. Thus I would like to start lazy
consensus. If there are no objections, I will assume lazy consensus on stopping
nightly releases to Pypi in 72hrs.

Best regards
Leonard


Re: Update GCC 4.8 dependency?

2019-08-27 Thread Sunderland, Kellen
We could think about moving to a newer version and updating the standard.  I'm 
using gcc 4.9 with my work builds, but more modern compilers everywhere else 
(and is be willing to update the work compiler).

One of the cons is that it makes our code less portable. When we update the 
minimum required compiler (in Linux) then we use a toolchains with a new libc 
version, meaning MXNet could not be used on older platforms without using 
docker or virtualization.  In our case updating to a cpp17 compiler might mean 
dropping centos5 or Ubuntu 14.04 support.

If you look at CUDA releases as an example the continually releases binaries 
compiled against older toolchains to support libc on most platforms.

What are the features you'd like to see in C++17?  I'd recommend we call out 
interesting features and then see what compiler support we would need to use 
the feature.  It could be the case that the feature is supported in a fairly 
old compiler version.

If we want to immediately modernize the codebase, I've noticed that their are 
actually a few C++14/11 features we could be using but aren't.  (Clang-tidy 
lists a number of them in each build).  We could start with those.

On Aug 27, 2019 2:53 AM, Leonard Lausen  wrote:
Hi,

"Currently, we only support gcc-4.8 build." [1]

Do we ever want to change this? gcc-4.8 is now available since more than
6 years and a lot has happened during that time. Also platforms have
upgraded their default compiler versions, and gcc-7 is now commonly
available (eg. Ubuntu 18.04 LTS, Amazon Linux 2). With gcc-7 we could
for example rely on C++17.

Wikipedia says:
- GCC since version 7 has complete support for C++17.
- Clang 5 and later implement all the features of C++17.
- Visual Studio 2017 15.7 (MSVC 19.14) supports almost all of C++17.

As Mu mentioned "Conservatism is not an option" if we want to bring
MXNet forward. The benefits of 6 years of work on compilers as well as
C++ ISO committee work may help us with that.

Should we adapt a newer compiler toolchain and perhaps C++17 standard?

Best regards
Leonard

[1]: 
https://github.com/apache/incubator-mxnet/blob/681cfc4/tools/dependencies/README.md


On Aug 27, 2019 2:53 AM, Leonard Lausen  wrote:
Hi,

"Currently, we only support gcc-4.8 build." [1]

Do we ever want to change this? gcc-4.8 is now available since more than
6 years and a lot has happened during that time. Also platforms have
upgraded their default compiler versions, and gcc-7 is now commonly
available (eg. Ubuntu 18.04 LTS, Amazon Linux 2). With gcc-7 we could
for example rely on C++17.

Wikipedia says:
- GCC since version 7 has complete support for C++17.
- Clang 5 and later implement all the features of C++17.
- Visual Studio 2017 15.7 (MSVC 19.14) supports almost all of C++17.

As Mu mentioned "Conservatism is not an option" if we want to bring
MXNet forward. The benefits of 6 years of work on compilers as well as
C++ ISO committee work may help us with that.

Should we adapt a newer compiler toolchain and perhaps C++17 standard?

Best regards
Leonard

[1]: 
https://github.com/apache/incubator-mxnet/blob/681cfc4/tools/dependencies/README.md


Re: TensorRT blocker

2019-05-15 Thread Sunderland, Kellen
Looks like it's merged.  Can I help with a fix Per?

On May 15, 2019 3:00 AM, Per da Silva  wrote:
Hi everyone,

Could a committer please merge this PR:
https://github.com/apache/incubator-mxnet/pull/14958

It disables the TensorRT steps to unblock CI while a fix is being worked on.

Cheers,

Per


Re: CI impaired

2018-11-21 Thread Sunderland, Kellen
Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.

On Nov 21, 2018 5:52 AM, Marco de Abreu  
wrote:
Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
that incident.

If somebody is interested in the details around the outage:

Due to a required maintenance (disk running full), we had to upgrade our
Jenkins master because it was running on Ubuntu 17.04 (for an unknown
reason, it used to be 16.04) and we needed to install some packages. Since
the support for Ubuntu 17.04 was stopped, this resulted in all package
updates and installations to fail because the repositories were taken
offline. Due to the unavailable maintenance package and other issues with
the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
master to Ubuntu 18.04 LTS in order to get back to a supported version with
maintenance tools. During this upgrade, Jenkins was automatically updated
by APT as part of the dist-upgrade process.

In the latest version of Jenkins, some labels have been changed which we
depend on for our auto scaling. To be more specific:
> Waiting for next available executor on mxnetlinux-gpu
has been changed to
> Waiting for next available executor on ‘mxnetlinux-gpu’
Notice the quote characters.

Jenkins does not offer a better way than to parse these messages
unfortunately - there's no standardized way to express queue items. Since
our parser expected the above message without quote signs, this message was
discarded.

We support various queue reasons (5 of them to be exact) that indicate
resource starvation. If we run super low on capacity, the queue reason is
different and we would still be able to scale up, but most of the cases
would have printed the unsupported message. This resulted in reduced
capacity (to be specific, the limit during that time was 1 slave per type).

We have now fixed our autoscaling to automatically strip these characters
and added that message to our test suite.

Best regards,
Marco

On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham 
wrote:

> Marco, thanks for your hard work on this. I'm super excited about the new
> Jenkins jobs. This is going to be very helpful and improve sanity for our
> PRs and ourselves!
>
> Cheers,
> Aaron
>
> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> 
> > Hello,
> >
> > the CI is now back up and running. Auto scaling is working as expected
> and
> > it passed our load tests.
> >
> > Please excuse the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > marco.g.ab...@googlemail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'd like to let you know that our CI was impaired and down for the last
> > > few hours. After getting the CI back up, I noticed that our auto
> scaling
> > > broke due to a silent update of Jenkins which broke our
> > upscale-detection.
> > > Manual scaling is currently not possible and stopping the scaling won't
> > > help either because there are currently no p3 instances available,
> which
> > > means that all jobs will fail none the less. In a few hours, the auto
> > > scaling will have recycled all slaves through the down-scale mechanism
> > and
> > > we will be out of capacity. This will lead to resource starvation and
> > thus
> > > timeouts.
> > >
> > > Your PRs will be properly registered by Jenkins, but please expect the
> > > jobs to time out and thus fail your PRs.
> > >
> > > I will fix the auto scaling as soon as I'm awake again.
> > >
> > > Sorry for the caused inconveniences.
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > P.S. Sorry for the brief email and my lack of further fixes, but it's
> > > 5:30AM now and I've been working for 17 hours.
> > >
> >
>


Re: Fix slicing for 0.12

2017-10-24 Thread Sunderland, Kellen
Fyi we hit a bug in both numpy and mxnet, thus 8383 is a little more 
complicated than we thought.  Details will eventually be in the ticket, but 
there won't be a fix for that one tonight.



On Oct 24, 2017 6:20 PM, Pedro Larroy  wrote:

We could also get this one in:

https://github.com/apache/incubator-mxnet/issues/8383

We are working on a fix with Kellen.

How much time until the time window closes?

Pedro.

On Tue, Oct 24, 2017 at 4:50 PM, Chris Olivier 
wrote:

> Does anyone else want to make the case that they have a critical fix that
> should go into 0.12.0.rc1?  Hopefully the PR already passed CI or is in
> master already.
>
> On Tue, Oct 24, 2017 at 6:31 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > Hi
> >
> > Can we get this PR in for 0.12?
> >
> > https://github.com/apache/incubator-mxnet/pull/8400
> >
> > It's a critical fix with undefined behaviour, which shows itself
> specially
> > in ARM platforms.
> >
> > --
> > Pedro.
> >
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: CI system seems to be using python3 for python2 builds

2017-09-27 Thread Sunderland, Kellen
Many thanks Gautam.

On 9/26/17, 8:37 PM, "Kumar, Gautam" <ga...@amazon.com> wrote:

Hi Kellen, 

   This issue has been happening since last 3-4 days along with few other 
test failure.
I am looking into it.  

-Gautam 

On 9/26/17, 7:45 AM, "Sunderland, Kellen" <kell...@amazon.de> wrote:

I’ve been noticing in a few failed builds that the stack trace 
indicates we’re actually running python 3.4 in the python 2 tests. I know the 
CI folks are working hard getting everything setup, is this a known issue for 
the CI team?

For example: 
https://builds.apache.org/blue/organizations/jenkins/incubator-mxnet/detail/PR-8026/3/pipeline/281

Steps Python2: MKLML-CPU

StackTrace:
Stack trace returned 10 entries:
[bt] (0) 
/workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c)
 [0x7fadb8999aac]
[bt] (1) 
/workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal12GroupKVPairsISt4pairIPNS_7NDArrayES4_EZNS1_19GroupKVPairsPullRspERKSt6vectorIiSaIiEERKS7_IS6_SaIS6_EEPS9_PS7_ISD_SaISD_EEEUliRKS6_E_EEvSB_RKS7_IT_SaISN_EESG_PS7_ISP_SaISP_EERKT0_+0x56b)
 [0x7fadba32c01b]
[bt] (2) 
/workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal17PullRowSparseImplERKSt6vectorIiSaIiEERKS2_ISt4pairIPNS_7NDArrayES8_ESaISA_EEi+0xa6)
 [0x7fadba32c856]
[bt] (3) 
/workspace/python/mxnet/../../lib/libmxnet.so(MXKVStorePullRowSparse+0x245) 
[0x7fadba18f165]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7fadde26cadc]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) 
[0x7fadde26c40c]
[bt] (6) 
/usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(_ctypes_callproc+0x21d)
 [0x7fadde47e12d]
[bt] (7) 
/usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(+0xf6a3) 
[0x7fadde47e6a3]
[bt] (8) /usr/bin/python3(PyEval_EvalFrameEx+0x41d7) [0x48a487]
[bt] (9) /usr/bin/python3() [0x48f2df]

-Kellen
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B




Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B