Re: Stopping nightly releases to Pypi
Makes sense to me to release nightlies to s3 only. Can we reduce size by cutting down on the SMs we release? Was the main complaint around cuda release sizes? On Dec 1, 2019 9:43 PM, "Lausen, Leonard" wrote: Hi MXNet Community, since more than 2 months our binary Python nightly releases published on Pypi are broken. The problem is that our binaries exceed Pypi's size limit. Decreasing the binary size by adding compression breaks third-party libraries loading libmxnet.so https://github.com/apache/incubator-mxnet/issues/16193 Sheng requested Pypi to increase their size limit: https://github.com/pypa/pypi-support/issues/50 Currently "the biggest cost for PyPI from [the many MXNet binaries with nightly release to Pypi] is the bandwidth consumed when several hundred mirrors attempt to mirror each release immediately after it's published". So Pypi is not inclined to allow us to upload even larger binaries on a nightly schedule. Their compromise is to allow it on a weekly cadence. However, I would like the community to revisit the necessity of releasing pre- release binaries to Pypi on a nightly (or weekly) cadence. Instead, we can release nightly binaries ONLY to a public S3 bucket and instruct users to install from there. On our side, we only need to prepare a html document that contains links to all released nightly binaries. Finally users will install the nightly releases via pip install --pre mxnet-cu101 -f http://mxnet.s3.amazonaws.com/mxnet-cu101/ nightly.html Instead of pip install --pre mxnet-cu101 Of course proper releases and release candidates should still be made available via Pypi. Thus releases would be installed via pip install mxnet-cu101 And release candidates via pip install --pre mxnet-cu101 This will substantially reduce the costs of the Pypi project and in fact matches the installation experience provided by PyTorch. I don't think the benefit of not including "-f http://mxnet.s3.amazonaws.com/mxnet-cu101/nightly.html"; matches the costs we currently externalize to the Pypi team. This suggestion seems uncontroversial to me. Thus I would like to start lazy consensus. If there are no objections, I will assume lazy consensus on stopping nightly releases to Pypi in 72hrs. Best regards Leonard
Re: Update GCC 4.8 dependency?
We could think about moving to a newer version and updating the standard. I'm using gcc 4.9 with my work builds, but more modern compilers everywhere else (and is be willing to update the work compiler). One of the cons is that it makes our code less portable. When we update the minimum required compiler (in Linux) then we use a toolchains with a new libc version, meaning MXNet could not be used on older platforms without using docker or virtualization. In our case updating to a cpp17 compiler might mean dropping centos5 or Ubuntu 14.04 support. If you look at CUDA releases as an example the continually releases binaries compiled against older toolchains to support libc on most platforms. What are the features you'd like to see in C++17? I'd recommend we call out interesting features and then see what compiler support we would need to use the feature. It could be the case that the feature is supported in a fairly old compiler version. If we want to immediately modernize the codebase, I've noticed that their are actually a few C++14/11 features we could be using but aren't. (Clang-tidy lists a number of them in each build). We could start with those. On Aug 27, 2019 2:53 AM, Leonard Lausen wrote: Hi, "Currently, we only support gcc-4.8 build." [1] Do we ever want to change this? gcc-4.8 is now available since more than 6 years and a lot has happened during that time. Also platforms have upgraded their default compiler versions, and gcc-7 is now commonly available (eg. Ubuntu 18.04 LTS, Amazon Linux 2). With gcc-7 we could for example rely on C++17. Wikipedia says: - GCC since version 7 has complete support for C++17. - Clang 5 and later implement all the features of C++17. - Visual Studio 2017 15.7 (MSVC 19.14) supports almost all of C++17. As Mu mentioned "Conservatism is not an option" if we want to bring MXNet forward. The benefits of 6 years of work on compilers as well as C++ ISO committee work may help us with that. Should we adapt a newer compiler toolchain and perhaps C++17 standard? Best regards Leonard [1]: https://github.com/apache/incubator-mxnet/blob/681cfc4/tools/dependencies/README.md On Aug 27, 2019 2:53 AM, Leonard Lausen wrote: Hi, "Currently, we only support gcc-4.8 build." [1] Do we ever want to change this? gcc-4.8 is now available since more than 6 years and a lot has happened during that time. Also platforms have upgraded their default compiler versions, and gcc-7 is now commonly available (eg. Ubuntu 18.04 LTS, Amazon Linux 2). With gcc-7 we could for example rely on C++17. Wikipedia says: - GCC since version 7 has complete support for C++17. - Clang 5 and later implement all the features of C++17. - Visual Studio 2017 15.7 (MSVC 19.14) supports almost all of C++17. As Mu mentioned "Conservatism is not an option" if we want to bring MXNet forward. The benefits of 6 years of work on compilers as well as C++ ISO committee work may help us with that. Should we adapt a newer compiler toolchain and perhaps C++17 standard? Best regards Leonard [1]: https://github.com/apache/incubator-mxnet/blob/681cfc4/tools/dependencies/README.md
Re: TensorRT blocker
Looks like it's merged. Can I help with a fix Per? On May 15, 2019 3:00 AM, Per da Silva wrote: Hi everyone, Could a committer please merge this PR: https://github.com/apache/incubator-mxnet/pull/14958 It disables the TensorRT steps to unblock CI while a fix is being worked on. Cheers, Per
Re: CI impaired
Appreciate the big effort in bring the CI back so quickly. Thanks Marco. On Nov 21, 2018 5:52 AM, Marco de Abreu wrote: Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to that incident. If somebody is interested in the details around the outage: Due to a required maintenance (disk running full), we had to upgrade our Jenkins master because it was running on Ubuntu 17.04 (for an unknown reason, it used to be 16.04) and we needed to install some packages. Since the support for Ubuntu 17.04 was stopped, this resulted in all package updates and installations to fail because the repositories were taken offline. Due to the unavailable maintenance package and other issues with the installed OpenJDK8 version, we made the decision to upgrade the Jenkins master to Ubuntu 18.04 LTS in order to get back to a supported version with maintenance tools. During this upgrade, Jenkins was automatically updated by APT as part of the dist-upgrade process. In the latest version of Jenkins, some labels have been changed which we depend on for our auto scaling. To be more specific: > Waiting for next available executor on mxnetlinux-gpu has been changed to > Waiting for next available executor on ‘mxnetlinux-gpu’ Notice the quote characters. Jenkins does not offer a better way than to parse these messages unfortunately - there's no standardized way to express queue items. Since our parser expected the above message without quote signs, this message was discarded. We support various queue reasons (5 of them to be exact) that indicate resource starvation. If we run super low on capacity, the queue reason is different and we would still be able to scale up, but most of the cases would have printed the unsupported message. This resulted in reduced capacity (to be specific, the limit during that time was 1 slave per type). We have now fixed our autoscaling to automatically strip these characters and added that message to our test suite. Best regards, Marco On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham wrote: > Marco, thanks for your hard work on this. I'm super excited about the new > Jenkins jobs. This is going to be very helpful and improve sanity for our > PRs and ourselves! > > Cheers, > Aaron > > On Wed, Nov 21, 2018, 05:37 Marco de Abreu > > > Hello, > > > > the CI is now back up and running. Auto scaling is working as expected > and > > it passed our load tests. > > > > Please excuse the caused inconveniences. > > > > Best regards, > > Marco > > > > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < > > marco.g.ab...@googlemail.com> > > wrote: > > > > > Hello, > > > > > > I'd like to let you know that our CI was impaired and down for the last > > > few hours. After getting the CI back up, I noticed that our auto > scaling > > > broke due to a silent update of Jenkins which broke our > > upscale-detection. > > > Manual scaling is currently not possible and stopping the scaling won't > > > help either because there are currently no p3 instances available, > which > > > means that all jobs will fail none the less. In a few hours, the auto > > > scaling will have recycled all slaves through the down-scale mechanism > > and > > > we will be out of capacity. This will lead to resource starvation and > > thus > > > timeouts. > > > > > > Your PRs will be properly registered by Jenkins, but please expect the > > > jobs to time out and thus fail your PRs. > > > > > > I will fix the auto scaling as soon as I'm awake again. > > > > > > Sorry for the caused inconveniences. > > > > > > Best regards, > > > Marco > > > > > > > > > P.S. Sorry for the brief email and my lack of further fixes, but it's > > > 5:30AM now and I've been working for 17 hours. > > > > > >
Re: Fix slicing for 0.12
Fyi we hit a bug in both numpy and mxnet, thus 8383 is a little more complicated than we thought. Details will eventually be in the ticket, but there won't be a fix for that one tonight. On Oct 24, 2017 6:20 PM, Pedro Larroy wrote: We could also get this one in: https://github.com/apache/incubator-mxnet/issues/8383 We are working on a fix with Kellen. How much time until the time window closes? Pedro. On Tue, Oct 24, 2017 at 4:50 PM, Chris Olivier wrote: > Does anyone else want to make the case that they have a critical fix that > should go into 0.12.0.rc1? Hopefully the PR already passed CI or is in > master already. > > On Tue, Oct 24, 2017 at 6:31 AM, Pedro Larroy < > pedro.larroy.li...@gmail.com> > wrote: > > > Hi > > > > Can we get this PR in for 0.12? > > > > https://github.com/apache/incubator-mxnet/pull/8400 > > > > It's a critical fix with undefined behaviour, which shows itself > specially > > in ARM platforms. > > > > -- > > Pedro. > > > Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
Re: CI system seems to be using python3 for python2 builds
Many thanks Gautam. On 9/26/17, 8:37 PM, "Kumar, Gautam" wrote: Hi Kellen, This issue has been happening since last 3-4 days along with few other test failure. I am looking into it. -Gautam On 9/26/17, 7:45 AM, "Sunderland, Kellen" wrote: I’ve been noticing in a few failed builds that the stack trace indicates we’re actually running python 3.4 in the python 2 tests. I know the CI folks are working hard getting everything setup, is this a known issue for the CI team? For example: https://builds.apache.org/blue/organizations/jenkins/incubator-mxnet/detail/PR-8026/3/pipeline/281 Steps Python2: MKLML-CPU StackTrace: Stack trace returned 10 entries: [bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fadb8999aac] [bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal12GroupKVPairsISt4pairIPNS_7NDArrayES4_EZNS1_19GroupKVPairsPullRspERKSt6vectorIiSaIiEERKS7_IS6_SaIS6_EEPS9_PS7_ISD_SaISD_EEEUliRKS6_E_EEvSB_RKS7_IT_SaISN_EESG_PS7_ISP_SaISP_EERKT0_+0x56b) [0x7fadba32c01b] [bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal17PullRowSparseImplERKSt6vectorIiSaIiEERKS2_ISt4pairIPNS_7NDArrayES8_ESaISA_EEi+0xa6) [0x7fadba32c856] [bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(MXKVStorePullRowSparse+0x245) [0x7fadba18f165] [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fadde26cadc] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7fadde26c40c] [bt] (6) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(_ctypes_callproc+0x21d) [0x7fadde47e12d] [bt] (7) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(+0xf6a3) [0x7fadde47e6a3] [bt] (8) /usr/bin/python3(PyEval_EvalFrameEx+0x41d7) [0x48a487] [bt] (9) /usr/bin/python3() [0x48f2df] -Kellen Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
CI system seems to be using python3 for python2 builds
I’ve been noticing in a few failed builds that the stack trace indicates we’re actually running python 3.4 in the python 2 tests. I know the CI folks are working hard getting everything setup, is this a known issue for the CI team? For example: https://builds.apache.org/blue/organizations/jenkins/incubator-mxnet/detail/PR-8026/3/pipeline/281 Steps Python2: MKLML-CPU StackTrace: Stack trace returned 10 entries: [bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fadb8999aac] [bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal12GroupKVPairsISt4pairIPNS_7NDArrayES4_EZNS1_19GroupKVPairsPullRspERKSt6vectorIiSaIiEERKS7_IS6_SaIS6_EEPS9_PS7_ISD_SaISD_EEEUliRKS6_E_EEvSB_RKS7_IT_SaISN_EESG_PS7_ISP_SaISP_EERKT0_+0x56b) [0x7fadba32c01b] [bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal17PullRowSparseImplERKSt6vectorIiSaIiEERKS2_ISt4pairIPNS_7NDArrayES8_ESaISA_EEi+0xa6) [0x7fadba32c856] [bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(MXKVStorePullRowSparse+0x245) [0x7fadba18f165] [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fadde26cadc] [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7fadde26c40c] [bt] (6) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(_ctypes_callproc+0x21d) [0x7fadde47e12d] [bt] (7) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(+0xf6a3) [0x7fadde47e6a3] [bt] (8) /usr/bin/python3(PyEval_EvalFrameEx+0x41d7) [0x48a487] [bt] (9) /usr/bin/python3() [0x48f2df] -Kellen Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B