Re: CUDNN algorithm selection failure

2018-10-01 Thread Lin Yuan
I could not reproduce the error on an EC2 g3x8 instance making it hard to debug. I also suspect it was due to resource usage limit on ci Instance. On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy wrote: > It doesn't look like flakiness to me at first sight. I think it might be > related to

Re: CUDNN algorithm selection failure

2018-10-01 Thread Pedro Larroy
It doesn't look like flakiness to me at first sight. I think it might be related to resource usage / allocation / leak in the worst case. Could be that there was not enough memory GPU memory at the time of test execution. But I'm just speculating, hence my original question. Pedro. On Mon, Oct

RE: [Discuss] Next MXNet release

2018-10-01 Thread Zhao, Patric
Thanks to let us know this discussion. Because we don't have enough bandwidth to track the different sources, like discussion forum. I think the best way is to open issue in the github so that we can answer/solve the issue in time :) Thanks, --Patric > -Original Message- > From:

RE: [Discuss] Next MXNet release

2018-10-01 Thread Zhao, Patric
Thanks, Steffen. I will send the reminder again and currently Da, Jun, Haibin and Marco is reviewing our 1st PR (12530). Regarding MKL-DNN integration, the MKL-DNN backend reached GA now from my view. In the last development cycle, lots of tests, both unit tests and real models, are added to

Re: CUDNN algorithm selection failure

2018-10-01 Thread Lin Yuan
Hi Pedro, I also got this failure in my PR http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline I was not able to identify the root cause of it from changelist. Are you suggesting there is some flakiness in the master branch too? Thanks,

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Well I'd propose we get clarification from Travis before bring the issue up with infra. No point debating something with infra or amongst ourselves if it's not possible. Orthogonal to the paid account option let's merge this speedup to unblock Intel. On Oct 2, 2018 4:37 AM, "Marco de Abreu"

Re: Time out for Travis CI

2018-10-01 Thread Marco de Abreu
I think the timeout and other limitations have been employed by Apache Infra and not by Travis. They didn't say that specifically, but they already made me aware that we might get further restrictions if we consume too many resources. kellen sunderland schrieb am Di., 2. Okt. 2018, 04:34: >

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Still worth following up with Travis (I've already messaged them). They're in the middle of reorganizing their business model and merging paid and free accounts into the same service, so maybe this policy is changing. It doesn't make a lot of sense to me that public repo accounts would have

Re: Time out for Travis CI

2018-10-01 Thread Marco de Abreu
Apache has it's own shared Travis fleet. We are basically using an on-premise version of the paid Travis plan. That was the information I got from Infra when I had a chat with them a few days ago. But from that conversation it was made pretty clear that we cannot increase the limits. -Marco

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Interesting, this page seems to indicate that private projects do have a longer time out. I'll drop Travis a quick email and see what the deal would be for our project. https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts. On Tue, Oct 2, 2018, 3:15 AM kellen sunderland wrote:

Re: Time out for Travis CI

2018-10-01 Thread Qing Lan
From the link it looks like "Travis CI offers a free account" instead of Apache buy it. It may just be a free user account with extension on the numbers of nodes it can runs on. I think we may need to reach out to Travis or Apache to clarify that we currently have the service that paid version

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
I actually thought we were already using a paid plan through Apache https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci On Tue, Oct 2, 2018, 3:11 AM Qing Lan wrote: > Are we currently on a free plan? If we are, probably the unlimited build > minutes would help > > Thanks, >

Re: Time out for Travis CI

2018-10-01 Thread Qing Lan
Are we currently on a free plan? If we are, probably the unlimited build minutes would help Thanks, Qing On 10/1/18, 6:08 PM, "kellen sunderland" wrote: Does the global time out change for paid plans? I looked into it briefly but didn't see anything that would indicate it does.

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Does the global time out change for paid plans? I looked into it briefly but didn't see anything that would indicate it does. On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy wrote: > I think there's two approaches that we can take to mitigate the build & > test time problem, in one hand use a paid

MXNet Podling Report - October

2018-10-01 Thread Haibin Lin
Hi MXNet community, The podling report for MXNet is due on October 3rd. The report covers MXNet's progress on community development and project development (the previous one can be found here ). You can search "MXNet" at

Re: Time out for Travis CI

2018-10-01 Thread Pedro Larroy
I think there's two approaches that we can take to mitigate the build & test time problem, in one hand use a paid travis CI plan, in other improve the unit tests in suites and only run a core set of tests, as we should do on devices, but on this case we reduce coverage.

CUDNN algorithm selection failure

2018-10-01 Thread Pedro Larroy
Hi I saw this failure on CI: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline Have you seen other cases where we fail to select the best CUDNN algorithm? In which circumstances this could happen, and do you think is a good idea to have

Re: [Discuss] Next MXNet release

2018-10-01 Thread Haibin Lin
I found 2 bugs related to gluon Trainer with distributed KVStore. Basically if someone uses Gluon for distributed training with a learning rate schedule (e.g. train ResNet50 for image classification), it won't work. https://github.com/apache/incubator-mxnet/issues/12713 I have the fix for the

Re: [Discuss] Next MXNet release

2018-10-01 Thread Afrooze, Sina
This post suggests there is a regression from 1.1.0 to 1.2.1 related to MKLDNN integration: https://discuss.mxnet.io/t/mxnet-1-2-1-module-get-outputs/1882 The error is related to MKLDNN layout not being converted back to MXNet layout in some operator: " !IsMKLDNNData() We can’t generate TBlob

Re: Subscription

2018-10-01 Thread Naveen Swamy
Invited On Mon, Oct 1, 2018 at 12:39 PM Jim Jagielski wrote: > I'd like an invite as well, please :) > > > On Sep 29, 2018, at 12:03 PM, Naveen Swamy wrote: > > > > Invite sent. Welcome to Apache MXNet Cosmin :). > > > > > > On Sat, Sep 29, 2018 at 11:38 AM Cosmin Cătălin Sanda < > >

Re: Subscription

2018-10-01 Thread Jim Jagielski
I'd like an invite as well, please :) > On Sep 29, 2018, at 12:03 PM, Naveen Swamy wrote: > > Invite sent. Welcome to Apache MXNet Cosmin :). > > > On Sat, Sep 29, 2018 at 11:38 AM Cosmin Cătălin Sanda < > cosmincata...@gmail.com> wrote: > >> Hi, I would like to subscribe to the ASF mxnet