Re: CI and PRs

2019-08-26 Thread Pedro Larroy
Hi Chris, you are reading some confrontational or negative things where there is no bad intention and just diverse opinions and ways to express them. We went with Marco for a beer and dinner together and talked about this and we had a good exchange of technical ideas and opinions with mutual

Re: CI and PRs

2019-08-23 Thread Chris Olivier
Pedro, I don’t see where Marco says that he “designed and implemented all aspects of CI by himself”. I do think, however, that it’s fair to say that Marco was in charge of the design and most likely made the majority of design decisions as the CI was being built, especially around those tenents

Re: CI and PRs

2019-08-23 Thread Pedro Larroy
Thanks for your response Marco, I think you have totally missed my original point which was basically that someone volunteering effort on the CI is as important as someone contributing a feature. From my perspective this hasn't been the case, and we had to rely a lot on you and Sheng to submit

Re: CI and PRs

2019-08-23 Thread Marco de Abreu
I've heard this request multiple times and so far, I'm having issues understanding the direct correlation between having committer permissions and being able to manage CI. When I designed the CI, one of the tenets was maintainability and accessbility for the community: I wanted to avoid that

Re: CI and PRs

2019-08-23 Thread Pedro Larroy
As Marco has open sourced the bulk of the CI infrastructure donated from Amazon to the community, I would like to raise the recommendation that the community takes action to help volunteers working on the CI have a better experience. In the past, it's my impression that there hasn't been much

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Hi Aaron. This is difficult to diagnose, because I don't know what to do when the hash of the layer in docker doesn't match and decides to rebuild it. the r script seems not to have changed. I have observed this in the past and I think is due to bugs in docker. Maybe Kellen is able to give some

Re: CI and PRs

2019-08-16 Thread Aaron Markham
Is -R already in there? Here's an example of it happening to me right now I am making minor changes to the runtime_functions logic for handling the R docs output. I pull the fix, then run the container, but I see the R deps layer re-running. I didn't touch that. Why it that running again?

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Also, I forgot, another workaround is that I added the -R flag to the build logic (build.py) so the container is not rebuilt for manual use. On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy wrote: > > Hi Aaron. > > As Marco explained, if you are in master the cache usually works, there's > two

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Hi Aaron. As Marco explained, if you are in master the cache usually works, there's two issues that I have observed: 1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so if your cached base which is used in the FROM statement becomes outdated your caching won't work. (Using

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
It's rerunning as soon as that particular script has been modified. Since the following steps depend on it, it means that once step 4 has a cache mismatch, steps 5-15 are also no longer valid. Our cache is always controlled by master. This means that the only thing that matters is the diff

Re: CI and PRs

2019-08-15 Thread Aaron Markham
When you create a new Dockerfile and use that on CI, it doesn't seem to cache some of the steps... like this: Step 13/15 : RUN /work/ubuntu_docs.sh ---> Running in a1e522f3283b [91m+ echo 'Installing dependencies...' + apt-get update [0mInstalling dependencies. Or this Step 4/13 : RUN

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
Do I understand it correctly that you are saying that the Docker cache doesn't work properly and regularly reinstalls dependencies? Or do you mean that you only have cache misses when you modify the dependencies - which would be expected? -Marco On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham

Re: CI and PRs

2019-08-15 Thread Aaron Markham
Many of the CI pipelines follow this pattern: Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why repeat steps 1-3 over and over? Now, some tests use a stashed binary and docker cache. And I see this work locally, but for the most part, on CI, you're gonna sit through a

Re: CI and PRs

2019-08-15 Thread Pedro Larroy
Hi Chris. I suggest you send a PR to illustrate your proposal so we have a concrete example to look into. Pedro. On Wed, Aug 14, 2019 at 6:16 PM Chris Olivier wrote: > I see it done daily now, and while I can’t share all the details, it’s not > an incredibly complex thing, and involves not much

Re: CI and PRs

2019-08-15 Thread Pedro Larroy
Hi Aaron. Why speeds things up? What's the difference? Pedro. On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham wrote: > The PRs Thomas and I are working on for the new docs and website share the > mxnet binary in the new CI pipelines we made. Speeds things up a lot. > > On Wed, Aug 14, 2019,

Re: new website (RE: CI and PRs)

2019-08-15 Thread Marco de Abreu
kage in its own pipeline, I see > > > opportunities to use the GitHub API to check the PR payload and be > > > selective about what tests to run. > > > > > > > > > On Wed, Aug 14, 2019 at 10:03 PM Zhao, Patric > > > wrote: > > > >

Re: new website (RE: CI and PRs)

2019-08-15 Thread Aaron Markham
; > > Recently, we are working on improving the documents of CPU backend based > > on the current website. > > > > > > I saw there're several PRs to update the new website and it's really > > great. > > > > > > Thus, I'd like to know when the new website w

Re: new website (RE: CI and PRs)

2019-08-15 Thread Marco de Abreu
ew website. > > > > Thanks, > > > > --Patric > > > > > > > -Original Message- > > > From: Aaron Markham > > > Sent: Thursday, August 15, 2019 11:40 AM > > > To: dev@mxnet.incubator.apache.org > > > Subject:

Re: new website (RE: CI and PRs)

2019-08-15 Thread Aaron Markham
gt; --Patric > > > > -Original Message- > > From: Aaron Markham > > Sent: Thursday, August 15, 2019 11:40 AM > > To: dev@mxnet.incubator.apache.org > > Subject: Re: CI and PRs > > > > The PRs Thomas and I are working on for the new docs and

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
No worries, auto scaling is taking care of that :) -Marco Sheng Zha schrieb am Do., 15. Aug. 2019, 19:50: > The AWS Batch approach should also help with hardware utilization as > machines are launched only when needed :) > > -sz > > > On Aug 15, 2019, at 9:11 AM, Marco de Abreu > wrote: > > >

Re: CI and PRs

2019-08-15 Thread Sheng Zha
The AWS Batch approach should also help with hardware utilization as machines are launched only when needed :) -sz > On Aug 15, 2019, at 9:11 AM, Marco de Abreu wrote: > > Thanks Leonard. Naively dividing by test files would certainly be an easy > and doable way before going into to proper

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
Thanks Leonard. Naively dividing by test files would certainly be an easy and doable way before going into to proper nose parallelization. Great idea! Scalability in terms of nodes is not an issue. Our system can handle at least 600 slaves (didn't want to go higher for obvious reasons). But I

Re: CI and PRs

2019-08-15 Thread Leonard Lausen
To parallelize across machines: For GluonNLP we started submitting test jobs to AWS Batch. Just adding a for-loop over the units in the Jenkinsfile [1] and submitting a job for each [2] works quite well. Then Jenkins just waits for all jobs to finish and retrieves their status. This works since

Re: CI and PRs

2019-08-14 Thread Marco de Abreu
The first start wrt parallelization could certainly be start adding parallel test execution in nosetests. -Marco Aaron Markham schrieb am Do., 15. Aug. 2019, 05:39: > The PRs Thomas and I are working on for the new docs and website share the > mxnet binary in the new CI pipelines we made.

new website (RE: CI and PRs)

2019-08-14 Thread Zhao, Patric
to the new website. Thanks, --Patric > -Original Message- > From: Aaron Markham > Sent: Thursday, August 15, 2019 11:40 AM > To: dev@mxnet.incubator.apache.org > Subject: Re: CI and PRs > > The PRs Thomas and I are working on for the new docs and website share > the mx

Re: CI and PRs

2019-08-14 Thread Aaron Markham
The PRs Thomas and I are working on for the new docs and website share the mxnet binary in the new CI pipelines we made. Speeds things up a lot. On Wed, Aug 14, 2019, 18:16 Chris Olivier wrote: > I see it done daily now, and while I can’t share all the details, it’s not > an incredibly complex

Re: CI and PRs

2019-08-14 Thread Chris Olivier
I see it done daily now, and while I can’t share all the details, it’s not an incredibly complex thing, and involves not much more than nfs/efs sharing and remote ssh commands. All it takes is a little ingenuity and some imagination. On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy wrote: > Sounds

Re: CI and PRs

2019-08-14 Thread Chris Olivier
+1 Rather than remove tests (which doesn’t scale as a solution), why not scale them horizontally so that they finish more quickly? Across processes or even on a pool of machines that aren’t necessarily the build machine? On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu wrote: > With regards to

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
Hi Marco. I have to agree with you on that, from past experience. What do you suggest for maintenance? Do we need a watermark that fails the validation if the total runtime exceeds a high threshold? Pedro. On Wed, Aug 14, 2019 at 1:03 PM Marco de Abreu wrote: > With regards to time I rather

Re: CI and PRs

2019-08-14 Thread Marco de Abreu
With regards to time I rather prefer us spending a bit more time on maintenance than somebody running into an error that could've been caught with a test. I mean, our Publishing pipeline for Scala GPU has been broken for quite some time now, but nobody noticed that. Basically my stance on that

Re: CI and PRs

2019-08-14 Thread Carin Meier
If a language binding test is failing for a not important reason, then it is too brittle and needs to be fixed (we have fixed some of these with the Clojure package [1]). But in general, if we thinking of the MXNet project as one project that is across all the language bindings, then we want to

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
>From what I have seen Clojure is 15 minutes, which I think is reasonable. The only question is that when a binding such as R, Perl or Clojure fails, some devs are a bit confused about how to fix them since they are not familiar with the testing tools and the language. On Wed, Aug 14, 2019 at

Re: CI and PRs

2019-08-14 Thread Carin Meier
Great idea Marco! Anything that you think would be valuable to share would be good. The duration of each node in the test stage sounds like a good start. - Carin On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu wrote: > Hi, > > we record a bunch of metrics about run statistics (down to the

Re: CI and PRs

2019-08-14 Thread Marco de Abreu
Hi, we record a bunch of metrics about run statistics (down to the duration of every individual step). If you tell me which ones you're particularly interested in (probably total duration of each node in the test stage), I'm happy to provide them. Dimensions are (in hierarchical order): - job -

Re: CI and PRs

2019-08-14 Thread Carin Meier
I would prefer to keep the language binding in the PR process. Perhaps we could do some analytics to see how much each of the language bindings is contributing to overall run time. If we have some metrics on that, maybe we can come up with a guideline of how much time each should take. Another

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
Hi Carin. That's a good point, all things considered would your preference be to keep the Clojure tests as part of the PR process or in Nightly? Some options are having notifications here or in slack. But if we think breakages would go unnoticed maybe is not a good idea to fully remove bindings

Re: CI and PRs

2019-08-14 Thread Chaitanya Bapat
Pedro, great job of summarizing the set of tasks to restore CI's glory! As far as your list goes, > * Address unit tests that take more than 10-20s, streamline them or move > them to nightly if it can't be done. I would like to call out this request specifically. I'm tracking # of timeouts that

Re: CI and PRs

2019-08-14 Thread Carin Meier
Before any binding tests are moved to nightly, I think we need to figure out how the community can get proper notifications of failure and success on those nightly runs. Otherwise, I think that breakages would go unnoticed. -Carin On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy wrote: > Hi > >

CI and PRs

2019-08-13 Thread Pedro Larroy
Hi Seems we are hitting some problems in CI. I propose the following action items to remedy the situation and accelerate turn around times in CI, reduce cost, complexity and probability of failure blocking PRs and frustrating developers: * Upgrade Windows visual studio from VS 2015 to VS 2017.