Re: CI and PRs

2019-08-26 Thread Pedro Larroy
Hi Chris, you are reading some confrontational or negative things where there is no bad intention and just diverse opinions and ways to express them. We went with Marco for a beer and dinner together and talked about this and we had a good exchange of technical ideas and opinions with mutual respe

Re: CI and PRs

2019-08-23 Thread Chris Olivier
Pedro, I don’t see where Marco says that he “designed and implemented all aspects of CI by himself”. I do think, however, that it’s fair to say that Marco was in charge of the design and most likely made the majority of design decisions as the CI was being built, especially around those tenents t

Re: CI and PRs

2019-08-23 Thread Pedro Larroy
Thanks for your response Marco, I think you have totally missed my original point which was basically that someone volunteering effort on the CI is as important as someone contributing a feature. From my perspective this hasn't been the case, and we had to rely a lot on you and Sheng to submit fixe

Re: CI and PRs

2019-08-23 Thread Marco de Abreu
I've heard this request multiple times and so far, I'm having issues understanding the direct correlation between having committer permissions and being able to manage CI. When I designed the CI, one of the tenets was maintainability and accessbility for the community: I wanted to avoid that someb

Re: CI and PRs

2019-08-23 Thread Pedro Larroy
As Marco has open sourced the bulk of the CI infrastructure donated from Amazon to the community, I would like to raise the recommendation that the community takes action to help volunteers working on the CI have a better experience. In the past, it's my impression that there hasn't been much actio

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Hi Aaron. This is difficult to diagnose, because I don't know what to do when the hash of the layer in docker doesn't match and decides to rebuild it. the r script seems not to have changed. I have observed this in the past and I think is due to bugs in docker. Maybe Kellen is able to give some t

Re: CI and PRs

2019-08-16 Thread Aaron Markham
Is -R already in there? Here's an example of it happening to me right now I am making minor changes to the runtime_functions logic for handling the R docs output. I pull the fix, then run the container, but I see the R deps layer re-running. I didn't touch that. Why it that running again? >Fr

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Also, I forgot, another workaround is that I added the -R flag to the build logic (build.py) so the container is not rebuilt for manual use. On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy wrote: > > Hi Aaron. > > As Marco explained, if you are in master the cache usually works, there's > two issu

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Hi Aaron. As Marco explained, if you are in master the cache usually works, there's two issues that I have observed: 1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so if your cached base which is used in the FROM statement becomes outdated your caching won't work. (Using

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
It's rerunning as soon as that particular script has been modified. Since the following steps depend on it, it means that once step 4 has a cache mismatch, steps 5-15 are also no longer valid. Our cache is always controlled by master. This means that the only thing that matters is the diff between

Re: CI and PRs

2019-08-15 Thread Aaron Markham
When you create a new Dockerfile and use that on CI, it doesn't seem to cache some of the steps... like this: Step 13/15 : RUN /work/ubuntu_docs.sh ---> Running in a1e522f3283b [91m+ echo 'Installing dependencies...' + apt-get update [0mInstalling dependencies. Or this Step 4/13 : RUN /wo

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
Do I understand it correctly that you are saying that the Docker cache doesn't work properly and regularly reinstalls dependencies? Or do you mean that you only have cache misses when you modify the dependencies - which would be expected? -Marco On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham wro

Re: CI and PRs

2019-08-15 Thread Aaron Markham
Many of the CI pipelines follow this pattern: Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why repeat steps 1-3 over and over? Now, some tests use a stashed binary and docker cache. And I see this work locally, but for the most part, on CI, you're gonna sit through a dependen

Re: CI and PRs

2019-08-15 Thread Pedro Larroy
Hi Chris. I suggest you send a PR to illustrate your proposal so we have a concrete example to look into. Pedro. On Wed, Aug 14, 2019 at 6:16 PM Chris Olivier wrote: > I see it done daily now, and while I can’t share all the details, it’s not > an incredibly complex thing, and involves not much

Re: CI and PRs

2019-08-15 Thread Pedro Larroy
Hi Aaron. Why speeds things up? What's the difference? Pedro. On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham wrote: > The PRs Thomas and I are working on for the new docs and website share the > mxnet binary in the new CI pipelines we made. Speeds things up a lot. > > On Wed, Aug 14, 2019, 18:16

Re: new website (RE: CI and PRs)

2019-08-15 Thread Marco de Abreu
ny documentation on the S3 publishing steps and how to > test > > > this. > > > > > > * After breaking out each docs package in its own pipeline, I see > > > opportunities to use the GitHub API to check the PR payload and be > > > selective about wha

Re: new website (RE: CI and PRs)

2019-08-15 Thread Aaron Markham
019 at 10:03 PM Zhao, Patric > > wrote: > > > > > > Hi Aaron, > > > > > > Recently, we are working on improving the documents of CPU backend based > > on the current website. > > > > > > I saw there're several PRs to update the new websit

Re: new website (RE: CI and PRs)

2019-08-15 Thread Marco de Abreu
hen the new website will online. > > If it's very near, we will switch our works to the new website. > > > > Thanks, > > > > --Patric > > > > > > > -Original Message- > > > From: Aaron Markham > > > Sent: Thursday, Augu

Re: new website (RE: CI and PRs)

2019-08-15 Thread Aaron Markham
our works to the new website. > > Thanks, > > --Patric > > > > -Original Message- > > From: Aaron Markham > > Sent: Thursday, August 15, 2019 11:40 AM > > To: dev@mxnet.incubator.apache.org > > Subject: Re: CI and PRs > > > > The PRs Thomas

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
No worries, auto scaling is taking care of that :) -Marco Sheng Zha schrieb am Do., 15. Aug. 2019, 19:50: > The AWS Batch approach should also help with hardware utilization as > machines are launched only when needed :) > > -sz > > > On Aug 15, 2019, at 9:11 AM, Marco de Abreu > wrote: > > >

Re: CI and PRs

2019-08-15 Thread Sheng Zha
The AWS Batch approach should also help with hardware utilization as machines are launched only when needed :) -sz > On Aug 15, 2019, at 9:11 AM, Marco de Abreu wrote: > > Thanks Leonard. Naively dividing by test files would certainly be an easy > and doable way before going into to proper nos

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
Thanks Leonard. Naively dividing by test files would certainly be an easy and doable way before going into to proper nose parallelization. Great idea! Scalability in terms of nodes is not an issue. Our system can handle at least 600 slaves (didn't want to go higher for obvious reasons). But I thin

Re: CI and PRs

2019-08-15 Thread Leonard Lausen
To parallelize across machines: For GluonNLP we started submitting test jobs to AWS Batch. Just adding a for-loop over the units in the Jenkinsfile [1] and submitting a job for each [2] works quite well. Then Jenkins just waits for all jobs to finish and retrieves their status. This works since AWS

Re: CI and PRs

2019-08-14 Thread Marco de Abreu
The first start wrt parallelization could certainly be start adding parallel test execution in nosetests. -Marco Aaron Markham schrieb am Do., 15. Aug. 2019, 05:39: > The PRs Thomas and I are working on for the new docs and website share the > mxnet binary in the new CI pipelines we made. Speed

new website (RE: CI and PRs)

2019-08-14 Thread Zhao, Patric
tch our works to the new website. Thanks, --Patric > -Original Message- > From: Aaron Markham > Sent: Thursday, August 15, 2019 11:40 AM > To: dev@mxnet.incubator.apache.org > Subject: Re: CI and PRs > > The PRs Thomas and I are working on for the new docs and websi

Re: CI and PRs

2019-08-14 Thread Aaron Markham
The PRs Thomas and I are working on for the new docs and website share the mxnet binary in the new CI pipelines we made. Speeds things up a lot. On Wed, Aug 14, 2019, 18:16 Chris Olivier wrote: > I see it done daily now, and while I can’t share all the details, it’s not > an incredibly complex t

Re: CI and PRs

2019-08-14 Thread Chris Olivier
I see it done daily now, and while I can’t share all the details, it’s not an incredibly complex thing, and involves not much more than nfs/efs sharing and remote ssh commands. All it takes is a little ingenuity and some imagination. On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy wrote: > Sounds

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
Sounds good in theory. I think there are complex details with regards of resource sharing during parallel execution. Still I think both ways can be explored. I think some tests run for unreasonably long times for what they are doing. We already scale parts of the pipeline horizontally across worker

Re: CI and PRs

2019-08-14 Thread Chris Olivier
+1 Rather than remove tests (which doesn’t scale as a solution), why not scale them horizontally so that they finish more quickly? Across processes or even on a pool of machines that aren’t necessarily the build machine? On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu wrote: > With regards to t

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
Hi Marco. I have to agree with you on that, from past experience. What do you suggest for maintenance? Do we need a watermark that fails the validation if the total runtime exceeds a high threshold? Pedro. On Wed, Aug 14, 2019 at 1:03 PM Marco de Abreu wrote: > With regards to time I rather p

Re: CI and PRs

2019-08-14 Thread Marco de Abreu
With regards to time I rather prefer us spending a bit more time on maintenance than somebody running into an error that could've been caught with a test. I mean, our Publishing pipeline for Scala GPU has been broken for quite some time now, but nobody noticed that. Basically my stance on that mat

Re: CI and PRs

2019-08-14 Thread Carin Meier
If a language binding test is failing for a not important reason, then it is too brittle and needs to be fixed (we have fixed some of these with the Clojure package [1]). But in general, if we thinking of the MXNet project as one project that is across all the language bindings, then we want to kno

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
>From what I have seen Clojure is 15 minutes, which I think is reasonable. The only question is that when a binding such as R, Perl or Clojure fails, some devs are a bit confused about how to fix them since they are not familiar with the testing tools and the language. On Wed, Aug 14, 2019 at 11:5

Re: CI and PRs

2019-08-14 Thread Carin Meier
Great idea Marco! Anything that you think would be valuable to share would be good. The duration of each node in the test stage sounds like a good start. - Carin On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu wrote: > Hi, > > we record a bunch of metrics about run statistics (down to the durati

Re: CI and PRs

2019-08-14 Thread Marco de Abreu
Hi, we record a bunch of metrics about run statistics (down to the duration of every individual step). If you tell me which ones you're particularly interested in (probably total duration of each node in the test stage), I'm happy to provide them. Dimensions are (in hierarchical order): - job - b

Re: CI and PRs

2019-08-14 Thread Carin Meier
I would prefer to keep the language binding in the PR process. Perhaps we could do some analytics to see how much each of the language bindings is contributing to overall run time. If we have some metrics on that, maybe we can come up with a guideline of how much time each should take. Another poss

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
Yes another point is that pushing again to the PR should cancel previous builds which is now not happening which wastes resources. Any ideas how to make connection errors more robust? The Ivy cache for JVM packages for example could be pre-populated in the workers. It's a balance between complexit

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
Hi Carin. That's a good point, all things considered would your preference be to keep the Clojure tests as part of the PR process or in Nightly? Some options are having notifications here or in slack. But if we think breakages would go unnoticed maybe is not a good idea to fully remove bindings fr

Re: CI and PRs

2019-08-14 Thread Chaitanya Bapat
Pedro, great job of summarizing the set of tasks to restore CI's glory! As far as your list goes, > * Address unit tests that take more than 10-20s, streamline them or move > them to nightly if it can't be done. I would like to call out this request specifically. I'm tracking # of timeouts that

Re: CI and PRs

2019-08-14 Thread Carin Meier
Before any binding tests are moved to nightly, I think we need to figure out how the community can get proper notifications of failure and success on those nightly runs. Otherwise, I think that breakages would go unnoticed. -Carin On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy wrote: > Hi > > See