Re: PR validation and runtime of CI

2018-06-13 Thread Pedro Larroy
Hi Thanks for your insightful comments. I see the concerns about moving PR checks to nightly and it worries me too. We should all agree on a tradeoff of some sorts. Flaky tests not only do not help validate the code, but are detrimental to our progress. Agree that disabling them is suboptimal but

Re: PR validation and runtime of CI

2018-06-07 Thread Steffen Rochel
Agree, we need to get serious about reliable fixing and re-enabling the tests. Mapping from test to folders might be a good enough approximation for mxnet repo. In general you would have to trace code - test dependencies. Steffen On Thu, Jun 7, 2018 at 6:48 PM Marco de Abreu wrote: > We already

Re: PR validation and runtime of CI

2018-06-07 Thread Marco de Abreu
We already have GitHub issues for most of the flaky tests. Even for some that have been disabled for almost half a year - they have never been re-enabled and thus we still lack coverage. I think I have an idea how to do it. I will check if it actually works and then provide a small POC. Basically

Re: PR validation and runtime of CI

2018-06-07 Thread Steffen Rochel
I support to create github/Jira for flaky tests and disable for now. However, we need to get serious and prioritize fixing the disabled tests. Making PR checks smart and test only code impacted by change is a good idea, anybody has experience with tools enabling smart validation? I'm concerned

Re: PR validation and runtime of CI

2018-06-07 Thread Naveen Swamy
Sorry, I missed reading that Pedro was asking to move the tests that run training. I agree with that. Additionally we should make the CI smart as I mentioned above. -Naveen On Thu, Jun 7, 2018 at 3:59 PM, Naveen Swamy wrote: > -1 for moving to nightly. I think that would be detrimental. > >

Re: PR validation and runtime of CI

2018-06-07 Thread Naveen Swamy
-1 for moving to nightly. I think that would be detrimental. We have to make our CI a little more smart and only build required components and not build all components to reduce cost and the time it takes to run CI. A Scala build need not build everything and run tests related to Python, etc.,

Re: PR validation and runtime of CI

2018-06-07 Thread Marco de Abreu
Thanks a lot for our input, Thomas! You are right, 3h are only hit if somebody makes changes in their Dockerfiles and thus every node has to rebuild their containers - but this is expected and inevitable. So far there have not been any big attempts to resolve the number of flaky tests. We had a

Re: PR validation and runtime of CI

2018-06-07 Thread Thomas DELTEIL
Thanks for bringing the issue of CI stability! However I disagree with some points in this thread: - "We are at approximately 3h for a full successful run." => Looking at Jenkins I see the last successful runs oscillating between 1h53 and 2h42 with a mean that seems to be at 2h20. Or are you

Re: PR validation and runtime of CI

2018-06-07 Thread Marco de Abreu
Yeah, I think we are at the point at which we have to disable tests.. If a test fails in nightly, the commit would not be reverted since it's hard to pin a failure to a specific PR. We will have reporting for failures on nightly (they have proven to be stable, so we can enable it right from the

Re: PR validation and runtime of CI

2018-06-07 Thread Aaron Markham
I'd like to disable flaky tests until they're fixed. What would the process be for fixing a failure if the tests are done nightly? Would the commit be reverted? Won't we end up in the same situation with so many flaky tests? I'd like to see if we can separate the test pipelines based on the

PR validation and runtime of CI

2018-06-06 Thread Pedro Larroy
Hi Team The time to validate a PR is growing, due to our number of supported platforms and increased time spent in testing and running models. We are at approximately 3h for a full successful run. This is compounded with the failure rate of builds due to flaky tests of more than 50% which is a