Re: [DISCUSS] Re-think CI strategy?
Our CI is looking much healthier now after recent work (thank you!), example build: https://travis-ci.org/apache/arrow/builds/417700344 I think we've bought ourselves a few months at least. We'll have to see what the impact on CI health of adding a couple more things: * parquet-cpp unit tests (per [1]) * Gandiva build + tests I suspect at some point in the future we may need to have a combination of "fast Travis CI builds" and more exhaustive / longer running builds in Jenkins. Projects like Apache Kudu have much more intense testing procedures and these are run on dedicated infrastructure rather than CI I also think that more parts of our CI could be handled by creating an "Arrow test bot" that can respond to directions. There are a number of frameworks and examples now for writing GitHub bots; we could create a bot that can execute on-demand tests of optional components using the crossbow tool. Other things that we run in every commit, like the Python manylinux1 build, could be run on-demand and nightly. That being said, I just worked on a PR that broke the manylinux1 build (https://github.com/apache/arrow/pull/2428) and so we risk having to hunt down the root cause of a broken build if we don't run such tests on every commit. I'm not sure we can simultaneously have fast CI builds while also catching all possible problems - Wes [1]: https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E On Tue, Aug 7, 2018 at 2:55 AM, Antoine Pitrou wrote: > > It would be good to test all Python versions in a cron build, but I > agree we may not need to test all Python 3 versions in per-commit builds. > > Regards > > Antoine. > > > Le 07/08/2018 à 03:14, Robert Nishihara a écrit : >> Thanks Wes. >> >> As for Python 3.5, 3.6, and 3.7, I think testing any one of them should be >> sufficient (I can't recall any errors that happened with one version and >> not the other). >> >> On Mon, Aug 6, 2018 at 12:01 PM Wes McKinney wrote: >> >>> @Robert, it looks like NumPy is making LTS releases until Jan 1, 2020 >>> >>> >>> https://docs.scipy.org/doc/numpy-1.14.0/neps/dropping-python2.7-proposal.html >>> >>> Based on this, I think it's fine for us to continue to support Python >>> 2.7 until then. It's only 16 months away; are you all ready for the >>> next decade? >>> >>> We should also discuss if we want to continue to build and test Python >>> 3.5. From download statistics it appears that there are 5-10x as many >>> Python 3.6 users as 3.5. I would prefer to drop 3.5 and begin >>> supporting 3.7 soon. >>> >>> @Antoine, I think we can avoid building the C++ codebase 3 times, but >>> it will require a bit of retooling of the scripts. The reason that >>> ccache isn't working properly is probably because the Python include >>> directory is being included even for compilation units that do not use >>> the Python C API. >>> https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L721. >>> I'm opening a JIRA about fixing this >>> https://issues.apache.org/jira/browse/ARROW-2994 >>> >>> Created https://issues.apache.org/jira/browse/ARROW-2995 about >>> removing the redundant build cycle >>> >>> On Mon, Aug 6, 2018 at 2:19 PM, Robert Nishihara >>> wrote: > > Also, at this point we're sometimes hitting the 50 minutes time limit on > our slowest Travis-CI matrix job, which means we have to restart it... > making the build even slower. > Only a short-term fix, but Travis can lengthen the max build time if you email them and ask them to. >>> >>
Re: [DISCUSS] Re-think CI strategy?
Thanks Wes. As for Python 3.5, 3.6, and 3.7, I think testing any one of them should be sufficient (I can't recall any errors that happened with one version and not the other). On Mon, Aug 6, 2018 at 12:01 PM Wes McKinney wrote: > @Robert, it looks like NumPy is making LTS releases until Jan 1, 2020 > > > https://docs.scipy.org/doc/numpy-1.14.0/neps/dropping-python2.7-proposal.html > > Based on this, I think it's fine for us to continue to support Python > 2.7 until then. It's only 16 months away; are you all ready for the > next decade? > > We should also discuss if we want to continue to build and test Python > 3.5. From download statistics it appears that there are 5-10x as many > Python 3.6 users as 3.5. I would prefer to drop 3.5 and begin > supporting 3.7 soon. > > @Antoine, I think we can avoid building the C++ codebase 3 times, but > it will require a bit of retooling of the scripts. The reason that > ccache isn't working properly is probably because the Python include > directory is being included even for compilation units that do not use > the Python C API. > https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L721. > I'm opening a JIRA about fixing this > https://issues.apache.org/jira/browse/ARROW-2994 > > Created https://issues.apache.org/jira/browse/ARROW-2995 about > removing the redundant build cycle > > On Mon, Aug 6, 2018 at 2:19 PM, Robert Nishihara > wrote: > >> > >> Also, at this point we're sometimes hitting the 50 minutes time limit on > >> our slowest Travis-CI matrix job, which means we have to restart it... > >> making the build even slower. > >> > > Only a short-term fix, but Travis can lengthen the max build time if you > > email them and ask them to. >
Re: [DISCUSS] Re-think CI strategy?
@Robert, it looks like NumPy is making LTS releases until Jan 1, 2020 https://docs.scipy.org/doc/numpy-1.14.0/neps/dropping-python2.7-proposal.html Based on this, I think it's fine for us to continue to support Python 2.7 until then. It's only 16 months away; are you all ready for the next decade? We should also discuss if we want to continue to build and test Python 3.5. From download statistics it appears that there are 5-10x as many Python 3.6 users as 3.5. I would prefer to drop 3.5 and begin supporting 3.7 soon. @Antoine, I think we can avoid building the C++ codebase 3 times, but it will require a bit of retooling of the scripts. The reason that ccache isn't working properly is probably because the Python include directory is being included even for compilation units that do not use the Python C API. https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L721. I'm opening a JIRA about fixing this https://issues.apache.org/jira/browse/ARROW-2994 Created https://issues.apache.org/jira/browse/ARROW-2995 about removing the redundant build cycle On Mon, Aug 6, 2018 at 2:19 PM, Robert Nishihara wrote: >> >> Also, at this point we're sometimes hitting the 50 minutes time limit on >> our slowest Travis-CI matrix job, which means we have to restart it... >> making the build even slower. >> > Only a short-term fix, but Travis can lengthen the max build time if you > email them and ask them to.
Re: [DISCUSS] Re-think CI strategy?
> > Also, at this point we're sometimes hitting the 50 minutes time limit on > our slowest Travis-CI matrix job, which means we have to restart it... > making the build even slower. > Only a short-term fix, but Travis can lengthen the max build time if you email them and ask them to.
Re: [DISCUSS] Re-think CI strategy?
Also, at this point we're sometimes hitting the 50 minutes time limit on our slowest Travis-CI matrix job, which means we have to restart it... making the build even slower. There's something perhaps suboptimal in the way we build Arrow C++ on Travis: - first we build it for no particular Python version - second we build it for Python 2.7 - third we build it for Python 3.6 Even those C++ files that don't depend on Python get re-compiled thrice (and ccache doesn't save us, probably because the compile flags are different). Regards Antoine. Le 06/08/2018 à 19:57, Antoine Pitrou a écrit : > > Not wanting to answer for Wes, but those are two sides of the same coin: > reducing CI overhead and complexity helps increase developer > productivity. Reducing CI overhead is not a goal *in itself* (unless > there are money issues I don't know about) ;-) > > The productivity cost of being Python 2-compatible is not very high > *currently* (since much of the cost is a sunk cost by now), but these > things all add up. So at some point we should really drop Python 2. > Whether it's 2019 or 2020, I don't know and I don't get to decide. > > However, anything later than 2020 is excessively conservative IMHO. > > Regards > > Antoine. > > > Le 06/08/2018 à 19:46, Robert Nishihara a écrit : >> Wes, do you primarily want to drop Python 2 to speed up Travis or to reduce >> the development overhead? In my experience the development overhead is >> minimal and well worth it. For Travis, we could consider looking into other >> options like paying for more concurrency. >> >> January 2019 is very soon and Python 2 is still massively popular. >> >> On Mon, Aug 6, 2018 at 5:11 AM Wes McKinney wrote: >> The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK. Don't they include thrift? >>> >>> I was referring to your comment about "parquet-cpp AppVeyor builds are >>> abysmally slow". I think the slowness is in significant part due to >>> the ExternalProject builds, where Thrift is the worst offender. >>> >>
Re: [DISCUSS] Re-think CI strategy?
Not wanting to answer for Wes, but those are two sides of the same coin: reducing CI overhead and complexity helps increase developer productivity. Reducing CI overhead is not a goal *in itself* (unless there are money issues I don't know about) ;-) The productivity cost of being Python 2-compatible is not very high *currently* (since much of the cost is a sunk cost by now), but these things all add up. So at some point we should really drop Python 2. Whether it's 2019 or 2020, I don't know and I don't get to decide. However, anything later than 2020 is excessively conservative IMHO. Regards Antoine. Le 06/08/2018 à 19:46, Robert Nishihara a écrit : > Wes, do you primarily want to drop Python 2 to speed up Travis or to reduce > the development overhead? In my experience the development overhead is > minimal and well worth it. For Travis, we could consider looking into other > options like paying for more concurrency. > > January 2019 is very soon and Python 2 is still massively popular. > > On Mon, Aug 6, 2018 at 5:11 AM Wes McKinney wrote: > >>> The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK. >>> Don't they include thrift? >> >> I was referring to your comment about "parquet-cpp AppVeyor builds are >> abysmally slow". I think the slowness is in significant part due to >> the ExternalProject builds, where Thrift is the worst offender. >> >
Re: [DISCUSS] Re-think CI strategy?
Wes, do you primarily want to drop Python 2 to speed up Travis or to reduce the development overhead? In my experience the development overhead is minimal and well worth it. For Travis, we could consider looking into other options like paying for more concurrency. January 2019 is very soon and Python 2 is still massively popular. On Mon, Aug 6, 2018 at 5:11 AM Wes McKinney wrote: > > The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK. > > Don't they include thrift? > > I was referring to your comment about "parquet-cpp AppVeyor builds are > abysmally slow". I think the slowness is in significant part due to > the ExternalProject builds, where Thrift is the worst offender. >
Re: [DISCUSS] Re-think CI strategy?
> The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK. > Don't they include thrift? I was referring to your comment about "parquet-cpp AppVeyor builds are abysmally slow". I think the slowness is in significant part due to the ExternalProject builds, where Thrift is the worst offender.
Re: [DISCUSS] Re-think CI strategy?
hi, On Mon, Aug 6, 2018 at 7:52 AM, Antoine Pitrou wrote: > > Le 06/08/2018 à 13:42, Wes McKinney a écrit : >> hi Antoine, >> >> I completely agree. Part of why I've been so consistently pressing for >> nightly build tooling is to be able to shift more exhaustive testing >> out of per-commit runs into a daily build or an on-demand build to be >> invoked by the user either manually or by means of a bot. If you >> search in JIRA for the term "nightly" you can see a lot of issue that >> I have created for this already. > > My worry with nightly jobs is that they doesn't clearly pinpoint the > specific changeset where a regression occurred; also, since it happens > after merging, there is less incentive to clear any mess introduced by a > PR; moreover, it makes trying out a fix more difficult, since you don't > get direct feedback on a commit or PR. > > So in exchange for lowering per-commit CI times, nightly jobs require > extra human care to track regressions and make sure they get fixed. The way that projects with much more complex testing deal with this is to have a pre-commit test run. The Apache Impala "verify merge" commit take several hours to run, for example. I don't think we are on our way to that point yet. What I'm suggesting is that we be able to write @arrow-test-bot please run exhaustive There are a lot of patches that clearly don't require exhaustive testing. In any case, we are likely to accumulate more testing than we can run on every commit / patch, but as long as we have a convenient mechanism to run the tests, then it is OK. > >> it would be useful to be >> able to validate if desired (and in an automated way) that the NMake >> build works properly > > I guess the main question is why we're testing for NMake at all. CMake > supports a range of different build tools, we can't exercise all of > them. So I'd say on each platform we should exercise at most two build > tools: > - Ninja, because it's cross-platform and the fastest (which makes it > desirable for developers *and* for CI) > - the standard platform-specific build tool, i.e. GNU make on Unix and > Visual Studio (or "msbuild") on Windows Well, I would be OK with dropping NMake in favor of supporting Ninja. > >> I think we can also improve CI build times by caching certain >> toolchain artifacts that are taking a long time to build (Thrift, I'm >> looking at you). > > The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK. > Don't they include thrift? The most time consuming parts of the job are AFAIK: * The asv benchmarks * Code coverage uploads * Using valgrind * Building multiple versions of Python These can be addressed respectively by: * Nightly/opt-in testing asv (via bot) * Nightly/opt-in coverage * Nightly valgrind * No immediate solution. I would like to drop Python 2 in January 2019 - Wes > > (another thought: when do we want to drop Python 2 compatibility?) > > Regards > > Antoine.
Re: [DISCUSS] Re-think CI strategy?
Le 06/08/2018 à 13:42, Wes McKinney a écrit : > hi Antoine, > > I completely agree. Part of why I've been so consistently pressing for > nightly build tooling is to be able to shift more exhaustive testing > out of per-commit runs into a daily build or an on-demand build to be > invoked by the user either manually or by means of a bot. If you > search in JIRA for the term "nightly" you can see a lot of issue that > I have created for this already. My worry with nightly jobs is that they doesn't clearly pinpoint the specific changeset where a regression occurred; also, since it happens after merging, there is less incentive to clear any mess introduced by a PR; moreover, it makes trying out a fix more difficult, since you don't get direct feedback on a commit or PR. So in exchange for lowering per-commit CI times, nightly jobs require extra human care to track regressions and make sure they get fixed. > it would be useful to be > able to validate if desired (and in an automated way) that the NMake > build works properly I guess the main question is why we're testing for NMake at all. CMake supports a range of different build tools, we can't exercise all of them. So I'd say on each platform we should exercise at most two build tools: - Ninja, because it's cross-platform and the fastest (which makes it desirable for developers *and* for CI) - the standard platform-specific build tool, i.e. GNU make on Unix and Visual Studio (or "msbuild") on Windows > I think we can also improve CI build times by caching certain > toolchain artifacts that are taking a long time to build (Thrift, I'm > looking at you). The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK. Don't they include thrift? (another thought: when do we want to drop Python 2 compatibility?) Regards Antoine.
Re: [DISCUSS] Re-think CI strategy?
hi Antoine, I completely agree. Part of why I've been so consistently pressing for nightly build tooling is to be able to shift more exhaustive testing out of per-commit runs into a daily build or an on-demand build to be invoked by the user either manually or by means of a bot. If you search in JIRA for the term "nightly" you can see a lot of issue that I have created for this already. In the case of Appveyor for example, I don't think we need the exhaustive build matrix on each commit, but it would be useful to be able to validate if desired (and in an automated way) that the NMake build works properly. This could be automated with the crossbow build tool. I think we can also improve CI build times by caching certain toolchain artifacts that are taking a long time to build (Thrift, I'm looking at you). We can verify that the toolchain will successfully build automatically via ExternalProject in a nightly. - Wes On Mon, Aug 6, 2018 at 7:23 AM, Antoine Pitrou wrote: > > Hello, > > Our CI jobs are taking longer and longer. The main reason seem not to > be that our test suites become more thorough (running tests actually > seems to account for a very minor fraction of CI times) but the combined > fact that 1) fetching dependencies and building is slow 2) we have many > configurations tested on our CI jobs. > > The slowest job on our Travis-CI configuration routinely takes more 40 > minutes (this is for a single job, where everything is essentially > serialized). As for AppVeyor, the different jobs in a build there are > mostly serialized, which balloons the total build time. > > If we ever move parquet-cpp into the Arrow repo, this will probably > become worse again (though it might be better for parquet-cpp itself, > which doesn't seem to have received a lot of care on the CI side: for > instance, parquet-cpp AppVeyor builds are abysmally slow). > > I think we will have to cut down on the number of things we exercise on > the per-build CI builds. This can be done along two axes: > 1) remove some build steps that we deem non-critical or even unimportant > 2) remove some build configurations entirely (for instance I don't > understand why we need a "static CRT" build on Windows, or worse, why we > need to have NMake-based builds at all) > > Any thoughts? > > Regards > > Antoine.
Re: [DISCUSS] Re-think CI strategy?
Hi, A straightforward way would be to run non-critical CI jobs as nightlies. Nightly package builds work pretty well, see the following link https://github.com/kszucs/crossbow/branches/all?query=nightly the notification logic requires improvement though. We should also run integrations tests regularly (dask, hdfs, spark). If We use a non apache queue repository for crossbow, We could even submit jobs to other CI services (e.g. CircleCI) to increase the build parallelism. Krisztian On Aug 6 2018, at 1:23 pm, Antoine Pitrou wrote: > > > Hello, > Our CI jobs are taking longer and longer. The main reason seem not to > be that our test suites become more thorough (running tests actually > seems to account for a very minor fraction of CI times) but the combined > fact that 1) fetching dependencies and building is slow 2) we have many > configurations tested on our CI jobs. > > The slowest job on our Travis-CI configuration routinely takes more 40 > minutes (this is for a single job, where everything is essentially > serialized). As for AppVeyor, the different jobs in a build there are > mostly serialized, which balloons the total build time. > > If we ever move parquet-cpp into the Arrow repo, this will probably > become worse again (though it might be better for parquet-cpp itself, > which doesn't seem to have received a lot of care on the CI side: for > instance, parquet-cpp AppVeyor builds are abysmally slow). > > I think we will have to cut down on the number of things we exercise on > the per-build CI builds. This can be done along two axes: > 1) remove some build steps that we deem non-critical or even unimportant > 2) remove some build configurations entirely (for instance I don't > understand why we need a "static CRT" build on Windows, or worse, why we > need to have NMake-based builds at all) > > Any thoughts? > Regards > Antoine.
[DISCUSS] Re-think CI strategy?
Hello, Our CI jobs are taking longer and longer. The main reason seem not to be that our test suites become more thorough (running tests actually seems to account for a very minor fraction of CI times) but the combined fact that 1) fetching dependencies and building is slow 2) we have many configurations tested on our CI jobs. The slowest job on our Travis-CI configuration routinely takes more 40 minutes (this is for a single job, where everything is essentially serialized). As for AppVeyor, the different jobs in a build there are mostly serialized, which balloons the total build time. If we ever move parquet-cpp into the Arrow repo, this will probably become worse again (though it might be better for parquet-cpp itself, which doesn't seem to have received a lot of care on the CI side: for instance, parquet-cpp AppVeyor builds are abysmally slow). I think we will have to cut down on the number of things we exercise on the per-build CI builds. This can be done along two axes: 1) remove some build steps that we deem non-critical or even unimportant 2) remove some build configurations entirely (for instance I don't understand why we need a "static CRT" build on Windows, or worse, why we need to have NMake-based builds at all) Any thoughts? Regards Antoine.