Re: [DISCUSS] Re-think CI strategy?

2018-08-18 Thread Wes McKinney
Our CI is looking much healthier now after recent work (thank you!),
example build:

https://travis-ci.org/apache/arrow/builds/417700344

I think we've bought ourselves a few months at least. We'll have to
see what the impact on CI health of adding a couple more things:

* parquet-cpp unit tests (per [1])
* Gandiva build + tests

I suspect at some point in the future we may need to have a
combination of "fast Travis CI builds" and more exhaustive / longer
running builds in Jenkins. Projects like Apache Kudu have much more
intense testing procedures and these are run on dedicated
infrastructure rather than CI

I also think that more parts of our CI could be handled by creating an
"Arrow test bot" that can respond to directions. There are a number of
frameworks and examples now for writing GitHub bots; we could create a
bot that can execute on-demand tests of optional components using the
crossbow tool.

Other things that we run in every commit, like the Python manylinux1
build, could be run on-demand and nightly. That being said, I just
worked on a PR that broke the manylinux1 build
(https://github.com/apache/arrow/pull/2428) and so we risk having to
hunt down the root cause of a broken build if we don't run such tests
on every commit. I'm not sure we can simultaneously have fast CI
builds while also catching all possible problems

- Wes

[1]: 
https://lists.apache.org/thread.html/53f77f9f1f04b97709a0286db1b73a49b7f1541d8f8b2cb32db5c922@%3Cdev.parquet.apache.org%3E

On Tue, Aug 7, 2018 at 2:55 AM, Antoine Pitrou  wrote:
>
> It would be good to test all Python versions in a cron build, but I
> agree we may not need to test all Python 3 versions in per-commit builds.
>
> Regards
>
> Antoine.
>
>
> Le 07/08/2018 à 03:14, Robert Nishihara a écrit :
>> Thanks Wes.
>>
>> As for Python 3.5, 3.6, and 3.7, I think testing any one of them should be
>> sufficient (I can't recall any errors that happened with one version and
>> not the other).
>>
>> On Mon, Aug 6, 2018 at 12:01 PM Wes McKinney  wrote:
>>
>>> @Robert, it looks like NumPy is making LTS releases until Jan 1, 2020
>>>
>>>
>>> https://docs.scipy.org/doc/numpy-1.14.0/neps/dropping-python2.7-proposal.html
>>>
>>> Based on this, I think it's fine for us to continue to support Python
>>> 2.7 until then. It's only 16 months away; are you all ready for the
>>> next decade?
>>>
>>> We should also discuss if we want to continue to build and test Python
>>> 3.5. From download statistics it appears that there are 5-10x as many
>>> Python 3.6 users as 3.5. I would prefer to drop 3.5 and begin
>>> supporting 3.7 soon.
>>>
>>> @Antoine, I think we can avoid building the C++ codebase 3 times, but
>>> it will require a bit of retooling of the scripts. The reason that
>>> ccache isn't working properly is probably because the Python include
>>> directory is being included even for compilation units that do not use
>>> the Python C API.
>>> https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L721.
>>> I'm opening a JIRA about fixing this
>>> https://issues.apache.org/jira/browse/ARROW-2994
>>>
>>> Created https://issues.apache.org/jira/browse/ARROW-2995 about
>>> removing the redundant build cycle
>>>
>>> On Mon, Aug 6, 2018 at 2:19 PM, Robert Nishihara
>>>  wrote:
>
> Also, at this point we're sometimes hitting the 50 minutes time limit on
> our slowest Travis-CI matrix job, which means we have to restart it...
> making the build even slower.
>
 Only a short-term fix, but Travis can lengthen the max build time if you
 email them and ask them to.
>>>
>>


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Robert Nishihara
Thanks Wes.

As for Python 3.5, 3.6, and 3.7, I think testing any one of them should be
sufficient (I can't recall any errors that happened with one version and
not the other).

On Mon, Aug 6, 2018 at 12:01 PM Wes McKinney  wrote:

> @Robert, it looks like NumPy is making LTS releases until Jan 1, 2020
>
>
> https://docs.scipy.org/doc/numpy-1.14.0/neps/dropping-python2.7-proposal.html
>
> Based on this, I think it's fine for us to continue to support Python
> 2.7 until then. It's only 16 months away; are you all ready for the
> next decade?
>
> We should also discuss if we want to continue to build and test Python
> 3.5. From download statistics it appears that there are 5-10x as many
> Python 3.6 users as 3.5. I would prefer to drop 3.5 and begin
> supporting 3.7 soon.
>
> @Antoine, I think we can avoid building the C++ codebase 3 times, but
> it will require a bit of retooling of the scripts. The reason that
> ccache isn't working properly is probably because the Python include
> directory is being included even for compilation units that do not use
> the Python C API.
> https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L721.
> I'm opening a JIRA about fixing this
> https://issues.apache.org/jira/browse/ARROW-2994
>
> Created https://issues.apache.org/jira/browse/ARROW-2995 about
> removing the redundant build cycle
>
> On Mon, Aug 6, 2018 at 2:19 PM, Robert Nishihara
>  wrote:
> >>
> >> Also, at this point we're sometimes hitting the 50 minutes time limit on
> >> our slowest Travis-CI matrix job, which means we have to restart it...
> >> making the build even slower.
> >>
> > Only a short-term fix, but Travis can lengthen the max build time if you
> > email them and ask them to.
>


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Wes McKinney
@Robert, it looks like NumPy is making LTS releases until Jan 1, 2020

https://docs.scipy.org/doc/numpy-1.14.0/neps/dropping-python2.7-proposal.html

Based on this, I think it's fine for us to continue to support Python
2.7 until then. It's only 16 months away; are you all ready for the
next decade?

We should also discuss if we want to continue to build and test Python
3.5. From download statistics it appears that there are 5-10x as many
Python 3.6 users as 3.5. I would prefer to drop 3.5 and begin
supporting 3.7 soon.

@Antoine, I think we can avoid building the C++ codebase 3 times, but
it will require a bit of retooling of the scripts. The reason that
ccache isn't working properly is probably because the Python include
directory is being included even for compilation units that do not use
the Python C API.
https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L721.
I'm opening a JIRA about fixing this
https://issues.apache.org/jira/browse/ARROW-2994

Created https://issues.apache.org/jira/browse/ARROW-2995 about
removing the redundant build cycle

On Mon, Aug 6, 2018 at 2:19 PM, Robert Nishihara
 wrote:
>>
>> Also, at this point we're sometimes hitting the 50 minutes time limit on
>> our slowest Travis-CI matrix job, which means we have to restart it...
>> making the build even slower.
>>
> Only a short-term fix, but Travis can lengthen the max build time if you
> email them and ask them to.


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Robert Nishihara
>
> Also, at this point we're sometimes hitting the 50 minutes time limit on
> our slowest Travis-CI matrix job, which means we have to restart it...
> making the build even slower.
>
Only a short-term fix, but Travis can lengthen the max build time if you
email them and ask them to.


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Antoine Pitrou


Also, at this point we're sometimes hitting the 50 minutes time limit on
our slowest Travis-CI matrix job, which means we have to restart it...
making the build even slower.

There's something perhaps suboptimal in the way we build Arrow C++ on
Travis:
- first we build it for no particular Python version
- second we build it for Python 2.7
- third we build it for Python 3.6

Even those C++ files that don't depend on Python get re-compiled thrice
(and ccache doesn't save us, probably because the compile flags are
different).

Regards

Antoine.


Le 06/08/2018 à 19:57, Antoine Pitrou a écrit :
> 
> Not wanting to answer for Wes, but those are two sides of the same coin:
> reducing CI overhead and complexity helps increase developer
> productivity.  Reducing CI overhead is not a goal *in itself* (unless
> there are money issues I don't know about) ;-)
> 
> The productivity cost of being Python 2-compatible is not very high
> *currently* (since much of the cost is a sunk cost by now), but these
> things all add up.  So at some point we should really drop Python 2.
> Whether it's 2019 or 2020, I don't know and I don't get to decide.
> 
> However, anything later than 2020 is excessively conservative IMHO.
> 
> Regards
> 
> Antoine.
> 
> 
> Le 06/08/2018 à 19:46, Robert Nishihara a écrit :
>> Wes, do you primarily want to drop Python 2 to speed up Travis or to reduce
>> the development overhead? In my experience the development overhead is
>> minimal and well worth it. For Travis, we could consider looking into other
>> options like paying for more concurrency.
>>
>> January 2019 is very soon and Python 2 is still massively popular.
>>
>> On Mon, Aug 6, 2018 at 5:11 AM Wes McKinney  wrote:
>>
 The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK.
  Don't they include thrift?
>>>
>>> I was referring to your comment about "parquet-cpp AppVeyor builds are
>>> abysmally slow". I think the slowness is in significant part due to
>>> the ExternalProject builds, where Thrift is the worst offender.
>>>
>>


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Antoine Pitrou


Not wanting to answer for Wes, but those are two sides of the same coin:
reducing CI overhead and complexity helps increase developer
productivity.  Reducing CI overhead is not a goal *in itself* (unless
there are money issues I don't know about) ;-)

The productivity cost of being Python 2-compatible is not very high
*currently* (since much of the cost is a sunk cost by now), but these
things all add up.  So at some point we should really drop Python 2.
Whether it's 2019 or 2020, I don't know and I don't get to decide.

However, anything later than 2020 is excessively conservative IMHO.

Regards

Antoine.


Le 06/08/2018 à 19:46, Robert Nishihara a écrit :
> Wes, do you primarily want to drop Python 2 to speed up Travis or to reduce
> the development overhead? In my experience the development overhead is
> minimal and well worth it. For Travis, we could consider looking into other
> options like paying for more concurrency.
> 
> January 2019 is very soon and Python 2 is still massively popular.
> 
> On Mon, Aug 6, 2018 at 5:11 AM Wes McKinney  wrote:
> 
>>> The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK.
>>>  Don't they include thrift?
>>
>> I was referring to your comment about "parquet-cpp AppVeyor builds are
>> abysmally slow". I think the slowness is in significant part due to
>> the ExternalProject builds, where Thrift is the worst offender.
>>
> 


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Robert Nishihara
Wes, do you primarily want to drop Python 2 to speed up Travis or to reduce
the development overhead? In my experience the development overhead is
minimal and well worth it. For Travis, we could consider looking into other
options like paying for more concurrency.

January 2019 is very soon and Python 2 is still massively popular.

On Mon, Aug 6, 2018 at 5:11 AM Wes McKinney  wrote:

> > The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK.
> >  Don't they include thrift?
>
> I was referring to your comment about "parquet-cpp AppVeyor builds are
> abysmally slow". I think the slowness is in significant part due to
> the ExternalProject builds, where Thrift is the worst offender.
>


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Wes McKinney
> The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK.
>  Don't they include thrift?

I was referring to your comment about "parquet-cpp AppVeyor builds are
abysmally slow". I think the slowness is in significant part due to
the ExternalProject builds, where Thrift is the worst offender.


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Wes McKinney
hi,

On Mon, Aug 6, 2018 at 7:52 AM, Antoine Pitrou  wrote:
>
> Le 06/08/2018 à 13:42, Wes McKinney a écrit :
>> hi Antoine,
>>
>> I completely agree. Part of why I've been so consistently pressing for
>> nightly build tooling is to be able to shift more exhaustive testing
>> out of per-commit runs into a daily build or an on-demand build to be
>> invoked by the user either manually or by means of a bot. If you
>> search in JIRA for the term "nightly" you can see a lot of issue that
>> I have created for this already.
>
> My worry with nightly jobs is that they doesn't clearly pinpoint the
> specific changeset where a regression occurred; also, since it happens
> after merging, there is less incentive to clear any mess introduced by a
> PR; moreover, it makes trying out a fix more difficult, since you don't
> get direct feedback on a commit or PR.
>
> So in exchange for lowering per-commit CI times, nightly jobs require
> extra human care to track regressions and make sure they get fixed.

The way that projects with much more complex testing deal with this is
to have a pre-commit test run. The Apache Impala "verify merge" commit
take several hours to run, for example. I don't think we are on our
way to that point yet.

What I'm suggesting is that we be able to write

@arrow-test-bot please run exhaustive

There are a lot of patches that clearly don't require exhaustive
testing. In any case, we are likely to accumulate more testing than we
can run on every commit / patch, but as long as we have a convenient
mechanism to run the tests, then it is OK.

>
>> it would be useful to be
>> able to validate if desired (and in an automated way) that the NMake
>> build works properly
>
> I guess the main question is why we're testing for NMake at all.  CMake
> supports a range of different build tools, we can't exercise all of
> them.  So I'd say on each platform we should exercise at most two build
> tools:
>   - Ninja, because it's cross-platform and the fastest (which makes it
> desirable for developers *and* for CI)
>   - the standard platform-specific build tool, i.e. GNU make on Unix and
> Visual Studio (or "msbuild") on Windows

Well, I would be OK with dropping NMake in favor of supporting Ninja.

>
>> I think we can also improve CI build times by caching certain
>> toolchain artifacts that are taking a long time to build (Thrift, I'm
>> looking at you).
>
> The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK.
>  Don't they include thrift?

The most time consuming parts of the job are AFAIK:

* The asv benchmarks
* Code coverage uploads
* Using valgrind
* Building multiple versions of Python

These can be addressed respectively by:

* Nightly/opt-in testing asv (via bot)
* Nightly/opt-in coverage
* Nightly valgrind
* No immediate solution. I would like to drop Python 2 in January 2019

- Wes

>
> (another thought: when do we want to drop Python 2 compatibility?)
>
> Regards
>
> Antoine.


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Antoine Pitrou


Le 06/08/2018 à 13:42, Wes McKinney a écrit :
> hi Antoine,
> 
> I completely agree. Part of why I've been so consistently pressing for
> nightly build tooling is to be able to shift more exhaustive testing
> out of per-commit runs into a daily build or an on-demand build to be
> invoked by the user either manually or by means of a bot. If you
> search in JIRA for the term "nightly" you can see a lot of issue that
> I have created for this already.

My worry with nightly jobs is that they doesn't clearly pinpoint the
specific changeset where a regression occurred; also, since it happens
after merging, there is less incentive to clear any mess introduced by a
PR; moreover, it makes trying out a fix more difficult, since you don't
get direct feedback on a commit or PR.

So in exchange for lowering per-commit CI times, nightly jobs require
extra human care to track regressions and make sure they get fixed.

> it would be useful to be
> able to validate if desired (and in an automated way) that the NMake
> build works properly

I guess the main question is why we're testing for NMake at all.  CMake
supports a range of different build tools, we can't exercise all of
them.  So I'd say on each platform we should exercise at most two build
tools:
  - Ninja, because it's cross-platform and the fastest (which makes it
desirable for developers *and* for CI)
  - the standard platform-specific build tool, i.e. GNU make on Unix and
Visual Studio (or "msbuild") on Windows

> I think we can also improve CI build times by caching certain
> toolchain artifacts that are taking a long time to build (Thrift, I'm
> looking at you).

The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK.
 Don't they include thrift?

(another thought: when do we want to drop Python 2 compatibility?)

Regards

Antoine.


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Wes McKinney
hi Antoine,

I completely agree. Part of why I've been so consistently pressing for
nightly build tooling is to be able to shift more exhaustive testing
out of per-commit runs into a daily build or an on-demand build to be
invoked by the user either manually or by means of a bot. If you
search in JIRA for the term "nightly" you can see a lot of issue that
I have created for this already.

In the case of Appveyor for example, I don't think we need the
exhaustive build matrix on each commit, but it would be useful to be
able to validate if desired (and in an automated way) that the NMake
build works properly. This could be automated with the crossbow build
tool.

I think we can also improve CI build times by caching certain
toolchain artifacts that are taking a long time to build (Thrift, I'm
looking at you). We can verify that the toolchain will successfully
build automatically via ExternalProject in a nightly.

- Wes

On Mon, Aug 6, 2018 at 7:23 AM, Antoine Pitrou  wrote:
>
> Hello,
>
> Our CI jobs are taking longer and longer.  The main reason seem not to
> be that our test suites become more thorough (running tests actually
> seems to account for a very minor fraction of CI times) but the combined
> fact that 1) fetching dependencies and building is slow 2) we have many
> configurations tested on our CI jobs.
>
> The slowest job on our Travis-CI configuration routinely takes more 40
> minutes (this is for a single job, where everything is essentially
> serialized).  As for AppVeyor, the different jobs in a build there are
> mostly serialized, which balloons the total build time.
>
> If we ever move parquet-cpp into the Arrow repo, this will probably
> become worse again (though it might be better for parquet-cpp itself,
> which doesn't seem to have received a lot of care on the CI side: for
> instance, parquet-cpp AppVeyor builds are abysmally slow).
>
> I think we will have to cut down on the number of things we exercise on
> the per-build CI builds.  This can be done along two axes:
> 1) remove some build steps that we deem non-critical or even unimportant
> 2) remove some build configurations entirely (for instance I don't
> understand why we need a "static CRT" build on Windows, or worse, why we
> need to have NMake-based builds at all)
>
> Any thoughts?
>
> Regards
>
> Antoine.


Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Krisztián Szűcs
Hi,

A straightforward way would be to run non-critical CI jobs as nightlies.
Nightly package builds work pretty well, see the following link
https://github.com/kszucs/crossbow/branches/all?query=nightly
the notification logic requires improvement though.
We should also run integrations tests regularly (dask, hdfs, spark).
If We use a non apache queue repository for crossbow, We could even
submit jobs to other CI services (e.g. CircleCI) to increase the build
parallelism.

Krisztian
On Aug 6 2018, at 1:23 pm, Antoine Pitrou  wrote:
>
>
> Hello,
> Our CI jobs are taking longer and longer. The main reason seem not to
> be that our test suites become more thorough (running tests actually
> seems to account for a very minor fraction of CI times) but the combined
> fact that 1) fetching dependencies and building is slow 2) we have many
> configurations tested on our CI jobs.
>
> The slowest job on our Travis-CI configuration routinely takes more 40
> minutes (this is for a single job, where everything is essentially
> serialized). As for AppVeyor, the different jobs in a build there are
> mostly serialized, which balloons the total build time.
>
> If we ever move parquet-cpp into the Arrow repo, this will probably
> become worse again (though it might be better for parquet-cpp itself,
> which doesn't seem to have received a lot of care on the CI side: for
> instance, parquet-cpp AppVeyor builds are abysmally slow).
>
> I think we will have to cut down on the number of things we exercise on
> the per-build CI builds. This can be done along two axes:
> 1) remove some build steps that we deem non-critical or even unimportant
> 2) remove some build configurations entirely (for instance I don't
> understand why we need a "static CRT" build on Windows, or worse, why we
> need to have NMake-based builds at all)
>
> Any thoughts?
> Regards
> Antoine.

[DISCUSS] Re-think CI strategy?

2018-08-06 Thread Antoine Pitrou


Hello,

Our CI jobs are taking longer and longer.  The main reason seem not to
be that our test suites become more thorough (running tests actually
seems to account for a very minor fraction of CI times) but the combined
fact that 1) fetching dependencies and building is slow 2) we have many
configurations tested on our CI jobs.

The slowest job on our Travis-CI configuration routinely takes more 40
minutes (this is for a single job, where everything is essentially
serialized).  As for AppVeyor, the different jobs in a build there are
mostly serialized, which balloons the total build time.

If we ever move parquet-cpp into the Arrow repo, this will probably
become worse again (though it might be better for parquet-cpp itself,
which doesn't seem to have received a lot of care on the CI side: for
instance, parquet-cpp AppVeyor builds are abysmally slow).

I think we will have to cut down on the number of things we exercise on
the per-build CI builds.  This can be done along two axes:
1) remove some build steps that we deem non-critical or even unimportant
2) remove some build configurations entirely (for instance I don't
understand why we need a "static CRT" build on Windows, or worse, why we
need to have NMake-based builds at all)

Any thoughts?

Regards

Antoine.