Re: [DISCUSS] Reducing build times

2019-09-05 Thread Robert Metzger
I do have a working Azure setup, yes. E2E tests are not included in the
3.5hrs.

Yesterday, I became aware of a major blocker with Azure pipelines: Apache
Infra does not allow it to be integrated with Apache GitHub repositories,
because it requires write access (for a simple usability feature) [1]. This
means that we "have" to use CiBot for the time being.
I've also reached out to Microsoft to see if they can do anything about it.

+1 For setting it up with CiBot immediately.

[1]https://issues.apache.org/jira/browse/INFRA-17030

On Thu, Sep 5, 2019 at 11:04 AM Chesnay Schepler  wrote:

> I assume you already have a working (and verified) azure setup?
>
> Once we're running things on azure on the apache repo people will
> invariably use that as a source of truth because fancy check marks will
> yet again appear on commits. Hence I'm wary of running experiments here;
> I would prefer if we only activate it once things are confirmed to be
> working.
>
> For observation purposes, we could also add it to flink-ci with
> notifications to people who are interested in this experiment.
> This wouldn't impact CiBot.
>
> On 03/09/2019 18:57, Robert Metzger wrote:
> > Hi all,
> >
> > I wanted to give a short update on this:
> > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> > working on making all modules compile and test with Gradle. We've also
> > identified some problematic areas (shading being the most obvious one)
> > which we will analyse as part of the PoC.
> > The goal is to see how much Gradle helps to parallelise our build, and to
> > avoid duplicate work (incremental builds).
> >
> > - I am working on setting up a Flink testing infrastructure based on
> Azure
> > Pipelines, using more powerful hardware. Alibaba kindly provided me with
> > two 32 core machines (temporarily), and another company reached out to
> > privately, looking into options for cheap, fast machines :)
> > If nobody in the community disagrees, I am going to set up Azure
> Pipelines
> > with our apache/flink GitHub as a build infrastructure that exists next
> to
> > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
> > equally or even more reliable than Travis, and I want to see what the
> > required maintenance work is.
> > On top of that, Azure Pipelines is a very feature-rich tool with a lot of
> > nice options for us to improve the build experience (statistics about
> tests
> > (flaky tests etc.), nice docker support, plenty of free build resources
> for
> > open source projects, ...)
> >
> > Best,
> > Robert
> >
> >
> >
> >
> >
> > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger 
> wrote:
> >
> >> Hi all,
> >>
> >> I have summarized all arguments mentioned so far + some additional
> >> research into a Wiki page here:
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
> >>
> >> I'm happy to hear further comments on my summary! I'm pretty sure we can
> >> find more pro's and con's for the different options.
> >>
> >> My opinion after looking at the options:
> >>
> >> - Flink relies on an outdated build tool (Maven), while a good
> >> alternative is well-established (gradle), and will likely provide a
> much
> >> better CI and local build experience through incremental build and
> cached
> >> intermediates.
> >> Scripting around Maven, or splitting modules / test execution /
> >> repositories won't solve this problem. We should rather spend the
> effort in
> >> migrating to a modern build tool which will provide us benefits in
> the long
> >> run.
> >> - Flink relies on a fairly slow build service (Travis CI), while
> >> simply putting more money onto the problem could cut the build time
> at
> >> least in half.
> >> We should consider using a build service that provides bigger
> machines
> >> to solve our build time problem.
> >>
> >> My opinion is based on many assumptions (gradle is actually as fast as
> >> promised (haven't used it before), we can build Flink with gradle, we
> find
> >> sponsors for bigger build machines) that we need to test first through
> PoCs.
> >>
> >> Best,
> >> Robert
> >>
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek 
> >> wrote:
> >>
> >>> I did a quick test: a normal "mvn clean install -DskipTests
> >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my
> machine
> >>> takes about 14 minutes. After removing all mentions of
> maven-shade-plugin
> >>> the build time goes down to roughly 11.5 minutes. (Obviously the
> resulting
> >>> Flink won’t work, because some expected stuff is not packaged and most
> of
> >>> the end-to-end tests use the shade plugin to package the jars for
> testing.
> >>>
> >>> Aljoscha
> >>>
>  On 18. Aug 2019, at 19:52, Robert Metzger 
> wrote:
> 
>  Hi all,
> 
>  I wanted to understand the impact of the hardware we are using for
> >>> running
>  our tests. Each travis worker has 2 virtual 

Re: [DISCUSS] Reducing build times

2019-09-05 Thread Chesnay Schepler

I assume you already have a working (and verified) azure setup?

Once we're running things on azure on the apache repo people will 
invariably use that as a source of truth because fancy check marks will 
yet again appear on commits. Hence I'm wary of running experiments here; 
I would prefer if we only activate it once things are confirmed to be 
working.


For observation purposes, we could also add it to flink-ci with 
notifications to people who are interested in this experiment.

This wouldn't impact CiBot.

On 03/09/2019 18:57, Robert Metzger wrote:

Hi all,

I wanted to give a short update on this:
- Arvid, Aljoscha and I have started working on a Gradle PoC, currently
working on making all modules compile and test with Gradle. We've also
identified some problematic areas (shading being the most obvious one)
which we will analyse as part of the PoC.
The goal is to see how much Gradle helps to parallelise our build, and to
avoid duplicate work (incremental builds).

- I am working on setting up a Flink testing infrastructure based on Azure
Pipelines, using more powerful hardware. Alibaba kindly provided me with
two 32 core machines (temporarily), and another company reached out to
privately, looking into options for cheap, fast machines :)
If nobody in the community disagrees, I am going to set up Azure Pipelines
with our apache/flink GitHub as a build infrastructure that exists next to
Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
equally or even more reliable than Travis, and I want to see what the
required maintenance work is.
On top of that, Azure Pipelines is a very feature-rich tool with a lot of
nice options for us to improve the build experience (statistics about tests
(flaky tests etc.), nice docker support, plenty of free build resources for
open source projects, ...)

Best,
Robert





On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger  wrote:


Hi all,

I have summarized all arguments mentioned so far + some additional
research into a Wiki page here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279

I'm happy to hear further comments on my summary! I'm pretty sure we can
find more pro's and con's for the different options.

My opinion after looking at the options:

- Flink relies on an outdated build tool (Maven), while a good
alternative is well-established (gradle), and will likely provide a much
better CI and local build experience through incremental build and cached
intermediates.
Scripting around Maven, or splitting modules / test execution /
repositories won't solve this problem. We should rather spend the effort in
migrating to a modern build tool which will provide us benefits in the long
run.
- Flink relies on a fairly slow build service (Travis CI), while
simply putting more money onto the problem could cut the build time at
least in half.
We should consider using a build service that provides bigger machines
to solve our build time problem.

My opinion is based on many assumptions (gradle is actually as fast as
promised (haven't used it before), we can build Flink with gradle, we find
sponsors for bigger build machines) that we need to test first through PoCs.

Best,
Robert




On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek 
wrote:


I did a quick test: a normal "mvn clean install -DskipTests
-Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine
takes about 14 minutes. After removing all mentions of maven-shade-plugin
the build time goes down to roughly 11.5 minutes. (Obviously the resulting
Flink won’t work, because some expected stuff is not packaged and most of
the end-to-end tests use the shade plugin to package the jars for testing.

Aljoscha


On 18. Aug 2019, at 19:52, Robert Metzger  wrote:

Hi all,

I wanted to understand the impact of the hardware we are using for

running

our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory

[1].

They are using Google Cloud Compute Engine *n1-standard-2* instances.
Running a full "mvn clean verify" takes *03:32 h* on such a machine

type.

Running the same workload on a 32 virtual cores, 64 gb machine, takes

*1:21

h*.

What is interesting are the per-module build time differences.
Modules which are parallelizing tests well greatly benefit from the
additional cores:
"flink-tests" 36:51 min vs 4:33 min
"flink-runtime" 23:41 min vs 3:47 min
"flink-table-planner" 15:54 min vs 3:13 min

On the other hand, we have modules which are not parallel at all:
"flink-connector-kafka": 16:32 min vs 15:19 min
"flink-connector-kafka-0.11": 9:52 min vs 7:46 min
Also, the checkstyle plugin is not scaling at all.

Chesnay reported some significant speedups by reusing forks.
I don't know how much effort it would be to make the Kafka tests
parallelizable. In total, they currently use 30 minutes on the big

machine

(while 31 CPUs are idling :) )

Let me know what you think about these results. If the 

Re: [DISCUSS] Reducing build times

2019-09-04 Thread Chesnay Schepler
e2e tests on Travis add another 4-5 hours, but we never optimized these 
to make use of the cached Flink artifact.


On 04/09/2019 13:26, Till Rohrmann wrote:

How long do we need to run all e2e tests? They are not included in the 3,5
hours I assume.

Cheers,
Till

On Wed, Sep 4, 2019 at 11:59 AM Robert Metzger  wrote:


Yes, we can ensure the same (or better) experience for contributors.

On the powerful machines, builds finish in 1.5 hours (without any caching
enabled).

Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a
build for open source projects. Flink needs 3.5 hours on that infra (not
parallelized at all, no caching). These free machines are very similar to
those of Travis, so I expect no build time regressions, if we set it up
similarly.


On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler 
wrote:


Will using more powerful for the project make it more difficult to
ensure that contributor builds are still running in a reasonable time?

As an example of this happening on Travis, contributors currently cannot
run all e2e tests since they timeout, but on apache we have a larger
timeout.

On 03/09/2019 18:57, Robert Metzger wrote:

Hi all,

I wanted to give a short update on this:
- Arvid, Aljoscha and I have started working on a Gradle PoC, currently
working on making all modules compile and test with Gradle. We've also
identified some problematic areas (shading being the most obvious one)
which we will analyse as part of the PoC.
The goal is to see how much Gradle helps to parallelise our build, and

to

avoid duplicate work (incremental builds).

- I am working on setting up a Flink testing infrastructure based on

Azure

Pipelines, using more powerful hardware. Alibaba kindly provided me

with

two 32 core machines (temporarily), and another company reached out to
privately, looking into options for cheap, fast machines :)
If nobody in the community disagrees, I am going to set up Azure

Pipelines

with our apache/flink GitHub as a build infrastructure that exists next

to

Flinkbot and flink-ci. I would like to make sure that Azure Pipelines

is

equally or even more reliable than Travis, and I want to see what the
required maintenance work is.
On top of that, Azure Pipelines is a very feature-rich tool with a lot

of

nice options for us to improve the build experience (statistics about

tests

(flaky tests etc.), nice docker support, plenty of free build resources

for

open source projects, ...)

Best,
Robert





On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger 

wrote:

Hi all,

I have summarized all arguments mentioned so far + some additional
research into a Wiki page here:


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279

I'm happy to hear further comments on my summary! I'm pretty sure we

can

find more pro's and con's for the different options.

My opinion after looking at the options:

 - Flink relies on an outdated build tool (Maven), while a good
 alternative is well-established (gradle), and will likely provide

a

much

 better CI and local build experience through incremental build and

cached

 intermediates.
 Scripting around Maven, or splitting modules / test execution /
 repositories won't solve this problem. We should rather spend the

effort in

 migrating to a modern build tool which will provide us benefits in

the long

 run.
 - Flink relies on a fairly slow build service (Travis CI), while
 simply putting more money onto the problem could cut the build

time

at

 least in half.
 We should consider using a build service that provides bigger

machines

 to solve our build time problem.

My opinion is based on many assumptions (gradle is actually as fast as
promised (haven't used it before), we can build Flink with gradle, we

find

sponsors for bigger build machines) that we need to test first through

PoCs.

Best,
Robert




On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <

aljos...@apache.org>

wrote:


I did a quick test: a normal "mvn clean install -DskipTests
-Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my

machine

takes about 14 minutes. After removing all mentions of

maven-shade-plugin

the build time goes down to roughly 11.5 minutes. (Obviously the

resulting

Flink won’t work, because some expected stuff is not packaged and

most

of

the end-to-end tests use the shade plugin to package the jars for

testing.

Aljoscha


On 18. Aug 2019, at 19:52, Robert Metzger 

wrote:

Hi all,

I wanted to understand the impact of the hardware we are using for

running

our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory

[1].

They are using Google Cloud Compute Engine *n1-standard-2*

instances.

Running a full "mvn clean verify" takes *03:32 h* on such a machine

type.

Running the same workload on a 32 virtual cores, 64 gb machine,

takes

*1:21

h*.

What is interesting are the per-module build time differences.
Modules which 

Re: [DISCUSS] Reducing build times

2019-09-04 Thread Till Rohrmann
How long do we need to run all e2e tests? They are not included in the 3,5
hours I assume.

Cheers,
Till

On Wed, Sep 4, 2019 at 11:59 AM Robert Metzger  wrote:

> Yes, we can ensure the same (or better) experience for contributors.
>
> On the powerful machines, builds finish in 1.5 hours (without any caching
> enabled).
>
> Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a
> build for open source projects. Flink needs 3.5 hours on that infra (not
> parallelized at all, no caching). These free machines are very similar to
> those of Travis, so I expect no build time regressions, if we set it up
> similarly.
>
>
> On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler 
> wrote:
>
> > Will using more powerful for the project make it more difficult to
> > ensure that contributor builds are still running in a reasonable time?
> >
> > As an example of this happening on Travis, contributors currently cannot
> > run all e2e tests since they timeout, but on apache we have a larger
> > timeout.
> >
> > On 03/09/2019 18:57, Robert Metzger wrote:
> > > Hi all,
> > >
> > > I wanted to give a short update on this:
> > > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> > > working on making all modules compile and test with Gradle. We've also
> > > identified some problematic areas (shading being the most obvious one)
> > > which we will analyse as part of the PoC.
> > > The goal is to see how much Gradle helps to parallelise our build, and
> to
> > > avoid duplicate work (incremental builds).
> > >
> > > - I am working on setting up a Flink testing infrastructure based on
> > Azure
> > > Pipelines, using more powerful hardware. Alibaba kindly provided me
> with
> > > two 32 core machines (temporarily), and another company reached out to
> > > privately, looking into options for cheap, fast machines :)
> > > If nobody in the community disagrees, I am going to set up Azure
> > Pipelines
> > > with our apache/flink GitHub as a build infrastructure that exists next
> > to
> > > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines
> is
> > > equally or even more reliable than Travis, and I want to see what the
> > > required maintenance work is.
> > > On top of that, Azure Pipelines is a very feature-rich tool with a lot
> of
> > > nice options for us to improve the build experience (statistics about
> > tests
> > > (flaky tests etc.), nice docker support, plenty of free build resources
> > for
> > > open source projects, ...)
> > >
> > > Best,
> > > Robert
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger 
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> I have summarized all arguments mentioned so far + some additional
> > >> research into a Wiki page here:
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
> > >>
> > >> I'm happy to hear further comments on my summary! I'm pretty sure we
> can
> > >> find more pro's and con's for the different options.
> > >>
> > >> My opinion after looking at the options:
> > >>
> > >> - Flink relies on an outdated build tool (Maven), while a good
> > >> alternative is well-established (gradle), and will likely provide
> a
> > much
> > >> better CI and local build experience through incremental build and
> > cached
> > >> intermediates.
> > >> Scripting around Maven, or splitting modules / test execution /
> > >> repositories won't solve this problem. We should rather spend the
> > effort in
> > >> migrating to a modern build tool which will provide us benefits in
> > the long
> > >> run.
> > >> - Flink relies on a fairly slow build service (Travis CI), while
> > >> simply putting more money onto the problem could cut the build
> time
> > at
> > >> least in half.
> > >> We should consider using a build service that provides bigger
> > machines
> > >> to solve our build time problem.
> > >>
> > >> My opinion is based on many assumptions (gradle is actually as fast as
> > >> promised (haven't used it before), we can build Flink with gradle, we
> > find
> > >> sponsors for bigger build machines) that we need to test first through
> > PoCs.
> > >>
> > >> Best,
> > >> Robert
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <
> aljos...@apache.org>
> > >> wrote:
> > >>
> > >>> I did a quick test: a normal "mvn clean install -DskipTests
> > >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my
> > machine
> > >>> takes about 14 minutes. After removing all mentions of
> > maven-shade-plugin
> > >>> the build time goes down to roughly 11.5 minutes. (Obviously the
> > resulting
> > >>> Flink won’t work, because some expected stuff is not packaged and
> most
> > of
> > >>> the end-to-end tests use the shade plugin to package the jars for
> > testing.
> > >>>
> > >>> Aljoscha
> > >>>
> >  On 18. Aug 2019, at 19:52, Robert Metzger 
> > wrote:
> > 
> >  Hi 

Re: [DISCUSS] Reducing build times

2019-09-04 Thread Robert Metzger
Yes, we can ensure the same (or better) experience for contributors.

On the powerful machines, builds finish in 1.5 hours (without any caching
enabled).

Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a
build for open source projects. Flink needs 3.5 hours on that infra (not
parallelized at all, no caching). These free machines are very similar to
those of Travis, so I expect no build time regressions, if we set it up
similarly.


On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler  wrote:

> Will using more powerful for the project make it more difficult to
> ensure that contributor builds are still running in a reasonable time?
>
> As an example of this happening on Travis, contributors currently cannot
> run all e2e tests since they timeout, but on apache we have a larger
> timeout.
>
> On 03/09/2019 18:57, Robert Metzger wrote:
> > Hi all,
> >
> > I wanted to give a short update on this:
> > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> > working on making all modules compile and test with Gradle. We've also
> > identified some problematic areas (shading being the most obvious one)
> > which we will analyse as part of the PoC.
> > The goal is to see how much Gradle helps to parallelise our build, and to
> > avoid duplicate work (incremental builds).
> >
> > - I am working on setting up a Flink testing infrastructure based on
> Azure
> > Pipelines, using more powerful hardware. Alibaba kindly provided me with
> > two 32 core machines (temporarily), and another company reached out to
> > privately, looking into options for cheap, fast machines :)
> > If nobody in the community disagrees, I am going to set up Azure
> Pipelines
> > with our apache/flink GitHub as a build infrastructure that exists next
> to
> > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
> > equally or even more reliable than Travis, and I want to see what the
> > required maintenance work is.
> > On top of that, Azure Pipelines is a very feature-rich tool with a lot of
> > nice options for us to improve the build experience (statistics about
> tests
> > (flaky tests etc.), nice docker support, plenty of free build resources
> for
> > open source projects, ...)
> >
> > Best,
> > Robert
> >
> >
> >
> >
> >
> > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger 
> wrote:
> >
> >> Hi all,
> >>
> >> I have summarized all arguments mentioned so far + some additional
> >> research into a Wiki page here:
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
> >>
> >> I'm happy to hear further comments on my summary! I'm pretty sure we can
> >> find more pro's and con's for the different options.
> >>
> >> My opinion after looking at the options:
> >>
> >> - Flink relies on an outdated build tool (Maven), while a good
> >> alternative is well-established (gradle), and will likely provide a
> much
> >> better CI and local build experience through incremental build and
> cached
> >> intermediates.
> >> Scripting around Maven, or splitting modules / test execution /
> >> repositories won't solve this problem. We should rather spend the
> effort in
> >> migrating to a modern build tool which will provide us benefits in
> the long
> >> run.
> >> - Flink relies on a fairly slow build service (Travis CI), while
> >> simply putting more money onto the problem could cut the build time
> at
> >> least in half.
> >> We should consider using a build service that provides bigger
> machines
> >> to solve our build time problem.
> >>
> >> My opinion is based on many assumptions (gradle is actually as fast as
> >> promised (haven't used it before), we can build Flink with gradle, we
> find
> >> sponsors for bigger build machines) that we need to test first through
> PoCs.
> >>
> >> Best,
> >> Robert
> >>
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek 
> >> wrote:
> >>
> >>> I did a quick test: a normal "mvn clean install -DskipTests
> >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my
> machine
> >>> takes about 14 minutes. After removing all mentions of
> maven-shade-plugin
> >>> the build time goes down to roughly 11.5 minutes. (Obviously the
> resulting
> >>> Flink won’t work, because some expected stuff is not packaged and most
> of
> >>> the end-to-end tests use the shade plugin to package the jars for
> testing.
> >>>
> >>> Aljoscha
> >>>
>  On 18. Aug 2019, at 19:52, Robert Metzger 
> wrote:
> 
>  Hi all,
> 
>  I wanted to understand the impact of the hardware we are using for
> >>> running
>  our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
> >>> [1].
>  They are using Google Cloud Compute Engine *n1-standard-2* instances.
>  Running a full "mvn clean verify" takes *03:32 h* on such a machine
> >>> type.
>  Running the same workload on a 32 virtual cores, 64 gb machine, takes
> >>> *1:21
>  h*.
> 

Re: [DISCUSS] Reducing build times

2019-09-04 Thread Chesnay Schepler
Will using more powerful for the project make it more difficult to 
ensure that contributor builds are still running in a reasonable time?


As an example of this happening on Travis, contributors currently cannot 
run all e2e tests since they timeout, but on apache we have a larger 
timeout.


On 03/09/2019 18:57, Robert Metzger wrote:

Hi all,

I wanted to give a short update on this:
- Arvid, Aljoscha and I have started working on a Gradle PoC, currently
working on making all modules compile and test with Gradle. We've also
identified some problematic areas (shading being the most obvious one)
which we will analyse as part of the PoC.
The goal is to see how much Gradle helps to parallelise our build, and to
avoid duplicate work (incremental builds).

- I am working on setting up a Flink testing infrastructure based on Azure
Pipelines, using more powerful hardware. Alibaba kindly provided me with
two 32 core machines (temporarily), and another company reached out to
privately, looking into options for cheap, fast machines :)
If nobody in the community disagrees, I am going to set up Azure Pipelines
with our apache/flink GitHub as a build infrastructure that exists next to
Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
equally or even more reliable than Travis, and I want to see what the
required maintenance work is.
On top of that, Azure Pipelines is a very feature-rich tool with a lot of
nice options for us to improve the build experience (statistics about tests
(flaky tests etc.), nice docker support, plenty of free build resources for
open source projects, ...)

Best,
Robert





On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger  wrote:


Hi all,

I have summarized all arguments mentioned so far + some additional
research into a Wiki page here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279

I'm happy to hear further comments on my summary! I'm pretty sure we can
find more pro's and con's for the different options.

My opinion after looking at the options:

- Flink relies on an outdated build tool (Maven), while a good
alternative is well-established (gradle), and will likely provide a much
better CI and local build experience through incremental build and cached
intermediates.
Scripting around Maven, or splitting modules / test execution /
repositories won't solve this problem. We should rather spend the effort in
migrating to a modern build tool which will provide us benefits in the long
run.
- Flink relies on a fairly slow build service (Travis CI), while
simply putting more money onto the problem could cut the build time at
least in half.
We should consider using a build service that provides bigger machines
to solve our build time problem.

My opinion is based on many assumptions (gradle is actually as fast as
promised (haven't used it before), we can build Flink with gradle, we find
sponsors for bigger build machines) that we need to test first through PoCs.

Best,
Robert




On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek 
wrote:


I did a quick test: a normal "mvn clean install -DskipTests
-Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine
takes about 14 minutes. After removing all mentions of maven-shade-plugin
the build time goes down to roughly 11.5 minutes. (Obviously the resulting
Flink won’t work, because some expected stuff is not packaged and most of
the end-to-end tests use the shade plugin to package the jars for testing.

Aljoscha


On 18. Aug 2019, at 19:52, Robert Metzger  wrote:

Hi all,

I wanted to understand the impact of the hardware we are using for

running

our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory

[1].

They are using Google Cloud Compute Engine *n1-standard-2* instances.
Running a full "mvn clean verify" takes *03:32 h* on such a machine

type.

Running the same workload on a 32 virtual cores, 64 gb machine, takes

*1:21

h*.

What is interesting are the per-module build time differences.
Modules which are parallelizing tests well greatly benefit from the
additional cores:
"flink-tests" 36:51 min vs 4:33 min
"flink-runtime" 23:41 min vs 3:47 min
"flink-table-planner" 15:54 min vs 3:13 min

On the other hand, we have modules which are not parallel at all:
"flink-connector-kafka": 16:32 min vs 15:19 min
"flink-connector-kafka-0.11": 9:52 min vs 7:46 min
Also, the checkstyle plugin is not scaling at all.

Chesnay reported some significant speedups by reusing forks.
I don't know how much effort it would be to make the Kafka tests
parallelizable. In total, they currently use 30 minutes on the big

machine

(while 31 CPUs are idling :) )

Let me know what you think about these results. If the community is
generally interested in further investigating into that direction, I

could

look into software to orchestrate this, as well as sponsors for such an
infrastructure.

[1] 

Re: [DISCUSS] Reducing build times

2019-09-03 Thread Arvid Heise
+1 for Azure Pipelines, had very good experiences in the past with it and
the open source and payment models are much better.

The upcoming Github CI/CD seems also like a promising alternative, but from
the first looks, it seems like the small brother of Azure Pipeline. So, any
effort going into Azure Pipelines is probably also going into this
direction.

Best,

Arvid

On Tue, Sep 3, 2019 at 6:57 PM Robert Metzger  wrote:

> Hi all,
>
> I wanted to give a short update on this:
> - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> working on making all modules compile and test with Gradle. We've also
> identified some problematic areas (shading being the most obvious one)
> which we will analyse as part of the PoC.
> The goal is to see how much Gradle helps to parallelise our build, and to
> avoid duplicate work (incremental builds).
>
> - I am working on setting up a Flink testing infrastructure based on Azure
> Pipelines, using more powerful hardware. Alibaba kindly provided me with
> two 32 core machines (temporarily), and another company reached out to
> privately, looking into options for cheap, fast machines :)
> If nobody in the community disagrees, I am going to set up Azure Pipelines
> with our apache/flink GitHub as a build infrastructure that exists next to
> Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
> equally or even more reliable than Travis, and I want to see what the
> required maintenance work is.
> On top of that, Azure Pipelines is a very feature-rich tool with a lot of
> nice options for us to improve the build experience (statistics about tests
> (flaky tests etc.), nice docker support, plenty of free build resources for
> open source projects, ...)
>
> Best,
> Robert
>
>
>
>
>
> On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger 
> wrote:
>
> > Hi all,
> >
> > I have summarized all arguments mentioned so far + some additional
> > research into a Wiki page here:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
> >
> > I'm happy to hear further comments on my summary! I'm pretty sure we can
> > find more pro's and con's for the different options.
> >
> > My opinion after looking at the options:
> >
> >- Flink relies on an outdated build tool (Maven), while a good
> >alternative is well-established (gradle), and will likely provide a
> much
> >better CI and local build experience through incremental build and
> cached
> >intermediates.
> >Scripting around Maven, or splitting modules / test execution /
> >repositories won't solve this problem. We should rather spend the
> effort in
> >migrating to a modern build tool which will provide us benefits in
> the long
> >run.
> >- Flink relies on a fairly slow build service (Travis CI), while
> >simply putting more money onto the problem could cut the build time at
> >least in half.
> >We should consider using a build service that provides bigger machines
> >to solve our build time problem.
> >
> > My opinion is based on many assumptions (gradle is actually as fast as
> > promised (haven't used it before), we can build Flink with gradle, we
> find
> > sponsors for bigger build machines) that we need to test first through
> PoCs.
> >
> > Best,
> > Robert
> >
> >
> >
> >
> > On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek 
> > wrote:
> >
> >> I did a quick test: a normal "mvn clean install -DskipTests
> >> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my
> machine
> >> takes about 14 minutes. After removing all mentions of
> maven-shade-plugin
> >> the build time goes down to roughly 11.5 minutes. (Obviously the
> resulting
> >> Flink won’t work, because some expected stuff is not packaged and most
> of
> >> the end-to-end tests use the shade plugin to package the jars for
> testing.
> >>
> >> Aljoscha
> >>
> >> > On 18. Aug 2019, at 19:52, Robert Metzger 
> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > I wanted to understand the impact of the hardware we are using for
> >> running
> >> > our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
> >> [1].
> >> > They are using Google Cloud Compute Engine *n1-standard-2* instances.
> >> > Running a full "mvn clean verify" takes *03:32 h* on such a machine
> >> type.
> >> >
> >> > Running the same workload on a 32 virtual cores, 64 gb machine, takes
> >> *1:21
> >> > h*.
> >> >
> >> > What is interesting are the per-module build time differences.
> >> > Modules which are parallelizing tests well greatly benefit from the
> >> > additional cores:
> >> > "flink-tests" 36:51 min vs 4:33 min
> >> > "flink-runtime" 23:41 min vs 3:47 min
> >> > "flink-table-planner" 15:54 min vs 3:13 min
> >> >
> >> > On the other hand, we have modules which are not parallel at all:
> >> > "flink-connector-kafka": 16:32 min vs 15:19 min
> >> > "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> >> > Also, the checkstyle plugin is not scaling at all.
> >> >
> >> > 

Re: [DISCUSS] Reducing build times

2019-09-03 Thread Robert Metzger
Hi all,

I wanted to give a short update on this:
- Arvid, Aljoscha and I have started working on a Gradle PoC, currently
working on making all modules compile and test with Gradle. We've also
identified some problematic areas (shading being the most obvious one)
which we will analyse as part of the PoC.
The goal is to see how much Gradle helps to parallelise our build, and to
avoid duplicate work (incremental builds).

- I am working on setting up a Flink testing infrastructure based on Azure
Pipelines, using more powerful hardware. Alibaba kindly provided me with
two 32 core machines (temporarily), and another company reached out to
privately, looking into options for cheap, fast machines :)
If nobody in the community disagrees, I am going to set up Azure Pipelines
with our apache/flink GitHub as a build infrastructure that exists next to
Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
equally or even more reliable than Travis, and I want to see what the
required maintenance work is.
On top of that, Azure Pipelines is a very feature-rich tool with a lot of
nice options for us to improve the build experience (statistics about tests
(flaky tests etc.), nice docker support, plenty of free build resources for
open source projects, ...)

Best,
Robert





On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger  wrote:

> Hi all,
>
> I have summarized all arguments mentioned so far + some additional
> research into a Wiki page here:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
>
> I'm happy to hear further comments on my summary! I'm pretty sure we can
> find more pro's and con's for the different options.
>
> My opinion after looking at the options:
>
>- Flink relies on an outdated build tool (Maven), while a good
>alternative is well-established (gradle), and will likely provide a much
>better CI and local build experience through incremental build and cached
>intermediates.
>Scripting around Maven, or splitting modules / test execution /
>repositories won't solve this problem. We should rather spend the effort in
>migrating to a modern build tool which will provide us benefits in the long
>run.
>- Flink relies on a fairly slow build service (Travis CI), while
>simply putting more money onto the problem could cut the build time at
>least in half.
>We should consider using a build service that provides bigger machines
>to solve our build time problem.
>
> My opinion is based on many assumptions (gradle is actually as fast as
> promised (haven't used it before), we can build Flink with gradle, we find
> sponsors for bigger build machines) that we need to test first through PoCs.
>
> Best,
> Robert
>
>
>
>
> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek 
> wrote:
>
>> I did a quick test: a normal "mvn clean install -DskipTests
>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine
>> takes about 14 minutes. After removing all mentions of maven-shade-plugin
>> the build time goes down to roughly 11.5 minutes. (Obviously the resulting
>> Flink won’t work, because some expected stuff is not packaged and most of
>> the end-to-end tests use the shade plugin to package the jars for testing.
>>
>> Aljoscha
>>
>> > On 18. Aug 2019, at 19:52, Robert Metzger  wrote:
>> >
>> > Hi all,
>> >
>> > I wanted to understand the impact of the hardware we are using for
>> running
>> > our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
>> [1].
>> > They are using Google Cloud Compute Engine *n1-standard-2* instances.
>> > Running a full "mvn clean verify" takes *03:32 h* on such a machine
>> type.
>> >
>> > Running the same workload on a 32 virtual cores, 64 gb machine, takes
>> *1:21
>> > h*.
>> >
>> > What is interesting are the per-module build time differences.
>> > Modules which are parallelizing tests well greatly benefit from the
>> > additional cores:
>> > "flink-tests" 36:51 min vs 4:33 min
>> > "flink-runtime" 23:41 min vs 3:47 min
>> > "flink-table-planner" 15:54 min vs 3:13 min
>> >
>> > On the other hand, we have modules which are not parallel at all:
>> > "flink-connector-kafka": 16:32 min vs 15:19 min
>> > "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
>> > Also, the checkstyle plugin is not scaling at all.
>> >
>> > Chesnay reported some significant speedups by reusing forks.
>> > I don't know how much effort it would be to make the Kafka tests
>> > parallelizable. In total, they currently use 30 minutes on the big
>> machine
>> > (while 31 CPUs are idling :) )
>> >
>> > Let me know what you think about these results. If the community is
>> > generally interested in further investigating into that direction, I
>> could
>> > look into software to orchestrate this, as well as sponsors for such an
>> > infrastructure.
>> >
>> > [1] https://docs.travis-ci.com/user/reference/overview/
>> >
>> >
>> > On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler 
>> wrote:
>> >
>> >> 

Re: [DISCUSS] Reducing build times

2019-08-19 Thread Robert Metzger
Hi all,

I have summarized all arguments mentioned so far + some additional research
into a Wiki page here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279

I'm happy to hear further comments on my summary! I'm pretty sure we can
find more pro's and con's for the different options.

My opinion after looking at the options:

   - Flink relies on an outdated build tool (Maven), while a good
   alternative is well-established (gradle), and will likely provide a much
   better CI and local build experience through incremental build and cached
   intermediates.
   Scripting around Maven, or splitting modules / test execution /
   repositories won't solve this problem. We should rather spend the effort in
   migrating to a modern build tool which will provide us benefits in the long
   run.
   - Flink relies on a fairly slow build service (Travis CI), while simply
   putting more money onto the problem could cut the build time at least in
   half.
   We should consider using a build service that provides bigger machines
   to solve our build time problem.

My opinion is based on many assumptions (gradle is actually as fast as
promised (haven't used it before), we can build Flink with gradle, we find
sponsors for bigger build machines) that we need to test first through PoCs.

Best,
Robert




On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek 
wrote:

> I did a quick test: a normal "mvn clean install -DskipTests
> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine
> takes about 14 minutes. After removing all mentions of maven-shade-plugin
> the build time goes down to roughly 11.5 minutes. (Obviously the resulting
> Flink won’t work, because some expected stuff is not packaged and most of
> the end-to-end tests use the shade plugin to package the jars for testing.
>
> Aljoscha
>
> > On 18. Aug 2019, at 19:52, Robert Metzger  wrote:
> >
> > Hi all,
> >
> > I wanted to understand the impact of the hardware we are using for
> running
> > our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
> > They are using Google Cloud Compute Engine *n1-standard-2* instances.
> > Running a full "mvn clean verify" takes *03:32 h* on such a machine type.
> >
> > Running the same workload on a 32 virtual cores, 64 gb machine, takes
> *1:21
> > h*.
> >
> > What is interesting are the per-module build time differences.
> > Modules which are parallelizing tests well greatly benefit from the
> > additional cores:
> > "flink-tests" 36:51 min vs 4:33 min
> > "flink-runtime" 23:41 min vs 3:47 min
> > "flink-table-planner" 15:54 min vs 3:13 min
> >
> > On the other hand, we have modules which are not parallel at all:
> > "flink-connector-kafka": 16:32 min vs 15:19 min
> > "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> > Also, the checkstyle plugin is not scaling at all.
> >
> > Chesnay reported some significant speedups by reusing forks.
> > I don't know how much effort it would be to make the Kafka tests
> > parallelizable. In total, they currently use 30 minutes on the big
> machine
> > (while 31 CPUs are idling :) )
> >
> > Let me know what you think about these results. If the community is
> > generally interested in further investigating into that direction, I
> could
> > look into software to orchestrate this, as well as sponsors for such an
> > infrastructure.
> >
> > [1] https://docs.travis-ci.com/user/reference/overview/
> >
> >
> > On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler 
> wrote:
> >
> >> @Aljoscha Shading takes a few minutes for a full build; you can see this
> >> quite easily by looking at the compile step in the misc profile
> >> ; all modules that
> >> longer than a fraction of a section are usually caused by shading lots
> >> of classes. Note that I cannot tell you how much of this is spent on
> >> relocations, and how much on writing the jar.
> >>
> >> Personally, I'd very much like us to move all shading to flink-shaded;
> >> this would finally allows us to use newer maven versions without needing
> >> cumbersome workarounds for flink-dist. However, this isn't a trivial
> >> affair in some cases; IIRC calcite could be difficult to handle.
> >>
> >> On another note, this would also simplify switching the main repo to
> >> another build system, since you would no longer had to deal with
> >> relocations, just packaging + merging NOTICE files.
> >>
> >> @BowenLi I disagree, flink-shaded does not include any tests,  API
> >> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
> >> and flink-dist, where both relocate dependencies and one is bundled by
> >> the other), and, most importantly, CI (and really, without CI being
> >> covered in a PoC there's nothing to discuss).
> >>
> >> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> >>> Speaking of flink-shaded, do we have any idea what the impact of
> shading
> >> is on the build time? We could get rid of shading completely 

Re: [DISCUSS] Reducing build times

2019-08-19 Thread Aljoscha Krettek
I did a quick test: a normal "mvn clean install -DskipTests -Drat.skip=true 
-Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine takes about 14 
minutes. After removing all mentions of maven-shade-plugin the build time goes 
down to roughly 11.5 minutes. (Obviously the resulting Flink won’t work, 
because some expected stuff is not packaged and most of the end-to-end tests 
use the shade plugin to package the jars for testing.

Aljoscha

> On 18. Aug 2019, at 19:52, Robert Metzger  wrote:
> 
> Hi all,
> 
> I wanted to understand the impact of the hardware we are using for running
> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
> They are using Google Cloud Compute Engine *n1-standard-2* instances.
> Running a full "mvn clean verify" takes *03:32 h* on such a machine type.
> 
> Running the same workload on a 32 virtual cores, 64 gb machine, takes *1:21
> h*.
> 
> What is interesting are the per-module build time differences.
> Modules which are parallelizing tests well greatly benefit from the
> additional cores:
> "flink-tests" 36:51 min vs 4:33 min
> "flink-runtime" 23:41 min vs 3:47 min
> "flink-table-planner" 15:54 min vs 3:13 min
> 
> On the other hand, we have modules which are not parallel at all:
> "flink-connector-kafka": 16:32 min vs 15:19 min
> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> Also, the checkstyle plugin is not scaling at all.
> 
> Chesnay reported some significant speedups by reusing forks.
> I don't know how much effort it would be to make the Kafka tests
> parallelizable. In total, they currently use 30 minutes on the big machine
> (while 31 CPUs are idling :) )
> 
> Let me know what you think about these results. If the community is
> generally interested in further investigating into that direction, I could
> look into software to orchestrate this, as well as sponsors for such an
> infrastructure.
> 
> [1] https://docs.travis-ci.com/user/reference/overview/
> 
> 
> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler  wrote:
> 
>> @Aljoscha Shading takes a few minutes for a full build; you can see this
>> quite easily by looking at the compile step in the misc profile
>> ; all modules that
>> longer than a fraction of a section are usually caused by shading lots
>> of classes. Note that I cannot tell you how much of this is spent on
>> relocations, and how much on writing the jar.
>> 
>> Personally, I'd very much like us to move all shading to flink-shaded;
>> this would finally allows us to use newer maven versions without needing
>> cumbersome workarounds for flink-dist. However, this isn't a trivial
>> affair in some cases; IIRC calcite could be difficult to handle.
>> 
>> On another note, this would also simplify switching the main repo to
>> another build system, since you would no longer had to deal with
>> relocations, just packaging + merging NOTICE files.
>> 
>> @BowenLi I disagree, flink-shaded does not include any tests,  API
>> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
>> and flink-dist, where both relocate dependencies and one is bundled by
>> the other), and, most importantly, CI (and really, without CI being
>> covered in a PoC there's nothing to discuss).
>> 
>> On 16/08/2019 15:13, Aljoscha Krettek wrote:
>>> Speaking of flink-shaded, do we have any idea what the impact of shading
>> is on the build time? We could get rid of shading completely in the Flink
>> main repository by moving everything that we shade to flink-shaded.
>>> 
>>> Aljoscha
>>> 
 On 16. Aug 2019, at 14:58, Bowen Li  wrote:
 
 +1 to Till's points on #2 and #5, especially the potential
>> non-disruptive,
 gradual migration approach if we decide to go that route.
 
 To add on, I want to point it out that we can actually start with
 flink-shaded project [1] which is a perfect candidate for PoC. It's of
>> much
 smaller size, totally isolated from and not interfered with flink
>> project
 [2], and it actually covers most of our practical feature requirements
>> for
 a build tool - all making it an ideal experimental field.
 
 [1] https://github.com/apache/flink-shaded
 [2] https://github.com/apache/flink
 
 
 On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann 
>> wrote:
 
> For the sake of keeping the discussion focused and not cluttering the
> discussion thread I would suggest to split the detailed reporting for
> reusing JVMs to a separate thread and cross linking it from here.
> 
> Cheers,
> Till
> 
> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
> wrote:
> 
>> Update:
>> 
>> TL;DR: table-planner is a good candidate for enabling fork reuse right
>> away, while flink-tests has the potential for huge savings, but we
>> have
>> to figure out some issues first.
>> 
>> 
>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>> 

Re: [DISCUSS] Reducing build times

2019-08-18 Thread Robert Metzger
Hi all,

I wanted to understand the impact of the hardware we are using for running
our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
They are using Google Cloud Compute Engine *n1-standard-2* instances.
Running a full "mvn clean verify" takes *03:32 h* on such a machine type.

Running the same workload on a 32 virtual cores, 64 gb machine, takes *1:21
h*.

What is interesting are the per-module build time differences.
Modules which are parallelizing tests well greatly benefit from the
additional cores:
"flink-tests" 36:51 min vs 4:33 min
"flink-runtime" 23:41 min vs 3:47 min
"flink-table-planner" 15:54 min vs 3:13 min

On the other hand, we have modules which are not parallel at all:
"flink-connector-kafka": 16:32 min vs 15:19 min
"flink-connector-kafka-0.11": 9:52 min vs 7:46 min
Also, the checkstyle plugin is not scaling at all.

Chesnay reported some significant speedups by reusing forks.
I don't know how much effort it would be to make the Kafka tests
parallelizable. In total, they currently use 30 minutes on the big machine
(while 31 CPUs are idling :) )

Let me know what you think about these results. If the community is
generally interested in further investigating into that direction, I could
look into software to orchestrate this, as well as sponsors for such an
infrastructure.

[1] https://docs.travis-ci.com/user/reference/overview/


On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler  wrote:

> @Aljoscha Shading takes a few minutes for a full build; you can see this
> quite easily by looking at the compile step in the misc profile
> ; all modules that
> longer than a fraction of a section are usually caused by shading lots
> of classes. Note that I cannot tell you how much of this is spent on
> relocations, and how much on writing the jar.
>
> Personally, I'd very much like us to move all shading to flink-shaded;
> this would finally allows us to use newer maven versions without needing
> cumbersome workarounds for flink-dist. However, this isn't a trivial
> affair in some cases; IIRC calcite could be difficult to handle.
>
> On another note, this would also simplify switching the main repo to
> another build system, since you would no longer had to deal with
> relocations, just packaging + merging NOTICE files.
>
> @BowenLi I disagree, flink-shaded does not include any tests,  API
> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
> and flink-dist, where both relocate dependencies and one is bundled by
> the other), and, most importantly, CI (and really, without CI being
> covered in a PoC there's nothing to discuss).
>
> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> > Speaking of flink-shaded, do we have any idea what the impact of shading
> is on the build time? We could get rid of shading completely in the Flink
> main repository by moving everything that we shade to flink-shaded.
> >
> > Aljoscha
> >
> >> On 16. Aug 2019, at 14:58, Bowen Li  wrote:
> >>
> >> +1 to Till's points on #2 and #5, especially the potential
> non-disruptive,
> >> gradual migration approach if we decide to go that route.
> >>
> >> To add on, I want to point it out that we can actually start with
> >> flink-shaded project [1] which is a perfect candidate for PoC. It's of
> much
> >> smaller size, totally isolated from and not interfered with flink
> project
> >> [2], and it actually covers most of our practical feature requirements
> for
> >> a build tool - all making it an ideal experimental field.
> >>
> >> [1] https://github.com/apache/flink-shaded
> >> [2] https://github.com/apache/flink
> >>
> >>
> >> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann 
> wrote:
> >>
> >>> For the sake of keeping the discussion focused and not cluttering the
> >>> discussion thread I would suggest to split the detailed reporting for
> >>> reusing JVMs to a separate thread and cross linking it from here.
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
> >>> wrote:
> >>>
>  Update:
> 
>  TL;DR: table-planner is a good candidate for enabling fork reuse right
>  away, while flink-tests has the potential for huge savings, but we
> have
>  to figure out some issues first.
> 
> 
>  Build link: https://travis-ci.org/zentol/flink/builds/572659220
> 
>  4/8 profiles failed.
> 
>  No speedup in libraries, python, blink_planner, 7 minutes saved in
>  libraries (table-planner).
> 
>  The kafka and connectors profiles both fail in kafka tests due to
>  producer leaks, and no speed up could be confirmed so far:
> 
>  java.lang.AssertionError: Detected producer leak. Thread name:
>  kafka-producer-network-thread | producer-239
>  at org.junit.Assert.fail(Assert.java:88)
>  at
> 
> >>>
> 

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler
@Aljoscha Shading takes a few minutes for a full build; you can see this 
quite easily by looking at the compile step in the misc profile 
; all modules that 
longer than a fraction of a section are usually caused by shading lots 
of classes. Note that I cannot tell you how much of this is spent on 
relocations, and how much on writing the jar.


Personally, I'd very much like us to move all shading to flink-shaded; 
this would finally allows us to use newer maven versions without needing 
cumbersome workarounds for flink-dist. However, this isn't a trivial 
affair in some cases; IIRC calcite could be difficult to handle.


On another note, this would also simplify switching the main repo to 
another build system, since you would no longer had to deal with 
relocations, just packaging + merging NOTICE files.


@BowenLi I disagree, flink-shaded does not include any tests,  API 
compatibility checks, checkstyle, layered shading (e.g., flink-runtime 
and flink-dist, where both relocate dependencies and one is bundled by 
the other), and, most importantly, CI (and really, without CI being 
covered in a PoC there's nothing to discuss).


On 16/08/2019 15:13, Aljoscha Krettek wrote:

Speaking of flink-shaded, do we have any idea what the impact of shading is on 
the build time? We could get rid of shading completely in the Flink main 
repository by moving everything that we shade to flink-shaded.

Aljoscha


On 16. Aug 2019, at 14:58, Bowen Li  wrote:

+1 to Till's points on #2 and #5, especially the potential non-disruptive,
gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's of much
smaller size, totally isolated from and not interfered with flink project
[2], and it actually covers most of our practical feature requirements for
a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann  wrote:


For the sake of keeping the discussion focused and not cluttering the
discussion thread I would suggest to split the detailed reporting for
reusing JVMs to a separate thread and cross linking it from here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
wrote:


Update:

TL;DR: table-planner is a good candidate for enabling fork reuse right
away, while flink-tests has the potential for huge savings, but we have
to figure out some issues first.


Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in
libraries (table-planner).

The kafka and connectors profiles both fail in kafka tests due to
producer leaks, and no speed up could be confirmed so far:

java.lang.AssertionError: Detected producer leak. Thread name:
kafka-producer-network-thread | producer-239
at org.junit.Assert.fail(Assert.java:88)
at


org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)

at


org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)


The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected

accumulator

results within time limit.
at


org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one above
failed after 19 minutes and is only missing the migration tests (which
currently need 6-7 minutes). So we could save somewhere between 15 to 20
minutes here.


Finally, the misc profiles fails in YARN:

java.lang.AssertionError
at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for
flink-yarn-tests we can maybe get a minute or 2 out of it.

On 16/08/2019 10:43, Chesnay Schepler wrote:

There appears to be a general agreement that 1) should be looked into;
I've setup a branch with fork reuse being enabled for all tests; will
report back the results.

On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's
discuss the different ways how they could be reduced.


   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total
time"), and in the ideal case takes about 1h20m ("run time") to
complete from start to finish. The run time may fluctuate of course
depending on the current Travis load. This applies both to builds on
the Apache and flink-ci Travis.

At the time of writing, the current queue time 

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Aljoscha Krettek
Speaking of flink-shaded, do we have any idea what the impact of shading is on 
the build time? We could get rid of shading completely in the Flink main 
repository by moving everything that we shade to flink-shaded.

Aljoscha

> On 16. Aug 2019, at 14:58, Bowen Li  wrote:
> 
> +1 to Till's points on #2 and #5, especially the potential non-disruptive,
> gradual migration approach if we decide to go that route.
> 
> To add on, I want to point it out that we can actually start with
> flink-shaded project [1] which is a perfect candidate for PoC. It's of much
> smaller size, totally isolated from and not interfered with flink project
> [2], and it actually covers most of our practical feature requirements for
> a build tool - all making it an ideal experimental field.
> 
> [1] https://github.com/apache/flink-shaded
> [2] https://github.com/apache/flink
> 
> 
> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann  wrote:
> 
>> For the sake of keeping the discussion focused and not cluttering the
>> discussion thread I would suggest to split the detailed reporting for
>> reusing JVMs to a separate thread and cross linking it from here.
>> 
>> Cheers,
>> Till
>> 
>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
>> wrote:
>> 
>>> Update:
>>> 
>>> TL;DR: table-planner is a good candidate for enabling fork reuse right
>>> away, while flink-tests has the potential for huge savings, but we have
>>> to figure out some issues first.
>>> 
>>> 
>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>>> 
>>> 4/8 profiles failed.
>>> 
>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
>>> libraries (table-planner).
>>> 
>>> The kafka and connectors profiles both fail in kafka tests due to
>>> producer leaks, and no speed up could be confirmed so far:
>>> 
>>> java.lang.AssertionError: Detected producer leak. Thread name:
>>> kafka-producer-network-thread | producer-239
>>>at org.junit.Assert.fail(Assert.java:88)
>>>at
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>>>at
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>>> 
>>> 
>>> The tests profile failed due to various errors in migration tests:
>>> 
>>> junit.framework.AssertionFailedError: Did not see the expected
>> accumulator
>>> results within time limit.
>>>at
>>> 
>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>>> 
>>> *However*, a normal tests run takes 40 minutes, while this one above
>>> failed after 19 minutes and is only missing the migration tests (which
>>> currently need 6-7 minutes). So we could save somewhere between 15 to 20
>>> minutes here.
>>> 
>>> 
>>> Finally, the misc profiles fails in YARN:
>>> 
>>> java.lang.AssertionError
>>>at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>>> 
>>> No significant speedup could be observed in other modules; for
>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
>>> 
>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
 There appears to be a general agreement that 1) should be looked into;
 I've setup a branch with fork reuse being enabled for all tests; will
 report back the results.
 
 On 15/08/2019 09:38, Chesnay Schepler wrote:
> Hello everyone,
> 
> improving our build times is a hot topic at the moment so let's
> discuss the different ways how they could be reduced.
> 
> 
>   Current state:
> 
> First up, let's look at some numbers:
> 
> 1 full build currently consumes 5h of build time total ("total
> time"), and in the ideal case takes about 1h20m ("run time") to
> complete from start to finish. The run time may fluctuate of course
> depending on the current Travis load. This applies both to builds on
> the Apache and flink-ci Travis.
> 
> At the time of writing, the current queue time for PR jobs (reminder:
> running on flink-ci) is about 30 minutes (which basically means that
> we are processing builds at the rate that they come in), however we
> are in an admittedly quiet period right now.
> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> everyone was scrambling to get their changes merged in time for the
> feature freeze.
> 
> (Note: Recently optimizations where added to ci-bot where pending
> builds are canceled if a new commit was pushed to the PR or the PR
> was closed, which should prove especially useful during the rush
> hours we see before feature-freezes.)
> 
> 
>   Past approaches
> 
> Over the years we have done rather few things to improve this
> situation (hence our current predicament).
> 
> Beyond the sporadic speedup of some 

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Bowen Li
+1 to Till's points on #2 and #5, especially the potential non-disruptive,
gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's of much
smaller size, totally isolated from and not interfered with flink project
[2], and it actually covers most of our practical feature requirements for
a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann  wrote:

> For the sake of keeping the discussion focused and not cluttering the
> discussion thread I would suggest to split the detailed reporting for
> reusing JVMs to a separate thread and cross linking it from here.
>
> Cheers,
> Till
>
> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
> wrote:
>
> > Update:
> >
> > TL;DR: table-planner is a good candidate for enabling fork reuse right
> > away, while flink-tests has the potential for huge savings, but we have
> > to figure out some issues first.
> >
> >
> > Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >
> > 4/8 profiles failed.
> >
> > No speedup in libraries, python, blink_planner, 7 minutes saved in
> > libraries (table-planner).
> >
> > The kafka and connectors profiles both fail in kafka tests due to
> > producer leaks, and no speed up could be confirmed so far:
> >
> > java.lang.AssertionError: Detected producer leak. Thread name:
> > kafka-producer-network-thread | producer-239
> > at org.junit.Assert.fail(Assert.java:88)
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >
> >
> > The tests profile failed due to various errors in migration tests:
> >
> > junit.framework.AssertionFailedError: Did not see the expected
> accumulator
> > results within time limit.
> > at
> >
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >
> > *However*, a normal tests run takes 40 minutes, while this one above
> > failed after 19 minutes and is only missing the migration tests (which
> > currently need 6-7 minutes). So we could save somewhere between 15 to 20
> > minutes here.
> >
> >
> > Finally, the misc profiles fails in YARN:
> >
> > java.lang.AssertionError
> > at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >
> > No significant speedup could be observed in other modules; for
> > flink-yarn-tests we can maybe get a minute or 2 out of it.
> >
> > On 16/08/2019 10:43, Chesnay Schepler wrote:
> > > There appears to be a general agreement that 1) should be looked into;
> > > I've setup a branch with fork reuse being enabled for all tests; will
> > > report back the results.
> > >
> > > On 15/08/2019 09:38, Chesnay Schepler wrote:
> > >> Hello everyone,
> > >>
> > >> improving our build times is a hot topic at the moment so let's
> > >> discuss the different ways how they could be reduced.
> > >>
> > >>
> > >>Current state:
> > >>
> > >> First up, let's look at some numbers:
> > >>
> > >> 1 full build currently consumes 5h of build time total ("total
> > >> time"), and in the ideal case takes about 1h20m ("run time") to
> > >> complete from start to finish. The run time may fluctuate of course
> > >> depending on the current Travis load. This applies both to builds on
> > >> the Apache and flink-ci Travis.
> > >>
> > >> At the time of writing, the current queue time for PR jobs (reminder:
> > >> running on flink-ci) is about 30 minutes (which basically means that
> > >> we are processing builds at the rate that they come in), however we
> > >> are in an admittedly quiet period right now.
> > >> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > >> everyone was scrambling to get their changes merged in time for the
> > >> feature freeze.
> > >>
> > >> (Note: Recently optimizations where added to ci-bot where pending
> > >> builds are canceled if a new commit was pushed to the PR or the PR
> > >> was closed, which should prove especially useful during the rush
> > >> hours we see before feature-freezes.)
> > >>
> > >>
> > >>Past approaches
> > >>
> > >> Over the years we have done rather few things to improve this
> > >> situation (hence our current predicament).
> > >>
> > >> Beyond the sporadic speedup of some tests, the only notable reduction
> > >> in total build times was the introduction of cron jobs, which
> > >> consolidated the per-commit matrix from 4 configurations (different
> > >> scala/hadoop versions) to 1.
> > >>
> > >> The separation into multiple build profiles was only a work-around
> > >> for 

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Till Rohrmann
For the sake of keeping the discussion focused and not cluttering the
discussion thread I would suggest to split the detailed reporting for
reusing JVMs to a separate thread and cross linking it from here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler  wrote:

> Update:
>
> TL;DR: table-planner is a good candidate for enabling fork reuse right
> away, while flink-tests has the potential for huge savings, but we have
> to figure out some issues first.
>
>
> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>
> 4/8 profiles failed.
>
> No speedup in libraries, python, blink_planner, 7 minutes saved in
> libraries (table-planner).
>
> The kafka and connectors profiles both fail in kafka tests due to
> producer leaks, and no speed up could be confirmed so far:
>
> java.lang.AssertionError: Detected producer leak. Thread name:
> kafka-producer-network-thread | producer-239
> at org.junit.Assert.fail(Assert.java:88)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>
>
> The tests profile failed due to various errors in migration tests:
>
> junit.framework.AssertionFailedError: Did not see the expected accumulator
> results within time limit.
> at
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>
> *However*, a normal tests run takes 40 minutes, while this one above
> failed after 19 minutes and is only missing the migration tests (which
> currently need 6-7 minutes). So we could save somewhere between 15 to 20
> minutes here.
>
>
> Finally, the misc profiles fails in YARN:
>
> java.lang.AssertionError
> at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>
> No significant speedup could be observed in other modules; for
> flink-yarn-tests we can maybe get a minute or 2 out of it.
>
> On 16/08/2019 10:43, Chesnay Schepler wrote:
> > There appears to be a general agreement that 1) should be looked into;
> > I've setup a branch with fork reuse being enabled for all tests; will
> > report back the results.
> >
> > On 15/08/2019 09:38, Chesnay Schepler wrote:
> >> Hello everyone,
> >>
> >> improving our build times is a hot topic at the moment so let's
> >> discuss the different ways how they could be reduced.
> >>
> >>
> >>Current state:
> >>
> >> First up, let's look at some numbers:
> >>
> >> 1 full build currently consumes 5h of build time total ("total
> >> time"), and in the ideal case takes about 1h20m ("run time") to
> >> complete from start to finish. The run time may fluctuate of course
> >> depending on the current Travis load. This applies both to builds on
> >> the Apache and flink-ci Travis.
> >>
> >> At the time of writing, the current queue time for PR jobs (reminder:
> >> running on flink-ci) is about 30 minutes (which basically means that
> >> we are processing builds at the rate that they come in), however we
> >> are in an admittedly quiet period right now.
> >> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> >> everyone was scrambling to get their changes merged in time for the
> >> feature freeze.
> >>
> >> (Note: Recently optimizations where added to ci-bot where pending
> >> builds are canceled if a new commit was pushed to the PR or the PR
> >> was closed, which should prove especially useful during the rush
> >> hours we see before feature-freezes.)
> >>
> >>
> >>Past approaches
> >>
> >> Over the years we have done rather few things to improve this
> >> situation (hence our current predicament).
> >>
> >> Beyond the sporadic speedup of some tests, the only notable reduction
> >> in total build times was the introduction of cron jobs, which
> >> consolidated the per-commit matrix from 4 configurations (different
> >> scala/hadoop versions) to 1.
> >>
> >> The separation into multiple build profiles was only a work-around
> >> for the 50m limit on Travis. Running tests in parallel has the
> >> obvious potential of reducing run time, but we're currently hitting a
> >> hard limit since a few modules (flink-tests, flink-runtime,
> >> flink-table-planner-blink) are so loaded with tests that they nearly
> >> consume an entire profile by themselves (and thus no further
> >> splitting is possible).
> >>
> >> The rework that introduced stages, at the time of introduction, did
> >> also not provide a speed up, although this changed slightly once more
> >> profiles were added and some optimizations to the caching have been
> >> made.
> >>
> >> Very recently we modified the surefire-plugin configuration for
> >> flink-table-planner-blink to reuse JVM forks for IT cases, providing
> >> a significant speedup (18 minutes!). So far we have not seen any
> >> negative 

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler

Update:

TL;DR: table-planner is a good candidate for enabling fork reuse right 
away, while flink-tests has the potential for huge savings, but we have 
to figure out some issues first.



Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in 
libraries (table-planner).


The kafka and connectors profiles both fail in kafka tests due to 
producer leaks, and no speed up could be confirmed so far:


java.lang.AssertionError: Detected producer leak. Thread name: 
kafka-producer-network-thread | producer-239
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)


The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected accumulator 
results within time limit.
at 
org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one above 
failed after 19 minutes and is only missing the migration tests (which 
currently need 6-7 minutes). So we could save somewhere between 15 to 20 
minutes here.



Finally, the misc profiles fails in YARN:

java.lang.AssertionError
at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for 
flink-yarn-tests we can maybe get a minute or 2 out of it.


On 16/08/2019 10:43, Chesnay Schepler wrote:
There appears to be a general agreement that 1) should be looked into; 
I've setup a branch with fork reuse being enabled for all tests; will 
report back the results.


On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's 
discuss the different ways how they could be reduced.



   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total 
time"), and in the ideal case takes about 1h20m ("run time") to 
complete from start to finish. The run time may fluctuate of course 
depending on the current Travis load. This applies both to builds on 
the Apache and flink-ci Travis.


At the time of writing, the current queue time for PR jobs (reminder: 
running on flink-ci) is about 30 minutes (which basically means that 
we are processing builds at the rate that they come in), however we 
are in an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as 
everyone was scrambling to get their changes merged in time for the 
feature freeze.


(Note: Recently optimizations where added to ci-bot where pending 
builds are canceled if a new commit was pushed to the PR or the PR 
was closed, which should prove especially useful during the rush 
hours we see before feature-freezes.)



   Past approaches

Over the years we have done rather few things to improve this 
situation (hence our current predicament).


Beyond the sporadic speedup of some tests, the only notable reduction 
in total build times was the introduction of cron jobs, which 
consolidated the per-commit matrix from 4 configurations (different 
scala/hadoop versions) to 1.


The separation into multiple build profiles was only a work-around 
for the 50m limit on Travis. Running tests in parallel has the 
obvious potential of reducing run time, but we're currently hitting a 
hard limit since a few modules (flink-tests, flink-runtime, 
flink-table-planner-blink) are so loaded with tests that they nearly 
consume an entire profile by themselves (and thus no further 
splitting is possible).


The rework that introduced stages, at the time of introduction, did 
also not provide a speed up, although this changed slightly once more 
profiles were added and some optimizations to the caching have been 
made.


Very recently we modified the surefire-plugin configuration for 
flink-table-planner-blink to reuse JVM forks for IT cases, providing 
a significant speedup (18 minutes!). So far we have not seen any 
negative consequences.



   Suggestions

This is a list of /all /suggestions for reducing run/total times that 
I have seen recently (in other words, they aren't necessarily mine 
nor may I agree with all of them).


1. Enable JVM reuse for IT cases in more modules.
 * We've seen significant speedups in the blink planner, and this
   should be applicable for all modules. However, I presume there's
   a reason why we disabled JVM reuse (information on this would be
   appreciated)
2. Custom differential build scripts
 * Setup custom scripts for determining which 

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler
There appears to be a general agreement that 1) should be looked into; 
I've setup a branch with fork reuse being enabled for all tests; will 
report back the results.


On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's 
discuss the different ways how they could be reduced.



   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total time"), 
and in the ideal case takes about 1h20m ("run time") to complete from 
start to finish. The run time may fluctuate of course depending on the 
current Travis load. This applies both to builds on the Apache and 
flink-ci Travis.


At the time of writing, the current queue time for PR jobs (reminder: 
running on flink-ci) is about 30 minutes (which basically means that 
we are processing builds at the rate that they come in), however we 
are in an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as 
everyone was scrambling to get their changes merged in time for the 
feature freeze.


(Note: Recently optimizations where added to ci-bot where pending 
builds are canceled if a new commit was pushed to the PR or the PR was 
closed, which should prove especially useful during the rush hours we 
see before feature-freezes.)



   Past approaches

Over the years we have done rather few things to improve this 
situation (hence our current predicament).


Beyond the sporadic speedup of some tests, the only notable reduction 
in total build times was the introduction of cron jobs, which 
consolidated the per-commit matrix from 4 configurations (different 
scala/hadoop versions) to 1.


The separation into multiple build profiles was only a work-around for 
the 50m limit on Travis. Running tests in parallel has the obvious 
potential of reducing run time, but we're currently hitting a hard 
limit since a few modules (flink-tests, flink-runtime, 
flink-table-planner-blink) are so loaded with tests that they nearly 
consume an entire profile by themselves (and thus no further splitting 
is possible).


The rework that introduced stages, at the time of introduction, did 
also not provide a speed up, although this changed slightly once more 
profiles were added and some optimizations to the caching have been made.


Very recently we modified the surefire-plugin configuration for 
flink-table-planner-blink to reuse JVM forks for IT cases, providing a 
significant speedup (18 minutes!). So far we have not seen any 
negative consequences.



   Suggestions

This is a list of /all /suggestions for reducing run/total times that 
I have seen recently (in other words, they aren't necessarily mine nor 
may I agree with all of them).


1. Enable JVM reuse for IT cases in more modules.
 * We've seen significant speedups in the blink planner, and this
   should be applicable for all modules. However, I presume there's
   a reason why we disabled JVM reuse (information on this would be
   appreciated)
2. Custom differential build scripts
 * Setup custom scripts for determining which modules might be
   affected by change, and manipulate the splits accordingly. This
   approach is conceptually quite straight-forward, but has limits
   since it has to be pessimistic; i.e. a change in flink-core
   _must_ result in testing all modules.
3. Only run smoke tests when PR is opened, run heavy tests on demand.
 * With the introduction of the ci-bot we now have significantly
   more options on how to handle PR builds. One option could be to
   only run basic tests when the PR is created (which may be only
   modified modules, or all unit tests, or another low-cost
   scheme), and then have a committer trigger other builds (full
   test run, e2e tests, etc...) on demand.
4. Move more tests into cron builds
 * The budget version of 3); move certain tests that are either
   expensive (like some runtime tests that take minutes) or in
   rarely modified modules (like gelly) into cron jobs.
5. Gradle
 * Gradle was brought up a few times for it's built-in support for
   differential builds; basically providing 2) without the overhead
   of maintaining additional scripts.
 * To date no PoC was provided that shows it working in our CI
   environment (i.e., handling splits & caching etc).
 * This is the most disruptive change by a fair margin, as it would
   affect the entire project, developers and potentially users (f
   they build from source).
6. CI service
 * Our current artifact caching setup on Travis is basically a
   hack; we're basically abusing the Travis cache, which is meant
   for long-term caching, to ship build artifacts across jobs. It's
   brittle at times due to timing/visibility issues and on branches
   the cleanup processes can interfere with running builds. It is
   

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Xiyuan Wang
6. CI service
I'm not very familar with tarvis, but according to its offical
doc[1][2]. Is it possible to run jobs in parallel? AFAIK, many CI system
supports this kind of feature.

[1]:
https://docs.travis-ci.com/user/speeding-up-the-build/#parallelizing-your-builds-across-virtual-machines
[2]: https://docs.travis-ci.com/user/build-matrix/

Arvid Heise  于2019年8月16日周五 下午4:14写道:

> Thank you for starting the discussion as well!
>
> +1 to 1. it seems to be a quite low-hanging fruit that we should try to
> employ as much as possible.
>
> -0 to 2. the build setup is already very complicated. Adding new
> functionality that I would expect to come out of the box of a modern build
> tool seems like too much effort for me. I'm proposing a 7. action item that
> I would like to try out first before making the setup more complicated.
>
> +0 to 3. What is the actual intent here? If it's about failing earlier,
> then I'd rather propose to reorder the tests such that unit and smoke tests
> of every module are run before IT tests. If it's about being able to
> approve a PR quicker, are smoke tests really enough? However, if we have
> layered tests, then it would be rather easy to omit IT tests altogether in
> specific (local) builds.
>
> -1 to 4. I really want to see when stuff breaks not only once per day (or
> whatever the CRON cycle is). I can really see more broken code being merged
> into master because of the disconnect.
>
> +1 to 5. Gradle build cache has worked well for me in the past. If there is
> a general interest, I can start a POC (or improve upon older POCs). I
> currently expect shading to be the most effort.
>
> +1 to 6. Travis had so many drawbacks in the past and now that most of the
> senior staff has been layed off, I don't expect any improvements at all.
> At my old company, I switched our open source projects to Azure pipelines
> with great success. Azure pipelines offers 10 instances for open source
> projects and it's payment model is pay-as-you-go [1]. Since artifact
> sharing seems to be an issue with Travis anyways, it looks rather easy to
> use in pipelines [2].
> I'd also expect Github CI to be a good fit for our needs [3], but it's
> rather young and I have no experience.
>
> ---
>
> 7. Option I'd like to try the global build cache that's provided by Gradle
> enterprise for Maven first [4]. It basically fingerprints a task
> (fingerprint of upstream tasks, source files + black magic) and whenever
> the fingerprint matches it fetches the results from the build cache. In
> theory, we would get the results of 2. implicitly without any effort. Of
> course, Gradle enterprise costs money (which I could inquire if general
> interest exists) but it would also allow us to downgrade the Travis plan
> (and Travis is really expensive).
>
>
> [1]
>
> https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
> [2]
>
> https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops=yaml
> [3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/
> [4] https://docs.gradle.com/enterprise/maven-extension/
>
> On Fri, Aug 16, 2019 at 5:20 AM Jark Wu  wrote:
>
> > Thanks Chesnay for starting this discussion.
> >
> > +1 for #1, it might be the easiest way to get a significant speedup.
> > If the only reason is for isolation. I think we can fix the static fields
> > or global state used in Flink if possible.
> >
> > +1 for #2, and thanks Aleksey for the prototype. I think it's a good
> > approach which doesn't introduce too much things to maintain.
> >
> > +1 for #3(run CRON or e2e tests on demand).
> > We have this requirement when reviewing some pull requests, because we
> > don't sure whether it will broken some specific e2e test.
> > Currently, we have to run it locally by building the whole project. Or
> > enable CRON jobs for the pushed branch in contributor's own travis.
> >
> > Besides that, I think FLINK-11464[1] is also a good way to cache
> > distributions to save a lot of download time.
> >
> > Best,
> > Jark
> >
> > [1]: https://issues.apache.org/jira/browse/FLINK-11464
> >
> > On Thu, 15 Aug 2019 at 21:47, Aleksey Pak  wrote:
> >
> > > Hi all!
> > >
> > > Thanks for starting this discussion.
> > >
> > > I'd like to also add my 2 cents:
> > >
> > > +1 for #2, differential build scripts.
> > > I've worked on the approach. And with it, I think it's possible to
> reduce
> > > total build time with relatively low effort, without enforcing any new
> > > build tool and low maintenance cost.
> > >
> > > You can check a proposed change (for the old CI setup, when Flink PRs
> > were
> > > running in Apache common CI pool) here:
> > > https://github.com/apache/flink/pull/9065
> > > In the proposed change, the dependency check is not heavily hardcoded
> and
> > > just uses maven's results for dependency graph analysis.
> > >
> > > > This approach is conceptually quite straight-forward, but has limits
> 

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Arvid Heise
Thank you for starting the discussion as well!

+1 to 1. it seems to be a quite low-hanging fruit that we should try to
employ as much as possible.

-0 to 2. the build setup is already very complicated. Adding new
functionality that I would expect to come out of the box of a modern build
tool seems like too much effort for me. I'm proposing a 7. action item that
I would like to try out first before making the setup more complicated.

+0 to 3. What is the actual intent here? If it's about failing earlier,
then I'd rather propose to reorder the tests such that unit and smoke tests
of every module are run before IT tests. If it's about being able to
approve a PR quicker, are smoke tests really enough? However, if we have
layered tests, then it would be rather easy to omit IT tests altogether in
specific (local) builds.

-1 to 4. I really want to see when stuff breaks not only once per day (or
whatever the CRON cycle is). I can really see more broken code being merged
into master because of the disconnect.

+1 to 5. Gradle build cache has worked well for me in the past. If there is
a general interest, I can start a POC (or improve upon older POCs). I
currently expect shading to be the most effort.

+1 to 6. Travis had so many drawbacks in the past and now that most of the
senior staff has been layed off, I don't expect any improvements at all.
At my old company, I switched our open source projects to Azure pipelines
with great success. Azure pipelines offers 10 instances for open source
projects and it's payment model is pay-as-you-go [1]. Since artifact
sharing seems to be an issue with Travis anyways, it looks rather easy to
use in pipelines [2].
I'd also expect Github CI to be a good fit for our needs [3], but it's
rather young and I have no experience.

---

7. Option I'd like to try the global build cache that's provided by Gradle
enterprise for Maven first [4]. It basically fingerprints a task
(fingerprint of upstream tasks, source files + black magic) and whenever
the fingerprint matches it fetches the results from the build cache. In
theory, we would get the results of 2. implicitly without any effort. Of
course, Gradle enterprise costs money (which I could inquire if general
interest exists) but it would also allow us to downgrade the Travis plan
(and Travis is really expensive).


[1]
https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
[2]
https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops=yaml
[3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/
[4] https://docs.gradle.com/enterprise/maven-extension/

On Fri, Aug 16, 2019 at 5:20 AM Jark Wu  wrote:

> Thanks Chesnay for starting this discussion.
>
> +1 for #1, it might be the easiest way to get a significant speedup.
> If the only reason is for isolation. I think we can fix the static fields
> or global state used in Flink if possible.
>
> +1 for #2, and thanks Aleksey for the prototype. I think it's a good
> approach which doesn't introduce too much things to maintain.
>
> +1 for #3(run CRON or e2e tests on demand).
> We have this requirement when reviewing some pull requests, because we
> don't sure whether it will broken some specific e2e test.
> Currently, we have to run it locally by building the whole project. Or
> enable CRON jobs for the pushed branch in contributor's own travis.
>
> Besides that, I think FLINK-11464[1] is also a good way to cache
> distributions to save a lot of download time.
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/FLINK-11464
>
> On Thu, 15 Aug 2019 at 21:47, Aleksey Pak  wrote:
>
> > Hi all!
> >
> > Thanks for starting this discussion.
> >
> > I'd like to also add my 2 cents:
> >
> > +1 for #2, differential build scripts.
> > I've worked on the approach. And with it, I think it's possible to reduce
> > total build time with relatively low effort, without enforcing any new
> > build tool and low maintenance cost.
> >
> > You can check a proposed change (for the old CI setup, when Flink PRs
> were
> > running in Apache common CI pool) here:
> > https://github.com/apache/flink/pull/9065
> > In the proposed change, the dependency check is not heavily hardcoded and
> > just uses maven's results for dependency graph analysis.
> >
> > > This approach is conceptually quite straight-forward, but has limits
> > since it has to be pessimistic; > i.e. a change in flink-core _must_
> result
> > in testing all modules.
> >
> > Agree, in Flink case, there are some core modules that would trigger
> whole
> > tests run with such approach. For developers who modify such components,
> > the build time would be the longest. But this approach should really help
> > for developers who touch more-or-less independent modules.
> >
> > Even for core modules, it's possible to create "abstraction" barriers by
> > changing dependency graph. For example, it can look like: flink-core-api
> > <-- 

Re: [DISCUSS] Reducing build times

2019-08-15 Thread Jark Wu
Thanks Chesnay for starting this discussion.

+1 for #1, it might be the easiest way to get a significant speedup.
If the only reason is for isolation. I think we can fix the static fields
or global state used in Flink if possible.

+1 for #2, and thanks Aleksey for the prototype. I think it's a good
approach which doesn't introduce too much things to maintain.

+1 for #3(run CRON or e2e tests on demand).
We have this requirement when reviewing some pull requests, because we
don't sure whether it will broken some specific e2e test.
Currently, we have to run it locally by building the whole project. Or
enable CRON jobs for the pushed branch in contributor's own travis.

Besides that, I think FLINK-11464[1] is also a good way to cache
distributions to save a lot of download time.

Best,
Jark

[1]: https://issues.apache.org/jira/browse/FLINK-11464

On Thu, 15 Aug 2019 at 21:47, Aleksey Pak  wrote:

> Hi all!
>
> Thanks for starting this discussion.
>
> I'd like to also add my 2 cents:
>
> +1 for #2, differential build scripts.
> I've worked on the approach. And with it, I think it's possible to reduce
> total build time with relatively low effort, without enforcing any new
> build tool and low maintenance cost.
>
> You can check a proposed change (for the old CI setup, when Flink PRs were
> running in Apache common CI pool) here:
> https://github.com/apache/flink/pull/9065
> In the proposed change, the dependency check is not heavily hardcoded and
> just uses maven's results for dependency graph analysis.
>
> > This approach is conceptually quite straight-forward, but has limits
> since it has to be pessimistic; > i.e. a change in flink-core _must_ result
> in testing all modules.
>
> Agree, in Flink case, there are some core modules that would trigger whole
> tests run with such approach. For developers who modify such components,
> the build time would be the longest. But this approach should really help
> for developers who touch more-or-less independent modules.
>
> Even for core modules, it's possible to create "abstraction" barriers by
> changing dependency graph. For example, it can look like: flink-core-api
> <-- flink-core, flink-core-api <-- flink-connectors.
> In that case, only change in flink-core-api would trigger whole tests run.
>
> +1 for #3, separating PR CI runs to different stages.
> Imo, it may require more change to current CI setup, compared to #2 and
> better it should not be silly. Best, if it integrates with the Flink bot
> and triggers some follow up build steps only when some prerequisites are
> done.
>
> +1 for #4, to move some tests into cron runs.
> But imo, this does not scale well, it applies only to a small subset of
> tests.
>
> +1 for #6, to use other CI service(s).
> More specifically, GitHub gives build actions for free that can be used to
> offload some build steps/PR checks. It can help to move out some PR checks
> from the main CI build (for example: documentation builds, license checks,
> code formatting checks).
>
> Regards,
> Aleksey
>
> On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann 
> wrote:
>
> > Thanks for starting this discussion Chesnay. I think it has become
> obvious
> > to the Flink community that with the existing build setup we cannot
> really
> > deliver fast build times which are essential for fast iteration cycles
> and
> > high developer productivity. The reasons for this situation are manifold
> > but it is definitely affected by Flink's project growth, not always
> optimal
> > tests and the inflexibility that everything needs to be built. Hence, I
> > consider the reduction of build times crucial for the project's health
> and
> > future growth.
> >
> > Without necessarily voicing a strong preference for any of the presented
> > suggestions, I wanted to comment on each of them:
> >
> > 1. This sounds promising. Could the reason why we don't reuse JVMs date
> > back to the time when we still had a lot of static fields in Flink which
> > made it hard to reuse JVMs and the potentially mutated global state?
> >
> > 2. Building hand-crafted solutions around a build system in order to
> > compensate for its limitations which other build systems support out of
> the
> > box sounds like the not invented here syndrome to me. Reinventing the
> wheel
> > has historically proven to be usually not the best solution and it often
> > comes with a high maintenance price tag. Moreover, it would add just
> > another layer of complexity around our existing build system. I think the
> > current state where we have the maven setup in pom files and for Travis
> > multiple bash scripts specializing the builds to make it fit the time
> limit
> > is already not very transparent/easy to understand.
> >
> > 3. I could see this work but it also requires a very good understanding
> of
> > Flink of every committer because the committer needs to know which tests
> > would be good to run additionally.
> >
> > 4. I would be against this option solely to decrease our build time. My

Re: [DISCUSS] Reducing build times

2019-08-15 Thread Aleksey Pak
Hi all!

Thanks for starting this discussion.

I'd like to also add my 2 cents:

+1 for #2, differential build scripts.
I've worked on the approach. And with it, I think it's possible to reduce
total build time with relatively low effort, without enforcing any new
build tool and low maintenance cost.

You can check a proposed change (for the old CI setup, when Flink PRs were
running in Apache common CI pool) here:
https://github.com/apache/flink/pull/9065
In the proposed change, the dependency check is not heavily hardcoded and
just uses maven's results for dependency graph analysis.

> This approach is conceptually quite straight-forward, but has limits
since it has to be pessimistic; > i.e. a change in flink-core _must_ result
in testing all modules.

Agree, in Flink case, there are some core modules that would trigger whole
tests run with such approach. For developers who modify such components,
the build time would be the longest. But this approach should really help
for developers who touch more-or-less independent modules.

Even for core modules, it's possible to create "abstraction" barriers by
changing dependency graph. For example, it can look like: flink-core-api
<-- flink-core, flink-core-api <-- flink-connectors.
In that case, only change in flink-core-api would trigger whole tests run.

+1 for #3, separating PR CI runs to different stages.
Imo, it may require more change to current CI setup, compared to #2 and
better it should not be silly. Best, if it integrates with the Flink bot
and triggers some follow up build steps only when some prerequisites are
done.

+1 for #4, to move some tests into cron runs.
But imo, this does not scale well, it applies only to a small subset of
tests.

+1 for #6, to use other CI service(s).
More specifically, GitHub gives build actions for free that can be used to
offload some build steps/PR checks. It can help to move out some PR checks
from the main CI build (for example: documentation builds, license checks,
code formatting checks).

Regards,
Aleksey

On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann  wrote:

> Thanks for starting this discussion Chesnay. I think it has become obvious
> to the Flink community that with the existing build setup we cannot really
> deliver fast build times which are essential for fast iteration cycles and
> high developer productivity. The reasons for this situation are manifold
> but it is definitely affected by Flink's project growth, not always optimal
> tests and the inflexibility that everything needs to be built. Hence, I
> consider the reduction of build times crucial for the project's health and
> future growth.
>
> Without necessarily voicing a strong preference for any of the presented
> suggestions, I wanted to comment on each of them:
>
> 1. This sounds promising. Could the reason why we don't reuse JVMs date
> back to the time when we still had a lot of static fields in Flink which
> made it hard to reuse JVMs and the potentially mutated global state?
>
> 2. Building hand-crafted solutions around a build system in order to
> compensate for its limitations which other build systems support out of the
> box sounds like the not invented here syndrome to me. Reinventing the wheel
> has historically proven to be usually not the best solution and it often
> comes with a high maintenance price tag. Moreover, it would add just
> another layer of complexity around our existing build system. I think the
> current state where we have the maven setup in pom files and for Travis
> multiple bash scripts specializing the builds to make it fit the time limit
> is already not very transparent/easy to understand.
>
> 3. I could see this work but it also requires a very good understanding of
> Flink of every committer because the committer needs to know which tests
> would be good to run additionally.
>
> 4. I would be against this option solely to decrease our build time. My
> observation is that the community does not monitor the health of the cron
> jobs well enough. In the past the cron jobs have been unstable for as long
> as a complete release cycle. Moreover, I've seen that PRs were merged which
> passed Travis but broke the cron jobs. Consequently, I fear that this
> option would deteriorate Flink's stability.
>
> 5. I would rephrase this point into changing the build system. Gradle could
> be one candidate but there are also other build systems out there like
> Bazel. Changing the build system would indeed be a major endeavour but I
> could see the long term benefits of such a change (similar to having a
> consistent and enforced code style) in particular if the build system
> supports the functionality which we would otherwise build & maintain on our
> own. I think there would be ways to make the transition not as disruptive
> as described. For example, one could keep the Maven build and the new build
> side by side until one is confident enough that the new build produces the
> same output as the Maven build. Maybe it would 

Re: [DISCUSS] Reducing build times

2019-08-15 Thread Till Rohrmann
Thanks for starting this discussion Chesnay. I think it has become obvious
to the Flink community that with the existing build setup we cannot really
deliver fast build times which are essential for fast iteration cycles and
high developer productivity. The reasons for this situation are manifold
but it is definitely affected by Flink's project growth, not always optimal
tests and the inflexibility that everything needs to be built. Hence, I
consider the reduction of build times crucial for the project's health and
future growth.

Without necessarily voicing a strong preference for any of the presented
suggestions, I wanted to comment on each of them:

1. This sounds promising. Could the reason why we don't reuse JVMs date
back to the time when we still had a lot of static fields in Flink which
made it hard to reuse JVMs and the potentially mutated global state?

2. Building hand-crafted solutions around a build system in order to
compensate for its limitations which other build systems support out of the
box sounds like the not invented here syndrome to me. Reinventing the wheel
has historically proven to be usually not the best solution and it often
comes with a high maintenance price tag. Moreover, it would add just
another layer of complexity around our existing build system. I think the
current state where we have the maven setup in pom files and for Travis
multiple bash scripts specializing the builds to make it fit the time limit
is already not very transparent/easy to understand.

3. I could see this work but it also requires a very good understanding of
Flink of every committer because the committer needs to know which tests
would be good to run additionally.

4. I would be against this option solely to decrease our build time. My
observation is that the community does not monitor the health of the cron
jobs well enough. In the past the cron jobs have been unstable for as long
as a complete release cycle. Moreover, I've seen that PRs were merged which
passed Travis but broke the cron jobs. Consequently, I fear that this
option would deteriorate Flink's stability.

5. I would rephrase this point into changing the build system. Gradle could
be one candidate but there are also other build systems out there like
Bazel. Changing the build system would indeed be a major endeavour but I
could see the long term benefits of such a change (similar to having a
consistent and enforced code style) in particular if the build system
supports the functionality which we would otherwise build & maintain on our
own. I think there would be ways to make the transition not as disruptive
as described. For example, one could keep the Maven build and the new build
side by side until one is confident enough that the new build produces the
same output as the Maven build. Maybe it would also be possible to migrate
individual modules starting from the leaves. However, I admit that changing
the build system will affect every Flink developer because she needs to
learn & understand it.

6. I would like to learn about other people's experience with different CI
systems. Travis worked okish for Flink so far but we see sometimes problems
with its caching mechanism as Chesnay stated. I think that this topic is
actually orthogonal to the other suggestions.

My gut feeling is that not a single suggestion will be our solution but a
combination of them.

Cheers,
Till

On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu  wrote:

> Thanks Chesnay for bringing up this discussion and sharing those thoughts
> to speed up the building process.
>
> I'd +1 for option 2 and 3.
>
> We can benefits a lot from Option 2. Developing table, connectors,
> libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to
> run.
> PRs for those modules take up more than half of all the PRs in my
> observation.
>
> Option 3 can be a supplementary to option 2 that if the PR is modifying
> fundamental modules like flink-core or flink-runtime.
> It can even be a switch of the tests scope(basic/full) of a PR, so that
> committers do not need to trigger it multiple times.
> With it we can postpone the testing of IT cases or connectors before the PR
> reaches a stable state.
>
> Thanks,
> Zhu Zhu
>
> Chesnay Schepler  于2019年8月15日周四 下午3:38写道:
>
> > Hello everyone,
> >
> > improving our build times is a hot topic at the moment so let's discuss
> > the different ways how they could be reduced.
> >
> >
> > Current state:
> >
> > First up, let's look at some numbers:
> >
> > 1 full build currently consumes 5h of build time total ("total time"),
> > and in the ideal case takes about 1h20m ("run time") to complete from
> > start to finish. The run time may fluctuate of course depending on the
> > current Travis load. This applies both to builds on the Apache and
> > flink-ci Travis.
> >
> > At the time of writing, the current queue time for PR jobs (reminder:
> > running on flink-ci) is about 30 minutes (which basically means that we
> > are processing builds 

Re: [DISCUSS] Reducing build times

2019-08-15 Thread Zhu Zhu
Thanks Chesnay for bringing up this discussion and sharing those thoughts
to speed up the building process.

I'd +1 for option 2 and 3.

We can benefits a lot from Option 2. Developing table, connectors,
libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to
run.
PRs for those modules take up more than half of all the PRs in my
observation.

Option 3 can be a supplementary to option 2 that if the PR is modifying
fundamental modules like flink-core or flink-runtime.
It can even be a switch of the tests scope(basic/full) of a PR, so that
committers do not need to trigger it multiple times.
With it we can postpone the testing of IT cases or connectors before the PR
reaches a stable state.

Thanks,
Zhu Zhu

Chesnay Schepler  于2019年8月15日周四 下午3:38写道:

> Hello everyone,
>
> improving our build times is a hot topic at the moment so let's discuss
> the different ways how they could be reduced.
>
>
> Current state:
>
> First up, let's look at some numbers:
>
> 1 full build currently consumes 5h of build time total ("total time"),
> and in the ideal case takes about 1h20m ("run time") to complete from
> start to finish. The run time may fluctuate of course depending on the
> current Travis load. This applies both to builds on the Apache and
> flink-ci Travis.
>
> At the time of writing, the current queue time for PR jobs (reminder:
> running on flink-ci) is about 30 minutes (which basically means that we
> are processing builds at the rate that they come in), however we are in
> an admittedly quiet period right now.
> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> everyone was scrambling to get their changes merged in time for the
> feature freeze.
>
> (Note: Recently optimizations where added to ci-bot where pending builds
> are canceled if a new commit was pushed to the PR or the PR was closed,
> which should prove especially useful during the rush hours we see before
> feature-freezes.)
>
>
> Past approaches
>
> Over the years we have done rather few things to improve this situation
> (hence our current predicament).
>
> Beyond the sporadic speedup of some tests, the only notable reduction in
> total build times was the introduction of cron jobs, which consolidated
> the per-commit matrix from 4 configurations (different scala/hadoop
> versions) to 1.
>
> The separation into multiple build profiles was only a work-around for
> the 50m limit on Travis. Running tests in parallel has the obvious
> potential of reducing run time, but we're currently hitting a hard limit
> since a few modules (flink-tests, flink-runtime,
> flink-table-planner-blink) are so loaded with tests that they nearly
> consume an entire profile by themselves (and thus no further splitting
> is possible).
>
> The rework that introduced stages, at the time of introduction, did also
> not provide a speed up, although this changed slightly once more
> profiles were added and some optimizations to the caching have been made.
>
> Very recently we modified the surefire-plugin configuration for
> flink-table-planner-blink to reuse JVM forks for IT cases, providing a
> significant speedup (18 minutes!). So far we have not seen any negative
> consequences.
>
>
> Suggestions
>
> This is a list of /all /suggestions for reducing run/total times that I
> have seen recently (in other words, they aren't necessarily mine nor may
> I agree with all of them).
>
>  1. Enable JVM reuse for IT cases in more modules.
>   * We've seen significant speedups in the blink planner, and this
> should be applicable for all modules. However, I presume there's
> a reason why we disabled JVM reuse (information on this would be
> appreciated)
>  2. Custom differential build scripts
>   * Setup custom scripts for determining which modules might be
> affected by change, and manipulate the splits accordingly. This
> approach is conceptually quite straight-forward, but has limits
> since it has to be pessimistic; i.e. a change in flink-core
> _must_ result in testing all modules.
>  3. Only run smoke tests when PR is opened, run heavy tests on demand.
>   * With the introduction of the ci-bot we now have significantly
> more options on how to handle PR builds. One option could be to
> only run basic tests when the PR is created (which may be only
> modified modules, or all unit tests, or another low-cost
> scheme), and then have a committer trigger other builds (full
> test run, e2e tests, etc...) on demand.
>  4. Move more tests into cron builds
>   * The budget version of 3); move certain tests that are either
> expensive (like some runtime tests that take minutes) or in
> rarely modified modules (like gelly) into cron jobs.
>  5. Gradle
>   * Gradle was brought up a few times for it's built-in support for
> differential builds; basically providing 2) without the overhead
>