Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Shaoxuan Wang
+1 (binding)

On Fri, Aug 16, 2019 at 7:48 PM Chesnay Schepler  wrote:

> +1 (binding)
>
> Although I think it would be a good idea to always cc
> priv...@flink.apache.org when modifying bylaws, if anything to speed up
> the voting process.
>
> On 16/08/2019 11:26, Ufuk Celebi wrote:
> > +1 (binding)
> >
> > – Ufuk
> >
> >
> > On Wed, Aug 14, 2019 at 4:50 AM Biao Liu  wrote:
> >
> >> +1 (non-binding)
> >>
> >> Thanks for pushing this!
> >>
> >> Thanks,
> >> Biao /'bɪ.aʊ/
> >>
> >>
> >>
> >> On Wed, 14 Aug 2019 at 09:37, Jark Wu  wrote:
> >>
> >>> +1 (non-binding)
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>> On Wed, 14 Aug 2019 at 09:22, Kurt Young  wrote:
> >>>
>  +1 (binding)
> 
>  Best,
>  Kurt
> 
> 
>  On Wed, Aug 14, 2019 at 1:34 AM Yun Tang  wrote:
> 
> > +1 (non-binding)
> >
> > But I have a minor question about "code change" action, for those
> > "[hotfix]" github pull requests [1], the dev mailing list would not
> >> be
> > notified currently. I think we should change the description of this
>  action.
> >
> > [1]
> >
> >>
> https://flink.apache.org/contributing/contribute-code.html#code-contribution-process
> > Best
> > Yun Tang
> > 
> > From: JingsongLee 
> > Sent: Tuesday, August 13, 2019 23:56
> > To: dev 
> > Subject: Re: [VOTE] Flink Project Bylaws
> >
> > +1 (non-binding)
> > Thanks Becket.
> > I've learned a lot from current bylaws.
> >
> > Best,
> > Jingsong Lee
> >
> >
> > --
> > From:Yu Li 
> > Send Time:2019年8月13日(星期二) 17:48
> > To:dev 
> > Subject:Re: [VOTE] Flink Project Bylaws
> >
> > +1 (non-binding)
> >
> > Thanks for the efforts Becket!
> >
> > Best Regards,
> > Yu
> >
> >
> > On Tue, 13 Aug 2019 at 16:09, Xintong Song 
>  wrote:
> >> +1 (non-binding)
> >>
> >> Thank you~
> >>
> >> Xintong Song
> >>
> >>
> >>
> >> On Tue, Aug 13, 2019 at 1:48 PM Robert Metzger <
> >> rmetz...@apache.org>
> >> wrote:
> >>
> >>> +1 (binding)
> >>>
> >>> On Tue, Aug 13, 2019 at 1:47 PM Becket Qin  > wrote:
>  Thanks everyone for voting.
> 
>  For those who have already voted, just want to bring this up to
>  your
>  attention that there is a minor clarification to the bylaws
> >> wiki
>  this
>  morning. The change is in bold format below:
> 
>  one +1 from a committer followed by a Lazy approval (not
> >> counting
>  the
> >>> vote
> > of the contributor), moving to lazy majority if a -1 is
> >>> received.
> 
>  Note that this implies that committers can +1 their own commits
> >>> and
> >> merge
> > right away. *However, the committe**rs should use their best
> >> judgement
> >>> to
> > respect the components expertise and ongoing development
> >> plan.*
> 
>  This addition does not really change anything the bylaws meant
> >> to
> > set.
> >> It
>  is simply a clarification. If anyone who have casted the vote
> > objects,
>  please feel free to withdraw the vote.
> 
>  Thanks,
> 
>  Jiangjie (Becket) Qin
> 
> 
>  On Tue, Aug 13, 2019 at 1:29 PM Piotr Nowojski <
>  pi...@ververica.com>
>  wrote:
> 
> > +1
> >
> >> On 13 Aug 2019, at 13:22, vino yang  > wrote:
> >> +1
> >>
> >> Tzu-Li (Gordon) Tai  于2019年8月13日周二
> > 下午6:32写道:
> >>> +1
> >>>
> >>> On Tue, Aug 13, 2019, 12:31 PM Hequn Cheng <
> > chenghe...@gmail.com>
> > wrote:
>  +1 (non-binding)
> 
>  Thanks a lot for driving this! Good job. @Becket Qin <
> >>> becket@gmail.com
>  Best, Hequn
> 
>  On Tue, Aug 13, 2019 at 6:26 PM Stephan Ewen <
>  se...@apache.org
>  wrote:
> > +1
> >
> > On Tue, Aug 13, 2019 at 12:22 PM Maximilian Michels <
> >>> m...@apache.org
> > wrote:
> >
> >> +1 It's good that we formalize this.
> >>
> >> On 13.08.19 10:41, Fabian Hueske wrote:
> >>> +1 for the proposed bylaws.
> >>> Thanks for pushing this Becket!
> >>>
> >>> Cheers, Fabian
> >>>
> >>> Am Mo., 12. Aug. 2019 um 16:31 Uhr schrieb Robert
> >>> Metzger
>  <
> >>> rmetz...@apache.org>:
> >>>
>  I changed the permissions of the page.
> 
>  On Mon, Aug 12, 2019 at 4:21 PM Till Rohrmann <
>  trohrm...@apache.org>
>  wrote:
> >

[jira] [Created] (FLINK-13757) Document error for `logical functions`

2019-08-16 Thread hehuiyuan (JIRA)
hehuiyuan created FLINK-13757:
-

 Summary: Document error for  `logical functions`
 Key: FLINK-13757
 URL: https://issues.apache.org/jira/browse/FLINK-13757
 Project: Flink
  Issue Type: Wish
  Components: Documentation
Reporter: hehuiyuan
 Attachments: image-2019-08-17-11-58-53-247.png

[https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/table/functions.html#logical-functions]

False:
|{{boolean IS NOT TRUE}}|Returns TRUE if _boolean_ is FALSE or UNKNOWN; returns 
FALSE if _boolean_ is FALSE.|

True:
|{{boolean IS NOT TRUE}}|Returns TRUE if _boolean_ is FALSE or UNKNOWN; returns 
FALSE if _boolean_ is TURE.|

[!image-2019-08-17-11-58-53-247.png!|https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/table/functions.html#logical-functions]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13756) Modify Code Annotations for findAndCreateTableSource in TableFactoryUtil

2019-08-16 Thread hehuiyuan (JIRA)
hehuiyuan created FLINK-13756:
-

 Summary:  Modify Code Annotations for findAndCreateTableSource  in 
TableFactoryUtil
 Key: FLINK-13756
 URL: https://issues.apache.org/jira/browse/FLINK-13756
 Project: Flink
  Issue Type: Wish
  Components: Table SQL / API
Reporter: hehuiyuan


 

/**
 * Returns a *table sink* matching the \{@link 
org.apache.flink.table.catalog.CatalogTable}.
 */
public static  TableSource findAndCreateTableSource(CatalogTable table) {
 return findAndCreateTableSource(table.toProperties());
}

 

Hi , this method `findAndCreateTableSource`   is used for returning  
`TableSource` , but the annotation is *` Returns a table sink`*

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13755) support Hive built-in functions in Flink

2019-08-16 Thread Bowen Li (JIRA)
Bowen Li created FLINK-13755:


 Summary: support Hive built-in functions in Flink
 Key: FLINK-13755
 URL: https://issues.apache.org/jira/browse/FLINK-13755
 Project: Flink
  Issue Type: New Feature
  Components: Connectors / Hive
Affects Versions: 1.10.0
Reporter: Bowen Li
Assignee: Bowen Li
 Fix For: 1.10.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13754) Decouple OperatorChain from StreamStatusMaintainer

2019-08-16 Thread zhijiang (JIRA)
zhijiang created FLINK-13754:


 Summary: Decouple OperatorChain from StreamStatusMaintainer
 Key: FLINK-13754
 URL: https://issues.apache.org/jira/browse/FLINK-13754
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Reporter: zhijiang
Assignee: zhijiang


There are two motivations for this refactoring:
 * It is the precondition for the following work of decoupling the dependency 
between two inputs status in ForwardingValveOutputHandler.
 * From the aspect of design rule, the current OperatorChain takes many 
unrelated roles like StreamStatusMaintainer to make it unmaintainable. The root 
reason for this case is from the cycle dependency between RecordWriterOutput 
(created by OperatorChain) and  StreamStatusMaintainer.

The solution is to refactor the creation of StreamStatusMaintainer and 
RecordWriterOutput in StreamTask level, and then break the implementation cycle 
dependency between them. The array of RecordWriters which has close 
relationship with RecordWriterOutput is created in StreamTask, so it is 
reasonable to create them together. The created StreamStatusMaintainer in 
StreamTask can be directly referenced by subclasses like 
OneInputStreamTask/TwoInputStreamTask.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13753) Integrate new Source Operator with Mailbox Model in StreamTask

2019-08-16 Thread zhijiang (JIRA)
zhijiang created FLINK-13753:


 Summary: Integrate new Source Operator with Mailbox Model in 
StreamTask
 Key: FLINK-13753
 URL: https://issues.apache.org/jira/browse/FLINK-13753
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Reporter: zhijiang
Assignee: zhijiang


This is the umbrella issue for integrating new source operator with mailbox 
model in StreamTask.

The motivation is based on 
[FLIP-27|https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface]
 which proposes to refactor the whole source API and the integration of 
task-level actions (including checkpoint, timer, async operator) with unified 
[mailbox model| 
[https://docs.google.com/document/d/1eDpsUKv2FqwZiS1Pm6gYO5eFHScBHfULKmH1-ZEWB4g]]
 on runtime side.
 * The benefits are simple unified processing logics because only one single 
thread handles all the actions without concurrent issue, and further getting 
rid of lock dependency which causes unfair lock concern in checkpoint process.
 * We still need to support the current legacy source in some releases which 
would probably be used for a while, especially for the scenario of performance 
concern.

The design doc is 
[https://docs.google.com/document/d/13x9M7k1SRqkOFXP0bETcJemIRyJzoqGgkdy11pz5qHM/edit#|https://docs.google.com/document/d/13x9M7k1SRqkOFXP0bETcJemIRyJzoqGgkdy11pz5qHM/edit]|[https://docs.google.com/document/d/13x9M7k1SRqkOFXP0bETcJemIRyJzoqGgkdy11pz5qHM/edit#|https://docs.google.com/document/d/13x9M7k1SRqkOFXP0bETcJemIRyJzoqGgkdy11pz5qHM/edit]]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13752) TaskDeploymentDescriptor cannot be recycled by GC due to referenced by an anonymous function

2019-08-16 Thread Yun Gao (JIRA)
Yun Gao created FLINK-13752:
---

 Summary: TaskDeploymentDescriptor cannot be recycled by GC due to 
referenced by an anonymous function
 Key: FLINK-13752
 URL: https://issues.apache.org/jira/browse/FLINK-13752
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Yun Gao


When comparing the 1.8 and 1.9.0-rc2 on a test streaming job, we found that the 
performance on 1.9.0-rc2 is much lower than that of 1.8. By comparing the two 
versions, we found that the count of Full GC on 1.9.0-rc2 is much more than 
that on 1.8.

A further analysis found that the difference is due to in 
TaskExecutor#setupResultPartitionBookkeeping, the anonymous function in 
taskTermimationWithResourceCleanFuture has referenced the 
TaskDeploymentDescriptor, since this function will be kept till the task is 
terminated,  TaskDeploymentDescriptor will also be kept referenced in the 
closure and cannot be recycled by GC. In this job, TaskDeploymentDescriptor of 
some tasks are as large as 10M, and the total heap is about 113M, thus the kept 
TaskDeploymentDescriptors will cause relatively large impact on GC and 
performance.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] FLIP-54: Evolve ConfigOption and Configuration

2019-08-16 Thread JingsongLee
+1 to this, thanks Timo and Dawid for the design.
This allows the currently cluttered configuration of various
 modules to be unified.
This is also first step of one of the keys to making new unified
TableEnvironment available for production.

Previously, we did encounter complex configurations, such as 
specifying the skewed values of column in DDL. The skew may
 be a single field or a combination of multiple fields. So the
 configuration is very troublesome. We used JSON string to
 configure it.

Best,
Jingsong Lee



--
From:Jark Wu 
Send Time:2019年8月16日(星期五) 16:44
To:dev 
Subject:Re: [DISCUSS] FLIP-54: Evolve ConfigOption and Configuration

Thanks for starting this design Timo and Dawid,

Improving ConfigOption has been hovering in my mind for a long time.
We have seen the benefit when developing blink configurations and connector
properties in 1.9 release.
Thanks for bringing it up and make such a detailed design.
I will leave my thoughts and comments there.

Cheers,
Jark


On Fri, 16 Aug 2019 at 22:30, Zili Chen  wrote:

> Hi Timo,
>
> It looks interesting. Thanks for preparing this FLIP!
>
> Client API enhancement benefit from this evolution which
> hopefully provides a better view of configuration of Flink.
> In client API enhancement, we likely make the deployment
> of cluster and submission of job totally defined by configuration.
>
> Will take a look at the document in days.
>
> Best,
> tison.
>
>
> Timo Walther  于2019年8月16日周五 下午10:12写道:
>
> > Hi everyone,
> >
> > Dawid and I are working on making parts of ExecutionConfig and
> > TableConfig configurable via config options. This is necessary to make
> > all properties also available in SQL. Additionally, with the new SQL DDL
> > based on properties as well as more connectors and formats coming up,
> > unified configuration becomes more important.
> >
> > We need more features around string-based configuration in the future,
> > which is why Dawid and I would like to propose FLIP-54 for evolving the
> > ConfigOption and Configuration classes:
> >
> >
> >
> https://docs.google.com/document/d/1IQ7nwXqmhCy900t2vQLEL3N2HIdMg-JO8vTzo1BtyKU/edit
> >
> > In summary it adds:
> > - documented types and validation
> > - more common types such as memory size, duration, list
> > - simple non-nested object types
> >
> > Looking forward to your feedback,
> > Timo
> >
> >
>



[VOTE] FLIP-50: Spill-able Heap State Backend

2019-08-16 Thread Yu Li
Hi All,

Since we have reached a consensus in the discussion thread [1], I'd like to
start the voting for FLIP-50 [2].

This vote will be open for at least 72 hours. Unless objection I will try
to close it by end of Tuesday August 20, 2019 if we have sufficient votes.
Thanks.

[1] https://s.apache.org/cq358
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend

Best Regards,
Yu


Re: [DISCUSS] FLIP-54: Evolve ConfigOption and Configuration

2019-08-16 Thread Jark Wu
Thanks for starting this design Timo and Dawid,

Improving ConfigOption has been hovering in my mind for a long time.
We have seen the benefit when developing blink configurations and connector
properties in 1.9 release.
Thanks for bringing it up and make such a detailed design.
I will leave my thoughts and comments there.

Cheers,
Jark


On Fri, 16 Aug 2019 at 22:30, Zili Chen  wrote:

> Hi Timo,
>
> It looks interesting. Thanks for preparing this FLIP!
>
> Client API enhancement benefit from this evolution which
> hopefully provides a better view of configuration of Flink.
> In client API enhancement, we likely make the deployment
> of cluster and submission of job totally defined by configuration.
>
> Will take a look at the document in days.
>
> Best,
> tison.
>
>
> Timo Walther  于2019年8月16日周五 下午10:12写道:
>
> > Hi everyone,
> >
> > Dawid and I are working on making parts of ExecutionConfig and
> > TableConfig configurable via config options. This is necessary to make
> > all properties also available in SQL. Additionally, with the new SQL DDL
> > based on properties as well as more connectors and formats coming up,
> > unified configuration becomes more important.
> >
> > We need more features around string-based configuration in the future,
> > which is why Dawid and I would like to propose FLIP-54 for evolving the
> > ConfigOption and Configuration classes:
> >
> >
> >
> https://docs.google.com/document/d/1IQ7nwXqmhCy900t2vQLEL3N2HIdMg-JO8vTzo1BtyKU/edit
> >
> > In summary it adds:
> > - documented types and validation
> > - more common types such as memory size, duration, list
> > - simple non-nested object types
> >
> > Looking forward to your feedback,
> > Timo
> >
> >
>


Re: [DISCUSS] FLIP-54: Evolve ConfigOption and Configuration

2019-08-16 Thread Zili Chen
Hi Timo,

It looks interesting. Thanks for preparing this FLIP!

Client API enhancement benefit from this evolution which
hopefully provides a better view of configuration of Flink.
In client API enhancement, we likely make the deployment
of cluster and submission of job totally defined by configuration.

Will take a look at the document in days.

Best,
tison.


Timo Walther  于2019年8月16日周五 下午10:12写道:

> Hi everyone,
>
> Dawid and I are working on making parts of ExecutionConfig and
> TableConfig configurable via config options. This is necessary to make
> all properties also available in SQL. Additionally, with the new SQL DDL
> based on properties as well as more connectors and formats coming up,
> unified configuration becomes more important.
>
> We need more features around string-based configuration in the future,
> which is why Dawid and I would like to propose FLIP-54 for evolving the
> ConfigOption and Configuration classes:
>
>
> https://docs.google.com/document/d/1IQ7nwXqmhCy900t2vQLEL3N2HIdMg-JO8vTzo1BtyKU/edit
>
> In summary it adds:
> - documented types and validation
> - more common types such as memory size, duration, list
> - simple non-nested object types
>
> Looking forward to your feedback,
> Timo
>
>


[DISCUSS] FLIP-54: Evolve ConfigOption and Configuration

2019-08-16 Thread Timo Walther

Hi everyone,

Dawid and I are working on making parts of ExecutionConfig and 
TableConfig configurable via config options. This is necessary to make 
all properties also available in SQL. Additionally, with the new SQL DDL 
based on properties as well as more connectors and formats coming up, 
unified configuration becomes more important.


We need more features around string-based configuration in the future, 
which is why Dawid and I would like to propose FLIP-54 for evolving the 
ConfigOption and Configuration classes:


https://docs.google.com/document/d/1IQ7nwXqmhCy900t2vQLEL3N2HIdMg-JO8vTzo1BtyKU/edit

In summary it adds:
- documented types and validation
- more common types such as memory size, duration, list
- simple non-nested object types

Looking forward to your feedback,
Timo



Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler
@Aljoscha Shading takes a few minutes for a full build; you can see this 
quite easily by looking at the compile step in the misc profile 
; all modules that 
longer than a fraction of a section are usually caused by shading lots 
of classes. Note that I cannot tell you how much of this is spent on 
relocations, and how much on writing the jar.


Personally, I'd very much like us to move all shading to flink-shaded; 
this would finally allows us to use newer maven versions without needing 
cumbersome workarounds for flink-dist. However, this isn't a trivial 
affair in some cases; IIRC calcite could be difficult to handle.


On another note, this would also simplify switching the main repo to 
another build system, since you would no longer had to deal with 
relocations, just packaging + merging NOTICE files.


@BowenLi I disagree, flink-shaded does not include any tests,  API 
compatibility checks, checkstyle, layered shading (e.g., flink-runtime 
and flink-dist, where both relocate dependencies and one is bundled by 
the other), and, most importantly, CI (and really, without CI being 
covered in a PoC there's nothing to discuss).


On 16/08/2019 15:13, Aljoscha Krettek wrote:

Speaking of flink-shaded, do we have any idea what the impact of shading is on 
the build time? We could get rid of shading completely in the Flink main 
repository by moving everything that we shade to flink-shaded.

Aljoscha


On 16. Aug 2019, at 14:58, Bowen Li  wrote:

+1 to Till's points on #2 and #5, especially the potential non-disruptive,
gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's of much
smaller size, totally isolated from and not interfered with flink project
[2], and it actually covers most of our practical feature requirements for
a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann  wrote:


For the sake of keeping the discussion focused and not cluttering the
discussion thread I would suggest to split the detailed reporting for
reusing JVMs to a separate thread and cross linking it from here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
wrote:


Update:

TL;DR: table-planner is a good candidate for enabling fork reuse right
away, while flink-tests has the potential for huge savings, but we have
to figure out some issues first.


Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in
libraries (table-planner).

The kafka and connectors profiles both fail in kafka tests due to
producer leaks, and no speed up could be confirmed so far:

java.lang.AssertionError: Detected producer leak. Thread name:
kafka-producer-network-thread | producer-239
at org.junit.Assert.fail(Assert.java:88)
at


org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)

at


org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)


The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected

accumulator

results within time limit.
at


org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one above
failed after 19 minutes and is only missing the migration tests (which
currently need 6-7 minutes). So we could save somewhere between 15 to 20
minutes here.


Finally, the misc profiles fails in YARN:

java.lang.AssertionError
at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for
flink-yarn-tests we can maybe get a minute or 2 out of it.

On 16/08/2019 10:43, Chesnay Schepler wrote:

There appears to be a general agreement that 1) should be looked into;
I've setup a branch with fork reuse being enabled for all tests; will
report back the results.

On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's
discuss the different ways how they could be reduced.


   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total
time"), and in the ideal case takes about 1h20m ("run time") to
complete from start to finish. The run time may fluctuate of course
depending on the current Travis load. This applies both to builds on
the Apache and flink-ci Travis.

At the time of writing, the current queue time for

Re: [DISCUSS] Flink client api enhancement for downstream project

2019-08-16 Thread Aljoscha Krettek
Hi,

I read both Jeffs initial design document and the newer document by Tison. I 
also finally found the time to collect our thoughts on the issue, I had quite 
some discussions with Kostas and this is the result: [1].

I think overall we agree that this part of the code is in dire need of some 
refactoring/improvements but I think there are still some open questions and 
some differences in opinion what those refactorings should look like.

I think the API-side is quite clear, i.e. we need some JobClient API that 
allows interacting with a running Job. It could be worthwhile to spin that off 
into a separate FLIP because we can probably find consensus on that part more 
easily.

For the rest, the main open questions from our doc are these:

  - Do we want to separate cluster creation and job submission for per-job 
mode? In the past, there were conscious efforts to *not* separate job 
submission from cluster creation for per-job clusters for Mesos, YARN, 
Kubernets (see StandaloneJobClusterEntryPoint). Tison suggests in his design 
document to decouple this in order to unify job submission.

  - How to deal with plan preview, which needs to hijack execute() and let the 
outside code catch an exception?

  - How to deal with Jar Submission at the Web Frontend, which needs to hijack 
execute() and let the outside code catch an exception? CliFrontend.run() 
“hijacks” ExecutionEnvironment.execute() to get a JobGraph and then execute 
that JobGraph manually. We could get around that by letting execute() do the 
actual execution. One caveat for this is that now the main() method doesn’t 
return (or is forced to return by throwing an exception from execute()) which 
means that for Jar Submission from the WebFrontend we have a long-running 
main() method running in the WebFrontend. This doesn’t sound very good. We 
could get around this by removing the plan preview feature and by removing Jar 
Submission/Running.

  - How to deal with detached mode? Right now, DetachedEnvironment will execute 
the job and return immediately. If users control when they want to return, by 
waiting on the job completion future, how do we deal with this? Do we simply 
remove the distinction between detached/non-detached?

  - How does per-job mode interact with “interactive programming” (FLIP-36). 
For YARN, each execute() call could spawn a new Flink YARN cluster. What about 
Mesos and Kubernetes?

The first open question is where the opinions diverge, I think. The rest are 
just open questions and interesting things that we need to consider.

Best,
Aljoscha

[1] 
https://docs.google.com/document/d/1E-8UjOLz4QPUTxetGWbU23OlsIH9VIdodpTsxwoQTs0/edit#heading=h.na7k0ad88tix
 


> On 31. Jul 2019, at 15:23, Jeff Zhang  wrote:
> 
> Thanks tison for the effort. I left a few comments.
> 
> 
> Zili Chen  于2019年7月31日周三 下午8:24写道:
> 
>> Hi Flavio,
>> 
>> Thanks for your reply.
>> 
>> Either current impl and in the design, ClusterClient
>> never takes responsibility for generating JobGraph.
>> (what you see in current codebase is several class methods)
>> 
>> Instead, user describes his program in the main method
>> with ExecutionEnvironment apis and calls env.compile()
>> or env.optimize() to get FlinkPlan and JobGraph respectively.
>> 
>> For listing main classes in a jar and choose one for
>> submission, you're now able to customize a CLI to do it.
>> Specifically, the path of jar is passed as arguments and
>> in the customized CLI you list main classes, choose one
>> to submit to the cluster.
>> 
>> Best,
>> tison.
>> 
>> 
>> Flavio Pompermaier  于2019年7月31日周三 下午8:12写道:
>> 
>>> Just one note on my side: it is not clear to me whether the client needs
>> to
>>> be able to generate a job graph or not.
>>> In my opinion, the job jar must resides only on the server/jobManager
>> side
>>> and the client requires a way to get the job graph.
>>> If you really want to access to the job graph, I'd add a dedicated method
>>> on the ClusterClient. like:
>>> 
>>>   - getJobGraph(jarId, mainClass): JobGraph
>>>   - listMainClasses(jarId): List
>>> 
>>> These would require some addition also on the job manager endpoint as
>>> well..what do you think?
>>> 
>>> On Wed, Jul 31, 2019 at 12:42 PM Zili Chen  wrote:
>>> 
 Hi all,
 
 Here is a document[1] on client api enhancement from our perspective.
 We have investigated current implementations. And we propose
 
 1. Unify the implementation of cluster deployment and job submission in
 Flink.
 2. Provide programmatic interfaces to allow flexible job and cluster
 management.
 
 The first proposal is aimed at reducing code paths of cluster
>> deployment
 and
 job submission so that one can adopt Flink in his usage easily. The
>>> second
 proposal is aimed at providing rich interfaces for advanced users
 who want to make accurate control of these st

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Aljoscha Krettek
Speaking of flink-shaded, do we have any idea what the impact of shading is on 
the build time? We could get rid of shading completely in the Flink main 
repository by moving everything that we shade to flink-shaded.

Aljoscha

> On 16. Aug 2019, at 14:58, Bowen Li  wrote:
> 
> +1 to Till's points on #2 and #5, especially the potential non-disruptive,
> gradual migration approach if we decide to go that route.
> 
> To add on, I want to point it out that we can actually start with
> flink-shaded project [1] which is a perfect candidate for PoC. It's of much
> smaller size, totally isolated from and not interfered with flink project
> [2], and it actually covers most of our practical feature requirements for
> a build tool - all making it an ideal experimental field.
> 
> [1] https://github.com/apache/flink-shaded
> [2] https://github.com/apache/flink
> 
> 
> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann  wrote:
> 
>> For the sake of keeping the discussion focused and not cluttering the
>> discussion thread I would suggest to split the detailed reporting for
>> reusing JVMs to a separate thread and cross linking it from here.
>> 
>> Cheers,
>> Till
>> 
>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
>> wrote:
>> 
>>> Update:
>>> 
>>> TL;DR: table-planner is a good candidate for enabling fork reuse right
>>> away, while flink-tests has the potential for huge savings, but we have
>>> to figure out some issues first.
>>> 
>>> 
>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>>> 
>>> 4/8 profiles failed.
>>> 
>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
>>> libraries (table-planner).
>>> 
>>> The kafka and connectors profiles both fail in kafka tests due to
>>> producer leaks, and no speed up could be confirmed so far:
>>> 
>>> java.lang.AssertionError: Detected producer leak. Thread name:
>>> kafka-producer-network-thread | producer-239
>>>at org.junit.Assert.fail(Assert.java:88)
>>>at
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>>>at
>>> 
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>>> 
>>> 
>>> The tests profile failed due to various errors in migration tests:
>>> 
>>> junit.framework.AssertionFailedError: Did not see the expected
>> accumulator
>>> results within time limit.
>>>at
>>> 
>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>>> 
>>> *However*, a normal tests run takes 40 minutes, while this one above
>>> failed after 19 minutes and is only missing the migration tests (which
>>> currently need 6-7 minutes). So we could save somewhere between 15 to 20
>>> minutes here.
>>> 
>>> 
>>> Finally, the misc profiles fails in YARN:
>>> 
>>> java.lang.AssertionError
>>>at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>>> 
>>> No significant speedup could be observed in other modules; for
>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
>>> 
>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
 There appears to be a general agreement that 1) should be looked into;
 I've setup a branch with fork reuse being enabled for all tests; will
 report back the results.
 
 On 15/08/2019 09:38, Chesnay Schepler wrote:
> Hello everyone,
> 
> improving our build times is a hot topic at the moment so let's
> discuss the different ways how they could be reduced.
> 
> 
>   Current state:
> 
> First up, let's look at some numbers:
> 
> 1 full build currently consumes 5h of build time total ("total
> time"), and in the ideal case takes about 1h20m ("run time") to
> complete from start to finish. The run time may fluctuate of course
> depending on the current Travis load. This applies both to builds on
> the Apache and flink-ci Travis.
> 
> At the time of writing, the current queue time for PR jobs (reminder:
> running on flink-ci) is about 30 minutes (which basically means that
> we are processing builds at the rate that they come in), however we
> are in an admittedly quiet period right now.
> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> everyone was scrambling to get their changes merged in time for the
> feature freeze.
> 
> (Note: Recently optimizations where added to ci-bot where pending
> builds are canceled if a new commit was pushed to the PR or the PR
> was closed, which should prove especially useful during the rush
> hours we see before feature-freezes.)
> 
> 
>   Past approaches
> 
> Over the years we have done rather few things to improve this
> situation (hence our current predicament).
> 
> Beyond the sporadic speedup of some te

Re: [VOTE] FLIP-51: Rework of the Expression Design

2019-08-16 Thread Aljoscha Krettek
+1

This seems to be a good refactoring/cleanup step to me!

> On 16. Aug 2019, at 10:59, Dawid Wysakowicz  wrote:
> 
> +1 from my side
> 
> Best,
> 
> Dawid
> 
> On 16/08/2019 10:31, Jark Wu wrote:
>> +1 from my side.
>> 
>> Thanks Jingsong for driving this.
>> 
>> Best,
>> Jark
>> 
>> On Thu, 15 Aug 2019 at 22:09, Timo Walther  wrote:
>> 
>>> +1 for this.
>>> 
>>> Thanks,
>>> Timo
>>> 
>>> Am 15.08.19 um 15:57 schrieb JingsongLee:
 Hi Flink devs,
 
 I would like to start the voting for FLIP-51 Rework of the Expression
  Design.
 
 FLIP wiki:
 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design
 Discussion thread:
 
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html
 Google Doc:
 
>>> https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing
 Thanks,
 
 Best,
 Jingsong Lee
>>> 
>>> 
> 



Re: [DISCUSS] Reducing build times

2019-08-16 Thread Bowen Li
+1 to Till's points on #2 and #5, especially the potential non-disruptive,
gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's of much
smaller size, totally isolated from and not interfered with flink project
[2], and it actually covers most of our practical feature requirements for
a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann  wrote:

> For the sake of keeping the discussion focused and not cluttering the
> discussion thread I would suggest to split the detailed reporting for
> reusing JVMs to a separate thread and cross linking it from here.
>
> Cheers,
> Till
>
> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
> wrote:
>
> > Update:
> >
> > TL;DR: table-planner is a good candidate for enabling fork reuse right
> > away, while flink-tests has the potential for huge savings, but we have
> > to figure out some issues first.
> >
> >
> > Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >
> > 4/8 profiles failed.
> >
> > No speedup in libraries, python, blink_planner, 7 minutes saved in
> > libraries (table-planner).
> >
> > The kafka and connectors profiles both fail in kafka tests due to
> > producer leaks, and no speed up could be confirmed so far:
> >
> > java.lang.AssertionError: Detected producer leak. Thread name:
> > kafka-producer-network-thread | producer-239
> > at org.junit.Assert.fail(Assert.java:88)
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >
> >
> > The tests profile failed due to various errors in migration tests:
> >
> > junit.framework.AssertionFailedError: Did not see the expected
> accumulator
> > results within time limit.
> > at
> >
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >
> > *However*, a normal tests run takes 40 minutes, while this one above
> > failed after 19 minutes and is only missing the migration tests (which
> > currently need 6-7 minutes). So we could save somewhere between 15 to 20
> > minutes here.
> >
> >
> > Finally, the misc profiles fails in YARN:
> >
> > java.lang.AssertionError
> > at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >
> > No significant speedup could be observed in other modules; for
> > flink-yarn-tests we can maybe get a minute or 2 out of it.
> >
> > On 16/08/2019 10:43, Chesnay Schepler wrote:
> > > There appears to be a general agreement that 1) should be looked into;
> > > I've setup a branch with fork reuse being enabled for all tests; will
> > > report back the results.
> > >
> > > On 15/08/2019 09:38, Chesnay Schepler wrote:
> > >> Hello everyone,
> > >>
> > >> improving our build times is a hot topic at the moment so let's
> > >> discuss the different ways how they could be reduced.
> > >>
> > >>
> > >>Current state:
> > >>
> > >> First up, let's look at some numbers:
> > >>
> > >> 1 full build currently consumes 5h of build time total ("total
> > >> time"), and in the ideal case takes about 1h20m ("run time") to
> > >> complete from start to finish. The run time may fluctuate of course
> > >> depending on the current Travis load. This applies both to builds on
> > >> the Apache and flink-ci Travis.
> > >>
> > >> At the time of writing, the current queue time for PR jobs (reminder:
> > >> running on flink-ci) is about 30 minutes (which basically means that
> > >> we are processing builds at the rate that they come in), however we
> > >> are in an admittedly quiet period right now.
> > >> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > >> everyone was scrambling to get their changes merged in time for the
> > >> feature freeze.
> > >>
> > >> (Note: Recently optimizations where added to ci-bot where pending
> > >> builds are canceled if a new commit was pushed to the PR or the PR
> > >> was closed, which should prove especially useful during the rush
> > >> hours we see before feature-freezes.)
> > >>
> > >>
> > >>Past approaches
> > >>
> > >> Over the years we have done rather few things to improve this
> > >> situation (hence our current predicament).
> > >>
> > >> Beyond the sporadic speedup of some tests, the only notable reduction
> > >> in total build times was the introduction of cron jobs, which
> > >> consolidated the per-commit matrix from 4 configurations (different
> > >> scala/hadoop versions) to 1.
> > >>
> > >> The separation into multiple build profiles was only a work-around
> > >> for the

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

2019-08-16 Thread Xintong Song
Thanks for the feedbacks, Yangze and Till.

Yangze,

I agree with you that we should make scheduling strategy pluggable and
optimize the strategy to reduce the memory fragmentation problem, and
thanks for the inputs on the potential algorithmic solutions. However, I'm
in favor of keep this FLIP focusing on the overall mechanism design rather
than strategies. Solving the fragmentation issue should be considered as an
optimization, and I agree with Till that we probably should tackle this
afterwards.

Till,

- Regarding splitting the FLIP, I think it makes sense. The operator
resource management and dynamic slot allocation do not have much dependency
on each other.

- Regarding the default slot size, I think this is similar to FLIP-49 [1]
where we want all the deriving happens at one place. I think it would be
nice to pass the default slot size into the task executor in the same way
that we pass in the memory pool sizes in FLIP-49 [1].

- Regarding the return value of TaskExecutorGateway#requestResource, I
think you're right. We should avoid using null as the return value. I think
we probably should thrown an exception here.

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors

On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann  wrote:

> Hi Xintong,
>
> thanks for drafting this FLIP. I think your proposal helps to improve the
> execution of batch jobs more efficiently. Moreover, it enables the proper
> integration of the Blink planner which is very important as well.
>
> Overall, the FLIP looks good to me. I was wondering whether it wouldn't
> make sense to actually split it up into two FLIPs: Operator resource
> management and dynamic slot allocation. I think these two FLIPs could be
> seen as orthogonal and it would decrease the scope of each individual FLIP.
>
> Some smaller comments:
>
> - I'm not sure whether we should pass in the default slot size via an
> environment variable. Without having unified the way how Flink components
> are configured [1], I think it would be better to pass it in as part of the
> configuration.
> - I would avoid returning a null value from
> TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we
> should introduce an explicit return value saying this or throw an
> exception.
>
> Concerning Yangze's comments: I think you are right that it would be
> helpful to make the selection strategy pluggable. Also batching slot
> requests to the RM could be a good optimization. For the sake of keeping
> the scope of this FLIP smaller I would try to tackle these things after the
> initial version has been completed (without spoiling these optimization
> opportunities). In particular batching the slot requests depends on the
> current scheduler refactoring and could also be realized on the RM side
> only.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
>
> Cheers,
> Till
>
>
>
> On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo  wrote:
>
> > Hi, Xintong
> >
> > Thanks to propose this FLIP. The general design looks good to me, +1
> > for this feature.
> >
> > Since slots in the same task executor could have different resource
> > profile, we will
> > meet resource fragment problem. Think about this case:
> >  - request A want 1G memory while request B & C want 0.5G memory
> >  - There are two task executors T1 & T2 with 1G and 0.5G free memory
> > respectively
> > If B come first and we cut a slot from T1 for B, A must wait for the
> > free resource from
> > other task. But A could have been scheduled immediately if we cut a
> > slot from T2 for B.
> >
> > The logic of findMatchingSlot now become finding a task executor which
> > has enough
> > resource and then cut a slot from it. Current method could be seen as
> > "First-fit strategy",
> > which works well in general but sometimes could not be the optimization
> > method.
> >
> > Actually, this problem could be abstracted as "Bin Packing Problem"[1].
> > Here are
> > some common approximate algorithms:
> > - First fit
> > - Next fit
> > - Best fit
> >
> > But it become multi-dimensional bin packing problem if we take CPU
> > into account. It hard
> > to define which one is best fit now. Some research addressed this
> > problem, such like Tetris[2].
> >
> > Here are some thinking about it:
> > 1. We could make the strategy of finding matching task executor
> > pluginable. Let user to config the
> > best strategy in their scenario.
> > 2. We could support batch request interface in RM, because we have
> > opportunities to optimize
> > if we have more information. If we know the A, B, C at the same time,
> > we could always make the best decision.
> >
> > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > [2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> >
> > Best,
> > Yangze Guo
> >
> > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song 
> > wrote:
> > >
> > > H

[jira] [Created] (FLINK-13751) Add Built-in vector types

2019-08-16 Thread Xu Yang (JIRA)
Xu Yang created FLINK-13751:
---

 Summary: Add Built-in vector types
 Key: FLINK-13751
 URL: https://issues.apache.org/jira/browse/FLINK-13751
 Project: Flink
  Issue Type: Sub-task
  Components: Library / Machine Learning
Reporter: Xu Yang


Built-in vector types is the TypeInformation of Vector, DenseVector and 
SparseVector. The class contains the mapping of the TypeInformation and its 
String representation.

 Add the class of the Built-in vector types
 Add the test cases



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Update our Roadmap

2019-08-16 Thread Robert Metzger
Flink 1.9 is feature freezed and almost released.
I guess it makes sense to update the roadmap on the website again.

Who feels like having a good overview of what's coming up?

On Tue, May 7, 2019 at 4:33 PM Fabian Hueske  wrote:

> Yes, that's a very good proposal Jark.
> +1
>
> Best, Fabian
>
> Am Mo., 6. Mai 2019 um 16:33 Uhr schrieb Till Rohrmann <
> trohrm...@apache.org
> >:
>
> > I think this is a good idea Jark. Putting the last update date on the
> > roadmap would also force us to regularly update it.
> >
> > Cheers,
> > Till
> >
> > On Mon, May 6, 2019 at 4:14 AM Jark Wu  wrote:
> >
> > > Hi,
> > >
> > > One suggestion for the roadmap:
> > >
> > > Shall we add a `latest-update-time` to the top of Roadmap page? So that
> > > users can know this is a up-to-date Roadmap.
> > >
> > > On Thu, 2 May 2019 at 04:49, Bowen Li  wrote:
> > >
> > > > +1
> > > >
> > > > On Mon, Apr 29, 2019 at 11:41 PM jincheng sun <
> > sunjincheng...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jeff&Fabian,
> > > > >
> > > > > I have open the PR about add Python Table API section to the
> > roadmap. I
> > > > > appreciate if you have time to look at it. :)
> > > > >
> > > > > https://github.com/apache/flink-web/pull/204
> > > > >
> > > > > Regards,
> > > > > Jincheng
> > > > >
> > > > > jincheng sun  于2019年4月29日周一 下午11:12写道:
> > > > >
> > > > > > Sure, I will do it!I think the python table api info should in
> the
> > > > > >  roadmap! Thank you @Jeff @Fabian
> > > > > >
> > > > > > Fabian Hueske 于2019年4月29日 周一23:05写道:
> > > > > >
> > > > > >> Great, thanks Jeff and Timo!
> > > > > >>
> > > > > >> @Jincheng do you want to write a paragraph about the Python
> effort
> > > and
> > > > > >> open a PR for it?
> > > > > >>
> > > > > >> I'll remove the issue about Hadoop convenience builds
> > (FLINK-11266).
> > > > > >>
> > > > > >> Best, Fabian
> > > > > >>
> > > > > >> Am Mo., 29. Apr. 2019 um 16:37 Uhr schrieb Jeff Zhang <
> > > > zjf...@gmail.com
> > > > > >:
> > > > > >>
> > > > > >>> jincheng(cc) is driving the python effort, I think he can help
> to
> > > > > >>> prepare it.
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> Fabian Hueske  于2019年4月29日周一 下午10:15写道:
> > > > > >>>
> > > > >  Hi everyone,
> > > > > 
> > > > >  Since we had no more comments on this thread, I think we
> proceed
> > > to
> > > > >  update the roadmap.
> > > > > 
> > > > >  @Jeff Zhang  I agree, we should add the
> > Python
> > > > >  efforts to the roadmap.
> > > > >  Do you want to prepare a short paragraph that we can add to
> the
> > > > >  document?
> > > > > 
> > > > >  Best, Fabian
> > > > > 
> > > > >  Am Mi., 17. Apr. 2019 um 15:04 Uhr schrieb Jeff Zhang <
> > > > > zjf...@gmail.com
> > > > >  >:
> > > > > 
> > > > > > Hi Fabian,
> > > > > >
> > > > > > One thing missing is python api and python udf, we already
> > > > discussed
> > > > > > it in
> > > > > > community, and it is very close to reach consensus.
> > > > > >
> > > > > >
> > > > > > Fabian Hueske  于2019年4月17日周三 下午7:51写道:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > We recently added a roadmap to our project website [1] and
> > > > decided
> > > > > to
> > > > > > > update it after every release. Flink 1.8.0 was released a
> few
> > > > days
> > > > > > ago, so
> > > > > > > I think it we should check and remove from the roadmap what
> > was
> > > > > > achieved so
> > > > > > > far and add features / improvements that we plan for the
> > > future.
> > > > > > >
> > > > > > > I had a look at the roadmap and found that
> > > > > > >
> > > > > > > > We are changing the build setup to not bundle Hadoop by
> > > > default,
> > > > > > but
> > > > > > > rather offer pre-packaged
> > > > > > > > Hadoop libraries for the use with Yarn, HDFS, etc. as
> > > > convenience
> > > > > > > downloads FLINK-11266 <
> > > > > > https://issues.apache.org/jira/browse/FLINK-11266>.
> > > > > > >
> > > > > > > was implemented for 1.8.0 and should be removed from the
> > > roadmap.
> > > > > > > All other issues are still ongoing efforts.
> > > > > > >
> > > > > > > Are there any other efforts that we want to put on the
> > roadmap?
> > > > > > >
> > > > > > > Best, Fabian
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards
> > > > > >
> > > > > > Jeff Zhang
> > > > > >
> > > > > 
> > > > > >>>
> > > > > >>> --
> > > > > >>> Best Regards
> > > > > >>>
> > > > > >>> Jeff Zhang
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

2019-08-16 Thread Till Rohrmann
Hi Xintong,

thanks for drafting this FLIP. I think your proposal helps to improve the
execution of batch jobs more efficiently. Moreover, it enables the proper
integration of the Blink planner which is very important as well.

Overall, the FLIP looks good to me. I was wondering whether it wouldn't
make sense to actually split it up into two FLIPs: Operator resource
management and dynamic slot allocation. I think these two FLIPs could be
seen as orthogonal and it would decrease the scope of each individual FLIP.

Some smaller comments:

- I'm not sure whether we should pass in the default slot size via an
environment variable. Without having unified the way how Flink components
are configured [1], I think it would be better to pass it in as part of the
configuration.
- I would avoid returning a null value from
TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we
should introduce an explicit return value saying this or throw an exception.

Concerning Yangze's comments: I think you are right that it would be
helpful to make the selection strategy pluggable. Also batching slot
requests to the RM could be a good optimization. For the sake of keeping
the scope of this FLIP smaller I would try to tackle these things after the
initial version has been completed (without spoiling these optimization
opportunities). In particular batching the slot requests depends on the
current scheduler refactoring and could also be realized on the RM side
only.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration

Cheers,
Till



On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo  wrote:

> Hi, Xintong
>
> Thanks to propose this FLIP. The general design looks good to me, +1
> for this feature.
>
> Since slots in the same task executor could have different resource
> profile, we will
> meet resource fragment problem. Think about this case:
>  - request A want 1G memory while request B & C want 0.5G memory
>  - There are two task executors T1 & T2 with 1G and 0.5G free memory
> respectively
> If B come first and we cut a slot from T1 for B, A must wait for the
> free resource from
> other task. But A could have been scheduled immediately if we cut a
> slot from T2 for B.
>
> The logic of findMatchingSlot now become finding a task executor which
> has enough
> resource and then cut a slot from it. Current method could be seen as
> "First-fit strategy",
> which works well in general but sometimes could not be the optimization
> method.
>
> Actually, this problem could be abstracted as "Bin Packing Problem"[1].
> Here are
> some common approximate algorithms:
> - First fit
> - Next fit
> - Best fit
>
> But it become multi-dimensional bin packing problem if we take CPU
> into account. It hard
> to define which one is best fit now. Some research addressed this
> problem, such like Tetris[2].
>
> Here are some thinking about it:
> 1. We could make the strategy of finding matching task executor
> pluginable. Let user to config the
> best strategy in their scenario.
> 2. We could support batch request interface in RM, because we have
> opportunities to optimize
> if we have more information. If we know the A, B, C at the same time,
> we could always make the best decision.
>
> [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> [2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
>
> Best,
> Yangze Guo
>
> On Thu, Aug 15, 2019 at 10:40 PM Xintong Song 
> wrote:
> >
> > Hi everyone,
> >
> > We would like to start a discussion thread on "FLIP-53: Fine Grained
> > Resource Management"[1], where we propose how to improve Flink resource
> > management and scheduling.
> >
> > This FLIP mainly discusses the following issues.
> >
> >- How to support tasks with fine grained resource requirements.
> >- How to unify resource management for jobs with / without fine
> grained
> >resource requirements.
> >- How to unify resource management for streaming / batch jobs.
> >
> > Key changes proposed in the FLIP are as follows.
> >
> >- Unify memory management for operators with / without fine grained
> >resource requirements by applying a fraction based quota mechanism.
> >- Unify resource scheduling for streaming and batch jobs by setting
> slot
> >sharing groups for pipelined regions during compiling stage.
> >- Dynamically allocate slots from task executors' available resources.
> >
> > Please find more details in the FLIP wiki document [1]. Looking forward
> to
> > your feedbacks.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
>


Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-16 Thread Xintong Song
Thanks for sharing your opinion Till.

I'm also in favor of alternative 2. I was wondering whether we can avoid
using Unsafe.allocate() for off-heap managed memory and network memory with
alternative 3. But after giving it a second thought, I think even for
alternative 3 using direct memory for off-heap managed memory could cause
problems.

Hi Yang,

Regarding your concern, I think what proposed in this FLIP it to have both
off-heap managed memory and network memory allocated through
Unsafe.allocate(), which means they are practically native memory and not
limited by JVM max direct memory. The only parts of memory limited by JVM
max direct memory are task off-heap memory and JVM overhead, which are
exactly alternative 2 suggests to set the JVM max direct memory to.

Thank you~

Xintong Song



On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann  wrote:

> Thanks for the clarification Xintong. I understand the two alternatives
> now.
>
> I would be in favour of option 2 because it makes things explicit. If we
> don't limit the direct memory, I fear that we might end up in a similar
> situation as we are currently in: The user might see that her process gets
> killed by the OS and does not know why this is the case. Consequently, she
> tries to decrease the process memory size (similar to increasing the cutoff
> ratio) in order to accommodate for the extra direct memory. Even worse, she
> tries to decrease memory budgets which are not fully used and hence won't
> change the overall memory consumption.
>
> Cheers,
> Till
>
> On Fri, Aug 16, 2019 at 11:01 AM Xintong Song 
> wrote:
>
> > Let me explain this with a concrete example Till.
> >
> > Let's say we have the following scenario.
> >
> > Total Process Memory: 1GB
> > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory and
> > Network Memory): 800MB
> >
> >
> > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > For alternative 3, we set -XX:MaxDirectMemorySize to a very large value,
> > let's say 1TB.
> >
> > If the actual direct memory usage of Task Off-Heap Memory and JVM
> Overhead
> > do not exceed 200MB, then alternative 2 and alternative 3 should have the
> > same utility. Setting larger -XX:MaxDirectMemorySize will not reduce the
> > sizes of the other memory pools.
> >
> > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > Overhead potentially exceed 200MB, then
> >
> >- Alternative 2 suffers from frequent OOM. To avoid that, the only
> thing
> >user can do is to modify the configuration and increase JVM Direct
> > Memory
> >(Task Off-Heap Memory + JVM Overhead). Let's say that user increases
> JVM
> >Direct Memory to 250MB, this will reduce the total size of other
> memory
> >pools to 750MB, given the total process memory remains 1GB.
> >- For alternative 3, there is no chance of direct OOM. There are
> chances
> >of exceeding the total process memory limit, but given that the
> process
> > may
> >not use up all the reserved native memory (Off-Heap Managed Memory,
> > Network
> >Memory, JVM Metaspace), if the actual direct memory usage is slightly
> > above
> >yet very close to 200MB, user probably do not need to change the
> >configurations.
> >
> > Therefore, I think from the user's perspective, a feasible configuration
> > for alternative 2 may lead to lower resource utilization compared to
> > alternative 3.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann 
> > wrote:
> >
> > > I guess you have to help me understand the difference between
> > alternative 2
> > > and 3 wrt to memory under utilization Xintong.
> > >
> > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and
> > JVM
> > > Overhead. Then there is the risk that this size is too low resulting
> in a
> > > lot of garbage collection and potentially an OOM.
> > > - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> > > alternative 2. This would of course reduce the sizes of the other
> memory
> > > types.
> > >
> > > How would alternative 2 now result in an under utilization of memory
> > > compared to alternative 3? If alternative 3 strictly sets a higher max
> > > direct memory size and we use only little, then I would expect that
> > > alternative 3 results in memory under utilization.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang 
> wrote:
> > >
> > > > Hi xintong,till
> > > >
> > > >
> > > > > Native and Direct Memory
> > > >
> > > > My point is setting a very large max direct memory size when we do
> not
> > > > differentiate direct and native memory. If the direct
> memory,including
> > > user
> > > > direct memory and framework direct memory,could be calculated
> > > > correctly,then
> > > > i am in favor of setting direct memory with fixed value.
> > > >
> > > >
> > > >
> > > > > Memory Calculatio

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Till Rohrmann
For the sake of keeping the discussion focused and not cluttering the
discussion thread I would suggest to split the detailed reporting for
reusing JVMs to a separate thread and cross linking it from here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler  wrote:

> Update:
>
> TL;DR: table-planner is a good candidate for enabling fork reuse right
> away, while flink-tests has the potential for huge savings, but we have
> to figure out some issues first.
>
>
> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>
> 4/8 profiles failed.
>
> No speedup in libraries, python, blink_planner, 7 minutes saved in
> libraries (table-planner).
>
> The kafka and connectors profiles both fail in kafka tests due to
> producer leaks, and no speed up could be confirmed so far:
>
> java.lang.AssertionError: Detected producer leak. Thread name:
> kafka-producer-network-thread | producer-239
> at org.junit.Assert.fail(Assert.java:88)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>
>
> The tests profile failed due to various errors in migration tests:
>
> junit.framework.AssertionFailedError: Did not see the expected accumulator
> results within time limit.
> at
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>
> *However*, a normal tests run takes 40 minutes, while this one above
> failed after 19 minutes and is only missing the migration tests (which
> currently need 6-7 minutes). So we could save somewhere between 15 to 20
> minutes here.
>
>
> Finally, the misc profiles fails in YARN:
>
> java.lang.AssertionError
> at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>
> No significant speedup could be observed in other modules; for
> flink-yarn-tests we can maybe get a minute or 2 out of it.
>
> On 16/08/2019 10:43, Chesnay Schepler wrote:
> > There appears to be a general agreement that 1) should be looked into;
> > I've setup a branch with fork reuse being enabled for all tests; will
> > report back the results.
> >
> > On 15/08/2019 09:38, Chesnay Schepler wrote:
> >> Hello everyone,
> >>
> >> improving our build times is a hot topic at the moment so let's
> >> discuss the different ways how they could be reduced.
> >>
> >>
> >>Current state:
> >>
> >> First up, let's look at some numbers:
> >>
> >> 1 full build currently consumes 5h of build time total ("total
> >> time"), and in the ideal case takes about 1h20m ("run time") to
> >> complete from start to finish. The run time may fluctuate of course
> >> depending on the current Travis load. This applies both to builds on
> >> the Apache and flink-ci Travis.
> >>
> >> At the time of writing, the current queue time for PR jobs (reminder:
> >> running on flink-ci) is about 30 minutes (which basically means that
> >> we are processing builds at the rate that they come in), however we
> >> are in an admittedly quiet period right now.
> >> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> >> everyone was scrambling to get their changes merged in time for the
> >> feature freeze.
> >>
> >> (Note: Recently optimizations where added to ci-bot where pending
> >> builds are canceled if a new commit was pushed to the PR or the PR
> >> was closed, which should prove especially useful during the rush
> >> hours we see before feature-freezes.)
> >>
> >>
> >>Past approaches
> >>
> >> Over the years we have done rather few things to improve this
> >> situation (hence our current predicament).
> >>
> >> Beyond the sporadic speedup of some tests, the only notable reduction
> >> in total build times was the introduction of cron jobs, which
> >> consolidated the per-commit matrix from 4 configurations (different
> >> scala/hadoop versions) to 1.
> >>
> >> The separation into multiple build profiles was only a work-around
> >> for the 50m limit on Travis. Running tests in parallel has the
> >> obvious potential of reducing run time, but we're currently hitting a
> >> hard limit since a few modules (flink-tests, flink-runtime,
> >> flink-table-planner-blink) are so loaded with tests that they nearly
> >> consume an entire profile by themselves (and thus no further
> >> splitting is possible).
> >>
> >> The rework that introduced stages, at the time of introduction, did
> >> also not provide a speed up, although this changed slightly once more
> >> profiles were added and some optimizations to the caching have been
> >> made.
> >>
> >> Very recently we modified the surefire-plugin configuration for
> >> flink-table-planner-blink to reuse JVM forks for IT cases, providing
> >> a significant speedup (18 minutes!). So far we have not seen any
> >> negative c

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-16 Thread Till Rohrmann
Thanks for the clarification Xintong. I understand the two alternatives now.

I would be in favour of option 2 because it makes things explicit. If we
don't limit the direct memory, I fear that we might end up in a similar
situation as we are currently in: The user might see that her process gets
killed by the OS and does not know why this is the case. Consequently, she
tries to decrease the process memory size (similar to increasing the cutoff
ratio) in order to accommodate for the extra direct memory. Even worse, she
tries to decrease memory budgets which are not fully used and hence won't
change the overall memory consumption.

Cheers,
Till

On Fri, Aug 16, 2019 at 11:01 AM Xintong Song  wrote:

> Let me explain this with a concrete example Till.
>
> Let's say we have the following scenario.
>
> Total Process Memory: 1GB
> JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory and
> Network Memory): 800MB
>
>
> For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> For alternative 3, we set -XX:MaxDirectMemorySize to a very large value,
> let's say 1TB.
>
> If the actual direct memory usage of Task Off-Heap Memory and JVM Overhead
> do not exceed 200MB, then alternative 2 and alternative 3 should have the
> same utility. Setting larger -XX:MaxDirectMemorySize will not reduce the
> sizes of the other memory pools.
>
> If the actual direct memory usage of Task Off-Heap Memory and JVM
> Overhead potentially exceed 200MB, then
>
>- Alternative 2 suffers from frequent OOM. To avoid that, the only thing
>user can do is to modify the configuration and increase JVM Direct
> Memory
>(Task Off-Heap Memory + JVM Overhead). Let's say that user increases JVM
>Direct Memory to 250MB, this will reduce the total size of other memory
>pools to 750MB, given the total process memory remains 1GB.
>- For alternative 3, there is no chance of direct OOM. There are chances
>of exceeding the total process memory limit, but given that the process
> may
>not use up all the reserved native memory (Off-Heap Managed Memory,
> Network
>Memory, JVM Metaspace), if the actual direct memory usage is slightly
> above
>yet very close to 200MB, user probably do not need to change the
>configurations.
>
> Therefore, I think from the user's perspective, a feasible configuration
> for alternative 2 may lead to lower resource utilization compared to
> alternative 3.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann 
> wrote:
>
> > I guess you have to help me understand the difference between
> alternative 2
> > and 3 wrt to memory under utilization Xintong.
> >
> > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and
> JVM
> > Overhead. Then there is the risk that this size is too low resulting in a
> > lot of garbage collection and potentially an OOM.
> > - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> > alternative 2. This would of course reduce the sizes of the other memory
> > types.
> >
> > How would alternative 2 now result in an under utilization of memory
> > compared to alternative 3? If alternative 3 strictly sets a higher max
> > direct memory size and we use only little, then I would expect that
> > alternative 3 results in memory under utilization.
> >
> > Cheers,
> > Till
> >
> > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang  wrote:
> >
> > > Hi xintong,till
> > >
> > >
> > > > Native and Direct Memory
> > >
> > > My point is setting a very large max direct memory size when we do not
> > > differentiate direct and native memory. If the direct memory,including
> > user
> > > direct memory and framework direct memory,could be calculated
> > > correctly,then
> > > i am in favor of setting direct memory with fixed value.
> > >
> > >
> > >
> > > > Memory Calculation
> > >
> > > I agree with xintong. For Yarn and k8s,we need to check the memory
> > > configurations in client to avoid submitting successfully and failing
> in
> > > the flink master.
> > >
> > >
> > > Best,
> > >
> > > Yang
> > >
> > > Xintong Song 于2019年8月13日 周二22:07写道:
> > >
> > > > Thanks for replying, Till.
> > > >
> > > > About MemorySegment, I think you are right that we should not include
> > > this
> > > > issue in the scope of this FLIP. This FLIP should concentrate on how
> to
> > > > configure memory pools for TaskExecutors, with minimum involvement on
> > how
> > > > memory consumers use it.
> > > >
> > > > About direct memory, I think alternative 3 may not having the same
> over
> > > > reservation issue that alternative 2 does, but at the cost of risk of
> > > over
> > > > using memory at the container level, which is not good. My point is
> > that
> > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to
> config.
> > > For
> > > > alternative 2, users might configure them higher than what actually
> > > needed,
> > > > just to avoid getting a

Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Chesnay Schepler

+1 (binding)

Although I think it would be a good idea to always cc 
priv...@flink.apache.org when modifying bylaws, if anything to speed up 
the voting process.


On 16/08/2019 11:26, Ufuk Celebi wrote:

+1 (binding)

– Ufuk


On Wed, Aug 14, 2019 at 4:50 AM Biao Liu  wrote:


+1 (non-binding)

Thanks for pushing this!

Thanks,
Biao /'bɪ.aʊ/



On Wed, 14 Aug 2019 at 09:37, Jark Wu  wrote:


+1 (non-binding)

Best,
Jark

On Wed, 14 Aug 2019 at 09:22, Kurt Young  wrote:


+1 (binding)

Best,
Kurt


On Wed, Aug 14, 2019 at 1:34 AM Yun Tang  wrote:


+1 (non-binding)

But I have a minor question about "code change" action, for those
"[hotfix]" github pull requests [1], the dev mailing list would not

be

notified currently. I think we should change the description of this

action.


[1]


https://flink.apache.org/contributing/contribute-code.html#code-contribution-process

Best
Yun Tang

From: JingsongLee 
Sent: Tuesday, August 13, 2019 23:56
To: dev 
Subject: Re: [VOTE] Flink Project Bylaws

+1 (non-binding)
Thanks Becket.
I've learned a lot from current bylaws.

Best,
Jingsong Lee


--
From:Yu Li 
Send Time:2019年8月13日(星期二) 17:48
To:dev 
Subject:Re: [VOTE] Flink Project Bylaws

+1 (non-binding)

Thanks for the efforts Becket!

Best Regards,
Yu


On Tue, 13 Aug 2019 at 16:09, Xintong Song 

wrote:

+1 (non-binding)

Thank you~

Xintong Song



On Tue, Aug 13, 2019 at 1:48 PM Robert Metzger <

rmetz...@apache.org>

wrote:


+1 (binding)

On Tue, Aug 13, 2019 at 1:47 PM Becket Qin 
wrote:

Thanks everyone for voting.

For those who have already voted, just want to bring this up to

your

attention that there is a minor clarification to the bylaws

wiki

this

morning. The change is in bold format below:

one +1 from a committer followed by a Lazy approval (not

counting

the

vote

of the contributor), moving to lazy majority if a -1 is

received.


Note that this implies that committers can +1 their own commits

and

merge

right away. *However, the committe**rs should use their best

judgement

to

respect the components expertise and ongoing development

plan.*


This addition does not really change anything the bylaws meant

to

set.

It

is simply a clarification. If anyone who have casted the vote

objects,

please feel free to withdraw the vote.

Thanks,

Jiangjie (Becket) Qin


On Tue, Aug 13, 2019 at 1:29 PM Piotr Nowojski <

pi...@ververica.com>

wrote:


+1


On 13 Aug 2019, at 13:22, vino yang 
wrote:

+1

Tzu-Li (Gordon) Tai  于2019年8月13日周二

下午6:32写道:

+1

On Tue, Aug 13, 2019, 12:31 PM Hequn Cheng <

chenghe...@gmail.com>

wrote:

+1 (non-binding)

Thanks a lot for driving this! Good job. @Becket Qin <

becket@gmail.com

Best, Hequn

On Tue, Aug 13, 2019 at 6:26 PM Stephan Ewen <

se...@apache.org

wrote:

+1

On Tue, Aug 13, 2019 at 12:22 PM Maximilian Michels <

m...@apache.org

wrote:


+1 It's good that we formalize this.

On 13.08.19 10:41, Fabian Hueske wrote:

+1 for the proposed bylaws.
Thanks for pushing this Becket!

Cheers, Fabian

Am Mo., 12. Aug. 2019 um 16:31 Uhr schrieb Robert

Metzger

<

rmetz...@apache.org>:


I changed the permissions of the page.

On Mon, Aug 12, 2019 at 4:21 PM Till Rohrmann <

trohrm...@apache.org>

wrote:


+1 for the proposal. Thanks a lot for driving this

discussion

Becket!

Cheers,
Till

On Mon, Aug 12, 2019 at 3:02 PM Becket Qin <

becket@gmail.com>

wrote:

Hi Robert,

That's a good suggestion. Will you help to change

the

permission

on

that

page?

Thanks,

Jiangjie (Becket) Qin

On Mon, Aug 12, 2019 at 2:41 PM Robert Metzger <

rmetz...@apache.org>

wrote:


Thanks for starting the vote.
How about putting a specific version in the wiki

up

for

voting,

or

restricting edit access to the page to the PMC?
There were already two changes (very minor) to the

page

since

the

vote

has

started:



https://cwiki.apache.org/confluence/pages/viewpreviousversions.action?pageId=120731026

I suggest to restrict edit access to the page.



On Mon, Aug 12, 2019 at 11:43 AM Timo Walther <

twal...@apache.org

wrote:

+1

Thanks for all the efforts you put into this for

documenting

how

the

project operates.

Regards,
Timo

Am 12.08.19 um 10:44 schrieb Aljoscha Krettek:

+1


On 11. Aug 2019, at 10:07, Becket Qin <

becket@gmail.com>

wrote:

Hi all,

I would like to start a voting thread on the

project

bylaws

of

Flink.

It

aims to help the community coordinate more

smoothly.

Please

see

the

bylaws

wiki page below for details.



https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026

The discussion thread is following:



http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-project-bylaws-td30409.html

The vote will be open for at least 6 days. PMC

members'

votes

are

considered as binding. The vote requires 2/3

majority

of

the

binding

+1s to

pass.

Thanks,

Jiangjie (Becket) Qin





Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Becket Qin
Hi Chesnay,

Thanks for responding. I think cc private@ is a good idea. I just added
that to the CC list.

We are following the 2/3 majority voting scheme defined in the bylaws here.
I should have referred to the terms in the bylaws instead rephrasing them.

Thanks,

Jiangjie (Becket) Qin



On Fri, Aug 16, 2019 at 1:14 PM Chesnay Schepler  wrote:

> The wording of the original mail is ambiguous imo.
>
> "The vote requires 2/3 majority of the binding +1s to pass."
>
> This to me reads very much "This vote passes if 2/3 of all votes after
> the voting period are +1."
>
> Maybe it's just a wording thing, but it was not clear to me that this
> follows the 2/3 majority scheme laid out in the bylaws.
>
> On 16/08/2019 12:51, Dawid Wysakowicz wrote:
> > AFAIK this voting scheme is described in the "Modifying Bylaws" section,
> > in the end introducing bylaws is a modify operation ;) . I think it is a
> > valid point to CC priv...@flink.apache.org in the future. I wouldn't say
> > it is a must though. The voting scheme requires that every PMC member
> > has to be reached out directly, via a private address if he/she did not
> > vote in a thread. So every PMC member should be aware of the voting
> thread.
> >
> > Best,
> >
> > Dawid
> >
> > On 16/08/2019 12:38, Chesnay Schepler wrote:
> >> I'm very late to the party, but isn't it a bit weird that we're using
> >> a voting scheme that isn't laid out in the bylaws?
> >>
> >> Additionally, I would heavily suggest to CC priv...@flink.apache.org,
> >> as we want as many PMC as possible to look at this.
> >> (I would regard the this point as a reason for delaying  the vote
> >> conclusion)
> >>
> >> On 11/08/2019 10:07, Becket Qin wrote:
> >>> Hi all,
> >>>
> >>> I would like to start a voting thread on the project bylaws of Flink.
> It
> >>> aims to help the community coordinate more smoothly. Please see the
> >>> bylaws
> >>> wiki page below for details.
> >>>
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> >>>
> >>>
> >>> The discussion thread is following:
> >>>
> >>>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-project-bylaws-td30409.html
> >>>
> >>>
> >>> The vote will be open for at least 6 days. PMC members' votes are
> >>> considered as binding. The vote requires 2/3 majority of the binding
> >>> +1s to
> >>> pass.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
>
>


Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler

Update:

TL;DR: table-planner is a good candidate for enabling fork reuse right 
away, while flink-tests has the potential for huge savings, but we have 
to figure out some issues first.



Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in 
libraries (table-planner).


The kafka and connectors profiles both fail in kafka tests due to 
producer leaks, and no speed up could be confirmed so far:


java.lang.AssertionError: Detected producer leak. Thread name: 
kafka-producer-network-thread | producer-239
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)


The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected accumulator 
results within time limit.
at 
org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one above 
failed after 19 minutes and is only missing the migration tests (which 
currently need 6-7 minutes). So we could save somewhere between 15 to 20 
minutes here.



Finally, the misc profiles fails in YARN:

java.lang.AssertionError
at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for 
flink-yarn-tests we can maybe get a minute or 2 out of it.


On 16/08/2019 10:43, Chesnay Schepler wrote:
There appears to be a general agreement that 1) should be looked into; 
I've setup a branch with fork reuse being enabled for all tests; will 
report back the results.


On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's 
discuss the different ways how they could be reduced.



   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total 
time"), and in the ideal case takes about 1h20m ("run time") to 
complete from start to finish. The run time may fluctuate of course 
depending on the current Travis load. This applies both to builds on 
the Apache and flink-ci Travis.


At the time of writing, the current queue time for PR jobs (reminder: 
running on flink-ci) is about 30 minutes (which basically means that 
we are processing builds at the rate that they come in), however we 
are in an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as 
everyone was scrambling to get their changes merged in time for the 
feature freeze.


(Note: Recently optimizations where added to ci-bot where pending 
builds are canceled if a new commit was pushed to the PR or the PR 
was closed, which should prove especially useful during the rush 
hours we see before feature-freezes.)



   Past approaches

Over the years we have done rather few things to improve this 
situation (hence our current predicament).


Beyond the sporadic speedup of some tests, the only notable reduction 
in total build times was the introduction of cron jobs, which 
consolidated the per-commit matrix from 4 configurations (different 
scala/hadoop versions) to 1.


The separation into multiple build profiles was only a work-around 
for the 50m limit on Travis. Running tests in parallel has the 
obvious potential of reducing run time, but we're currently hitting a 
hard limit since a few modules (flink-tests, flink-runtime, 
flink-table-planner-blink) are so loaded with tests that they nearly 
consume an entire profile by themselves (and thus no further 
splitting is possible).


The rework that introduced stages, at the time of introduction, did 
also not provide a speed up, although this changed slightly once more 
profiles were added and some optimizations to the caching have been 
made.


Very recently we modified the surefire-plugin configuration for 
flink-table-planner-blink to reuse JVM forks for IT cases, providing 
a significant speedup (18 minutes!). So far we have not seen any 
negative consequences.



   Suggestions

This is a list of /all /suggestions for reducing run/total times that 
I have seen recently (in other words, they aren't necessarily mine 
nor may I agree with all of them).


1. Enable JVM reuse for IT cases in more modules.
 * We've seen significant speedups in the blink planner, and this
   should be applicable for all modules. However, I presume there's
   a reason why we disabled JVM reuse (information on this would be
   appreciated)
2. Custom differential build scripts
 * Setup custom scripts for determining which module

Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Chesnay Schepler

The wording of the original mail is ambiguous imo.

"The vote requires 2/3 majority of the binding +1s to pass."

This to me reads very much "This vote passes if 2/3 of all votes after 
the voting period are +1."


Maybe it's just a wording thing, but it was not clear to me that this 
follows the 2/3 majority scheme laid out in the bylaws.


On 16/08/2019 12:51, Dawid Wysakowicz wrote:

AFAIK this voting scheme is described in the "Modifying Bylaws" section,
in the end introducing bylaws is a modify operation ;) . I think it is a
valid point to CC priv...@flink.apache.org in the future. I wouldn't say
it is a must though. The voting scheme requires that every PMC member
has to be reached out directly, via a private address if he/she did not
vote in a thread. So every PMC member should be aware of the voting thread.

Best,

Dawid

On 16/08/2019 12:38, Chesnay Schepler wrote:

I'm very late to the party, but isn't it a bit weird that we're using
a voting scheme that isn't laid out in the bylaws?

Additionally, I would heavily suggest to CC priv...@flink.apache.org,
as we want as many PMC as possible to look at this.
(I would regard the this point as a reason for delaying  the vote
conclusion)

On 11/08/2019 10:07, Becket Qin wrote:

Hi all,

I would like to start a voting thread on the project bylaws of Flink. It
aims to help the community coordinate more smoothly. Please see the
bylaws
wiki page below for details.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026


The discussion thread is following:

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-project-bylaws-td30409.html


The vote will be open for at least 6 days. PMC members' votes are
considered as binding. The vote requires 2/3 majority of the binding
+1s to
pass.

Thanks,

Jiangjie (Becket) Qin





Re: Inverted classloading for client

2019-08-16 Thread Paul Lam
Hi,

I’ve created a ticket to track this problem [1]. Any comments will be 
appreciated. 

[1] https://issues.apache.org/jira/browse/FLINK-13749 


Best,
Paul Lam

> 在 2019年8月9日,11:16,Paul Lam  写道:
> 
> Hi devs,
> 
> Flink uses inverted class loading by default to allow a different version of 
> dependencies in user codes, but currently this approach is not applied to the 
> client, so I’m wondering if it’s out of some special reason?
> 
> If not, I think it would be great to add inverted class loading as the the 
> behavior for the client and make it respect to the classloading configuration 
> to make the components work in a uniform manner. 
> 
> Moreover, I’ve seen some jobs failed to upgrade to 1.8, because we added 
> hadoop classpath in the client classpath to compensate the removed 
> convenience hadoop binaries that shades lots of common dependencies, which 
> significantly increases the changes of dependencies conflicts.
> 
> Looking forward to your feedbacks.
> 
> Best,
> Paul Lam
> 



Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Dawid Wysakowicz
AFAIK this voting scheme is described in the "Modifying Bylaws" section,
in the end introducing bylaws is a modify operation ;) . I think it is a
valid point to CC priv...@flink.apache.org in the future. I wouldn't say
it is a must though. The voting scheme requires that every PMC member
has to be reached out directly, via a private address if he/she did not
vote in a thread. So every PMC member should be aware of the voting thread.

Best,

Dawid

On 16/08/2019 12:38, Chesnay Schepler wrote:
> I'm very late to the party, but isn't it a bit weird that we're using
> a voting scheme that isn't laid out in the bylaws?
>
> Additionally, I would heavily suggest to CC priv...@flink.apache.org,
> as we want as many PMC as possible to look at this.
> (I would regard the this point as a reason for delaying  the vote
> conclusion)
>
> On 11/08/2019 10:07, Becket Qin wrote:
>> Hi all,
>>
>> I would like to start a voting thread on the project bylaws of Flink. It
>> aims to help the community coordinate more smoothly. Please see the
>> bylaws
>> wiki page below for details.
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
>>
>>
>> The discussion thread is following:
>>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-project-bylaws-td30409.html
>>
>>
>> The vote will be open for at least 6 days. PMC members' votes are
>> considered as binding. The vote requires 2/3 majority of the binding
>> +1s to
>> pass.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>



signature.asc
Description: OpenPGP digital signature


Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Chesnay Schepler
I'm very late to the party, but isn't it a bit weird that we're using a 
voting scheme that isn't laid out in the bylaws?


Additionally, I would heavily suggest to CC priv...@flink.apache.org, as 
we want as many PMC as possible to look at this.
(I would regard the this point as a reason for delaying  the vote 
conclusion)


On 11/08/2019 10:07, Becket Qin wrote:

Hi all,

I would like to start a voting thread on the project bylaws of Flink. It
aims to help the community coordinate more smoothly. Please see the bylaws
wiki page below for details.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026

The discussion thread is following:

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-project-bylaws-td30409.html

The vote will be open for at least 6 days. PMC members' votes are
considered as binding. The vote requires 2/3 majority of the binding +1s to
pass.

Thanks,

Jiangjie (Becket) Qin





[jira] [Created] (FLINK-13750) Separate HA services between client-/ and server-side

2019-08-16 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-13750:


 Summary: Separate HA services between client-/ and server-side
 Key: FLINK-13750
 URL: https://issues.apache.org/jira/browse/FLINK-13750
 Project: Flink
  Issue Type: Improvement
  Components: Command Line Client, Runtime / Coordination
Reporter: Chesnay Schepler


Currently, we use the same {{HighAvailabilityServices}} on the client and 
server. However, the client does not need several of the features that the 
services currently provide (access to the blobstore or checkpoint metadata).

Additionally, due to how these services are setup they also require the client 
to have access to the blob storage, despite it never actually being used, which 
can cause issues, like FLINK-13500.

[~Tison] Would be be interested in this issue?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13749) Make Flink client respect classloading policy

2019-08-16 Thread Paul Lin (JIRA)
Paul Lin created FLINK-13749:


 Summary: Make Flink client respect classloading policy
 Key: FLINK-13749
 URL: https://issues.apache.org/jira/browse/FLINK-13749
 Project: Flink
  Issue Type: Improvement
  Components: Command Line Client, Runtime / REST
Affects Versions: 1.9.0
Reporter: Paul Lin


Currently, Flink client does not respect the classloading policy and uses 
hardcoded parent-first classloader, while the other components like jobmanager 
and taskmanager use child-first classloader by default and respect the 
classloading options. This makes the client more likely to have dependency 
conflicts, especially after we removed the convenient hadoop binaries (so users 
need to add hadoop classpath in the client classpath).

So I propose to make Flink client's (including cli and rest handler) 
classloading behavior aligned with the other components.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Ufuk Celebi
+1 (binding)

– Ufuk


On Wed, Aug 14, 2019 at 4:50 AM Biao Liu  wrote:

> +1 (non-binding)
>
> Thanks for pushing this!
>
> Thanks,
> Biao /'bɪ.aʊ/
>
>
>
> On Wed, 14 Aug 2019 at 09:37, Jark Wu  wrote:
>
> > +1 (non-binding)
> >
> > Best,
> > Jark
> >
> > On Wed, 14 Aug 2019 at 09:22, Kurt Young  wrote:
> >
> > > +1 (binding)
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Wed, Aug 14, 2019 at 1:34 AM Yun Tang  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > But I have a minor question about "code change" action, for those
> > > > "[hotfix]" github pull requests [1], the dev mailing list would not
> be
> > > > notified currently. I think we should change the description of this
> > > action.
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://flink.apache.org/contributing/contribute-code.html#code-contribution-process
> > > >
> > > > Best
> > > > Yun Tang
> > > > 
> > > > From: JingsongLee 
> > > > Sent: Tuesday, August 13, 2019 23:56
> > > > To: dev 
> > > > Subject: Re: [VOTE] Flink Project Bylaws
> > > >
> > > > +1 (non-binding)
> > > > Thanks Becket.
> > > > I've learned a lot from current bylaws.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > >
> > > > --
> > > > From:Yu Li 
> > > > Send Time:2019年8月13日(星期二) 17:48
> > > > To:dev 
> > > > Subject:Re: [VOTE] Flink Project Bylaws
> > > >
> > > > +1 (non-binding)
> > > >
> > > > Thanks for the efforts Becket!
> > > >
> > > > Best Regards,
> > > > Yu
> > > >
> > > >
> > > > On Tue, 13 Aug 2019 at 16:09, Xintong Song 
> > > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 13, 2019 at 1:48 PM Robert Metzger <
> rmetz...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 1:47 PM Becket Qin  >
> > > > wrote:
> > > > > >
> > > > > > > Thanks everyone for voting.
> > > > > > >
> > > > > > > For those who have already voted, just want to bring this up to
> > > your
> > > > > > > attention that there is a minor clarification to the bylaws
> wiki
> > > this
> > > > > > > morning. The change is in bold format below:
> > > > > > >
> > > > > > > one +1 from a committer followed by a Lazy approval (not
> counting
> > > the
> > > > > > vote
> > > > > > > > of the contributor), moving to lazy majority if a -1 is
> > received.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Note that this implies that committers can +1 their own commits
> > and
> > > > > merge
> > > > > > > > right away. *However, the committe**rs should use their best
> > > > > judgement
> > > > > > to
> > > > > > > > respect the components expertise and ongoing development
> plan.*
> > > > > > >
> > > > > > >
> > > > > > > This addition does not really change anything the bylaws meant
> to
> > > > set.
> > > > > It
> > > > > > > is simply a clarification. If anyone who have casted the vote
> > > > objects,
> > > > > > > please feel free to withdraw the vote.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jiangjie (Becket) Qin
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 1:29 PM Piotr Nowojski <
> > > pi...@ververica.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > > On 13 Aug 2019, at 13:22, vino yang  >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > Tzu-Li (Gordon) Tai  于2019年8月13日周二
> > > > 下午6:32写道:
> > > > > > > > >
> > > > > > > > >> +1
> > > > > > > > >>
> > > > > > > > >> On Tue, Aug 13, 2019, 12:31 PM Hequn Cheng <
> > > > chenghe...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >>> +1 (non-binding)
> > > > > > > > >>>
> > > > > > > > >>> Thanks a lot for driving this! Good job. @Becket Qin <
> > > > > > > > >> becket@gmail.com
> > > > > > > > 
> > > > > > > > >>>
> > > > > > > > >>> Best, Hequn
> > > > > > > > >>>
> > > > > > > > >>> On Tue, Aug 13, 2019 at 6:26 PM Stephan Ewen <
> > > se...@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > >>>
> > > > > > > >  +1
> > > > > > > > 
> > > > > > > >  On Tue, Aug 13, 2019 at 12:22 PM Maximilian Michels <
> > > > > > m...@apache.org
> > > > > > > >
> > > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > > +1 It's good that we formalize this.
> > > > > > > > >
> > > > > > > > > On 13.08.19 10:41, Fabian Hueske wrote:
> > > > > > > > >> +1 for the proposed bylaws.
> > > > > > > > >> Thanks for pushing this Becket!
> > > > > > > > >>
> > > > > > > > >> Cheers, Fabian
> > > > > > > > >>
> > > > > > > > >> Am Mo., 12. Aug. 2019 um 16:31 Uhr schrieb Robert
> > Metzger
> > > <
> > > > > > > > >> rmetz...@apache.org>:
> > > > > > > > >>
> > > > > > > > >>> I changed the permissions of the page.
> > > > > > > > >>>
> > > > > > > > >>>

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-16 Thread Gyula Fóra
Hi all,
I agree with Till that we should investigate the suspected performance
regression issue before proceeding with the release.

If we do not find any problem I vote +1

I have verified the following behaviour:
 - Built flink with custom hadoop version
 - YARN Deployment with and without high-availability
 - Simulated TM and JM failures
 - Test recovery with savepoints and checkpoints for simple stateful job
with kafka connectors

Gyula



On Fri, Aug 16, 2019 at 10:34 AM Till Rohrmann  wrote:

> Thanks for reporting this issue Guowei. Could you share a bit more details
> what the job exactly does and which operators it uses? Does the job uses
> the new `TwoInputSelectableStreamTask` which might cause the performance
> regression?
>
> I think it is important to understand where the problem comes from before
> we proceed with the release.
>
> Cheers,
> Till
>
> On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma  wrote:
>
> > Hi,
> > -1
> > We have a benchmark job, which includes a two-input operator.
> > This job has a big performance regression using 1.9 compared to 1.8.
> > It's still not very clear why this regression happens.
> >
> > Best,
> > Guowei
> >
> >
> > Yu Li  于2019年8月16日周五 下午3:27写道:
> >
> > > +1 (non-binding)
> > >
> > > - checked release notes: OK
> > > - checked sums and signatures: OK
> > > - source release
> > >  - contains no binaries: OK
> > >  - contains no 1.9-SNAPSHOT references: OK
> > >  - build from source: OK (8u102)
> > >  - mvn clean verify: OK (8u102)
> > > - binary release
> > >  - no examples appear to be missing
> > >  - started a cluster; WebUI reachable, example ran successfully
> > > - repository appears to contain all expected artifacts
> > >
> > > Best Regards,
> > > Yu
> > >
> > >
> > > On Fri, 16 Aug 2019 at 06:06, Bowen Li  wrote:
> > >
> > > > Hi Jark,
> > > >
> > > > Thanks for letting me know that it's been like this in previous
> > releases.
> > > > Though I don't think that's the right behavior, it can be discussed
> for
> > > > later release. Thus I retract my -1 for RC2.
> > > >
> > > > Bowen
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 7:49 PM Jark Wu  wrote:
> > > >
> > > > > Hi Bowen,
> > > > >
> > > > > Thanks for reporting this.
> > > > > However, I don't think this is an issue. IMO, it is by design.
> > > > > The `tEnv.listUserDefinedFunctions()` in Table API and `show
> > > functions;`
> > > > in
> > > > > SQL CLI are intended to return only the registered UDFs, not
> > including
> > > > > built-in functions.
> > > > > This is also the behavior in previous versions.
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Fri, 16 Aug 2019 at 06:52, Bowen Li 
> wrote:
> > > > >
> > > > > > -1 for RC2.
> > > > > >
> > > > > > I found a bug https://issues.apache.org/jira/browse/FLINK-13741,
> > > and I
> > > > > > think it's a blocker.  The bug means currently if users call
> > > > > > `tEnv.listUserDefinedFunctions()` in Table API or `show
> functions;`
> > > > thru
> > > > > > SQL would not be able to see Flink's built-in functions.
> > > > > >
> > > > > > I'm preparing a fix right now.
> > > > > >
> > > > > > Bowen
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai <
> > > > tzuli...@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for all the test efforts, verifications and votes so
> far.
> > > > > > >
> > > > > > > So far, things are looking good, but we still require one more
> > PMC
> > > > > > binding
> > > > > > > vote for this RC to be the official release, so I would like to
> > > > extend
> > > > > > the
> > > > > > > vote time for 1 more day, until *Aug. 16th 17:00 CET*.
> > > > > > >
> > > > > > > In the meantime, the release notes for 1.9.0 had only just been
> > > > > finalized
> > > > > > > [1], and could use a few more eyes before closing the vote.
> > > > > > > Any help with checking if anything else should be mentioned
> there
> > > > > > regarding
> > > > > > > breaking changes / known shortcomings would be appreciated.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gordon
> > > > > > >
> > > > > > > [1] https://github.com/apache/flink/pull/9438
> > > > > > >
> > > > > > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young 
> > > wrote:
> > > > > > >
> > > > > > > > Great, then I have no other comments on legal check.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Kurt
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler <
> > > > ches...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > The licensing items aren't a problem; we don't care about
> > Flink
> > > > > > modules
> > > > > > > > > in NOTICE files, and we don't have to update the
> > source-release
> > > > > > > > > licensing since we don't have a pre-built version of the
> > WebUI
> > > in
> > > > > the
> > > > > > > > > source.
> > > > > > > > >
> > > > > > > > > On 15/08/2019 15:22, Kurt Young wrote:
> > >

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

2019-08-16 Thread Yangze Guo
Hi, Xintong

Thanks to propose this FLIP. The general design looks good to me, +1
for this feature.

Since slots in the same task executor could have different resource
profile, we will
meet resource fragment problem. Think about this case:
 - request A want 1G memory while request B & C want 0.5G memory
 - There are two task executors T1 & T2 with 1G and 0.5G free memory
respectively
If B come first and we cut a slot from T1 for B, A must wait for the
free resource from
other task. But A could have been scheduled immediately if we cut a
slot from T2 for B.

The logic of findMatchingSlot now become finding a task executor which
has enough
resource and then cut a slot from it. Current method could be seen as
"First-fit strategy",
which works well in general but sometimes could not be the optimization method.

Actually, this problem could be abstracted as "Bin Packing Problem"[1]. Here are
some common approximate algorithms:
- First fit
- Next fit
- Best fit

But it become multi-dimensional bin packing problem if we take CPU
into account. It hard
to define which one is best fit now. Some research addressed this
problem, such like Tetris[2].

Here are some thinking about it:
1. We could make the strategy of finding matching task executor
pluginable. Let user to config the
best strategy in their scenario.
2. We could support batch request interface in RM, because we have
opportunities to optimize
if we have more information. If we know the A, B, C at the same time,
we could always make the best decision.

[1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
[2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf

Best,
Yangze Guo

On Thu, Aug 15, 2019 at 10:40 PM Xintong Song  wrote:
>
> Hi everyone,
>
> We would like to start a discussion thread on "FLIP-53: Fine Grained
> Resource Management"[1], where we propose how to improve Flink resource
> management and scheduling.
>
> This FLIP mainly discusses the following issues.
>
>- How to support tasks with fine grained resource requirements.
>- How to unify resource management for jobs with / without fine grained
>resource requirements.
>- How to unify resource management for streaming / batch jobs.
>
> Key changes proposed in the FLIP are as follows.
>
>- Unify memory management for operators with / without fine grained
>resource requirements by applying a fraction based quota mechanism.
>- Unify resource scheduling for streaming and batch jobs by setting slot
>sharing groups for pipelined regions during compiling stage.
>- Dynamically allocate slots from task executors' available resources.
>
> Please find more details in the FLIP wiki document [1]. Looking forward to
> your feedbacks.
>
> Thank you~
>
> Xintong Song
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management


Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-16 Thread Terry Wang
Congratulations Andrey!

Best,
Terry Wang



> 在 2019年8月15日,下午9:27,Hequn Cheng  写道:
> 
> Congratulations Andrey!
> 
> On Thu, Aug 15, 2019 at 3:30 PM Fabian Hueske  > wrote:
> Congrats Andrey!
> 
> Am Do., 15. Aug. 2019 um 07:58 Uhr schrieb Gary Yao  >:
> 
> > Congratulations Andrey, well deserved!
> >
> > Best,
> > Gary
> >
> > On Thu, Aug 15, 2019 at 7:50 AM Bowen Li  > > wrote:
> >
> > > Congratulations Andrey!
> > >
> > > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong  > > > wrote:
> > >
> > >> Congratulations Andrey!
> > >>
> > >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok  > >> > wrote:
> > >>
> > >> > Congratulations Andrey!
> > >> > At 2019-08-14 21:26:37, "Till Rohrmann"  > >> > > wrote:
> > >> > >Hi everyone,
> > >> > >
> > >> > >I'm very happy to announce that Andrey Zagrebin accepted the offer of
> > >> the
> > >> > >Flink PMC to become a committer of the Flink project.
> > >> > >
> > >> > >Andrey has been an active community member for more than 15 months.
> > He
> > >> has
> > >> > >helped shaping numerous features such as State TTL, FRocksDB release,
> > >> > >Shuffle service abstraction, FLIP-1, result partition management and
> > >> > >various fixes/improvements. He's also frequently helping out on the
> > >> > >user@f.a.o mailing lists.
> > >> > >
> > >> > >Congratulations Andrey!
> > >> > >
> > >> > >Best, Till
> > >> > >(on behalf of the Flink PMC)
> > >> >
> > >>
> > >
> >



Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-16 Thread Xintong Song
Let me explain this with a concrete example Till.

Let's say we have the following scenario.

Total Process Memory: 1GB
JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory and
Network Memory): 800MB


For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
For alternative 3, we set -XX:MaxDirectMemorySize to a very large value,
let's say 1TB.

If the actual direct memory usage of Task Off-Heap Memory and JVM Overhead
do not exceed 200MB, then alternative 2 and alternative 3 should have the
same utility. Setting larger -XX:MaxDirectMemorySize will not reduce the
sizes of the other memory pools.

If the actual direct memory usage of Task Off-Heap Memory and JVM
Overhead potentially exceed 200MB, then

   - Alternative 2 suffers from frequent OOM. To avoid that, the only thing
   user can do is to modify the configuration and increase JVM Direct Memory
   (Task Off-Heap Memory + JVM Overhead). Let's say that user increases JVM
   Direct Memory to 250MB, this will reduce the total size of other memory
   pools to 750MB, given the total process memory remains 1GB.
   - For alternative 3, there is no chance of direct OOM. There are chances
   of exceeding the total process memory limit, but given that the process may
   not use up all the reserved native memory (Off-Heap Managed Memory, Network
   Memory, JVM Metaspace), if the actual direct memory usage is slightly above
   yet very close to 200MB, user probably do not need to change the
   configurations.

Therefore, I think from the user's perspective, a feasible configuration
for alternative 2 may lead to lower resource utilization compared to
alternative 3.

Thank you~

Xintong Song



On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann  wrote:

> I guess you have to help me understand the difference between alternative 2
> and 3 wrt to memory under utilization Xintong.
>
> - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and JVM
> Overhead. Then there is the risk that this size is too low resulting in a
> lot of garbage collection and potentially an OOM.
> - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> alternative 2. This would of course reduce the sizes of the other memory
> types.
>
> How would alternative 2 now result in an under utilization of memory
> compared to alternative 3? If alternative 3 strictly sets a higher max
> direct memory size and we use only little, then I would expect that
> alternative 3 results in memory under utilization.
>
> Cheers,
> Till
>
> On Tue, Aug 13, 2019 at 4:19 PM Yang Wang  wrote:
>
> > Hi xintong,till
> >
> >
> > > Native and Direct Memory
> >
> > My point is setting a very large max direct memory size when we do not
> > differentiate direct and native memory. If the direct memory,including
> user
> > direct memory and framework direct memory,could be calculated
> > correctly,then
> > i am in favor of setting direct memory with fixed value.
> >
> >
> >
> > > Memory Calculation
> >
> > I agree with xintong. For Yarn and k8s,we need to check the memory
> > configurations in client to avoid submitting successfully and failing in
> > the flink master.
> >
> >
> > Best,
> >
> > Yang
> >
> > Xintong Song 于2019年8月13日 周二22:07写道:
> >
> > > Thanks for replying, Till.
> > >
> > > About MemorySegment, I think you are right that we should not include
> > this
> > > issue in the scope of this FLIP. This FLIP should concentrate on how to
> > > configure memory pools for TaskExecutors, with minimum involvement on
> how
> > > memory consumers use it.
> > >
> > > About direct memory, I think alternative 3 may not having the same over
> > > reservation issue that alternative 2 does, but at the cost of risk of
> > over
> > > using memory at the container level, which is not good. My point is
> that
> > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to config.
> > For
> > > alternative 2, users might configure them higher than what actually
> > needed,
> > > just to avoid getting a direct OOM. For alternative 3, users do not get
> > > direct OOM, so they may not config the two options aggressively high.
> But
> > > the consequences are risks of overall container memory usage exceeds
> the
> > > budget.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann 
> > > wrote:
> > >
> > > > Thanks for proposing this FLIP Xintong.
> > > >
> > > > All in all I think it already looks quite good. Concerning the first
> > open
> > > > question about allocating memory segments, I was wondering whether
> this
> > > is
> > > > strictly necessary to do in the context of this FLIP or whether this
> > > could
> > > > be done as a follow up? Without knowing all details, I would be
> > concerned
> > > > that we would widen the scope of this FLIP too much because we would
> > have
> > > > to touch all the existing call sites of the MemoryManager where we
> > > a

Re: [VOTE] FLIP-51: Rework of the Expression Design

2019-08-16 Thread Dawid Wysakowicz
+1 from my side

Best,

Dawid

On 16/08/2019 10:31, Jark Wu wrote:
> +1 from my side.
>
> Thanks Jingsong for driving this.
>
> Best,
> Jark
>
> On Thu, 15 Aug 2019 at 22:09, Timo Walther  wrote:
>
>> +1 for this.
>>
>> Thanks,
>> Timo
>>
>> Am 15.08.19 um 15:57 schrieb JingsongLee:
>>> Hi Flink devs,
>>>
>>> I would like to start the voting for FLIP-51 Rework of the Expression
>>>   Design.
>>>
>>> FLIP wiki:
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design
>>> Discussion thread:
>>>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html
>>> Google Doc:
>>>
>> https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing
>>> Thanks,
>>>
>>> Best,
>>> Jingsong Lee
>>
>>



signature.asc
Description: OpenPGP digital signature


Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-16 Thread Guowei Ma
Hi, till
I can send the job to you offline.
It is just a datastream job and does not use TwoInputSelectableStreamTask.
A->B
 \
   C
 /
D->E
Best,
Guowei


Till Rohrmann  于2019年8月16日周五 下午4:34写道:

> Thanks for reporting this issue Guowei. Could you share a bit more details
> what the job exactly does and which operators it uses? Does the job uses
> the new `TwoInputSelectableStreamTask` which might cause the performance
> regression?
>
> I think it is important to understand where the problem comes from before
> we proceed with the release.
>
> Cheers,
> Till
>
> On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma  wrote:
>
> > Hi,
> > -1
> > We have a benchmark job, which includes a two-input operator.
> > This job has a big performance regression using 1.9 compared to 1.8.
> > It's still not very clear why this regression happens.
> >
> > Best,
> > Guowei
> >
> >
> > Yu Li  于2019年8月16日周五 下午3:27写道:
> >
> > > +1 (non-binding)
> > >
> > > - checked release notes: OK
> > > - checked sums and signatures: OK
> > > - source release
> > >  - contains no binaries: OK
> > >  - contains no 1.9-SNAPSHOT references: OK
> > >  - build from source: OK (8u102)
> > >  - mvn clean verify: OK (8u102)
> > > - binary release
> > >  - no examples appear to be missing
> > >  - started a cluster; WebUI reachable, example ran successfully
> > > - repository appears to contain all expected artifacts
> > >
> > > Best Regards,
> > > Yu
> > >
> > >
> > > On Fri, 16 Aug 2019 at 06:06, Bowen Li  wrote:
> > >
> > > > Hi Jark,
> > > >
> > > > Thanks for letting me know that it's been like this in previous
> > releases.
> > > > Though I don't think that's the right behavior, it can be discussed
> for
> > > > later release. Thus I retract my -1 for RC2.
> > > >
> > > > Bowen
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 7:49 PM Jark Wu  wrote:
> > > >
> > > > > Hi Bowen,
> > > > >
> > > > > Thanks for reporting this.
> > > > > However, I don't think this is an issue. IMO, it is by design.
> > > > > The `tEnv.listUserDefinedFunctions()` in Table API and `show
> > > functions;`
> > > > in
> > > > > SQL CLI are intended to return only the registered UDFs, not
> > including
> > > > > built-in functions.
> > > > > This is also the behavior in previous versions.
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Fri, 16 Aug 2019 at 06:52, Bowen Li 
> wrote:
> > > > >
> > > > > > -1 for RC2.
> > > > > >
> > > > > > I found a bug https://issues.apache.org/jira/browse/FLINK-13741,
> > > and I
> > > > > > think it's a blocker.  The bug means currently if users call
> > > > > > `tEnv.listUserDefinedFunctions()` in Table API or `show
> functions;`
> > > > thru
> > > > > > SQL would not be able to see Flink's built-in functions.
> > > > > >
> > > > > > I'm preparing a fix right now.
> > > > > >
> > > > > > Bowen
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai <
> > > > tzuli...@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for all the test efforts, verifications and votes so
> far.
> > > > > > >
> > > > > > > So far, things are looking good, but we still require one more
> > PMC
> > > > > > binding
> > > > > > > vote for this RC to be the official release, so I would like to
> > > > extend
> > > > > > the
> > > > > > > vote time for 1 more day, until *Aug. 16th 17:00 CET*.
> > > > > > >
> > > > > > > In the meantime, the release notes for 1.9.0 had only just been
> > > > > finalized
> > > > > > > [1], and could use a few more eyes before closing the vote.
> > > > > > > Any help with checking if anything else should be mentioned
> there
> > > > > > regarding
> > > > > > > breaking changes / known shortcomings would be appreciated.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gordon
> > > > > > >
> > > > > > > [1] https://github.com/apache/flink/pull/9438
> > > > > > >
> > > > > > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young 
> > > wrote:
> > > > > > >
> > > > > > > > Great, then I have no other comments on legal check.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Kurt
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler <
> > > > ches...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > The licensing items aren't a problem; we don't care about
> > Flink
> > > > > > modules
> > > > > > > > > in NOTICE files, and we don't have to update the
> > source-release
> > > > > > > > > licensing since we don't have a pre-built version of the
> > WebUI
> > > in
> > > > > the
> > > > > > > > > source.
> > > > > > > > >
> > > > > > > > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > > > > > > > After going through the licenses, I found 2 suspicions
> but
> > > not
> > > > > sure
> > > > > > > if
> > > > > > > > > they
> > > > > > > > > > are
> > > > > > > > > > valid or not.
> > > > > > > > > >
> > > > > > > > > > 1. flink-state-processing-api is packaged in t

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler
There appears to be a general agreement that 1) should be looked into; 
I've setup a branch with fork reuse being enabled for all tests; will 
report back the results.


On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's 
discuss the different ways how they could be reduced.



   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total time"), 
and in the ideal case takes about 1h20m ("run time") to complete from 
start to finish. The run time may fluctuate of course depending on the 
current Travis load. This applies both to builds on the Apache and 
flink-ci Travis.


At the time of writing, the current queue time for PR jobs (reminder: 
running on flink-ci) is about 30 minutes (which basically means that 
we are processing builds at the rate that they come in), however we 
are in an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as 
everyone was scrambling to get their changes merged in time for the 
feature freeze.


(Note: Recently optimizations where added to ci-bot where pending 
builds are canceled if a new commit was pushed to the PR or the PR was 
closed, which should prove especially useful during the rush hours we 
see before feature-freezes.)



   Past approaches

Over the years we have done rather few things to improve this 
situation (hence our current predicament).


Beyond the sporadic speedup of some tests, the only notable reduction 
in total build times was the introduction of cron jobs, which 
consolidated the per-commit matrix from 4 configurations (different 
scala/hadoop versions) to 1.


The separation into multiple build profiles was only a work-around for 
the 50m limit on Travis. Running tests in parallel has the obvious 
potential of reducing run time, but we're currently hitting a hard 
limit since a few modules (flink-tests, flink-runtime, 
flink-table-planner-blink) are so loaded with tests that they nearly 
consume an entire profile by themselves (and thus no further splitting 
is possible).


The rework that introduced stages, at the time of introduction, did 
also not provide a speed up, although this changed slightly once more 
profiles were added and some optimizations to the caching have been made.


Very recently we modified the surefire-plugin configuration for 
flink-table-planner-blink to reuse JVM forks for IT cases, providing a 
significant speedup (18 minutes!). So far we have not seen any 
negative consequences.



   Suggestions

This is a list of /all /suggestions for reducing run/total times that 
I have seen recently (in other words, they aren't necessarily mine nor 
may I agree with all of them).


1. Enable JVM reuse for IT cases in more modules.
 * We've seen significant speedups in the blink planner, and this
   should be applicable for all modules. However, I presume there's
   a reason why we disabled JVM reuse (information on this would be
   appreciated)
2. Custom differential build scripts
 * Setup custom scripts for determining which modules might be
   affected by change, and manipulate the splits accordingly. This
   approach is conceptually quite straight-forward, but has limits
   since it has to be pessimistic; i.e. a change in flink-core
   _must_ result in testing all modules.
3. Only run smoke tests when PR is opened, run heavy tests on demand.
 * With the introduction of the ci-bot we now have significantly
   more options on how to handle PR builds. One option could be to
   only run basic tests when the PR is created (which may be only
   modified modules, or all unit tests, or another low-cost
   scheme), and then have a committer trigger other builds (full
   test run, e2e tests, etc...) on demand.
4. Move more tests into cron builds
 * The budget version of 3); move certain tests that are either
   expensive (like some runtime tests that take minutes) or in
   rarely modified modules (like gelly) into cron jobs.
5. Gradle
 * Gradle was brought up a few times for it's built-in support for
   differential builds; basically providing 2) without the overhead
   of maintaining additional scripts.
 * To date no PoC was provided that shows it working in our CI
   environment (i.e., handling splits & caching etc).
 * This is the most disruptive change by a fair margin, as it would
   affect the entire project, developers and potentially users (f
   they build from source).
6. CI service
 * Our current artifact caching setup on Travis is basically a
   hack; we're basically abusing the Travis cache, which is meant
   for long-term caching, to ship build artifacts across jobs. It's
   brittle at times due to timing/visibility issues and on branches
   the cleanup processes can interfere with running builds. It is
   also

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Xiyuan Wang
6. CI service
I'm not very familar with tarvis, but according to its offical
doc[1][2]. Is it possible to run jobs in parallel? AFAIK, many CI system
supports this kind of feature.

[1]:
https://docs.travis-ci.com/user/speeding-up-the-build/#parallelizing-your-builds-across-virtual-machines
[2]: https://docs.travis-ci.com/user/build-matrix/

Arvid Heise  于2019年8月16日周五 下午4:14写道:

> Thank you for starting the discussion as well!
>
> +1 to 1. it seems to be a quite low-hanging fruit that we should try to
> employ as much as possible.
>
> -0 to 2. the build setup is already very complicated. Adding new
> functionality that I would expect to come out of the box of a modern build
> tool seems like too much effort for me. I'm proposing a 7. action item that
> I would like to try out first before making the setup more complicated.
>
> +0 to 3. What is the actual intent here? If it's about failing earlier,
> then I'd rather propose to reorder the tests such that unit and smoke tests
> of every module are run before IT tests. If it's about being able to
> approve a PR quicker, are smoke tests really enough? However, if we have
> layered tests, then it would be rather easy to omit IT tests altogether in
> specific (local) builds.
>
> -1 to 4. I really want to see when stuff breaks not only once per day (or
> whatever the CRON cycle is). I can really see more broken code being merged
> into master because of the disconnect.
>
> +1 to 5. Gradle build cache has worked well for me in the past. If there is
> a general interest, I can start a POC (or improve upon older POCs). I
> currently expect shading to be the most effort.
>
> +1 to 6. Travis had so many drawbacks in the past and now that most of the
> senior staff has been layed off, I don't expect any improvements at all.
> At my old company, I switched our open source projects to Azure pipelines
> with great success. Azure pipelines offers 10 instances for open source
> projects and it's payment model is pay-as-you-go [1]. Since artifact
> sharing seems to be an issue with Travis anyways, it looks rather easy to
> use in pipelines [2].
> I'd also expect Github CI to be a good fit for our needs [3], but it's
> rather young and I have no experience.
>
> ---
>
> 7. Option I'd like to try the global build cache that's provided by Gradle
> enterprise for Maven first [4]. It basically fingerprints a task
> (fingerprint of upstream tasks, source files + black magic) and whenever
> the fingerprint matches it fetches the results from the build cache. In
> theory, we would get the results of 2. implicitly without any effort. Of
> course, Gradle enterprise costs money (which I could inquire if general
> interest exists) but it would also allow us to downgrade the Travis plan
> (and Travis is really expensive).
>
>
> [1]
>
> https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
> [2]
>
> https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops&tabs=yaml
> [3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/
> [4] https://docs.gradle.com/enterprise/maven-extension/
>
> On Fri, Aug 16, 2019 at 5:20 AM Jark Wu  wrote:
>
> > Thanks Chesnay for starting this discussion.
> >
> > +1 for #1, it might be the easiest way to get a significant speedup.
> > If the only reason is for isolation. I think we can fix the static fields
> > or global state used in Flink if possible.
> >
> > +1 for #2, and thanks Aleksey for the prototype. I think it's a good
> > approach which doesn't introduce too much things to maintain.
> >
> > +1 for #3(run CRON or e2e tests on demand).
> > We have this requirement when reviewing some pull requests, because we
> > don't sure whether it will broken some specific e2e test.
> > Currently, we have to run it locally by building the whole project. Or
> > enable CRON jobs for the pushed branch in contributor's own travis.
> >
> > Besides that, I think FLINK-11464[1] is also a good way to cache
> > distributions to save a lot of download time.
> >
> > Best,
> > Jark
> >
> > [1]: https://issues.apache.org/jira/browse/FLINK-11464
> >
> > On Thu, 15 Aug 2019 at 21:47, Aleksey Pak  wrote:
> >
> > > Hi all!
> > >
> > > Thanks for starting this discussion.
> > >
> > > I'd like to also add my 2 cents:
> > >
> > > +1 for #2, differential build scripts.
> > > I've worked on the approach. And with it, I think it's possible to
> reduce
> > > total build time with relatively low effort, without enforcing any new
> > > build tool and low maintenance cost.
> > >
> > > You can check a proposed change (for the old CI setup, when Flink PRs
> > were
> > > running in Apache common CI pool) here:
> > > https://github.com/apache/flink/pull/9065
> > > In the proposed change, the dependency check is not heavily hardcoded
> and
> > > just uses maven's results for dependency graph analysis.
> > >
> > > > This approach is conceptually quite straight-forward, but has limi

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-16 Thread Till Rohrmann
Thanks for reporting this issue Guowei. Could you share a bit more details
what the job exactly does and which operators it uses? Does the job uses
the new `TwoInputSelectableStreamTask` which might cause the performance
regression?

I think it is important to understand where the problem comes from before
we proceed with the release.

Cheers,
Till

On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma  wrote:

> Hi,
> -1
> We have a benchmark job, which includes a two-input operator.
> This job has a big performance regression using 1.9 compared to 1.8.
> It's still not very clear why this regression happens.
>
> Best,
> Guowei
>
>
> Yu Li  于2019年8月16日周五 下午3:27写道:
>
> > +1 (non-binding)
> >
> > - checked release notes: OK
> > - checked sums and signatures: OK
> > - source release
> >  - contains no binaries: OK
> >  - contains no 1.9-SNAPSHOT references: OK
> >  - build from source: OK (8u102)
> >  - mvn clean verify: OK (8u102)
> > - binary release
> >  - no examples appear to be missing
> >  - started a cluster; WebUI reachable, example ran successfully
> > - repository appears to contain all expected artifacts
> >
> > Best Regards,
> > Yu
> >
> >
> > On Fri, 16 Aug 2019 at 06:06, Bowen Li  wrote:
> >
> > > Hi Jark,
> > >
> > > Thanks for letting me know that it's been like this in previous
> releases.
> > > Though I don't think that's the right behavior, it can be discussed for
> > > later release. Thus I retract my -1 for RC2.
> > >
> > > Bowen
> > >
> > >
> > > On Thu, Aug 15, 2019 at 7:49 PM Jark Wu  wrote:
> > >
> > > > Hi Bowen,
> > > >
> > > > Thanks for reporting this.
> > > > However, I don't think this is an issue. IMO, it is by design.
> > > > The `tEnv.listUserDefinedFunctions()` in Table API and `show
> > functions;`
> > > in
> > > > SQL CLI are intended to return only the registered UDFs, not
> including
> > > > built-in functions.
> > > > This is also the behavior in previous versions.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Fri, 16 Aug 2019 at 06:52, Bowen Li  wrote:
> > > >
> > > > > -1 for RC2.
> > > > >
> > > > > I found a bug https://issues.apache.org/jira/browse/FLINK-13741,
> > and I
> > > > > think it's a blocker.  The bug means currently if users call
> > > > > `tEnv.listUserDefinedFunctions()` in Table API or `show functions;`
> > > thru
> > > > > SQL would not be able to see Flink's built-in functions.
> > > > >
> > > > > I'm preparing a fix right now.
> > > > >
> > > > > Bowen
> > > > >
> > > > >
> > > > > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai <
> > > tzuli...@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks for all the test efforts, verifications and votes so far.
> > > > > >
> > > > > > So far, things are looking good, but we still require one more
> PMC
> > > > > binding
> > > > > > vote for this RC to be the official release, so I would like to
> > > extend
> > > > > the
> > > > > > vote time for 1 more day, until *Aug. 16th 17:00 CET*.
> > > > > >
> > > > > > In the meantime, the release notes for 1.9.0 had only just been
> > > > finalized
> > > > > > [1], and could use a few more eyes before closing the vote.
> > > > > > Any help with checking if anything else should be mentioned there
> > > > > regarding
> > > > > > breaking changes / known shortcomings would be appreciated.
> > > > > >
> > > > > > Cheers,
> > > > > > Gordon
> > > > > >
> > > > > > [1] https://github.com/apache/flink/pull/9438
> > > > > >
> > > > > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young 
> > wrote:
> > > > > >
> > > > > > > Great, then I have no other comments on legal check.
> > > > > > >
> > > > > > > Best,
> > > > > > > Kurt
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler <
> > > ches...@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > The licensing items aren't a problem; we don't care about
> Flink
> > > > > modules
> > > > > > > > in NOTICE files, and we don't have to update the
> source-release
> > > > > > > > licensing since we don't have a pre-built version of the
> WebUI
> > in
> > > > the
> > > > > > > > source.
> > > > > > > >
> > > > > > > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > > > > > > After going through the licenses, I found 2 suspicions but
> > not
> > > > sure
> > > > > > if
> > > > > > > > they
> > > > > > > > > are
> > > > > > > > > valid or not.
> > > > > > > > >
> > > > > > > > > 1. flink-state-processing-api is packaged in to flink-dist
> > jar,
> > > > but
> > > > > > not
> > > > > > > > > included in
> > > > > > > > > NOTICE-binary file (the one under the root directory) like
> > > other
> > > > > > > modules.
> > > > > > > > > 2. flink-runtime-web distributed some JavaScript
> dependencies
> > > > > through
> > > > > > > > source
> > > > > > > > > codes, the licenses and NOTICE file were only updated
> inside
> > > the
> > > > > > module
> > > > > > > > of
> > > > > > > > > flink-runtime-web, but not the NOTICE file and licenses
> > > dir

Re: [VOTE] FLIP-51: Rework of the Expression Design

2019-08-16 Thread Jark Wu
+1 from my side.

Thanks Jingsong for driving this.

Best,
Jark

On Thu, 15 Aug 2019 at 22:09, Timo Walther  wrote:

> +1 for this.
>
> Thanks,
> Timo
>
> Am 15.08.19 um 15:57 schrieb JingsongLee:
> > Hi Flink devs,
> >
> > I would like to start the voting for FLIP-51 Rework of the Expression
> >   Design.
> >
> > FLIP wiki:
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design
> >
> > Discussion thread:
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html
> >
> > Google Doc:
> >
> https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing
> >
> > Thanks,
> >
> > Best,
> > Jingsong Lee
>
>
>


[DISCUSS] Release flink-shaded 8.0

2019-08-16 Thread Chesnay Schepler

Hello,

I would like to kick off the next flink-shaded release next week. There 
are 2 ongoing efforts that are blocked on this release:


 * [FLINK-13467] Java 11 support requires a bump to ASM to correctly
   handle Java 11 bytecode
 * [FLINK-11767] Reworking the typeSerializerSnapshotMigrationTestBase
   requires asm-commons to be added to flink-shaded-asm

Are there any other changes on anyone's radar that we will have to make 
for 1.10? (will bumping calcite require anything, for example)





Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-16 Thread Till Rohrmann
I guess you have to help me understand the difference between alternative 2
and 3 wrt to memory under utilization Xintong.

- Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and JVM
Overhead. Then there is the risk that this size is too low resulting in a
lot of garbage collection and potentially an OOM.
- Alternative 3: set XX:MaxDirectMemorySize to something larger than
alternative 2. This would of course reduce the sizes of the other memory
types.

How would alternative 2 now result in an under utilization of memory
compared to alternative 3? If alternative 3 strictly sets a higher max
direct memory size and we use only little, then I would expect that
alternative 3 results in memory under utilization.

Cheers,
Till

On Tue, Aug 13, 2019 at 4:19 PM Yang Wang  wrote:

> Hi xintong,till
>
>
> > Native and Direct Memory
>
> My point is setting a very large max direct memory size when we do not
> differentiate direct and native memory. If the direct memory,including user
> direct memory and framework direct memory,could be calculated
> correctly,then
> i am in favor of setting direct memory with fixed value.
>
>
>
> > Memory Calculation
>
> I agree with xintong. For Yarn and k8s,we need to check the memory
> configurations in client to avoid submitting successfully and failing in
> the flink master.
>
>
> Best,
>
> Yang
>
> Xintong Song 于2019年8月13日 周二22:07写道:
>
> > Thanks for replying, Till.
> >
> > About MemorySegment, I think you are right that we should not include
> this
> > issue in the scope of this FLIP. This FLIP should concentrate on how to
> > configure memory pools for TaskExecutors, with minimum involvement on how
> > memory consumers use it.
> >
> > About direct memory, I think alternative 3 may not having the same over
> > reservation issue that alternative 2 does, but at the cost of risk of
> over
> > using memory at the container level, which is not good. My point is that
> > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to config.
> For
> > alternative 2, users might configure them higher than what actually
> needed,
> > just to avoid getting a direct OOM. For alternative 3, users do not get
> > direct OOM, so they may not config the two options aggressively high. But
> > the consequences are risks of overall container memory usage exceeds the
> > budget.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann 
> > wrote:
> >
> > > Thanks for proposing this FLIP Xintong.
> > >
> > > All in all I think it already looks quite good. Concerning the first
> open
> > > question about allocating memory segments, I was wondering whether this
> > is
> > > strictly necessary to do in the context of this FLIP or whether this
> > could
> > > be done as a follow up? Without knowing all details, I would be
> concerned
> > > that we would widen the scope of this FLIP too much because we would
> have
> > > to touch all the existing call sites of the MemoryManager where we
> > allocate
> > > memory segments (this should mainly be batch operators). The addition
> of
> > > the memory reservation call to the MemoryManager should not be affected
> > by
> > > this and I would hope that this is the only point of interaction a
> > > streaming job would have with the MemoryManager.
> > >
> > > Concerning the second open question about setting or not setting a max
> > > direct memory limit, I would also be interested why Yang Wang thinks
> > > leaving it open would be best. My concern about this would be that we
> > would
> > > be in a similar situation as we are now with the RocksDBStateBackend.
> If
> > > the different memory pools are not clearly separated and can spill over
> > to
> > > a different pool, then it is quite hard to understand what exactly
> > causes a
> > > process to get killed for using too much memory. This could then easily
> > > lead to a similar situation what we have with the cutoff-ratio. So why
> > not
> > > setting a sane default value for max direct memory and giving the user
> an
> > > option to increase it if he runs into an OOM.
> > >
> > > @Xintong, how would alternative 2 lead to lower memory utilization than
> > > alternative 3 where we set the direct memory to a higher value?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song 
> > wrote:
> > >
> > > > Thanks for the feedback, Yang.
> > > >
> > > > Regarding your comments:
> > > >
> > > > *Native and Direct Memory*
> > > > I think setting a very large max direct memory size definitely has
> some
> > > > good sides. E.g., we do not worry about direct OOM, and we don't even
> > > need
> > > > to allocate managed / network memory with Unsafe.allocate() .
> > > > However, there are also some down sides of doing this.
> > > >
> > > >- One thing I can think of is that if a task executor container is
> > > >killed due to overusing memory, it could be hard for use to know
> > which
> > > > part
> > > >of the memory 

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-16 Thread Guowei Ma
Hi,
-1
We have a benchmark job, which includes a two-input operator.
This job has a big performance regression using 1.9 compared to 1.8.
It's still not very clear why this regression happens.

Best,
Guowei


Yu Li  于2019年8月16日周五 下午3:27写道:

> +1 (non-binding)
>
> - checked release notes: OK
> - checked sums and signatures: OK
> - source release
>  - contains no binaries: OK
>  - contains no 1.9-SNAPSHOT references: OK
>  - build from source: OK (8u102)
>  - mvn clean verify: OK (8u102)
> - binary release
>  - no examples appear to be missing
>  - started a cluster; WebUI reachable, example ran successfully
> - repository appears to contain all expected artifacts
>
> Best Regards,
> Yu
>
>
> On Fri, 16 Aug 2019 at 06:06, Bowen Li  wrote:
>
> > Hi Jark,
> >
> > Thanks for letting me know that it's been like this in previous releases.
> > Though I don't think that's the right behavior, it can be discussed for
> > later release. Thus I retract my -1 for RC2.
> >
> > Bowen
> >
> >
> > On Thu, Aug 15, 2019 at 7:49 PM Jark Wu  wrote:
> >
> > > Hi Bowen,
> > >
> > > Thanks for reporting this.
> > > However, I don't think this is an issue. IMO, it is by design.
> > > The `tEnv.listUserDefinedFunctions()` in Table API and `show
> functions;`
> > in
> > > SQL CLI are intended to return only the registered UDFs, not including
> > > built-in functions.
> > > This is also the behavior in previous versions.
> > >
> > > Best,
> > > Jark
> > >
> > > On Fri, 16 Aug 2019 at 06:52, Bowen Li  wrote:
> > >
> > > > -1 for RC2.
> > > >
> > > > I found a bug https://issues.apache.org/jira/browse/FLINK-13741,
> and I
> > > > think it's a blocker.  The bug means currently if users call
> > > > `tEnv.listUserDefinedFunctions()` in Table API or `show functions;`
> > thru
> > > > SQL would not be able to see Flink's built-in functions.
> > > >
> > > > I'm preparing a fix right now.
> > > >
> > > > Bowen
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai <
> > tzuli...@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Thanks for all the test efforts, verifications and votes so far.
> > > > >
> > > > > So far, things are looking good, but we still require one more PMC
> > > > binding
> > > > > vote for this RC to be the official release, so I would like to
> > extend
> > > > the
> > > > > vote time for 1 more day, until *Aug. 16th 17:00 CET*.
> > > > >
> > > > > In the meantime, the release notes for 1.9.0 had only just been
> > > finalized
> > > > > [1], and could use a few more eyes before closing the vote.
> > > > > Any help with checking if anything else should be mentioned there
> > > > regarding
> > > > > breaking changes / known shortcomings would be appreciated.
> > > > >
> > > > > Cheers,
> > > > > Gordon
> > > > >
> > > > > [1] https://github.com/apache/flink/pull/9438
> > > > >
> > > > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young 
> wrote:
> > > > >
> > > > > > Great, then I have no other comments on legal check.
> > > > > >
> > > > > > Best,
> > > > > > Kurt
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler <
> > ches...@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > The licensing items aren't a problem; we don't care about Flink
> > > > modules
> > > > > > > in NOTICE files, and we don't have to update the source-release
> > > > > > > licensing since we don't have a pre-built version of the WebUI
> in
> > > the
> > > > > > > source.
> > > > > > >
> > > > > > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > > > > > After going through the licenses, I found 2 suspicions but
> not
> > > sure
> > > > > if
> > > > > > > they
> > > > > > > > are
> > > > > > > > valid or not.
> > > > > > > >
> > > > > > > > 1. flink-state-processing-api is packaged in to flink-dist
> jar,
> > > but
> > > > > not
> > > > > > > > included in
> > > > > > > > NOTICE-binary file (the one under the root directory) like
> > other
> > > > > > modules.
> > > > > > > > 2. flink-runtime-web distributed some JavaScript dependencies
> > > > through
> > > > > > > source
> > > > > > > > codes, the licenses and NOTICE file were only updated inside
> > the
> > > > > module
> > > > > > > of
> > > > > > > > flink-runtime-web, but not the NOTICE file and licenses
> > directory
> > > > > which
> > > > > > > > under
> > > > > > > > the  root directory.
> > > > > > > >
> > > > > > > > Another minor issue I just found is:
> > > > > > > > FLINK-13558 tries to include table examples to flink-dist,
> but
> > I
> > > > > cannot
> > > > > > > > find it in
> > > > > > > > the binary distribution of RC2.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Kurt
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  >
> > > > wrote:
> > > > > > > >
> > > > > > > >> Hi Gordon & Timo,
> > > > > > > >>
> > > > > > > >> Thanks for the feedback, and I agree with it. I will
> document
> > > this
> > > > > in
> > > > > > > the
> > > > > > > >> rele

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Arvid Heise
Thank you for starting the discussion as well!

+1 to 1. it seems to be a quite low-hanging fruit that we should try to
employ as much as possible.

-0 to 2. the build setup is already very complicated. Adding new
functionality that I would expect to come out of the box of a modern build
tool seems like too much effort for me. I'm proposing a 7. action item that
I would like to try out first before making the setup more complicated.

+0 to 3. What is the actual intent here? If it's about failing earlier,
then I'd rather propose to reorder the tests such that unit and smoke tests
of every module are run before IT tests. If it's about being able to
approve a PR quicker, are smoke tests really enough? However, if we have
layered tests, then it would be rather easy to omit IT tests altogether in
specific (local) builds.

-1 to 4. I really want to see when stuff breaks not only once per day (or
whatever the CRON cycle is). I can really see more broken code being merged
into master because of the disconnect.

+1 to 5. Gradle build cache has worked well for me in the past. If there is
a general interest, I can start a POC (or improve upon older POCs). I
currently expect shading to be the most effort.

+1 to 6. Travis had so many drawbacks in the past and now that most of the
senior staff has been layed off, I don't expect any improvements at all.
At my old company, I switched our open source projects to Azure pipelines
with great success. Azure pipelines offers 10 instances for open source
projects and it's payment model is pay-as-you-go [1]. Since artifact
sharing seems to be an issue with Travis anyways, it looks rather easy to
use in pipelines [2].
I'd also expect Github CI to be a good fit for our needs [3], but it's
rather young and I have no experience.

---

7. Option I'd like to try the global build cache that's provided by Gradle
enterprise for Maven first [4]. It basically fingerprints a task
(fingerprint of upstream tasks, source files + black magic) and whenever
the fingerprint matches it fetches the results from the build cache. In
theory, we would get the results of 2. implicitly without any effort. Of
course, Gradle enterprise costs money (which I could inquire if general
interest exists) but it would also allow us to downgrade the Travis plan
(and Travis is really expensive).


[1]
https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
[2]
https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops&tabs=yaml
[3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/
[4] https://docs.gradle.com/enterprise/maven-extension/

On Fri, Aug 16, 2019 at 5:20 AM Jark Wu  wrote:

> Thanks Chesnay for starting this discussion.
>
> +1 for #1, it might be the easiest way to get a significant speedup.
> If the only reason is for isolation. I think we can fix the static fields
> or global state used in Flink if possible.
>
> +1 for #2, and thanks Aleksey for the prototype. I think it's a good
> approach which doesn't introduce too much things to maintain.
>
> +1 for #3(run CRON or e2e tests on demand).
> We have this requirement when reviewing some pull requests, because we
> don't sure whether it will broken some specific e2e test.
> Currently, we have to run it locally by building the whole project. Or
> enable CRON jobs for the pushed branch in contributor's own travis.
>
> Besides that, I think FLINK-11464[1] is also a good way to cache
> distributions to save a lot of download time.
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/FLINK-11464
>
> On Thu, 15 Aug 2019 at 21:47, Aleksey Pak  wrote:
>
> > Hi all!
> >
> > Thanks for starting this discussion.
> >
> > I'd like to also add my 2 cents:
> >
> > +1 for #2, differential build scripts.
> > I've worked on the approach. And with it, I think it's possible to reduce
> > total build time with relatively low effort, without enforcing any new
> > build tool and low maintenance cost.
> >
> > You can check a proposed change (for the old CI setup, when Flink PRs
> were
> > running in Apache common CI pool) here:
> > https://github.com/apache/flink/pull/9065
> > In the proposed change, the dependency check is not heavily hardcoded and
> > just uses maven's results for dependency graph analysis.
> >
> > > This approach is conceptually quite straight-forward, but has limits
> > since it has to be pessimistic; > i.e. a change in flink-core _must_
> result
> > in testing all modules.
> >
> > Agree, in Flink case, there are some core modules that would trigger
> whole
> > tests run with such approach. For developers who modify such components,
> > the build time would be the longest. But this approach should really help
> > for developers who touch more-or-less independent modules.
> >
> > Even for core modules, it's possible to create "abstraction" barriers by
> > changing dependency graph. For example, it can look like: flink-core-api
> > <-- 

Re: flink 1.9 DDL nested json derived

2019-08-16 Thread Danny Chan
Hi, Shengnan YU ~

You can reference the test cases in FlinkDDLDataTypeTest[1] for a quick 
reference of what a DDL column type looks like.

[1] 
https://github.com/apache/flink/blob/a194b37d9b99a47174de9108a937f821816d61f5/flink-table/flink-sql-parser/src/test/java/org/apache/flink/sql/parser/FlinkDDLDataTypeTest.java#L165

Best,
Danny Chan
在 2019年8月15日 +0800 PM2:12,Shengnan YU ,写道:
>
> Hi guys
> I am trying the DDL feature in branch 1.9-releasae. I am stucked in creating 
> a table from kafka with nested json format. Is it possibe to specify a "Row" 
> type of columns to derive the nested json schema?
>
> String sql = "create table kafka_stream(\n" +
> " a varchar, \n" +
> " b varchar,\n" +
> " c int,\n" +
> " inner_json row\n" +
> ") with (\n" +
> " 'connector.type' ='kafka',\n" +
> " 'connector.version' = '0.11',\n" +
> " 'update-mode' = 'append', \n" +
> " 'connector.topic' = 'test',\n" +
> " 'connector.properties.0.key' = 'bootstrap.servers',\n" +
> " 'connector.properties.0.value' = 'localhost:9092',\n" +
> " 'format.type' = 'json', \n" +
> " 'format.derive-schema' = 'true'\n" +
> ")\n";
>
> Thank you very much!


[jira] [Created] (FLINK-13748) Streaming File Sink s3 end-to-end test failed on Travis

2019-08-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13748:
-

 Summary: Streaming File Sink s3 end-to-end test failed on Travis
 Key: FLINK-13748
 URL: https://issues.apache.org/jira/browse/FLINK-13748
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem, Tests
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0


The {{Streaming File Sink s3 end-to-end test}} failed on Travis because it did 
not produce any output for 10 minutes.

https://api.travis-ci.org/v3/job/572255913/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13747) Remove some TODOs in Hive connector

2019-08-16 Thread Rui Li (JIRA)
Rui Li created FLINK-13747:
--

 Summary: Remove some TODOs in Hive connector
 Key: FLINK-13747
 URL: https://issues.apache.org/jira/browse/FLINK-13747
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive
Reporter: Rui Li






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13746) Elasticsearch (v2.3.5) sink end-to-end test fails on Travis

2019-08-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13746:
-

 Summary: Elasticsearch (v2.3.5) sink end-to-end test fails on 
Travis
 Key: FLINK-13746
 URL: https://issues.apache.org/jira/browse/FLINK-13746
 Project: Flink
  Issue Type: Bug
  Components: Connectors / ElasticSearch, Tests
Affects Versions: 1.9.0, 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0, 1.9.1


The {{Elasticsearch (v2.3.5) sink end-to-end test}} fails on Travis because it 
logs contain the following line:

{code}
INFO  org.elasticsearch.plugins - [Terror] modules [], plugins [], sites []
{code}

Due to this, the error check is triggered.

https://api.travis-ci.org/v3/job/572255901/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13745) Flink cache on Travis does not exist

2019-08-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13745:
-

 Summary: Flink cache on Travis does not exist
 Key: FLINK-13745
 URL: https://issues.apache.org/jira/browse/FLINK-13745
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 1.9.0, 1.10.0
Reporter: Till Rohrmann


More and more often I observe that Flink builds fail on Travis because of 
missing Flink caches:

{code}
Cached flink dir /home/travis/flink_cache/40072/flink does not exist. Exiting 
build.
{code}

It seems as if Travis cannot guarantee that a cache survives as long as the 
different profiles of a build are running. It would be good to solve this 
problem because now we have regularly failing builds:

https://travis-ci.org/apache/flink/builds/572559629
https://travis-ci.org/apache/flink/builds/572523730
https://travis-ci.org/apache/flink/builds/571576734



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-16 Thread Yu Li
+1 (non-binding)

- checked release notes: OK
- checked sums and signatures: OK
- source release
 - contains no binaries: OK
 - contains no 1.9-SNAPSHOT references: OK
 - build from source: OK (8u102)
 - mvn clean verify: OK (8u102)
- binary release
 - no examples appear to be missing
 - started a cluster; WebUI reachable, example ran successfully
- repository appears to contain all expected artifacts

Best Regards,
Yu


On Fri, 16 Aug 2019 at 06:06, Bowen Li  wrote:

> Hi Jark,
>
> Thanks for letting me know that it's been like this in previous releases.
> Though I don't think that's the right behavior, it can be discussed for
> later release. Thus I retract my -1 for RC2.
>
> Bowen
>
>
> On Thu, Aug 15, 2019 at 7:49 PM Jark Wu  wrote:
>
> > Hi Bowen,
> >
> > Thanks for reporting this.
> > However, I don't think this is an issue. IMO, it is by design.
> > The `tEnv.listUserDefinedFunctions()` in Table API and `show functions;`
> in
> > SQL CLI are intended to return only the registered UDFs, not including
> > built-in functions.
> > This is also the behavior in previous versions.
> >
> > Best,
> > Jark
> >
> > On Fri, 16 Aug 2019 at 06:52, Bowen Li  wrote:
> >
> > > -1 for RC2.
> > >
> > > I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I
> > > think it's a blocker.  The bug means currently if users call
> > > `tEnv.listUserDefinedFunctions()` in Table API or `show functions;`
> thru
> > > SQL would not be able to see Flink's built-in functions.
> > >
> > > I'm preparing a fix right now.
> > >
> > > Bowen
> > >
> > >
> > > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai <
> tzuli...@apache.org
> > >
> > > wrote:
> > >
> > > > Thanks for all the test efforts, verifications and votes so far.
> > > >
> > > > So far, things are looking good, but we still require one more PMC
> > > binding
> > > > vote for this RC to be the official release, so I would like to
> extend
> > > the
> > > > vote time for 1 more day, until *Aug. 16th 17:00 CET*.
> > > >
> > > > In the meantime, the release notes for 1.9.0 had only just been
> > finalized
> > > > [1], and could use a few more eyes before closing the vote.
> > > > Any help with checking if anything else should be mentioned there
> > > regarding
> > > > breaking changes / known shortcomings would be appreciated.
> > > >
> > > > Cheers,
> > > > Gordon
> > > >
> > > > [1] https://github.com/apache/flink/pull/9438
> > > >
> > > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young  wrote:
> > > >
> > > > > Great, then I have no other comments on legal check.
> > > > >
> > > > > Best,
> > > > > Kurt
> > > > >
> > > > >
> > > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler <
> ches...@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > The licensing items aren't a problem; we don't care about Flink
> > > modules
> > > > > > in NOTICE files, and we don't have to update the source-release
> > > > > > licensing since we don't have a pre-built version of the WebUI in
> > the
> > > > > > source.
> > > > > >
> > > > > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > > > > After going through the licenses, I found 2 suspicions but not
> > sure
> > > > if
> > > > > > they
> > > > > > > are
> > > > > > > valid or not.
> > > > > > >
> > > > > > > 1. flink-state-processing-api is packaged in to flink-dist jar,
> > but
> > > > not
> > > > > > > included in
> > > > > > > NOTICE-binary file (the one under the root directory) like
> other
> > > > > modules.
> > > > > > > 2. flink-runtime-web distributed some JavaScript dependencies
> > > through
> > > > > > source
> > > > > > > codes, the licenses and NOTICE file were only updated inside
> the
> > > > module
> > > > > > of
> > > > > > > flink-runtime-web, but not the NOTICE file and licenses
> directory
> > > > which
> > > > > > > under
> > > > > > > the  root directory.
> > > > > > >
> > > > > > > Another minor issue I just found is:
> > > > > > > FLINK-13558 tries to include table examples to flink-dist, but
> I
> > > > cannot
> > > > > > > find it in
> > > > > > > the binary distribution of RC2.
> > > > > > >
> > > > > > > Best,
> > > > > > > Kurt
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young 
> > > wrote:
> > > > > > >
> > > > > > >> Hi Gordon & Timo,
> > > > > > >>
> > > > > > >> Thanks for the feedback, and I agree with it. I will document
> > this
> > > > in
> > > > > > the
> > > > > > >> release notes.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Kurt
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai <
> > > > > > tzuli...@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Kurt,
> > > > > > >>>
> > > > > > >>> With the same argument as before, given that it is mentioned
> in
> > > the
> > > > > > >>> release
> > > > > > >>> announcement that it is a preview feature, I would not block
> > this
> > > > > > release
> > > > > > >>> because of it.
> > > > > > >>> Nevertheless, it

Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend

2019-08-16 Thread Till Rohrmann
+1 for this FLIP and the feature. I think this feature will be super
helpful for many Flink users.

Once the SpillableHeapKeyedStateBackend has proven to be superior to the
HeapKeyedStateBackend we should think about removing the latter completely
to reduce maintenance burden.

Cheers,
Till

On Fri, Aug 16, 2019 at 4:06 AM Congxian Qiu  wrote:

> Big +1 for this feature.
>
> This FLIP can help improves at least the following two scenarios:
> - Temporary data peak when using Heap StateBackend
> - Heap State Backend has better performance than RocksDBStateBackend,
> especially on SATA disk. there are some guys ever told me that they
> increased the parallelism of operators(and use HeapStateBackend) other than
> use RocksDBStateBackend to get better performance. But increase parallelism
> will have some other problems, after this FLIP, we can run Flink Job with
> the same parallelism as RocksDBStateBackend and get better performance
> also.
>
> Best,
> Congxian
>
>
> Yu Li  于2019年8月16日周五 上午12:14写道:
>
> > Thanks all for the reviews and comments!
> >
> > bq. From the implementation plan, it looks like this exists purely in a
> new
> > module and does not require any changes in other parts of Flink's code.
> Can
> > you confirm that?
> > Confirmed, thanks!
> >
> > Best Regards,
> > Yu
> >
> >
> > On Thu, 15 Aug 2019 at 18:04, Tzu-Li (Gordon) Tai 
> > wrote:
> >
> > > +1 to start a VOTE for this FLIP.
> > >
> > > Given the properties of this new state backend and that it will exist
> as
> > a
> > > new module without touching the original heap backend, I don't see a
> harm
> > > in including this.
> > > Regarding design of the feature, I've already mentioned my comments in
> > the
> > > original discussion thread.
> > >
> > > Cheers,
> > > Gordon
> > >
> > > On Thu, Aug 15, 2019 at 5:53 PM Yun Tang  wrote:
> > >
> > > > Big +1 for this feature.
> > > >
> > > > Our customers including me, have ever met dilemma where we have to
> use
> > > > window to aggregate events in applications like real-time monitoring.
> > The
> > > > larger of timer and window state, the poor performance of RocksDB.
> > > However,
> > > > switching to use FsStateBackend would always make me feel fear about
> > the
> > > > OOM errors.
> > > >
> > > > Look forward for more powerful enrichment to state-backend, and help
> > > Flink
> > > > to achieve better performance together.
> > > >
> > > > Best
> > > > Yun Tang
> > > > 
> > > > From: Stephan Ewen 
> > > > Sent: Thursday, August 15, 2019 23:07
> > > > To: dev 
> > > > Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
> > > >
> > > > +1 for this feature. I think this will be appreciated by users, as a
> > way
> > > to
> > > > use the HeapStateBackend with a safety-net against OOM errors.
> > > > And having had major production exposure is great.
> > > >
> > > > From the implementation plan, it looks like this exists purely in a
> new
> > > > module and does not require any changes in other parts of Flink's
> code.
> > > Can
> > > > you confirm that?
> > > >
> > > > Other that that, I have no further questions and we could proceed to
> > vote
> > > > on this FLIP, from my side.
> > > >
> > > > Best,
> > > > Stephan
> > > >
> > > >
> > > > On Tue, Aug 13, 2019 at 10:00 PM Yu Li  wrote:
> > > >
> > > > > Sorry for forgetting to give the link of the FLIP, here it is:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Best Regards,
> > > > > Yu
> > > > >
> > > > >
> > > > > On Tue, 13 Aug 2019 at 18:06, Yu Li  wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > We ever held a discussion about this feature before [1] but now
> > > opening
> > > > > > another thread because after a second thought introducing a new
> > > backend
> > > > > > instead of modifying the existing heap backend is a better option
> > to
> > > > > > prevent causing any regression or surprise to existing
> > in-production
> > > > > usage.
> > > > > > And since introducing a new backend is relatively big change, we
> > > regard
> > > > > it
> > > > > > as a FLIP and need another discussion and voting process
> according
> > to
> > > > our
> > > > > > newly drafted bylaw [2].
> > > > > >
> > > > > > Please allow me to quote the brief description from the old
> thread
> > > [1]
> > > > > for
> > > > > > the convenience of those who noticed this feature for the first
> > time:
> > > > > >
> > > > > >
> > > > > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in
> > Flink,
> > > > > > since state lives as Java objects on the heap in
> > > HeapKeyedStateBackend
> > > > > and
> > > > > > the de/serialization only happens during state snapshot and
> > restore,
> > > it
> > > > > > outperforms RocksDBKeyeStateBackend when all data could reside in
> > > > > memory.**However,
> > > > > > along with the advantage, HeapKeyedStateBacke