[jira] [Comment Edited] (FLINK-15000) Metrics is very slow when parallelism is over 1000

2019-12-09 Thread fa zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988573#comment-16988573
 ] 

fa zheng edited comment on FLINK-15000 at 12/10/19 7:58 AM:


it mainly caused by loading all metric once in nz-select, in this case, there 
are 63599 metrics in total. It is not suitable to load all metric once, it will 
lead a performance problem.


was (Author: faaronzheng):
it mainly caused by loading all metric once, in this case, there are 63599 
metrics in total. It is not suitable to load all metric once, it will lead a 
performance problem.

> Metrics is very slow when parallelism is over 1000
> --
>
> Key: FLINK-15000
> URL: https://issues.apache.org/jira/browse/FLINK-15000
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.9.0
>Reporter: fa zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> metrics in web ui are very slow when parallelism is over 1000. It's hard to 
> add metric and choose one metric. I run carTopSpeedWindowingExample with 
> command 
> {code:java}
> //代码占位符
> flink run -m yarn-cluster -p 1200 -yjm 60g -ytm 60g -yn 60 -ys 20 
> examples/streaming/TopSpeedWindowing.jar
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355886957
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/cli/KubernetesCliOptions.java
 ##
 @@ -46,15 +44,16 @@ public static Option getOptionWithPrefix(Option option, 
String shortPrefix, Stri
.longOpt("clusterId")
.required(false)
.hasArg(true)
-   
.desc(KubernetesConfigOptions.CLUSTER_ID.description().toString())
+   .desc("The cluster id that will be used for flink cluster. If 
it's not set, the client will generate " +
+   "a random UUID name.")
.build();
 
public static final Option IMAGE_OPTION = Option.builder("i")
.longOpt("image")
.required(false)
.hasArg(true)
.argName("image-name")
-   
.desc(KubernetesConfigOptions.CONTAINER_IMAGE.description().toString())
+   .desc("Image to use for Flink containers")
 
 Review comment:
   Thanks for the explanation, then it is better to put these reasons under the 
commit descriptions to let others know the motivation of this commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-7151) Support function DDL

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-7151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-7151:
-
Fix Version/s: (was: 1.10.0)
   1.11.0

Change the fix version to 1.11.0 since we already reached feature freeze for 
1.10.0.

On the other hand, since many sub-tasks have been completed, can we call this a 
"preview" feature for 1.10.0? If so, could you supply some documentation about 
where we are now and what's missing? [~ZhenqiuHuang] [~phoenixjiangnan] Thanks.

> Support function DDL
> 
>
> Key: FLINK-7151
> URL: https://issues.apache.org/jira/browse/FLINK-7151
> Project: Flink
>  Issue Type: New Feature
>  Components: Table SQL / API
>Reporter: yuemeng
>Assignee: Zhenqiu Huang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Based on create function and table.we can register a udf,udaf,udtf use sql:
> {code}
> CREATE FUNCTION [IF NOT EXISTS] [catalog_name.db_name.]function_name AS 
> class_name;
> DROP FUNCTION [IF EXISTS] [catalog_name.db_name.]function_name;
> ALTER FUNCTION [IF EXISTS] [catalog_name.db_name.]function_name RENAME TO 
> new_name;
> {code}
> {code}
> CREATE function 'TOPK' AS 
> 'com..aggregate.udaf.distinctUdaf.topk.ITopKUDAF';
> INSERT INTO db_sink SELECT id, TOPK(price, 5, 'DESC') FROM kafka_source GROUP 
> BY id;
> {code}
> This ticket can assume that the function class is already loaded in classpath 
> by users. Advanced syntax like to how to dynamically load udf libraries from 
> external locations can be on a separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15103) Performance regression on 3.12.2019 in various benchmarks

2019-12-09 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992220#comment-16992220
 ] 

Xintong Song edited comment on FLINK-15103 at 12/10/19 7:50 AM:


Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, cause-regression has more network buffers, which is as 
expected.

*Why do we have more network buffers?* For the two commits discussed, network 
buffer memory size is calculated in 
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}. 
Basically this method calculates network memory size with the following 
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 - 
networkFraction)
 heapAndManagedMemory = jvmFreeHeapMemory, if managed memory is on-heap
 heapAndManagedMemory = jvmFreeHeapMemory + managedMemory, if managed memory is 
off-heap
{quote}
Assuming we have the same {{jvmFreeHeapMemory}}, with managed memory changed 
from on-heap to off-heap, we should have larger network memory size, thus 
larger number of network buffers when the page size stays the same.

What against intuition is that, with more network buffers, we actually have 
worse performance. The only thing I can think of it that, our benchmarks take 
the cluster initialization time into the statistics, and with more network 
buffers we need more time for allocating those direct memory buffers.

To verify that, I explicitly configured 
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and 
logged out the time consumed by allocating all the buffers in the constructor 
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large' 
stand for setting number of network buffers to 12945 / 32768 respectively.
|| 
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|twoInputMapSink score (ops / ms) | 11025.369 | 10471.976 | 11511.51 | 
10749.657 |
|globalWindow (ops / ms) |4399.537 |4063.925 | 4538.209 | 4020.957 |
|buffer allocation time (ms)|79.033|242.567|78.167|237.717|

The benchmark scores show that larger number of network buffers indeed leads to 
the regression in statistics. Further dig into the results, taking 
{{arrayKeyBy}} in before-regression as an example, the total records is 
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the 
small network memory size case is 1934, that gives us the total execution time 
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network 
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution 
time with the large network buffer size can be roughly calculated as 7,000,000 
/ 1835 = 3815ms. The total execution time difference between with small / large 
network buffer size is about 196ms, which is very close to time difference of 
buffer allocation time (242 - 79 = 163ms). If we also take the randomness into 
consideration, this basically explains where the benchmark regression come from.

I still need to look into the performance of the later commit 
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post 
my findings so far.


was (Author: xintongsong):
Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 

[GitHub] [flink] flinkbot edited a comment on issue #10348: [FLINK-14951][tests] Harden the thread safety of State TTL backend tests

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10348: [FLINK-14951][tests] Harden the 
thread safety of State TTL backend tests
URL: https://github.com/apache/flink/pull/10348#issuecomment-559509076
 
 
   
   ## CI report:
   
   * 54ecee627e7036f4d150aad330b9772406a19494 Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/138580810) 
   * 3e89b2b0d85dd29a76bb10807e2959a7d2ee8295 Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/140343903) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3383)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (FLINK-15103) Performance regression on 3.12.2019 in various benchmarks

2019-12-09 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992220#comment-16992220
 ] 

Xintong Song edited comment on FLINK-15103 at 12/10/19 7:49 AM:


Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, cause-regression has more network buffers, which is as 
expected.

*Why do we have more network buffers?* For the two commits discussed, network 
buffer memory size is calculated in 
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}. 
Basically this method calculates network memory size with the following 
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 - 
networkFraction)
 heapAndManagedMemory = jvmFreeHeapMemory, if managed memory is on-heap
 heapAndManagedMemory = jvmFreeHeapMemory + managedMemory, if managed memory is 
off-heap
{quote}
Assuming we have the same {{jvmFreeHeapMemory}}, with managed memory changed 
from on-heap to off-heap, we should have larger network memory size, thus 
larger number of network buffers when the page size stays the same.

What against intuition is that, with more network buffers, we actually have 
worse performance. The only thing I can think of it that, our benchmarks take 
the cluster initialization time into the statistics, and with more network 
buffers we need more time for allocating those direct memory buffers.

To verify that, I explicitly configured 
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and 
logged out the time consumed by allocating all the buffers in the constructor 
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large' 
stand for setting number of network buffers to 12945 / 32768 respectively.
|| 
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|twoInputMapSink score (ops / ms) | 11025.369 | 10471.976 | 11511.51 | 
10749.657 |
|globalWindow (ops / ms) |4399.537 |4063.925 | 4538.209 | 4020.957 |
|avg. buffer allocation time (ms)|79.033|242.567|78.167|237.717|

The benchmark scores show that larger number of network buffers indeed leads to 
the regression in statistics. Further dig into the results, taking 
{{arrayKeyBy}} in before-regression as an example, the total records is 
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the 
small network memory size case is 1934, that gives us the total execution time 
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network 
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution 
time with the large network buffer size can be roughly calculated as 7,000,000 
/ 1835 = 3815ms. The total execution time difference between with small / large 
network buffer size is about 196ms, which is very close to time difference of 
buffer allocation time (242 - 79 = 163ms). If we also take the randomness into 
consideration, this basically explains where the benchmark regression come from.

I still need to look into the performance of the later commit 
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post 
my findings so far.


was (Author: xintongsong):
Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 

[GitHub] [flink] flinkbot edited a comment on issue #9768: [FLINK-14086][test][core] Add OperatingArchitecture Enum

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #9768: [FLINK-14086][test][core] Add 
OperatingArchitecture Enum
URL: https://github.com/apache/flink/pull/9768#issuecomment-534899022
 
 
   
   ## CI report:
   
   * 51ae12da735fd7fd1bdadd282839c9bfe7b80146 Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/129056828) 
   * faa23214d94a2c1105fbadee358f32ed04e3693d Travis: 
[FAILURE](https://travis-ci.com/flink-ci/flink/builds/140207916) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3353)
 
   * 78a28f191158b78b9aed45f95051b7c398c2c124 Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140366097) Azure: 
[PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3393)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10507: [FLINK-15146][CORE]Fix the value of that should be strictly greater than zero

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10507: [FLINK-15146][CORE]Fix the value of  
that should  be strictly greater than zero 
URL: https://github.com/apache/flink/pull/10507#issuecomment-563900578
 
 
   
   ## CI report:
   
   * 5e6a4e1ba03194e13928342ab5d77fa516e16b00 Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140366087) Azure: 
[PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3392)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10506: [FLINK-15166][runtime] Fix that buffer is wrongly recycled when data compression is enabled for blocking shuffle.

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10506: [FLINK-15166][runtime] Fix that 
buffer is wrongly recycled when data compression is enabled for blocking 
shuffle.
URL: https://github.com/apache/flink/pull/10506#issuecomment-563900511
 
 
   
   ## CI report:
   
   * 4f2b4aa04f47518706f004f7702cf42d76b79f4e Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140366073) Azure: 
[PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3391)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10505: [FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID instead of String

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10505: 
[FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID 
instead of String
URL: https://github.com/apache/flink/pull/10505#issuecomment-563892373
 
 
   
   ## CI report:
   
   * f742b7defe3d40ae00fb3e0adb66e4dc3ec20087 Travis: 
[FAILURE](https://travis-ci.com/flink-ci/flink/builds/140362846) Azure: 
[PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3390)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-12352) [FLIP-36] [Phase 1] Support cache() / invalidateCache() in Table with default ShuffleService and NetworkStack

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-12352:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change the fix version to 1.11.0 since we already reached feature freeze for 
1.10.0

> [FLIP-36] [Phase 1] Support cache() / invalidateCache() in Table with default 
> ShuffleService and NetworkStack
> -
>
> Key: FLINK-12352
> URL: https://issues.apache.org/jira/browse/FLINK-12352
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / API
>Reporter: Jiangjie Qin
>Priority: Major
> Fix For: 1.11.0
>
>
> The goals of this phase are following:
>  * cache and release intermediate result with shuffle service.
>  * benefit from locality of default shuffle service



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14392) Introduce JobClient API(FLIP-74)

2019-12-09 Thread Zili Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992272#comment-16992272
 ] 

Zili Chen commented on FLINK-14392:
---

>From my side the {{JobClient}} is full-featured and ready for using. The 
>pending tickets is about {{JobListener}} which satisfies downstream projects 
>such as Zeppelin's requirement.

For release note, we might later check in a sample using {{JobClient}} 
interface on our document site. Possibly we announce that {{executeAsync}} of 
env is an experimental interface of envs returns a {{JobClient}} for user 
communicating with the submitted job.

> Introduce JobClient API(FLIP-74)
> 
>
> Key: FLINK-14392
> URL: https://issues.apache.org/jira/browse/FLINK-14392
> Project: Flink
>  Issue Type: New Feature
>  Components: Client / Job Submission
>Reporter: Zili Chen
>Assignee: Zili Chen
>Priority: Major
> Fix For: 1.10.0
>
>
> This is the umbrella issue to track all efforts toward {{JobClient}} proposed 
> in 
> [FLIP-74|https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-13437) Add Hive SQL E2E test

2019-12-09 Thread Yu Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992269#comment-16992269
 ] 

Yu Li edited comment on FLINK-13437 at 12/10/19 7:40 AM:
-

Thanks for the confirmation and efforts [~ykt836] [~lzljs3620320] !

[~Terry1897] Please change the status to "In-Progress" accordingly, thanks.


was (Author: carp84):
Thanks for the confirmation and efforts [~ykt836] [~lzljs3620320] !

> Add Hive SQL E2E test
> -
>
> Key: FLINK-13437
> URL: https://issues.apache.org/jira/browse/FLINK-13437
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / Hive, Tests
>Affects Versions: 1.9.0
>Reporter: Till Rohrmann
>Assignee: Terry Wang
>Priority: Major
> Fix For: 1.10.0
>
>
> We should add an E2E test for the Hive integration: List all tables and read 
> some metadata, read from an existing table, register a new table in Hive, use 
> a registered function, write to an existing table, write to a new table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13437) Add Hive SQL E2E test

2019-12-09 Thread Yu Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992269#comment-16992269
 ] 

Yu Li commented on FLINK-13437:
---

Thanks for the confirmation and efforts [~ykt836] [~lzljs3620320] !

> Add Hive SQL E2E test
> -
>
> Key: FLINK-13437
> URL: https://issues.apache.org/jira/browse/FLINK-13437
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / Hive, Tests
>Affects Versions: 1.9.0
>Reporter: Till Rohrmann
>Assignee: Terry Wang
>Priority: Major
> Fix For: 1.10.0
>
>
> We should add an E2E test for the Hive integration: List all tables and read 
> some metadata, read from an existing table, register a new table in Hive, use 
> a registered function, write to an existing table, write to a new table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14359) Create a module called flink-sql-connector-hbase to shade HBase

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14359:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> Create a module called flink-sql-connector-hbase to shade HBase
> ---
>
> Key: FLINK-14359
> URL: https://issues.apache.org/jira/browse/FLINK-14359
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / HBase
>Reporter: Jingsong Lee
>Assignee: Zheng Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We need do the same thing as kafka and elasticsearch to HBase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13998) Fix ORC test failure with Hive 2.0.x

2019-12-09 Thread Jingsong Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992266#comment-16992266
 ] 

Jingsong Lee commented on FLINK-13998:
--

I will modify it to 1.11

> Fix ORC test failure with Hive 2.0.x
> 
>
> Key: FLINK-13998
> URL: https://issues.apache.org/jira/browse/FLINK-13998
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hive
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Major
> Fix For: 1.11.0
>
>
> Our test is using local file system, and orc in HIve 2.0.x seems having issue 
> with that. 
> {code}
> 06:54:43.156 [ORC_GET_SPLITS #0] ERROR org.apache.hadoop.hive.ql.io.AcidUtils 
> - Failed to get files with ID; using regular API
> java.lang.UnsupportedOperationException: Only supported for DFS; got class 
> org.apache.hadoop.fs.LocalFileSystem
>   at 
> org.apache.hadoop.hive.shims.Hadoop23Shims.ensureDfs(Hadoop23Shims.java:813) 
> ~[hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.shims.Hadoop23Shims.listLocatedHdfsStatus(Hadoop23Shims.java:784)
>  ~[hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:477) 
> [hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:890)
>  [hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:875)
>  [hive-exec-2.0.0.jar:2.0.0]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_181]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_181]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_181]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_181]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_181]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13998) Fix ORC test failure with Hive 2.0.x

2019-12-09 Thread Jingsong Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingsong Lee updated FLINK-13998:
-
Fix Version/s: (was: 1.10.0)
   1.11.0

> Fix ORC test failure with Hive 2.0.x
> 
>
> Key: FLINK-13998
> URL: https://issues.apache.org/jira/browse/FLINK-13998
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hive
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Major
> Fix For: 1.11.0
>
>
> Our test is using local file system, and orc in HIve 2.0.x seems having issue 
> with that. 
> {code}
> 06:54:43.156 [ORC_GET_SPLITS #0] ERROR org.apache.hadoop.hive.ql.io.AcidUtils 
> - Failed to get files with ID; using regular API
> java.lang.UnsupportedOperationException: Only supported for DFS; got class 
> org.apache.hadoop.fs.LocalFileSystem
>   at 
> org.apache.hadoop.hive.shims.Hadoop23Shims.ensureDfs(Hadoop23Shims.java:813) 
> ~[hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.shims.Hadoop23Shims.listLocatedHdfsStatus(Hadoop23Shims.java:784)
>  ~[hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:477) 
> [hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:890)
>  [hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:875)
>  [hive-exec-2.0.0.jar:2.0.0]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_181]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_181]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_181]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_181]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_181]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13998) Fix ORC test failure with Hive 2.0.x

2019-12-09 Thread Jingsong Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992265#comment-16992265
 ] 

Jingsong Lee commented on FLINK-13998:
--

Since useOrcVectorizedRead has been introduced, this bug will be greatly 
weakened.

> Fix ORC test failure with Hive 2.0.x
> 
>
> Key: FLINK-13998
> URL: https://issues.apache.org/jira/browse/FLINK-13998
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hive
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Major
> Fix For: 1.10.0
>
>
> Our test is using local file system, and orc in HIve 2.0.x seems having issue 
> with that. 
> {code}
> 06:54:43.156 [ORC_GET_SPLITS #0] ERROR org.apache.hadoop.hive.ql.io.AcidUtils 
> - Failed to get files with ID; using regular API
> java.lang.UnsupportedOperationException: Only supported for DFS; got class 
> org.apache.hadoop.fs.LocalFileSystem
>   at 
> org.apache.hadoop.hive.shims.Hadoop23Shims.ensureDfs(Hadoop23Shims.java:813) 
> ~[hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.shims.Hadoop23Shims.listLocatedHdfsStatus(Hadoop23Shims.java:784)
>  ~[hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:477) 
> [hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:890)
>  [hive-exec-2.0.0.jar:2.0.0]
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:875)
>  [hive-exec-2.0.0.jar:2.0.0]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_181]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_181]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_181]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_181]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_181]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13947) Check Hive shim serialization in Hive UDF wrapper classes and test coverage

2019-12-09 Thread Jingsong Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingsong Lee closed FLINK-13947.

Resolution: Fixed

> Check Hive shim serialization in Hive UDF wrapper classes and test coverage
> ---
>
> Key: FLINK-13947
> URL: https://issues.apache.org/jira/browse/FLINK-13947
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hive
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>
> Check 1. if HiveShim needs to be serializable and serialized in a few Hive 
> UDF wrapper classes such as HiveGenericUDF. 2. Make sure we have end-to-end 
> test coverage for Hive UDF usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13947) Check Hive shim serialization in Hive UDF wrapper classes and test coverage

2019-12-09 Thread Jingsong Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992264#comment-16992264
 ] 

Jingsong Lee commented on FLINK-13947:
--

Now HiveShim is serializable now and there are some tests cover HiveGenericUDF.

I will close this JIRA.

> Check Hive shim serialization in Hive UDF wrapper classes and test coverage
> ---
>
> Key: FLINK-13947
> URL: https://issues.apache.org/jira/browse/FLINK-13947
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hive
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>
> Check 1. if HiveShim needs to be serializable and serialized in a few Hive 
> UDF wrapper classes such as HiveGenericUDF. 2. Make sure we have end-to-end 
> test coverage for Hive UDF usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on issue #10458: [FLINK-14815][rest]Expose network metric in IOMetricsInfo

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10458: [FLINK-14815][rest]Expose network 
metric in IOMetricsInfo
URL: https://github.com/apache/flink/pull/10458#issuecomment-562482569
 
 
   
   ## CI report:
   
   * 303f921f58ff61aa3e332c36ffdc07d5f9ceb352 Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/139659908) 
   * 0122cc64d2c4a46ee8ba05ca67832a1589b024b1 Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/139667569) 
   * 2b332510e6d2cd2dee836de532139241c20ce00c Travis: 
[FAILURE](https://travis-ci.com/flink-ci/flink/builds/139677676) 
   * 30f1b29134a585e7d19c0de5e4ef000282a285b2 Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/140158853) 
   * 8a5207e384e995fb2f8929e5de3d87e89e801b3e Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/140199482) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3348)
 
   * 54049f914f9ac88570f882317e7f497ccb09587e Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/140203351) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3350)
 
   * 308229677cea17a602f48cce2ce9e289a4ede30a Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/140345632) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3384)
 
   * 7aa5e9c2b36d3fde471ecf7d751c02a17f2b0ce4 Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140348794) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3386)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (FLINK-13437) Add Hive SQL E2E test

2019-12-09 Thread Kurt Young (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kurt Young reassigned FLINK-13437:
--

Assignee: Terry Wang  (was: Jingsong Lee)

> Add Hive SQL E2E test
> -
>
> Key: FLINK-13437
> URL: https://issues.apache.org/jira/browse/FLINK-13437
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / Hive, Tests
>Affects Versions: 1.9.0
>Reporter: Till Rohrmann
>Assignee: Terry Wang
>Priority: Major
> Fix For: 1.10.0
>
>
> We should add an E2E test for the Hive integration: List all tables and read 
> some metadata, read from an existing table, register a new table in Hive, use 
> a registered function, write to an existing table, write to a new table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15169) Errors happen in the scheduling of DefaultScheduler is not shown in WebUI

2019-12-09 Thread Gary Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-15169:
-
Priority: Blocker  (was: Major)

> Errors happen in the scheduling of DefaultScheduler is not shown in WebUI
> -
>
> Key: FLINK-15169
> URL: https://issues.apache.org/jira/browse/FLINK-15169
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Blocker
> Fix For: 1.10.0
>
>
> WebUI relies on {{ExecutionGraph#failureInfo}} and {{Execution#failureCause}} 
> to generate error info (vis 
> {{JobExceptionsHandler#createJobExceptionsInfo}}). 
> Errors happen in the scheduling of DefaultScheduler are not recorded into 
> those fields, thus cannot be shown to users in WebUI (nor via REST queries).
> To solve it, 
> 1. global failures should be recorded into {{ExecutionGraph#failureInfo}}, 
> via {{ExecutionGraph#initFailureCause}} which can be exposed as 
> {{SchedulerBase#initFailureCause}}.
> 2. for task failures, one solution I can think of is to avoid invoking 
> {{DefaultScheduler#handleTaskFailure}} directly on scheduler's internal 
> failures. Instead, we can introduce 
> {{ExecutionVertexOperations#fail(ExecutionVertex)}} to hand the error to 
> {{ExecutionVertex}} as a common failure.
> cc [~gjy]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15169) Errors happen in the scheduling of DefaultScheduler is not shown in WebUI

2019-12-09 Thread Zhu Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu updated FLINK-15169:

Description: 
WebUI relies on {{ExecutionGraph#failureInfo}} and {{Execution#failureCause}} 
to generate error info (via {{JobExceptionsHandler#createJobExceptionsInfo}}). 
Errors happen in the scheduling of DefaultScheduler are not recorded into those 
fields, thus cannot be shown to users in WebUI (nor via REST queries).

To solve it, 
1. global failures should be recorded into {{ExecutionGraph#failureInfo}}, via 
{{ExecutionGraph#initFailureCause}} which can be exposed as 
{{SchedulerBase#initFailureCause}}.
2. for task failures, one solution I can think of is to avoid invoking 
{{DefaultScheduler#handleTaskFailure}} directly on scheduler's internal 
failures. Instead, we can introduce 
{{ExecutionVertexOperations#fail(ExecutionVertex)}} to hand the error to 
{{ExecutionVertex}} as a common failure.

cc [~gjy]

  was:
WebUI relies on {{ExecutionGraph#failureInfo}} and {{Execution#failureCause}} 
to generate error info (vis {{JobExceptionsHandler#createJobExceptionsInfo}}). 
Errors happen in the scheduling of DefaultScheduler are not recorded into those 
fields, thus cannot be shown to users in WebUI (nor via REST queries).

To solve it, 
1. global failures should be recorded into {{ExecutionGraph#failureInfo}}, via 
{{ExecutionGraph#initFailureCause}} which can be exposed as 
{{SchedulerBase#initFailureCause}}.
2. for task failures, one solution I can think of is to avoid invoking 
{{DefaultScheduler#handleTaskFailure}} directly on scheduler's internal 
failures. Instead, we can introduce 
{{ExecutionVertexOperations#fail(ExecutionVertex)}} to hand the error to 
{{ExecutionVertex}} as a common failure.

cc [~gjy]


> Errors happen in the scheduling of DefaultScheduler is not shown in WebUI
> -
>
> Key: FLINK-15169
> URL: https://issues.apache.org/jira/browse/FLINK-15169
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Blocker
> Fix For: 1.10.0
>
>
> WebUI relies on {{ExecutionGraph#failureInfo}} and {{Execution#failureCause}} 
> to generate error info (via 
> {{JobExceptionsHandler#createJobExceptionsInfo}}). 
> Errors happen in the scheduling of DefaultScheduler are not recorded into 
> those fields, thus cannot be shown to users in WebUI (nor via REST queries).
> To solve it, 
> 1. global failures should be recorded into {{ExecutionGraph#failureInfo}}, 
> via {{ExecutionGraph#initFailureCause}} which can be exposed as 
> {{SchedulerBase#initFailureCause}}.
> 2. for task failures, one solution I can think of is to avoid invoking 
> {{DefaultScheduler#handleTaskFailure}} directly on scheduler's internal 
> failures. Instead, we can introduce 
> {{ExecutionVertexOperations#fail(ExecutionVertex)}} to hand the error to 
> {{ExecutionVertex}} as a common failure.
> cc [~gjy]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15103) Performance regression on 3.12.2019 in various benchmarks

2019-12-09 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992220#comment-16992220
 ] 

Xintong Song edited comment on FLINK-15103 at 12/10/19 7:23 AM:


Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, cause-regression has more network buffers, which is as 
expected.

*Why do we have more network buffers?* For the two commits discussed, network 
buffer memory size is calculated in 
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}. 
Basically this method calculates network memory size with the following 
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 - 
networkFraction)
 heapAndManagedMemory = jvmFreeHeapMemory, if managed memory is on-heap
 heapAndManagedMemory = jvmFreeHeapMemory + managedMemory, if managed memory is 
off-heap
{quote}
Assuming we have the same {{jvmFreeHeapMemory}}, with managed memory changed 
from on-heap to off-heap, we should have larger network memory size, thus 
larger number of network buffers when the page size stays the same.

What against intuition is that, with more network buffers, we actually have 
worse performance. The only thing I can think of it that, our benchmarks take 
the cluster initialization time into the statistics, and with more network 
buffers we need more time for allocating those direct memory buffers.

To verify that, I explicitly configured 
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and 
logged out the time consumed by allocating all the buffers in the constructor 
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large' 
stand for setting number of network buffers to 12945 / 32768 respectively.
|| 
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|twoInputMapSink score (ops / ms) | 11025.369 | 10471.976 | 11511.51 | 
10749.657 |
|avg. buffer allocation time (ms)|79.033|242.567|78.167|237.717|

The benchmark scores show that larger number of network buffers indeed leads to 
the regression in statistics. Further dig into the results, taking 
{{arrayKeyBy}} in before-regression as an example, the total records is 
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the 
small network memory size case is 1934, that gives us the total execution time 
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network 
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution 
time with the large network buffer size can be roughly calculated as 7,000,000 
/ 1835 = 3815ms. The total execution time difference between with small / large 
network buffer size is about 196ms, which is very close to time difference of 
buffer allocation time (242 - 79 = 163ms). If we also take the randomness into 
consideration, this basically explains where the benchmark regression come from.

I still need to look into the other two cases (twoInputMapSink and 
globalWindow), and the performance of the later commit 
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post 
my findings so far.


was (Author: xintongsong):
Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 

[jira] [Created] (FLINK-15169) Errors happen in the scheduling of DefaultScheduler is not shown in WebUI

2019-12-09 Thread Zhu Zhu (Jira)
Zhu Zhu created FLINK-15169:
---

 Summary: Errors happen in the scheduling of DefaultScheduler is 
not shown in WebUI
 Key: FLINK-15169
 URL: https://issues.apache.org/jira/browse/FLINK-15169
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Zhu Zhu
 Fix For: 1.10.0


WebUI relies on {{ExecutionGraph#failureInfo}} and {{Execution#failureCause}} 
to generate error info (vis {{JobExceptionsHandler#createJobExceptionsInfo}}). 
Errors happen in the scheduling of DefaultScheduler are not recorded into those 
fields, thus cannot be shown to users in WebUI (nor via REST queries).

To solve it, 
1. global failures should be recorded into {{ExecutionGraph#failureInfo}}, via 
{{ExecutionGraph#initFailureCause}} which can be exposed as 
{{SchedulerBase#initFailureCause}}.
2. for task failures, one solution I can think of is to avoid invoking 
{{DefaultScheduler#handleTaskFailure}} directly on scheduler's internal 
failures. Instead, we can introduce 
{{ExecutionVertexOperations#fail(ExecutionVertex)}} to hand the error to 
{{ExecutionVertex}} as a common failure.

cc [~gjy]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15103) Performance regression on 3.12.2019 in various benchmarks

2019-12-09 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992220#comment-16992220
 ] 

Xintong Song edited comment on FLINK-15103 at 12/10/19 7:19 AM:


Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, cause-regression has more network buffers, which is as 
expected.

*Why do we have more network buffers?* For the two commits discussed, network 
buffer memory size is calculated in 
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}. 
Basically this method calculates network memory size with the following 
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 - 
networkFraction)
 heapAndManagedMemory = jvmFreeHeapMemory, if managed memory is on-heap
 heapAndManagedMemory = jvmFreeHeapMemory + managedMemory, if managed memory is 
off-heap
{quote}
Assuming we have the same {{jvmFreeHeapMemory}}, with managed memory changed 
from on-heap to off-heap, we should have larger network memory size, thus 
larger number of network buffers when the page size stays the same.

What against intuition is that, with more network buffers, we actually have 
worse performance. The only thing I can think of it that, our benchmarks take 
the cluster initialization time into the statistics, and with more network 
buffers we need more time for allocating those direct memory buffers.

To verify that, I explicitly configured 
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and 
logged out the time consumed by allocating all the buffers in the constructor 
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large' 
stand for setting number of network buffers to 12945 / 32768 respectively.
|| 
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|avg. buffer allocation time (ms)|79.033|242.567|78.167|237.717|

The benchmark scores show that larger number of network buffers indeed leads to 
the regression in statistics. Further dig into the results, taking 
{{arrayKeyBy}} in before-regression as an example, the total records is 
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the 
small network memory size case is 1934, that gives us the total execution time 
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network 
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution 
time with the large network buffer size can be roughly calculated as 7,000,000 
/ 1835 = 3815ms. The total execution time difference between with small / large 
network buffer size is about 196ms, which is very close to time difference of 
buffer allocation time (242 - 79 = 163ms). If we also take the randomness into 
consideration, this basically explains where the benchmark regression come from.

I still need to look into the other two cases (twoInputMapSink and 
globalWindow), and the performance of the later commit 
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post 
my findings so far.


was (Author: xintongsong):
Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, 

[GitHub] [flink] flinkbot edited a comment on issue #9768: [FLINK-14086][test][core] Add OperatingArchitecture Enum

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #9768: [FLINK-14086][test][core] Add 
OperatingArchitecture Enum
URL: https://github.com/apache/flink/pull/9768#issuecomment-534899022
 
 
   
   ## CI report:
   
   * 51ae12da735fd7fd1bdadd282839c9bfe7b80146 Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/129056828) 
   * faa23214d94a2c1105fbadee358f32ed04e3693d Travis: 
[FAILURE](https://travis-ci.com/flink-ci/flink/builds/140207916) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3353)
 
   * 78a28f191158b78b9aed45f95051b7c398c2c124 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10505: [FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID instead of String

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10505: 
[FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID 
instead of String
URL: https://github.com/apache/flink/pull/10505#issuecomment-563892373
 
 
   
   ## CI report:
   
   * f742b7defe3d40ae00fb3e0adb66e4dc3ec20087 Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140362846) Azure: 
[PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3390)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot commented on issue #10507: [FLINK-15146][CORE]Fix the value of that should be strictly greater than zero

2019-12-09 Thread GitBox
flinkbot commented on issue #10507: [FLINK-15146][CORE]Fix the value of  that 
should  be strictly greater than zero 
URL: https://github.com/apache/flink/pull/10507#issuecomment-563900578
 
 
   
   ## CI report:
   
   * 5e6a4e1ba03194e13928342ab5d77fa516e16b00 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot commented on issue #10506: [FLINK-15166][runtime] Fix that buffer is wrongly recycled when data compression is enabled for blocking shuffle.

2019-12-09 Thread GitBox
flinkbot commented on issue #10506: [FLINK-15166][runtime] Fix that buffer is 
wrongly recycled when data compression is enabled for blocking shuffle.
URL: https://github.com/apache/flink/pull/10506#issuecomment-563900511
 
 
   
   ## CI report:
   
   * 4f2b4aa04f47518706f004f7702cf42d76b79f4e UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10503: [FLINK-15137][avro] Improve schema derivation for Avro format

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10503: [FLINK-15137][avro] Improve schema 
derivation for Avro format
URL: https://github.com/apache/flink/pull/10503#issuecomment-563692047
 
 
   
   ## CI report:
   
   * 0a63ef8576f25cdec9fe106d7f69429fff6a4c7e Travis: 
[FAILURE](https://travis-ci.com/flink-ci/flink/builds/140348792) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3385)
 
   * 440383dbd0da6a722c19f0da3a5d7fe21a74bb7d Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140362832) Azure: 
[PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3389)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (FLINK-15168) Exception is thrown when using kafka source connector with flink planner

2019-12-09 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-15168:


 Summary: Exception is thrown when using kafka source connector 
with flink planner
 Key: FLINK-15168
 URL: https://issues.apache.org/jira/browse/FLINK-15168
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Legacy Planner
Affects Versions: 1.10.0
Reporter: Huang Xingbo


when running the following case using kafka as source connector in flink 
planner, we will get a RuntimeException:
{code:java}
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);StreamTableEnvironment tEnv = 
StreamTableEnvironment.create(env);tEnv.connect(new Kafka()
.version("0.11")
.topic("user")
.startFromEarliest()
.property("zookeeper.connect", "localhost:2181")
.property("bootstrap.servers", "localhost:9092"))
.withFormat(new Json()
.failOnMissingField(true)
.jsonSchema("{" +
"  type: 'object'," +
"  properties: {" +
"a: {" +
"  type: 'string'" +
"}," +
"b: {" +
"  type: 'string'" +
"}," +
"c: {" +
"  type: 'string'" +
"}," +
"time: {" +
"  type: 'string'," +
"  format: 'date-time'" +
"}" +
"  }" +
"}"
))
.withSchema(new Schema()
.field("rowtime", Types.SQL_TIMESTAMP)
.rowtime(new Rowtime()
.timestampsFromField("time")
.watermarksPeriodicBounded(6))
.field("a", Types.STRING)
.field("b", Types.STRING)
.field("c", Types.STRING))
.inAppendMode()
.registerTableSource("source");Table t = 
tEnv.scan("source").select("a");tEnv.toAppendStream(t, Row.class).print();
tEnv.execute("test");
{code}
The RuntimeException detail:
{code:java}
Exception in thread "main" java.lang.RuntimeException: Error while applying 
rule PushProjectIntoTableSourceScanRule, args 
[rel#26:FlinkLogicalCalc.LOGICAL(input=RelSubset#25,expr#0..3={inputs},a=$t1), 
Scan(table:[default_catalog, default_database, source], fields:(rowtime, a, b, 
c), source:Kafka011TableSource(rowtime, a, b, c))]Exception in thread "main" 
java.lang.RuntimeException: Error while applying rule 
PushProjectIntoTableSourceScanRule, args 
[rel#26:FlinkLogicalCalc.LOGICAL(input=RelSubset#25,expr#0..3={inputs},a=$t1), 
Scan(table:[default_catalog, default_database, source], fields:(rowtime, a, b, 
c), source:Kafka011TableSource(rowtime, a, b, c))] at 
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:235)
 at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:631)
 at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:327) at 
org.apache.flink.table.plan.Optimizer.runVolcanoPlanner(Optimizer.scala:280) at 
org.apache.flink.table.plan.Optimizer.optimizeLogicalPlan(Optimizer.scala:199) 
at 
org.apache.flink.table.plan.StreamOptimizer.optimize(StreamOptimizer.scala:66) 
at 
org.apache.flink.table.planner.StreamPlanner.translateToType(StreamPlanner.scala:389)
 at 
org.apache.flink.table.planner.StreamPlanner.org$apache$flink$table$planner$StreamPlanner$$translate(StreamPlanner.scala:180)
 at 
org.apache.flink.table.planner.StreamPlanner$$anonfun$translate$1.apply(StreamPlanner.scala:117)
 at 
org.apache.flink.table.planner.StreamPlanner$$anonfun$translate$1.apply(StreamPlanner.scala:117)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.Iterator$class.foreach(Iterator.scala:891) at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at 
scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
org.apache.flink.table.planner.StreamPlanner.translate(StreamPlanner.scala:117) 
at 
org.apache.flink.table.api.java.internal.StreamTableEnvironmentImpl.toDataStream(StreamTableEnvironmentImpl.java:351)
 at 
org.apache.flink.table.api.java.internal.StreamTableEnvironmentImpl.toAppendStream(StreamTableEnvironmentImpl.java:259)
 at 

[GitHub] [flink] flinkbot edited a comment on issue #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10496: [FLINK-15153]Service selector needs 
to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#issuecomment-563198013
 
 
   
   ## CI report:
   
   * 793df2e323b4f84433f99f21515b30a21524658c Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/140221508) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3359)
 
   * e66e8431fedb094be49d3d6748ad0452cd6ebc4c Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/140233384) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3361)
 
   * 07ac32a4c8b347370794666af2d8b252c9cfc5f4 Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140362804) Azure: 
[PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3388)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (FLINK-14378) Cleanup rocksDB lib folder if fail to load library

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li resolved FLINK-14378.
---
Resolution: Fixed

Merged via: 0d9058c56e4f420d1c501e5aef981433a96650dd
(Thanks [~sewen] for help review and merge this)

> Cleanup rocksDB lib folder if fail to load library
> --
>
> Key: FLINK-14378
> URL: https://issues.apache.org/jira/browse/FLINK-14378
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This improvement is inspired due to some of our machines need some time to 
> load the rocksDB library. When some other unrecoverable exceptions continue 
> to happen and the process to load library would be interrupted which cause 
> the {{rocksdb-lib}} folder created but not cleaned up. As the job continues 
> to failover, the {{rocksdb-lib}} folder would be created more and more. We 
> even come across that machine was running out of inodes!
> Details could refer to current 
> [implementation|https://github.com/apache/flink/blob/80b27a150026b7b5cb707bd9fa3e17f565bb8112/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackend.java#L860]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] wangyang0918 commented on issue #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
wangyang0918 commented on issue #10496: [FLINK-15153]Service selector needs to 
contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#issuecomment-563899206
 
 
   @zhijiangW Thanks for you comments.
   1. The reason why i change the description of `KubernetesCliOptions` is that 
the description could not be shown correctly.
   2. I have updated the existing tests to verify service selector label should 
contain jobmanager component label. Without the changes of 
`ServiceDecorator.java`, the udpated tests will fail.
   3. I will add some detailed description of hotfix commit.
   
   
   ```
 Options for kubernetes-cluster mode:
-kid,--kubernetesclusterId 
   
org.apache.flink.configuration.descripti
   on.Description@166fa74d
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (FLINK-15133) support sql_alchemy in Flink for broader python sql integration

2019-12-09 Thread sunjincheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992252#comment-16992252
 ] 

sunjincheng edited comment on FLINK-15133 at 12/10/19 7:09 AM:
---

I think this is tool lib for python user who want query DB by using Object. 
i.e., sqlalchemy is ORM tool, it is useful for the python user who want using 
the common DB, Similar as Hibernate,MyBatis for Java user. I don't see the 
benefit of integrating this with PyFlink, Could you please share more your 
thoughts :)


was (Author: sunjincheng121):
I think this is tool lib for python user who want query DB by using Object. 
i.n., sqlalchemy is ORM tool, it is useful for the python user who want using 
the common DB, Similar as Hibernate,MyBatis for Java user. I don't see the 
benefit of integrating this with PyFlink, Could you please share more your 
thoughts :)

> support sql_alchemy in Flink for broader python sql integration
> ---
>
> Key: FLINK-15133
> URL: https://issues.apache.org/jira/browse/FLINK-15133
> Project: Flink
>  Issue Type: New Feature
>  Components: API / Python, Table SQL / Ecosystem
>Reporter: Bowen Li
>Priority: Major
>
> sql_alchemy is the standard interface for python sql ecosystem
> examples of integrations requiring sql_alchemy:
> - https://github.com/cloudera/hue
> - https://github.com/lyft/amundsen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14351) Refactor MetricRegistry delimiter retrieval into separate interface

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14351:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> Refactor MetricRegistry delimiter retrieval into separate interface
> ---
>
> Key: FLINK-14351
> URL: https://issues.apache.org/jira/browse/FLINK-14351
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The MetricRegistry offers a few methods for retrieving configured delimiters, 
> which are used a fair bit during scope operations; however other methods 
> aren't being used in these contexts.
> Hence we could reduce access and simplify testing by introducing a dedicated 
> interface for these methods that the registry extends.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14820) Replace Time with Duration in FutureUtils

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14820:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> Replace Time with Duration in FutureUtils
> -
>
> Key: FLINK-14820
> URL: https://issues.apache.org/jira/browse/FLINK-14820
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15133) support sql_alchemy in Flink for broader python sql integration

2019-12-09 Thread sunjincheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992252#comment-16992252
 ] 

sunjincheng commented on FLINK-15133:
-

I think this is tool lib for python user who want query DB by using Object. 
i.n., sqlalchemy is ORM tool, it is useful for the python user who want using 
the common DB, Similar as Hibernate,MyBatis for Java user. I don't see the 
benefit of integrating this with PyFlink, Could you please share more your 
thoughts :)

> support sql_alchemy in Flink for broader python sql integration
> ---
>
> Key: FLINK-15133
> URL: https://issues.apache.org/jira/browse/FLINK-15133
> Project: Flink
>  Issue Type: New Feature
>  Components: API / Python, Table SQL / Ecosystem
>Reporter: Bowen Li
>Priority: Major
>
> sql_alchemy is the standard interface for python sql ecosystem
> examples of integrations requiring sql_alchemy:
> - https://github.com/cloudera/hue
> - https://github.com/lyft/amundsen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13437) Add Hive SQL E2E test

2019-12-09 Thread Jingsong Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992250#comment-16992250
 ] 

Jingsong Lee commented on FLINK-13437:
--

PR: [https://github.com/apache/flink/pull/9919]

[~Terry1897] will continue to develop it, [~ykt836] can you assign this ticket 
to him?

> Add Hive SQL E2E test
> -
>
> Key: FLINK-13437
> URL: https://issues.apache.org/jira/browse/FLINK-13437
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / Hive, Tests
>Affects Versions: 1.9.0
>Reporter: Till Rohrmann
>Assignee: Jingsong Lee
>Priority: Major
> Fix For: 1.10.0
>
>
> We should add an E2E test for the Hive integration: List all tables and read 
> some metadata, read from an existing table, register a new table in Hive, use 
> a registered function, write to an existing table, write to a new table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-13390) Clarify the exact meaning of state size when executing incremental checkpoint

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-13390:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Thanks for following this up from user ML. However, since we already reached 
feature freeze for 1.10.0, change the fix version to 1.11.0.

> Clarify the exact meaning of state size when executing incremental checkpoint
> -
>
> Key: FLINK-13390
> URL: https://issues.apache.org/jira/browse/FLINK-13390
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue is inspired from [a user 
> mail|https://lists.apache.org/thread.html/56069ce869afbfca66179e89788c05d3b092e3fe363f3540dcdeb7a1@%3Cuser.flink.apache.org%3E]
>  which confused about the state size meaning.
> I think changing the description of state size and add some notices on 
> documentation could help this. Moreover, change the log when complete 
> checkpoint should be also taken into account.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on issue #10458: [FLINK-14815][rest]Expose network metric in IOMetricsInfo

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10458: [FLINK-14815][rest]Expose network 
metric in IOMetricsInfo
URL: https://github.com/apache/flink/pull/10458#issuecomment-562482569
 
 
   
   ## CI report:
   
   * 303f921f58ff61aa3e332c36ffdc07d5f9ceb352 Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/139659908) 
   * 0122cc64d2c4a46ee8ba05ca67832a1589b024b1 Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/139667569) 
   * 2b332510e6d2cd2dee836de532139241c20ce00c Travis: 
[FAILURE](https://travis-ci.com/flink-ci/flink/builds/139677676) 
   * 30f1b29134a585e7d19c0de5e4ef000282a285b2 Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/140158853) 
   * 8a5207e384e995fb2f8929e5de3d87e89e801b3e Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/140199482) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3348)
 
   * 54049f914f9ac88570f882317e7f497ccb09587e Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/140203351) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3350)
 
   * 308229677cea17a602f48cce2ce9e289a4ede30a Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/140345632) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3384)
 
   * 7aa5e9c2b36d3fde471ecf7d751c02a17f2b0ce4 Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140348794) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3386)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355871124
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/cli/KubernetesCliOptions.java
 ##
 @@ -46,15 +44,16 @@ public static Option getOptionWithPrefix(Option option, 
String shortPrefix, Stri
.longOpt("clusterId")
.required(false)
.hasArg(true)
-   
.desc(KubernetesConfigOptions.CLUSTER_ID.description().toString())
+   .desc("The cluster id that will be used for flink cluster. If 
it's not set, the client will generate " +
+   "a random UUID name.")
.build();
 
public static final Option IMAGE_OPTION = Option.builder("i")
.longOpt("image")
.required(false)
.hasArg(true)
.argName("image-name")
-   
.desc(KubernetesConfigOptions.CONTAINER_IMAGE.description().toString())
+   .desc("Image to use for Flink containers")
 
 Review comment:
   `KubernetesConfigOptions.CONTAINER_IMAGE.description().toString()` will get 
`org.apache.flink.configuration.description.Description@166fa74d`, not the real 
description.
   
   The cli option description may be different with config option. So i just 
put the real description text here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot commented on issue #10507: [FLINK-15146][CORE]Fix the value of that should be strictly greater than zero

2019-12-09 Thread GitBox
flinkbot commented on issue #10507: [FLINK-15146][CORE]Fix the value of  that 
should  be strictly greater than zero 
URL: https://github.com/apache/flink/pull/10507#issuecomment-563895954
 
 
   Thanks a lot for your contribution to the Apache Flink project. I'm the 
@flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress 
of the review.
   
   
   ## Automated Checks
   Last check on commit 5e6a4e1ba03194e13928342ab5d77fa516e16b00 (Tue Dec 10 
07:03:48 UTC 2019)
   
   **Warnings:**
* No documentation files were touched! Remember to keep the Flink docs up 
to date!
* **This pull request references an unassigned [Jira 
ticket](https://issues.apache.org/jira/browse/FLINK-15146).** According to the 
[code contribution 
guide](https://flink.apache.org/contributing/contribute-code.html), tickets 
need to be assigned before starting with the implementation work.
   
   
   Mention the bot in a comment to re-run the automated checks.
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review 
Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full 
explanation of the review process.
The Bot is tracking the review progress through labels. Labels are applied 
according to the order of the review items. For consensus, approval by a Flink 
committer of PMC member is required Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot approve description` to approve one or more aspects (aspects: 
`description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until 
`architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's 
attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355870735
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/cli/KubernetesCliOptions.java
 ##
 @@ -46,15 +44,16 @@ public static Option getOptionWithPrefix(Option option, 
String shortPrefix, Stri
.longOpt("clusterId")
.required(false)
.hasArg(true)
-   
.desc(KubernetesConfigOptions.CLUSTER_ID.description().toString())
+   .desc("The cluster id that will be used for flink cluster. If 
it's not set, the client will generate " +
 
 Review comment:
   Make sense to me.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355870584
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/ServiceDecorator.java
 ##
 @@ -89,7 +90,13 @@ protected Service decorateInternalResource(Service 
resource, Configuration flink
}
 
spec.setPorts(servicePorts);
-   spec.setSelector(resource.getMetadata().getLabels());
+
+   final Map labels = new LabelBuilder()
 
 Review comment:
   I have update the test `testCreateInternalService` and 
`testCreateRestService` to verify the expected selector labels. Jobmanager 
component label should exist.
   
   If you revert the change of `ServiceDecorator`, the updated tests will fail.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] hehuiyuan opened a new pull request #10507: [FLINK-15146][CORE]Fix the value of that should be strictly greater than zero

2019-12-09 Thread GitBox
hehuiyuan opened a new pull request #10507: [FLINK-15146][CORE]Fix the value of 
 that should  be strictly greater than zero 
URL: https://github.com/apache/flink/pull/10507
 
 
   ## What is the purpose of the change
   Hi , the value of cleanupSize is grater than or equal 0. Whether that the 
value is grater than 0 is more practical.
   
![image](https://user-images.githubusercontent.com/18002496/70502733-98a34700-1b5c-11ea-8b68-88b8ac4f1c80.png)
   
![image](https://user-images.githubusercontent.com/18002496/70502745-9fca5500-1b5c-11ea-89fd-a1bc2486cc1c.png)
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-15146) The value of `cleanupSize` should be grater than 0 for `IncrementalCleanupStrategy`

2019-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-15146:
---
Labels: pull-request-available  (was: )

> The value of `cleanupSize` should be grater than 0 for 
> `IncrementalCleanupStrategy`
> ---
>
> Key: FLINK-15146
> URL: https://issues.apache.org/jira/browse/FLINK-15146
> Project: Flink
>  Issue Type: Wish
>Reporter: hehuiyuan
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2019-12-09-17-03-59-014.png, 
> image-2019-12-09-17-09-18-062.png
>
>
>  
> Hi , the value of cleanupSize is grater than or equal 0. Whether that the 
> value is grater than 0 is more practical.
> !image-2019-12-09-17-03-59-014.png|width=615,height=108!
> !image-2019-12-09-17-09-18-062.png|width=491,height=309!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14537) Improve influxdb reporter performance for kafka source/sink

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14537:
--
Fix Version/s: (was: 1.10.0)

Remove 1.10.0 from the fix version(s) since we already reached feature freeze.

> Improve influxdb reporter performance for kafka source/sink
> ---
>
> Key: FLINK-14537
> URL: https://issues.apache.org/jira/browse/FLINK-14537
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.9.1
>Reporter: ouyangwulin
>Priority: Minor
> Fix For: 1.9.2, 1.11.0
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> In our product env, our datasource mostly from kafka source. and influxdb 
> report use kafka topic and partition for create infuxdb measurements, It 
> makes 13531 measurements in influxdb. so It trouble for get measurements 
> which we want get the metric data, and It exaust performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355869450
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesClusterDescriptor.java
 ##
 @@ -108,17 +101,31 @@ public String getClusterDescription() {
 
@Override
public ClusterClientProvider retrieve(String clusterId) {
-   return createClusterClientProvider(clusterId);
+   final ClusterClientProvider clusterClientProvider = 
createClusterClientProvider(clusterId);
+
+   try (ClusterClient clusterClient = 
clusterClientProvider.getClusterClient()) {
+   LOG.info(
+   "Retrieve flink cluster {} successfully, 
JobManager Web Interface: {}",
+   clusterId,
+   clusterClient.getWebInterfaceURL());
+   }
+   return clusterClientProvider;
}
 
@Override
public ClusterClientProvider 
deploySessionCluster(ClusterSpecification clusterSpecification) throws 
ClusterDeploymentException {
-   final ClusterClientProvider clusterClient = 
deployClusterInternal(
+   final ClusterClientProvider clusterClientProvider = 
deployClusterInternal(
KubernetesSessionClusterEntrypoint.class.getName(),
clusterSpecification,
false);
 
-   return clusterClient;
+   try (ClusterClient clusterClient = 
clusterClientProvider.getClusterClient()) {
 
 Review comment:
   If we print the log in `createClusterClientProvider`. Then when we call 
clusterClientProvider.getClusterClient() the logs will show. We create a new 
cluster, the log will show twice and confusing. So i want to separate the log 
of creating and retrieving.
   
   The log is changed when `ClusterClientProvider` is introduced.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-14863) remove default in-memory catalog from CatalogManager

2019-12-09 Thread Bowen Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bowen Li updated FLINK-14863:
-
Fix Version/s: (was: 1.11.0)

> remove default in-memory catalog from CatalogManager
> 
>
> Key: FLINK-14863
> URL: https://issues.apache.org/jira/browse/FLINK-14863
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / API
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
wangyang0918 commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355869102
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesClusterDescriptor.java
 ##
 @@ -108,17 +101,31 @@ public String getClusterDescription() {
 
@Override
public ClusterClientProvider retrieve(String clusterId) {
-   return createClusterClientProvider(clusterId);
+   final ClusterClientProvider clusterClientProvider = 
createClusterClientProvider(clusterId);
+
+   try (ClusterClient clusterClient = 
clusterClientProvider.getClusterClient()) {
 
 Review comment:
   We do not change any behavior here. We just get the real cluster client of 
`ClusterClientProvider` for logging and then close it. The 
`ClusterClientProvider` do not need to be close, the 
`ClusterClientProvider.getClusterClient()` should do.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-14046) DDL property 'format.fields.#.type' should ignore case

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14046:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change the fix version to 1.11.0 to facilitate issue tracking for 1.10.0 (after 
feature freeze), please feel free to close it later if reached consensus 
[~jark]. Thanks.

> DDL property 'format.fields.#.type' should ignore case 
> ---
>
> Key: FLINK-14046
> URL: https://issues.apache.org/jira/browse/FLINK-14046
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / API
>Affects Versions: 1.9.0
>Reporter: hailong wang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When define DDL as follow:
> {code:java}
> create table RubberOrders(
>   b int
> ) with (
>  'connector.type' = 'filesystem',
>  'format.type' = 'csv',                        
>  'connector.path' = '##',
>  'format.fields.0.name' = 'b',
>  'format.fields.0.type' = 'int'
> )
> {code}
> It has an exception:
> {code:java}
> Could not parse type information at position 0: Unsupported type: int
> Input type string: int 
> at 
> org.apache.flink.table.utils.TypeStringUtils$TokenConverter.parsingError(TypeStringUtils.java:491)
>  at 
> org.apache.flink.table.utils.TypeStringUtils$TokenConverter.convertType(TypeStringUtils.java:318)
>  at 
> org.apache.flink.table.utils.TypeStringUtils$TokenConverter.convert(TypeStringUtils.java:261)
>  at 
> org.apache.flink.table.utils.TypeStringUtils.readTypeInfo(TypeStringUtils.java:169)
>  at 
> org.apache.flink.table.descriptors.DescriptorProperties.lambda$validateType$32(DescriptorProperties.java:1149)
>  at 
> org.apache.flink.table.descriptors.DescriptorProperties.validateOptional(DescriptorProperties.java:1357)
>  at 
> org.apache.flink.table.descriptors.DescriptorProperties.validateType(DescriptorProperties.java:1143)
>  at 
> org.apache.flink.table.descriptors.DescriptorProperties.lambda$validateTableSchema$26(DescriptorProperties.java:1011)
>  at 
> org.apache.flink.table.descriptors.DescriptorProperties.validateFixedIndexedProperties(DescriptorProperties.java:998)
>  at 
> org.apache.flink.table.descriptors.DescriptorProperties.validateTableSchema(DescriptorProperties.java:1017)
> {code}
> For the reason we did not ignore case in TypeStringUtils.convertType method.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355869144
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/ServiceDecorator.java
 ##
 @@ -89,7 +90,13 @@ protected Service decorateInternalResource(Service 
resource, Configuration flink
}
 
spec.setPorts(servicePorts);
-   spec.setSelector(resource.getMetadata().getLabels());
+
+   final Map labels = new LabelBuilder()
 
 Review comment:
   `decorateInternalResource` is only used for job manager?
   I think we should add a unit test for covering the changes if it is really a 
bug. I mean before the changes the test should fail, and after this change it 
can success. Otherwise we can not measure what is the difference.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-12302:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

It seems we haven't reached a consensus here yet, and since we already reached 
feature freeze for 1.10.0, change the fix version to 1.11.0.

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>Reporter: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-12273) The default value of CheckpointRetentionPolicy should be RETAIN_ON_FAILURE, otherwise it cannot be recovered according to checkpoint after failure.

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-12273:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

>From the discussion in relative PR, we haven't reached consensus on whether to 
>apply this change.

Since we already reached feature freeze for 1.10.0, change fix version to 
1.11.0 for further discussion.

> The default value of CheckpointRetentionPolicy should be RETAIN_ON_FAILURE, 
> otherwise it cannot be recovered according to checkpoint after failure.
> ---
>
> Key: FLINK-12273
> URL: https://issues.apache.org/jira/browse/FLINK-12273
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.6.3, 1.6.4, 1.7.2, 1.8.0
>Reporter: Mr.Nineteen
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The default value of CheckpointRetentionPolicy should be RETAIN_ON_FAILURE, 
> otherwise it cannot be recovered according to checkpoint after failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355867648
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/cli/KubernetesCliOptions.java
 ##
 @@ -46,15 +44,16 @@ public static Option getOptionWithPrefix(Option option, 
String shortPrefix, Stri
.longOpt("clusterId")
.required(false)
.hasArg(true)
-   
.desc(KubernetesConfigOptions.CLUSTER_ID.description().toString())
+   .desc("The cluster id that will be used for flink cluster. If 
it's not set, the client will generate " +
+   "a random UUID name.")
.build();
 
public static final Option IMAGE_OPTION = Option.builder("i")
.longOpt("image")
.required(false)
.hasArg(true)
.argName("image-name")
-   
.desc(KubernetesConfigOptions.CONTAINER_IMAGE.description().toString())
+   .desc("Image to use for Flink containers")
 
 Review comment:
   Also I am not sure why we make this change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot commented on issue #10505: [FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID instead of String

2019-12-09 Thread GitBox
flinkbot commented on issue #10505: 
[FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID 
instead of String
URL: https://github.com/apache/flink/pull/10505#issuecomment-563892373
 
 
   
   ## CI report:
   
   * f742b7defe3d40ae00fb3e0adb66e4dc3ec20087 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10503: [FLINK-15137][avro] Improve schema derivation for Avro format

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10503: [FLINK-15137][avro] Improve schema 
derivation for Avro format
URL: https://github.com/apache/flink/pull/10503#issuecomment-563692047
 
 
   
   ## CI report:
   
   * 0a63ef8576f25cdec9fe106d7f69429fff6a4c7e Travis: 
[FAILURE](https://travis-ci.com/flink-ci/flink/builds/140348792) Azure: 
[FAILURE](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3385)
 
   * 440383dbd0da6a722c19f0da3a5d7fe21a74bb7d UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-15167) SQL CLI library option doesn't work for Hive connector

2019-12-09 Thread Jingsong Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingsong Lee updated FLINK-15167:
-
Fix Version/s: 1.10.0

> SQL CLI library option doesn't work for Hive connector
> --
>
> Key: FLINK-15167
> URL: https://issues.apache.org/jira/browse/FLINK-15167
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hive, Table SQL / Client
>Reporter: Rui Li
>Priority: Major
> Fix For: 1.10.0
>
>
> Put all Hive connector dependency jars in a folder and start sql cli like: 
> {{sql-client.sh embedded -l }}. Hit the following exception:
> {noformat}
>   at 
> org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:186)
>   at 
> org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:102)
>   at 
> org.apache.flink.table.api.internal.TableEnvironmentImpl.registerCatalog(TableEnvironmentImpl.java:233)
>   at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext.initializeCatalogs(ExecutionContext.java:519)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext.initializeTableEnvironment(ExecutionContext.java:463)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:156)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:115)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:724)
>   ... 3 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.table.catalog.hive.client.HiveShimV230.getHiveMetastoreClient(HiveShimV230.java:55)
>   ... 15 more
> Caused by: 
> MetaException(message:org.apache.hadoop.hive.metastore.HiveMetaStoreClient 
> class not found)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getClass(MetaStoreUtils.java:1676)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:131)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:89)
>   ... 20 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot commented on issue #10506: [FLINK-15166][runtime] Fix that buffer is wrongly recycled when data compression is enabled for blocking shuffle.

2019-12-09 Thread GitBox
flinkbot commented on issue #10506: [FLINK-15166][runtime] Fix that buffer is 
wrongly recycled when data compression is enabled for blocking shuffle.
URL: https://github.com/apache/flink/pull/10506#issuecomment-563891883
 
 
   Thanks a lot for your contribution to the Apache Flink project. I'm the 
@flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress 
of the review.
   
   
   ## Automated Checks
   Last check on commit 4f2b4aa04f47518706f004f7702cf42d76b79f4e (Tue Dec 10 
06:50:04 UTC 2019)
   
   **Warnings:**
* No documentation files were touched! Remember to keep the Flink docs up 
to date!
* **This pull request references an unassigned [Jira 
ticket](https://issues.apache.org/jira/browse/FLINK-15166).** According to the 
[code contribution 
guide](https://flink.apache.org/contributing/contribute-code.html), tickets 
need to be assigned before starting with the implementation work.
   
   
   Mention the bot in a comment to re-run the automated checks.
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review 
Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full 
explanation of the review process.
The Bot is tracking the review progress through labels. Labels are applied 
according to the order of the review items. For consensus, approval by a Flink 
committer of PMC member is required Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot approve description` to approve one or more aspects (aspects: 
`description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until 
`architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's 
attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10496: [FLINK-15153]Service selector needs 
to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#issuecomment-563198013
 
 
   
   ## CI report:
   
   * 793df2e323b4f84433f99f21515b30a21524658c Travis: 
[CANCELED](https://travis-ci.com/flink-ci/flink/builds/140221508) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3359)
 
   * e66e8431fedb094be49d3d6748ad0452cd6ebc4c Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/140233384) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3361)
 
   * 07ac32a4c8b347370794666af2d8b252c9cfc5f4 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355866830
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/cli/KubernetesCliOptions.java
 ##
 @@ -46,15 +44,16 @@ public static Option getOptionWithPrefix(Option option, 
String shortPrefix, Stri
.longOpt("clusterId")
.required(false)
.hasArg(true)
-   
.desc(KubernetesConfigOptions.CLUSTER_ID.description().toString())
+   .desc("The cluster id that will be used for flink cluster. If 
it's not set, the client will generate " +
 
 Review comment:
   nit: remove `that`, 
   `used for identifying the unique flink cluster`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-15166) Shuffle data compression wrongly decrease the buffer reference count.

2019-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-15166:
---
Labels: pull-request-available  (was: )

> Shuffle data compression wrongly decrease the buffer reference count.
> -
>
> Key: FLINK-15166
> URL: https://issues.apache.org/jira/browse/FLINK-15166
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
>Reporter: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>
> FLINK-15140 report two relevant problems which are both triggered by 
> broadcast partitioner, to make it more clear, I create this Jira to addresses 
> the problems separately.
>  
> For blocking shuffle compression, we recycle the compressed intermediate 
> buffer each time after we write data out, however when the data is not 
> compressed, the return buffer is the original buffer and should not be 
> recycled, but we wrongly recycled it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15066) Cannot run multiple `insert into csvTable values ()`

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-15066:
--
Issue Type: Bug  (was: Improvement)

Correct me if I'm wrong, but this seems more like a bug to fix instead of 
something to improve. Thanks.

> Cannot run multiple `insert into csvTable values ()`
> 
>
> Key: FLINK-15066
> URL: https://issues.apache.org/jira/browse/FLINK-15066
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Client
>Reporter: Kurt Young
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.10.0
>
>
> I created a csv table in sql client, and tried to insert some data into this 
> table.
> The first insert into success, but the second one failed with exception: 
> {code:java}
> // Caused by: java.io.IOException: File or directory /.../xxx.csv already 
> exists. Existing files and directories are not overwritten in NO_OVERWRITE 
> mode. Use OVERWRITE mode to overwrite existing files and directories.at 
> org.apache.flink.core.fs.FileSystem.initOutPathLocalFS(FileSystem.java:817)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15167) SQL CLI library option doesn't work for Hive connector

2019-12-09 Thread Rui Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated FLINK-15167:
---
Description: 
Put all Hive connector dependency jars in a folder and start sql cli like: 
{{sql-client.sh embedded -l }}. Hit the following exception:
{noformat}
at 
org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:186)
at 
org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:102)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.registerCatalog(TableEnvironmentImpl.java:233)
at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.initializeCatalogs(ExecutionContext.java:519)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.initializeTableEnvironment(ExecutionContext.java:463)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:156)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:115)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:724)
... 3 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.table.catalog.hive.client.HiveShimV230.getHiveMetastoreClient(HiveShimV230.java:55)
... 15 more
Caused by: 
MetaException(message:org.apache.hadoop.hive.metastore.HiveMetaStoreClient 
class not found)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getClass(MetaStoreUtils.java:1676)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:131)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:89)
... 20 more
{noformat}

  was:
Put all Hive connector dependency jars in a folder and start sql cli like 
{{sql-client.sh embedded -l }}. Hit the following exception:
{noformat}
at 
org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:186)
at 
org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:102)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.registerCatalog(TableEnvironmentImpl.java:233)
at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.initializeCatalogs(ExecutionContext.java:519)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.initializeTableEnvironment(ExecutionContext.java:463)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:156)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:115)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:724)
... 3 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.table.catalog.hive.client.HiveShimV230.getHiveMetastoreClient(HiveShimV230.java:55)
... 15 more
Caused by: 
MetaException(message:org.apache.hadoop.hive.metastore.HiveMetaStoreClient 
class not found)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getClass(MetaStoreUtils.java:1676)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:131)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:89)
... 20 more
{noformat}


> SQL CLI library option doesn't work for Hive connector
> --
>
> Key: FLINK-15167
> URL: https://issues.apache.org/jira/browse/FLINK-15167
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hive, Table SQL / Client
>Reporter: Rui Li
>Priority: Major
>
> Put all Hive connector dependency jars in a folder and start sql cli like: 
> {{sql-client.sh embedded -l }}. Hit the following exception:
> {noformat}
>   at 
> org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:186)
>   at 
> 

[jira] [Updated] (FLINK-15167) SQL CLI library option doesn't work for Hive connector

2019-12-09 Thread Rui Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated FLINK-15167:
---
Description: 
Put all Hive connector dependency jars in a folder and start sql cli like 
{{sql-client.sh embedded -l }}. Hit the following exception:
{noformat}
at 
org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:186)
at 
org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:102)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.registerCatalog(TableEnvironmentImpl.java:233)
at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.initializeCatalogs(ExecutionContext.java:519)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.initializeTableEnvironment(ExecutionContext.java:463)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:156)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:115)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:724)
... 3 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.table.catalog.hive.client.HiveShimV230.getHiveMetastoreClient(HiveShimV230.java:55)
... 15 more
Caused by: 
MetaException(message:org.apache.hadoop.hive.metastore.HiveMetaStoreClient 
class not found)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getClass(MetaStoreUtils.java:1676)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:131)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:89)
... 20 more
{noformat}

> SQL CLI library option doesn't work for Hive connector
> --
>
> Key: FLINK-15167
> URL: https://issues.apache.org/jira/browse/FLINK-15167
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hive, Table SQL / Client
>Reporter: Rui Li
>Priority: Major
>
> Put all Hive connector dependency jars in a folder and start sql cli like 
> {{sql-client.sh embedded -l }}. Hit the following exception:
> {noformat}
>   at 
> org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:186)
>   at 
> org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:102)
>   at 
> org.apache.flink.table.api.internal.TableEnvironmentImpl.registerCatalog(TableEnvironmentImpl.java:233)
>   at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext.initializeCatalogs(ExecutionContext.java:519)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext.initializeTableEnvironment(ExecutionContext.java:463)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:156)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext.(ExecutionContext.java:115)
>   at 
> org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:724)
>   ... 3 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.table.catalog.hive.client.HiveShimV230.getHiveMetastoreClient(HiveShimV230.java:55)
>   ... 15 more
> Caused by: 
> MetaException(message:org.apache.hadoop.hive.metastore.HiveMetaStoreClient 
> class not found)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getClass(MetaStoreUtils.java:1676)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:131)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:89)
>   ... 20 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355866084
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesClusterDescriptor.java
 ##
 @@ -108,17 +101,31 @@ public String getClusterDescription() {
 
@Override
public ClusterClientProvider retrieve(String clusterId) {
-   return createClusterClientProvider(clusterId);
+   final ClusterClientProvider clusterClientProvider = 
createClusterClientProvider(clusterId);
+
+   try (ClusterClient clusterClient = 
clusterClientProvider.getClusterClient()) {
+   LOG.info(
+   "Retrieve flink cluster {} successfully, 
JobManager Web Interface: {}",
+   clusterId,
+   clusterClient.getWebInterfaceURL());
+   }
+   return clusterClientProvider;
}
 
@Override
public ClusterClientProvider 
deploySessionCluster(ClusterSpecification clusterSpecification) throws 
ClusterDeploymentException {
-   final ClusterClientProvider clusterClient = 
deployClusterInternal(
+   final ClusterClientProvider clusterClientProvider = 
deployClusterInternal(
KubernetesSessionClusterEntrypoint.class.getName(),
clusterSpecification,
false);
 
-   return clusterClient;
+   try (ClusterClient clusterClient = 
clusterClientProvider.getClusterClient()) {
 
 Review comment:
   I am not sure what is the benefits to migrate the log or are there any 
considerations to do so?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] wsry opened a new pull request #10506: [FLINK-15166][runtime] Fix that buffer is wrongly recycled when data compression is enabled for blocking shuffle.

2019-12-09 Thread GitBox
wsry opened a new pull request #10506: [FLINK-15166][runtime] Fix that buffer 
is wrongly recycled when data compression is enabled for blocking shuffle.
URL: https://github.com/apache/flink/pull/10506
 
 
   ## What is the purpose of the change
   
   For blocking shuffle data compression, the compressed intermediate buffer is 
recycled each time after data is written out, however when the data can not be 
compressed, the return buffer is the original buffer which should not be 
recycled. This PR fixes the wrong recycling problem by checking the returned 
buffer.
   
   ## Brief change log
   
 - Not recycle the compressed buffer if it is the original buffer.
 - Modify the ```ShuffleCompressionITCase``` to cover the scenario.
   
   
   ## Verifying this change
   
   ```ShuffleCompressionITCase``` is modified to cover the scenario.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / **no**)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / **no**)
 - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (FLINK-15167) SQL CLI library option doesn't work for Hive connector

2019-12-09 Thread Rui Li (Jira)
Rui Li created FLINK-15167:
--

 Summary: SQL CLI library option doesn't work for Hive connector
 Key: FLINK-15167
 URL: https://issues.apache.org/jira/browse/FLINK-15167
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive, Table SQL / Client
Reporter: Rui Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service selector needs to contain jobmanager component label

2019-12-09 Thread GitBox
zhijiangW commented on a change in pull request #10496: [FLINK-15153]Service 
selector needs to contain jobmanager component label
URL: https://github.com/apache/flink/pull/10496#discussion_r355865915
 
 

 ##
 File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesClusterDescriptor.java
 ##
 @@ -108,17 +101,31 @@ public String getClusterDescription() {
 
@Override
public ClusterClientProvider retrieve(String clusterId) {
-   return createClusterClientProvider(clusterId);
+   final ClusterClientProvider clusterClientProvider = 
createClusterClientProvider(clusterId);
+
+   try (ClusterClient clusterClient = 
clusterClientProvider.getClusterClient()) {
 
 Review comment:
   I think the behavior is changed here. The previous way would not close the 
`clusterClient`, but now we close the client after logging. If so it should not 
be a hotfix commit and might need a unit test to cover the change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-14971) Move ACK and declined message handling in the same thread with triggering

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14971:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> Move ACK and declined message handling in the same thread with triggering
> -
>
> Key: FLINK-14971
> URL: https://issues.apache.org/jira/browse/FLINK-14971
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Biao Liu
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently the ACK and declined message handling are executed in IO thread. It 
> should be moved into main thread eventually.
> After this step, all operations could be executed in main thread. Also we 
> don't need coordinator-wide lock anymore then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14934) Remove error log statement in ES connector

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14934:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> Remove error log statement in ES connector
> --
>
> Key: FLINK-14934
> URL: https://issues.apache.org/jira/browse/FLINK-14934
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / ElasticSearch
>Affects Versions: 1.9.1
>Reporter: Arvid Heise
>Priority: Major
> Fix For: 1.11.0
>
>
> The ES connector currently uses the [log and throw 
> antipattern|https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-elasticsearch-base/src/main/java/org/apache/flink/streaming/connectors/elasticsearch/ElasticsearchSinkBase.java#L406],
>  which doesn't allow users to ignore certain types of errors without getting 
> their logs spammed.
> The log statement should be removed completely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14893) Using Child ClassLoader to load class when Parent ClassLoader couldn't load in ParentFirstPatterns

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14893:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> Using Child ClassLoader  to load class when Parent ClassLoader couldn't load 
> in ParentFirstPatterns 
> 
>
> Key: FLINK-14893
> URL: https://issues.apache.org/jira/browse/FLINK-14893
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: hailong wang
>Assignee: hailong wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In ChildFirstClassLoader#loadClass, when a class is in ParentFirstPattern, 
> but it load failed for that parent does not contain it. It will throw a 
> ClassNotFoundException. I think when loading failed, we should use findClass 
> to load later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14902) JDBCTableSource support AsyncLookupFunction

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14902:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> JDBCTableSource support AsyncLookupFunction
> ---
>
> Key: FLINK-14902
> URL: https://issues.apache.org/jira/browse/FLINK-14902
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / JDBC
>Affects Versions: 1.9.0
>Reporter: hailong wang
>Assignee: hailong wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> JDBCTableSource support AsyncLookupFunction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-14647) Improve the exception message when required property is not matched

2019-12-09 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992206#comment-16992206
 ] 

Nicholas Jiang edited comment on FLINK-14647 at 12/10/19 6:44 AM:
--

Yeah, my original intention was that requiredContext includes some required 
properties, so add matchContext method to tell the comparsion with required 
context. As your feel,  your exposion of two interfaces is feasible, and I 
modify the TableFactory interface design, no match context in interface 
concept. [~jark]Please check again of [Improve exception message 
design|https://docs.google.com/presentation/d/1SWLC_hjQ5FAXX1cXwvfduA-s6e8w1CeDFNH8y3mPFVM/edit?usp=sharing]


was (Author: nicholasjiang):
[~jark]Yeah, my original intention was that requiredContext includes some 
required properties, so add matchContext method to tell the comparsion with 
required context. As your feel,  your exposion of two interfaces is feasible, 
and I will modify the TableFactory interface design, not match context in 
interface concept.

> Improve the exception message when required property is not matched
> ---
>
> Key: FLINK-14647
> URL: https://issues.apache.org/jira/browse/FLINK-14647
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / API
>Reporter: Jark Wu
>Priority: Major
>
> Currently, all the required properties should exist and match, otherwise, 
> {{NoMatchingTableFactoryException}} will be thrown.
> For example, if we have {{connector.type=hbase,  connector.versions=1.1.1}}, 
> the following exception will be thrown.
> {code}
> org.apache.flink.table.api.NoMatchingTableFactoryException: Could not find a 
> suitable table factory for 'org.apache.flink.addons.hbase.HBaseTableFactory' 
> in
> the classpath.
> Reason: No context matches.
> The following properties are requested:
> connector.type=hbase
> connector.version=1.1.1
> {code}
> It's hard to know the problem is the version is wrong. A quick fixing is move 
> version out of {{requiredContext()}} if we only support one version and throw 
> a readable exception in {{ConnectorDescriptorValidator#validate}}. 
> However, for the multiple-version connectors, e.g. Kafka, maybe we should 
> improve the design of {{TableFactory}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14863) remove default in-memory catalog from CatalogManager

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14863:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0.

> remove default in-memory catalog from CatalogManager
> 
>
> Key: FLINK-14863
> URL: https://issues.apache.org/jira/browse/FLINK-14863
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / API
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14652) Refactor checkpointing related parts into one place on task side

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14652:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> Refactor checkpointing related parts into one place on task side
> 
>
> Key: FLINK-14652
> URL: https://issues.apache.org/jira/browse/FLINK-14652
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
> Fix For: 1.11.0
>
>
> As suggested by [~sewen] within [review for 
> PR-8693|https://github.com/apache/flink/pull/8693#issuecomment-542834147], it 
> would be worthy to refactor all checkpointing parts into a single place on 
> task side.
> This issue focus on refactoring these parts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14551) Unaligned checkpoints

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-14551:
-
Fix Version/s: (was: 1.10.0)
   1.11.0

> Unaligned checkpoints
> -
>
> Key: FLINK-14551
> URL: https://issues.apache.org/jira/browse/FLINK-14551
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Network
>Reporter: zhijiang
>Priority: Major
> Fix For: 1.11.0
>
>
> This is the umbrella issue for the feature of unaligned checkpoints. Refer to 
> the 
> [FLIP-76|https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints]
>  for more details.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15031) Calculate required shuffle memory before allocating slots if resources are specified

2019-12-09 Thread Zhu Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu updated FLINK-15031:

Parent: (was: FLINK-14058)
Issue Type: Task  (was: Sub-task)

> Calculate required shuffle memory before allocating slots if resources are 
> specified
> 
>
> Key: FLINK-15031
> URL: https://issues.apache.org/jira/browse/FLINK-15031
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In resources specified cases, we expect each operator to declare required 
> resources and before using them. In this way, no resource related error 
> should happen if no resource is used more than declared. This ensures a 
> deployed task would not fail due to insufficient resources in TM. This may 
> result in unnecessary failures and may even cause a job hanging forever, 
> failing repeatedly on deploying tasks to a TM with insufficient resources.
> Shuffle memory is the last missing piece for this goal at the moment. Minimum 
> network buffers are required by tasks to work. Currently a task is possible 
> to be deployed to a TM with insufficient network buffers, and fails on 
> launching.
> To avoid that, we should calculate required network memory for a 
> task/SlotSharingGroup before allocating a slot for it.
> The required shuffle memory can be derived from the number of required 
> network buffers. The number of buffers required by a task (ExecutionVertex) 
> is 
> {code:java}
> exclusive buffers for input channels(i.e. numInputChannel * 
> buffersPerChannel) + required buffers for result partition buffer 
> pool(currently is numberOfSubpartitions + 1)
> {code}
> Note that this is for the {{NettyShuffleService}} case. For custom shuffle 
> services, currently there is no way to get the required shuffle memory of a 
> task.
> To make it simple under dynamic slot sharing, the required shuffle memory for 
> a task should be the max required shuffle memory of all {{ExecutionVertex}} 
> of the same {{ExecutionJobVertex}}. And the required shuffle memory for a 
> slot sharing group should be the sum of shuffle memory for each 
> {{ExecutionJobVertex}} instance within.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14510) Remove the lazy vertex attaching mechanism from ExecutionGraph

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14510:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0

> Remove the lazy vertex attaching mechanism from ExecutionGraph
> --
>
> Key: FLINK-14510
> URL: https://issues.apache.org/jira/browse/FLINK-14510
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently in production, the vertex attaching is only invoked right after the 
> ExecutionGraph is created in ExecutionGraphBuilder. That means lazy attaching 
> is not necessary at the moment. It however adds extra complexity to 
> ExecutionGraph, since we need to assume that the vertices may be not 
> initialized or even get changed.
> Moreover, attaching vertices after a job starts scheduling is an undefined 
> behavior which would not work properly.
> I'd propose to remove the lazy attaching mechanism, and do vertices building 
> and related components initialization in ExecutionGraph constructor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] jinglining commented on issue #10458: [FLINK-14815][rest]Expose network metric in IOMetricsInfo

2019-12-09 Thread GitBox
jinglining commented on issue #10458: [FLINK-14815][rest]Expose network metric 
in IOMetricsInfo
URL: https://github.com/apache/flink/pull/10458#issuecomment-563888090
 
 
   @flinkbot run travis


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (FLINK-15159) the string of json is mapped to VARCHAR or STRING?

2019-12-09 Thread Jark Wu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992227#comment-16992227
 ] 

Jark Wu commented on FLINK-15159:
-

Currently, VARCHAR is the same with STRING, they are both alias of 
VARHCAR(MAX_LENGTH), but we suggest to use STRING, because VARCHAR should take 
a lenght parameter all the time.

> the string of json is mapped to VARCHAR or STRING?
> --
>
> Key: FLINK-15159
> URL: https://issues.apache.org/jira/browse/FLINK-15159
> Project: Flink
>  Issue Type: Wish
>  Components: Documentation
>Reporter: hehuiyuan
>Priority: Major
> Attachments: image-2019-12-09-21-14-08-183.png
>
>
> !image-2019-12-09-21-14-08-183.png|width=356,height=180!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot commented on issue #10505: [FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID instead of String

2019-12-09 Thread GitBox
flinkbot commented on issue #10505: 
[FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID 
instead of String
URL: https://github.com/apache/flink/pull/10505#issuecomment-563887264
 
 
   Thanks a lot for your contribution to the Apache Flink project. I'm the 
@flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress 
of the review.
   
   
   ## Automated Checks
   Last check on commit f742b7defe3d40ae00fb3e0adb66e4dc3ec20087 (Tue Dec 10 
06:34:37 UTC 2019)
   
   **Warnings:**
* No documentation files were touched! Remember to keep the Flink docs up 
to date!
   
   
   Mention the bot in a comment to re-run the automated checks.
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review 
Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full 
explanation of the review process.
The Bot is tracking the review progress through labels. Labels are applied 
according to the order of the review items. For consensus, approval by a Flink 
committer of PMC member is required Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot approve description` to approve one or more aspects (aspects: 
`description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until 
`architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's 
attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (FLINK-15103) Performance regression on 3.12.2019 in various benchmarks

2019-12-09 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992220#comment-16992220
 ] 

Xintong Song edited comment on FLINK-15103 at 12/10/19 6:32 AM:


Ok, I think I find something that might be the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, cause-regression has more network buffers, which is as 
expected.

*Why do we have more network buffers?* For the two commits discussed, network 
buffer memory size is calculated in 
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}. 
Basically this method calculates network memory size with the following 
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 - 
networkFraction)
 heapAndManagedMemory = jvmFreeHeapMemory, if managed memory is on-heap
 heapAndManagedMemory = jvmFreeHeapMemory + managedMemory, if managed memory is 
off-heap
{quote}
Assuming we have the same {{jvmFreeHeapMemory}}, with managed memory changed 
from on-heap to off-heap, we should have larger network memory size, thus 
larger number of network buffers when the page size stays the same.

What against intuition is that, with more network buffers, we actually have 
worse performance. The only thing I can think of it that, our benchmarks take 
the cluster initialization time into the statistics, and with more network 
buffers we need more time for allocating those direct memory buffers.

To verify that, I explicitly configured 
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and 
logged out the time consumed by allocating all the buffers in the constructor 
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large' 
stand for setting number of network buffers to 12945 / 32768 respectively.
|| 
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|avg. buffer allocation time (ms)|79.033|242.567|78.167|237.717|

The benchmark scores show that larger number of network buffers indeed leads to 
the regression in statistics. Further dig into the results, taking 
{{arrayKeyBy}} in before-regression as an example, the total records is 
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the 
small network memory size case is 1934, that gives us the total execution time 
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network 
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution 
time with the large network buffer size can be roughly calculated as 7,000,000 
/ 1835 = 3815ms. The total execution time difference between with small / large 
network buffer size is about 196ms, which is very close to time difference of 
buffer allocation time (242 - 79 = 163ms). If we also take the randomness and 
potential slight performance improvement due to more network buffers into 
consideration, this basically explains where the benchmark regression come from.

I still need to look into the other two cases (twoInputMapSink and 
globalWindow), and the performance of the later commit 
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post 
my findings so far.


was (Author: xintongsong):
Ok, I think I find the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, 

[jira] [Updated] (FLINK-14958) ProgramTargetDescriptor#jobID possibly can be of type JobID instead of String

2019-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-14958:
---
Labels: pull-request-available  (was: )

> ProgramTargetDescriptor#jobID possibly can be of type JobID instead of String
> -
>
> Key: FLINK-14958
> URL: https://issues.apache.org/jira/browse/FLINK-14958
> Project: Flink
>  Issue Type: Improvement
>Reporter: Zili Chen
>Assignee: AT-Fieldless
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] AT-Fieldless opened a new pull request #10505: [FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID instead of String

2019-12-09 Thread GitBox
AT-Fieldless opened a new pull request #10505: 
[FLINK-14958][table]ProgramTargetDescriptor#jobID possibly can be of type JobID 
instead of String
URL: https://github.com/apache/flink/pull/10505
 
 
   ## What is the purpose of the change
   
   ProgramTargetDescriptor#jobID possibly can be of type JobID instead of String
   
   ## Brief change log
   
   Replacing ProgramTargetDescriptor#jobID type String with type JobID.
   Changing related Tests code.
   
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
 - The serializers: (don't know)
 - The runtime per-record code paths (performance sensitive): (don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (don't know)
 - The S3 file system connector: (don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? ( no)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10348: [FLINK-14951][tests] Harden the thread safety of State TTL backend tests

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10348: [FLINK-14951][tests] Harden the 
thread safety of State TTL backend tests
URL: https://github.com/apache/flink/pull/10348#issuecomment-559509076
 
 
   
   ## CI report:
   
   * 54ecee627e7036f4d150aad330b9772406a19494 Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/138580810) 
   * 3e89b2b0d85dd29a76bb10807e2959a7d2ee8295 Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140343903) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3383)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-14471) Hide error message when metric api failed

2019-12-09 Thread Yu Li (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-14471:
--
Fix Version/s: (was: 1.10.0)
   1.11.0

Change fix version to 1.11.0 since we already reached feature freeze for 1.10.0.

Also noticed the parent issue is marked as duplication, so if there's still 
some reasons to work on this issue, I'd suggest to extract it out as a 
standalone one.

> Hide error message when metric api failed
> -
>
> Key: FLINK-14471
> URL: https://issues.apache.org/jira/browse/FLINK-14471
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.1
>Reporter: Yadong Xie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The error message should hide when metric api failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14953) Parquet table source should use schema type to build FilterPredicate

2019-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-14953:
---
Labels: pull-request-available  (was: )

> Parquet table source should use schema type to build FilterPredicate
> 
>
> Key: FLINK-14953
> URL: https://issues.apache.org/jira/browse/FLINK-14953
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.8.0, 1.8.2, 1.9.0, 1.9.1
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.2
>
>
> The issue happens when the data type of value in predicate inferred from SQL 
> doesn't match the parquet schema. For example, foo is a long type, foo < 1 is 
> the predicate. Literal will be recognized as an integration. It causes the 
> parquet FilterPredicate is mistakenly created for the column of Integer type. 
> Then, the exception comes.
> java.lang.UnsupportedOperationException
>   at 
> org.apache.parquet.filter2.recordlevel.IncrementallyUpdatedFilterPredicate$ValueInspector.update(IncrementallyUpdatedFilterPredicate.java:71)
>   at 
> org.apache.parquet.filter2.recordlevel.FilteringPrimitiveConverter.addLong(FilteringPrimitiveConverter.java:105)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:268)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:367)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
>   at 
> org.apache.flink.formats.parquet.utils.ParquetRecordReader.readNextRecord(ParquetRecordReader.java:235)
>   at 
> org.apache.flink.formats.parquet.utils.ParquetRecordReader.reachEnd(ParquetRecordReader.java:207)
>   at 
> org.apache.flink.formats.parquet.ParquetInputFormat.reachedEnd(ParquetInputFormat.java:233)
>   at 
> org.apache.flink.api.common.operators.GenericDataSourceBase.executeOnCollections(GenericDataSourceBase.java:231)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.executeDataSource(CollectionExecutor.java:219)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.execute(CollectionExecutor.java:155)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.executeUnaryOperator(CollectionExecutor.java:229)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.execute(CollectionExecutor.java:149)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.execute(CollectionExecutor.java:131)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.executeDataSink(CollectionExecutor.java:182)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.execute(CollectionExecutor.java:158)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.execute(CollectionExecutor.java:131)
>   at 
> org.apache.flink.api.common.operators.CollectionExecutor.execute(CollectionExecutor.java:115)
>   at 
> org.apache.flink.api.java.CollectionEnvironment.execute(CollectionEnvironment.java:38)
>   at 
> org.apache.flink.test.util.CollectionTestEnvironment.execute(CollectionTestEnvironment.java:52)
>   at 
> org.apache.flink.test.util.CollectionTestEnvironment.execute(CollectionTestEnvironment.java:47)
>   at org.apache.flink.api.java.DataSet.collect(DataSet.java:413)
>   at 
> org.apache.flink.formats.parquet.ParquetTableSourceITCase.testScanWithProjectionAndFilter(ParquetTableSourceITCase.java:91)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> 

[GitHub] [flink] KurtYoung closed pull request #10371: [FLINK-14953][formats] use table type to build parquet FilterPredicate

2019-12-09 Thread GitBox
KurtYoung closed pull request #10371: [FLINK-14953][formats] use table type to 
build parquet FilterPredicate
URL: https://github.com/apache/flink/pull/10371
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (FLINK-14032) Make the cache size of RocksDBPriorityQueueSetFactory configurable

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14032:


Assignee: Yun Tang

> Make the cache size of RocksDBPriorityQueueSetFactory configurable
> --
>
> Key: FLINK-14032
> URL: https://issues.apache.org/jira/browse/FLINK-14032
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, the cache size of {{RocksDBPriorityQueueSetFactory}} has been set 
> as 128 and no any ways to configure this to other value. (We could increase 
> this to obtain better performance if necessary). Actually, this is also a 
> TODO for quiet a long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on issue #10504: [FLINK-15069][benchmark] Supplement the pipelined shuffle compression case for benchmark

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10504: [FLINK-15069][benchmark] Supplement 
the pipelined shuffle compression case for benchmark
URL: https://github.com/apache/flink/pull/10504#issuecomment-563761500
 
 
   
   ## CI report:
   
   * 64f397f5a4fdb5cf37119c1251e7a553d5c4da0f Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/140352443) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3387)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (FLINK-14392) Introduce JobClient API(FLIP-74)

2019-12-09 Thread Yu Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992221#comment-16992221
 ] 

Yu Li commented on FLINK-14392:
---

Since we already reached feature freeze, I'm afraid this work (FLIP-74) cannot 
be included in 1.10.0. I'm planning to change the fix version to 1.11.0 and 
please let me know if any concerns [~tison] [~aljoscha].

On the other hand, for the already completed tasks, anything we could write in 
our release note? Could we call it some preview version?

Thanks!

> Introduce JobClient API(FLIP-74)
> 
>
> Key: FLINK-14392
> URL: https://issues.apache.org/jira/browse/FLINK-14392
> Project: Flink
>  Issue Type: New Feature
>  Components: Client / Job Submission
>Reporter: Zili Chen
>Assignee: Zili Chen
>Priority: Major
> Fix For: 1.10.0
>
>
> This is the umbrella issue to track all efforts toward {{JobClient}} proposed 
> in 
> [FLIP-74|https://cwiki.apache.org/confluence/display/FLINK/FLIP-74%3A+Flink+JobClient+API].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15103) Performance regression on 3.12.2019 in various benchmarks

2019-12-09 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992220#comment-16992220
 ] 

Xintong Song commented on FLINK-15103:
--

Ok, I think I find the cause of the regression.

I’ll explain the details below, but maybe start from my conclusion: *With the 
questioned commit we actually have more network buffers than before, which take 
more time to allocate when initiating the task executors, causing the 
regression.*

For simplicity, I’ll refer to the last commit before the regression 
(6681e111f2cee580c86422082d4409004df4f096) with *before-regression*, and refer 
to the commit causing the regression (4b8ed643a4d85c9440a8adbc0798b8a4bbd9520b) 
with *cause-regression*. All the test results shown below are from running the 
benchmarks on my laptop locally.

First, I tried to log out the number of network buffers and page sizes.
||before-regression||cause-regression||
|12945 buffers, page size 32kb|32768 buffers, page size 32kb|

The result shows that, cause-regression has more network buffers, which is as 
expected.

*Why do we have more network buffers?* For the two commits discussed, network 
buffer memory size is calculated in 
{{NettyShuffleEnvironmentConfiguration#calculateNewNetworkBufferMemory}}. 
Basically this method calculates network memory size with the following 
equations.
{quote}networkMemorySize = heapAndManagedMemory * networkFraction / (1 - 
networkFraction)
 heapAndManagedMemory = jvmFreeHeapMemory, if managed memory is on-heap
 heapAndManagedMemory = jvmFreeHeapMemory + managedMemory, if managed memory is 
off-heap
{quote}
Assuming we have the same {{jvmFreeHeapMemory}}, with managed memory changed 
from on-heap to off-heap, we should have larger network memory size, thus 
larger number of network buffers when the page size stays the same.

What against intuition is that, with more network buffers, we actually have 
worse performance. The only thing I can think of it that, our benchmarks take 
the cluster initialization time into the statistics, and with more network 
buffers we need more time for allocating those direct memory buffers.

To verify that, I explicitly configured 
{{NettyShuffleEnvironmentOptions.NETWORK_NUM_BUFFERS}} in the benchmarks, and 
logged out the time consumed by allocating all the buffers in the constructor 
of {{NetworkBufferPool}}. Here are the results. Suffix '-small' / '-large' 
stand for setting number of network buffers to 12945 / 32768 respectively.
|| 
||before-regression-small||before-regression-large||cause-regression-small||cause-regression-large||
|arrayKeyBy score (ops / ms)|1934.100|1835.320|2007.567|1966.123|
|tupleKeyBy score (ops / ms)|3403.841|3118.959|3696.727|3219.226|
|avg. buffer allocation time (ms)|79.033|242.567|78.167|237.717|

The benchmark scores show that larger number of network buffers indeed leads to 
the regression in statistics. Further dig into the results, taking 
{{arrayKeyBy}} in before-regression as an example, the total records is 
7,000,000 ({{KeyByBenchmarks#ARRAY_RECORDS_PER_INVOCATION}}), ops / ms for the 
small network memory size case is 1934, that gives us the total execution time 
of one invocation of {{KeyByBenchmarks#arrayKeyBy}} with the small network 
buffer size is roughly 7,000,000 / 1934 = 3619ms. Similarly the total execution 
time with the large network buffer size can be roughly calculated as 7,000,000 
/ 1835 = 3815ms. The total execution time difference between with small / large 
network buffer size is about 196ms, which is very close to time difference of 
buffer allocation time (242 - 79 = 163ms). If we also take the randomness and 
potential slight performance improvement due to more network buffers into 
consideration, this basically explains where the benchmark regression come from.

I still need to look into the other two cases (twoInputMapSink and 
globalWindow), and the performance of the later commit 
(9d1256ccbf8eb1556016b6805c3a91e2787d298a) that activates FLIP-49. Just to post 
my findings so far.

> Performance regression on 3.12.2019 in various benchmarks
> -
>
> Key: FLINK-15103
> URL: https://issues.apache.org/jira/browse/FLINK-15103
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks
>Reporter: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.10.0
>
>
> Various benchmarks show a performance regression that happened on December 
> 3rd:
> [arrayKeyBy (probably the most easily 
> visible)|http://codespeed.dak8s.net:8000/timeline/#/?exe=1=arrayKeyBy=2=200=off=on=on]
>  
> [tupleKeyBy|http://codespeed.dak8s.net:8000/timeline/#/?exe=1=tupleKeyBy=2=200=off=on=on]
>  
> [twoInputMapSink|http://codespeed.dak8s.net:8000/timeline/#/?exe=1=twoInputMapSink=2=200=off=on=on]
>  [globalWindow (small 
> 

[jira] [Commented] (FLINK-15146) The value of `cleanupSize` should be grater than 0 for `IncrementalCleanupStrategy`

2019-12-09 Thread hehuiyuan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992218#comment-16992218
 ] 

hehuiyuan commented on FLINK-15146:
---

Hi  [~azagrebin]  , I am pleasure to do this.

> The value of `cleanupSize` should be grater than 0 for 
> `IncrementalCleanupStrategy`
> ---
>
> Key: FLINK-15146
> URL: https://issues.apache.org/jira/browse/FLINK-15146
> Project: Flink
>  Issue Type: Wish
>Reporter: hehuiyuan
>Priority: Minor
> Attachments: image-2019-12-09-17-03-59-014.png, 
> image-2019-12-09-17-09-18-062.png
>
>
>  
> Hi , the value of cleanupSize is grater than or equal 0. Whether that the 
> value is grater than 0 is more practical.
> !image-2019-12-09-17-03-59-014.png|width=615,height=108!
> !image-2019-12-09-17-09-18-062.png|width=491,height=309!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on issue #10348: [FLINK-14951][tests] Harden the thread safety of State TTL backend tests

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10348: [FLINK-14951][tests] Harden the 
thread safety of State TTL backend tests
URL: https://github.com/apache/flink/pull/10348#issuecomment-559509076
 
 
   
   ## CI report:
   
   * 54ecee627e7036f4d150aad330b9772406a19494 Travis: 
[SUCCESS](https://travis-ci.com/flink-ci/flink/builds/138580810) 
   * 3e89b2b0d85dd29a76bb10807e2959a7d2ee8295 Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140343903) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3383)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] flinkbot edited a comment on issue #10504: [FLINK-15069][benchmark] Supplement the pipelined shuffle compression case for benchmark

2019-12-09 Thread GitBox
flinkbot edited a comment on issue #10504: [FLINK-15069][benchmark] Supplement 
the pipelined shuffle compression case for benchmark
URL: https://github.com/apache/flink/pull/10504#issuecomment-563761500
 
 
   
   ## CI report:
   
   * 64f397f5a4fdb5cf37119c1251e7a553d5c4da0f Travis: 
[PENDING](https://travis-ci.com/flink-ci/flink/builds/140352443) Azure: 
[SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=3387)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-14236) Make LazyFromSourcesSchedulingStrategy do lazy scheduling based on partition state only

2019-12-09 Thread Zhu Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu updated FLINK-14236:

Fix Version/s: (was: 1.10.0)
   1.11.0

> Make LazyFromSourcesSchedulingStrategy do lazy scheduling based on partition 
> state only
> ---
>
> Key: FLINK-14236
> URL: https://issues.apache.org/jira/browse/FLINK-14236
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Major
> Fix For: 1.11.0
>
>
> The philosophy of lazy_from_sources scheduling is to schedule a task when its 
> inputs are ready, i.e. when needed result partitions are consumable. It 
> actually does not directly care whether a task is finished, though a finished 
> task will lead its partitions to be consumable.
> The {{LazyFromSourcesSchedulingStrategy}} currently does lazy scheduling in 
> two ways:
> 1. trigger scheduling on FINISHED execution state for blocking partition 
> consumers
> 2. trigger scheduling on partition consumable events for pipelined partition 
> consumers
> This makes the scheduling decision a bit messy. And it also requires the 
> {{LazyFromSourcesSchedulingStrategy}} to manage the partition state by 
> itself, to decide when a partition is consumable, which is not good.
> I'd propose to make {{LazyFromSourcesSchedulingStrategy}} do lazy scheduling 
> only based on partition state. This can make the scheduling decision making 
> clearer and relieve {{LazyFromSourcesSchedulingStrategy}} from partition 
> state management.
> This work requires FLINK-14234 so that all partition consumable events can be 
> notified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   >