[jira] [Updated] (FLINK-21318) ExecutionContextTest.testCatalogs fail

2021-02-08 Thread Jark Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jark Wu updated FLINK-21318:

Description: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13064=logs=51fed01c-4eb0-5511-d479-ed5e8b9a7820=e5682198-9e22-5770-69f6-7551182edea8
{code:java}
2021-02-07T22:25:13.3526341Z [ERROR] 
testDatabases(org.apache.flink.table.client.gateway.local.ExecutionContextTest) 
 Time elapsed: 0.044 s  <<< ERROR!
2021-02-07T22:25:13.3526885Z 
org.apache.flink.table.client.gateway.SqlExecutionException: Could not create 
execution context.
2021-02-07T22:25:13.3527484Zat 
org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:972)
2021-02-07T22:25:13.3528070Zat 
org.apache.flink.table.client.gateway.local.ExecutionContextTest.createExecutionContext(ExecutionContextTest.java:398)
2021-02-07T22:25:13.3528676Zat 
org.apache.flink.table.client.gateway.local.ExecutionContextTest.createCatalogExecutionContext(ExecutionContextTest.java:434)
2021-02-07T22:25:13.3529280Zat 
org.apache.flink.table.client.gateway.local.ExecutionContextTest.testDatabases(ExecutionContextTest.java:230)
2021-02-07T22:25:13.3529775Zat 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2021-02-07T22:25:13.3530232Zat 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2021-02-07T22:25:13.3530773Zat 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2021-02-07T22:25:13.3531246Zat 
java.base/java.lang.reflect.Method.invoke(Method.java:566)
2021-02-07T22:25:13.3531641Zat 
org.junit.internal.runners.TestMethod.invoke(TestMethod.java:68)
2021-02-07T22:25:13.3532223Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.runTestMethod(PowerMockJUnit44RunnerDelegateImpl.java:326)
2021-02-07T22:25:13.3532909Zat 
org.junit.internal.runners.MethodRoadie$2.run(MethodRoadie.java:89)
2021-02-07T22:25:13.3533369Zat 
org.junit.internal.runners.MethodRoadie.runBeforesThenTestThenAfters(MethodRoadie.java:97)
2021-02-07T22:25:13.3534119Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.executeTest(PowerMockJUnit44RunnerDelegateImpl.java:310)
2021-02-07T22:25:13.3534995Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner.executeTestInSuper(PowerMockJUnit47RunnerDelegateImpl.java:131)
2021-02-07T22:25:13.3535784Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner.access$100(PowerMockJUnit47RunnerDelegateImpl.java:59)
2021-02-07T22:25:13.3536590Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner$TestExecutorStatement.evaluate(PowerMockJUnit47RunnerDelegateImpl.java:147)
2021-02-07T22:25:13.3537395Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner.evaluateStatement(PowerMockJUnit47RunnerDelegateImpl.java:107)
2021-02-07T22:25:13.3538183Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner.executeTest(PowerMockJUnit47RunnerDelegateImpl.java:82)
2021-02-07T22:25:13.3538998Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.runBeforesThenTestThenAfters(PowerMockJUnit44RunnerDelegateImpl.java:298)
2021-02-07T22:25:13.3539633Zat 
org.junit.internal.runners.MethodRoadie.runTest(MethodRoadie.java:87)
2021-02-07T22:25:13.3540039Zat 
org.junit.internal.runners.MethodRoadie.run(MethodRoadie.java:50)
2021-02-07T22:25:13.3540592Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.invokeTestMethod(PowerMockJUnit44RunnerDelegateImpl.java:218)
2021-02-07T22:25:13.3541269Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.runMethods(PowerMockJUnit44RunnerDelegateImpl.java:160)
2021-02-07T22:25:13.3541969Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$1.run(PowerMockJUnit44RunnerDelegateImpl.java:134)
2021-02-07T22:25:13.3542523Zat 
org.junit.internal.runners.ClassRoadie.runUnprotected(ClassRoadie.java:34)
2021-02-07T22:25:13.3543045Zat 
org.junit.internal.runners.ClassRoadie.runProtected(ClassRoadie.java:44)
2021-02-07T22:25:13.3543567Zat 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.run(PowerMockJUnit44RunnerDelegateImpl.java:136)
2021-02-07T22:25:13.3544312Zat 
org.powermock.modules.junit4.common.internal.impl.JUnit4TestSuiteChunkerImpl.run(JUnit4TestSuiteChunkerImpl.java:117)

[jira] [Created] (FLINK-21332) Optimize releasing result partitions in RegionPartitionReleaseStrategy

2021-02-08 Thread Zhilong Hong (Jira)
Zhilong Hong created FLINK-21332:


 Summary: Optimize releasing result partitions in 
RegionPartitionReleaseStrategy
 Key: FLINK-21332
 URL: https://issues.apache.org/jira/browse/FLINK-21332
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Zhilong Hong
 Fix For: 1.13.0


RegionPartitionReleaseStrategy is responsible for releasing result partitions 
when all the downstream tasks finish.

The current implementation is:
{code:java}
for each consumed SchedulingResultPartition of current finished 
SchedulingPipelinedRegion:
  for each consumer SchedulingPipelinedRegion of the SchedulingResultPartition:
if all the regions are finished:
  release the partitions
{code}
The time complexity of releasing a result partition is O(N^2). However, 
considering that during the entire stage, all the result partitions need to be 
released, the time complexity is actually O(N^3).

After the optimization of DefaultSchedulingTopology, the consumed result 
partitions are grouped. Since the result partitions in one group are 
isomorphic, we can just cache the finished status of result partition groups 
and the corresponding pipeline regions.

The optimized implementation is:
{code:java}
for each ConsumedPartitionGroup of current finished SchedulingPipelinedRegion:
  if all consumer SchedulingPipelinedRegion of the ConsumedPartitionGroup are 
finished:
set the ConsumePartitionGroup to be fully consumed
for result partition in the ConsumePartitionGroup:
  if all the ConsumePartitionGroups it belongs to are fully consumed:
release the result partition
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on pull request #14904: [WIP][FLINK-20663][runtime] Instantly release unsafe memory on freeing segment.

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14904:
URL: https://github.com/apache/flink/pull/14904#issuecomment-775639746


   
   ## CI report:
   
   * 797afe4dcb649ffa6e1e761581389c6055af0c9c Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13133)
 
   * b194119f275e5a19a86281965584b60df59d5c39 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (FLINK-21331) Optimize calculating tasks to restart in RestartPipelinedRegionFailoverStrategy

2021-02-08 Thread Zhilong Hong (Jira)
Zhilong Hong created FLINK-21331:


 Summary: Optimize calculating tasks to restart in 
RestartPipelinedRegionFailoverStrategy
 Key: FLINK-21331
 URL: https://issues.apache.org/jira/browse/FLINK-21331
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Zhilong Hong
 Fix For: 1.13.0


RestartPipelinedRegionFailoverStrategy is used to calculate the tasks to 
restart when a task failure occurs. It contains two parts: firstly calculate 
the regions to restart; then add all the tasks in these regions to the 
restarting queue.

The bottleneck is mainly in the first part. This part traverses all the 
upstream and downstream regions of the failed region to determine whether they 
should be restarted or not.

For the current failed region, if its consumed result partition is not 
available, the owner, i.e., the upstream region should restart. Also, since the 
failed region needs to restart, its result partition won't be available, all 
the downstream regions need to restart, too.

1. Calculating the upstream regions that should restart

The current implementation is:
{code:java}
for each SchedulingExecutionVertex in current visited SchedulingPipelinedRegion:
  for each consumed SchedulingResultPartition of the SchedulingExecutionVertex:
if the result partition is not available:
  add the producer region to the restart queue
{code}
Based on FLINK-21328, the consumed result partition of a vertex is already 
grouped. Here we can use a HashSet to record the visited result partition 
group. For vertices connected with all-to-all edges, they will only need to 
traverse the group once. This decreases the time complexity from O(N^2) to O(N).

2. Calculating the downstream regions that should restart

The current implementation is:
{code:java}
for each SchedulingExecutionVertex in current visited SchedulingPipelinedRegion:
  for each produced SchedulingResultPartition of the SchedulingExecutionVertex:
for each consumer SchedulingExecutionVertex of the produced 
SchedulingResultPartition:
  if the region containing the consumer SchedulingExecutionVertex is not 
visited:
add the region to the restart queue
{code}
Since the count of the produced result partitions of a vertex equals the count 
of output JobEdges, the time complexity of this procedure is actually O(N^2). 
As the consumer vertices of a result partition are already grouped, we can use 
a HashSet to record the visited ConsumerVertexGroup. The time complexity 
decreases from O(N^2) to O(N).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21110) Optimize scheduler performance for large-scale jobs

2021-02-08 Thread Zhilong Hong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhilong Hong updated FLINK-21110:
-
Summary: Optimize scheduler performance for large-scale jobs  (was: 
Optimize Scheduler Performance for Large-Scale Jobs)

> Optimize scheduler performance for large-scale jobs
> ---
>
> Key: FLINK-21110
> URL: https://issues.apache.org/jira/browse/FLINK-21110
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Zhilong Hong
>Assignee: Zhilong Hong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
> Attachments: Illustration of Group.jpg
>
>
> According to the result of scheduler benchmarks we implemented in 
> FLINK-20612, the bottleneck of deploying and running a large-scale job in 
> Flink is mainly focused on the following procedures:
> |Procedure|Time complexity|
> |Initializing ExecutionGraph|O(N^2)|
> |Building DefaultExecutionTopology|O(N^2)|
> |Initializing PipelinedRegionSchedulingStrategy|O(N^2)|
> |Scheduling downstream tasks when a task finishes|O(N^2)|
> |Calculating tasks to restart when a failover occurs|O(N^2)|
> |Releasing result partitions|O(N^3)|
> These procedures are all related to the complexity of the topology in the 
> ExecutionGraph. Between two vertices connected with the all-to-all edges, all 
> the upstream Intermediate ResultPartitions are connected to all downstream 
> ExecutionVertices. The computation complexity of building and traversing all 
> these edges will be O(N^2). 
> As for memory usage, currently we use ExecutionEdges to store the information 
> of connections. For the all-to-all distribution type, there are O(N^2) 
> ExecutionEdges. We test a simple job with only two vertices. The parallelisms 
> of them are both 10k. Furthermore, they are connected with all-to-all edges. 
> It takes 4.175 GiB (estimated via MXBean) to store the 100M ExecutionEdges.
> In most large-scale jobs, there will be more than two vertices with large 
> parallelisms, and they would cost a lot of time and memory to deploy the job.
> As we can see, for two JobVertices connected with the all-to-all distribution 
> type, all IntermediateResultPartitions produced by the upstream 
> ExecutionVertices are isomorphic, which means that the downstream 
> ExecutionVertices they connected are exactly the same. The downstream 
> ExecutionVertices belonging to the same JobVertex are also isomorphic, as the 
> upstream ResultPartitions they connect are the same, too.
> Since every JobEdge has exactly one distribution type, we can divide the 
> vertices and result partitions into groups according to the distribution type 
> of the JobEdge. 
> For the all-to-all distribution type, since all downstream vertices are 
> isomorphic, they belong to a single group, and all the upstream result 
> partitions are connected to this group. Vice versa, all the upstream result 
> partitions also belong to a single group, and all the downstream vertices are 
> connected to this group. In the past, when we wanted to iterate all the 
> downstream vertices, we needed to loop over them n times, which leads to the 
> complexity of O(N^2). Now since all upstream result partitions are connected 
> to one downstream group, we just need to loop over them once, with the 
> complexity of O(N).
> For the pointwise distribution type, because each result partition is 
> connected to different downstream vertices, they should belong to different 
> groups. Vice versa, all the vertices belong to different groups. Since one 
> result partition group is connected to one vertex group pointwisely, the 
> computation complexity of looping over them is still O(N).
> !Illustration of Group.jpg|height=249!
> After we group the result partitions and vertices, ExecutionEdge is no longer 
> needed. For the test job we mentioned above, the optimization can effectively 
> reduce the memory usage from 4.175 GiB to 12.076 MiB (estimated via MXBean) 
> in our POC. The time cost is reduced from 62.090 seconds to 8.551 seconds 
> (with 10k parallelism).
>  
> The detailed design doc: 
> https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21330) Optimize the initialization of PipelinedRegionSchedulingStrategy

2021-02-08 Thread Zhilong Hong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhilong Hong updated FLINK-21330:
-
Summary: Optimize the initialization of PipelinedRegionSchedulingStrategy  
(was: Optimization the initialization of PipelinedRegionSchedulingStrategy)

> Optimize the initialization of PipelinedRegionSchedulingStrategy
> 
>
> Key: FLINK-21330
> URL: https://issues.apache.org/jira/browse/FLINK-21330
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Zhilong Hong
>Priority: Major
> Fix For: 1.13.0
>
>
> PipelinedRegionSchedulingStrategy is used for task scheduling. Its 
> initialization is located at {{PipelinedRegionSchedulingStrategy#init}}. The 
> initialization can be divided into two parts:
>  # Calculating consumed result partitions of SchedulingPipelinedRegions
>  # Calculating the consumer pipelined region of SchedulingResultPartition
> Based on FLINK-21328, the {{consumedResults}} of 
> {{DefaultSchedulingPipelinedRegion}} can be replaced with 
> {{ConsumedPartitionGroup}}.
> Then we can optimize the procedures we mentioned above. After the 
> optimization, the time complexity decreases from O(N^2) to O(N).
> The related usage of {{getConsumedResults}} should be replaced, too.
> The detailed design doc is located at: 
> [https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit#heading=h.a1mz4yjpry6m]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21328) Optimize the initialization of DefaultExecutionTopology

2021-02-08 Thread Zhilong Hong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhilong Hong updated FLINK-21328:
-
Description: 
Based on FLINK-21326, the {{consumedResults}} in {{DefaultExecutionVertex}} and 
{{consumers}} in {{DefaultResultPartition}} can be replaced with 
{{ConsumedPartitionGroup}} and {{ConsumerVertexGroup}} in {{EdgeManager}}.
 # The method {{DefaultExecutionTopology#connectVerticesToConsumedPartitions}} 
could be removed.
 # All the related usages should be fixed.

The detailed design doc is located at: 
https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit#heading=h.zary7uv1ev22

  was:
Based on FLINK-21326, the {{consumedResults}} in {{DefaultExecutionVertex}} and 
{{consumers}} in {{DefaultResultPartition}} can be replaced with 
{{ConsumedPartitionGroup}} and {{ConsumerVertexGroup}} in {{EdgeManager}}.
 # The method {{DefaultExecutionTopology#connectVerticesToConsumedPartitions}} 
could be removed.
 # All the related usages should be fixed.


> Optimize the initialization of DefaultExecutionTopology
> ---
>
> Key: FLINK-21328
> URL: https://issues.apache.org/jira/browse/FLINK-21328
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Zhilong Hong
>Priority: Major
> Fix For: 1.13.0
>
>
> Based on FLINK-21326, the {{consumedResults}} in {{DefaultExecutionVertex}} 
> and {{consumers}} in {{DefaultResultPartition}} can be replaced with 
> {{ConsumedPartitionGroup}} and {{ConsumerVertexGroup}} in {{EdgeManager}}.
>  # The method 
> {{DefaultExecutionTopology#connectVerticesToConsumedPartitions}} could be 
> removed.
>  # All the related usages should be fixed.
> The detailed design doc is located at: 
> https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit#heading=h.zary7uv1ev22



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21330) Optimization the initialization of PipelinedRegionSchedulingStrategy

2021-02-08 Thread Zhilong Hong (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhilong Hong updated FLINK-21330:
-
Description: 
PipelinedRegionSchedulingStrategy is used for task scheduling. Its 
initialization is located at {{PipelinedRegionSchedulingStrategy#init}}. The 
initialization can be divided into two parts:
 # Calculating consumed result partitions of SchedulingPipelinedRegions
 # Calculating the consumer pipelined region of SchedulingResultPartition

Based on FLINK-21328, the {{consumedResults}} of 
{{DefaultSchedulingPipelinedRegion}} can be replaced with 
{{ConsumedPartitionGroup}}.

Then we can optimize the procedures we mentioned above. After the optimization, 
the time complexity decreases from O(N^2) to O(N).

The related usage of {{getConsumedResults}} should be replaced, too.

The detailed design doc is located at: 
[https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit#heading=h.a1mz4yjpry6m]

  was:
{{PipelinedRegionSchedulingStrategy}} is used for task scheduling. Its 
initialization is located at {{PipelinedRegionSchedulingStrategy#init}}. The 
initialization can be divided into two parts:
 # Calculating consumed result partitions of SchedulingPipelinedRegions
 # Calculating the consumer pipelined region of SchedulingResultPartition

Based on FLINK-21328, the {{consumedResults}} of 
{{DefaultSchedulingPipelinedRegion}} can be replaced with 
{{ConsumedPartitionGroup}}.

Then we can optimize the procedures we mentioned above. After the optimization, 
the time complexity decreases from O(N^2) to O(N).

The related usage of {{getConsumedResults}} should be replaced, too.

The detailed design doc is located at: 
[https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit#heading=h.a1mz4yjpry6m]


> Optimization the initialization of PipelinedRegionSchedulingStrategy
> 
>
> Key: FLINK-21330
> URL: https://issues.apache.org/jira/browse/FLINK-21330
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Zhilong Hong
>Priority: Major
> Fix For: 1.13.0
>
>
> PipelinedRegionSchedulingStrategy is used for task scheduling. Its 
> initialization is located at {{PipelinedRegionSchedulingStrategy#init}}. The 
> initialization can be divided into two parts:
>  # Calculating consumed result partitions of SchedulingPipelinedRegions
>  # Calculating the consumer pipelined region of SchedulingResultPartition
> Based on FLINK-21328, the {{consumedResults}} of 
> {{DefaultSchedulingPipelinedRegion}} can be replaced with 
> {{ConsumedPartitionGroup}}.
> Then we can optimize the procedures we mentioned above. After the 
> optimization, the time complexity decreases from O(N^2) to O(N).
> The related usage of {{getConsumedResults}} should be replaced, too.
> The detailed design doc is located at: 
> [https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit#heading=h.a1mz4yjpry6m]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on pull request #14906: [FLINK-21274][runtime] Change the ClusterEntrypoint.runClusterEntrypoint to wait on the result of clusterEntrypoint.getTerminationFut

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14906:
URL: https://github.com/apache/flink/pull/14906#issuecomment-775715265


   
   ## CI report:
   
   * 2951f7e582aecc00293b92003de99c6bf84aa470 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13136)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (FLINK-21330) Optimization the initialization of PipelinedRegionSchedulingStrategy

2021-02-08 Thread Zhilong Hong (Jira)
Zhilong Hong created FLINK-21330:


 Summary: Optimization the initialization of 
PipelinedRegionSchedulingStrategy
 Key: FLINK-21330
 URL: https://issues.apache.org/jira/browse/FLINK-21330
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Zhilong Hong
 Fix For: 1.13.0


{{PipelinedRegionSchedulingStrategy}} is used for task scheduling. Its 
initialization is located at {{PipelinedRegionSchedulingStrategy#init}}. The 
initialization can be divided into two parts:
 # Calculating consumed result partitions of SchedulingPipelinedRegions
 # Calculating the consumer pipelined region of SchedulingResultPartition

Based on FLINK-21328, the {{consumedResults}} of 
{{DefaultSchedulingPipelinedRegion}} can be replaced with 
{{ConsumedPartitionGroup}}.

Then we can optimize the procedures we mentioned above. After the optimization, 
the time complexity decreases from O(N^2) to O(N).

The related usage of {{getConsumedResults}} should be replaced, too.

The detailed design doc is located at: 
[https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit#heading=h.a1mz4yjpry6m]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] becketqin commented on a change in pull request #14789: [FLINK-21159][connector/kafka] Send NoMoreSplitsEvent to all readers even if the reader is not assigned with any splits

2021-02-08 Thread GitBox


becketqin commented on a change in pull request #14789:
URL: https://github.com/apache/flink/pull/14789#discussion_r572647867



##
File path: 
flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumerator.java
##
@@ -276,16 +284,21 @@ private void assignPendingPartitionSplits() {
 .addAll(newPartitionSplits);
 // Clear the pending splits for the reader owner.
 pendingPartitionSplitAssignment.remove(readerOwner);
-// Sends NoMoreSplitsEvent to the readers if there is no 
more partition splits
-// to be assigned.
-if (noMoreNewPartitionSplits) {
-LOG.debug(
-"No more KafkaPartitionSplits to assign. 
Sending NoMoreSplitsEvent to the readers "
-+ "in consumer group {}.",
-consumerGroupId);
-context.signalNoMoreSplits(readerOwner);
-}
 });
+if (noMoreNewPartitionSplits) {
+signalNoMoreSplitsToNotNotifiedReaders();
+}
+}
+
+private void signalNoMoreSplitsToNotNotifiedReaders() {
+context.registeredReaders()
+.forEach(
+(readerId, ignore) -> {
+if (!finishedReaders.contains(readerId)) {
+context.signalNoMoreSplits(readerId);
+finishedReaders.add(readerId);

Review comment:
   What happens if the readers failover? Do we assume that the source 
readers remember the reception of `NoMoreSplitsEvent`?

##
File path: 
flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumerator.java
##
@@ -64,6 +64,8 @@
 private final Properties properties;
 private final long partitionDiscoveryIntervalMs;
 private final SplitEnumeratorContext context;
+private volatile boolean partitionDiscoveryFinished = false;

Review comment:
   It seems that we can just reuse the `noMoreNewPartitionSplits` flag.
   
   - If the periodic partition discovery is disabled, then after the first 
partition discovery is done, set the `noMoreNewPartitionSplits` flag to true. 
The subsequent `assignPendingPartitionSplits` will just send the 
`NoMoreSplitsEvent` to all the readers. As long as the flag is set in the main 
thread, the readers who registered before the first partition discovery is done 
will not receive duplicate `NoMoreSplitsEvent` in this case.
   - Otherwise, the `noMoreNewPartitionSplits` will always be set to false, and 
no `NoMoreSplitsEvent` will be sent.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (FLINK-21102) Add ScaleUpController

2021-02-08 Thread Robert Metzger (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-21102.
--
Resolution: Fixed

Merged to master in 
https://github.com/apache/flink/commit/b59057cdec0339e73a6252d99d20c2d67b205b24

> Add ScaleUpController
> -
>
> Key: FLINK-21102
> URL: https://issues.apache.org/jira/browse/FLINK-21102
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.13.0
>Reporter: Till Rohrmann
>Assignee: Robert Metzger
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> Add the {{ScaleUpController}} according to the definition in 
> [FLIP-160|https://cwiki.apache.org/confluence/x/mwtRCg].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] rmetzger merged pull request #14874: [FLINK-21102] Add ScaleUpController for declarative scheduler

2021-02-08 Thread GitBox


rmetzger merged pull request #14874:
URL: https://github.com/apache/flink/pull/14874


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (FLINK-21133) FLIP-27 Source does not work with synchronous savepoint

2021-02-08 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281581#comment-17281581
 ] 

Matthias commented on FLINK-21133:
--

Findings mentioned in duplicate FLINK-21323:

{quote}
When looking into FLINK-21030 analyzing the stop-with-savepoint behavior we 
implemented different test jobs covering the old addSource and new fromSource 
methods for adding sources. The stop-with-savepoint consists of two phase:

Create the savepoint
Stop the source to trigger finalizing the job
The test failing in the second phase using fromSource does not succeed. The 
reason for this might be that finishTask is not implemented by 
SourceOperatorStreamTask in contrast to SourceStreamTask which is used when 
calling addSource in the job definition. Hence, the job termination is never 
triggered.

We might have missed this due to some naming error of 
JobMasterStopWithSavepointIT test that is not triggered by Maven due to the 
wrong suffix used in this case. The IT is failing right now. FLINK-21031 is 
covering the fix of JobMasterStopWithSavepointIT already.
{quote}

> FLIP-27 Source does not work with synchronous savepoint
> ---
>
> Key: FLINK-21133
> URL: https://issues.apache.org/jira/browse/FLINK-21133
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core, Runtime / Checkpointing
>Affects Versions: 1.11.3, 1.12.1
>Reporter: Kezhu Wang
>Priority: Major
>
> I have pushed branch 
> [synchronous-savepoint-conflict-with-bounded-end-input-case|https://github.com/kezhuw/flink/commits/synchronous-savepoint-conflict-with-bounded-end-input-case]
>  in my repository. {{SavepointITCase.testStopSavepointWithFlip27Source}} 
> failed due to timeout.
> See also FLINK-21132 and 
> [apache/iceberg#2033|https://github.com/apache/iceberg/issues/2033]..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-21323) Stop-with-savepoint is not supported by SourceOperatorStreamTask

2021-02-08 Thread Matthias (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias closed FLINK-21323.

Resolution: Duplicate

Thanks for looking into this, [~roman_khachatryan] and [~kezhuw]. I guess, 
you're right. I overlooked FLINK-21133 when browsing for already opened issues. 
I'm gonna close this ticket in favor of FLINK-21133 and move the findings over 
there.

> Stop-with-savepoint is not supported by SourceOperatorStreamTask
> 
>
> Key: FLINK-21323
> URL: https://issues.apache.org/jira/browse/FLINK-21323
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.11.3, 1.12.1
>Reporter: Matthias
>Priority: Critical
> Fix For: 1.11.4, 1.12.2, 1.13.0
>
>
> When looking into FLINK-21030 analyzing the stop-with-savepoint behavior we 
> implemented different test jobs covering the old {{addSource}} and new 
> {{fromSource}} methods for adding sources. The stop-with-savepoint consists 
> of two phase:
>  # Create the savepoint
>  # Stop the source to trigger finalizing the job
> The test failing in the second phase using {{fromSource}} does not succeed. 
> The reason for this might be that {{finishTask}} is not implemented by 
> {{SourceOperatorStreamTask}} in contrast to {{SourceStreamTask}} which is 
> used when calling {{addSource}} in the job definition. Hence, the job 
> termination is never triggered.
> We might have missed this due to some naming error of 
> {{JobMasterStopWithSavepointIT}} test that is not triggered by Maven due to 
> the wrong suffix used in this case. The IT is failing right now. FLINK-21031 
> is covering the fix of {{JobMasterStopWithSavepointIT}} already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21329) "Local recovery and sticky scheduling end-to-end test" does not finish within 600 seconds

2021-02-08 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21329:
--

 Summary: "Local recovery and sticky scheduling end-to-end test" 
does not finish within 600 seconds
 Key: FLINK-21329
 URL: https://issues.apache.org/jira/browse/FLINK-21329
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13118=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=ff888d9b-cd34-53cc-d90f-3e446d355529

{code}
Feb 08 22:25:46 
==
Feb 08 22:25:46 Running 'Local recovery and sticky scheduling end-to-end test'
Feb 08 22:25:46 
==
Feb 08 22:25:46 TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-46881214821
Feb 08 22:25:47 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.13-SNAPSHOT-bin/flink-1.13-SNAPSHOT
Feb 08 22:25:47 Running local recovery test with configuration:
Feb 08 22:25:47 parallelism: 4
Feb 08 22:25:47 max attempts: 10
Feb 08 22:25:47 backend: rocks
Feb 08 22:25:47 incremental checkpoints: false
Feb 08 22:25:47 kill JVM: false
Feb 08 22:25:47 Starting zookeeper daemon on host fv-az127-394.
Feb 08 22:25:47 Starting HA cluster with 1 masters.
Feb 08 22:25:48 Starting standalonesession daemon on host fv-az127-394.
Feb 08 22:25:49 Starting taskexecutor daemon on host fv-az127-394.
Feb 08 22:25:49 Waiting for Dispatcher REST endpoint to come up...
Feb 08 22:25:50 Waiting for Dispatcher REST endpoint to come up...
Feb 08 22:25:51 Waiting for Dispatcher REST endpoint to come up...
Feb 08 22:25:53 Waiting for Dispatcher REST endpoint to come up...
Feb 08 22:25:54 Dispatcher REST endpoint is up.
Feb 08 22:25:54 Started TM watchdog with PID 28961.
Feb 08 22:25:58 Job has been submitted with JobID 
e790e85a39040539f9386c0df7ca4812
Feb 08 22:35:47 Test (pid: 27970) did not finish after 600 seconds.
Feb 08 22:35:47 Printing Flink logs and killing it:

{code}

and

{code}

at 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.unhandledError(ZooKeeperLeaderRetrievalDriver.java:184)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at 
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss
at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
... 10 more

{code}



--
This 

[GitHub] [flink] rmetzger commented on pull request #14874: [FLINK-21102] Add ScaleUpController for declarative scheduler

2021-02-08 Thread GitBox


rmetzger commented on pull request #14874:
URL: https://github.com/apache/flink/pull/14874#issuecomment-775727039


   Created JIRA for the failed e2e: 
https://issues.apache.org/jira/browse/FLINK-21329



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (FLINK-21328) Optimize the initialization of DefaultExecutionTopology

2021-02-08 Thread Zhilong Hong (Jira)
Zhilong Hong created FLINK-21328:


 Summary: Optimize the initialization of DefaultExecutionTopology
 Key: FLINK-21328
 URL: https://issues.apache.org/jira/browse/FLINK-21328
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Zhilong Hong
 Fix For: 1.13.0


Based on FLINK-21326, the {{consumedResults}} in {{DefaultExecutionVertex}} and 
{{consumers}} in {{DefaultResultPartition}} can be replaced with 
{{ConsumedPartitionGroup}} and {{ConsumerVertexGroup}} in {{EdgeManager}}.
 # The method {{DefaultExecutionTopology#connectVerticesToConsumedPartitions}} 
could be removed.
 # All the related usages should be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21327) Support window TVF in batch mode

2021-02-08 Thread Jark Wu (Jira)
Jark Wu created FLINK-21327:
---

 Summary: Support window TVF in batch mode
 Key: FLINK-21327
 URL: https://issues.apache.org/jira/browse/FLINK-21327
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jark Wu


As a batch and streaming unified engine, we should also support to run window 
TVF in batch mode. Then users can use one query with streaming mode to produce 
data in real-time and use the same query with batch mode to backfill data for a 
specific day.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21327) Support window TVF in batch mode

2021-02-08 Thread Jark Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jark Wu updated FLINK-21327:

Description: 
As a batch and streaming unified engine, we should also support to run window 
TVF in batch mode. Then users can use one query with streaming mode to produce 
data in real-time and use the same query with batch mode to backfill data for a 
specific day.

The implementation for batch should be straightforward and simple. We can just 
introduce a physical and exec node for window TVF. 

  was:As a batch and streaming unified engine, we should also support to run 
window TVF in batch mode. Then users can use one query with streaming mode to 
produce data in real-time and use the same query with batch mode to backfill 
data for a specific day.


> Support window TVF in batch mode
> 
>
> Key: FLINK-21327
> URL: https://issues.apache.org/jira/browse/FLINK-21327
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Planner
>Reporter: Jark Wu
>Priority: Major
>
> As a batch and streaming unified engine, we should also support to run window 
> TVF in batch mode. Then users can use one query with streaming mode to 
> produce data in real-time and use the same query with batch mode to backfill 
> data for a specific day.
> The implementation for batch should be straightforward and simple. We can 
> just introduce a physical and exec node for window TVF. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot commented on pull request #14906: [FLINK-21274][runtime] Change the ClusterEntrypoint.runClusterEntrypoint to wait on the result of clusterEntrypoint.getTerminationFuture().g

2021-02-08 Thread GitBox


flinkbot commented on pull request #14906:
URL: https://github.com/apache/flink/pull/14906#issuecomment-775715265


   
   ## CI report:
   
   * 2951f7e582aecc00293b92003de99c6bf84aa470 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (FLINK-21326) Optimize building topology when initializing ExecutionGraph

2021-02-08 Thread Zhilong Hong (Jira)
Zhilong Hong created FLINK-21326:


 Summary: Optimize building topology when initializing 
ExecutionGraph
 Key: FLINK-21326
 URL: https://issues.apache.org/jira/browse/FLINK-21326
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Zhilong Hong
 Fix For: 1.13.0


The main idea of optimizing the procedure of building topology is to put all 
the vertices that consumed the same result partitions into one group, and put 
all the result partitions that have the same consumer vertices into one 
consumer group. The corresponding data structure is {{ConsumedPartitionGroup}} 
and {{ConsumerVertexGroup}}. {{EdgeManager}} is used to store the relationship 
between the groups. The procedure of creating {{ExecutionEdge}} is replaced 
with building {{EdgeManager}}.

With these improvements, the complexity of building topology in ExecutionGraph 
decreases from O(N^2) to O(N). 

Furthermore, {{ExecutionEdge}} and all its related calls are replaced with 
{{ConsumedPartitionGroup}} and {{ConsumerVertexGroup}}.

The detailed design doc is located at: 
https://docs.google.com/document/d/1OjGAyJ9Z6KsxcMtBHr6vbbrwP9xye7CdCtrLvf8dFYw/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21321) Change RocksDB incremental checkpoint re-scaling to use deleteRange

2021-02-08 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281569#comment-17281569
 ] 

Yun Tang commented on FLINK-21321:
--

I think this is an interesting idea. Though I have not tracked the bug-fix 
history of {{deleteRange}}, [~legojoey17] do you know which version of RocksDB 
make {{deleteRange}} not experimental or at least no known bugs.

> Change RocksDB incremental checkpoint re-scaling to use deleteRange
> ---
>
> Key: FLINK-21321
> URL: https://issues.apache.org/jira/browse/FLINK-21321
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Joey Pereira
>Priority: Minor
>  Labels: pull-request-available
>
> In FLINK-8790, it was suggested to use RocksDB's {{deleteRange}} API to more 
> efficiently clip the databases for the desired target group.
> During the PR for that ticket, 
> [#5582|https://github.com/apache/flink/pull/5582], the change did not end up 
> using the {{deleteRange}} method  as it was an experimental feature in 
> RocksDB.
> At this point {{deleteRange}} is in a far less experimental state now but I 
> believe is still formally "experimental". It is heavily by many others like 
> CockroachDB and TiKV and they have teased out several bugs in complex 
> interactions over the years.
> For certain re-scaling situations where restores trigger 
> {{restoreWithScaling}} and the DB clipping logic, this would likely reduce an 
> O[n] operation (N = state size/records) to O(1). For large state apps, this 
> would potentially represent a non-trivial amount of time spent for 
> re-scaling. In the case of my workplace, we have an operator with 100s of 
> billions of records in state and re-scaling was taking a long time (>>30min, 
> but it has been awhile since doing it).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot commented on pull request #14906: [FLINK-21274][runtime] Change the ClusterEntrypoint.runClusterEntrypoint to wait on the result of clusterEntrypoint.getTerminationFuture().g

2021-02-08 Thread GitBox


flinkbot commented on pull request #14906:
URL: https://github.com/apache/flink/pull/14906#issuecomment-775707192


   Thanks a lot for your contribution to the Apache Flink project. I'm the 
@flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress 
of the review.
   
   
   ## Automated Checks
   Last check on commit 2951f7e582aecc00293b92003de99c6bf84aa470 (Tue Feb 09 
06:35:20 UTC 2021)
   
   **Warnings:**
* No documentation files were touched! Remember to keep the Flink docs up 
to date!
   
   
   Mention the bot in a comment to re-run the automated checks.
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review 
Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full 
explanation of the review process.
The Bot is tracking the review progress through labels. Labels are applied 
according to the order of the review items. For consensus, approval by a Flink 
committer of PMC member is required Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot approve description` to approve one or more aspects (aspects: 
`description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until 
`architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's 
attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14895: [FLINK-21295][table] Support 'useModules' and 'listFullModules' in ModuleManager and TableEnvironment

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14895:
URL: https://github.com/apache/flink/pull/14895#issuecomment-774905526


   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * f85e0cb64c50146c1be7a040c91c48128e140bc5 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13135)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (FLINK-21274) At per-job mode, during the exit of the JobManager process, if ioExecutor exits at the end, the System.exit() method will not be executed.

2021-02-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-21274:
---
Labels: pull-request-available  (was: )

> At per-job mode, during the exit of the JobManager process, if ioExecutor 
> exits at the end, the System.exit() method will not be executed.
> --
>
> Key: FLINK-21274
> URL: https://issues.apache.org/jira/browse/FLINK-21274
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0, 1.12.0
>Reporter: Jichao Wang
>Assignee: Jichao Wang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.12.2, 1.13.0
>
> Attachments: 1.png, 2.png, Add wait 5 seconds in 
> org.apache.flink.runtime.history.FsJobArchivist#archiveJob.log, Not add wait 
> 5 seconds.log, application_1612404624605_0010-JobManager.log
>
>
> h2. =Latest issue description(2021.02.07)==
> I want to try to describe the issue in a more concise way:
> *My issue only appears in per-job mode,*
> In JsonResponseHistoryServerArchivist#archiveExecutionGraph, submit the 
> archive task to ioExecutor for execution. At the same time, 
> ClusterEntrypoint#stopClusterServices exits multiple thread pools in parallel 
> (for example, commonRpcService, metricRegistry, 
> MetricRegistryImpl#executor(in metricRegistry.shutdown())). Think about it, 
> assuming that the archiving process takes 10 seconds to execute, then 
> ExecutorUtils.nonBlockingShutdown will wait 10 before exiting. However, 
> through testing, it was found that the JobManager process exited immediately 
> after commonRpcService and metricRegistry exited. At this time, 
> ExecutorUtils.nonBlockingShutdown is still waiting for the end of the 
> archiving process, so the archiving process will not be completely executed.
> *There are two specific reproduction methods:*
> *Method one:*
> Modify the org.apache.flink.runtime.history.FsJobArchivist#archiveJob method 
> to wait 5 seconds before actually writing to HDFS (simulating a slow write 
> speed scenario).
> {code:java}
> public static Path archiveJob(Path rootPath, JobID jobId, 
> Collection jsonToArchive)
> throws IOException {
> try {
> FileSystem fs = rootPath.getFileSystem();
> Path path = new Path(rootPath, jobId.toString());
> OutputStream out = fs.create(path, FileSystem.WriteMode.NO_OVERWRITE);
> try {
> LOG.info("===Wait 5 seconds..");
> Thread.sleep(5000);
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> try (JsonGenerator gen = jacksonFactory.createGenerator(out, 
> JsonEncoding.UTF8)) {
> ...  // Part of the code is omitted here
> } catch (Exception e) {
> fs.delete(path, false);
> throw e;
> }
> LOG.info("Job {} has been archived at {}.", jobId, path);
> return path;
> } catch (IOException e) {
> LOG.error("Failed to archive job.", e);
> throw e;
> }
> }
> {code}
> The above modification will cause the archive to fail.
> *Method two:*
> In ClusterEntrypoint#stopClusterServices, before 
> ExecutorUtils.nonBlockingShutdown is called, submit a task that waits 10 
> seconds to ioExecutor.
> {code:java}
> ioExecutor.execute(new Runnable() {
> @Override
> public void run() {
> try {
> LOG.info("===ioExecutor before sleep");
> Thread.sleep(1);
> LOG.info("===ioExecutor after sleep");
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> }
> });
> terminationFutures.add(ExecutorUtils.nonBlockingShutdown(shutdownTimeout, 
> TimeUnit.MILLISECONDS, ioExecutor));
> {code}
> According to the above modification, ===ioExecutor before sleep will be 
> printed, but ===ioExecutor after sleep will not be printed.
> *The root cause of the above issue is that all user threads (in Akka 
> ActorSystem) have exited during the waiting, and finally the daemon thread 
> (in ioExecutor) cannot be executed completely.*
>  
> {color:#de350b} *If you already understand my issue, you can skip the 
> following old version of the issue description, and browse the comment area 
> directly*{color}
>  
>  
>  
>  
>  
> h2. Older issue description(2021.02.04)
> This is a partial configuration of my Flink History service(flink-conf.yaml), 
> and this is also the configuration of my Flink client.
> {code:java}
> jobmanager.archive.fs.dir: hdfs://hdfsHACluster/flink/completed-jobs/
> 

[jira] [Commented] (FLINK-21323) Stop-with-savepoint is not supported by SourceOperatorStreamTask

2021-02-08 Thread Kezhu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281568#comment-17281568
 ] 

Kezhu Wang commented on FLINK-21323:


[~mapohl] [~roman_khachatryan] [~pnowojski] Maybe we can provide a fix solely 
for {{SourceOperatorStreamTask}} but not chained out FLIP-27 source ? This way 
we can deliver it to 1.12.2 on time. I could open a fix for this if this make 
sense.

> Stop-with-savepoint is not supported by SourceOperatorStreamTask
> 
>
> Key: FLINK-21323
> URL: https://issues.apache.org/jira/browse/FLINK-21323
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.11.3, 1.12.1
>Reporter: Matthias
>Priority: Critical
> Fix For: 1.11.4, 1.12.2, 1.13.0
>
>
> When looking into FLINK-21030 analyzing the stop-with-savepoint behavior we 
> implemented different test jobs covering the old {{addSource}} and new 
> {{fromSource}} methods for adding sources. The stop-with-savepoint consists 
> of two phase:
>  # Create the savepoint
>  # Stop the source to trigger finalizing the job
> The test failing in the second phase using {{fromSource}} does not succeed. 
> The reason for this might be that {{finishTask}} is not implemented by 
> {{SourceOperatorStreamTask}} in contrast to {{SourceStreamTask}} which is 
> used when calling {{addSource}} in the job definition. Hence, the job 
> termination is never triggered.
> We might have missed this due to some naming error of 
> {{JobMasterStopWithSavepointIT}} test that is not triggered by Maven due to 
> the wrong suffix used in this case. The IT is failing right now. FLINK-21031 
> is covering the fix of {{JobMasterStopWithSavepointIT}} already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] wjc920 opened a new pull request #14906: [FLINK-21274][runtime] Change the ClusterEntrypoint.runClusterEntrypoint to wait on the result of clusterEntrypoint.getTerminationFuture().get

2021-02-08 Thread GitBox


wjc920 opened a new pull request #14906:
URL: https://github.com/apache/flink/pull/14906


   
   ## What is the purpose of the change
   
   In some cases we don't have non-daemon thread remaining which causes the JVM 
to terminate before all tasks have been completed. 
   
   To solve this problem, my fix is to block the main thread so that it exits 
after all tasks in the JVM are completed.
   
   ## Brief change log
   
 - Change the ClusterEntrypoint.runClusterEntrypoint to wait on the result 
of clusterEntrypoint.getTerminationFuture().get() and do the System.exit 
outside of the future callback
   
   
   ## Verifying this change
   
   This change is already covered by existing tests.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / **no**)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / **no** / 
don't know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / **no**)
 - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (FLINK-21132) BoundedOneInput.endInput is called when taking synchronous savepoint

2021-02-08 Thread Kezhu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281563#comment-17281563
 ] 

Kezhu Wang commented on FLINK-21132:


{quote}I saw a hole in notifyCheckpointAbortAsync where there is no 
resetSynchronousSavepointId. I will verify it using test case. – from me
{quote}
[~roman_khachatryan] [~pnowojski] This does not hold as 
{{CheckpointCoordinator}} will fail the 
job({{CheckpointFailureManager.handleSynchronousSavepointFailure}}) after 
synchronous-savepoint(eg. stop-with-savepoint) failure. But the causal chain is 
a bit long, so I suggest to {{resetSynchronousSavepointId}} always. 
[pr#14815|https://github.com/apache/flink/pull/14815/files#diff-0c5fe245445b932fa83fdaf5c4802dbe671b73e8d76f39058c6eaaaffd9639faR1080]
 already have this, so there is no more concern from my side.
{quote}Data flow and events between TaskManagers are using the same channels. 
RPCs from JobManager to the TaskManagers are using a different channel, but 
they won't be processed if the Task's thread (mailbox) is blocked.
{quote}
I probably use wrong words. I use "data flow" to represent {{StreamElement}} 
and {{RuntimeEvent}}(including {{EndOfPartitionEvent}}, {{CheckpointBarrier}}, 
etc.) and "gateway event" to represent 
{{TaskExecutorGateway.triggerCheckpoint/confirmCheckpoint}} which are rpc 
methods. I think we are aligned on this.

Checkpoint completion and cancellation are delivered using 
{{TaskMailbox.MAX_PRIORITY}}, so they will be processed along with other 
control-mails in {{runSynchronousSavepointMailboxLoop}}. I don't think either 
approach change this part of code, so if there is deadlock after, it probably 
exists before.
{quote}To avoid deadlocks one would have to carefully thought this through.
{quote}
[~roman_khachatryan] [~pnowojski] Is it worth another dedicated issue to 
discuss this no end-of-partition style stop-with-savepoint ? I see values of 
this approach:
 # It is applicable to mailbox model. FLINK-21133 could be fixed without touch 
FLIP-27 source code. FLINK-21133 may be relative complicated due to source 
chaining with multiple input stream. Though, I think this could also be 
achieved in end-of-partition style by rearranging {{StreamTask.finishTask}} to 
path from {{StreamTask.triggerCheckpointAsync}}, but this will need more 
bookkeeping.
 # Clean code ? We don't need end-of-input detection during 
stop-with-savepoint. It could be tangled/confused in evaluation data-flow 
during stop-with-savepoint in future, eg. why end-of-input and then stripping 
of them.

> BoundedOneInput.endInput is called when taking synchronous savepoint
> 
>
> Key: FLINK-21132
> URL: https://issues.apache.org/jira/browse/FLINK-21132
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.10.2, 1.10.3, 1.11.3, 1.12.1
>Reporter: Kezhu Wang
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.4, 1.12.2, 1.13.0
>
>
> [~elkhand](?) reported on project iceberg that {{BoundedOneInput.endInput}} 
> was 
> [called|https://github.com/apache/iceberg/issues/2033#issuecomment-765864038] 
> when [stopping job with 
> savepoint|https://github.com/apache/iceberg/issues/2033#issuecomment-765557995].
> I think it is a bug of Flink and was introduced in FLINK-14230. The 
> [changes|https://github.com/apache/flink/pull/9854/files#diff-0c5fe245445b932fa83fdaf5c4802dbe671b73e8d76f39058c6eaaaffd9639faL577]
>  rely on {{StreamTask.afterInvoke}} and {{OperatorChain.closeOperators}} will 
> only be invoked after *end of input*. But that is not true long before after 
> [FLIP-34: Terminate/Suspend Job with 
> Savepoint|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212].
>  Task could enter state called 
> [*finished*|https://github.com/apache/flink/blob/3a8e06cd16480eacbbf0c10f36b8c79a6f741814/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L467]
>  after synchronous savepoint, that is an expected job suspension and stopping.
> [~sunhaibotb] [~pnowojski] [~roman_khachatryan] Could you help confirm this ?
> For full context, see 
> [apache/iceberg#2033|https://github.com/apache/iceberg/issues/2033]. I have 
> pushed branch 
> [synchronous-savepoint-conflict-with-bounded-end-input-case|https://github.com/kezhuw/flink/commits/synchronous-savepoint-conflict-with-bounded-end-input-case]
>  in my repository. Test case 
> {{SavepointITCase.testStopSavepointWithBoundedInput}} failed due to 
> {{BoundedOneInput.endInput}} called.
> I am also aware of [FLIP-147: Support Checkpoints After Tasks 
> Finished|https://cwiki.apache.org/confluence/x/mw-ZCQ], maybe the three 
> should align on what *finished* means 

[GitHub] [flink] aimiebell commented on a change in pull request #14754: [FLINK-21127][runtime][checkpoint] Stores finished status for fully finished operators in checkpoint

2021-02-08 Thread GitBox


aimiebell commented on a change in pull request #14754:
URL: https://github.com/apache/flink/pull/14754#discussion_r572622956



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java
##
@@ -1095,7 +1084,7 @@ public void initFailureCause(Throwable t) {
  */
 void vertexFinished() {
 assertRunningInJobMasterMainThread();
-final int numFinished = ++verticesFinished;
+final int numFinished = ++numFinishedVertices;

Review comment:
   I detect that this code is problematic. According to the [Correctness 
(CORRECTNESS)](https://spotbugs.readthedocs.io/en/stable/bugDescriptions.html#correctness-correctness),
 [DLS: Overwritten increment 
(DLS_OVERWRITTEN_INCREMENT)](https://spotbugs.readthedocs.io/en/stable/bugDescriptions.html#dls-overwritten-increment-dls-overwritten-increment).
   The code performs an increment operation (e.g., i++) and then immediately 
overwrites it. For example, i = i++ immediately overwrites the incremented 
value with the original value.
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14895: [FLINK-21295][table] Support 'useModules' and 'listFullModules' in ModuleManager and TableEnvironment

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14895:
URL: https://github.com/apache/flink/pull/14895#issuecomment-774905526


   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * f85e0cb64c50146c1be7a040c91c48128e140bc5 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] LadyForest commented on pull request #14895: [FLINK-21295][table] Support 'useModules' and 'listFullModules' in ModuleManager and TableEnvironment

2021-02-08 Thread GitBox


LadyForest commented on pull request #14895:
URL: https://github.com/apache/flink/pull/14895#issuecomment-775689515


   @flinkbot run azure



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14905: [FLINK-19608][table-planner-blink] Support TVF based window aggreagte in planner

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14905:
URL: https://github.com/apache/flink/pull/14905#issuecomment-775660513


   
   ## CI report:
   
   * 7908eb32eee9fba4b38936424fe8e40a6be209d2 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13134)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14905: [FLINK-19608][table-planner-blink] Support TVF based window aggreagte in planner

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14905:
URL: https://github.com/apache/flink/pull/14905#issuecomment-775660513


   
   ## CI report:
   
   * 7908eb32eee9fba4b38936424fe8e40a6be209d2 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13134)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (FLINK-20950) SinkITCase.writerAndGlobalCommitterExecuteInStreamingMode test failed with "AssertionError"

2021-02-08 Thread Guowei Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guowei Ma reassigned FLINK-20950:
-

Assignee: Guowei Ma

> SinkITCase.writerAndGlobalCommitterExecuteInStreamingMode test failed with 
> "AssertionError"
> ---
>
> Key: FLINK-20950
> URL: https://issues.apache.org/jira/browse/FLINK-20950
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.13.0
>Reporter: Huang Xingbo
>Assignee: Guowei Ma
>Priority: Critical
>  Labels: test-stability
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=11940=logs=34f41360-6c0d-54d3-11a1-0292a2def1d9=2d56e022-1ace-542f-bf1a-b37dd63243f2]
> {code:java}
> 2021-01-12T16:47:00.7579536Z [ERROR] Failures: 
> 2021-01-12T16:47:00.7580061Z [ERROR]   
> SinkITCase.writerAndGlobalCommitterExecuteInStreamingMode:218 
> 2021-01-12T16:47:00.7587186Z Expected: iterable over 
> ["(895,null,-9223372036854775808)", "(895,null,-9223372036854775808)", 
> "(127,null,-9223372036854775808)", "(127,null,-9223372036854775808)", 
> "(148,null,-9223372036854775808)", "(148,null,-9223372036854775808)", 
> "(161,null,-9223372036854775808)", "(161,null,-9223372036854775808)", 
> "(148,null,-9223372036854775808)", "(148,null,-9223372036854775808)", 
> "(662,null,-9223372036854775808)", "(662,null,-9223372036854775808)", 
> "(822,null,-9223372036854775808)", "(822,null,-9223372036854775808)", 
> "(491,null,-9223372036854775808)", "(491,null,-9223372036854775808)", 
> "(275,null,-9223372036854775808)", "(275,null,-9223372036854775808)", 
> "(122,null,-9223372036854775808)", "(122,null,-9223372036854775808)", 
> "(850,null,-9223372036854775808)", "(850,null,-9223372036854775808)", 
> "(630,null,-9223372036854775808)", "(630,null,-9223372036854775808)", 
> "(682,null,-9223372036854775808)", "(682,null,-9223372036854775808)", 
> "(765,null,-9223372036854775808)", "(765,null,-9223372036854775808)", 
> "(434,null,-9223372036854775808)", "(434,null,-9223372036854775808)", 
> "(970,null,-9223372036854775808)", "(970,null,-9223372036854775808)", 
> "(714,null,-9223372036854775808)", "(714,null,-9223372036854775808)", 
> "(795,null,-9223372036854775808)", "(795,null,-9223372036854775808)", 
> "(288,null,-9223372036854775808)", "(288,null,-9223372036854775808)", 
> "(422,null,-9223372036854775808)", "(422,null,-9223372036854775808)"] in any 
> order
> 2021-01-12T16:47:00.7591663Z  but: Not matched: "end of input"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot commented on pull request #14905: [FLINK-19608][table-planner-blink] Support TVF based window aggreagte in planner

2021-02-08 Thread GitBox


flinkbot commented on pull request #14905:
URL: https://github.com/apache/flink/pull/14905#issuecomment-775660513


   
   ## CI report:
   
   * 7908eb32eee9fba4b38936424fe8e40a6be209d2 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14904: [WIP][FLINK-20663][runtime] Instantly release unsafe memory on freeing segment.

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14904:
URL: https://github.com/apache/flink/pull/14904#issuecomment-775639746


   
   ## CI report:
   
   * 797afe4dcb649ffa6e1e761581389c6055af0c9c Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13133)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot commented on pull request #14905: [FLINK-19608][table-planner-blink] Support TVF based window aggreagte in planner

2021-02-08 Thread GitBox


flinkbot commented on pull request #14905:
URL: https://github.com/apache/flink/pull/14905#issuecomment-775654551


   Thanks a lot for your contribution to the Apache Flink project. I'm the 
@flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress 
of the review.
   
   
   ## Automated Checks
   Last check on commit 7908eb32eee9fba4b38936424fe8e40a6be209d2 (Tue Feb 09 
04:28:03 UTC 2021)
   
   **Warnings:**
* No documentation files were touched! Remember to keep the Flink docs up 
to date!
   
   
   Mention the bot in a comment to re-run the automated checks.
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review 
Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full 
explanation of the review process.
The Bot is tracking the review progress through labels. Labels are applied 
according to the order of the review items. For consensus, approval by a Flink 
committer of PMC member is required Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot approve description` to approve one or more aspects (aspects: 
`description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until 
`architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's 
attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (FLINK-19608) Support window TVF based window aggreagte in planner

2021-02-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-19608:
---
Labels: pull-request-available  (was: )

> Support window TVF based window aggreagte in planner
> 
>
> Key: FLINK-19608
> URL: https://issues.apache.org/jira/browse/FLINK-19608
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Planner
>Reporter: Jark Wu
>Assignee: Jark Wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> Support window TVF based window aggreagte in planner.  We will introduce new 
> physical nodes for {{StreamExecWindowPropertyAggregate}} and 
> {{StreamExecWindowAssigner}}, and optimize them into 
> {{StreamExecGroupWindowAggregate}} if possible. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] wuchong opened a new pull request #14905: [FLINK-19608][table-planner-blink] Support TVF based window aggreagte in planner

2021-02-08 Thread GitBox


wuchong opened a new pull request #14905:
URL: https://github.com/apache/flink/pull/14905


   
   
   ## What is the purpose of the change
   
   Support window TVF based window aggreagte in planner. We should be able to 
end-to-end run the new window aggregate syntax after this PR. 
   
   ## Brief change log
   
   1.  Introduce SqlOperator definitions for TUMBLE, HOP, CUMULATE table-valued 
function
   
   We don't use the definitions in Calcite, because we have some special needs:
   1) we have time attribute type check on the DESCRIPTOR time column
   2) we need to derive type of window columns from the time attribute type 
(e.g. TIMESTAMP_LZ)
   3) we will output an additional column `window_time` which extends time 
attribute type
   
   2. Support translating into TVF based window aggreagte physical node
   
   This commit makes plan test possbile.
   1) [FLINK-19611] Introduce WindowProperties MetadataHandler to propagate 
window properties.
   2) Recognize window aggregate and translate `FlinkLogicalAggregate` into 
`StreamPhysicalWindowAggregate`, by introduce 
`StreamPhysicalWindowAggregateRule`
   3) Push window TVF into `StreamPhysicalWindowAggregate`, by introduce 
`PushWindowTableFunctionIntoWindowAggregateRule`
   4) [FLINK-21304] Support split distinct aggregate for TVF based window 
aggregate, by introduce `ExpandWindowTableFunctionTransposeRule`
   5) Support cascading window aggregate, mainly fix 
`RelTimeIndicatorConverter` to propagate time attribute of window_time.
   
   3. Introduce TVF based window aggregate exec node
   
   This commit makes integration test possible. 
   1) Introduce StreamExecWindowAggregate to translate into slicing window 
aggregate operator
   2) Update AggsHandlerCodeGenerator to support generate window properties 
based on window end timestamp
   3) Fix implementation of WindowedSliceAssigner::getWindowStart, it must 
delegate calls to real slice assigner
   
   ## Verifying this change
   
   1. Rename WindowAggregateTest to GroupWindowTest
   
   Because we are going to introduce the new window TVF based aggregates,
   in order to avoid confusing, we call the legacy window aggregate 
"GroupWindow",
   and the new window aggregate "WindowAggregate".
   
   2. Introduce plan tests in `WindowAggregateTest` and 
`WindowTableFunctionTest`
   3. Introduce IT cases in `WindowAggregateITCase`
   4. Introduce distinct aggregate IT casess in `WindowDistinctAggregateITCase`
   5. Introduce harness tests for processing-time window aggregates  in 
`WindowAggregateHarnessTest`
   6. Add tests to reproduce problems with 
`WindowedSliceAssigner#getWindowStart` in `WindowedSliceAssignerTest`.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / **no**)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / **no**)
 - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14904: [WIP][FLINK-20663][runtime] Instantly release unsafe memory on freeing segment.

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14904:
URL: https://github.com/apache/flink/pull/14904#issuecomment-775639746


   
   ## CI report:
   
   * 797afe4dcb649ffa6e1e761581389c6055af0c9c Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13133)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] wenlong88 commented on a change in pull request #14878: [FLINK-21245][table-planner-blink] Support StreamExecCalc json serialization/deserialization

2021-02-08 Thread GitBox


wenlong88 commented on a change in pull request #14878:
URL: https://github.com/apache/flink/pull/14878#discussion_r570897982



##
File path: 
flink-table/flink-table-planner-blink/src/main/java/org/apache/flink/table/planner/plan/nodes/exec/serde/FlinkDeserializationContext.java
##
@@ -22,13 +22,17 @@
 import 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.DeserializationConfig;
 import 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.DeserializationContext;
 import 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.InjectableValues;
+import 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
 import 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.deser.DefaultDeserializationContext;
 import 
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.deser.DeserializerFactory;
 
+import static org.apache.flink.util.Preconditions.checkNotNull;
+
 /** Custom JSON {@link DeserializationContext} which wraps a {@link 
SerdeContext}. */
 public class FlinkDeserializationContext extends DefaultDeserializationContext 
{
 private static final long serialVersionUID = 1L;
 private final SerdeContext serdeCtx;
+private ObjectMapper ownerMapper;

Review comment:
   this is a typo? objectMapper?

##
File path: 
flink-table/flink-table-planner-blink/src/main/java/org/apache/flink/table/planner/plan/nodes/exec/serde/EnumTypes.java
##
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.planner.plan.nodes.exec.serde;
+
+import org.apache.flink.table.functions.FunctionKind;
+import org.apache.flink.table.types.logical.TimestampKind;
+
+import org.apache.calcite.avatica.util.TimeUnitRange;
+import org.apache.calcite.rel.type.StructKind;
+import org.apache.calcite.sql.fun.SqlTrimFunction;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Registry of {@link Enum} classes that can be serialized to JSON.
+ *
+ * Suppose you want to serialize the value {@link 
SqlTrimFunction.Flag#LEADING} to JSON. First,
+ * make sure that {@link SqlTrimFunction.Flag} is registered. The type will be 
serialized as
+ * "SYMBOL". The value will be serialized as the string "LEADING".
+ *
+ * When we deserialize, we rely on the fact that the registered {@code 
enum} classes have
+ * distinct values. Therefore, knowing that {@code (type="SYMBOL", 
value="LEADING")} we can convert
+ * the string "LEADING" to the enum {@code Flag.LEADING}.
+ */
+@SuppressWarnings({"rawtypes", "unchecked"})
+public abstract class EnumTypes {
+private EnumTypes() {}
+
+private static final Map> ENUM_BY_NAME;

Review comment:
   why should we need this? as I known the default json serialization for 
enum is use the name of the value.

##
File path: 
flink-table/flink-table-planner-blink/src/main/java/org/apache/flink/table/planner/plan/nodes/exec/serde/JsonSerdeUtil.java
##
@@ -44,15 +44,17 @@ public static boolean hasJsonCreatorAnnotation(Class 
clazz) {
 
 /** Create an {@link ObjectMapper} which DeserializationContext wraps a 
{@link SerdeContext}. */
 public static ObjectMapper createObjectMapper(SerdeContext serdeCtx) {
+FlinkDeserializationContext ctx =
+new FlinkDeserializationContext(
+new 
DefaultDeserializationContext.Impl(BeanDeserializerFactory.instance),
+serdeCtx);
 ObjectMapper mapper =
 new ObjectMapper(
 null, // JsonFactory
 null, // DefaultSerializerProvider
-new FlinkDeserializationContext(
-new DefaultDeserializationContext.Impl(
-BeanDeserializerFactory.instance),
-serdeCtx));
+ctx);
 mapper.configure(MapperFeature.USE_GETTERS_AS_SETTERS, false);
+ctx.setObjectMapper(mapper);

Review comment:
   can we prevent the cyclic dependency here? 

##
File path: 

[GitHub] [flink] flinkbot commented on pull request #14904: [WIP][FLINK-20663][runtime] Instantly release unsafe memory on freeing segment.

2021-02-08 Thread GitBox


flinkbot commented on pull request #14904:
URL: https://github.com/apache/flink/pull/14904#issuecomment-775639746


   
   ## CI report:
   
   * 797afe4dcb649ffa6e1e761581389c6055af0c9c UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] PChou commented on pull request #14891: [FLINK-21289][deployment] FIX missing load pipeline.classpaths in app…

2021-02-08 Thread GitBox


PChou commented on pull request #14891:
URL: https://github.com/apache/flink/pull/14891#issuecomment-775638451


   @wangyang0918 err...i have no idea about 
`StandaloneApplicationClusterEntryPoint` actually. so you mean 
`StandaloneApplicationClusterEntryPoint` also have the issue? and the fix 
should implement in `ClassPathPackagedProgramRetriever` internally?
   if so, `ClassPathPackagedProgramRetriever` or 
`ClassPathPackagedProgramRetriever.Builder` have to access Configuration 
directly. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot commented on pull request #14904: [WIP][FLINK-20663][runtime] Instantly release unsafe memory on freeing segment.

2021-02-08 Thread GitBox


flinkbot commented on pull request #14904:
URL: https://github.com/apache/flink/pull/14904#issuecomment-775636912


   Thanks a lot for your contribution to the Apache Flink project. I'm the 
@flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress 
of the review.
   
   
   ## Automated Checks
   Last check on commit 797afe4dcb649ffa6e1e761581389c6055af0c9c (Tue Feb 09 
03:38:18 UTC 2021)
   
   **Warnings:**
* No documentation files were touched! Remember to keep the Flink docs up 
to date!
   
   
   Mention the bot in a comment to re-run the automated checks.
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review 
Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full 
explanation of the review process.
The Bot is tracking the review progress through labels. Labels are applied 
according to the order of the review items. For consensus, approval by a Flink 
committer of PMC member is required Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot approve description` to approve one or more aspects (aspects: 
`description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until 
`architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's 
attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-08 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21325:

Description: 
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"
I am sure my slots are sufficient.

and this problem were fixed by restart k8s jm an tm pods.

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 at 
org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
 at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
 at 
org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:84)
 ... 24 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 30 
ms
 ... 25 more
{code}


  was:
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"

and this problem were fixed by restart k8s jm an tm pods.

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required 

[jira] [Updated] (FLINK-20663) Managed memory may not be released in time when operators use managed memory frequently

2021-02-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-20663:
---
Labels: pull-request-available  (was: )

> Managed memory may not be released in time when operators use managed memory 
> frequently
> ---
>
> Key: FLINK-20663
> URL: https://issues.apache.org/jira/browse/FLINK-20663
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.12.0
>Reporter: Caizhi Weng
>Assignee: Xintong Song
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.2
>
>
> Some batch operators (like sort merge join or hash aggregate) use managed 
> memory frequently. When these operators are chained together and the cluster 
> load is a bit heavy, it is very likely that the following exception occurs:
> {code:java}
> 2020-12-18 10:04:32
> java.lang.RuntimeException: 
> org.apache.flink.runtime.memory.MemoryAllocationException: Could not allocate 
> 512 pages
>   at 
> org.apache.flink.table.runtime.util.LazyMemorySegmentPool.nextSegment(LazyMemorySegmentPool.java:85)
>   at 
> org.apache.flink.runtime.io.disk.SimpleCollectingOutputView.(SimpleCollectingOutputView.java:49)
>   at 
> org.apache.flink.table.runtime.operators.aggregate.BytesHashMap$RecordArea.(BytesHashMap.java:297)
>   at 
> org.apache.flink.table.runtime.operators.aggregate.BytesHashMap.(BytesHashMap.java:103)
>   at 
> org.apache.flink.table.runtime.operators.aggregate.BytesHashMap.(BytesHashMap.java:90)
>   at LocalHashAggregateWithKeys$209161.open(Unknown Source)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:401)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:506)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:92)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:501)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:530)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547)
>   at java.lang.Thread.run(Thread.java:834)
>   Suppressed: java.lang.NullPointerException
>   at LocalHashAggregateWithKeys$209161.close(Unknown Source)
>   at 
> org.apache.flink.table.runtime.operators.TableStreamOperator.dispose(TableStreamOperator.java:46)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:739)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:719)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:642)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:551)
>   ... 3 more
>   Suppressed: java.lang.NullPointerException
>   at LocalHashAggregateWithKeys$209766.close(Unknown 
> Source)
>   ... 8 more
> Caused by: org.apache.flink.runtime.memory.MemoryAllocationException: Could 
> not allocate 512 pages
>   at 
> org.apache.flink.runtime.memory.MemoryManager.allocatePages(MemoryManager.java:231)
>   at 
> org.apache.flink.table.runtime.util.LazyMemorySegmentPool.nextSegment(LazyMemorySegmentPool.java:83)
>   ... 13 more
> Caused by: org.apache.flink.runtime.memory.MemoryReservationException: Could 
> not allocate 16777216 bytes, only 9961487 bytes are remaining. This usually 
> indicates that you are requesting more memory than you have reserved. 
> However, when running an old JVM version it can also be caused by slow 
> garbage collection. Try to upgrade to Java 8u72 or higher if running on an 
> old Java version.
>   at 
> org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:164)
>   at 
> org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:80)
>   at 
> org.apache.flink.runtime.memory.MemoryManager.allocatePages(MemoryManager.java:229)
>   ... 14 more
> {code}
> It seems that this is caused by relying on GC to release managed memory, as 
> {{System.gc()}} may not trigger GC in time. See {{UnsafeMemoryBudget.java}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-08 Thread hayden zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hayden zhou updated FLINK-21325:

Description: 
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"

and this problem were fixed by restart k8s jm an tm pods.

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 at 
org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
 at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
 at 
org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:84)
 ... 24 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 30 
ms
 ... 25 more
{code}


  was:
 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 

[GitHub] [flink] xintongsong opened a new pull request #14904: [WIP][FLINK-20663][runtime] Instantly release unsafe memory on freeing segment.

2021-02-08 Thread GitBox


xintongsong opened a new pull request #14904:
URL: https://github.com/apache/flink/pull/14904


   
   
   ## What is the purpose of the change
   
   *(For example: This pull request makes task deployment go through the blob 
server, rather than through RPC. That way we avoid re-transferring them on each 
deployment (during recovery).)*
   
   
   ## Brief change log
   
   *(for example:)*
 - *The TaskInfo is stored in the blob store on job creation time as a 
persistent artifact*
 - *Deployments RPC transmits only the blob storage reference*
 - *TaskManagers retrieve the TaskInfo from the blob cache*
   
   
   ## Verifying this change
   
   *(Please pick either of the following options)*
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This change is already covered by existing tests, such as *(please describe 
tests)*.
   
   *(or)*
   
   This change added tests and can be verified as follows:
   
   *(example:)*
 - *Added integration tests for end-to-end deployment with large payloads 
(100MB)*
 - *Extended integration test for recovery after master (JobManager) 
failure*
 - *Added test that validates that TaskInfo is transferred only once across 
recoveries*
 - *Manually verified the change by running a 4 node cluser with 2 
JobManagers and 4 TaskManagers, a stateful streaming program, and killing one 
JobManager and two TaskManagers during the execution, verifying that recovery 
happens correctly.*
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / no)
 - The serializers: (yes / no / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / no / 
don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't 
know)
 - The S3 file system connector: (yes / no / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / no)
 - If yes, how is the feature documented? (not applicable / docs / JavaDocs 
/ not documented)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (FLINK-21325) NoResourceAvailableException while cancel then resubmit jobs

2021-02-08 Thread hayden zhou (Jira)
hayden zhou created FLINK-21325:
---

 Summary: NoResourceAvailableException while cancel then resubmit  
jobs
 Key: FLINK-21325
 URL: https://issues.apache.org/jira/browse/FLINK-21325
 Project: Flink
  Issue Type: Bug
 Environment: FLINK 1.12  with 
[flink-kubernetes_2.11-1.12-SNAPSHOT.jar] in libs directory to fix FLINK 
restart problem on k8s HA session mode.
Reporter: hayden zhou


 I have five stream jobs and want to clear all states in jobs, so I canceled 
all those jobs, then resubmitted one by one, resulting in two jobs are in 
running status,  while three jobs are in created status with errors 
".NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout
"

below is the error logs:

{code:java}
ava.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 at 
org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:195)
 at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:147)
 at 
org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:84)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:87)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:404)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Slot request bulk is not fulfillable! Could not allocate the required slot 
within slot request timeout
 at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:84)
 ... 24 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 30 
ms
 ... 25 more
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17170) Cannot stop streaming job with savepoint which uses kinesis consumer

2021-02-08 Thread Kezhu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281492#comment-17281492
 ] 

Kezhu Wang commented on FLINK-17170:


{quote}
In stop-with-savepoint path, {{FlinkKinesisConsumer.cancel}} is called with 
{{checkpointLock}} hold.
{quote}

This holds after 1.10, so this issue should exist in all versions after 1.10.

> Cannot stop streaming job with savepoint which uses kinesis consumer
> 
>
> Key: FLINK-17170
> URL: https://issues.apache.org/jira/browse/FLINK-17170
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Connectors / Kinesis
>Affects Versions: 1.10.0
>Reporter: Vasii Cosmin Radu
>Priority: Critical
>  Labels: usability
> Attachments: Screenshot 2020-04-15 at 18.16.26.png, threaddump_tm1
>
>
> I am encountering a very strange situation where I can't stop with savepoint 
> a streaming job.
> The job reads from kinesis and sinks to S3, very simple job, no mapping 
> function, no watermarks, just source->sink. 
> Source is using flink-kinesis-consumer, sink is using StreamingFileSink. 
> Everything works fine, except stopping the job with savepoints.
> The behaviour happens only when multiple task managers are involved, having 
> sub-tasks off the job spread across multiple task manager instances. When a 
> single task manager has all the sub-tasks this issue never occurred.
> Using latest Flink 1.10.0 version, deployment done in HA mode (2 job 
> managers), in EC2, savepoints and checkpoints written on S3.
> When trying to stop, the savepoint is created correctly and appears on S3, 
> but not all sub-tasks are stopped. Some of them finished, but some just 
> remain hanged. Sometimes, on the same task manager part of the sub-tasks are 
> finished, part aren't.
> The logs don't show any errors. For the ones that succeed, the standard 
> messages appear, with "Source: <> switched from RUNNING to FINISHED".
> For the sub-tasks hanged the last message is 
> "org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher - 
> Shutting down the shard consumer threads of subtask 0 ..." and that's it.
>  
> I tried using the cli (flink stop )
> Timeout Message:
> {code:java}
> root@ec2-XX-XX-XX-XX:/opt/flink/current/bin# ./flink stop 
> cf43cecd9339e8f02a12333e52966a25
> root@ec2-XX-XX-XX-XX:/opt/flink/current/bin# ./flink stop 
> cf43cecd9339e8f02a12333e52966a25Suspending job 
> "cf43cecd9339e8f02a12333e52966a25" with a savepoint. 
>  The program 
> finished with the following exception: org.apache.flink.util.FlinkException: 
> Could not stop with a savepoint job "cf43cecd9339e8f02a12333e52966a25". at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462) 
> at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>  at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454) at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907) 
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) 
> at java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)Caused 
> by: java.util.concurrent.TimeoutException at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) 
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460) 
> ... 9 more{code}
>  
> Using the monitoring api, I keep getting infinite message when querying based 
> on the savepoint id, that the status id is still "IN_PROGRESS".
>  
> When performing a cancel instead of stop, it works. But cancel is deprecated, 
> so I am a bit concerned that this might fail also, maybe I was just lucky.
>  
> I attached a screenshot with what the UI is showing when this happens
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17170) Cannot stop streaming job with savepoint which uses kinesis consumer

2021-02-08 Thread Kezhu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281491#comment-17281491
 ] 

Kezhu Wang commented on FLINK-17170:


I think it is problem of {{FlinkKinesisConsumer.cancel}}, it should not await 
fetcher to finished, it should do only signalling.

[~qinjunjerry] is correct about the deadlock. In stop-with-savepoint path, 
{{FlinkKinesisConsumer.cancel}} is called with {{checkpointLock}} hold.

> Cannot stop streaming job with savepoint which uses kinesis consumer
> 
>
> Key: FLINK-17170
> URL: https://issues.apache.org/jira/browse/FLINK-17170
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Connectors / Kinesis
>Affects Versions: 1.10.0
>Reporter: Vasii Cosmin Radu
>Priority: Critical
>  Labels: usability
> Attachments: Screenshot 2020-04-15 at 18.16.26.png, threaddump_tm1
>
>
> I am encountering a very strange situation where I can't stop with savepoint 
> a streaming job.
> The job reads from kinesis and sinks to S3, very simple job, no mapping 
> function, no watermarks, just source->sink. 
> Source is using flink-kinesis-consumer, sink is using StreamingFileSink. 
> Everything works fine, except stopping the job with savepoints.
> The behaviour happens only when multiple task managers are involved, having 
> sub-tasks off the job spread across multiple task manager instances. When a 
> single task manager has all the sub-tasks this issue never occurred.
> Using latest Flink 1.10.0 version, deployment done in HA mode (2 job 
> managers), in EC2, savepoints and checkpoints written on S3.
> When trying to stop, the savepoint is created correctly and appears on S3, 
> but not all sub-tasks are stopped. Some of them finished, but some just 
> remain hanged. Sometimes, on the same task manager part of the sub-tasks are 
> finished, part aren't.
> The logs don't show any errors. For the ones that succeed, the standard 
> messages appear, with "Source: <> switched from RUNNING to FINISHED".
> For the sub-tasks hanged the last message is 
> "org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher - 
> Shutting down the shard consumer threads of subtask 0 ..." and that's it.
>  
> I tried using the cli (flink stop )
> Timeout Message:
> {code:java}
> root@ec2-XX-XX-XX-XX:/opt/flink/current/bin# ./flink stop 
> cf43cecd9339e8f02a12333e52966a25
> root@ec2-XX-XX-XX-XX:/opt/flink/current/bin# ./flink stop 
> cf43cecd9339e8f02a12333e52966a25Suspending job 
> "cf43cecd9339e8f02a12333e52966a25" with a savepoint. 
>  The program 
> finished with the following exception: org.apache.flink.util.FlinkException: 
> Could not stop with a savepoint job "cf43cecd9339e8f02a12333e52966a25". at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462) 
> at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>  at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454) at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907) 
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) 
> at java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)Caused 
> by: java.util.concurrent.TimeoutException at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) 
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460) 
> ... 9 more{code}
>  
> Using the monitoring api, I keep getting infinite message when querying based 
> on the savepoint id, that the status id is still "IN_PROGRESS".
>  
> When performing a cancel instead of stop, it works. But cancel is deprecated, 
> so I am a bit concerned that this might fail also, maybe I was just lucky.
>  
> I attached a screenshot with what the UI is showing when this happens
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-11838) Create RecoverableWriter for GCS

2021-02-08 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281488#comment-17281488
 ] 

Xintong Song commented on FLINK-11838:
--

Thanks for the information.
FYI, the Chinese New Year holidays are from Feb. 11 to 17, during which I could 
be less responsive.

> Create RecoverableWriter for GCS
> 
>
> Key: FLINK-11838
> URL: https://issues.apache.org/jira/browse/FLINK-11838
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0
>Reporter: Fokko Driesprong
>Assignee: Galen Warren
>Priority: Major
>  Labels: pull-request-available, usability
> Fix For: 1.13.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> GCS supports the resumable upload which we can use to create a Recoverable 
> writer similar to the S3 implementation:
> https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload
> After using the Hadoop compatible interface: 
> https://github.com/apache/flink/pull/7519
> We've noticed that the current implementation relies heavily on the renaming 
> of the files on the commit: 
> https://github.com/apache/flink/blob/master/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/fs/hdfs/HadoopRecoverableFsDataOutputStream.java#L233-L259
> This is suboptimal on an object store such as GCS. Therefore we would like to 
> implement a more GCS native RecoverableWriter 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] wangyang0918 commented on pull request #14891: [FLINK-21289][deployment] FIX missing load pipeline.classpaths in app…

2021-02-08 Thread GitBox


wangyang0918 commented on pull request #14891:
URL: https://github.com/apache/flink/pull/14891#issuecomment-775610643


   @PChou What I mean is adding a private method 
`ClassPathPackagedProgramRetriever#buildUserClassPaths`. It will be used to 
initialize the `userClassPaths` variable in constructor. Do we have some 
situations that the `pipeline.classpaths` should not be added to the user 
classpaths? If not, then I think merging the `pipeline.classpaths` and `usrlib` 
internally in `ClassPathPackagedProgramRetriever` is reasonable.
   
   BTW, based on your current implementation, the 
`StandaloneApplicationClusterEntryPoint` also needs to set the pipeline 
classpath.
   
   > When "other mode" access the ClassPathPackagedProgramRetriever, 
pipelineClassPath is empty collection by default, that will not affect the 
mode's behavior
   
   So do you mean in "other mode", we do not need to set the pipeline 
classpaths? I am afraid I cannot agree with you. The 
`ClassPathPackagedProgramRetriever` always should set the pipeline classpaths.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] KarmaGYZ commented on a change in pull request #14897: [FLINK-21221][runtime] Deduplication for multiple ResourceCounters

2021-02-08 Thread GitBox


KarmaGYZ commented on a change in pull request #14897:
URL: https://github.com/apache/flink/pull/14897#discussion_r572517152



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/util/ResourceCounter.java
##
@@ -35,11 +37,12 @@
  * associated counts. The counts are always positive (> 0).
  */
 public final class ResourceCounter {

Review comment:
   Each time we add a `ResourceProfile` to it, all the `ResourceProfile`s 
recorded will be put into a new map, with O(nlogn) complexity. For 
coarse-grained resource management, that's might be ok since we only have one 
type of requirement and one type of slot(for standalone mode, that assumption 
might also be broken). However, for fine-grained resource management, 
performance can get bad.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] KarmaGYZ commented on a change in pull request #14897: [FLINK-21221][runtime] Deduplication for multiple ResourceCounters

2021-02-08 Thread GitBox


KarmaGYZ commented on a change in pull request #14897:
URL: https://github.com/apache/flink/pull/14897#discussion_r572517152



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/util/ResourceCounter.java
##
@@ -35,11 +37,12 @@
  * associated counts. The counts are always positive (> 0).
  */
 public final class ResourceCounter {

Review comment:
   Each time we add a `ResourceProfile` to it, all the `ResourceProfile`s 
recorded will be put into a new map, with O(nlogn) complexity. For 
coarse-grained resource management, that's might be ok since we only have one 
type of requirement and one type of slot(for standalone mode, that assumption 
might be broken). However, for fine-grained resource management, performance 
can get bad.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] KarmaGYZ commented on a change in pull request #14897: [FLINK-21221][runtime] Deduplication for multiple ResourceCounters

2021-02-08 Thread GitBox


KarmaGYZ commented on a change in pull request #14897:
URL: https://github.com/apache/flink/pull/14897#discussion_r572511815



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/util/ResourceCounter.java
##
@@ -35,11 +37,12 @@
  * associated counts. The counts are always positive (> 0).
  */
 public final class ResourceCounter {

Review comment:
   I make it mutable mainly for performance concerns. If someone uses it 
like a "counter", `add` and `substract` resource profile for a lot of times, 
the construct of hundreds of `ResourceCounter` might be expensive. It may also 
put a lot of pressure on the GC. For instance, every time a slot is allocated, 
we will construct 2 `ResourceCounter`s in 
`ResourceTracker#notifyAcquiredResource`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] KarmaGYZ commented on a change in pull request #14897: [FLINK-21221][runtime] Deduplication for multiple ResourceCounters

2021-02-08 Thread GitBox


KarmaGYZ commented on a change in pull request #14897:
URL: https://github.com/apache/flink/pull/14897#discussion_r572511815



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/util/ResourceCounter.java
##
@@ -35,11 +37,12 @@
  * associated counts. The counts are always positive (> 0).
  */
 public final class ResourceCounter {

Review comment:
   I make it mutable mainly for performance concerns. If someone uses it 
like a "counter", `add` and `substract` resource profile for a lot of times, 
the construct of `ResourceCounter` might be expensive. It may also put a lot of 
pressure on the GC. For instance, every time a slot is allocated, we will 
construct 2 `ResourceCounter`s in `ResourceTracker#notifyAcquiredResource`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (FLINK-11838) Create RecoverableWriter for GCS

2021-02-08 Thread Galen Warren (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281481#comment-17281481
 ] 

Galen Warren commented on FLINK-11838:
--

Thanks, that all sounds reasonable. I'll take a look at the REST option. I'm 
going to be a bit tied up with some work-related stuff for the next couple of 
days so I might not get back until later in the week, but I'll definitely get 
back on this soon.

> Create RecoverableWriter for GCS
> 
>
> Key: FLINK-11838
> URL: https://issues.apache.org/jira/browse/FLINK-11838
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / FileSystem
>Affects Versions: 1.8.0
>Reporter: Fokko Driesprong
>Assignee: Galen Warren
>Priority: Major
>  Labels: pull-request-available, usability
> Fix For: 1.13.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> GCS supports the resumable upload which we can use to create a Recoverable 
> writer similar to the S3 implementation:
> https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload
> After using the Hadoop compatible interface: 
> https://github.com/apache/flink/pull/7519
> We've noticed that the current implementation relies heavily on the renaming 
> of the files on the commit: 
> https://github.com/apache/flink/blob/master/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/fs/hdfs/HadoopRecoverableFsDataOutputStream.java#L233-L259
> This is suboptimal on an object store such as GCS. Therefore we would like to 
> implement a more GCS native RecoverableWriter 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-21323) Stop-with-savepoint is not supported by SourceOperatorStreamTask

2021-02-08 Thread Kezhu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281480#comment-17281480
 ] 

Kezhu Wang commented on FLINK-21323:


[~mapohl]  [~roman_khachatryan] FLIP-27 sources could be chained out(See 
{{ChainingStrategy.HEAD_WITH_SOURCES}}), this should be taken into account to 
make FLIP-27 sources fully function with stop-with-savepoint.

> Stop-with-savepoint is not supported by SourceOperatorStreamTask
> 
>
> Key: FLINK-21323
> URL: https://issues.apache.org/jira/browse/FLINK-21323
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.11.3, 1.12.1
>Reporter: Matthias
>Priority: Critical
> Fix For: 1.11.4, 1.12.2, 1.13.0
>
>
> When looking into FLINK-21030 analyzing the stop-with-savepoint behavior we 
> implemented different test jobs covering the old {{addSource}} and new 
> {{fromSource}} methods for adding sources. The stop-with-savepoint consists 
> of two phase:
>  # Create the savepoint
>  # Stop the source to trigger finalizing the job
> The test failing in the second phase using {{fromSource}} does not succeed. 
> The reason for this might be that {{finishTask}} is not implemented by 
> {{SourceOperatorStreamTask}} in contrast to {{SourceStreamTask}} which is 
> used when calling {{addSource}} in the job definition. Hence, the job 
> termination is never triggered.
> We might have missed this due to some naming error of 
> {{JobMasterStopWithSavepointIT}} test that is not triggered by Maven due to 
> the wrong suffix used in this case. The IT is failing right now. FLINK-21031 
> is covering the fix of {{JobMasterStopWithSavepointIT}} already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775348759


   
   ## CI report:
   
   * 10b2499a2a18cd4ccec518d0aeaf719658036659 UNKNOWN
   * 82de9d36ff5364984912c935c60c39a623dec537 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13131)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] PChou commented on pull request #14891: [FLINK-21289][deployment] FIX missing load pipeline.classpaths in app…

2021-02-08 Thread GitBox


PChou commented on pull request #14891:
URL: https://github.com/apache/flink/pull/14891#issuecomment-775585080


   @wangyang0918 `ClassPathPackagedProgramRetriever` is an implementation of 
`PackagedProgramRetrieve` which only define method `getPackagedProgram`, that 
means callers should only know about `getPackagedProgram`. Caller should only 
control the behavior of `ClassPathPackagedProgramRetriever` through 
`ClassPathPackagedProgramRetriever.Builder`. So i prefer doing  changes in the 
`Builder`. 
   When "other mode" access the `ClassPathPackagedProgramRetriever`, 
`pipelineClassPath` is empty collection by default, that will not affect the 
mode's behavior



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775348759


   
   ## CI report:
   
   * 10b2499a2a18cd4ccec518d0aeaf719658036659 UNKNOWN
   * 9406a10fbbc9a2ccf4cc08236184a9f76c076d43 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13130)
 
   * 82de9d36ff5364984912c935c60c39a623dec537 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13131)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14834: [FLINK-21234][build system] Adding timeout to all tasks

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14834:
URL: https://github.com/apache/flink/pull/14834#issuecomment-771607546


   
   ## CI report:
   
   * 3c55d1b87076148f095c08a55319f1b898dc932d UNKNOWN
   * 5c8ac11faf9d6fbcad7550bec8abf6a6343347c3 UNKNOWN
   * 178027f14650f60c887f02b7add583d12a0aef86 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13127)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775348759


   
   ## CI report:
   
   * 10b2499a2a18cd4ccec518d0aeaf719658036659 UNKNOWN
   * 9406a10fbbc9a2ccf4cc08236184a9f76c076d43 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13130)
 
   * 82de9d36ff5364984912c935c60c39a623dec537 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] Jason-liujc commented on pull request #5415: [FLINK-3655] [core] Support multiple paths in FileInputFormat

2021-02-08 Thread GitBox


Jason-liujc commented on pull request #5415:
URL: https://github.com/apache/flink/pull/5415#issuecomment-775567716


   Yea I basically cloned the source code for 
`ContinuousFileMonitoringFunction` and added the reading from multiple 
directories capability to it. 
   
   I can't really share the source code right now as we need internal approval 
to do it. But basically, in the `run()` function, loop through each path and do 
`monitorDirAndForwardSplits()` on that given path. There's some kinks you might 
have to work out to get the checkpoint to work properly but that's the general 
idea. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775348759


   
   ## CI report:
   
   * 10b2499a2a18cd4ccec518d0aeaf719658036659 UNKNOWN
   * 47c27930dce7d7ea2dee3d0f434fe2511ab5 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13128)
 
   * 9406a10fbbc9a2ccf4cc08236184a9f76c076d43 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13130)
 
   * 82de9d36ff5364984912c935c60c39a623dec537 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14834: [FLINK-21234][build system] Adding timeout to all tasks

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14834:
URL: https://github.com/apache/flink/pull/14834#issuecomment-771607546


   
   ## CI report:
   
   * 3c55d1b87076148f095c08a55319f1b898dc932d UNKNOWN
   * 5c8ac11faf9d6fbcad7550bec8abf6a6343347c3 UNKNOWN
   * fc886cc2ebb0d9ffee932e5b6ed00df5e6be9136 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13129)
 
   * 178027f14650f60c887f02b7add583d12a0aef86 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13127)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (FLINK-21196) Sensitive information(password) is in plain text format in flink-conf.yaml configuration file

2021-02-08 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281462#comment-17281462
 ] 

Chesnay Schepler commented on FLINK-21196:
--

Security-wise we usually only focus on external connectivity (e.g., the REST 
API / WebUI, clients, networking between nodes).

If an attacker has read access to such files then you have probably already 
lost.

> Sensitive information(password) is in plain text format in flink-conf.yaml 
> configuration file
> -
>
> Key: FLINK-21196
> URL: https://issues.apache.org/jira/browse/FLINK-21196
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Scripts
>Affects Versions: 1.10.0, 1.11.3
>Reporter: Suchithra V N
>Priority: Major
>
> When ssecurit settings are enabled in flink configuration passwords 
> information will be wriiten in flink-conf.yaml file as a plain text. If any 
> attacker is able to access this file can use these paswords to decrypt the 
> files. Hence secure mechanism is required to mask these senstive information 
> from flink-conf.yaml file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775348759


   
   ## CI report:
   
   * 10b2499a2a18cd4ccec518d0aeaf719658036659 UNKNOWN
   * 521acd6627e3b85ab8ef7386cae10728fd97db9c Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13123)
 
   * 47c27930dce7d7ea2dee3d0f434fe2511ab5 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13128)
 
   * 9406a10fbbc9a2ccf4cc08236184a9f76c076d43 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] sjwiesman edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


sjwiesman edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775546460


   I've fixed the whitespace and added an arrow to the rest API docs. I'm still 
struggling with the side nav, I'll keep working on it tomorrow. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] sjwiesman commented on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


sjwiesman commented on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775546460


   I've fixed the whitespace and added an arrow to the rest API docs. I'm still 
struggling with the side bar, I'll keep working on it tomorrow. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14874: [FLINK-21102] Add ScaleUpController for declarative scheduler

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14874:
URL: https://github.com/apache/flink/pull/14874#issuecomment-773408240


   
   ## CI report:
   
   * 2f8766121adb3016eddb1e3145bd1cedd0a3da59 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13118)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (FLINK-21211) Looking for reviews on a framework based on flink-statefun

2021-02-08 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281456#comment-17281456
 ] 

Chesnay Schepler commented on FLINK-21211:
--

You may have better chances with getting feedback on the user mailing list.

> Looking for reviews on a framework based on flink-statefun 
> ---
>
> Key: FLINK-21211
> URL: https://issues.apache.org/jira/browse/FLINK-21211
> Project: Flink
>  Issue Type: New Feature
>Reporter: Zixuan Rao
>Priority: Minor
>
> Hi, I am currently developing a framework targeting back end state 
> management. To ensure exactly-once processing of events in back end, I intend 
> to use Flink Stateful Functions runtime in combination with Python's asyncio. 
> I hope to receive some feedbacks. 
> The following code shows an example (draft) of writing a back end micro 
> service using my framework. It is intended to be equivalent (exchangeable) 
> with Flink-stateful examples/ridesharing. The idea is that "Event" is 
> reducible to an async function call, and external egress can be emitted by 
> saving an object. This preserves the exactly-once features of Flink-statefun 
> while adding a great deal of readability to the code. 
> Reviews are appreciated. Thank you! 
> {code}
> """
> Equivalent implementation for flink stateful functions example - ridesharing
> ref: 
> https://github.com/apache/flink-statefun/blob/master/statefun-examples/statefun-ridesharing-example/statefun-ridesharing-example-functions/src/main/java/org/apache/flink/statefun/examples/ridesharing/FnDriver.java
> """
> from onto.models.base import Serializable
> """
> Rewrite callback-style code to async-await: 
> ref: 
> https://www.coreycleary.me/how-to-rewrite-a-callback-function-in-promise-form-and-async-await-form-in-javascript
>  
> """
> from onto.domain_model import DomainModel
> from onto.attrs import attrs
> class RideshareBase(DomainModel):
> pass
> class Passenger(RideshareBase):
> async def request_ride(self, start_geo_cell, end_geo_cell):
> r = Ride.create()  # TODO: implement create
> await r.passenger_joins(
> passenger=self,
> start_geo_cell=start_geo_cell,
> end_geo_cell=end_geo_cell
> )
> class PassengerMessage(DomainModel):
> passenger = attrs.relation('Passenger')
> class RideFailedMessage(Serializable):
> ride = attrs.relation('Ride')
> ride_failed = attrs.embed(RideFailedMessage).optional
> class DriverHasBeenFoundMessage(Serializable):
> driver = attrs.relation('Driver')
> driver_geo_cell = attrs.relation('GeoCell')
> driver_found = attrs.embed(RideFailedMessage).optional
> class RideHasStarted(Serializable):
> driver = attrs.relation('Driver')
> ride_started = attrs.embed(RideHasStarted).optional
> class RideHasEnded(Serializable):
> pass  # TODO: make sure that empty class works
> ride_ended = attrs.embed(RideHasEnded).optional
> async def ride_failed(self, ride: 'Ride'):
> message = self.PassengerMessage.new(
> passenger=self,
> ride_failed=self.PassengerMessage.RideFailedMessage.new(
> ride=ride
> )
> )
> message.save()
> async def driver_joins_ride(self, driver: 'Driver', driver_geo_cell: 
> 'GeoCell'):
> message = self.PassengerMessage.new(
> passenger=self,
> driver_found=self.PassengerMessage.DriverHasBeenFoundMessage.new(
> driver=driver,
> driver_geo_cell=driver_geo_cell
> )
> )
> message.save()
> async def ride_started(self, driver: 'Driver'):
> message = self.PassengerMessage.new(
> passenger=self,
> ride_started=self.PassengerMessage.RideHasStarted.new(
> driver=driver
> )
> )
> message.save()
> async def ride_ended(self):
> message = self.PassengerMessage.new(
> passenger=self,
> ride_started=self.PassengerMessage.RideHasEnded.new()
> )
> message.save()
> class DriverRejectsPickupError(RideshareBase, Exception):
> driver = attrs.relation(dm_cls='Driver')
> ride = attrs.relation(dm_cls='Ride')
> class Driver(RideshareBase):
> is_taken: bool = attrs.required
> current_ride = attrs.relation(dm_cls='Ride').optional
> current_location: 'GeoCell' = attrs.relation(dm_cls='GeoCell')
> @is_taken.getter
> def is_taken(self):
> # TODO: make better
> return self.current_ride is not None
> async def pickup_passenger(self, ride: 'Ride', passenger: Passenger,
> passenger_start_cell: 'GeoCell',
>   

[jira] [Commented] (FLINK-21287) Failed to build flink source code

2021-02-08 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281452#comment-17281452
 ] 

Chesnay Schepler commented on FLINK-21287:
--

Can you give us the full maven output?

> Failed to build flink source code
> -
>
> Key: FLINK-21287
> URL: https://issues.apache.org/jira/browse/FLINK-21287
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Affects Versions: 1.12.1
>Reporter: Lei Qu
>Priority: Major
>
> [ERROR] Failed to execute goal 
> org.xolstice.maven.plugins:protobuf-maven-plugin:0.5.1:test-compile (default) 
> on project flink-parquet_2.11: protoc did not exit cleanly. Review output for 
> more information. -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR] mvn  -rf :flink-parquet_2.11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-21314) Airflow | Edit DAG code from the UI

2021-02-08 Thread Chesnay Schepler (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chesnay Schepler closed FLINK-21314.

Resolution: Invalid

Head over to https://github.com/apache/airflow.

> Airflow | Edit DAG code from the UI
> ---
>
> Key: FLINK-21314
> URL: https://issues.apache.org/jira/browse/FLINK-21314
> Project: Flink
>  Issue Type: New Feature
>Reporter: ohad
>Priority: Minor
>
> Hi. 
> [The project should be AIRFLOW, I don't know why I cant change the project]
>  I want to create new pull request and add a feature that will enable to edit 
> the dag code and save the change.
>  This one will be super helpful to my daily work.
>  What do you think?
> Thanks :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] flinkbot edited a comment on pull request #14678: [FLINK-20833][runtime] Add pluggable failure listener in job manager

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14678:
URL: https://github.com/apache/flink/pull/14678#issuecomment-761905721


   
   ## CI report:
   
   * cade20e85b29ca63c51383dca04976c1d9801042 UNKNOWN
   * 34720c9f7ea37afb5d7f3d2a824b78fee916b755 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13120)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] sjwiesman edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


sjwiesman edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775465028


   @zentol I've updated the styling for the rest API and configurations. 
Additonally, I've externalized the css to `_custom.scss` so if we want to 
continue to tweak it that's where you should look. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14834: [FLINK-21234][build system] Adding timeout to all tasks

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14834:
URL: https://github.com/apache/flink/pull/14834#issuecomment-771607546


   
   ## CI report:
   
   * 3c55d1b87076148f095c08a55319f1b898dc932d UNKNOWN
   * 5c8ac11faf9d6fbcad7550bec8abf6a6343347c3 UNKNOWN
   * 37ac04e8b84cab10b3a98541429ffc1a764ad859 Azure: 
[CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13109)
 
   * fc886cc2ebb0d9ffee932e5b6ed00df5e6be9136 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13129)
 
   * 178027f14650f60c887f02b7add583d12a0aef86 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13127)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (FLINK-21324) statefun-testutil can't assert messages function sends to itself

2021-02-08 Thread Timur (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281421#comment-17281421
 ] 

Timur commented on FLINK-21324:
---

Please review pull request that fixes this bug 
https://github.com/apache/flink-statefun/pull/199

> statefun-testutil can't assert messages function sends to itself
> 
>
> Key: FLINK-21324
> URL: https://issues.apache.org/jira/browse/FLINK-21324
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-2.2.2
>Reporter: Timur
>Priority: Major
>
> Assertions don't work for messages sent by functions to themselves. The 
> reason is that TestContext doesn't add a message to responses:
>  
> {code:java}
> @Override
> public void send(Address to, Object message) {
>   if (to.equals(selfAddress)) {
> messages.add(new Envelope(self(), to, message));
> return;
>   }
>   responses.computeIfAbsent(to, ignore -> new ArrayList<>()).add(message);
> }
> {code}
> Instead of adding the message to responses the method returns right after 
> message added to messages. 
>  
> Here is the example of the assertion that doesn't work in case a function 
> sent a message to itself:
> {code:java}
> assertThat(harness.invoke(aliceDiseaseDiagnosedEvent()), sentNothing());
> {code}
> The test won't fail even though the message was really sent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-21324) statefun-testutil can't assert messages function sends to itself

2021-02-08 Thread Timur (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-21324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timur updated FLINK-21324:
--
Description: 
Assertions don't work for messages sent by functions to themselves. The reason 
is that TestContext doesn't add a message to responses: 
{code:java}
@Override
public void send(Address to, Object message) {
  if (to.equals(selfAddress)) {
messages.add(new Envelope(self(), to, message));
return;
  }
  responses.computeIfAbsent(to, ignore -> new ArrayList<>()).add(message);
}
{code}
Instead of adding the message to responses the method returns right after 
message added to messages. 

Here is the example of the assertion that doesn't work in case a function sent 
a message to itself:
{code:java}
assertThat(harness.invoke(aliceDiseaseDiagnosedEvent()), sentNothing());
{code}
The test won't fail even though the message was really sent.

  was:
Assertions don't work for messages sent by functions to themselves. The reason 
is that TestContext doesn't add a message to responses:

 
{code:java}
@Override
public void send(Address to, Object message) {
  if (to.equals(selfAddress)) {
messages.add(new Envelope(self(), to, message));
return;
  }
  responses.computeIfAbsent(to, ignore -> new ArrayList<>()).add(message);
}
{code}
Instead of adding the message to responses the method returns right after 
message added to messages. 

 

Here is the example of the assertion that doesn't work in case a function sent 
a message to itself:
{code:java}
assertThat(harness.invoke(aliceDiseaseDiagnosedEvent()), sentNothing());
{code}
The test won't fail even though the message was really sent.


> statefun-testutil can't assert messages function sends to itself
> 
>
> Key: FLINK-21324
> URL: https://issues.apache.org/jira/browse/FLINK-21324
> Project: Flink
>  Issue Type: Bug
>  Components: Stateful Functions
>Affects Versions: statefun-2.2.2
>Reporter: Timur
>Priority: Major
>
> Assertions don't work for messages sent by functions to themselves. The 
> reason is that TestContext doesn't add a message to responses: 
> {code:java}
> @Override
> public void send(Address to, Object message) {
>   if (to.equals(selfAddress)) {
> messages.add(new Envelope(self(), to, message));
> return;
>   }
>   responses.computeIfAbsent(to, ignore -> new ArrayList<>()).add(message);
> }
> {code}
> Instead of adding the message to responses the method returns right after 
> message added to messages. 
> Here is the example of the assertion that doesn't work in case a function 
> sent a message to itself:
> {code:java}
> assertThat(harness.invoke(aliceDiseaseDiagnosedEvent()), sentNothing());
> {code}
> The test won't fail even though the message was really sent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink-statefun] f1xmAn opened a new pull request #199: fix assertions for messages function sends to itself

2021-02-08 Thread GitBox


f1xmAn opened a new pull request #199:
URL: https://github.com/apache/flink-statefun/pull/199


   Here is the bug this PR fixes 
https://issues.apache.org/jira/browse/FLINK-21324



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] zentol commented on a change in pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


zentol commented on a change in pull request #14903:
URL: https://github.com/apache/flink/pull/14903#discussion_r572422797



##
File path: 
flink-docs/src/main/java/org/apache/flink/docs/rest/RestAPIDocGenerator.java
##
@@ -220,32 +220,42 @@ private static String createHtmlEntry(MessageHeaders spec) {
 {
 sb.append("\n");
 sb.append("  \n");
-sb.append(
-"Request\n");
-sb.append("\n");
+sb.append("  \n");
+sb.append("\n");
+sb.append("  \n");
+sb.append("Request\n");
+sb.append("...\n");

Review comment:
   actually, why not just a single `span` without the `justify-between`? 
Having the dots/arrow not jump to the right when opened seems nicer?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14834: [FLINK-21234][build system] Adding timeout to all tasks

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14834:
URL: https://github.com/apache/flink/pull/14834#issuecomment-771607546


   
   ## CI report:
   
   * 3c55d1b87076148f095c08a55319f1b898dc932d UNKNOWN
   * 5c8ac11faf9d6fbcad7550bec8abf6a6343347c3 UNKNOWN
   * 37ac04e8b84cab10b3a98541429ffc1a764ad859 Azure: 
[CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13109)
 
   * fc886cc2ebb0d9ffee932e5b6ed00df5e6be9136 UNKNOWN
   * 178027f14650f60c887f02b7add583d12a0aef86 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (FLINK-21324) statefun-testutil can't assert messages function sends to itself

2021-02-08 Thread Timur (Jira)
Timur created FLINK-21324:
-

 Summary: statefun-testutil can't assert messages function sends to 
itself
 Key: FLINK-21324
 URL: https://issues.apache.org/jira/browse/FLINK-21324
 Project: Flink
  Issue Type: Bug
  Components: Stateful Functions
Affects Versions: statefun-2.2.2
Reporter: Timur


Assertions don't work for messages sent by functions to themselves. The reason 
is that TestContext doesn't add a message to responses:

 
{code:java}
@Override
public void send(Address to, Object message) {
  if (to.equals(selfAddress)) {
messages.add(new Envelope(self(), to, message));
return;
  }
  responses.computeIfAbsent(to, ignore -> new ArrayList<>()).add(message);
}
{code}
Instead of adding the message to responses the method returns right after 
message added to messages. 

 

Here is the example of the assertion that doesn't work in case a function sent 
a message to itself:
{code:java}
assertThat(harness.invoke(aliceDiseaseDiagnosedEvent()), sentNothing());
{code}
The test won't fail even though the message was really sent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink] zentol commented on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


zentol commented on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775488365


   When some item in the sidebar is expanded such that the scrollbar appears 
everything is getting resized. Ideally it should just appear without things 
shifting around.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14902: [FLINK-21138][queryableState] - User ClassLoader in KvStateServerHandler

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14902:
URL: https://github.com/apache/flink/pull/14902#issuecomment-775312905


   
   ## CI report:
   
   * 7739e63ba32156084a9c9f5dc47d681c62a3 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13119)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775348759


   
   ## CI report:
   
   * 10b2499a2a18cd4ccec518d0aeaf719658036659 UNKNOWN
   * 521acd6627e3b85ab8ef7386cae10728fd97db9c Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13123)
 
   * 47c27930dce7d7ea2dee3d0f434fe2511ab5 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13128)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] zentol commented on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


zentol commented on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775486659


   The issue with headings having massive whitespace above them is due to 
`margin-top` and `padding-top` being set. Either would do the trick I think. 
(throwing both out isn't so bad either actually...)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] zentol commented on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


zentol commented on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775484371


   I think the code tabs have lost a bit of clarity; it's not immediately 
obvious that you can click them in the first place.
   
   
![tabs](https://user-images.githubusercontent.com/5725237/107282734-ae134a00-6a5b-11eb-8fad-1cd4724f86b5.png)
   
   In a similar vein, the REST API request/response drop-down elements are also 
not rendered in a way that indicates them to be interactive. The previous 
approach was admittedly crude. A down-arrow after the text would likely do the 
trick already.
   
   This will remove some excessive whitespace, and ensure that the url font 
size isn't smaller than usual text:
   
   ```
   .rest-api h5 {
   margin-top: .5em;
   margin-bottom: .5em;
   font-size: 1em;
   }
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] zentol commented on a change in pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


zentol commented on a change in pull request #14903:
URL: https://github.com/apache/flink/pull/14903#discussion_r572395853



##
File path: 
flink-docs/src/main/java/org/apache/flink/docs/rest/RestAPIDocGenerator.java
##
@@ -220,32 +220,42 @@ private static String createHtmlEntry(MessageHeaders spec) {
 {
 sb.append("\n");
 sb.append("  \n");
-sb.append(
-"Request\n");
-sb.append("\n");
+sb.append("  \n");
+sb.append("\n");
+sb.append("  \n");
+sb.append("Request\n");
+sb.append("...\n");

Review comment:
   How about:
   
   `sb.append("▾\n");`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775348759


   
   ## CI report:
   
   * 10b2499a2a18cd4ccec518d0aeaf719658036659 UNKNOWN
   * 521acd6627e3b85ab8ef7386cae10728fd97db9c Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13123)
 
   * 47c27930dce7d7ea2dee3d0f434fe2511ab5 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14834: [FLINK-21234][build system] Adding timeout to all tasks

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14834:
URL: https://github.com/apache/flink/pull/14834#issuecomment-771607546


   
   ## CI report:
   
   * 3c55d1b87076148f095c08a55319f1b898dc932d UNKNOWN
   * 5c8ac11faf9d6fbcad7550bec8abf6a6343347c3 UNKNOWN
   * 37ac04e8b84cab10b3a98541429ffc1a764ad859 Azure: 
[CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13109)
 
   * fc886cc2ebb0d9ffee932e5b6ed00df5e6be9136 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] sjwiesman commented on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


sjwiesman commented on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775465028


   @zentol I've updated the styling for the rest API and configurations. 
Additonally, I've externalized the css to ` so if we want to continue to tweak 
it that's where you should look. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (FLINK-21323) Stop-with-savepoint is not supported by SourceOperatorStreamTask

2021-02-08 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281373#comment-17281373
 ] 

Roman Khachatryan commented on FLINK-21323:
---

Probably a duplicate of FLINK-21133.

> Stop-with-savepoint is not supported by SourceOperatorStreamTask
> 
>
> Key: FLINK-21323
> URL: https://issues.apache.org/jira/browse/FLINK-21323
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 1.11.3, 1.12.1
>Reporter: Matthias
>Priority: Critical
> Fix For: 1.11.4, 1.12.2, 1.13.0
>
>
> When looking into FLINK-21030 analyzing the stop-with-savepoint behavior we 
> implemented different test jobs covering the old {{addSource}} and new 
> {{fromSource}} methods for adding sources. The stop-with-savepoint consists 
> of two phase:
>  # Create the savepoint
>  # Stop the source to trigger finalizing the job
> The test failing in the second phase using {{fromSource}} does not succeed. 
> The reason for this might be that {{finishTask}} is not implemented by 
> {{SourceOperatorStreamTask}} in contrast to {{SourceStreamTask}} which is 
> used when calling {{addSource}} in the job definition. Hence, the job 
> termination is never triggered.
> We might have missed this due to some naming error of 
> {{JobMasterStopWithSavepointIT}} test that is not triggered by Maven due to 
> the wrong suffix used in this case. The IT is failing right now. FLINK-21031 
> is covering the fix of {{JobMasterStopWithSavepointIT}} already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [flink-training] xhudik opened a new pull request #17: Readme updated

2021-02-08 Thread GitBox


xhudik opened a new pull request #17:
URL: https://github.com/apache/flink-training/pull/17


   small changes in Readmes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink-training] xhudik closed pull request #16: added link + typo

2021-02-08 Thread GitBox


xhudik closed pull request #16:
URL: https://github.com/apache/flink-training/pull/16


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14900: (1.12) [FLINK-21312][checkpointing] Reset OperatorChain.isStoppingBySyncSavуpoint properly

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14900:
URL: https://github.com/apache/flink/pull/14900#issuecomment-775088580


   
   ## CI report:
   
   * 18ad713d9f142ea56a521284d26c95746ee862e3 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13108)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14903: [FLINK-21193][docs] Migrate Flink docs from Jekyll to Hugo

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14903:
URL: https://github.com/apache/flink/pull/14903#issuecomment-775348759


   
   ## CI report:
   
   * 10b2499a2a18cd4ccec518d0aeaf719658036659 UNKNOWN
   * 8f7a3dac732790d6b6de91dc2a68cf344ece945b Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13122)
 
   * 521acd6627e3b85ab8ef7386cae10728fd97db9c Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13123)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] flinkbot edited a comment on pull request #14892: [FLINK-21312][checkpointing] Reset OperatorChain.isStoppingBySyncSavуpoint properly

2021-02-08 Thread GitBox


flinkbot edited a comment on pull request #14892:
URL: https://github.com/apache/flink/pull/14892#issuecomment-774773480


   
   ## CI report:
   
   * 8d1c709aa79aed1e18324dbafef7b7bac305c452 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=13105)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   3   4   >