date:20201022

[GitHub] [flink] flinkbot edited a comment on pull request #13735: [FLINK-19533][checkpoint] Add channel state reassignment for unaligned checkpoints.

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13735:
URL: https://github.com/apache/flink/pull/13735#issuecomment-714048414


   
   ## CI report:
   
   * e081fecba324a0e35017f4b89cc0499c9112a768 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8049)
 
   * ee73393a13c999be626f7b8b98a170be88a093d9 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] QingdongZeng3 commented on pull request #13690: [FLINK-16595][YARN]support more HDFS nameServices in yarn mode when security enabled. Is…

2020-10-22 Thread GitBox



QingdongZeng3 commented on pull request #13690:
URL: https://github.com/apache/flink/pull/13690#issuecomment-714257949


   Hi Robert,
   I have pass the CI ，would you help to review my code？Thanks! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] gaoyunhaii opened a new pull request #13740: [FLINK-19758][filesystem] Implement a new unified file sink based on the new sink API

2020-10-22 Thread GitBox



gaoyunhaii opened a new pull request #13740:
URL: https://github.com/apache/flink/pull/13740


   ## What is the purpose of the change
   
   This pull requests implementations a new File sink based on the new sink API.
   
   
   ## Brief change log
   
   af69ad89405936040b55d54acb0ef18e6141e5ad implements the sink.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
- Unit tests for each components for the new sink.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): **no**
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **yes**
 - The serializers: **no**
 - The runtime per-record code paths (performance sensitive): **no**
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: **no**
 - The S3 file system connector:  **no**
   
   ## Documentation
   
 - Does this pull request introduce a new feature? **no**
 - If yes, how is the feature documented? **not applicable**
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] QingdongZeng3 edited a comment on pull request #13690: [FLINK-16595][YARN]support more HDFS nameServices in yarn mode when security enabled. Is…

2020-10-22 Thread GitBox



QingdongZeng3 edited a comment on pull request #13690:
URL: https://github.com/apache/flink/pull/13690#issuecomment-714257949


   Hi Robert,
   I have pass the CI ，would you help to review my code？Thanks! 
   @rmetzger 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-19758) Implement a new unified File Sink based on the new Sink API

2020-10-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-19758:
---
Labels: pull-request-available  (was: )

> Implement a new unified File Sink based on the new Sink API
> ---
>
> Key: FLINK-19758
> URL: https://issues.apache.org/jira/browse/FLINK-19758
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / DataStream, Connectors / FileSystem
>Affects Versions: 1.12.0
>Reporter: Yun Gao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] gaoyunhaii commented on pull request #11326: [FLINK-11427][formats] Add protobuf parquet support for StreamingFileSink

2020-10-22 Thread GitBox



gaoyunhaii commented on pull request #11326:
URL: https://github.com/apache/flink/pull/11326#issuecomment-714271222


   Very thanks @aljoscha ! I'll close this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] gaoyunhaii closed pull request #11326: [FLINK-11427][formats] Add protobuf parquet support for StreamingFileSink

2020-10-22 Thread GitBox



gaoyunhaii closed pull request #11326:
URL: https://github.com/apache/flink/pull/11326


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery

2020-10-22 Thread GitBox



curcur edited a comment on pull request #13648:
URL: https://github.com/apache/flink/pull/13648#issuecomment-714211577


   > * Visilibity in normal case: none of the felds written in `releaseView` 
are `volatile`. So in normal case (`t1:release` then `t2:createReadView`) `t2` 
can see some inconsistent state. For example, `readView == null`, but 
`isPartialBufferCleanupRequired == false`. Right?
   >   Maybe call `releaseView()`  from `createReadView()` unconditionally?
   
   Aren't they guarded by the synchronization block? I think in the existing 
code, almost all access is guarded by the lock. But I have a simpler solution 
just in case. See the last sentence.
   
   > * Overwites when release is slow: won't `t1` overwrite changes to 
`PipelinedSubpartition` made already by `t2`? For example, reset 
`sequenceNumber` after `t2` has sent some data?
   >   Maybe `PipelinedSubpartition.readerView` should be `AtomicReference` and 
then we can guard `PipelinedApproximateSubpartition.releaseView()` by CAS on it?
   
   I think this is the same question I answered in the write-up above.
   In short, it won't be possible, because a view can only be released once and 
this is guarded by the release flag of the view, details quoted below. The 
updates are guarded by the lock. 
   
   - What if the netty thread1 release view after netty thread2 recreates the 
view?
   Thread2 releases the view that thread1 holds the reference on before 
creating a new view. Thread1 can not release the old view (through view 
reference) again afterwards, since a view can only be released once.
   
   I am actually having an idea to simplify this whole model:
   **If we only release before creation and no other places, this whole 
threading interaction model would be simplified in a great way. That says only 
one new netty thread can release the view**



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-19626) Introduce multi-input operator construction algorithm

2020-10-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-19626:
---
Labels: pull-request-available  (was: )

> Introduce multi-input operator construction algorithm
> -
>
> Key: FLINK-19626
> URL: https://issues.apache.org/jira/browse/FLINK-19626
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Planner
>Reporter: Caizhi Weng
>Assignee: Caizhi Weng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> We should introduce an algorithm to organize exec nodes into multi-input exec 
> nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] TsReaper opened a new pull request #13742: [FLINK-19626][table-planner-blink] Introduce multi-input operator construction algorithm

2020-10-22 Thread GitBox



TsReaper opened a new pull request #13742:
URL: https://github.com/apache/flink/pull/13742


   ## What is the purpose of the change
   
   As multiple input exec nodes have been introduced, we're going to construct 
multiple input operators by a new construction algorithm. This PR introduces 
such algorithm.
   
   Multiple input optimization is currently not use by default and is only used 
for tests as its operator is not ready. We'll change this once the multiple 
input operator is ready.
   
   ## Brief change log
   
- Introduce multi-input operator construction algorithm
   
   ## Verifying this change
   
   This change added tests and can be verified as follows: Run the newly added 
tests.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers: no
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
 - The S3 file system connector: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13703: [FLINK-19696] Add runtime batch committer operators for the new sink API

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13703:
URL: https://github.com/apache/flink/pull/13703#issuecomment-712821004


   
   ## CI report:
   
   * 5166271f74497afddca3a7f350b8191e2a843298 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7938)
 
   * 58ee54d456da57f99b6124c787b7cbaeedcfb261 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (FLINK-16218) Chinese character is garbled when reading from MySQL using JDBC source

2020-10-22 Thread Jark Wu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-16218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jark Wu closed FLINK-16218.
---
Fix Version/s: (was: 1.11.3)
   (was: 1.10.3)
   (was: 1.12.0)
   Resolution: Not A Problem

> Chinese character is garbled when reading from MySQL using JDBC source
> --
>
> Key: FLINK-16218
> URL: https://issues.apache.org/jira/browse/FLINK-16218
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / JDBC, Table SQL / Planner
>Affects Versions: 1.10.0
>Reporter: Jark Wu
>Priority: Major
> Attachments: 15821322356269.jpg
>
>
> I have set the database and table to use UTF8 collections, and use 
> {{jdbc:mysql://localhost:3306/db_test?useUnicode=true=utf-8}}
>  as the connection jdbc url. However, the Chinese characters are still 
> garbled.
> Btw, we should have a test to cover this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19761) Add lookup method for registered ShuffleDescriptor in ShuffleMaster

2020-10-22 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218839#comment-17218839
 ] 

Till Rohrmann commented on FLINK-19761:
---

Thanks for creating this issue [~xuannan]. How will this work if the 
{{ShuffleMaster}} you are asking does not know about the cluster partition 
(e.g. if you are using two distinct external shuffle services or when using the 
NettyShuffleService if you are requesting a cluster partition from a different 
cluster)?

> Add lookup method for registered ShuffleDescriptor in ShuffleMaster
> ---
>
> Key: FLINK-19761
> URL: https://issues.apache.org/jira/browse/FLINK-19761
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Xuannan Su
>Priority: Major
>
> Currently, the ShuffleMaster can register a partition and get the shuffle 
> descriptor. However, it lacks the ability to look up the registered 
> ShuffleDescriptors belongs to an IntermediateResult by the 
> IntermediateDataSetID.
> Adding the lookup method to the ShuffleMaster can make reusing the cluster 
> partition more easily. For example, we don't have to return the 
> ShuffleDescriptor to the client just so that the other job can somehow encode 
> the ShuffleDescriptor in the JobGraph to consume the cluster partition. 
> Instead, we only need to return the IntermediateDatSetID and use it to lookup 
> the ShuffleDescriptor by another job.
> By adding the lookup method in ShuffleMaster, if we have an external shuffle 
> service and the lifecycle of the IntermediateResult is not bounded to the 
> cluster, we can look up the ShuffleDescriptor and reuse the 
> IntermediateResult by a job running on another cluster even if the cluster 
> that produced the IntermediateResult is shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-19136) MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the predicate within the allowed time"

2020-10-22 Thread Robert Metzger (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-19136:
--

Assignee: Robert Metzger

> MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the 
> predicate within the allowed time"
> 
>
> Key: FLINK-19136
> URL: https://issues.apache.org/jira/browse/FLINK-19136
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.12.0
>Reporter: Dian Fu
>Assignee: Robert Metzger
>Priority: Major
>  Labels: test-stability
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6179=logs=9fca669f-5c5f-59c7-4118-e31c641064f0=91bf6583-3fb2-592f-e4d4-d79d79c3230a]
> {code}
> 2020-09-03T23:33:18.3687261Z [ERROR] 
> testReporter(org.apache.flink.metrics.tests.MetricsAvailabilityITCase)  Time 
> elapsed: 15.217 s  <<< ERROR!
> 2020-09-03T23:33:18.3698260Z java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
> satisfy the predicate within the allowed time.
> 2020-09-03T23:33:18.3698749Z  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-09-03T23:33:18.3699163Z  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> 2020-09-03T23:33:18.3699754Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.fetchMetric(MetricsAvailabilityITCase.java:162)
> 2020-09-03T23:33:18.3700234Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.checkJobManagerMetricAvailability(MetricsAvailabilityITCase.java:116)
> 2020-09-03T23:33:18.3700726Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.testReporter(MetricsAvailabilityITCase.java:101)
> 2020-09-03T23:33:18.3701097Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-09-03T23:33:18.3701425Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-09-03T23:33:18.3701798Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-09-03T23:33:18.3702146Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-09-03T23:33:18.3702471Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-09-03T23:33:18.3702866Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-09-03T23:33:18.3703253Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-09-03T23:33:18.3703621Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-09-03T23:33:18.3703997Z  at 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-09-03T23:33:18.3704339Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-09-03T23:33:18.3704629Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-09-03T23:33:18.3704940Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-09-03T23:33:18.3705354Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-09-03T23:33:18.3705725Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-09-03T23:33:18.3706072Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-09-03T23:33:18.3706397Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-09-03T23:33:18.3706714Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-09-03T23:33:18.3707044Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-09-03T23:33:18.3707373Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-09-03T23:33:18.3707708Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-09-03T23:33:18.3708073Z  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-09-03T23:33:18.3708410Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-09-03T23:33:18.3708691Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-09-03T23:33:18.3708976Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2020-09-03T23:33:18.3709273Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-09-03T23:33:18.3709579Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-09-03T23:33:18.3709910Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-09-03T23:33:18.3710242Z  at 
>

[GitHub] [flink] AHeise merged pull request #13733: (1.10 backport) [FLINK-19401][checkpointing] Download checkpoints only if needed

2020-10-22 Thread GitBox



AHeise merged pull request #13733:
URL: https://github.com/apache/flink/pull/13733


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-19765) flink SqlUseCatalog.getCatalogName is not unified with SqlCreateCatalog and SqlDropCatalog

2020-10-22 Thread jackylau (Jira)

jackylau created FLINK-19765:


 Summary: flink SqlUseCatalog.getCatalogName is not unified with 
SqlCreateCatalog and SqlDropCatalog
 Key: FLINK-19765
 URL: https://issues.apache.org/jira/browse/FLINK-19765
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.11.0
Reporter: jackylau
 Fix For: 1.12.0


when i develop flink ranger plugin at operation level, i find this method not 
unified.

And SqlToOperationConverter.convert needs has the good order for user to find 
code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] zhijiangW commented on a change in pull request #13595: [FLINK-19582][network] Introduce sort-merge based blocking shuffle to Flink

2020-10-22 Thread GitBox



zhijiangW commented on a change in pull request #13595:
URL: https://github.com/apache/flink/pull/13595#discussion_r509983937



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/SortBuffer.java
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.io.network.partition;
+
+import org.apache.flink.core.memory.MemorySegment;
+import org.apache.flink.runtime.io.network.buffer.Buffer;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+import static org.apache.flink.util.Preconditions.checkNotNull;
+
+/**
+ * Data of different channels can be appended to a {@link SortBuffer} and 
after the {@link SortBuffer} is
+ * finished, the appended data can be copied from it in channel index order.
+ */
+public interface SortBuffer {
+
+   /**
+* Appends data of the specified channel to this {@link SortBuffer} and 
returns true if all bytes of
+* the source buffer is copied to this {@link SortBuffer} successfully, 
otherwise if returns false,
+* nothing will be copied.
+*/
+   boolean append(ByteBuffer source, int targetChannel, Buffer.DataType 
dataType) throws IOException;
+
+   /**
+* Copies data in this {@link SortBuffer} to the target {@link 
MemorySegment} in channel index order
+* and returns {@link BufferWithChannel} which contains the copied data 
and the corresponding channel
+* index.
+*/
+   BufferWithChannel copyData(MemorySegment target);

Review comment:
   nit: rename as `copyIntoSegment` for better readable, otherwise I might 
be a bit confused whether `copyDataTo` or `copyDataFrom` in internal processes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (FLINK-19738) Remove the `getCommitter` from the `AbstractStreamingCommitterOperator`

2020-10-22 Thread Guowei Ma (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guowei Ma closed FLINK-19738.
-
Fix Version/s: 1.12.0
   Resolution: Resolved

> Remove the `getCommitter` from the `AbstractStreamingCommitterOperator`
> ---
>
> Key: FLINK-19738
> URL: https://issues.apache.org/jira/browse/FLINK-19738
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / DataStream
>Reporter: Guowei Ma
>Priority: Minor
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] gm7y8 edited a comment on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI

2020-10-22 Thread GitBox



gm7y8 edited a comment on pull request #13458:
URL: https://github.com/apache/flink/pull/13458#issuecomment-714289878


   @XComp  You are welcome! I have also verified fix with Word Count job
   
   https://user-images.githubusercontent.com/12487613/96838691-06ff9980-13fd-11eb-93dd-ae8545cabbdb.png;>
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] tillrohrmann commented on a change in pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API

2020-10-22 Thread GitBox



tillrohrmann commented on a change in pull request #13644:
URL: https://github.com/apache/flink/pull/13644#discussion_r509942532



##
File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionService.java
##
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.kubernetes.highavailability;
+
+import org.apache.flink.kubernetes.kubeclient.FlinkKubeClient;
+import 
org.apache.flink.kubernetes.kubeclient.KubernetesLeaderElectionConfiguration;
+import org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMap;
+import 
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector;
+import org.apache.flink.kubernetes.kubeclient.resources.KubernetesWatch;
+import org.apache.flink.kubernetes.utils.KubernetesUtils;
+import org.apache.flink.runtime.leaderelection.AbstractLeaderElectionService;
+import org.apache.flink.runtime.leaderelection.LeaderContender;
+import org.apache.flink.util.function.FunctionUtils;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.Executor;
+
+import static 
org.apache.flink.kubernetes.utils.Constants.LABEL_CONFIGMAP_TYPE_HIGH_AVAILABILITY;
+import static org.apache.flink.kubernetes.utils.Constants.LEADER_ADDRESS_KEY;
+import static 
org.apache.flink.kubernetes.utils.Constants.LEADER_SESSION_ID_KEY;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+
+/**
+ * Leader election service for multiple JobManagers. The active JobManager is 
elected using Kubernetes.
+ * The current leader's address as well as its leader session ID is published 
via Kubernetes ConfigMap.
+ * Note that the contending lock and leader storage are using the same 
ConfigMap. And every component(e.g.
+ * ResourceManager, Dispatcher, RestEndpoint, JobManager for each job) will 
have a separate ConfigMap.
+ */
+public class KubernetesLeaderElectionService extends 
AbstractLeaderElectionService {
+
+   private final FlinkKubeClient kubeClient;
+
+   private final Executor executor;
+
+   private final String configMapName;
+
+   private final KubernetesLeaderElector leaderElector;
+
+   private KubernetesWatch kubernetesWatch;
+
+   // Labels will be used to clean up the ha related ConfigMaps.
+   private Map configMapLabels;
+
+   KubernetesLeaderElectionService(
+   FlinkKubeClient kubeClient,
+   Executor executor,
+   KubernetesLeaderElectionConfiguration leaderConfig) {
+
+   this.kubeClient = checkNotNull(kubeClient, "Kubernetes client 
should not be null.");
+   this.executor = checkNotNull(executor, "Executor should not be 
null.");
+   this.configMapName = leaderConfig.getConfigMapName();
+   this.leaderElector = 
kubeClient.createLeaderElector(leaderConfig, new LeaderCallbackHandlerImpl());
+   this.leaderContender = null;
+   this.configMapLabels = KubernetesUtils.getConfigMapLabels(
+   leaderConfig.getClusterId(), 
LABEL_CONFIGMAP_TYPE_HIGH_AVAILABILITY);
+   }
+
+   @Override
+   public void internalStart(LeaderContender contender) {
+   CompletableFuture.runAsync(leaderElector::run, executor);
+   kubernetesWatch = kubeClient.watchConfigMaps(configMapName, new 
ConfigMapCallbackHandlerImpl());
+   }
+
+   @Override
+   public void internalStop() {
+   if (kubernetesWatch != null) {
+   kubernetesWatch.close();
+   }
+   }
+
+   @Override
+   protected void writeLeaderInformation() {
+   try {
+   kubeClient.checkAndUpdateConfigMap(
+   configMapName,
+   configMap -> {
+   if 
(leaderElector.hasLeadership(configMap)) {
+   // Get the updated ConfigMap 
with new leader information
+

[jira] [Closed] (FLINK-14406) Add metric for managed memory

2020-10-22 Thread Andrey Zagrebin (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Zagrebin closed FLINK-14406.
---
Fix Version/s: 1.12.0
 Release Note: 
New metrics are available to monitor managed memory:
`Status.Flink.Memory.Managed.[Used|Total]`
   Resolution: Done

merged into master by 1a70fe50da096a1a44fc71ef0b911de666cdbeb7

> Add metric for managed memory
> -
>
> Key: FLINK-14406
> URL: https://issues.apache.org/jira/browse/FLINK-14406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics, Runtime / Task
>Reporter: lining
>Assignee: Matthias
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If a user wants to get memory used in time, as there's no manage memory's 
> metrics, it couldn't get it.
> *Propose*
>  * add default memory type in MemoryManager
>  
> {code:java}
> public static final MemoryType DEFAULT_MEMORY_TYPE = MemoryType.OFF_HEAP;
> {code}
>  * add getManagedMemoryTotal in TaskExecutor:
>  
> {code:java}
> public long getManagedMemoryTotal() {
> return this.taskSlotTable.getAllocatedSlots().stream().mapToLong(
> slot -> 
> slot.getMemoryManager().getMemorySizeByType(MemoryManager.DEFAULT_MEMORY_TYPE)
> ).sum();
> }
> {code}
>  
>  * add getManagedMemoryUsed in TaskExecutor:
>  
> {code:java}
> public long getManagedMemoryUsed() {
> return this.taskSlotTable.getAllocatedSlots().stream().mapToLong(
> slot -> 
> slot.getMemoryManager().getMemorySizeByType(MemoryManager.DEFAULT_MEMORY_TYPE)
>  - slot.getMemoryManager().availableMemory(MemoryManager.DEFAULT_MEMORY_TYPE)
> ).sum();
> }
> {code}
>  
>  * add instantiateMemoryManagerMetrics in MetricUtils
>  
> {code:java}
> public static void instantiateMemoryManagerMetrics(MetricGroup 
> statusMetricGroup, TaskExecutor taskExecutor) {
> checkNotNull(statusMetricGroup);
> MetricGroup memoryManagerGroup = 
> statusMetricGroup.addGroup("Managed").addGroup("Memory");
> memoryManagerGroup.>gauge("TotalCapacity", 
> taskExecutor::getManagedMemoryTotal);
> memoryManagerGroup.>gauge("MemoryUsed", 
> taskExecutor::getManagedMemoryUsed);
> }
> {code}
>  * register it in TaskManagerRunner#startTaskManager
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] flinkbot commented on pull request #13742: [FLINK-19626][table-planner-blink] Introduce multi-input operator construction algorithm

2020-10-22 Thread GitBox



flinkbot commented on pull request #13742:
URL: https://github.com/apache/flink/pull/13742#issuecomment-714296461


   
   ## CI report:
   
   * 37cba38abc15a3ee7d1193644be564c0c75ac4c1 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #13741: [FLINK-19680][checkpointing] Announce timeoutable CheckpointBarriers

2020-10-22 Thread GitBox



flinkbot commented on pull request #13741:
URL: https://github.com/apache/flink/pull/13741#issuecomment-714295796


   
   ## CI report:
   
   * 8a68242a90fd180fcd0b4b4c60b5e1e49136c00f UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] vthinkxie commented on a change in pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI

2020-10-22 Thread GitBox



vthinkxie commented on a change in pull request #13458:
URL: https://github.com/apache/flink/pull/13458#discussion_r50994



##
File path: 
flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/detail/job-checkpoints-detail.component.html
##
@@ -21,6 +21,8 @@
 Path: {{checkPointDetail?.external_path || '-'}}
 
 Discarded: {{checkPointDetail?.discarded || '-'}}
+
+Checkpoint Type: {{checkPointDetail?.checkpoint_type === 
"CHECKPOINT" ? (checkPointConfig?.unaligned_checkpoints ? 'unaligned 
checkpoint' : 'aligned checkpoint') : (checkPointDetail?.checkpoint_type === 
"SYNC_SAVEPOINT" ? 'savepoint on cancel' : 'savepoint')|| '-'}}

Review comment:
   Hi @gm7y8 
   it would be better to move this calculation into the typescript file as a 
getter, and binding the getter here, since the calculation is a little too 
complex for the template file
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-19762) Selecting Job-ID in web UI covers more than the ID

2020-10-22 Thread Matthias (Jira)

Matthias created FLINK-19762:


 Summary: Selecting Job-ID in web UI covers more than the ID
 Key: FLINK-19762
 URL: https://issues.apache.org/jira/browse/FLINK-19762
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Affects Versions: 1.11.2, 1.10.2
Reporter: Matthias
 Fix For: 1.12.0
 Attachments: Screenshot 2020-10-22 at 09.47.41.png

Not only the ID is selected when trying to copy the Job ID from the web UI by 
double-clicking it. See the attached screenshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] shizhengchao opened a new pull request #13743: [FLINK-19629][Formats]Fix the NPE of avro format, when there is a nul…

2020-10-22 Thread GitBox



shizhengchao opened a new pull request #13743:
URL: https://github.com/apache/flink/pull/13743


   …l entry to the map
   
   
   
   ## What is the purpose of the change
   
   This pull request fixs the NPE in avro format when there is a null entry to 
the map.
   
   
   ## Brief change log
   
 -  Fixs the NPE in avro format when there is a null entry to the map.
 -  Add a null entry to the map in the 
`AvroRowDataDeSerializationSchemaTest#testSerializeDeserialize`
   
   
   ## Verifying this change
   
   This change is already covered by existing tests, such as 
*AvroRowDataDeSerializationSchemaTest#testSerializeDeserialize*.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
 - The serializers: (no)
 - The runtime per-record code paths (performance sensitive): (no)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
 - The S3 file system connector: (no)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (no)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13726: [BP-1.11][FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13726:
URL: https://github.com/apache/flink/pull/13726#issuecomment-713567666


   
   ## CI report:
   
   * c673d6c2530a3484e2a315301d0eee7154a3a5d9 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8020)
 
   * f0ac074feed577550c7026b190db031561a25f77 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack

2020-10-22 Thread Roman Khachatryan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218855#comment-17218855
 ] 

Roman Khachatryan commented on FLINK-19688:
---

https://github.com/apache/flink/pull/13723

> Flink batch job fails because of InterruptedExceptions from network stack
> -
>
> Key: FLINK-19688
> URL: https://issues.apache.org/jira/browse/FLINK-19688
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Runtime / Task
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Assignee: Roman Khachatryan
>Priority: Blocker
> Fix For: 1.12.0
>
> Attachments: logs.tgz
>
>
> I have a benchmarking test job, that throws RuntimeExceptions at any operator 
> at a configured, random interval. When using low intervals, such as mean 
> failure rate = 60 s, the job will get into a state where it frequently fails 
> with InterruptedExceptions.
> The same job does not have this problem on Flink 1.11.2 (at least not after 
> running the job for 15 hours, on 1.12-SN, it happens within a few minutes)
> This is the job: 
> https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java
> This is the exception:
> {code}
> 2020-10-16 16:02:15,653 WARN  org.apache.flink.runtime.taskmanager.Task   
>  [] - CHAIN GroupReduce (GroupReduce at 
> main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38)) (8/8)#1 
> (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
> switched from RUNNING to FAILED.
> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
> (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38))' , caused an error: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> Caused by: org.apache.flink.util.WrappingRuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   ... 4 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error obtaining the sorted input: Thread 
> 'SortMerger Reading Thread' terminated due to an exception: Connection for 
> partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> ~[?:1.8.0_222]
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) 
> ~[?:1.8.0_222]
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at

[GitHub] [flink] flinkbot edited a comment on pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13725:
URL: https://github.com/apache/flink/pull/13725#issuecomment-713560830


   
   ## CI report:
   
   * a606f618ea3fb7bf179657a060c47ee32b55e514 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8019)
 
   * cd48869c084675464d37449a3db1f9d3cd10a040 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (FLINK-7286) Flink Dashboard fails to display bytes/records received by sources / emitted by sinks

2020-10-22 Thread Robert Metzger (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-7286.
-
Fix Version/s: (was: 1.12.0)
   Resolution: Duplicate

Closed as duplicate of FLINK-11576.

> Flink Dashboard fails to display bytes/records received by sources / emitted 
> by sinks
> -
>
> Key: FLINK-7286
> URL: https://issues.apache.org/jira/browse/FLINK-7286
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics, Runtime / Web Frontend
>Affects Versions: 1.3.1
>Reporter: Elias Levy
>Priority: Major
>  Labels: usability
>
> It appears Flink can't measure the number of bytes read or records produced 
> by a source (e.g. Kafka source). This is particularly problematic for simple 
> jobs where the job pipeline is chained, and in which there are no 
> measurements between operators. Thus, in the UI it appears that the job is 
> not consuming any data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] flinkbot edited a comment on pull request #13736: [FLINK-19654][python][e2e] Reduce pyflink e2e test parallelism

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13736:
URL: https://github.com/apache/flink/pull/13736#issuecomment-714183394


   
   ## CI report:
   
   * 259c893dcb4b5692df54cfe7739f95d9e34096e6 Azure: 
[CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8056)
 
   * 111ec785929a0742b46ea98408f585aa03314b1d Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8070)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (FLINK-19759) DeadlockBreakupTest.testSubplanReuse_AddSingletonExchange is instable

2020-10-22 Thread Dian Fu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dian Fu closed FLINK-19759.
---
Fix Version/s: (was: 1.12.0)
   Resolution: Fixed

Just noticed that it's already fixed in 
https://github.com/apache/flink/commit/0b1c7327e9d0efa9c32c95815a0cff5f5ad0856b,
 closing this ticket.

> DeadlockBreakupTest.testSubplanReuse_AddSingletonExchange is instable
> -
>
> Key: FLINK-19759
> URL: https://issues.apache.org/jira/browse/FLINK-19759
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Affects Versions: 1.12.0
>Reporter: Dian Fu
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8052=logs=e25d5e7e-2a9c-5589-4940-0b638d75a414=a6e0f756-5bb9-5ea8-a468-5f60db442a29
> {code}
> [ERROR]   DeadlockBreakupTest.testSubplanReuse_AddSingletonExchange:217 
> planAfter expected:<...=[>(cnt, 3)])
> :  +- [SortAggregate(isMerge=[true], select=[Final_COUNT(count1$0) AS cnt], 
> reuse_id=[1])
> : +- Exchange(distribution=[single])
> :+- LocalSort]Aggregate(select=[Pa...> but was:<...=[>(cnt, 3)])
> :  +- [HashAggregate(isMerge=[true], select=[Final_COUNT(count1$0) AS cnt], 
> reuse_id=[1])
> : +- Exchange(distribution=[single])
> :+- LocalHash]Aggregate(select=[Pa...>
> [INFO] 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-19760) Make the `GlobalCommitter` a standalone interface that does not extend the `Committer`

2020-10-22 Thread Guowei Ma (Jira)

Guowei Ma created FLINK-19760:
-

 Summary: Make the `GlobalCommitter` a standalone interface that 
does not extend the `Committer`
 Key: FLINK-19760
 URL: https://issues.apache.org/jira/browse/FLINK-19760
 Project: Flink
  Issue Type: Sub-task
Reporter: Guowei Ma






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery

2020-10-22 Thread GitBox



curcur edited a comment on pull request #13648:
URL: https://github.com/apache/flink/pull/13648#issuecomment-714211577


   > * Visilibity in normal case: none of the felds written in `releaseView` 
are `volatile`. So in normal case (`t1:release` then `t2:createReadView`) `t2` 
can see some inconsistent state. For example, `readView == null`, but 
`isPartialBufferCleanupRequired == false`. Right?
   >   Maybe call `releaseView()`  from `createReadView()` unconditionally?
   
   Aren't they guarded by the synchronization block? I think in the existing 
code, almost all access is guarded by the lock. But I have a simpler solution 
just in case. See the last paragraph.
   
   > * Overwites when release is slow: won't `t1` overwrite changes to 
`PipelinedSubpartition` made already by `t2`? For example, reset 
`sequenceNumber` after `t2` has sent some data?
   >   Maybe `PipelinedSubpartition.readerView` should be `AtomicReference` and 
then we can guard `PipelinedApproximateSubpartition.releaseView()` by CAS on it?
   
   I think this is the same question I answered in the write-up above.
   In short, it won't be possible, because a view can only be released once and 
this is guarded by the release flag of the view, details quoted below. The 
updates are guarded by the lock. 
   
   - What if the netty thread1 release view after netty thread2 recreates the 
view?
   Thread2 releases the view that thread1 holds the reference on before 
creating a new view. Thread1 can not release the old view (through view 
reference) again afterwards, since a view can only be released once.
   
   I am actually having an idea to simplify this whole model:
   **If we only release before creation and no other places, this whole 
threading interaction model would be simplified in a great way. That says only 
one new netty thread can release the view**



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] tillrohrmann commented on a change in pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



tillrohrmann commented on a change in pull request #13725:
URL: https://github.com/apache/flink/pull/13725#discussion_r509929799



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##
@@ -170,7 +174,6 @@ public void nodeChanged() throws Exception {
notifyIfNewLeaderAddress(leaderAddress, 
leaderSessionID);
} catch (Exception e) {
leaderListener.handleError(new 
Exception("Could not handle node changed event.", e));
-   throw e;

Review comment:
   Yes, that is the reasoning. The `leaderListener.handleError` should 
handle all exceptions and usually terminate the process. But let me actually 
add `ExceptionUtils.checkInterrupted(e);` here again.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13740: [FLINK-19758][filesystem] Implement a new unified file sink based on the new sink API

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13740:
URL: https://github.com/apache/flink/pull/13740#issuecomment-714266782


   
   ## CI report:
   
   * af69ad89405936040b55d54acb0ef18e6141e5ad Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8078)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] XComp commented on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI

2020-10-22 Thread GitBox



XComp commented on pull request #13458:
URL: https://github.com/apache/flink/pull/13458#issuecomment-714288781


   Thanks, @gm7y8 . It looks good from my side. @vthinkxie: Could you have a 
brief look over the frontend code diff? I guess there have been some changes 
after your last review.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] tillrohrmann commented on a change in pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



tillrohrmann commented on a change in pull request #13725:
URL: https://github.com/apache/flink/pull/13725#discussion_r509935113



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##
@@ -203,6 +207,11 @@ protected void handleStateChange(ConnectionState newState) 
{
}
}
 
+   private void onReconnectedConnectionState() {
+   // check whether we find some new leader information in 
ZooKeeper
+   retrieveLeaderInformationFromZooKeeper();

Review comment:
   I think you are right that in some cases we will report a stale leader 
here. However, once the asynchronous fetch from the `NodeCache` completes, the 
listener should be notified about the new leader. What happens in the meantime 
is that the listener will try to connect to the old leader which should either 
be gone or reject all connection attempts since he is no longer the leader.
   
   The problem `notifyLossOfLeader` tried to solve is that a listener thinks 
that a stale leader is still the leader and, thus, continues working for it w/o 
questioning it (e.g. check with the leader) until the connection to the leader 
times out. With the `notifyLossOfLeader` change, once the retrieval service 
loses connection to ZooKeeper, it will tell the listener that the current 
leader is no longer valid. This will tell the listener to stop working for this 
leader (e.g. cancelling all tasks, disconnecting from it, etc.). If the 
listener should shortly after be told that the old leader is still the leader 
because of stale information, then it will first try to connect to the leader 
which will fail (assuming that the old leader is indeed no longer the leader) 
before it starts doing work for the leader.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #13742: [FLINK-19626][table-planner-blink] Introduce multi-input operator construction algorithm

2020-10-22 Thread GitBox



flinkbot commented on pull request #13742:
URL: https://github.com/apache/flink/pull/13742#issuecomment-714288929


   Thanks a lot for your contribution to the Apache Flink project. I'm the 
@flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress 
of the review.
   
   
   ## Automated Checks
   Last check on commit 37cba38abc15a3ee7d1193644be564c0c75ac4c1 (Thu Oct 22 
07:22:28 UTC 2020)
   
   **Warnings:**
* No documentation files were touched! Remember to keep the Flink docs up 
to date!
   
   
   Mention the bot in a comment to re-run the automated checks.
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review 
Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full 
explanation of the review process.
The Bot is tracking the review progress through labels. Labels are applied 
according to the order of the review items. For consensus, approval by a Flink 
committer of PMC member is required Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot approve description` to approve one or more aspects (aspects: 
`description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until 
`architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's 
attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (FLINK-17159) ES6 ElasticsearchSinkITCase unstable

2020-10-22 Thread Robert Metzger (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-17159:
--

Assignee: (was: Robert Metzger)

> ES6 ElasticsearchSinkITCase unstable
> 
>
> Key: FLINK-17159
> URL: https://issues.apache.org/jira/browse/FLINK-17159
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / ElasticSearch, Tests
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.12.0, 1.11.3
>
>
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7482=logs=64110e28-73be-50d7-9369-8750330e0bf1=aa84fb9a-59ae-5696-70f7-011bc086e59b]
> {code:java}
> 2020-04-15T02:37:04.4289477Z [ERROR] 
> testElasticsearchSinkWithSmile(org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSinkITCase)
>   Time elapsed: 0.145 s  <<< ERROR!
> 2020-04-15T02:37:04.4290310Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2020-04-15T02:37:04.4290790Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147)
> 2020-04-15T02:37:04.4291404Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:659)
> 2020-04-15T02:37:04.4291956Z  at 
> org.apache.flink.streaming.util.TestStreamEnvironment.execute(TestStreamEnvironment.java:77)
> 2020-04-15T02:37:04.4292548Z  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1643)
> 2020-04-15T02:37:04.4293254Z  at 
> org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.runElasticSearchSinkTest(ElasticsearchSinkTestBase.java:128)
> 2020-04-15T02:37:04.4293990Z  at 
> org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.runElasticsearchSinkSmileTest(ElasticsearchSinkTestBase.java:106)
> 2020-04-15T02:37:04.4295096Z  at 
> org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSinkITCase.testElasticsearchSinkWithSmile(ElasticsearchSinkITCase.java:45)
> 2020-04-15T02:37:04.4295923Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-04-15T02:37:04.4296489Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-04-15T02:37:04.4297076Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-04-15T02:37:04.4297513Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-04-15T02:37:04.4297951Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-04-15T02:37:04.4298688Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-04-15T02:37:04.4299374Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-04-15T02:37:04.4300069Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-04-15T02:37:04.4300960Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-04-15T02:37:04.4301705Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-04-15T02:37:04.4302204Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-04-15T02:37:04.4302661Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-04-15T02:37:04.4303234Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-04-15T02:37:04.4303706Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-04-15T02:37:04.4304127Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-04-15T02:37:04.4304716Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-04-15T02:37:04.4305394Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-04-15T02:37:04.4305965Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-04-15T02:37:04.4306425Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-04-15T02:37:04.4306942Z  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-04-15T02:37:04.4307466Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-04-15T02:37:04.4307920Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-04-15T02:37:04.4308375Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-04-15T02:37:04.4308782Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-04-15T02:37:04.4309182Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>

[GitHub] [flink] tillrohrmann commented on a change in pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API

2020-10-22 Thread GitBox



tillrohrmann commented on a change in pull request #13644:
URL: https://github.com/apache/flink/pull/13644#discussion_r509945969



##
File path: 
flink-kubernetes/src/test/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderRetrievalServiceTest.java
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.kubernetes.highavailability;
+
+import org.apache.flink.kubernetes.kubeclient.FlinkKubeClient;
+import org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMap;
+import org.apache.flink.kubernetes.utils.Constants;
+
+import org.junit.Test;
+
+import java.util.Collections;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.Matchers.is;
+import static org.hamcrest.Matchers.notNullValue;
+import static org.hamcrest.Matchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the {@link KubernetesLeaderRetrievalService}.
+ */
+public class KubernetesLeaderRetrievalServiceTest extends 
KubernetesHighAvailabilityTestBase {

Review comment:
   Shouldn't these tests be part of the commits where we add the K8s 
services? That way it will be a bit easier to review them for completeness.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #13743: [FLINK-19629][Formats]Fix the NPE of avro format, when there is a nul…

2020-10-22 Thread GitBox



flinkbot commented on pull request #13743:
URL: https://github.com/apache/flink/pull/13743#issuecomment-714308277


   Thanks a lot for your contribution to the Apache Flink project. I'm the 
@flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress 
of the review.
   
   
   ## Automated Checks
   Last check on commit 60d9d786803903b37709700a16c01bc6f7bef003 (Thu Oct 22 
07:59:10 UTC 2020)
   
   **Warnings:**
* No documentation files were touched! Remember to keep the Flink docs up 
to date!
   
   
   Mention the bot in a comment to re-run the automated checks.
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review 
Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full 
explanation of the review process.
The Bot is tracking the review progress through labels. Labels are applied 
according to the order of the review items. For consensus, approval by a Flink 
committer of PMC member is required Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot approve description` to approve one or more aspects (aspects: 
`description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until 
`architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's 
attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] shizhengchao commented on pull request #13743: [FLINK-19629][Formats]Fix the NPE of avro format, when there is a nul…

2020-10-22 Thread GitBox



shizhengchao commented on pull request #13743:
URL: https://github.com/apache/flink/pull/13743#issuecomment-714308295


   Hi @wuchong , this is the hotfix in release-1.11, could you help me reivew 
this :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-18811) if a disk is damaged, taskmanager should choose another disk for temp dir , rather than throw an IOException, which causes flink job restart over and over again

2020-10-22 Thread Timo Walther (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218843#comment-17218843
 ] 

Timo Walther commented on FLINK-18811:
--

[~pnowojski] could you find someone to help reviewing this contribution?

> if a disk is damaged, taskmanager should choose another disk for temp dir , 
> rather than throw an IOException, which causes flink job restart over and 
> over again
> 
>
> Key: FLINK-18811
> URL: https://issues.apache.org/jira/browse/FLINK-18811
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
> Environment: flink-1.10
>Reporter: Kai Chen
>Priority: Major
>  Labels: pull-request-available
> Attachments: flink_disk_error.png
>
>
> I met this Exception when a hard disk was damaged:
> !flink_disk_error.png!
> I checked the code and found that flink will create a temp file  when Record 
> length > 5 MB:
>  
> {code:java}
> // SpillingAdaptiveSpanningRecordDeserializer.java
> if (nextRecordLength > THRESHOLD_FOR_SPILLING) {
>// create a spilling channel and put the data there
>this.spillingChannel = createSpillingChannel();
>ByteBuffer toWrite = partial.segment.wrap(partial.position, numBytesChunk);
>FileUtils.writeCompletely(this.spillingChannel, toWrite);
> }
> {code}
> The tempDir is random picked from all `tempDirs`. Well on yarn mode, one 
> `tempDir`  usually represents one hard disk.
>  
> In may opinion, if a hard disk is damaged, taskmanager should pick another 
> disk(tmpDir) for Spilling Channel, rather than throw an IOException, which 
> causes flink job restart over and over again.
> If we could just change “SpillingAdaptiveSpanningRecordDeserializer" like 
> this:
> {code:java}
> // SpillingAdaptiveSpanningRecordDeserializer.java
> private FileChannel createSpillingChannel() throws IOException {
>if (spillFile != null) {
>   throw new IllegalStateException("Spilling file already exists.");
>}
>// try to find a unique file name for the spilling channel
>int maxAttempts = 10;
>String[] tempDirs = this.tempDirs;
>for (int attempt = 0; attempt < maxAttempts; attempt++) {
>   int dirIndex = rnd.nextInt(tempDirs.length);
>   String directory = tempDirs[dirIndex];
>   spillFile = new File(directory, randomString(rnd) + ".inputchannel");
>   try {
>  if (spillFile.createNewFile()) {
> return new RandomAccessFile(spillFile, "rw").getChannel();
>  }
>   } catch (IOException e) {
>  // if there is no tempDir left to try
>  if(tempDirs.length <= 1) {
> throw e;
>  }
>  LOG.warn("Caught an IOException when creating spill file: " + 
> directory + ". Attempt " + attempt, e);
>  tempDirs = (String[])ArrayUtils.remove(tempDirs,dirIndex);
>   }
>}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-22 Thread Arvid Heise (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-19401.
-
Resolution: Fixed

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Assignee: Roman Khachatryan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at 
> org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.lambda$allocateSlotsFor$0(DefaultExecutionSlotAllocator.java:104)
> at 
>

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-22 Thread Arvid Heise (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218851#comment-17218851
 ] 

Arvid Heise commented on FLINK-19401:
-

Merged into release-1.11 as bff79f5efffccd1793e09a5e08d0ceb9fe90cf2
Merged into release-1.10 as 88c06e38d4eff06ac09e4141b988f9a561f286a4
Closing as resolved.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Assignee: Roman Khachatryan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at 
>

[jira] [Closed] (FLINK-19136) MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the predicate within the allowed time"

2020-10-22 Thread Robert Metzger (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger closed FLINK-19136.
--
Resolution: Cannot Reproduce

> MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the 
> predicate within the allowed time"
> 
>
> Key: FLINK-19136
> URL: https://issues.apache.org/jira/browse/FLINK-19136
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.12.0
>Reporter: Dian Fu
>Assignee: Robert Metzger
>Priority: Major
>  Labels: test-stability
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6179=logs=9fca669f-5c5f-59c7-4118-e31c641064f0=91bf6583-3fb2-592f-e4d4-d79d79c3230a]
> {code}
> 2020-09-03T23:33:18.3687261Z [ERROR] 
> testReporter(org.apache.flink.metrics.tests.MetricsAvailabilityITCase)  Time 
> elapsed: 15.217 s  <<< ERROR!
> 2020-09-03T23:33:18.3698260Z java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
> satisfy the predicate within the allowed time.
> 2020-09-03T23:33:18.3698749Z  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-09-03T23:33:18.3699163Z  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> 2020-09-03T23:33:18.3699754Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.fetchMetric(MetricsAvailabilityITCase.java:162)
> 2020-09-03T23:33:18.3700234Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.checkJobManagerMetricAvailability(MetricsAvailabilityITCase.java:116)
> 2020-09-03T23:33:18.3700726Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.testReporter(MetricsAvailabilityITCase.java:101)
> 2020-09-03T23:33:18.3701097Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-09-03T23:33:18.3701425Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-09-03T23:33:18.3701798Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-09-03T23:33:18.3702146Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-09-03T23:33:18.3702471Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-09-03T23:33:18.3702866Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-09-03T23:33:18.3703253Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-09-03T23:33:18.3703621Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-09-03T23:33:18.3703997Z  at 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-09-03T23:33:18.3704339Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-09-03T23:33:18.3704629Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-09-03T23:33:18.3704940Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-09-03T23:33:18.3705354Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-09-03T23:33:18.3705725Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-09-03T23:33:18.3706072Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-09-03T23:33:18.3706397Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-09-03T23:33:18.3706714Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-09-03T23:33:18.3707044Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-09-03T23:33:18.3707373Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-09-03T23:33:18.3707708Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-09-03T23:33:18.3708073Z  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-09-03T23:33:18.3708410Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-09-03T23:33:18.3708691Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-09-03T23:33:18.3708976Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2020-09-03T23:33:18.3709273Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-09-03T23:33:18.3709579Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-09-03T23:33:18.3709910Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-09-03T23:33:18.3710242Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>

[jira] [Commented] (FLINK-19136) MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the predicate within the allowed time"

2020-10-22 Thread Robert Metzger (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218869#comment-17218869
 ] 

Robert Metzger commented on FLINK-19136:


The logs of TM and JM both do not look suspicious. I'm closing this as "Cannot 
Reproduce" for now.

> MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the 
> predicate within the allowed time"
> 
>
> Key: FLINK-19136
> URL: https://issues.apache.org/jira/browse/FLINK-19136
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.12.0
>Reporter: Dian Fu
>Assignee: Robert Metzger
>Priority: Major
>  Labels: test-stability
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6179=logs=9fca669f-5c5f-59c7-4118-e31c641064f0=91bf6583-3fb2-592f-e4d4-d79d79c3230a]
> {code}
> 2020-09-03T23:33:18.3687261Z [ERROR] 
> testReporter(org.apache.flink.metrics.tests.MetricsAvailabilityITCase)  Time 
> elapsed: 15.217 s  <<< ERROR!
> 2020-09-03T23:33:18.3698260Z java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
> satisfy the predicate within the allowed time.
> 2020-09-03T23:33:18.3698749Z  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-09-03T23:33:18.3699163Z  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> 2020-09-03T23:33:18.3699754Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.fetchMetric(MetricsAvailabilityITCase.java:162)
> 2020-09-03T23:33:18.3700234Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.checkJobManagerMetricAvailability(MetricsAvailabilityITCase.java:116)
> 2020-09-03T23:33:18.3700726Z  at 
> org.apache.flink.metrics.tests.MetricsAvailabilityITCase.testReporter(MetricsAvailabilityITCase.java:101)
> 2020-09-03T23:33:18.3701097Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-09-03T23:33:18.3701425Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-09-03T23:33:18.3701798Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-09-03T23:33:18.3702146Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-09-03T23:33:18.3702471Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-09-03T23:33:18.3702866Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-09-03T23:33:18.3703253Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-09-03T23:33:18.3703621Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-09-03T23:33:18.3703997Z  at 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-09-03T23:33:18.3704339Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-09-03T23:33:18.3704629Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-09-03T23:33:18.3704940Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-09-03T23:33:18.3705354Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-09-03T23:33:18.3705725Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-09-03T23:33:18.3706072Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-09-03T23:33:18.3706397Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-09-03T23:33:18.3706714Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-09-03T23:33:18.3707044Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-09-03T23:33:18.3707373Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-09-03T23:33:18.3707708Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-09-03T23:33:18.3708073Z  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-09-03T23:33:18.3708410Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-09-03T23:33:18.3708691Z  at 
> org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-09-03T23:33:18.3708976Z  at 
> org.junit.runners.Suite.runChild(Suite.java:27)
> 2020-09-03T23:33:18.3709273Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-09-03T23:33:18.3709579Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-09-03T23:33:18.3709910Z  at 
>

[GitHub] [flink] flinkbot edited a comment on pull request #13721: [FLINK-19694][table] Support Upsert ChangelogMode for ScanTableSource

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13721:
URL: https://github.com/apache/flink/pull/13721#issuecomment-713489063


   
   ## CI report:
   
   * fa0c87c737cee61fe6fd5b6655916eb97d18aaeb Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8072)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery

2020-10-22 Thread GitBox



curcur edited a comment on pull request #13648:
URL: https://github.com/apache/flink/pull/13648#issuecomment-714211577


   > * Visilibity in normal case: none of the felds written in `releaseView` 
are `volatile`. So in normal case (`t1:release` then `t2:createReadView`) `t2` 
can see some inconsistent state. For example, `readView == null`, but 
`isPartialBufferCleanupRequired == false`. Right?
   >   Maybe call `releaseView()`  from `createReadView()` unconditionally?
   
   Aren't they guarded by the synchronization block?
   
   > * Overwites when release is slow: won't `t1` overwrite changes to 
`PipelinedSubpartition` made already by `t2`? For example, reset 
`sequenceNumber` after `t2` has sent some data?
   >   Maybe `PipelinedSubpartition.readerView` should be `AtomicReference` and 
then we can guard `PipelinedApproximateSubpartition.releaseView()` by CAS on it?
   
   I think this is the same question I answered in the write-up above.
   In short, it won't be possible, because a view can only be released once and 
this is guarded by the release flag of the view, details quoted below. 
   
   - What if the netty thread1 release view after netty thread2 recreates the 
view?
   Thread2 releases the view that thread1 holds the reference on before 
creating a new view. Thread1 can not release the old view (through view 
reference) again afterwards, since a view can only be released once.
   
   I am actually having an idea to simplify this whole model:
   **If we only release before creation and no other places, this whole 
threading interaction model would be simplified in a great way. That says only 
one new netty thread can release the view**



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13678: [FLINK-19586] Add stream committer operators for new Sink API

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13678:
URL: https://github.com/apache/flink/pull/13678#issuecomment-711439155


   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 463b5c8ed21f93caaeb7b938aa9e72abb35619b2 UNKNOWN
   * 2ba5bda5c5a15c370c231e1cedfa03fd0bc0b41b Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8076)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery

2020-10-22 Thread GitBox



curcur edited a comment on pull request #13648:
URL: https://github.com/apache/flink/pull/13648#issuecomment-714211577


   > * Visilibity in normal case: none of the felds written in `releaseView` 
are `volatile`. So in normal case (`t1:release` then `t2:createReadView`) `t2` 
can see some inconsistent state. For example, `readView == null`, but 
`isPartialBufferCleanupRequired == false`. Right?
   >   Maybe call `releaseView()`  from `createReadView()` unconditionally?
   
   Aren't they guarded by the synchronization block?
   
   > * Overwites when release is slow: won't `t1` overwrite changes to 
`PipelinedSubpartition` made already by `t2`? For example, reset 
`sequenceNumber` after `t2` has sent some data?
   >   Maybe `PipelinedSubpartition.readerView` should be `AtomicReference` and 
then we can guard `PipelinedApproximateSubpartition.releaseView()` by CAS on it?
   
   I think this is the same question I answered in the write-up above.
   In short, it won't be possible, because a view can only be released once and 
this is guarded by the release flag of the view, details quoted below. 
   
   - What if the netty thread1 release view after netty thread2 recreates the 
view?
   Thread2 releases the view that thread1 holds the reference on before 
creating a new view. Thread1 can not release the old view (through view 
reference) again afterwards, since a view can only be released once, and 
guarded by the lock.
   
   I am actually having an idea to simplify this whole model:
   **If we only release before creation and no other places, this whole 
threading interaction model would be simplified in a great way. That says only 
one new netty thread can release the view**



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13735: [FLINK-19533][checkpoint] Add channel state reassignment for unaligned checkpoints.

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13735:
URL: https://github.com/apache/flink/pull/13735#issuecomment-714048414


   
   ## CI report:
   
   * ee73393a13c999be626f7b8b98a170be88a093d9 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8077)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #13740: [FLINK-19758][filesystem] Implement a new unified file sink based on the new sink API

2020-10-22 Thread GitBox



flinkbot commented on pull request #13740:
URL: https://github.com/apache/flink/pull/13740#issuecomment-714266782


   
   ## CI report:
   
   * af69ad89405936040b55d54acb0ef18e6141e5ad UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] pnowojski opened a new pull request #13741: [FLINK-19680][checkpointing] Announce timeoutable CheckpointBarriers

2020-10-22 Thread GitBox



pnowojski opened a new pull request #13741:
URL: https://github.com/apache/flink/pull/13741


   This PR adds `CheckpointBarrierAnnouncement` messages 
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / **no**)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (**yes** / no)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / **no** / 
don't know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (**yes** / no)
 - If yes, how is the feature documented? (not applicable / **docs** / 
**JavaDocs** / not documented)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-19680) Add CheckpointBarrierAnnouncement priority message

2020-10-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-19680:
---
Labels: pull-request-available  (was: )

> Add CheckpointBarrierAnnouncement priority message
> --
>
> Key: FLINK-19680
> URL: https://issues.apache.org/jira/browse/FLINK-19680
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.0
>Reporter: Piotr Nowojski
>Assignee: Piotr Nowojski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> When an aligned checkpoint barrier arrives in an input channel, we should 
> enqueue at the head of the queue priority event, announcing arrival of the 
> checkpoint barrier.
> Such announcement event should be propagated to the 
> `CheckpointBarrierHandler`. Based on that, `CheckpointBarrierHandler` will be 
> able to implement logic of time outing from aligned checkpoint to unaligned 
> checkpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19761) Add lookup method for registered ShuffleDescriptor in ShuffleMaster

2020-10-22 Thread Xuannan Su (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218826#comment-17218826
 ] 

Xuannan Su commented on FLINK-19761:


cc [~trohrmann] [~azagrebin] [~zjwang] 
I'd love to hear your thought on this.

> Add lookup method for registered ShuffleDescriptor in ShuffleMaster
> ---
>
> Key: FLINK-19761
> URL: https://issues.apache.org/jira/browse/FLINK-19761
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Xuannan Su
>Priority: Major
>
> Currently, the ShuffleMaster can register a partition and get the shuffle 
> descriptor. However, it lacks the ability to look up the registered 
> ShuffleDescriptors belongs to an IntermediateResult by the 
> IntermediateDataSetID.
> Adding the lookup method to the ShuffleMaster can make reusing the cluster 
> partition more easily. For example, we don't have to return the 
> ShuffleDescriptor to the client just so that the other job can somehow encode 
> the ShuffleDescriptor in the JobGraph to consume the cluster partition. 
> Instead, we only need to return the IntermediateDatSetID and use it to lookup 
> the ShuffleDescriptor by another job.
> By adding the lookup method in ShuffleMaster, if we have an external shuffle 
> service and the lifecycle of the IntermediateResult is not bounded to the 
> cluster, we can look up the ShuffleDescriptor and reuse the 
> IntermediateResult by a job running on another cluster even if the cluster 
> that produced the IntermediateResult is shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] tillrohrmann commented on a change in pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API

2020-10-22 Thread GitBox



tillrohrmann commented on a change in pull request #13644:
URL: https://github.com/apache/flink/pull/13644#discussion_r509937977



##
File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/FlinkKubeClient.java
##
@@ -104,6 +106,67 @@ KubernetesWatch watchPodsAndDoCallback(
Map labels,
WatchCallbackHandler podCallbackHandler);
 
+   /**
+* Create the ConfigMap with specified content. If the ConfigMap 
already exists, a FlinkRuntimeException will be
+* thrown.
+*
+* @param configMap ConfigMap.
+*
+* @return Return the ConfigMap create future.
+*/
+   CompletableFuture createConfigMap(KubernetesConfigMap configMap);
+
+   /**
+* Get the ConfigMap with specified name.
+*
+* @param name ConfigMap name.
+*
+* @return Return the ConfigMap, or empty if the ConfigMap does not 
exist.
+*/
+   Optional getConfigMap(String name);
+
+   /**
+* Update an existing ConfigMap with the data. Benefit from https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions>
+* resource version and combined with {@link 
#getConfigMap(String)}, we could perform a get-check-and-update
+* transactional operation. Since concurrent modification could happen 
on a same ConfigMap,
+* the update operation may fail. We need to retry internally. The max 
retry attempts could be
+* configured via {@link 
org.apache.flink.kubernetes.configuration.KubernetesConfigOptions#KUBERNETES_TRANSACTIONAL_OPERATION_MAX_RETRIES}.
+*
+* @param configMapName ConfigMap to be replaced with.
+* @param function  Function to be applied to the obtained 
ConfigMap and get a new updated one. If the returned

Review comment:
   Would we do these kind of calls in the update function of the configMap?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-19763) Missing test MetricUtilsTest.testNonHeapMetricUsageNotStatic

2020-10-22 Thread Matthias (Jira)

Matthias created FLINK-19763:


 Summary: Missing test 
MetricUtilsTest.testNonHeapMetricUsageNotStatic
 Key: FLINK-19763
 URL: https://issues.apache.org/jira/browse/FLINK-19763
 Project: Flink
  Issue Type: Test
Affects Versions: 1.11.2, 1.10.2
Reporter: Matthias
 Fix For: 1.12.0


We have tests for the heap and metaspace to check whether the metric is 
dynamically generated. The test for the non-heap space is missing. There was a 
test added in [296107e|https://github.com/apache/flink/commit/296107e] but 
reverted in [2d86256|https://github.com/apache/flink/commit/2d86256] as it 
appeared that the test is partially failing.

We should might want to add the test again fixing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] AHeise merged pull request #13734: (1.11 backport) [FLINK-19401][checkpointing] Download checkpoints only if needed

2020-10-22 Thread GitBox



AHeise merged pull request #13734:
URL: https://github.com/apache/flink/pull/13734


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack

2020-10-22 Thread Arvid Heise (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218858#comment-17218858
 ] 

Arvid Heise commented on FLINK-19688:
-

Merged into master as 840e8af879e69c1bf9ad121b670a2703eb88b858.
Closing issue as resolved.

> Flink batch job fails because of InterruptedExceptions from network stack
> -
>
> Key: FLINK-19688
> URL: https://issues.apache.org/jira/browse/FLINK-19688
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Runtime / Task
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Assignee: Roman Khachatryan
>Priority: Blocker
> Fix For: 1.12.0
>
> Attachments: logs.tgz
>
>
> I have a benchmarking test job, that throws RuntimeExceptions at any operator 
> at a configured, random interval. When using low intervals, such as mean 
> failure rate = 60 s, the job will get into a state where it frequently fails 
> with InterruptedExceptions.
> The same job does not have this problem on Flink 1.11.2 (at least not after 
> running the job for 15 hours, on 1.12-SN, it happens within a few minutes)
> This is the job: 
> https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java
> This is the exception:
> {code}
> 2020-10-16 16:02:15,653 WARN  org.apache.flink.runtime.taskmanager.Task   
>  [] - CHAIN GroupReduce (GroupReduce at 
> main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38)) (8/8)#1 
> (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
> switched from RUNNING to FAILED.
> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
> (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38))' , caused an error: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> Caused by: org.apache.flink.util.WrappingRuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   ... 4 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error obtaining the sorted input: Thread 
> 'SortMerger Reading Thread' terminated due to an exception: Connection for 
> partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> ~[?:1.8.0_222]
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) 
> ~[?:1.8.0_222]
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at

[GitHub] [flink] flinkbot edited a comment on pull request #13741: [FLINK-19680][checkpointing] Announce timeoutable CheckpointBarriers

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13741:
URL: https://github.com/apache/flink/pull/13741#issuecomment-714295796


   
   ## CI report:
   
   * 8a68242a90fd180fcd0b4b4c60b5e1e49136c00f Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8082)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13742: [FLINK-19626][table-planner-blink] Introduce multi-input operator construction algorithm

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13742:
URL: https://github.com/apache/flink/pull/13742#issuecomment-714296461


   
   ## CI report:
   
   * 37cba38abc15a3ee7d1193644be564c0c75ac4c1 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8084)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13736: [FLINK-19654][python][e2e] Reduce pyflink e2e test parallelism

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13736:
URL: https://github.com/apache/flink/pull/13736#issuecomment-714183394


   
   ## CI report:
   
   * 111ec785929a0742b46ea98408f585aa03314b1d Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8070)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] AHeise merged pull request #13723: [FlINK-19688][network] Don't cache InterruptedExceptions in PartitionRequestClientFactory

2020-10-22 Thread GitBox



AHeise merged pull request #13723:
URL: https://github.com/apache/flink/pull/13723


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13727: [BP-1.10][FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13727:
URL: https://github.com/apache/flink/pull/13727#issuecomment-713572668


   
   ## CI report:
   
   * c2ae5ba37a69ad1f40f585421d38edb30d40c0fa Travis: 
[FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/191431633) 
   * 8e2aaf84d255adab8420800d5e7ec20d502868f4 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] AHeise commented on a change in pull request #13723: [FlINK-19688][network] Don't cache InterruptedExceptions in PartitionRequestClientFactory

2020-10-22 Thread GitBox



AHeise commented on a change in pull request #13723:
URL: https://github.com/apache/flink/pull/13723#discussion_r509969318



##
File path: 
flink-runtime/src/test/java/org/apache/flink/runtime/io/network/netty/PartitionRequestClientFactoryTest.java
##
@@ -58,6 +59,38 @@
 
private static final int SERVER_PORT = NetUtils.getAvailablePort();
 
+   @Test
+   public void testInterruptsNotCached() throws Exception {
+   ConnectionID connectionId = new ConnectionID(new 
InetSocketAddress(InetAddress.getLocalHost(), 8080), 0);

Review comment:
   Okay if the test is not affected by that I'll merge.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (FLINK-14242) Collapse task names in job graph visualization if too long

2020-10-22 Thread Yadong Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Xie closed FLINK-14242.
--
Resolution: Fixed

> Collapse task names in job graph visualization if too long
> --
>
> Key: FLINK-14242
> URL: https://issues.apache.org/jira/browse/FLINK-14242
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.0
>Reporter: Paul Lin
>Assignee: Yadong Xie
>Priority: Minor
>
> For some complex jobs, especially SQL jobs, the task names are quite long 
> which makes the job graph hard to read.  We could auto collapse these task 
> names if they exceed a certain length, and provide an uncollapse button for 
> the full task names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] zhijiangW commented on a change in pull request #13595: [FLINK-19582][network] Introduce sort-merge based blocking shuffle to Flink

2020-10-22 Thread GitBox



zhijiangW commented on a change in pull request #13595:
URL: https://github.com/apache/flink/pull/13595#discussion_r509983577



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/SortMergeResultPartition.java
##
@@ -0,0 +1,360 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.io.network.partition;
+
+import org.apache.flink.core.memory.MemorySegment;
+import org.apache.flink.core.memory.MemorySegmentFactory;
+import org.apache.flink.runtime.event.AbstractEvent;
+import org.apache.flink.runtime.io.disk.FileChannelManager;
+import org.apache.flink.runtime.io.network.api.EndOfPartitionEvent;
+import org.apache.flink.runtime.io.network.api.serialization.EventSerializer;
+import org.apache.flink.runtime.io.network.buffer.Buffer;
+import org.apache.flink.runtime.io.network.buffer.BufferCompressor;
+import org.apache.flink.runtime.io.network.buffer.BufferPool;
+import org.apache.flink.runtime.io.network.buffer.NetworkBuffer;
+import org.apache.flink.util.function.SupplierWithException;
+
+import javax.annotation.Nullable;
+import javax.annotation.concurrent.GuardedBy;
+import javax.annotation.concurrent.NotThreadSafe;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+
+import static org.apache.flink.runtime.io.network.buffer.Buffer.DataType;
+import static 
org.apache.flink.runtime.io.network.partition.SortBuffer.BufferWithChannel;
+import static org.apache.flink.util.Preconditions.checkElementIndex;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+import static org.apache.flink.util.Preconditions.checkState;
+
+/**
+ * {@link SortMergeResultPartition} appends records and events to {@link 
SortBuffer} and after the {@link SortBuffer}
+ * is full, all data in the {@link SortBuffer} will be copied and spilled to a 
{@link PartitionedFile} in subpartition
+ * index order sequentially. Large records that can not be appended to an 
empty {@link SortBuffer} will be spilled to
+ * the {@link PartitionedFile} separately.
+ */
+@NotThreadSafe
+public class SortMergeResultPartition extends ResultPartition {
+
+   private final Object lock = new Object();
+
+   /** All active readers which are consuming data from this result 
partition now. */
+   @GuardedBy("lock")
+   private final Set readers = new 
HashSet<>();
+
+   /** {@link PartitionedFile} produced by this result partition. */
+   @GuardedBy("lock")
+   private PartitionedFile resultFile;
+
+   /** Used to generate random file channel ID. */
+   private final FileChannelManager channelManager;
+
+   /** Number of data buffers (excluding events) written for each 
subpartition. */
+   private final int[] numDataBuffers;
+
+   /** A piece of unmanaged memory for data writing. */
+   private final MemorySegment writeBuffer;
+
+   /** Size of network buffer and write buffer. */
+   private final int networkBufferSize;
+
+   /** Current {@link SortBuffer} to append records to. */
+   private SortBuffer currentSortBuffer;
+
+   /** File writer for this result partition. */
+   private PartitionedFileWriter fileWriter;
+
+   public SortMergeResultPartition(
+   String owningTaskName,
+   int partitionIndex,
+   ResultPartitionID partitionId,
+   ResultPartitionType partitionType,
+   int numSubpartitions,
+   int numTargetKeyGroups,
+   int networkBufferSize,
+   ResultPartitionManager partitionManager,
+   FileChannelManager channelManager,
+   @Nullable BufferCompressor bufferCompressor,
+   SupplierWithException 
bufferPoolFactory) {
+
+   super(
+   owningTaskName,
+   partitionIndex,
+   partitionId,
+   partitionType,
+   numSubpartitions,
+   numTargetKeyGroups,
+

[GitHub] [flink] flinkbot commented on pull request #13743: [FLINK-19629][Formats]Fix the NPE of avro format, when there is a nul…

2020-10-22 Thread GitBox



flinkbot commented on pull request #13743:
URL: https://github.com/apache/flink/pull/13743#issuecomment-714332320


   
   ## CI report:
   
   * 60d9d786803903b37709700a16c01bc6f7bef003 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (FLINK-19757) TimeStampData can cause time inconsistent problem

2020-10-22 Thread Jark Wu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218809#comment-17218809
 ] 

Jark Wu edited comment on FLINK-19757 at 10/22/20, 6:55 AM:


Hi [~zhoujira86], {{TimestampData}} is the internal data structure for both 
{{TIMESTAMP}} (= Java {{LocalDateTime}}) and {{TIMESTAMP WITH LOCAL TIME ZONE}} 
(= Java {{Instant}}). It always stores the epoch seconds since {{1970-01-01 
00:00:00}}. 

It always safe to convert without time zone if they are they same logical type. 
But not correct if they are different logical type, e.g.  {{Instant}} => 
{{TimestampData}} => {{LocalDateTime}} is not correct if no time zone 
considered. But such conversion only happends in Flink SQL internal, and 
handled correctly. 

You problem above is not in {{TimestampData}} data structure, but the 
{{PROCTIME()}}  currently returns {{TIMESTAMP}} type which is not right, it 
should be {{TIMESTAMP WITH LOCAL TIME ZONE}}.


was (Author: jark):
Hi [~zhoujira86], {{TimestampData}} is the internal data structure for both 
{{TIMESTAMP}} (= Java {{LocalDateTime}}) and {{TIMESTAMP WITH LOCAL TIME ZONE}} 
(= Java {{Instant}}). It always stores the epoch seconds since {{1970-01-01 
00:00:00}}. 

It always safe to convert without time zone if they are they same logical type. 
But not correct if they are different logical type, e.g.  {{Instant}} => 
{{TimestampData}} => {{LocalDateTime}} is not correct if no time zone 
considered. But such conversion only happends in Flink SQL internal, and 
handled correctly. 

You problem above is not in {{TimestampData}} data structure, but the 
{{PROCTIME()}}  currently returns {{TIMESTAMP}} type which is not right, it 
should be {{}}

 {{LocalDateTime}} => {{TimestampData}} => {{LocalDateTime}} and 
{{TimestampData}} <=> {{Instant}} without time zone considered. But it it right 
to convert  {{TIMESTAMP WITH LOCAL TIME ZONE}}.

> TimeStampData can cause time inconsistent problem
> -
>
> Key: FLINK-19757
> URL: https://issues.apache.org/jira/browse/FLINK-19757
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Runtime
>Affects Versions: 1.11.1
>Reporter: xiaogang zhou
>Priority: Major
>  Labels: pull-request-available
>
> when we check jdk LocalDateTime code，we find that
>  
> {code:java}
> // code placeholder
> public static LocalDateTime ofEpochSecond(long epochSecond, int nanoOfSecond, 
> ZoneOffset offset) {
> Objects.requireNonNull(offset, "offset");
> NANO_OF_SECOND.checkValidValue(nanoOfSecond);
> long localSecond = epochSecond + offset.getTotalSeconds();  // overflow 
> caught later
> long localEpochDay = Math.floorDiv(localSecond, SECONDS_PER_DAY);
> int secsOfDay = (int)Math.floorMod(localSecond, SECONDS_PER_DAY);
> LocalDate date = LocalDate.ofEpochDay(localEpochDay);
> LocalTime time = LocalTime.ofNanoOfDay(secsOfDay * NANOS_PER_SECOND + 
> nanoOfSecond);
> return new LocalDateTime(date, time);
> }
> {code}
>  
> offset.getTotalSeconds() they add the offset, but in the TimeStampData
> toLocalDateTime, we don't add a offset.
>  
> I'd like to add a TimeZone.getDefault().getRawOffset() in the 
> toLocalDateTime()
> and minus a TimeZone.getDefault().getRawOffset() in the 
> fromLocalDateTime



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] wuchong commented on pull request #13653: [FLINK-17528][table] Remove RowData#get() and ArrayData#get() and use FieldGetter and ElementGetter instead

2020-10-22 Thread GitBox



wuchong commented on pull request #13653:
URL: https://github.com/apache/flink/pull/13653#issuecomment-714275903


   @flinkbot run azure
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] azagrebin closed pull request #13547: [FLINK-14406][runtime] Exposes managed memory usage through the REST API

2020-10-22 Thread GitBox



azagrebin closed pull request #13547:
URL: https://github.com/apache/flink/pull/13547


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-19554) Unified testing framework for connectors

2020-10-22 Thread Robert Metzger (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218837#comment-17218837
 ] 

Robert Metzger commented on FLINK-19554:


What's the difference of this testing framework compared to the one developed 
in FLINK-11463? I'm concerned that we end up with two competing frameworks, 
instead of focusing our resources on one that is really good.

> Unified testing framework for connectors
> 
>
> Key: FLINK-19554
> URL: https://issues.apache.org/jira/browse/FLINK-19554
> Project: Flink
>  Issue Type: New Feature
>  Components: Connectors / Common, Test Infrastructure
>Reporter: Qingsheng Ren
>Assignee: Qingsheng Ren
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> As the community and eco-system of Flink growing up, more and more 
> Flink-owned and third party connectors are developed and added into Flink 
> community. In order to provide a standardized quality controlling for all 
> connectors, it's necessary to develop a unified connector testing framework 
> to simplify and standardize end-to-end test of connectors. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] godfreyhe commented on a change in pull request #13696: [FLINK-19726][table] Implement new providers for blink planner

2020-10-22 Thread GitBox



godfreyhe commented on a change in pull request #13696:
URL: https://github.com/apache/flink/pull/13696#discussion_r509945463



##
File path: 
flink-table/flink-table-runtime-blink/src/main/java/org/apache/flink/table/runtime/operators/sink/SinkNotNullEnforcer.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.operators.sink;
+
+import org.apache.flink.api.common.functions.FilterFunction;
+import org.apache.flink.table.api.TableException;
+import org.apache.flink.table.api.config.ExecutionConfigOptions;
+import 
org.apache.flink.table.api.config.ExecutionConfigOptions.NotNullEnforcer;
+import org.apache.flink.table.data.RowData;
+
+/**
+ * Checks writing null values into NOT NULL columns.
+ */
+public class SinkNotNullEnforcer implements FilterFunction {

Review comment:
   nit: add `serialVersionUID`

##
File path: 
flink-table/flink-table-runtime-blink/src/main/java/org/apache/flink/table/runtime/operators/sink/SinkNotNullEnforcer.java
##
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.operators.sink;
+
+import org.apache.flink.api.common.functions.FilterFunction;
+import org.apache.flink.table.api.TableException;
+import org.apache.flink.table.api.config.ExecutionConfigOptions;
+import 
org.apache.flink.table.api.config.ExecutionConfigOptions.NotNullEnforcer;
+import org.apache.flink.table.data.RowData;
+
+/**
+ * Checks writing null values into NOT NULL columns.
+ */
+public class SinkNotNullEnforcer implements FilterFunction {
+
+   private final NotNullEnforcer notNullEnforcer;
+   private final int[] notNullFieldIndices;
+   private final boolean notNullCheck;
+   private final String[] allFieldNames;
+
+   public SinkNotNullEnforcer(
+   NotNullEnforcer notNullEnforcer, int[] 
notNullFieldIndices, String[] allFieldNames) {
+   this.notNullFieldIndices = notNullFieldIndices;
+   this.notNullEnforcer = notNullEnforcer;
+   this.notNullCheck = notNullFieldIndices.length > 0;
+   this.allFieldNames = allFieldNames;
+   }
+
+   @Override
+   public boolean filter(RowData row) {
+   if (notNullCheck) {

Review comment:
   nit: return `false` directly when notNullCheck is false, reduce nesting 
depth.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] vthinkxie commented on a change in pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI

2020-10-22 Thread GitBox



vthinkxie commented on a change in pull request #13458:
URL: https://github.com/apache/flink/pull/13458#discussion_r50994



##
File path: 
flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/detail/job-checkpoints-detail.component.html
##
@@ -21,6 +21,8 @@
 Path: {{checkPointDetail?.external_path || '-'}}
 
 Discarded: {{checkPointDetail?.discarded || '-'}}
+
+Checkpoint Type: {{checkPointDetail?.checkpoint_type === 
"CHECKPOINT" ? (checkPointConfig?.unaligned_checkpoints ? 'unaligned 
checkpoint' : 'aligned checkpoint') : (checkPointDetail?.checkpoint_type === 
"SYNC_SAVEPOINT" ? 'savepoint on cancel' : 'savepoint')|| '-'}}

Review comment:
   Hi @gm7y8 
   it would be better to move this calculation into typescript file as a getter
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (FLINK-18117) "Kerberized YARN per-job on Docker test" fails with "Could not start hadoop cluster."

2020-10-22 Thread Robert Metzger (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-18117:
--

Assignee: Robert Metzger

> "Kerberized YARN per-job on Docker test" fails with "Could not start hadoop 
> cluster."
> -
>
> Key: FLINK-18117
> URL: https://issues.apache.org/jira/browse/FLINK-18117
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2683=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5
> {code}
> 2020-06-04T06:03:53.2844296Z Creating slave1 ... [32mdone[0m
> 2020-06-04T06:03:53.4981251Z [1BWaiting for hadoop cluster to come up. We 
> have been trying for 0 seconds, retrying ...
> 2020-06-04T06:03:58.5980181Z Waiting for hadoop cluster to come up. We have 
> been trying for 5 seconds, retrying ...
> 2020-06-04T06:04:03.6997087Z Waiting for hadoop cluster to come up. We have 
> been trying for 10 seconds, retrying ...
> 2020-06-04T06:04:08.7910791Z Waiting for hadoop cluster to come up. We have 
> been trying for 15 seconds, retrying ...
> 2020-06-04T06:04:13.8921621Z Waiting for hadoop cluster to come up. We have 
> been trying for 20 seconds, retrying ...
> 2020-06-04T06:04:18.9648844Z Waiting for hadoop cluster to come up. We have 
> been trying for 25 seconds, retrying ...
> 2020-06-04T06:04:24.0381851Z Waiting for hadoop cluster to come up. We have 
> been trying for 31 seconds, retrying ...
> 2020-06-04T06:04:29.1220264Z Waiting for hadoop cluster to come up. We have 
> been trying for 36 seconds, retrying ...
> 2020-06-04T06:04:34.1882187Z Waiting for hadoop cluster to come up. We have 
> been trying for 41 seconds, retrying ...
> 2020-06-04T06:04:39.2784948Z Waiting for hadoop cluster to come up. We have 
> been trying for 46 seconds, retrying ...
> 2020-06-04T06:04:44.3843337Z Waiting for hadoop cluster to come up. We have 
> been trying for 51 seconds, retrying ...
> 2020-06-04T06:04:49.4703561Z Waiting for hadoop cluster to come up. We have 
> been trying for 56 seconds, retrying ...
> 2020-06-04T06:04:54.5463207Z Waiting for hadoop cluster to come up. We have 
> been trying for 61 seconds, retrying ...
> 2020-06-04T06:04:59.6650405Z Waiting for hadoop cluster to come up. We have 
> been trying for 66 seconds, retrying ...
> 2020-06-04T06:05:04.7500168Z Waiting for hadoop cluster to come up. We have 
> been trying for 71 seconds, retrying ...
> 2020-06-04T06:05:09.8177904Z Waiting for hadoop cluster to come up. We have 
> been trying for 76 seconds, retrying ...
> 2020-06-04T06:05:14.9751297Z Waiting for hadoop cluster to come up. We have 
> been trying for 81 seconds, retrying ...
> 2020-06-04T06:05:20.0336417Z Waiting for hadoop cluster to come up. We have 
> been trying for 87 seconds, retrying ...
> 2020-06-04T06:05:25.1627704Z Waiting for hadoop cluster to come up. We have 
> been trying for 92 seconds, retrying ...
> 2020-06-04T06:05:30.2583315Z Waiting for hadoop cluster to come up. We have 
> been trying for 97 seconds, retrying ...
> 2020-06-04T06:05:35.3283678Z Waiting for hadoop cluster to come up. We have 
> been trying for 102 seconds, retrying ...
> 2020-06-04T06:05:40.4184029Z Waiting for hadoop cluster to come up. We have 
> been trying for 107 seconds, retrying ...
> 2020-06-04T06:05:45.5388372Z Waiting for hadoop cluster to come up. We have 
> been trying for 112 seconds, retrying ...
> 2020-06-04T06:05:50.6155334Z Waiting for hadoop cluster to come up. We have 
> been trying for 117 seconds, retrying ...
> 2020-06-04T06:05:55.7225186Z Command: start_hadoop_cluster failed. Retrying...
> 2020-06-04T06:05:55.7237999Z Starting Hadoop cluster
> 2020-06-04T06:05:56.5188293Z kdc is up-to-date
> 2020-06-04T06:05:56.5292716Z master is up-to-date
> 2020-06-04T06:05:56.5301735Z slave2 is up-to-date
> 2020-06-04T06:05:56.5306179Z slave1 is up-to-date
> 2020-06-04T06:05:56.6800566Z Waiting for hadoop cluster to come up. We have 
> been trying for 0 seconds, retrying ...
> 2020-06-04T06:06:01.7668291Z Waiting for hadoop cluster to come up. We have 
> been trying for 5 seconds, retrying ...
> 2020-06-04T06:06:06.8620265Z Waiting for hadoop cluster to come up. We have 
> been trying for 10 seconds, retrying ...
> 2020-06-04T06:06:11.9753596Z Waiting for hadoop cluster to come up. We have 
> been trying for 15 seconds, retrying ...
> 2020-06-04T06:06:17.0402846Z Waiting for hadoop cluster to come up. We have 
> been trying for 21 seconds, retrying ...
> 2020-06-04T06:06:22.1650005Z Waiting

[GitHub] [flink] rmetzger commented on pull request #13208: [FLINK-18676] [FileSystem] Bump s3 aws version

2020-10-22 Thread GitBox



rmetzger commented on pull request #13208:
URL: https://github.com/apache/flink/pull/13208#issuecomment-714316318


   Maybe I'll take over this pull request. Since the 1.12 release is coming up, 
it would be nice to address this issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-18676) Update version of aws to support use of default constructor of "WebIdentityTokenCredentialsProvider"

2020-10-22 Thread Robert Metzger (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-18676:
---
Component/s: (was: Connectors / FileSystem)
 FileSystems

> Update version of aws to support use of default constructor of 
> "WebIdentityTokenCredentialsProvider"
> 
>
> Key: FLINK-18676
> URL: https://issues.apache.org/jira/browse/FLINK-18676
> Project: Flink
>  Issue Type: Improvement
>  Components: FileSystems
>Affects Versions: 1.11.0
>Reporter: Ravi Bhushan Ratnakar
>Priority: Minor
>  Labels: pull-request-available
>
> *Background:*
> I am using Flink 1.11.0 on kubernetes platform. To give access of aws 
> services to taskmanager/jobmanager, we are using "IAM Roles for Service 
> Accounts" . I have configured below property in flink-conf.yaml to use 
> credential provider.
> fs.s3a.aws.credentials.provider: 
> com.amazonaws.auth.WebIdentityTokenCredentialsProvider
>  
> *Issue:*
> When taskmanager/jobmanager is starting up, during this it complains that 
> "WebIdentityTokenCredentialsProvider" doesn't have "public constructor" and 
> container doesn't come up.
>  
> *Solution:*
> Currently the above credential's class is being used from 
> "*flink-s3-fs-hadoop"* which gets "aws-java-sdk-core" dependency from 
> "*flink-s3-fs-base*". In *"flink-s3-fs-base",*  version of aws is 1.11.754 . 
> The support of default constructor for "WebIdentityTokenCredentialsProvider" 
> is provided from aws version 1.11.788 and onward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] zhijiangW commented on a change in pull request #13595: [FLINK-19582][network] Introduce sort-merge based blocking shuffle to Flink

2020-10-22 Thread GitBox



zhijiangW commented on a change in pull request #13595:
URL: https://github.com/apache/flink/pull/13595#discussion_r509989145



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/SortMergeResultPartition.java
##
@@ -0,0 +1,360 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.io.network.partition;
+
+import org.apache.flink.core.memory.MemorySegment;
+import org.apache.flink.core.memory.MemorySegmentFactory;
+import org.apache.flink.runtime.event.AbstractEvent;
+import org.apache.flink.runtime.io.disk.FileChannelManager;
+import org.apache.flink.runtime.io.network.api.EndOfPartitionEvent;
+import org.apache.flink.runtime.io.network.api.serialization.EventSerializer;
+import org.apache.flink.runtime.io.network.buffer.Buffer;
+import org.apache.flink.runtime.io.network.buffer.BufferCompressor;
+import org.apache.flink.runtime.io.network.buffer.BufferPool;
+import org.apache.flink.runtime.io.network.buffer.NetworkBuffer;
+import org.apache.flink.util.function.SupplierWithException;
+
+import javax.annotation.Nullable;
+import javax.annotation.concurrent.GuardedBy;
+import javax.annotation.concurrent.NotThreadSafe;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+
+import static org.apache.flink.runtime.io.network.buffer.Buffer.DataType;
+import static 
org.apache.flink.runtime.io.network.partition.SortBuffer.BufferWithChannel;
+import static org.apache.flink.util.Preconditions.checkElementIndex;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+import static org.apache.flink.util.Preconditions.checkState;
+
+/**
+ * {@link SortMergeResultPartition} appends records and events to {@link 
SortBuffer} and after the {@link SortBuffer}
+ * is full, all data in the {@link SortBuffer} will be copied and spilled to a 
{@link PartitionedFile} in subpartition
+ * index order sequentially. Large records that can not be appended to an 
empty {@link SortBuffer} will be spilled to
+ * the {@link PartitionedFile} separately.
+ */
+@NotThreadSafe
+public class SortMergeResultPartition extends ResultPartition {
+
+   private final Object lock = new Object();
+
+   /** All active readers which are consuming data from this result 
partition now. */
+   @GuardedBy("lock")
+   private final Set readers = new 
HashSet<>();
+
+   /** {@link PartitionedFile} produced by this result partition. */
+   @GuardedBy("lock")
+   private PartitionedFile resultFile;
+
+   /** Used to generate random file channel ID. */
+   private final FileChannelManager channelManager;
+
+   /** Number of data buffers (excluding events) written for each 
subpartition. */
+   private final int[] numDataBuffers;
+
+   /** A piece of unmanaged memory for data writing. */
+   private final MemorySegment writeBuffer;
+
+   /** Size of network buffer and write buffer. */
+   private final int networkBufferSize;
+
+   /** Current {@link SortBuffer} to append records to. */
+   private SortBuffer currentSortBuffer;
+
+   /** File writer for this result partition. */
+   private PartitionedFileWriter fileWriter;
+
+   public SortMergeResultPartition(
+   String owningTaskName,
+   int partitionIndex,
+   ResultPartitionID partitionId,
+   ResultPartitionType partitionType,
+   int numSubpartitions,
+   int numTargetKeyGroups,
+   int networkBufferSize,
+   ResultPartitionManager partitionManager,
+   FileChannelManager channelManager,
+   @Nullable BufferCompressor bufferCompressor,
+   SupplierWithException 
bufferPoolFactory) {
+
+   super(
+   owningTaskName,
+   partitionIndex,
+   partitionId,
+   partitionType,
+   numSubpartitions,
+   numTargetKeyGroups,
+

[jira] [Commented] (FLINK-19757) TimeStampData can cause time inconsistent problem

2020-10-22 Thread xiaogang zhou (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218784#comment-17218784
 ] 

xiaogang zhou commented on FLINK-19757:
---

[~jark] the ideal that TimestampData and LocalDataTime does not contains the 
zone offset is right, but when convert the

currentTimeMillis or the Timestamp to LocalDataTime, we should consider the 
timezone issue.

the jdk code suggest so
{code:java}
// code placeholder
// code placeholder


public static LocalDateTime ofEpochSecond(long epochSecond, int nanoOfSecond, 
ZoneOffset offset) {
Objects.requireNonNull(offset, "offset");
NANO_OF_SECOND.checkValidValue(nanoOfSecond);
long localSecond = epochSecond + offset.getTotalSeconds();  // overflow 
caught later
long localEpochDay = Math.floorDiv(localSecond, SECONDS_PER_DAY);
int secsOfDay = (int)Math.floorMod(localSecond, SECONDS_PER_DAY);
LocalDate date = LocalDate.ofEpochDay(localEpochDay);
LocalTime time = LocalTime.ofNanoOfDay(secsOfDay * NANOS_PER_SECOND + 
nanoOfSecond);
return new LocalDateTime(date, time);
}
{code}

> TimeStampData can cause time inconsistent problem
> -
>
> Key: FLINK-19757
> URL: https://issues.apache.org/jira/browse/FLINK-19757
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Runtime
>Affects Versions: 1.11.1
>Reporter: xiaogang zhou
>Priority: Major
>  Labels: pull-request-available
>
> when we check jdk LocalDateTime code，we find that
>  
> {code:java}
> // code placeholder
> public static LocalDateTime ofEpochSecond(long epochSecond, int nanoOfSecond, 
> ZoneOffset offset) {
> Objects.requireNonNull(offset, "offset");
> NANO_OF_SECOND.checkValidValue(nanoOfSecond);
> long localSecond = epochSecond + offset.getTotalSeconds();  // overflow 
> caught later
> long localEpochDay = Math.floorDiv(localSecond, SECONDS_PER_DAY);
> int secsOfDay = (int)Math.floorMod(localSecond, SECONDS_PER_DAY);
> LocalDate date = LocalDate.ofEpochDay(localEpochDay);
> LocalTime time = LocalTime.ofNanoOfDay(secsOfDay * NANOS_PER_SECOND + 
> nanoOfSecond);
> return new LocalDateTime(date, time);
> }
> {code}
>  
> offset.getTotalSeconds() they add the offset, but in the TimeStampData
> toLocalDateTime, we don't add a offset.
>  
> I'd like to add a TimeZone.getDefault().getRawOffset() in the 
> toLocalDateTime()
> and minus a TimeZone.getDefault().getRawOffset() in the 
> fromLocalDateTime



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] flinkbot edited a comment on pull request #13678: [FLINK-19586] Add stream committer operators for new Sink API

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13678:
URL: https://github.com/apache/flink/pull/13678#issuecomment-711439155


   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 463b5c8ed21f93caaeb7b938aa9e72abb35619b2 UNKNOWN
   * 2ba5bda5c5a15c370c231e1cedfa03fd0bc0b41b UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13458:
URL: https://github.com/apache/flink/pull/13458#issuecomment-697041219


   
   ## CI report:
   
   * 7311b0d12d19a645391ea0359a9aa6318806363b UNKNOWN
   * 79714f02f700aebe4a65dff017deba6dc9d911d9 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7989)
 
   * 9bc87ad6b2a8cbad2f226c483c74f1fa1523b70e UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13653: [FLINK-17528][table] Remove RowData#get() and ArrayData#get() and use FieldGetter and ElementGetter instead

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13653:
URL: https://github.com/apache/flink/pull/13653#issuecomment-709337488


   
   ## CI report:
   
   * 0c876d6befc487412629e6a1e8883fa5fbeb31e6 Azure: 
[CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8067)
 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8071)
 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8080)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-19757) TimeStampData can cause time inconsistent problem

2020-10-22 Thread xiaogang zhou (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218821#comment-17218821
 ] 

xiaogang zhou commented on FLINK-19757:
---

Hi [~jark] ， My point is using LocalDateTime to represent the 
TIMESTAMP(TIMESTAMP WO LOCAL TIME ZONE ) is not very proper :)

as the ofEpochSecond takes a offset and the toString method return a local 
time. Should we consider it more proper as a TIMESTAMP WITH LOCAL TIME ZONE? 

> TimeStampData can cause time inconsistent problem
> -
>
> Key: FLINK-19757
> URL: https://issues.apache.org/jira/browse/FLINK-19757
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Runtime
>Affects Versions: 1.11.1
>Reporter: xiaogang zhou
>Priority: Major
>  Labels: pull-request-available
>
> when we check jdk LocalDateTime code，we find that
>  
> {code:java}
> // code placeholder
> public static LocalDateTime ofEpochSecond(long epochSecond, int nanoOfSecond, 
> ZoneOffset offset) {
> Objects.requireNonNull(offset, "offset");
> NANO_OF_SECOND.checkValidValue(nanoOfSecond);
> long localSecond = epochSecond + offset.getTotalSeconds();  // overflow 
> caught later
> long localEpochDay = Math.floorDiv(localSecond, SECONDS_PER_DAY);
> int secsOfDay = (int)Math.floorMod(localSecond, SECONDS_PER_DAY);
> LocalDate date = LocalDate.ofEpochDay(localEpochDay);
> LocalTime time = LocalTime.ofNanoOfDay(secsOfDay * NANOS_PER_SECOND + 
> nanoOfSecond);
> return new LocalDateTime(date, time);
> }
> {code}
>  
> offset.getTotalSeconds() they add the offset, but in the TimeStampData
> toLocalDateTime, we don't add a offset.
>  
> I'd like to add a TimeZone.getDefault().getRawOffset() in the 
> toLocalDateTime()
> and minus a TimeZone.getDefault().getRawOffset() in the 
> fromLocalDateTime



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-19761) Add lookup method for registered ShuffleDescriptor in ShuffleMaster

2020-10-22 Thread Xuannan Su (Jira)

Xuannan Su created FLINK-19761:
--

 Summary: Add lookup method for registered ShuffleDescriptor in 
ShuffleMaster
 Key: FLINK-19761
 URL: https://issues.apache.org/jira/browse/FLINK-19761
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: Xuannan Su


Currently, the ShuffleMaster can register a partition and get the shuffle 
descriptor. However, it lacks the ability to look up the registered 
ShuffleDescriptors belongs to an IntermediateResult by the 
IntermediateDataSetID.
Adding the lookup method to the ShuffleMaster can make reusing the cluster 
partition more easily. For example, we don't have to return the 
ShuffleDescriptor to the client just so that the other job can somehow encode 
the ShuffleDescriptor in the JobGraph to consume the cluster partition. 
Instead, we only need to return the IntermediateDatSetID and use it to lookup 
the ShuffleDescriptor by another job.
By adding the lookup method in ShuffleMaster, if we have an external shuffle 
service and the lifecycle of the IntermediateResult is not bounded to the 
cluster, we can look up the ShuffleDescriptor and reuse the IntermediateResult 
by a job running on another cluster even if the cluster that produced the 
IntermediateResult is shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] godfreyhe commented on pull request #13694: [FLINK-19720][table-api] Introduce new Providers and parallelism API

2020-10-22 Thread GitBox



godfreyhe commented on pull request #13694:
URL: https://github.com/apache/flink/pull/13694#issuecomment-714290557


   LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] gm7y8 commented on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI

2020-10-22 Thread GitBox



gm7y8 commented on pull request #13458:
URL: https://github.com/apache/flink/pull/13458#issuecomment-714289878


   @XComp  You are welcome! I have also verified fix with Word Count job
   
   ![Uploading Screen Shot 2020-10-22 at 12.23.01 AM.png…]()
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] tillrohrmann commented on a change in pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



tillrohrmann commented on a change in pull request #13725:
URL: https://github.com/apache/flink/pull/13725#discussion_r509935113



##
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##
@@ -203,6 +207,11 @@ protected void handleStateChange(ConnectionState newState) 
{
}
}
 
+   private void onReconnectedConnectionState() {
+   // check whether we find some new leader information in 
ZooKeeper
+   retrieveLeaderInformationFromZooKeeper();

Review comment:
   I think you are right that in some cases we will report a stale leader 
here. However, once the asynchronous fetch from the `NodeCache` completes, the 
listener should be notified about the new leader. What happens in the meantime 
is that the listener will try to connect to the old leader which should either 
be gone or reject all connection attempts since he is no longer the leader.
   
   The problem `notifyLossOfLeader` tried to solve is that a listener thinks 
that a stale leader is still the leader and, thus, continues working for it w/o 
questioning it (e.g. check with the leader) until the connection to the leader 
times out. With the `notifyLossOfLeader` change, once the retrieval service 
loses connection to ZooKeeper, it will tell the listener that the current 
leader is no longer valid. This will tell the listener to stop working for this 
leader (e.g. cancelling all tasks, disconnecting from it, etc.). If the 
listener should shortly after be told that the old leader is still the leader 
because of stale information, then it will first try to connect to the leader 
which will fail (assuming that the old leader is indeed no longer the leader) 
before it can start doing work for the leader.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] wangyang0918 commented on a change in pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API

2020-10-22 Thread GitBox



wangyang0918 commented on a change in pull request #13644:
URL: https://github.com/apache/flink/pull/13644#discussion_r509939287



##
File path: 
flink-kubernetes/src/test/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderRetrievalServiceTest.java
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.kubernetes.highavailability;
+
+import org.apache.flink.kubernetes.kubeclient.FlinkKubeClient;
+import org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMap;
+import org.apache.flink.kubernetes.utils.Constants;
+
+import org.junit.Test;
+
+import java.util.Collections;
+import java.util.concurrent.TimeUnit;
+
+import static org.hamcrest.Matchers.is;
+import static org.hamcrest.Matchers.notNullValue;
+import static org.hamcrest.Matchers.nullValue;
+import static org.junit.Assert.assertThat;
+
+/**
+ * Tests for the {@link KubernetesLeaderRetrievalService}.
+ */
+public class KubernetesLeaderRetrievalServiceTest extends 
KubernetesHighAvailabilityTestBase {

Review comment:
   Yes. You are right. What I mean is that we could also add some 
individual tests for HA service, not just a full Flink cluster with K8s ha 
enabled. They will be implemented in the E2E module. I have left a comment in 
FLINK-19545.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13653: [FLINK-17528][table] Remove RowData#get() and ArrayData#get() and use FieldGetter and ElementGetter instead

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13653:
URL: https://github.com/apache/flink/pull/13653#issuecomment-709337488


   
   ## CI report:
   
   * 0c876d6befc487412629e6a1e8883fa5fbeb31e6 Azure: 
[CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8067)
 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8080)
 Azure: 
[FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8071)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13458:
URL: https://github.com/apache/flink/pull/13458#issuecomment-697041219


   
   ## CI report:
   
   * 7311b0d12d19a645391ea0359a9aa6318806363b UNKNOWN
   * 79714f02f700aebe4a65dff017deba6dc9d911d9 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7989)
 
   * 9bc87ad6b2a8cbad2f226c483c74f1fa1523b70e Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8079)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] wangyang0918 removed a comment on pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API

2020-10-22 Thread GitBox



wangyang0918 removed a comment on pull request #13644:
URL: https://github.com/apache/flink/pull/13644#issuecomment-713591200


   @tillrohrmann I have gone though all your comments and will update the PR 
soon. Just two quick points need to confirm with you.
   
   * Do you think we should not handle the externally deletion/update? For 
example, a Flink cluster with HA configured is running, some user delete/update 
the ConfigMap via `kubectl`. If it is yes, I will remove the operations in 
`KubernetesLeaderElectionService#Watcher`. And change some "ConfigMap not 
exists" behavior.
   * I am afraid it is hard to use a real K8s server in the UT because it is 
not very easy to start a `minikube`. I will try to add the unit tests for the 
contract testing now and leave the real cluster test in E2E test 
implementation. Does it make sense?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-17159) ES6 ElasticsearchSinkITCase unstable

2020-10-22 Thread Robert Metzger (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218836#comment-17218836
 ] 

Robert Metzger commented on FLINK-17159:


Uff. So I was really not able to even reproduce this issue, even though it 
seems to happen quite frequently.

Ideally we would reproduce it on DEBUG log level so that we can see why ES is 
not able to process the ping request.
Does it make sense to use {{CommonTestUtils.waitUntilCondition()}} and retry 
for 30 seconds for the cluster to come up properly?

> ES6 ElasticsearchSinkITCase unstable
> 
>
> Key: FLINK-17159
> URL: https://issues.apache.org/jira/browse/FLINK-17159
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / ElasticSearch, Tests
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.12.0, 1.11.3
>
>
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7482=logs=64110e28-73be-50d7-9369-8750330e0bf1=aa84fb9a-59ae-5696-70f7-011bc086e59b]
> {code:java}
> 2020-04-15T02:37:04.4289477Z [ERROR] 
> testElasticsearchSinkWithSmile(org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSinkITCase)
>   Time elapsed: 0.145 s  <<< ERROR!
> 2020-04-15T02:37:04.4290310Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2020-04-15T02:37:04.4290790Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147)
> 2020-04-15T02:37:04.4291404Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:659)
> 2020-04-15T02:37:04.4291956Z  at 
> org.apache.flink.streaming.util.TestStreamEnvironment.execute(TestStreamEnvironment.java:77)
> 2020-04-15T02:37:04.4292548Z  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1643)
> 2020-04-15T02:37:04.4293254Z  at 
> org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.runElasticSearchSinkTest(ElasticsearchSinkTestBase.java:128)
> 2020-04-15T02:37:04.4293990Z  at 
> org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.runElasticsearchSinkSmileTest(ElasticsearchSinkTestBase.java:106)
> 2020-04-15T02:37:04.4295096Z  at 
> org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSinkITCase.testElasticsearchSinkWithSmile(ElasticsearchSinkITCase.java:45)
> 2020-04-15T02:37:04.4295923Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-04-15T02:37:04.4296489Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-04-15T02:37:04.4297076Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-04-15T02:37:04.4297513Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-04-15T02:37:04.4297951Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-04-15T02:37:04.4298688Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-04-15T02:37:04.4299374Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-04-15T02:37:04.4300069Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-04-15T02:37:04.4300960Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-04-15T02:37:04.4301705Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-04-15T02:37:04.4302204Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-04-15T02:37:04.4302661Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-04-15T02:37:04.4303234Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-04-15T02:37:04.4303706Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-04-15T02:37:04.4304127Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-04-15T02:37:04.4304716Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-04-15T02:37:04.4305394Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-04-15T02:37:04.4305965Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-04-15T02:37:04.4306425Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-04-15T02:37:04.4306942Z  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-04-15T02:37:04.4307466Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-04-15T02:37:04.4307920Z  at 
>

[GitHub] [flink] flinkbot edited a comment on pull request #13703: [FLINK-19696] Add runtime batch committer operators for the new sink API

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13703:
URL: https://github.com/apache/flink/pull/13703#issuecomment-712821004


   
   ## CI report:
   
   * 5166271f74497afddca3a7f350b8191e2a843298 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7938)
 
   * 58ee54d456da57f99b6124c787b7cbaeedcfb261 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8081)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack

2020-10-22 Thread Arvid Heise (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-19688.
-
Resolution: Fixed

> Flink batch job fails because of InterruptedExceptions from network stack
> -
>
> Key: FLINK-19688
> URL: https://issues.apache.org/jira/browse/FLINK-19688
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Runtime / Task
>Affects Versions: 1.12.0
>Reporter: Robert Metzger
>Assignee: Roman Khachatryan
>Priority: Blocker
> Fix For: 1.12.0
>
> Attachments: logs.tgz
>
>
> I have a benchmarking test job, that throws RuntimeExceptions at any operator 
> at a configured, random interval. When using low intervals, such as mean 
> failure rate = 60 s, the job will get into a state where it frequently fails 
> with InterruptedExceptions.
> The same job does not have this problem on Flink 1.11.2 (at least not after 
> running the job for 15 hours, on 1.12-SN, it happens within a few minutes)
> This is the job: 
> https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java
> This is the exception:
> {code}
> 2020-10-16 16:02:15,653 WARN  org.apache.flink.runtime.taskmanager.Task   
>  [] - CHAIN GroupReduce (GroupReduce at 
> main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38)) (8/8)#1 
> (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) 
> switched from RUNNING to FAILED.
> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
> (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at 
> appendMapper(KillerClientMapper.java:38))' , caused an error: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) 
> [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
> Caused by: org.apache.flink.util.WrappingRuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due 
> to an exception: Connection for partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   ... 4 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: Error obtaining the sorted input: Thread 
> 'SortMerger Reading Thread' terminated due to an exception: Connection for 
> partition 
> 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1
>  not reachable.
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> ~[?:1.8.0_222]
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) 
> ~[?:1.8.0_222]
>   at 
> org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) 
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) 
>

[jira] [Created] (FLINK-19764) Add More Metrics to TaskManager in Web UI

2020-10-22 Thread Yadong Xie (Jira)

Yadong Xie created FLINK-19764:
--

 Summary: Add More Metrics to TaskManager in Web UI
 Key: FLINK-19764
 URL: https://issues.apache.org/jira/browse/FLINK-19764
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Web Frontend
Reporter: Yadong Xie


update the UI since https://issues.apache.org/jira/browse/FLINK-14422 and 
https://issues.apache.org/jira/browse/FLINK-14406 has been fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [flink] flinkbot edited a comment on pull request #13726: [BP-1.11][FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13726:
URL: https://github.com/apache/flink/pull/13726#issuecomment-713567666


   
   ## CI report:
   
   * c673d6c2530a3484e2a315301d0eee7154a3a5d9 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8020)
 
   * f0ac074feed577550c7026b190db031561a25f77 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8087)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13727: [BP-1.10][FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13727:
URL: https://github.com/apache/flink/pull/13727#issuecomment-713572668


   
   ## CI report:
   
   * c2ae5ba37a69ad1f40f585421d38edb30d40c0fa Travis: 
[FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/191431633) 
   * 8e2aaf84d255adab8420800d5e7ec20d502868f4 Travis: 
[PENDING](https://travis-ci.com/github/flink-ci/flink/builds/191598195) 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService

2020-10-22 Thread GitBox



flinkbot edited a comment on pull request #13725:
URL: https://github.com/apache/flink/pull/13725#issuecomment-713560830


   
   ## CI report:
   
   * a606f618ea3fb7bf179657a060c47ee32b55e514 Azure: 
[SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8019)
 
   * cd48869c084675464d37449a3db1f9d3cd10a040 Azure: 
[PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8086)
 
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run travis` re-run the last Travis build
- `@flinkbot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 5 >

1 - 100 of 489 matches

Mail list logo