[GitHub] [flink] flinkbot edited a comment on pull request #13735: [FLINK-19533][checkpoint] Add channel state reassignment for unaligned checkpoints.
flinkbot edited a comment on pull request #13735: URL: https://github.com/apache/flink/pull/13735#issuecomment-714048414 ## CI report: * e081fecba324a0e35017f4b89cc0499c9112a768 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8049) * ee73393a13c999be626f7b8b98a170be88a093d9 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] QingdongZeng3 commented on pull request #13690: [FLINK-16595][YARN]support more HDFS nameServices in yarn mode when security enabled. Is…
QingdongZeng3 commented on pull request #13690: URL: https://github.com/apache/flink/pull/13690#issuecomment-714257949 Hi Robert, I have pass the CI ,would you help to review my code?Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] gaoyunhaii opened a new pull request #13740: [FLINK-19758][filesystem] Implement a new unified file sink based on the new sink API
gaoyunhaii opened a new pull request #13740: URL: https://github.com/apache/flink/pull/13740 ## What is the purpose of the change This pull requests implementations a new File sink based on the new sink API. ## Brief change log af69ad89405936040b55d54acb0ef18e6141e5ad implements the sink. ## Verifying this change This change added tests and can be verified as follows: - Unit tests for each components for the new sink. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **yes** - The serializers: **no** - The runtime per-record code paths (performance sensitive): **no** - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: **no** - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? **no** - If yes, how is the feature documented? **not applicable** This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] QingdongZeng3 edited a comment on pull request #13690: [FLINK-16595][YARN]support more HDFS nameServices in yarn mode when security enabled. Is…
QingdongZeng3 edited a comment on pull request #13690: URL: https://github.com/apache/flink/pull/13690#issuecomment-714257949 Hi Robert, I have pass the CI ,would you help to review my code?Thanks! @rmetzger This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-19758) Implement a new unified File Sink based on the new Sink API
[ https://issues.apache.org/jira/browse/FLINK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-19758: --- Labels: pull-request-available (was: ) > Implement a new unified File Sink based on the new Sink API > --- > > Key: FLINK-19758 > URL: https://issues.apache.org/jira/browse/FLINK-19758 > Project: Flink > Issue Type: Sub-task > Components: API / DataStream, Connectors / FileSystem >Affects Versions: 1.12.0 >Reporter: Yun Gao >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] gaoyunhaii commented on pull request #11326: [FLINK-11427][formats] Add protobuf parquet support for StreamingFileSink
gaoyunhaii commented on pull request #11326: URL: https://github.com/apache/flink/pull/11326#issuecomment-714271222 Very thanks @aljoscha ! I'll close this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] gaoyunhaii closed pull request #11326: [FLINK-11427][formats] Add protobuf parquet support for StreamingFileSink
gaoyunhaii closed pull request #11326: URL: https://github.com/apache/flink/pull/11326 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery
curcur edited a comment on pull request #13648: URL: https://github.com/apache/flink/pull/13648#issuecomment-714211577 > * Visilibity in normal case: none of the felds written in `releaseView` are `volatile`. So in normal case (`t1:release` then `t2:createReadView`) `t2` can see some inconsistent state. For example, `readView == null`, but `isPartialBufferCleanupRequired == false`. Right? > Maybe call `releaseView()` from `createReadView()` unconditionally? Aren't they guarded by the synchronization block? I think in the existing code, almost all access is guarded by the lock. But I have a simpler solution just in case. See the last sentence. > * Overwites when release is slow: won't `t1` overwrite changes to `PipelinedSubpartition` made already by `t2`? For example, reset `sequenceNumber` after `t2` has sent some data? > Maybe `PipelinedSubpartition.readerView` should be `AtomicReference` and then we can guard `PipelinedApproximateSubpartition.releaseView()` by CAS on it? I think this is the same question I answered in the write-up above. In short, it won't be possible, because a view can only be released once and this is guarded by the release flag of the view, details quoted below. The updates are guarded by the lock. - What if the netty thread1 release view after netty thread2 recreates the view? Thread2 releases the view that thread1 holds the reference on before creating a new view. Thread1 can not release the old view (through view reference) again afterwards, since a view can only be released once. I am actually having an idea to simplify this whole model: **If we only release before creation and no other places, this whole threading interaction model would be simplified in a great way. That says only one new netty thread can release the view** This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-19626) Introduce multi-input operator construction algorithm
[ https://issues.apache.org/jira/browse/FLINK-19626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-19626: --- Labels: pull-request-available (was: ) > Introduce multi-input operator construction algorithm > - > > Key: FLINK-19626 > URL: https://issues.apache.org/jira/browse/FLINK-19626 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Planner >Reporter: Caizhi Weng >Assignee: Caizhi Weng >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > We should introduce an algorithm to organize exec nodes into multi-input exec > nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] TsReaper opened a new pull request #13742: [FLINK-19626][table-planner-blink] Introduce multi-input operator construction algorithm
TsReaper opened a new pull request #13742: URL: https://github.com/apache/flink/pull/13742 ## What is the purpose of the change As multiple input exec nodes have been introduced, we're going to construct multiple input operators by a new construction algorithm. This PR introduces such algorithm. Multiple input optimization is currently not use by default and is only used for tests as its operator is not ready. We'll change this once the multiple input operator is ready. ## Brief change log - Introduce multi-input operator construction algorithm ## Verifying this change This change added tests and can be verified as follows: Run the newly added tests. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13703: [FLINK-19696] Add runtime batch committer operators for the new sink API
flinkbot edited a comment on pull request #13703: URL: https://github.com/apache/flink/pull/13703#issuecomment-712821004 ## CI report: * 5166271f74497afddca3a7f350b8191e2a843298 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7938) * 58ee54d456da57f99b6124c787b7cbaeedcfb261 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (FLINK-16218) Chinese character is garbled when reading from MySQL using JDBC source
[ https://issues.apache.org/jira/browse/FLINK-16218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jark Wu closed FLINK-16218. --- Fix Version/s: (was: 1.11.3) (was: 1.10.3) (was: 1.12.0) Resolution: Not A Problem > Chinese character is garbled when reading from MySQL using JDBC source > -- > > Key: FLINK-16218 > URL: https://issues.apache.org/jira/browse/FLINK-16218 > Project: Flink > Issue Type: Bug > Components: Connectors / JDBC, Table SQL / Planner >Affects Versions: 1.10.0 >Reporter: Jark Wu >Priority: Major > Attachments: 15821322356269.jpg > > > I have set the database and table to use UTF8 collections, and use > {{jdbc:mysql://localhost:3306/db_test?useUnicode=true=utf-8}} > as the connection jdbc url. However, the Chinese characters are still > garbled. > Btw, we should have a test to cover this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19761) Add lookup method for registered ShuffleDescriptor in ShuffleMaster
[ https://issues.apache.org/jira/browse/FLINK-19761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218839#comment-17218839 ] Till Rohrmann commented on FLINK-19761: --- Thanks for creating this issue [~xuannan]. How will this work if the {{ShuffleMaster}} you are asking does not know about the cluster partition (e.g. if you are using two distinct external shuffle services or when using the NettyShuffleService if you are requesting a cluster partition from a different cluster)? > Add lookup method for registered ShuffleDescriptor in ShuffleMaster > --- > > Key: FLINK-19761 > URL: https://issues.apache.org/jira/browse/FLINK-19761 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: Xuannan Su >Priority: Major > > Currently, the ShuffleMaster can register a partition and get the shuffle > descriptor. However, it lacks the ability to look up the registered > ShuffleDescriptors belongs to an IntermediateResult by the > IntermediateDataSetID. > Adding the lookup method to the ShuffleMaster can make reusing the cluster > partition more easily. For example, we don't have to return the > ShuffleDescriptor to the client just so that the other job can somehow encode > the ShuffleDescriptor in the JobGraph to consume the cluster partition. > Instead, we only need to return the IntermediateDatSetID and use it to lookup > the ShuffleDescriptor by another job. > By adding the lookup method in ShuffleMaster, if we have an external shuffle > service and the lifecycle of the IntermediateResult is not bounded to the > cluster, we can look up the ShuffleDescriptor and reuse the > IntermediateResult by a job running on another cluster even if the cluster > that produced the IntermediateResult is shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-19136) MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the predicate within the allowed time"
[ https://issues.apache.org/jira/browse/FLINK-19136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger reassigned FLINK-19136: -- Assignee: Robert Metzger > MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the > predicate within the allowed time" > > > Key: FLINK-19136 > URL: https://issues.apache.org/jira/browse/FLINK-19136 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Robert Metzger >Priority: Major > Labels: test-stability > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6179=logs=9fca669f-5c5f-59c7-4118-e31c641064f0=91bf6583-3fb2-592f-e4d4-d79d79c3230a] > {code} > 2020-09-03T23:33:18.3687261Z [ERROR] > testReporter(org.apache.flink.metrics.tests.MetricsAvailabilityITCase) Time > elapsed: 15.217 s <<< ERROR! > 2020-09-03T23:33:18.3698260Z java.util.concurrent.ExecutionException: > org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not > satisfy the predicate within the allowed time. > 2020-09-03T23:33:18.3698749Z at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > 2020-09-03T23:33:18.3699163Z at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) > 2020-09-03T23:33:18.3699754Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.fetchMetric(MetricsAvailabilityITCase.java:162) > 2020-09-03T23:33:18.3700234Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.checkJobManagerMetricAvailability(MetricsAvailabilityITCase.java:116) > 2020-09-03T23:33:18.3700726Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.testReporter(MetricsAvailabilityITCase.java:101) > 2020-09-03T23:33:18.3701097Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-09-03T23:33:18.3701425Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-09-03T23:33:18.3701798Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-09-03T23:33:18.3702146Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-09-03T23:33:18.3702471Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-09-03T23:33:18.3702866Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-09-03T23:33:18.3703253Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-09-03T23:33:18.3703621Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-09-03T23:33:18.3703997Z at > org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-09-03T23:33:18.3704339Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-09-03T23:33:18.3704629Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-09-03T23:33:18.3704940Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-09-03T23:33:18.3705354Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-09-03T23:33:18.3705725Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-09-03T23:33:18.3706072Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-09-03T23:33:18.3706397Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-09-03T23:33:18.3706714Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-09-03T23:33:18.3707044Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-09-03T23:33:18.3707373Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-09-03T23:33:18.3707708Z at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > 2020-09-03T23:33:18.3708073Z at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > 2020-09-03T23:33:18.3708410Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-09-03T23:33:18.3708691Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-09-03T23:33:18.3708976Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2020-09-03T23:33:18.3709273Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-09-03T23:33:18.3709579Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-09-03T23:33:18.3709910Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-09-03T23:33:18.3710242Z at >
[GitHub] [flink] AHeise merged pull request #13733: (1.10 backport) [FLINK-19401][checkpointing] Download checkpoints only if needed
AHeise merged pull request #13733: URL: https://github.com/apache/flink/pull/13733 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-19765) flink SqlUseCatalog.getCatalogName is not unified with SqlCreateCatalog and SqlDropCatalog
jackylau created FLINK-19765: Summary: flink SqlUseCatalog.getCatalogName is not unified with SqlCreateCatalog and SqlDropCatalog Key: FLINK-19765 URL: https://issues.apache.org/jira/browse/FLINK-19765 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.11.0 Reporter: jackylau Fix For: 1.12.0 when i develop flink ranger plugin at operation level, i find this method not unified. And SqlToOperationConverter.convert needs has the good order for user to find code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] zhijiangW commented on a change in pull request #13595: [FLINK-19582][network] Introduce sort-merge based blocking shuffle to Flink
zhijiangW commented on a change in pull request #13595: URL: https://github.com/apache/flink/pull/13595#discussion_r509983937 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/SortBuffer.java ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.io.network.partition; + +import org.apache.flink.core.memory.MemorySegment; +import org.apache.flink.runtime.io.network.buffer.Buffer; + +import java.io.IOException; +import java.nio.ByteBuffer; + +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * Data of different channels can be appended to a {@link SortBuffer} and after the {@link SortBuffer} is + * finished, the appended data can be copied from it in channel index order. + */ +public interface SortBuffer { + + /** +* Appends data of the specified channel to this {@link SortBuffer} and returns true if all bytes of +* the source buffer is copied to this {@link SortBuffer} successfully, otherwise if returns false, +* nothing will be copied. +*/ + boolean append(ByteBuffer source, int targetChannel, Buffer.DataType dataType) throws IOException; + + /** +* Copies data in this {@link SortBuffer} to the target {@link MemorySegment} in channel index order +* and returns {@link BufferWithChannel} which contains the copied data and the corresponding channel +* index. +*/ + BufferWithChannel copyData(MemorySegment target); Review comment: nit: rename as `copyIntoSegment` for better readable, otherwise I might be a bit confused whether `copyDataTo` or `copyDataFrom` in internal processes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (FLINK-19738) Remove the `getCommitter` from the `AbstractStreamingCommitterOperator`
[ https://issues.apache.org/jira/browse/FLINK-19738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guowei Ma closed FLINK-19738. - Fix Version/s: 1.12.0 Resolution: Resolved > Remove the `getCommitter` from the `AbstractStreamingCommitterOperator` > --- > > Key: FLINK-19738 > URL: https://issues.apache.org/jira/browse/FLINK-19738 > Project: Flink > Issue Type: Sub-task > Components: API / DataStream >Reporter: Guowei Ma >Priority: Minor > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] gm7y8 edited a comment on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI
gm7y8 edited a comment on pull request #13458: URL: https://github.com/apache/flink/pull/13458#issuecomment-714289878 @XComp You are welcome! I have also verified fix with Word Count job https://user-images.githubusercontent.com/12487613/96838691-06ff9980-13fd-11eb-93dd-ae8545cabbdb.png;> This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] tillrohrmann commented on a change in pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API
tillrohrmann commented on a change in pull request #13644: URL: https://github.com/apache/flink/pull/13644#discussion_r509942532 ## File path: flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionService.java ## @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.kubernetes.highavailability; + +import org.apache.flink.kubernetes.kubeclient.FlinkKubeClient; +import org.apache.flink.kubernetes.kubeclient.KubernetesLeaderElectionConfiguration; +import org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMap; +import org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector; +import org.apache.flink.kubernetes.kubeclient.resources.KubernetesWatch; +import org.apache.flink.kubernetes.utils.KubernetesUtils; +import org.apache.flink.runtime.leaderelection.AbstractLeaderElectionService; +import org.apache.flink.runtime.leaderelection.LeaderContender; +import org.apache.flink.util.function.FunctionUtils; + +import java.util.List; +import java.util.Map; +import java.util.Objects; +import java.util.Optional; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.Executor; + +import static org.apache.flink.kubernetes.utils.Constants.LABEL_CONFIGMAP_TYPE_HIGH_AVAILABILITY; +import static org.apache.flink.kubernetes.utils.Constants.LEADER_ADDRESS_KEY; +import static org.apache.flink.kubernetes.utils.Constants.LEADER_SESSION_ID_KEY; +import static org.apache.flink.util.Preconditions.checkNotNull; + +/** + * Leader election service for multiple JobManagers. The active JobManager is elected using Kubernetes. + * The current leader's address as well as its leader session ID is published via Kubernetes ConfigMap. + * Note that the contending lock and leader storage are using the same ConfigMap. And every component(e.g. + * ResourceManager, Dispatcher, RestEndpoint, JobManager for each job) will have a separate ConfigMap. + */ +public class KubernetesLeaderElectionService extends AbstractLeaderElectionService { + + private final FlinkKubeClient kubeClient; + + private final Executor executor; + + private final String configMapName; + + private final KubernetesLeaderElector leaderElector; + + private KubernetesWatch kubernetesWatch; + + // Labels will be used to clean up the ha related ConfigMaps. + private Map configMapLabels; + + KubernetesLeaderElectionService( + FlinkKubeClient kubeClient, + Executor executor, + KubernetesLeaderElectionConfiguration leaderConfig) { + + this.kubeClient = checkNotNull(kubeClient, "Kubernetes client should not be null."); + this.executor = checkNotNull(executor, "Executor should not be null."); + this.configMapName = leaderConfig.getConfigMapName(); + this.leaderElector = kubeClient.createLeaderElector(leaderConfig, new LeaderCallbackHandlerImpl()); + this.leaderContender = null; + this.configMapLabels = KubernetesUtils.getConfigMapLabels( + leaderConfig.getClusterId(), LABEL_CONFIGMAP_TYPE_HIGH_AVAILABILITY); + } + + @Override + public void internalStart(LeaderContender contender) { + CompletableFuture.runAsync(leaderElector::run, executor); + kubernetesWatch = kubeClient.watchConfigMaps(configMapName, new ConfigMapCallbackHandlerImpl()); + } + + @Override + public void internalStop() { + if (kubernetesWatch != null) { + kubernetesWatch.close(); + } + } + + @Override + protected void writeLeaderInformation() { + try { + kubeClient.checkAndUpdateConfigMap( + configMapName, + configMap -> { + if (leaderElector.hasLeadership(configMap)) { + // Get the updated ConfigMap with new leader information +
[jira] [Closed] (FLINK-14406) Add metric for managed memory
[ https://issues.apache.org/jira/browse/FLINK-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Zagrebin closed FLINK-14406. --- Fix Version/s: 1.12.0 Release Note: New metrics are available to monitor managed memory: `Status.Flink.Memory.Managed.[Used|Total]` Resolution: Done merged into master by 1a70fe50da096a1a44fc71ef0b911de666cdbeb7 > Add metric for managed memory > - > > Key: FLINK-14406 > URL: https://issues.apache.org/jira/browse/FLINK-14406 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Task >Reporter: lining >Assignee: Matthias >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > Time Spent: 10m > Remaining Estimate: 0h > > If a user wants to get memory used in time, as there's no manage memory's > metrics, it couldn't get it. > *Propose* > * add default memory type in MemoryManager > > {code:java} > public static final MemoryType DEFAULT_MEMORY_TYPE = MemoryType.OFF_HEAP; > {code} > * add getManagedMemoryTotal in TaskExecutor: > > {code:java} > public long getManagedMemoryTotal() { > return this.taskSlotTable.getAllocatedSlots().stream().mapToLong( > slot -> > slot.getMemoryManager().getMemorySizeByType(MemoryManager.DEFAULT_MEMORY_TYPE) > ).sum(); > } > {code} > > * add getManagedMemoryUsed in TaskExecutor: > > {code:java} > public long getManagedMemoryUsed() { > return this.taskSlotTable.getAllocatedSlots().stream().mapToLong( > slot -> > slot.getMemoryManager().getMemorySizeByType(MemoryManager.DEFAULT_MEMORY_TYPE) > - slot.getMemoryManager().availableMemory(MemoryManager.DEFAULT_MEMORY_TYPE) > ).sum(); > } > {code} > > * add instantiateMemoryManagerMetrics in MetricUtils > > {code:java} > public static void instantiateMemoryManagerMetrics(MetricGroup > statusMetricGroup, TaskExecutor taskExecutor) { > checkNotNull(statusMetricGroup); > MetricGroup memoryManagerGroup = > statusMetricGroup.addGroup("Managed").addGroup("Memory"); > memoryManagerGroup.>gauge("TotalCapacity", > taskExecutor::getManagedMemoryTotal); > memoryManagerGroup.>gauge("MemoryUsed", > taskExecutor::getManagedMemoryUsed); > } > {code} > * register it in TaskManagerRunner#startTaskManager > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot commented on pull request #13742: [FLINK-19626][table-planner-blink] Introduce multi-input operator construction algorithm
flinkbot commented on pull request #13742: URL: https://github.com/apache/flink/pull/13742#issuecomment-714296461 ## CI report: * 37cba38abc15a3ee7d1193644be564c0c75ac4c1 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot commented on pull request #13741: [FLINK-19680][checkpointing] Announce timeoutable CheckpointBarriers
flinkbot commented on pull request #13741: URL: https://github.com/apache/flink/pull/13741#issuecomment-714295796 ## CI report: * 8a68242a90fd180fcd0b4b4c60b5e1e49136c00f UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] vthinkxie commented on a change in pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI
vthinkxie commented on a change in pull request #13458: URL: https://github.com/apache/flink/pull/13458#discussion_r50994 ## File path: flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/detail/job-checkpoints-detail.component.html ## @@ -21,6 +21,8 @@ Path: {{checkPointDetail?.external_path || '-'}} Discarded: {{checkPointDetail?.discarded || '-'}} + +Checkpoint Type: {{checkPointDetail?.checkpoint_type === "CHECKPOINT" ? (checkPointConfig?.unaligned_checkpoints ? 'unaligned checkpoint' : 'aligned checkpoint') : (checkPointDetail?.checkpoint_type === "SYNC_SAVEPOINT" ? 'savepoint on cancel' : 'savepoint')|| '-'}} Review comment: Hi @gm7y8 it would be better to move this calculation into the typescript file as a getter, and binding the getter here, since the calculation is a little too complex for the template file This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-19762) Selecting Job-ID in web UI covers more than the ID
Matthias created FLINK-19762: Summary: Selecting Job-ID in web UI covers more than the ID Key: FLINK-19762 URL: https://issues.apache.org/jira/browse/FLINK-19762 Project: Flink Issue Type: Bug Components: Runtime / Web Frontend Affects Versions: 1.11.2, 1.10.2 Reporter: Matthias Fix For: 1.12.0 Attachments: Screenshot 2020-10-22 at 09.47.41.png Not only the ID is selected when trying to copy the Job ID from the web UI by double-clicking it. See the attached screenshot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] shizhengchao opened a new pull request #13743: [FLINK-19629][Formats]Fix the NPE of avro format, when there is a nul…
shizhengchao opened a new pull request #13743: URL: https://github.com/apache/flink/pull/13743 …l entry to the map ## What is the purpose of the change This pull request fixs the NPE in avro format when there is a null entry to the map. ## Brief change log - Fixs the NPE in avro format when there is a null entry to the map. - Add a null entry to the map in the `AvroRowDataDeSerializationSchemaTest#testSerializeDeserialize` ## Verifying this change This change is already covered by existing tests, such as *AvroRowDataDeSerializationSchemaTest#testSerializeDeserialize*. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13726: [BP-1.11][FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
flinkbot edited a comment on pull request #13726: URL: https://github.com/apache/flink/pull/13726#issuecomment-713567666 ## CI report: * c673d6c2530a3484e2a315301d0eee7154a3a5d9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8020) * f0ac074feed577550c7026b190db031561a25f77 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack
[ https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218855#comment-17218855 ] Roman Khachatryan commented on FLINK-19688: --- https://github.com/apache/flink/pull/13723 > Flink batch job fails because of InterruptedExceptions from network stack > - > > Key: FLINK-19688 > URL: https://issues.apache.org/jira/browse/FLINK-19688 > Project: Flink > Issue Type: Bug > Components: Runtime / Network, Runtime / Task >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Assignee: Roman Khachatryan >Priority: Blocker > Fix For: 1.12.0 > > Attachments: logs.tgz > > > I have a benchmarking test job, that throws RuntimeExceptions at any operator > at a configured, random interval. When using low intervals, such as mean > failure rate = 60 s, the job will get into a state where it frequently fails > with InterruptedExceptions. > The same job does not have this problem on Flink 1.11.2 (at least not after > running the job for 15 hours, on 1.12-SN, it happens within a few minutes) > This is the job: > https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java > This is the exception: > {code} > 2020-10-16 16:02:15,653 WARN org.apache.flink.runtime.taskmanager.Task > [] - CHAIN GroupReduce (GroupReduce at > main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38)) (8/8)#1 > (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) > switched from RUNNING to FAILED. > java.lang.Exception: The data preparation for task 'CHAIN GroupReduce > (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38))' , caused an error: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222] > Caused by: org.apache.flink.util.WrappingRuntimeException: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > ... 4 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: Error obtaining the sorted input: Thread > 'SortMerger Reading Thread' terminated due to an exception: Connection for > partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > ~[?:1.8.0_222] > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > ~[?:1.8.0_222] > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at
[GitHub] [flink] flinkbot edited a comment on pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
flinkbot edited a comment on pull request #13725: URL: https://github.com/apache/flink/pull/13725#issuecomment-713560830 ## CI report: * a606f618ea3fb7bf179657a060c47ee32b55e514 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8019) * cd48869c084675464d37449a3db1f9d3cd10a040 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (FLINK-7286) Flink Dashboard fails to display bytes/records received by sources / emitted by sinks
[ https://issues.apache.org/jira/browse/FLINK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger closed FLINK-7286. - Fix Version/s: (was: 1.12.0) Resolution: Duplicate Closed as duplicate of FLINK-11576. > Flink Dashboard fails to display bytes/records received by sources / emitted > by sinks > - > > Key: FLINK-7286 > URL: https://issues.apache.org/jira/browse/FLINK-7286 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics, Runtime / Web Frontend >Affects Versions: 1.3.1 >Reporter: Elias Levy >Priority: Major > Labels: usability > > It appears Flink can't measure the number of bytes read or records produced > by a source (e.g. Kafka source). This is particularly problematic for simple > jobs where the job pipeline is chained, and in which there are no > measurements between operators. Thus, in the UI it appears that the job is > not consuming any data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #13736: [FLINK-19654][python][e2e] Reduce pyflink e2e test parallelism
flinkbot edited a comment on pull request #13736: URL: https://github.com/apache/flink/pull/13736#issuecomment-714183394 ## CI report: * 259c893dcb4b5692df54cfe7739f95d9e34096e6 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8056) * 111ec785929a0742b46ea98408f585aa03314b1d Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8070) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (FLINK-19759) DeadlockBreakupTest.testSubplanReuse_AddSingletonExchange is instable
[ https://issues.apache.org/jira/browse/FLINK-19759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu closed FLINK-19759. --- Fix Version/s: (was: 1.12.0) Resolution: Fixed Just noticed that it's already fixed in https://github.com/apache/flink/commit/0b1c7327e9d0efa9c32c95815a0cff5f5ad0856b, closing this ticket. > DeadlockBreakupTest.testSubplanReuse_AddSingletonExchange is instable > - > > Key: FLINK-19759 > URL: https://issues.apache.org/jira/browse/FLINK-19759 > Project: Flink > Issue Type: Bug > Components: Table SQL / Planner >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Blocker > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8052=logs=e25d5e7e-2a9c-5589-4940-0b638d75a414=a6e0f756-5bb9-5ea8-a468-5f60db442a29 > {code} > [ERROR] DeadlockBreakupTest.testSubplanReuse_AddSingletonExchange:217 > planAfter expected:<...=[>(cnt, 3)]) > : +- [SortAggregate(isMerge=[true], select=[Final_COUNT(count1$0) AS cnt], > reuse_id=[1]) > : +- Exchange(distribution=[single]) > :+- LocalSort]Aggregate(select=[Pa...> but was:<...=[>(cnt, 3)]) > : +- [HashAggregate(isMerge=[true], select=[Final_COUNT(count1$0) AS cnt], > reuse_id=[1]) > : +- Exchange(distribution=[single]) > :+- LocalHash]Aggregate(select=[Pa...> > [INFO] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19760) Make the `GlobalCommitter` a standalone interface that does not extend the `Committer`
Guowei Ma created FLINK-19760: - Summary: Make the `GlobalCommitter` a standalone interface that does not extend the `Committer` Key: FLINK-19760 URL: https://issues.apache.org/jira/browse/FLINK-19760 Project: Flink Issue Type: Sub-task Reporter: Guowei Ma -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery
curcur edited a comment on pull request #13648: URL: https://github.com/apache/flink/pull/13648#issuecomment-714211577 > * Visilibity in normal case: none of the felds written in `releaseView` are `volatile`. So in normal case (`t1:release` then `t2:createReadView`) `t2` can see some inconsistent state. For example, `readView == null`, but `isPartialBufferCleanupRequired == false`. Right? > Maybe call `releaseView()` from `createReadView()` unconditionally? Aren't they guarded by the synchronization block? I think in the existing code, almost all access is guarded by the lock. But I have a simpler solution just in case. See the last paragraph. > * Overwites when release is slow: won't `t1` overwrite changes to `PipelinedSubpartition` made already by `t2`? For example, reset `sequenceNumber` after `t2` has sent some data? > Maybe `PipelinedSubpartition.readerView` should be `AtomicReference` and then we can guard `PipelinedApproximateSubpartition.releaseView()` by CAS on it? I think this is the same question I answered in the write-up above. In short, it won't be possible, because a view can only be released once and this is guarded by the release flag of the view, details quoted below. The updates are guarded by the lock. - What if the netty thread1 release view after netty thread2 recreates the view? Thread2 releases the view that thread1 holds the reference on before creating a new view. Thread1 can not release the old view (through view reference) again afterwards, since a view can only be released once. I am actually having an idea to simplify this whole model: **If we only release before creation and no other places, this whole threading interaction model would be simplified in a great way. That says only one new netty thread can release the view** This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] tillrohrmann commented on a change in pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
tillrohrmann commented on a change in pull request #13725: URL: https://github.com/apache/flink/pull/13725#discussion_r509929799 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java ## @@ -170,7 +174,6 @@ public void nodeChanged() throws Exception { notifyIfNewLeaderAddress(leaderAddress, leaderSessionID); } catch (Exception e) { leaderListener.handleError(new Exception("Could not handle node changed event.", e)); - throw e; Review comment: Yes, that is the reasoning. The `leaderListener.handleError` should handle all exceptions and usually terminate the process. But let me actually add `ExceptionUtils.checkInterrupted(e);` here again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13740: [FLINK-19758][filesystem] Implement a new unified file sink based on the new sink API
flinkbot edited a comment on pull request #13740: URL: https://github.com/apache/flink/pull/13740#issuecomment-714266782 ## CI report: * af69ad89405936040b55d54acb0ef18e6141e5ad Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8078) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] XComp commented on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI
XComp commented on pull request #13458: URL: https://github.com/apache/flink/pull/13458#issuecomment-714288781 Thanks, @gm7y8 . It looks good from my side. @vthinkxie: Could you have a brief look over the frontend code diff? I guess there have been some changes after your last review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] tillrohrmann commented on a change in pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
tillrohrmann commented on a change in pull request #13725: URL: https://github.com/apache/flink/pull/13725#discussion_r509935113 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java ## @@ -203,6 +207,11 @@ protected void handleStateChange(ConnectionState newState) { } } + private void onReconnectedConnectionState() { + // check whether we find some new leader information in ZooKeeper + retrieveLeaderInformationFromZooKeeper(); Review comment: I think you are right that in some cases we will report a stale leader here. However, once the asynchronous fetch from the `NodeCache` completes, the listener should be notified about the new leader. What happens in the meantime is that the listener will try to connect to the old leader which should either be gone or reject all connection attempts since he is no longer the leader. The problem `notifyLossOfLeader` tried to solve is that a listener thinks that a stale leader is still the leader and, thus, continues working for it w/o questioning it (e.g. check with the leader) until the connection to the leader times out. With the `notifyLossOfLeader` change, once the retrieval service loses connection to ZooKeeper, it will tell the listener that the current leader is no longer valid. This will tell the listener to stop working for this leader (e.g. cancelling all tasks, disconnecting from it, etc.). If the listener should shortly after be told that the old leader is still the leader because of stale information, then it will first try to connect to the leader which will fail (assuming that the old leader is indeed no longer the leader) before it starts doing work for the leader. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot commented on pull request #13742: [FLINK-19626][table-planner-blink] Introduce multi-input operator construction algorithm
flinkbot commented on pull request #13742: URL: https://github.com/apache/flink/pull/13742#issuecomment-714288929 Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community to review your pull request. We will use this comment to track the progress of the review. ## Automated Checks Last check on commit 37cba38abc15a3ee7d1193644be564c0c75ac4c1 (Thu Oct 22 07:22:28 UTC 2020) **Warnings:** * No documentation files were touched! Remember to keep the Flink docs up to date! Mention the bot in a comment to re-run the automated checks. ## Review Progress * ❓ 1. The [description] looks good. * ❓ 2. There is [consensus] that the contribution should go into to Flink. * ❓ 3. Needs [attention] from. * ❓ 4. The change fits into the overall [architecture]. * ❓ 5. Overall code [quality] is good. Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands The @flinkbot bot supports the following commands: - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`) - `@flinkbot approve all` to approve all aspects - `@flinkbot approve-until architecture` to approve everything until `architecture` - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention - `@flinkbot disapprove architecture` to remove an approval you gave earlier This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (FLINK-17159) ES6 ElasticsearchSinkITCase unstable
[ https://issues.apache.org/jira/browse/FLINK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger reassigned FLINK-17159: -- Assignee: (was: Robert Metzger) > ES6 ElasticsearchSinkITCase unstable > > > Key: FLINK-17159 > URL: https://issues.apache.org/jira/browse/FLINK-17159 > Project: Flink > Issue Type: Bug > Components: Connectors / ElasticSearch, Tests >Affects Versions: 1.11.0, 1.12.0 >Reporter: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.0, 1.11.3 > > > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7482=logs=64110e28-73be-50d7-9369-8750330e0bf1=aa84fb9a-59ae-5696-70f7-011bc086e59b] > {code:java} > 2020-04-15T02:37:04.4289477Z [ERROR] > testElasticsearchSinkWithSmile(org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSinkITCase) > Time elapsed: 0.145 s <<< ERROR! > 2020-04-15T02:37:04.4290310Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2020-04-15T02:37:04.4290790Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147) > 2020-04-15T02:37:04.4291404Z at > org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:659) > 2020-04-15T02:37:04.4291956Z at > org.apache.flink.streaming.util.TestStreamEnvironment.execute(TestStreamEnvironment.java:77) > 2020-04-15T02:37:04.4292548Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1643) > 2020-04-15T02:37:04.4293254Z at > org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.runElasticSearchSinkTest(ElasticsearchSinkTestBase.java:128) > 2020-04-15T02:37:04.4293990Z at > org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.runElasticsearchSinkSmileTest(ElasticsearchSinkTestBase.java:106) > 2020-04-15T02:37:04.4295096Z at > org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSinkITCase.testElasticsearchSinkWithSmile(ElasticsearchSinkITCase.java:45) > 2020-04-15T02:37:04.4295923Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-04-15T02:37:04.4296489Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-04-15T02:37:04.4297076Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-04-15T02:37:04.4297513Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-04-15T02:37:04.4297951Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-04-15T02:37:04.4298688Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-04-15T02:37:04.4299374Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-04-15T02:37:04.4300069Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-04-15T02:37:04.4300960Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-04-15T02:37:04.4301705Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-04-15T02:37:04.4302204Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-04-15T02:37:04.4302661Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-04-15T02:37:04.4303234Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-04-15T02:37:04.4303706Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-04-15T02:37:04.4304127Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-04-15T02:37:04.4304716Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-04-15T02:37:04.4305394Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-04-15T02:37:04.4305965Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-04-15T02:37:04.4306425Z at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > 2020-04-15T02:37:04.4306942Z at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > 2020-04-15T02:37:04.4307466Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-04-15T02:37:04.4307920Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-04-15T02:37:04.4308375Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-04-15T02:37:04.4308782Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-04-15T02:37:04.4309182Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) >
[GitHub] [flink] tillrohrmann commented on a change in pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API
tillrohrmann commented on a change in pull request #13644: URL: https://github.com/apache/flink/pull/13644#discussion_r509945969 ## File path: flink-kubernetes/src/test/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderRetrievalServiceTest.java ## @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.kubernetes.highavailability; + +import org.apache.flink.kubernetes.kubeclient.FlinkKubeClient; +import org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMap; +import org.apache.flink.kubernetes.utils.Constants; + +import org.junit.Test; + +import java.util.Collections; +import java.util.concurrent.TimeUnit; + +import static org.hamcrest.Matchers.is; +import static org.hamcrest.Matchers.notNullValue; +import static org.hamcrest.Matchers.nullValue; +import static org.junit.Assert.assertThat; + +/** + * Tests for the {@link KubernetesLeaderRetrievalService}. + */ +public class KubernetesLeaderRetrievalServiceTest extends KubernetesHighAvailabilityTestBase { Review comment: Shouldn't these tests be part of the commits where we add the K8s services? That way it will be a bit easier to review them for completeness. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot commented on pull request #13743: [FLINK-19629][Formats]Fix the NPE of avro format, when there is a nul…
flinkbot commented on pull request #13743: URL: https://github.com/apache/flink/pull/13743#issuecomment-714308277 Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community to review your pull request. We will use this comment to track the progress of the review. ## Automated Checks Last check on commit 60d9d786803903b37709700a16c01bc6f7bef003 (Thu Oct 22 07:59:10 UTC 2020) **Warnings:** * No documentation files were touched! Remember to keep the Flink docs up to date! Mention the bot in a comment to re-run the automated checks. ## Review Progress * ❓ 1. The [description] looks good. * ❓ 2. There is [consensus] that the contribution should go into to Flink. * ❓ 3. Needs [attention] from. * ❓ 4. The change fits into the overall [architecture]. * ❓ 5. Overall code [quality] is good. Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands The @flinkbot bot supports the following commands: - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`) - `@flinkbot approve all` to approve all aspects - `@flinkbot approve-until architecture` to approve everything until `architecture` - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention - `@flinkbot disapprove architecture` to remove an approval you gave earlier This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] shizhengchao commented on pull request #13743: [FLINK-19629][Formats]Fix the NPE of avro format, when there is a nul…
shizhengchao commented on pull request #13743: URL: https://github.com/apache/flink/pull/13743#issuecomment-714308295 Hi @wuchong , this is the hotfix in release-1.11, could you help me reivew this :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-18811) if a disk is damaged, taskmanager should choose another disk for temp dir , rather than throw an IOException, which causes flink job restart over and over again
[ https://issues.apache.org/jira/browse/FLINK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218843#comment-17218843 ] Timo Walther commented on FLINK-18811: -- [~pnowojski] could you find someone to help reviewing this contribution? > if a disk is damaged, taskmanager should choose another disk for temp dir , > rather than throw an IOException, which causes flink job restart over and > over again > > > Key: FLINK-18811 > URL: https://issues.apache.org/jira/browse/FLINK-18811 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Environment: flink-1.10 >Reporter: Kai Chen >Priority: Major > Labels: pull-request-available > Attachments: flink_disk_error.png > > > I met this Exception when a hard disk was damaged: > !flink_disk_error.png! > I checked the code and found that flink will create a temp file when Record > length > 5 MB: > > {code:java} > // SpillingAdaptiveSpanningRecordDeserializer.java > if (nextRecordLength > THRESHOLD_FOR_SPILLING) { >// create a spilling channel and put the data there >this.spillingChannel = createSpillingChannel(); >ByteBuffer toWrite = partial.segment.wrap(partial.position, numBytesChunk); >FileUtils.writeCompletely(this.spillingChannel, toWrite); > } > {code} > The tempDir is random picked from all `tempDirs`. Well on yarn mode, one > `tempDir` usually represents one hard disk. > > In may opinion, if a hard disk is damaged, taskmanager should pick another > disk(tmpDir) for Spilling Channel, rather than throw an IOException, which > causes flink job restart over and over again. > If we could just change “SpillingAdaptiveSpanningRecordDeserializer" like > this: > {code:java} > // SpillingAdaptiveSpanningRecordDeserializer.java > private FileChannel createSpillingChannel() throws IOException { >if (spillFile != null) { > throw new IllegalStateException("Spilling file already exists."); >} >// try to find a unique file name for the spilling channel >int maxAttempts = 10; >String[] tempDirs = this.tempDirs; >for (int attempt = 0; attempt < maxAttempts; attempt++) { > int dirIndex = rnd.nextInt(tempDirs.length); > String directory = tempDirs[dirIndex]; > spillFile = new File(directory, randomString(rnd) + ".inputchannel"); > try { > if (spillFile.createNewFile()) { > return new RandomAccessFile(spillFile, "rw").getChannel(); > } > } catch (IOException e) { > // if there is no tempDir left to try > if(tempDirs.length <= 1) { > throw e; > } > LOG.warn("Caught an IOException when creating spill file: " + > directory + ". Attempt " + attempt, e); > tempDirs = (String[])ArrayUtils.remove(tempDirs,dirIndex); > } >} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-19401. - Resolution: Fixed > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Assignee: Roman Khachatryan >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at > org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) > at > org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.lambda$allocateSlotsFor$0(DefaultExecutionSlotAllocator.java:104) > at >
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218851#comment-17218851 ] Arvid Heise commented on FLINK-19401: - Merged into release-1.11 as bff79f5efffccd1793e09a5e08d0ceb9fe90cf2 Merged into release-1.10 as 88c06e38d4eff06ac09e4141b988f9a561f286a4 Closing as resolved. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Assignee: Roman Khachatryan >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at > org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) > at >
[jira] [Closed] (FLINK-19136) MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the predicate within the allowed time"
[ https://issues.apache.org/jira/browse/FLINK-19136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger closed FLINK-19136. -- Resolution: Cannot Reproduce > MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the > predicate within the allowed time" > > > Key: FLINK-19136 > URL: https://issues.apache.org/jira/browse/FLINK-19136 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Robert Metzger >Priority: Major > Labels: test-stability > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6179=logs=9fca669f-5c5f-59c7-4118-e31c641064f0=91bf6583-3fb2-592f-e4d4-d79d79c3230a] > {code} > 2020-09-03T23:33:18.3687261Z [ERROR] > testReporter(org.apache.flink.metrics.tests.MetricsAvailabilityITCase) Time > elapsed: 15.217 s <<< ERROR! > 2020-09-03T23:33:18.3698260Z java.util.concurrent.ExecutionException: > org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not > satisfy the predicate within the allowed time. > 2020-09-03T23:33:18.3698749Z at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > 2020-09-03T23:33:18.3699163Z at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) > 2020-09-03T23:33:18.3699754Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.fetchMetric(MetricsAvailabilityITCase.java:162) > 2020-09-03T23:33:18.3700234Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.checkJobManagerMetricAvailability(MetricsAvailabilityITCase.java:116) > 2020-09-03T23:33:18.3700726Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.testReporter(MetricsAvailabilityITCase.java:101) > 2020-09-03T23:33:18.3701097Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-09-03T23:33:18.3701425Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-09-03T23:33:18.3701798Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-09-03T23:33:18.3702146Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-09-03T23:33:18.3702471Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-09-03T23:33:18.3702866Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-09-03T23:33:18.3703253Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-09-03T23:33:18.3703621Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-09-03T23:33:18.3703997Z at > org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-09-03T23:33:18.3704339Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-09-03T23:33:18.3704629Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-09-03T23:33:18.3704940Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-09-03T23:33:18.3705354Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-09-03T23:33:18.3705725Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-09-03T23:33:18.3706072Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-09-03T23:33:18.3706397Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-09-03T23:33:18.3706714Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-09-03T23:33:18.3707044Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-09-03T23:33:18.3707373Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-09-03T23:33:18.3707708Z at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > 2020-09-03T23:33:18.3708073Z at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > 2020-09-03T23:33:18.3708410Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-09-03T23:33:18.3708691Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-09-03T23:33:18.3708976Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2020-09-03T23:33:18.3709273Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-09-03T23:33:18.3709579Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-09-03T23:33:18.3709910Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-09-03T23:33:18.3710242Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) >
[jira] [Commented] (FLINK-19136) MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the predicate within the allowed time"
[ https://issues.apache.org/jira/browse/FLINK-19136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218869#comment-17218869 ] Robert Metzger commented on FLINK-19136: The logs of TM and JM both do not look suspicious. I'm closing this as "Cannot Reproduce" for now. > MetricsAvailabilityITCase.testReporter failed with "Could not satisfy the > predicate within the allowed time" > > > Key: FLINK-19136 > URL: https://issues.apache.org/jira/browse/FLINK-19136 > Project: Flink > Issue Type: Bug > Components: Runtime / Metrics >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Robert Metzger >Priority: Major > Labels: test-stability > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6179=logs=9fca669f-5c5f-59c7-4118-e31c641064f0=91bf6583-3fb2-592f-e4d4-d79d79c3230a] > {code} > 2020-09-03T23:33:18.3687261Z [ERROR] > testReporter(org.apache.flink.metrics.tests.MetricsAvailabilityITCase) Time > elapsed: 15.217 s <<< ERROR! > 2020-09-03T23:33:18.3698260Z java.util.concurrent.ExecutionException: > org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not > satisfy the predicate within the allowed time. > 2020-09-03T23:33:18.3698749Z at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > 2020-09-03T23:33:18.3699163Z at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) > 2020-09-03T23:33:18.3699754Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.fetchMetric(MetricsAvailabilityITCase.java:162) > 2020-09-03T23:33:18.3700234Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.checkJobManagerMetricAvailability(MetricsAvailabilityITCase.java:116) > 2020-09-03T23:33:18.3700726Z at > org.apache.flink.metrics.tests.MetricsAvailabilityITCase.testReporter(MetricsAvailabilityITCase.java:101) > 2020-09-03T23:33:18.3701097Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-09-03T23:33:18.3701425Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-09-03T23:33:18.3701798Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-09-03T23:33:18.3702146Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-09-03T23:33:18.3702471Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-09-03T23:33:18.3702866Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-09-03T23:33:18.3703253Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-09-03T23:33:18.3703621Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-09-03T23:33:18.3703997Z at > org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-09-03T23:33:18.3704339Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-09-03T23:33:18.3704629Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-09-03T23:33:18.3704940Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-09-03T23:33:18.3705354Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-09-03T23:33:18.3705725Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-09-03T23:33:18.3706072Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-09-03T23:33:18.3706397Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-09-03T23:33:18.3706714Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-09-03T23:33:18.3707044Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-09-03T23:33:18.3707373Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-09-03T23:33:18.3707708Z at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > 2020-09-03T23:33:18.3708073Z at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > 2020-09-03T23:33:18.3708410Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-09-03T23:33:18.3708691Z at > org.junit.runners.Suite.runChild(Suite.java:128) > 2020-09-03T23:33:18.3708976Z at > org.junit.runners.Suite.runChild(Suite.java:27) > 2020-09-03T23:33:18.3709273Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-09-03T23:33:18.3709579Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-09-03T23:33:18.3709910Z at >
[GitHub] [flink] flinkbot edited a comment on pull request #13721: [FLINK-19694][table] Support Upsert ChangelogMode for ScanTableSource
flinkbot edited a comment on pull request #13721: URL: https://github.com/apache/flink/pull/13721#issuecomment-713489063 ## CI report: * fa0c87c737cee61fe6fd5b6655916eb97d18aaeb Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8072) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery
curcur edited a comment on pull request #13648: URL: https://github.com/apache/flink/pull/13648#issuecomment-714211577 > * Visilibity in normal case: none of the felds written in `releaseView` are `volatile`. So in normal case (`t1:release` then `t2:createReadView`) `t2` can see some inconsistent state. For example, `readView == null`, but `isPartialBufferCleanupRequired == false`. Right? > Maybe call `releaseView()` from `createReadView()` unconditionally? Aren't they guarded by the synchronization block? > * Overwites when release is slow: won't `t1` overwrite changes to `PipelinedSubpartition` made already by `t2`? For example, reset `sequenceNumber` after `t2` has sent some data? > Maybe `PipelinedSubpartition.readerView` should be `AtomicReference` and then we can guard `PipelinedApproximateSubpartition.releaseView()` by CAS on it? I think this is the same question I answered in the write-up above. In short, it won't be possible, because a view can only be released once and this is guarded by the release flag of the view, details quoted below. - What if the netty thread1 release view after netty thread2 recreates the view? Thread2 releases the view that thread1 holds the reference on before creating a new view. Thread1 can not release the old view (through view reference) again afterwards, since a view can only be released once. I am actually having an idea to simplify this whole model: **If we only release before creation and no other places, this whole threading interaction model would be simplified in a great way. That says only one new netty thread can release the view** This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13678: [FLINK-19586] Add stream committer operators for new Sink API
flinkbot edited a comment on pull request #13678: URL: https://github.com/apache/flink/pull/13678#issuecomment-711439155 ## CI report: * Unknown: [CANCELED](TBD) * 463b5c8ed21f93caaeb7b938aa9e72abb35619b2 UNKNOWN * 2ba5bda5c5a15c370c231e1cedfa03fd0bc0b41b Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8076) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] curcur edited a comment on pull request #13648: [FLINK-19632] Introduce a new ResultPartitionType for Approximate Local Recovery
curcur edited a comment on pull request #13648: URL: https://github.com/apache/flink/pull/13648#issuecomment-714211577 > * Visilibity in normal case: none of the felds written in `releaseView` are `volatile`. So in normal case (`t1:release` then `t2:createReadView`) `t2` can see some inconsistent state. For example, `readView == null`, but `isPartialBufferCleanupRequired == false`. Right? > Maybe call `releaseView()` from `createReadView()` unconditionally? Aren't they guarded by the synchronization block? > * Overwites when release is slow: won't `t1` overwrite changes to `PipelinedSubpartition` made already by `t2`? For example, reset `sequenceNumber` after `t2` has sent some data? > Maybe `PipelinedSubpartition.readerView` should be `AtomicReference` and then we can guard `PipelinedApproximateSubpartition.releaseView()` by CAS on it? I think this is the same question I answered in the write-up above. In short, it won't be possible, because a view can only be released once and this is guarded by the release flag of the view, details quoted below. - What if the netty thread1 release view after netty thread2 recreates the view? Thread2 releases the view that thread1 holds the reference on before creating a new view. Thread1 can not release the old view (through view reference) again afterwards, since a view can only be released once, and guarded by the lock. I am actually having an idea to simplify this whole model: **If we only release before creation and no other places, this whole threading interaction model would be simplified in a great way. That says only one new netty thread can release the view** This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13735: [FLINK-19533][checkpoint] Add channel state reassignment for unaligned checkpoints.
flinkbot edited a comment on pull request #13735: URL: https://github.com/apache/flink/pull/13735#issuecomment-714048414 ## CI report: * ee73393a13c999be626f7b8b98a170be88a093d9 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8077) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot commented on pull request #13740: [FLINK-19758][filesystem] Implement a new unified file sink based on the new sink API
flinkbot commented on pull request #13740: URL: https://github.com/apache/flink/pull/13740#issuecomment-714266782 ## CI report: * af69ad89405936040b55d54acb0ef18e6141e5ad UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] pnowojski opened a new pull request #13741: [FLINK-19680][checkpointing] Announce timeoutable CheckpointBarriers
pnowojski opened a new pull request #13741: URL: https://github.com/apache/flink/pull/13741 This PR adds `CheckpointBarrierAnnouncement` messages ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (**yes** / no) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (**yes** / no) - If yes, how is the feature documented? (not applicable / **docs** / **JavaDocs** / not documented) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-19680) Add CheckpointBarrierAnnouncement priority message
[ https://issues.apache.org/jira/browse/FLINK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-19680: --- Labels: pull-request-available (was: ) > Add CheckpointBarrierAnnouncement priority message > -- > > Key: FLINK-19680 > URL: https://issues.apache.org/jira/browse/FLINK-19680 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing >Affects Versions: 1.12.0 >Reporter: Piotr Nowojski >Assignee: Piotr Nowojski >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > When an aligned checkpoint barrier arrives in an input channel, we should > enqueue at the head of the queue priority event, announcing arrival of the > checkpoint barrier. > Such announcement event should be propagated to the > `CheckpointBarrierHandler`. Based on that, `CheckpointBarrierHandler` will be > able to implement logic of time outing from aligned checkpoint to unaligned > checkpoint. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19761) Add lookup method for registered ShuffleDescriptor in ShuffleMaster
[ https://issues.apache.org/jira/browse/FLINK-19761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218826#comment-17218826 ] Xuannan Su commented on FLINK-19761: cc [~trohrmann] [~azagrebin] [~zjwang] I'd love to hear your thought on this. > Add lookup method for registered ShuffleDescriptor in ShuffleMaster > --- > > Key: FLINK-19761 > URL: https://issues.apache.org/jira/browse/FLINK-19761 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: Xuannan Su >Priority: Major > > Currently, the ShuffleMaster can register a partition and get the shuffle > descriptor. However, it lacks the ability to look up the registered > ShuffleDescriptors belongs to an IntermediateResult by the > IntermediateDataSetID. > Adding the lookup method to the ShuffleMaster can make reusing the cluster > partition more easily. For example, we don't have to return the > ShuffleDescriptor to the client just so that the other job can somehow encode > the ShuffleDescriptor in the JobGraph to consume the cluster partition. > Instead, we only need to return the IntermediateDatSetID and use it to lookup > the ShuffleDescriptor by another job. > By adding the lookup method in ShuffleMaster, if we have an external shuffle > service and the lifecycle of the IntermediateResult is not bounded to the > cluster, we can look up the ShuffleDescriptor and reuse the > IntermediateResult by a job running on another cluster even if the cluster > that produced the IntermediateResult is shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] tillrohrmann commented on a change in pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API
tillrohrmann commented on a change in pull request #13644: URL: https://github.com/apache/flink/pull/13644#discussion_r509937977 ## File path: flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/FlinkKubeClient.java ## @@ -104,6 +106,67 @@ KubernetesWatch watchPodsAndDoCallback( Map labels, WatchCallbackHandler podCallbackHandler); + /** +* Create the ConfigMap with specified content. If the ConfigMap already exists, a FlinkRuntimeException will be +* thrown. +* +* @param configMap ConfigMap. +* +* @return Return the ConfigMap create future. +*/ + CompletableFuture createConfigMap(KubernetesConfigMap configMap); + + /** +* Get the ConfigMap with specified name. +* +* @param name ConfigMap name. +* +* @return Return the ConfigMap, or empty if the ConfigMap does not exist. +*/ + Optional getConfigMap(String name); + + /** +* Update an existing ConfigMap with the data. Benefit from https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions> +* resource version and combined with {@link #getConfigMap(String)}, we could perform a get-check-and-update +* transactional operation. Since concurrent modification could happen on a same ConfigMap, +* the update operation may fail. We need to retry internally. The max retry attempts could be +* configured via {@link org.apache.flink.kubernetes.configuration.KubernetesConfigOptions#KUBERNETES_TRANSACTIONAL_OPERATION_MAX_RETRIES}. +* +* @param configMapName ConfigMap to be replaced with. +* @param function Function to be applied to the obtained ConfigMap and get a new updated one. If the returned Review comment: Would we do these kind of calls in the update function of the configMap? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-19763) Missing test MetricUtilsTest.testNonHeapMetricUsageNotStatic
Matthias created FLINK-19763: Summary: Missing test MetricUtilsTest.testNonHeapMetricUsageNotStatic Key: FLINK-19763 URL: https://issues.apache.org/jira/browse/FLINK-19763 Project: Flink Issue Type: Test Affects Versions: 1.11.2, 1.10.2 Reporter: Matthias Fix For: 1.12.0 We have tests for the heap and metaspace to check whether the metric is dynamically generated. The test for the non-heap space is missing. There was a test added in [296107e|https://github.com/apache/flink/commit/296107e] but reverted in [2d86256|https://github.com/apache/flink/commit/2d86256] as it appeared that the test is partially failing. We should might want to add the test again fixing the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] AHeise merged pull request #13734: (1.11 backport) [FLINK-19401][checkpointing] Download checkpoints only if needed
AHeise merged pull request #13734: URL: https://github.com/apache/flink/pull/13734 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack
[ https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218858#comment-17218858 ] Arvid Heise commented on FLINK-19688: - Merged into master as 840e8af879e69c1bf9ad121b670a2703eb88b858. Closing issue as resolved. > Flink batch job fails because of InterruptedExceptions from network stack > - > > Key: FLINK-19688 > URL: https://issues.apache.org/jira/browse/FLINK-19688 > Project: Flink > Issue Type: Bug > Components: Runtime / Network, Runtime / Task >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Assignee: Roman Khachatryan >Priority: Blocker > Fix For: 1.12.0 > > Attachments: logs.tgz > > > I have a benchmarking test job, that throws RuntimeExceptions at any operator > at a configured, random interval. When using low intervals, such as mean > failure rate = 60 s, the job will get into a state where it frequently fails > with InterruptedExceptions. > The same job does not have this problem on Flink 1.11.2 (at least not after > running the job for 15 hours, on 1.12-SN, it happens within a few minutes) > This is the job: > https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java > This is the exception: > {code} > 2020-10-16 16:02:15,653 WARN org.apache.flink.runtime.taskmanager.Task > [] - CHAIN GroupReduce (GroupReduce at > main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38)) (8/8)#1 > (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) > switched from RUNNING to FAILED. > java.lang.Exception: The data preparation for task 'CHAIN GroupReduce > (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38))' , caused an error: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222] > Caused by: org.apache.flink.util.WrappingRuntimeException: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > ... 4 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: Error obtaining the sorted input: Thread > 'SortMerger Reading Thread' terminated due to an exception: Connection for > partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > ~[?:1.8.0_222] > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > ~[?:1.8.0_222] > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at
[GitHub] [flink] flinkbot edited a comment on pull request #13741: [FLINK-19680][checkpointing] Announce timeoutable CheckpointBarriers
flinkbot edited a comment on pull request #13741: URL: https://github.com/apache/flink/pull/13741#issuecomment-714295796 ## CI report: * 8a68242a90fd180fcd0b4b4c60b5e1e49136c00f Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8082) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13742: [FLINK-19626][table-planner-blink] Introduce multi-input operator construction algorithm
flinkbot edited a comment on pull request #13742: URL: https://github.com/apache/flink/pull/13742#issuecomment-714296461 ## CI report: * 37cba38abc15a3ee7d1193644be564c0c75ac4c1 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8084) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13736: [FLINK-19654][python][e2e] Reduce pyflink e2e test parallelism
flinkbot edited a comment on pull request #13736: URL: https://github.com/apache/flink/pull/13736#issuecomment-714183394 ## CI report: * 111ec785929a0742b46ea98408f585aa03314b1d Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8070) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] AHeise merged pull request #13723: [FlINK-19688][network] Don't cache InterruptedExceptions in PartitionRequestClientFactory
AHeise merged pull request #13723: URL: https://github.com/apache/flink/pull/13723 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13727: [BP-1.10][FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
flinkbot edited a comment on pull request #13727: URL: https://github.com/apache/flink/pull/13727#issuecomment-713572668 ## CI report: * c2ae5ba37a69ad1f40f585421d38edb30d40c0fa Travis: [FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/191431633) * 8e2aaf84d255adab8420800d5e7ec20d502868f4 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] AHeise commented on a change in pull request #13723: [FlINK-19688][network] Don't cache InterruptedExceptions in PartitionRequestClientFactory
AHeise commented on a change in pull request #13723: URL: https://github.com/apache/flink/pull/13723#discussion_r509969318 ## File path: flink-runtime/src/test/java/org/apache/flink/runtime/io/network/netty/PartitionRequestClientFactoryTest.java ## @@ -58,6 +59,38 @@ private static final int SERVER_PORT = NetUtils.getAvailablePort(); + @Test + public void testInterruptsNotCached() throws Exception { + ConnectionID connectionId = new ConnectionID(new InetSocketAddress(InetAddress.getLocalHost(), 8080), 0); Review comment: Okay if the test is not affected by that I'll merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (FLINK-14242) Collapse task names in job graph visualization if too long
[ https://issues.apache.org/jira/browse/FLINK-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Xie closed FLINK-14242. -- Resolution: Fixed > Collapse task names in job graph visualization if too long > -- > > Key: FLINK-14242 > URL: https://issues.apache.org/jira/browse/FLINK-14242 > Project: Flink > Issue Type: Improvement > Components: Runtime / Web Frontend >Affects Versions: 1.9.0 >Reporter: Paul Lin >Assignee: Yadong Xie >Priority: Minor > > For some complex jobs, especially SQL jobs, the task names are quite long > which makes the job graph hard to read. We could auto collapse these task > names if they exceed a certain length, and provide an uncollapse button for > the full task names. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] zhijiangW commented on a change in pull request #13595: [FLINK-19582][network] Introduce sort-merge based blocking shuffle to Flink
zhijiangW commented on a change in pull request #13595: URL: https://github.com/apache/flink/pull/13595#discussion_r509983577 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/SortMergeResultPartition.java ## @@ -0,0 +1,360 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.io.network.partition; + +import org.apache.flink.core.memory.MemorySegment; +import org.apache.flink.core.memory.MemorySegmentFactory; +import org.apache.flink.runtime.event.AbstractEvent; +import org.apache.flink.runtime.io.disk.FileChannelManager; +import org.apache.flink.runtime.io.network.api.EndOfPartitionEvent; +import org.apache.flink.runtime.io.network.api.serialization.EventSerializer; +import org.apache.flink.runtime.io.network.buffer.Buffer; +import org.apache.flink.runtime.io.network.buffer.BufferCompressor; +import org.apache.flink.runtime.io.network.buffer.BufferPool; +import org.apache.flink.runtime.io.network.buffer.NetworkBuffer; +import org.apache.flink.util.function.SupplierWithException; + +import javax.annotation.Nullable; +import javax.annotation.concurrent.GuardedBy; +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.HashSet; +import java.util.Set; +import java.util.concurrent.CompletableFuture; + +import static org.apache.flink.runtime.io.network.buffer.Buffer.DataType; +import static org.apache.flink.runtime.io.network.partition.SortBuffer.BufferWithChannel; +import static org.apache.flink.util.Preconditions.checkElementIndex; +import static org.apache.flink.util.Preconditions.checkNotNull; +import static org.apache.flink.util.Preconditions.checkState; + +/** + * {@link SortMergeResultPartition} appends records and events to {@link SortBuffer} and after the {@link SortBuffer} + * is full, all data in the {@link SortBuffer} will be copied and spilled to a {@link PartitionedFile} in subpartition + * index order sequentially. Large records that can not be appended to an empty {@link SortBuffer} will be spilled to + * the {@link PartitionedFile} separately. + */ +@NotThreadSafe +public class SortMergeResultPartition extends ResultPartition { + + private final Object lock = new Object(); + + /** All active readers which are consuming data from this result partition now. */ + @GuardedBy("lock") + private final Set readers = new HashSet<>(); + + /** {@link PartitionedFile} produced by this result partition. */ + @GuardedBy("lock") + private PartitionedFile resultFile; + + /** Used to generate random file channel ID. */ + private final FileChannelManager channelManager; + + /** Number of data buffers (excluding events) written for each subpartition. */ + private final int[] numDataBuffers; + + /** A piece of unmanaged memory for data writing. */ + private final MemorySegment writeBuffer; + + /** Size of network buffer and write buffer. */ + private final int networkBufferSize; + + /** Current {@link SortBuffer} to append records to. */ + private SortBuffer currentSortBuffer; + + /** File writer for this result partition. */ + private PartitionedFileWriter fileWriter; + + public SortMergeResultPartition( + String owningTaskName, + int partitionIndex, + ResultPartitionID partitionId, + ResultPartitionType partitionType, + int numSubpartitions, + int numTargetKeyGroups, + int networkBufferSize, + ResultPartitionManager partitionManager, + FileChannelManager channelManager, + @Nullable BufferCompressor bufferCompressor, + SupplierWithException bufferPoolFactory) { + + super( + owningTaskName, + partitionIndex, + partitionId, + partitionType, + numSubpartitions, + numTargetKeyGroups, +
[GitHub] [flink] flinkbot commented on pull request #13743: [FLINK-19629][Formats]Fix the NPE of avro format, when there is a nul…
flinkbot commented on pull request #13743: URL: https://github.com/apache/flink/pull/13743#issuecomment-714332320 ## CI report: * 60d9d786803903b37709700a16c01bc6f7bef003 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (FLINK-19757) TimeStampData can cause time inconsistent problem
[ https://issues.apache.org/jira/browse/FLINK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218809#comment-17218809 ] Jark Wu edited comment on FLINK-19757 at 10/22/20, 6:55 AM: Hi [~zhoujira86], {{TimestampData}} is the internal data structure for both {{TIMESTAMP}} (= Java {{LocalDateTime}}) and {{TIMESTAMP WITH LOCAL TIME ZONE}} (= Java {{Instant}}). It always stores the epoch seconds since {{1970-01-01 00:00:00}}. It always safe to convert without time zone if they are they same logical type. But not correct if they are different logical type, e.g. {{Instant}} => {{TimestampData}} => {{LocalDateTime}} is not correct if no time zone considered. But such conversion only happends in Flink SQL internal, and handled correctly. You problem above is not in {{TimestampData}} data structure, but the {{PROCTIME()}} currently returns {{TIMESTAMP}} type which is not right, it should be {{TIMESTAMP WITH LOCAL TIME ZONE}}. was (Author: jark): Hi [~zhoujira86], {{TimestampData}} is the internal data structure for both {{TIMESTAMP}} (= Java {{LocalDateTime}}) and {{TIMESTAMP WITH LOCAL TIME ZONE}} (= Java {{Instant}}). It always stores the epoch seconds since {{1970-01-01 00:00:00}}. It always safe to convert without time zone if they are they same logical type. But not correct if they are different logical type, e.g. {{Instant}} => {{TimestampData}} => {{LocalDateTime}} is not correct if no time zone considered. But such conversion only happends in Flink SQL internal, and handled correctly. You problem above is not in {{TimestampData}} data structure, but the {{PROCTIME()}} currently returns {{TIMESTAMP}} type which is not right, it should be {{}} {{LocalDateTime}} => {{TimestampData}} => {{LocalDateTime}} and {{TimestampData}} <=> {{Instant}} without time zone considered. But it it right to convert {{TIMESTAMP WITH LOCAL TIME ZONE}}. > TimeStampData can cause time inconsistent problem > - > > Key: FLINK-19757 > URL: https://issues.apache.org/jira/browse/FLINK-19757 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Runtime >Affects Versions: 1.11.1 >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > when we check jdk LocalDateTime code,we find that > > {code:java} > // code placeholder > public static LocalDateTime ofEpochSecond(long epochSecond, int nanoOfSecond, > ZoneOffset offset) { > Objects.requireNonNull(offset, "offset"); > NANO_OF_SECOND.checkValidValue(nanoOfSecond); > long localSecond = epochSecond + offset.getTotalSeconds(); // overflow > caught later > long localEpochDay = Math.floorDiv(localSecond, SECONDS_PER_DAY); > int secsOfDay = (int)Math.floorMod(localSecond, SECONDS_PER_DAY); > LocalDate date = LocalDate.ofEpochDay(localEpochDay); > LocalTime time = LocalTime.ofNanoOfDay(secsOfDay * NANOS_PER_SECOND + > nanoOfSecond); > return new LocalDateTime(date, time); > } > {code} > > offset.getTotalSeconds() they add the offset, but in the TimeStampData > toLocalDateTime, we don't add a offset. > > I'd like to add a TimeZone.getDefault().getRawOffset() in the > toLocalDateTime() > and minus a TimeZone.getDefault().getRawOffset() in the > fromLocalDateTime -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] wuchong commented on pull request #13653: [FLINK-17528][table] Remove RowData#get() and ArrayData#get() and use FieldGetter and ElementGetter instead
wuchong commented on pull request #13653: URL: https://github.com/apache/flink/pull/13653#issuecomment-714275903 @flinkbot run azure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] azagrebin closed pull request #13547: [FLINK-14406][runtime] Exposes managed memory usage through the REST API
azagrebin closed pull request #13547: URL: https://github.com/apache/flink/pull/13547 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-19554) Unified testing framework for connectors
[ https://issues.apache.org/jira/browse/FLINK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218837#comment-17218837 ] Robert Metzger commented on FLINK-19554: What's the difference of this testing framework compared to the one developed in FLINK-11463? I'm concerned that we end up with two competing frameworks, instead of focusing our resources on one that is really good. > Unified testing framework for connectors > > > Key: FLINK-19554 > URL: https://issues.apache.org/jira/browse/FLINK-19554 > Project: Flink > Issue Type: New Feature > Components: Connectors / Common, Test Infrastructure >Reporter: Qingsheng Ren >Assignee: Qingsheng Ren >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > As the community and eco-system of Flink growing up, more and more > Flink-owned and third party connectors are developed and added into Flink > community. In order to provide a standardized quality controlling for all > connectors, it's necessary to develop a unified connector testing framework > to simplify and standardize end-to-end test of connectors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] godfreyhe commented on a change in pull request #13696: [FLINK-19726][table] Implement new providers for blink planner
godfreyhe commented on a change in pull request #13696: URL: https://github.com/apache/flink/pull/13696#discussion_r509945463 ## File path: flink-table/flink-table-runtime-blink/src/main/java/org/apache/flink/table/runtime/operators/sink/SinkNotNullEnforcer.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.table.runtime.operators.sink; + +import org.apache.flink.api.common.functions.FilterFunction; +import org.apache.flink.table.api.TableException; +import org.apache.flink.table.api.config.ExecutionConfigOptions; +import org.apache.flink.table.api.config.ExecutionConfigOptions.NotNullEnforcer; +import org.apache.flink.table.data.RowData; + +/** + * Checks writing null values into NOT NULL columns. + */ +public class SinkNotNullEnforcer implements FilterFunction { Review comment: nit: add `serialVersionUID` ## File path: flink-table/flink-table-runtime-blink/src/main/java/org/apache/flink/table/runtime/operators/sink/SinkNotNullEnforcer.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.table.runtime.operators.sink; + +import org.apache.flink.api.common.functions.FilterFunction; +import org.apache.flink.table.api.TableException; +import org.apache.flink.table.api.config.ExecutionConfigOptions; +import org.apache.flink.table.api.config.ExecutionConfigOptions.NotNullEnforcer; +import org.apache.flink.table.data.RowData; + +/** + * Checks writing null values into NOT NULL columns. + */ +public class SinkNotNullEnforcer implements FilterFunction { + + private final NotNullEnforcer notNullEnforcer; + private final int[] notNullFieldIndices; + private final boolean notNullCheck; + private final String[] allFieldNames; + + public SinkNotNullEnforcer( + NotNullEnforcer notNullEnforcer, int[] notNullFieldIndices, String[] allFieldNames) { + this.notNullFieldIndices = notNullFieldIndices; + this.notNullEnforcer = notNullEnforcer; + this.notNullCheck = notNullFieldIndices.length > 0; + this.allFieldNames = allFieldNames; + } + + @Override + public boolean filter(RowData row) { + if (notNullCheck) { Review comment: nit: return `false` directly when notNullCheck is false, reduce nesting depth. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] vthinkxie commented on a change in pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI
vthinkxie commented on a change in pull request #13458: URL: https://github.com/apache/flink/pull/13458#discussion_r50994 ## File path: flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/detail/job-checkpoints-detail.component.html ## @@ -21,6 +21,8 @@ Path: {{checkPointDetail?.external_path || '-'}} Discarded: {{checkPointDetail?.discarded || '-'}} + +Checkpoint Type: {{checkPointDetail?.checkpoint_type === "CHECKPOINT" ? (checkPointConfig?.unaligned_checkpoints ? 'unaligned checkpoint' : 'aligned checkpoint') : (checkPointDetail?.checkpoint_type === "SYNC_SAVEPOINT" ? 'savepoint on cancel' : 'savepoint')|| '-'}} Review comment: Hi @gm7y8 it would be better to move this calculation into typescript file as a getter This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (FLINK-18117) "Kerberized YARN per-job on Docker test" fails with "Could not start hadoop cluster."
[ https://issues.apache.org/jira/browse/FLINK-18117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger reassigned FLINK-18117: -- Assignee: Robert Metzger > "Kerberized YARN per-job on Docker test" fails with "Could not start hadoop > cluster." > - > > Key: FLINK-18117 > URL: https://issues.apache.org/jira/browse/FLINK-18117 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.11.0, 1.12.0 >Reporter: Robert Metzger >Assignee: Robert Metzger >Priority: Critical > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2683=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 > {code} > 2020-06-04T06:03:53.2844296Z Creating slave1 ... [32mdone[0m > 2020-06-04T06:03:53.4981251Z [1BWaiting for hadoop cluster to come up. We > have been trying for 0 seconds, retrying ... > 2020-06-04T06:03:58.5980181Z Waiting for hadoop cluster to come up. We have > been trying for 5 seconds, retrying ... > 2020-06-04T06:04:03.6997087Z Waiting for hadoop cluster to come up. We have > been trying for 10 seconds, retrying ... > 2020-06-04T06:04:08.7910791Z Waiting for hadoop cluster to come up. We have > been trying for 15 seconds, retrying ... > 2020-06-04T06:04:13.8921621Z Waiting for hadoop cluster to come up. We have > been trying for 20 seconds, retrying ... > 2020-06-04T06:04:18.9648844Z Waiting for hadoop cluster to come up. We have > been trying for 25 seconds, retrying ... > 2020-06-04T06:04:24.0381851Z Waiting for hadoop cluster to come up. We have > been trying for 31 seconds, retrying ... > 2020-06-04T06:04:29.1220264Z Waiting for hadoop cluster to come up. We have > been trying for 36 seconds, retrying ... > 2020-06-04T06:04:34.1882187Z Waiting for hadoop cluster to come up. We have > been trying for 41 seconds, retrying ... > 2020-06-04T06:04:39.2784948Z Waiting for hadoop cluster to come up. We have > been trying for 46 seconds, retrying ... > 2020-06-04T06:04:44.3843337Z Waiting for hadoop cluster to come up. We have > been trying for 51 seconds, retrying ... > 2020-06-04T06:04:49.4703561Z Waiting for hadoop cluster to come up. We have > been trying for 56 seconds, retrying ... > 2020-06-04T06:04:54.5463207Z Waiting for hadoop cluster to come up. We have > been trying for 61 seconds, retrying ... > 2020-06-04T06:04:59.6650405Z Waiting for hadoop cluster to come up. We have > been trying for 66 seconds, retrying ... > 2020-06-04T06:05:04.7500168Z Waiting for hadoop cluster to come up. We have > been trying for 71 seconds, retrying ... > 2020-06-04T06:05:09.8177904Z Waiting for hadoop cluster to come up. We have > been trying for 76 seconds, retrying ... > 2020-06-04T06:05:14.9751297Z Waiting for hadoop cluster to come up. We have > been trying for 81 seconds, retrying ... > 2020-06-04T06:05:20.0336417Z Waiting for hadoop cluster to come up. We have > been trying for 87 seconds, retrying ... > 2020-06-04T06:05:25.1627704Z Waiting for hadoop cluster to come up. We have > been trying for 92 seconds, retrying ... > 2020-06-04T06:05:30.2583315Z Waiting for hadoop cluster to come up. We have > been trying for 97 seconds, retrying ... > 2020-06-04T06:05:35.3283678Z Waiting for hadoop cluster to come up. We have > been trying for 102 seconds, retrying ... > 2020-06-04T06:05:40.4184029Z Waiting for hadoop cluster to come up. We have > been trying for 107 seconds, retrying ... > 2020-06-04T06:05:45.5388372Z Waiting for hadoop cluster to come up. We have > been trying for 112 seconds, retrying ... > 2020-06-04T06:05:50.6155334Z Waiting for hadoop cluster to come up. We have > been trying for 117 seconds, retrying ... > 2020-06-04T06:05:55.7225186Z Command: start_hadoop_cluster failed. Retrying... > 2020-06-04T06:05:55.7237999Z Starting Hadoop cluster > 2020-06-04T06:05:56.5188293Z kdc is up-to-date > 2020-06-04T06:05:56.5292716Z master is up-to-date > 2020-06-04T06:05:56.5301735Z slave2 is up-to-date > 2020-06-04T06:05:56.5306179Z slave1 is up-to-date > 2020-06-04T06:05:56.6800566Z Waiting for hadoop cluster to come up. We have > been trying for 0 seconds, retrying ... > 2020-06-04T06:06:01.7668291Z Waiting for hadoop cluster to come up. We have > been trying for 5 seconds, retrying ... > 2020-06-04T06:06:06.8620265Z Waiting for hadoop cluster to come up. We have > been trying for 10 seconds, retrying ... > 2020-06-04T06:06:11.9753596Z Waiting for hadoop cluster to come up. We have > been trying for 15 seconds, retrying ... > 2020-06-04T06:06:17.0402846Z Waiting for hadoop cluster to come up. We have > been trying for 21 seconds, retrying ... > 2020-06-04T06:06:22.1650005Z Waiting
[GitHub] [flink] rmetzger commented on pull request #13208: [FLINK-18676] [FileSystem] Bump s3 aws version
rmetzger commented on pull request #13208: URL: https://github.com/apache/flink/pull/13208#issuecomment-714316318 Maybe I'll take over this pull request. Since the 1.12 release is coming up, it would be nice to address this issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-18676) Update version of aws to support use of default constructor of "WebIdentityTokenCredentialsProvider"
[ https://issues.apache.org/jira/browse/FLINK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-18676: --- Component/s: (was: Connectors / FileSystem) FileSystems > Update version of aws to support use of default constructor of > "WebIdentityTokenCredentialsProvider" > > > Key: FLINK-18676 > URL: https://issues.apache.org/jira/browse/FLINK-18676 > Project: Flink > Issue Type: Improvement > Components: FileSystems >Affects Versions: 1.11.0 >Reporter: Ravi Bhushan Ratnakar >Priority: Minor > Labels: pull-request-available > > *Background:* > I am using Flink 1.11.0 on kubernetes platform. To give access of aws > services to taskmanager/jobmanager, we are using "IAM Roles for Service > Accounts" . I have configured below property in flink-conf.yaml to use > credential provider. > fs.s3a.aws.credentials.provider: > com.amazonaws.auth.WebIdentityTokenCredentialsProvider > > *Issue:* > When taskmanager/jobmanager is starting up, during this it complains that > "WebIdentityTokenCredentialsProvider" doesn't have "public constructor" and > container doesn't come up. > > *Solution:* > Currently the above credential's class is being used from > "*flink-s3-fs-hadoop"* which gets "aws-java-sdk-core" dependency from > "*flink-s3-fs-base*". In *"flink-s3-fs-base",* version of aws is 1.11.754 . > The support of default constructor for "WebIdentityTokenCredentialsProvider" > is provided from aws version 1.11.788 and onward. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] zhijiangW commented on a change in pull request #13595: [FLINK-19582][network] Introduce sort-merge based blocking shuffle to Flink
zhijiangW commented on a change in pull request #13595: URL: https://github.com/apache/flink/pull/13595#discussion_r509989145 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/SortMergeResultPartition.java ## @@ -0,0 +1,360 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.io.network.partition; + +import org.apache.flink.core.memory.MemorySegment; +import org.apache.flink.core.memory.MemorySegmentFactory; +import org.apache.flink.runtime.event.AbstractEvent; +import org.apache.flink.runtime.io.disk.FileChannelManager; +import org.apache.flink.runtime.io.network.api.EndOfPartitionEvent; +import org.apache.flink.runtime.io.network.api.serialization.EventSerializer; +import org.apache.flink.runtime.io.network.buffer.Buffer; +import org.apache.flink.runtime.io.network.buffer.BufferCompressor; +import org.apache.flink.runtime.io.network.buffer.BufferPool; +import org.apache.flink.runtime.io.network.buffer.NetworkBuffer; +import org.apache.flink.util.function.SupplierWithException; + +import javax.annotation.Nullable; +import javax.annotation.concurrent.GuardedBy; +import javax.annotation.concurrent.NotThreadSafe; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.HashSet; +import java.util.Set; +import java.util.concurrent.CompletableFuture; + +import static org.apache.flink.runtime.io.network.buffer.Buffer.DataType; +import static org.apache.flink.runtime.io.network.partition.SortBuffer.BufferWithChannel; +import static org.apache.flink.util.Preconditions.checkElementIndex; +import static org.apache.flink.util.Preconditions.checkNotNull; +import static org.apache.flink.util.Preconditions.checkState; + +/** + * {@link SortMergeResultPartition} appends records and events to {@link SortBuffer} and after the {@link SortBuffer} + * is full, all data in the {@link SortBuffer} will be copied and spilled to a {@link PartitionedFile} in subpartition + * index order sequentially. Large records that can not be appended to an empty {@link SortBuffer} will be spilled to + * the {@link PartitionedFile} separately. + */ +@NotThreadSafe +public class SortMergeResultPartition extends ResultPartition { + + private final Object lock = new Object(); + + /** All active readers which are consuming data from this result partition now. */ + @GuardedBy("lock") + private final Set readers = new HashSet<>(); + + /** {@link PartitionedFile} produced by this result partition. */ + @GuardedBy("lock") + private PartitionedFile resultFile; + + /** Used to generate random file channel ID. */ + private final FileChannelManager channelManager; + + /** Number of data buffers (excluding events) written for each subpartition. */ + private final int[] numDataBuffers; + + /** A piece of unmanaged memory for data writing. */ + private final MemorySegment writeBuffer; + + /** Size of network buffer and write buffer. */ + private final int networkBufferSize; + + /** Current {@link SortBuffer} to append records to. */ + private SortBuffer currentSortBuffer; + + /** File writer for this result partition. */ + private PartitionedFileWriter fileWriter; + + public SortMergeResultPartition( + String owningTaskName, + int partitionIndex, + ResultPartitionID partitionId, + ResultPartitionType partitionType, + int numSubpartitions, + int numTargetKeyGroups, + int networkBufferSize, + ResultPartitionManager partitionManager, + FileChannelManager channelManager, + @Nullable BufferCompressor bufferCompressor, + SupplierWithException bufferPoolFactory) { + + super( + owningTaskName, + partitionIndex, + partitionId, + partitionType, + numSubpartitions, + numTargetKeyGroups, +
[jira] [Commented] (FLINK-19757) TimeStampData can cause time inconsistent problem
[ https://issues.apache.org/jira/browse/FLINK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218784#comment-17218784 ] xiaogang zhou commented on FLINK-19757: --- [~jark] the ideal that TimestampData and LocalDataTime does not contains the zone offset is right, but when convert the currentTimeMillis or the Timestamp to LocalDataTime, we should consider the timezone issue. the jdk code suggest so {code:java} // code placeholder // code placeholder public static LocalDateTime ofEpochSecond(long epochSecond, int nanoOfSecond, ZoneOffset offset) { Objects.requireNonNull(offset, "offset"); NANO_OF_SECOND.checkValidValue(nanoOfSecond); long localSecond = epochSecond + offset.getTotalSeconds(); // overflow caught later long localEpochDay = Math.floorDiv(localSecond, SECONDS_PER_DAY); int secsOfDay = (int)Math.floorMod(localSecond, SECONDS_PER_DAY); LocalDate date = LocalDate.ofEpochDay(localEpochDay); LocalTime time = LocalTime.ofNanoOfDay(secsOfDay * NANOS_PER_SECOND + nanoOfSecond); return new LocalDateTime(date, time); } {code} > TimeStampData can cause time inconsistent problem > - > > Key: FLINK-19757 > URL: https://issues.apache.org/jira/browse/FLINK-19757 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Runtime >Affects Versions: 1.11.1 >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > when we check jdk LocalDateTime code,we find that > > {code:java} > // code placeholder > public static LocalDateTime ofEpochSecond(long epochSecond, int nanoOfSecond, > ZoneOffset offset) { > Objects.requireNonNull(offset, "offset"); > NANO_OF_SECOND.checkValidValue(nanoOfSecond); > long localSecond = epochSecond + offset.getTotalSeconds(); // overflow > caught later > long localEpochDay = Math.floorDiv(localSecond, SECONDS_PER_DAY); > int secsOfDay = (int)Math.floorMod(localSecond, SECONDS_PER_DAY); > LocalDate date = LocalDate.ofEpochDay(localEpochDay); > LocalTime time = LocalTime.ofNanoOfDay(secsOfDay * NANOS_PER_SECOND + > nanoOfSecond); > return new LocalDateTime(date, time); > } > {code} > > offset.getTotalSeconds() they add the offset, but in the TimeStampData > toLocalDateTime, we don't add a offset. > > I'd like to add a TimeZone.getDefault().getRawOffset() in the > toLocalDateTime() > and minus a TimeZone.getDefault().getRawOffset() in the > fromLocalDateTime -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #13678: [FLINK-19586] Add stream committer operators for new Sink API
flinkbot edited a comment on pull request #13678: URL: https://github.com/apache/flink/pull/13678#issuecomment-711439155 ## CI report: * Unknown: [CANCELED](TBD) * 463b5c8ed21f93caaeb7b938aa9e72abb35619b2 UNKNOWN * 2ba5bda5c5a15c370c231e1cedfa03fd0bc0b41b UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI
flinkbot edited a comment on pull request #13458: URL: https://github.com/apache/flink/pull/13458#issuecomment-697041219 ## CI report: * 7311b0d12d19a645391ea0359a9aa6318806363b UNKNOWN * 79714f02f700aebe4a65dff017deba6dc9d911d9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7989) * 9bc87ad6b2a8cbad2f226c483c74f1fa1523b70e UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13653: [FLINK-17528][table] Remove RowData#get() and ArrayData#get() and use FieldGetter and ElementGetter instead
flinkbot edited a comment on pull request #13653: URL: https://github.com/apache/flink/pull/13653#issuecomment-709337488 ## CI report: * 0c876d6befc487412629e6a1e8883fa5fbeb31e6 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8067) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8071) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8080) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-19757) TimeStampData can cause time inconsistent problem
[ https://issues.apache.org/jira/browse/FLINK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218821#comment-17218821 ] xiaogang zhou commented on FLINK-19757: --- Hi [~jark] , My point is using LocalDateTime to represent the TIMESTAMP(TIMESTAMP WO LOCAL TIME ZONE ) is not very proper :) as the ofEpochSecond takes a offset and the toString method return a local time. Should we consider it more proper as a TIMESTAMP WITH LOCAL TIME ZONE? > TimeStampData can cause time inconsistent problem > - > > Key: FLINK-19757 > URL: https://issues.apache.org/jira/browse/FLINK-19757 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Runtime >Affects Versions: 1.11.1 >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > when we check jdk LocalDateTime code,we find that > > {code:java} > // code placeholder > public static LocalDateTime ofEpochSecond(long epochSecond, int nanoOfSecond, > ZoneOffset offset) { > Objects.requireNonNull(offset, "offset"); > NANO_OF_SECOND.checkValidValue(nanoOfSecond); > long localSecond = epochSecond + offset.getTotalSeconds(); // overflow > caught later > long localEpochDay = Math.floorDiv(localSecond, SECONDS_PER_DAY); > int secsOfDay = (int)Math.floorMod(localSecond, SECONDS_PER_DAY); > LocalDate date = LocalDate.ofEpochDay(localEpochDay); > LocalTime time = LocalTime.ofNanoOfDay(secsOfDay * NANOS_PER_SECOND + > nanoOfSecond); > return new LocalDateTime(date, time); > } > {code} > > offset.getTotalSeconds() they add the offset, but in the TimeStampData > toLocalDateTime, we don't add a offset. > > I'd like to add a TimeZone.getDefault().getRawOffset() in the > toLocalDateTime() > and minus a TimeZone.getDefault().getRawOffset() in the > fromLocalDateTime -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19761) Add lookup method for registered ShuffleDescriptor in ShuffleMaster
Xuannan Su created FLINK-19761: -- Summary: Add lookup method for registered ShuffleDescriptor in ShuffleMaster Key: FLINK-19761 URL: https://issues.apache.org/jira/browse/FLINK-19761 Project: Flink Issue Type: Improvement Components: Runtime / Network Reporter: Xuannan Su Currently, the ShuffleMaster can register a partition and get the shuffle descriptor. However, it lacks the ability to look up the registered ShuffleDescriptors belongs to an IntermediateResult by the IntermediateDataSetID. Adding the lookup method to the ShuffleMaster can make reusing the cluster partition more easily. For example, we don't have to return the ShuffleDescriptor to the client just so that the other job can somehow encode the ShuffleDescriptor in the JobGraph to consume the cluster partition. Instead, we only need to return the IntermediateDatSetID and use it to lookup the ShuffleDescriptor by another job. By adding the lookup method in ShuffleMaster, if we have an external shuffle service and the lifecycle of the IntermediateResult is not bounded to the cluster, we can look up the ShuffleDescriptor and reuse the IntermediateResult by a job running on another cluster even if the cluster that produced the IntermediateResult is shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] godfreyhe commented on pull request #13694: [FLINK-19720][table-api] Introduce new Providers and parallelism API
godfreyhe commented on pull request #13694: URL: https://github.com/apache/flink/pull/13694#issuecomment-714290557 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] gm7y8 commented on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI
gm7y8 commented on pull request #13458: URL: https://github.com/apache/flink/pull/13458#issuecomment-714289878 @XComp You are welcome! I have also verified fix with Word Count job ![Uploading Screen Shot 2020-10-22 at 12.23.01 AM.png…]() This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] tillrohrmann commented on a change in pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
tillrohrmann commented on a change in pull request #13725: URL: https://github.com/apache/flink/pull/13725#discussion_r509935113 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java ## @@ -203,6 +207,11 @@ protected void handleStateChange(ConnectionState newState) { } } + private void onReconnectedConnectionState() { + // check whether we find some new leader information in ZooKeeper + retrieveLeaderInformationFromZooKeeper(); Review comment: I think you are right that in some cases we will report a stale leader here. However, once the asynchronous fetch from the `NodeCache` completes, the listener should be notified about the new leader. What happens in the meantime is that the listener will try to connect to the old leader which should either be gone or reject all connection attempts since he is no longer the leader. The problem `notifyLossOfLeader` tried to solve is that a listener thinks that a stale leader is still the leader and, thus, continues working for it w/o questioning it (e.g. check with the leader) until the connection to the leader times out. With the `notifyLossOfLeader` change, once the retrieval service loses connection to ZooKeeper, it will tell the listener that the current leader is no longer valid. This will tell the listener to stop working for this leader (e.g. cancelling all tasks, disconnecting from it, etc.). If the listener should shortly after be told that the old leader is still the leader because of stale information, then it will first try to connect to the leader which will fail (assuming that the old leader is indeed no longer the leader) before it can start doing work for the leader. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] wangyang0918 commented on a change in pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API
wangyang0918 commented on a change in pull request #13644: URL: https://github.com/apache/flink/pull/13644#discussion_r509939287 ## File path: flink-kubernetes/src/test/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderRetrievalServiceTest.java ## @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.kubernetes.highavailability; + +import org.apache.flink.kubernetes.kubeclient.FlinkKubeClient; +import org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMap; +import org.apache.flink.kubernetes.utils.Constants; + +import org.junit.Test; + +import java.util.Collections; +import java.util.concurrent.TimeUnit; + +import static org.hamcrest.Matchers.is; +import static org.hamcrest.Matchers.notNullValue; +import static org.hamcrest.Matchers.nullValue; +import static org.junit.Assert.assertThat; + +/** + * Tests for the {@link KubernetesLeaderRetrievalService}. + */ +public class KubernetesLeaderRetrievalServiceTest extends KubernetesHighAvailabilityTestBase { Review comment: Yes. You are right. What I mean is that we could also add some individual tests for HA service, not just a full Flink cluster with K8s ha enabled. They will be implemented in the E2E module. I have left a comment in FLINK-19545. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13653: [FLINK-17528][table] Remove RowData#get() and ArrayData#get() and use FieldGetter and ElementGetter instead
flinkbot edited a comment on pull request #13653: URL: https://github.com/apache/flink/pull/13653#issuecomment-709337488 ## CI report: * 0c876d6befc487412629e6a1e8883fa5fbeb31e6 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8067) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8080) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8071) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13458: FLINK-18851 [runtime] Add checkpoint type to checkpoint history entries in Web UI
flinkbot edited a comment on pull request #13458: URL: https://github.com/apache/flink/pull/13458#issuecomment-697041219 ## CI report: * 7311b0d12d19a645391ea0359a9aa6318806363b UNKNOWN * 79714f02f700aebe4a65dff017deba6dc9d911d9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7989) * 9bc87ad6b2a8cbad2f226c483c74f1fa1523b70e Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8079) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] wangyang0918 removed a comment on pull request #13644: [FLINK-19542][k8s]Implement LeaderElectionService and LeaderRetrievalService based on Kubernetes API
wangyang0918 removed a comment on pull request #13644: URL: https://github.com/apache/flink/pull/13644#issuecomment-713591200 @tillrohrmann I have gone though all your comments and will update the PR soon. Just two quick points need to confirm with you. * Do you think we should not handle the externally deletion/update? For example, a Flink cluster with HA configured is running, some user delete/update the ConfigMap via `kubectl`. If it is yes, I will remove the operations in `KubernetesLeaderElectionService#Watcher`. And change some "ConfigMap not exists" behavior. * I am afraid it is hard to use a real K8s server in the UT because it is not very easy to start a `minikube`. I will try to add the unit tests for the contract testing now and leave the real cluster test in E2E test implementation. Does it make sense? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-17159) ES6 ElasticsearchSinkITCase unstable
[ https://issues.apache.org/jira/browse/FLINK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218836#comment-17218836 ] Robert Metzger commented on FLINK-17159: Uff. So I was really not able to even reproduce this issue, even though it seems to happen quite frequently. Ideally we would reproduce it on DEBUG log level so that we can see why ES is not able to process the ping request. Does it make sense to use {{CommonTestUtils.waitUntilCondition()}} and retry for 30 seconds for the cluster to come up properly? > ES6 ElasticsearchSinkITCase unstable > > > Key: FLINK-17159 > URL: https://issues.apache.org/jira/browse/FLINK-17159 > Project: Flink > Issue Type: Bug > Components: Connectors / ElasticSearch, Tests >Affects Versions: 1.11.0, 1.12.0 >Reporter: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.0, 1.11.3 > > > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7482=logs=64110e28-73be-50d7-9369-8750330e0bf1=aa84fb9a-59ae-5696-70f7-011bc086e59b] > {code:java} > 2020-04-15T02:37:04.4289477Z [ERROR] > testElasticsearchSinkWithSmile(org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSinkITCase) > Time elapsed: 0.145 s <<< ERROR! > 2020-04-15T02:37:04.4290310Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2020-04-15T02:37:04.4290790Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147) > 2020-04-15T02:37:04.4291404Z at > org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:659) > 2020-04-15T02:37:04.4291956Z at > org.apache.flink.streaming.util.TestStreamEnvironment.execute(TestStreamEnvironment.java:77) > 2020-04-15T02:37:04.4292548Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1643) > 2020-04-15T02:37:04.4293254Z at > org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.runElasticSearchSinkTest(ElasticsearchSinkTestBase.java:128) > 2020-04-15T02:37:04.4293990Z at > org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkTestBase.runElasticsearchSinkSmileTest(ElasticsearchSinkTestBase.java:106) > 2020-04-15T02:37:04.4295096Z at > org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSinkITCase.testElasticsearchSinkWithSmile(ElasticsearchSinkITCase.java:45) > 2020-04-15T02:37:04.4295923Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-04-15T02:37:04.4296489Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-04-15T02:37:04.4297076Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-04-15T02:37:04.4297513Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-04-15T02:37:04.4297951Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-04-15T02:37:04.4298688Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-04-15T02:37:04.4299374Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-04-15T02:37:04.4300069Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-04-15T02:37:04.4300960Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-04-15T02:37:04.4301705Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-04-15T02:37:04.4302204Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-04-15T02:37:04.4302661Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-04-15T02:37:04.4303234Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-04-15T02:37:04.4303706Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-04-15T02:37:04.4304127Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-04-15T02:37:04.4304716Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-04-15T02:37:04.4305394Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-04-15T02:37:04.4305965Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-04-15T02:37:04.4306425Z at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > 2020-04-15T02:37:04.4306942Z at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > 2020-04-15T02:37:04.4307466Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-04-15T02:37:04.4307920Z at >
[GitHub] [flink] flinkbot edited a comment on pull request #13703: [FLINK-19696] Add runtime batch committer operators for the new sink API
flinkbot edited a comment on pull request #13703: URL: https://github.com/apache/flink/pull/13703#issuecomment-712821004 ## CI report: * 5166271f74497afddca3a7f350b8191e2a843298 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7938) * 58ee54d456da57f99b6124c787b7cbaeedcfb261 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8081) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (FLINK-19688) Flink batch job fails because of InterruptedExceptions from network stack
[ https://issues.apache.org/jira/browse/FLINK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-19688. - Resolution: Fixed > Flink batch job fails because of InterruptedExceptions from network stack > - > > Key: FLINK-19688 > URL: https://issues.apache.org/jira/browse/FLINK-19688 > Project: Flink > Issue Type: Bug > Components: Runtime / Network, Runtime / Task >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Assignee: Roman Khachatryan >Priority: Blocker > Fix For: 1.12.0 > > Attachments: logs.tgz > > > I have a benchmarking test job, that throws RuntimeExceptions at any operator > at a configured, random interval. When using low intervals, such as mean > failure rate = 60 s, the job will get into a state where it frequently fails > with InterruptedExceptions. > The same job does not have this problem on Flink 1.11.2 (at least not after > running the job for 15 hours, on 1.12-SN, it happens within a few minutes) > This is the job: > https://github.com/rmetzger/flip1-bench/blob/master/flip1-bench-jobs/src/main/java/com/ververica/TPCHQuery3.java > This is the exception: > {code} > 2020-10-16 16:02:15,653 WARN org.apache.flink.runtime.taskmanager.Task > [] - CHAIN GroupReduce (GroupReduce at > main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38)) (8/8)#1 > (06d656f696bf4ed98831938a7ac2359d_c1c4a56fea0536703d37867c057f0cc8_7_1) > switched from RUNNING to FAILED. > java.lang.Exception: The data preparation for task 'CHAIN GroupReduce > (GroupReduce at main(TPCHQuery3.java:199)) -> Map (Map at > appendMapper(KillerClientMapper.java:38))' , caused an error: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:481) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:370) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) > [flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222] > Caused by: org.apache.flink.util.WrappingRuntimeException: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error > obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due > to an exception: Connection for partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:253) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > ... 4 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: Error obtaining the sorted input: Thread > 'SortMerger Reading Thread' terminated due to an exception: Connection for > partition > 060d457c4163472f65a4b741993c83f8#0@06d656f696bf4ed98831938a7ac2359d_0bcc9fbf9ac242d5aac540917d980e44_0_1 > not reachable. > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > ~[?:1.8.0_222] > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) > ~[?:1.8.0_222] > at > org.apache.flink.runtime.operators.sort.ExternalSorter.getIterator(ExternalSorter.java:250) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1122) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:99) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475) >
[jira] [Created] (FLINK-19764) Add More Metrics to TaskManager in Web UI
Yadong Xie created FLINK-19764: -- Summary: Add More Metrics to TaskManager in Web UI Key: FLINK-19764 URL: https://issues.apache.org/jira/browse/FLINK-19764 Project: Flink Issue Type: Improvement Components: Runtime / Web Frontend Reporter: Yadong Xie update the UI since https://issues.apache.org/jira/browse/FLINK-14422 and https://issues.apache.org/jira/browse/FLINK-14406 has been fixed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #13726: [BP-1.11][FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
flinkbot edited a comment on pull request #13726: URL: https://github.com/apache/flink/pull/13726#issuecomment-713567666 ## CI report: * c673d6c2530a3484e2a315301d0eee7154a3a5d9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8020) * f0ac074feed577550c7026b190db031561a25f77 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8087) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13727: [BP-1.10][FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
flinkbot edited a comment on pull request #13727: URL: https://github.com/apache/flink/pull/13727#issuecomment-713572668 ## CI report: * c2ae5ba37a69ad1f40f585421d38edb30d40c0fa Travis: [FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/191431633) * 8e2aaf84d255adab8420800d5e7ec20d502868f4 Travis: [PENDING](https://travis-ci.com/github/flink-ci/flink/builds/191598195) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13725: [FLINK-19557] Trigger LeaderRetrievalListener notification upon ZooKeeper reconnection in ZooKeeperLeaderRetrievalService
flinkbot edited a comment on pull request #13725: URL: https://github.com/apache/flink/pull/13725#issuecomment-713560830 ## CI report: * a606f618ea3fb7bf179657a060c47ee32b55e514 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8019) * cd48869c084675464d37449a3db1f9d3cd10a040 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8086) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org