[jira] [Assigned] (FLINK-20168) Translate page 'Flink Architecture' into Chinese
[ https://issues.apache.org/jira/browse/FLINK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jark Wu reassigned FLINK-20168: --- Assignee: CaoZhen > Translate page 'Flink Architecture' into Chinese > > > Key: FLINK-20168 > URL: https://issues.apache.org/jira/browse/FLINK-20168 > Project: Flink > Issue Type: Improvement > Components: chinese-translation, Documentation >Reporter: CaoZhen >Assignee: CaoZhen >Priority: Minor > > Translate the page [Flink > Architecture|https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html]. > The doc located in "flink/docs/concepts/flink-architecture.zh.md" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #14020: [FLINK-20078][coordination] Factor out an ExecutionGraph factory method for DefaultExecutionTopology
flinkbot edited a comment on pull request #14020: URL: https://github.com/apache/flink/pull/14020#issuecomment-724735089 ## CI report: * bb25d7f6325759f7400ae0b5728d41d478784659 UNKNOWN * c7df499afee4829fd30823f277597e04de20a8f2 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9449) * ab964bbe7a026494fca227e41979b74597dc7067 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232595#comment-17232595 ] Yuan Mei commented on FLINK-17726: -- Double checked with [~nicholasjiang], he said he would finish this ticket by the end of this week. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Fix For: 1.12.0, 1.11.3, 1.13.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #13991: [FLINK-19983][network] Remove wrong state checking in SortMergeSubpartitionReader
flinkbot edited a comment on pull request #13991: URL: https://github.com/apache/flink/pull/13991#issuecomment-723737863 ## CI report: * 9d1b60190d1e7c588b225301792b63789227a91b Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9332) * c44538769371ce3114274f3322168a9fe18e4875 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-20098) Don't add flink-connector-files to flink-dist
[ https://issues.apache.org/jira/browse/FLINK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232593#comment-17232593 ] Jark Wu commented on FLINK-20098: - If we want to solve the "provided" problem, then we may need to remove csv and json format from the dist too? It sounds like a regression on the out-of-box experience for SQL users. We have a long discussion before, and decided to put some light-weight dependencies into dist. However, now, users have to add dependency filesystem jar manually which is a frequently used connector in POC. > Don't add flink-connector-files to flink-dist > - > > Key: FLINK-20098 > URL: https://issues.apache.org/jira/browse/FLINK-20098 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > We currently add both {{flink-connector-files}} and {{flink-connector-base}} > to {{flink-dist}}. > This implies, that users should use the dependency like this: > {code} > > org.apache.flink > flink-connector-files > ${project.version} > provided > > {code} > which differs from other connectors where users don't need to specify > {{provided}}. > Also, {{flink-connector-files}} has {{flink-connector-base}} as a provided > dependency, which means that examples that use this dependency will not run > out-of-box in IntelliJ because transitive provided dependencies will not be > considered. > I propose to just remove the dependencies from {{flink-dist}} and let users > use the File Connector like any other connector. > I believe the initial motivation for "providing" the File Connector in > {{flink-dist}} was to allow us to use the File Connector under the hood in > methods such as {{StreamExecutionEnvironment.readFile(...)}}. We could decide > to deprecate and remove those methods or re-add the File Connector as an > explicit (non-provided) dependency again in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-20098) Don't add flink-connector-files to flink-dist
[ https://issues.apache.org/jira/browse/FLINK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232593#comment-17232593 ] Jark Wu edited comment on FLINK-20098 at 11/16/20, 7:53 AM: If we want to solve the "provided" problem, then we may need to remove csv and json format from the dist too? It sounds like a regression on the out-of-box experience for SQL users. We have a long discussion [1] before, and decided to put some light-weight dependencies into dist. However, now, users have to add dependency filesystem jar manually which is a frequently used connector in POC. [1]: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-quot-fat-quot-and-quot-slim-quot-Flink-distributions-td40237i40.html#a42276 was (Author: jark): If we want to solve the "provided" problem, then we may need to remove csv and json format from the dist too? It sounds like a regression on the out-of-box experience for SQL users. We have a long discussion before, and decided to put some light-weight dependencies into dist. However, now, users have to add dependency filesystem jar manually which is a frequently used connector in POC. > Don't add flink-connector-files to flink-dist > - > > Key: FLINK-20098 > URL: https://issues.apache.org/jira/browse/FLINK-20098 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > We currently add both {{flink-connector-files}} and {{flink-connector-base}} > to {{flink-dist}}. > This implies, that users should use the dependency like this: > {code} > > org.apache.flink > flink-connector-files > ${project.version} > provided > > {code} > which differs from other connectors where users don't need to specify > {{provided}}. > Also, {{flink-connector-files}} has {{flink-connector-base}} as a provided > dependency, which means that examples that use this dependency will not run > out-of-box in IntelliJ because transitive provided dependencies will not be > considered. > I propose to just remove the dependencies from {{flink-dist}} and let users > use the File Connector like any other connector. > I believe the initial motivation for "providing" the File Connector in > {{flink-dist}} was to allow us to use the File Connector under the hood in > methods such as {{StreamExecutionEnvironment.readFile(...)}}. We could decide > to deprecate and remove those methods or re-add the File Connector as an > explicit (non-provided) dependency again in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] wangyang0918 commented on a change in pull request #14006: [FLINK-19546][doc] Add documentation for native Kubernetes HA
wangyang0918 commented on a change in pull request #14006: URL: https://github.com/apache/flink/pull/14006#discussion_r523933632 ## File path: docs/ops/jobmanager_high_availability.md ## @@ -215,6 +215,76 @@ Starting zookeeper daemon on host localhost. $ bin/yarn-session.sh -n 2 +## Kubernetes Cluster High Availability +When running Flink JobManager as a Kubernetes deployment, the replica count should be configured to 1 or greater. +* The value `1` means that a new JobManager will be launched to take over leadership if the current one terminates exceptionally. +* The value `N` (greater than 1) means that multiple JobManagers will be launched simultaneously while one is active and others are standby. Starting more than one JobManager will make the recovery faster. + +### Configuration +{% highlight yaml %} +kubernetes.cluster-id: +high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory +high-availability.storageDir: hdfs:///flink/recovery +{% endhighlight %} + + Example: Highly Available Standalone Flink Cluster on Kubernetes +Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml]({{ site.baseurl}}/ops/deployment/kubernetes.html#common-cluster-resource-definitions). All other yamls do not need to be updated. + +Note The filesystem which corresponds to the scheme of your configured HA storage directory must be available to the runtime. Refer to [custom Flink image]({{ site.baseurl}}/ops/deployment/docker.html#customize-flink-image) and [enable plugins]({{ site.baseurl}}/ops/deployment/docker.html#using-plugins) for more information. + +{% highlight yaml %} +apiVersion: v1 +kind: ConfigMap +metadata: + name: flink-config + labels: +app: flink +data: + flink-conf.yaml: |+ + ... +kubernetes.cluster-id: +high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory +high-availability.storageDir: hdfs:///flink/recovery +restart-strategy: fixed-delay +restart-strategy.fixed-delay.attempts: 10 + ... +{% endhighlight %} + + Example: Highly Available Native Kubernetes Cluster +Using the following command to start a native Flink application cluster on Kubernetes with high availability configured. +{% highlight bash %} +$ ./bin/flink run-application -p 8 -t kubernetes-application \ + -Dkubernetes.cluster-id= \ + -Dtaskmanager.memory.process.size=4096m \ + -Dkubernetes.taskmanager.cpu=2 \ + -Dtaskmanager.numberOfTaskSlots=4 \ + -Dkubernetes.container.image= \ + -Dhigh-availability=org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory \ + -Dhigh-availability.storageDir=s3://flink/flink-ha \ + -Drestart-strategy=fixed-delay -Drestart-strategy.fixed-delay.attempts=10 \ + -Dcontainerized.master.env.ENABLE_BUILT_IN_PLUGINS=flink-s3-fs-hadoop-{{site.version}}.jar \ + -Dcontainerized.taskmanager.env.ENABLE_BUILT_IN_PLUGINS=flink-s3-fs-hadoop-{{site.version}}.jar \ + local:///opt/flink/examples/streaming/StateMachineExample.jar +{% endhighlight %} + +### High Availability Data Clean Up +Currently, when a Flink job reached the terminal state (`FAILED`, `CANCELED`, `FINISHED`), all the HA data, including metadata in Kubernetes ConfigMap and HA state on DFS, will be cleaned up. + +So the following command will only shut down the Flink session cluster and leave all the HA related ConfigMaps, state untouched. +{% highlight bash %} +$ echo 'stop' | ./bin/kubernetes-session.sh -Dkubernetes.cluster-id= -Dexecution.attached=true +{% endhighlight %} + +The following commands will cancel the job in application or session cluster and effectively remove all its HA data. +{% highlight bash %} +# Cancel a Flink job in the existing session +$ ./bin/flink cancel -t kubernetes-session -Dkubernetes.cluster-id= +# Cancel a Flink application +$ ./bin/flink cancel -t kubernetes-application -Dkubernetes.cluster-id= +{% endhighlight %} + +To keep HA data while restarting the Flink cluster, simply delete the deploy (via `kubectl delete deploy `). All the Flink cluster related resources will be destroyed (e.g. JobManager Deployment, TaskManager pods, services, Flink conf ConfigMap), as to not occupy the Kubernetes cluster resources. However, HA related ConfigMaps do not set the owner reference and they will be retained. When restarting the session / application, use `kubernetes-session.sh` or `flink run-application`. All the previous suspending running jobs will recover from the latest checkpoint successfully. Review comment: > I don't understand the subclause , as to not occupy resources. What I want to state here is that we will free all the allocated resources from K8s so that they could be used by others. At the same time, we still have the ability to recover the Flink application at any tim
[jira] [Commented] (FLINK-20143) use `yarn.provided.lib.dirs` config deploy job failed in yarn per job mode
[ https://issues.apache.org/jira/browse/FLINK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232590#comment-17232590 ] Kostas Kloudas commented on FLINK-20143: Thanks [~fly_in_gis], feel free to ping me for a review. > use `yarn.provided.lib.dirs` config deploy job failed in yarn per job mode > -- > > Key: FLINK-20143 > URL: https://issues.apache.org/jira/browse/FLINK-20143 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission, Deployment / YARN >Affects Versions: 1.12.0 >Reporter: zhisheng >Priority: Major > Attachments: image-2020-11-13-17-21-47-751.png, > image-2020-11-13-17-22-06-111.png, image-2020-11-13-18-43-55-188.png > > > use follow command deploy flink job to yarn failed > {code:java} > ./bin/flink run -m yarn-cluster -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jar > {code} > log: > {code:java} > $ ./bin/flink run -m yarn-cluster -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jar$ ./bin/flink run -m yarn-cluster > -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jarSLF4J: Class path contains > multiple SLF4J bindings.SLF4J: Found binding in > [jar:file:/data1/app/flink-1.12-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > Found binding in > [jar:file:/data1/app/hadoop-2.7.3-snappy-32core12disk/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > Found binding in > [jar:file:/data1/app/hadoop-2.7.3-snappy-32core12disk/share/hadoop/tools/lib/hadoop-aliyun-2.9.2-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation.SLF4J: Actual binding is of type > [org.apache.logging.slf4j.Log4jLoggerFactory]2020-11-13 16:14:30,347 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli [] - Dynamic > Property set: > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib2020-11-13 > 16:14:30,347 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli > [] - Dynamic Property set: > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/libUsage with > built-in data generator: StateMachineExample [--error-rate > ] [--sleep ]Usage > with Kafka: StateMachineExample --kafka-topic [--brokers > ]Options for both the above setups: [--backend ] > [--checkpoint-dir ] [--async-checkpoints ] > [--incremental-checkpoints ] [--output OR null for > stdout] > Using standalone source with error rate 0.00 and sleep delay 1 millis > 2020-11-13 16:14:30,706 WARN > org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The > configuration directory ('/data1/app/flink-1.12-SNAPSHOT/conf') already > contains a LOG4J config file.If you want to use logback, then please delete > or rename the log configuration file.2020-11-13 16:14:30,947 INFO > org.apache.hadoop.yarn.client.AHSProxy [] - Connecting > to Application History server at > FAT-hadoopuat-69117.vm.dc01.tech/10.69.1.17:102002020-11-13 16:14:30,958 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - No path > for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar2020-11-13 > 16:14:31,065 INFO > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider [] - Failing > over to rm22020-11-13 16:14:31,130 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - The > configured JobManager memory is 3072 MB. YARN will allocate 4096 MB to make > up an integer multiple of its minimum allocation memory (2048 MB, configured > via 'yarn.scheduler.minimum-allocation-mb'). The extra 1024 MB may not be > used by Flink.2020-11-13 16:14:31,130 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - The > configured TaskManager memory is 3072 MB. YARN will allocate 4096 MB to make > up an integer multiple of its minimum allocation memory (2048 MB, configured > via 'yarn.scheduler.minimum-allocation-mb'). The extra 1024 MB may not be > used by Flink.2020-11-13 16:14:31,130 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - Cluster > specification: ClusterSpecification{masterMemoryMB=3072, > taskManagerMemoryMB=3072, slotsPerTaskManager=2}2020-11-13 16:14:31,681 WARN > org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory [] - The > sh
[GitHub] [flink] flinkbot edited a comment on pull request #14075: [FLINK-20163][docs-zh] Translate page "raw format" into Chinese
flinkbot edited a comment on pull request #14075: URL: https://github.com/apache/flink/pull/14075#issuecomment-727598953 ## CI report: * 08214d0608cf8a45395d003dc136a3e51eb1db68 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9607) Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9595) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-20168) Translate page 'Flink Architecture' into Chinese
[ https://issues.apache.org/jira/browse/FLINK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232589#comment-17232589 ] CaoZhen commented on FLINK-20168: - Hi [~jark], I want to translate this document. Can you assign it to me? > Translate page 'Flink Architecture' into Chinese > > > Key: FLINK-20168 > URL: https://issues.apache.org/jira/browse/FLINK-20168 > Project: Flink > Issue Type: Improvement > Components: chinese-translation, Documentation >Reporter: CaoZhen >Priority: Minor > > Translate the page [Flink > Architecture|https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html]. > The doc located in "flink/docs/concepts/flink-architecture.zh.md" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] wsry commented on pull request #13991: [FLINK-19983][network] Remove wrong state checking in SortMergeSubpartitionReader
wsry commented on pull request #13991: URL: https://github.com/apache/flink/pull/13991#issuecomment-727793009 @AHeise Thanks for your review and comments. Your concern is reasonable. I have updated my fix. For current bounded partition, only Netty thread can release or poll the subpartition read view so we should never poll a released view. In this case, two reasons cause the failure: 1) There are redundant data availability notifications; 2) The buffers read is not cleared which means the read view can be still available after released. For multi-thread scenario, from my understand, the root failure cause need to be propagated to the downstream, if nothing is polled from a read view, we will try to find out if the view is released and if there is a failure cause, the logic is like this: ``` next = reader.getNextBuffer(); if (next == null) { if (!reader.isReleased()) { continue; } Throwable cause = reader.getFailureCause(); if (cause != null) { ErrorResponse msg = new ErrorResponse(new ProducerFailedException(cause), reader.getReceiverId()); ctx.writeAndFlush(msg); } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-20145) Streaming job fails with IllegalStateException: Should only poll priority events
[ https://issues.apache.org/jira/browse/FLINK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232587#comment-17232587 ] Robert Metzger commented on FLINK-20145: Thanks a lot for addressing this so quickly! I will verify the fix. Another question: We have found this issue through manual testing only, and the PR also doesn't contain any new tests. I wonder if we need additional test coverage to verify this fix, and ensure that it won't be broken again in the future. > Streaming job fails with IllegalStateException: Should only poll priority > events > > > Key: FLINK-20145 > URL: https://issues.apache.org/jira/browse/FLINK-20145 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Network >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Assignee: Arvid Heise >Priority: Blocker > Labels: pull-request-available > Fix For: 1.12.0 > > > While testing the 1.12 release, I came across the following failure cause: > {code} > 2020-11-13 09:41:52,110 WARN org.apache.flink.runtime.taskmanager.Task > [] - dynamic filter (3/4)#0 (b977944851531f96e5324e786f055eb7) > switched from RUNNING to FAILED. > java.lang.IllegalStateException: Should only poll priority events > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:198) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > at > org.apache.flink.streaming.runtime.io.CheckpointedInputGate.processPriorityEvents(CheckpointedInputGate.java:116) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > at > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:283) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:184) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:577) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:541) > ~[flink-dist_2.11-1.12.0.jar:1.12.0] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) > [flink-dist_2.11-1.12.0.jar:1.12.0] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) > [flink-dist_2.11-1.12.0.jar:1.12.0] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222] > {code} > I have unaligned checkpointing enabled, the failing operator is a > CoFlatMapFunction. The error happend on all four TaskManagers, very soon > after job submission. The error doesn't happen when unaligned checkpointing > is disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20168) Translate page 'Flink Architecture' into Chinese
CaoZhen created FLINK-20168: --- Summary: Translate page 'Flink Architecture' into Chinese Key: FLINK-20168 URL: https://issues.apache.org/jira/browse/FLINK-20168 Project: Flink Issue Type: Improvement Components: chinese-translation, Documentation Reporter: CaoZhen Translate the page [Flink Architecture|https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/flink-architecture.html]. The doc located in "flink/docs/concepts/flink-architecture.zh.md" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-20149) SQLClientKafkaITCase.testKafka failed with "Did not get expected results before timeout."
[ https://issues.apache.org/jira/browse/FLINK-20149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger closed FLINK-20149. -- Resolution: Fixed I reverted https://issues.apache.org/jira/browse/FLINK-20098 and reopened the ticket. > SQLClientKafkaITCase.testKafka failed with "Did not get expected results > before timeout." > - > > Key: FLINK-20149 > URL: https://issues.apache.org/jira/browse/FLINK-20149 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Table SQL / Ecosystem >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Robert Metzger >Priority: Blocker > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9558&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 > {code} > 2020-11-13T13:47:21.6002762Z Nov 13 13:47:21 [ERROR] testKafka[0: > kafka-version:2.4.1 > kafka-sql-version:universal](org.apache.flink.tests.util.kafka.SQLClientKafkaITCase) > Time elapsed: 183.015 s <<< FAILURE! 2020-11-13T13:47:21.6003744Z Nov 13 > 13:47:21 java.lang.AssertionError: Did not get expected results before > timeout. 2020-11-13T13:47:21.6004745Z Nov 13 13:47:21 at > org.junit.Assert.fail(Assert.java:88) 2020-11-13T13:47:21.6005325Z Nov 13 > 13:47:21 at org.junit.Assert.assertTrue(Assert.java:41) > 2020-11-13T13:47:21.6006007Z Nov 13 13:47:21 at > org.apache.flink.tests.util.kafka.SQLClientKafkaITCase.checkCsvResultFile(SQLClientKafkaITCase.java:226) > 2020-11-13T13:47:21.6007091Z Nov 13 13:47:21 at > org.apache.flink.tests.util.kafka.SQLClientKafkaITCase.testKafka(SQLClientKafkaITCase.java:166) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #13900: [FLINK-19949][csv] Unescape CSV format line delimiter character
flinkbot edited a comment on pull request #13900: URL: https://github.com/apache/flink/pull/13900#issuecomment-720989833 ## CI report: * d9e47451664a976692e4e6110bbffa2842f2ae7a UNKNOWN * c9a24a3ac861addff4a26ffa47cffd38db7427ae Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9600) * 67451b7cb7e42729ac9f74b0b1b642ac685954f7 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (FLINK-20098) Don't add flink-connector-files to flink-dist
[ https://issues.apache.org/jira/browse/FLINK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232585#comment-17232585 ] Robert Metzger edited comment on FLINK-20098 at 11/16/20, 7:27 AM: --- Reverted in f3f5e316622255b86f416ba5eb1c283562732823 See: https://issues.apache.org/jira/browse/FLINK-20149 was (Author: rmetzger): Reverted f3f5e316622255b86f416ba5eb1c283562732823 See: https://issues.apache.org/jira/browse/FLINK-20149 > Don't add flink-connector-files to flink-dist > - > > Key: FLINK-20098 > URL: https://issues.apache.org/jira/browse/FLINK-20098 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > We currently add both {{flink-connector-files}} and {{flink-connector-base}} > to {{flink-dist}}. > This implies, that users should use the dependency like this: > {code} > > org.apache.flink > flink-connector-files > ${project.version} > provided > > {code} > which differs from other connectors where users don't need to specify > {{provided}}. > Also, {{flink-connector-files}} has {{flink-connector-base}} as a provided > dependency, which means that examples that use this dependency will not run > out-of-box in IntelliJ because transitive provided dependencies will not be > considered. > I propose to just remove the dependencies from {{flink-dist}} and let users > use the File Connector like any other connector. > I believe the initial motivation for "providing" the File Connector in > {{flink-dist}} was to allow us to use the File Connector under the hood in > methods such as {{StreamExecutionEnvironment.readFile(...)}}. We could decide > to deprecate and remove those methods or re-add the File Connector as an > explicit (non-provided) dependency again in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (FLINK-20098) Don't add flink-connector-files to flink-dist
[ https://issues.apache.org/jira/browse/FLINK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger reopened FLINK-20098: Reverted f3f5e316622255b86f416ba5eb1c283562732823 See: https://issues.apache.org/jira/browse/FLINK-20149 > Don't add flink-connector-files to flink-dist > - > > Key: FLINK-20098 > URL: https://issues.apache.org/jira/browse/FLINK-20098 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > We currently add both {{flink-connector-files}} and {{flink-connector-base}} > to {{flink-dist}}. > This implies, that users should use the dependency like this: > {code} > > org.apache.flink > flink-connector-files > ${project.version} > provided > > {code} > which differs from other connectors where users don't need to specify > {{provided}}. > Also, {{flink-connector-files}} has {{flink-connector-base}} as a provided > dependency, which means that examples that use this dependency will not run > out-of-box in IntelliJ because transitive provided dependencies will not be > considered. > I propose to just remove the dependencies from {{flink-dist}} and let users > use the File Connector like any other connector. > I believe the initial motivation for "providing" the File Connector in > {{flink-dist}} was to allow us to use the File Connector under the hood in > methods such as {{StreamExecutionEnvironment.readFile(...)}}. We could decide > to deprecate and remove those methods or re-add the File Connector as an > explicit (non-provided) dependency again in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20149) SQLClientKafkaITCase.testKafka failed with "Did not get expected results before timeout."
[ https://issues.apache.org/jira/browse/FLINK-20149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232583#comment-17232583 ] Robert Metzger commented on FLINK-20149: Thanks for looking into it. I will revert FLINK-20098. > SQLClientKafkaITCase.testKafka failed with "Did not get expected results > before timeout." > - > > Key: FLINK-20149 > URL: https://issues.apache.org/jira/browse/FLINK-20149 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Table SQL / Ecosystem >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Blocker > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9558&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 > {code} > 2020-11-13T13:47:21.6002762Z Nov 13 13:47:21 [ERROR] testKafka[0: > kafka-version:2.4.1 > kafka-sql-version:universal](org.apache.flink.tests.util.kafka.SQLClientKafkaITCase) > Time elapsed: 183.015 s <<< FAILURE! 2020-11-13T13:47:21.6003744Z Nov 13 > 13:47:21 java.lang.AssertionError: Did not get expected results before > timeout. 2020-11-13T13:47:21.6004745Z Nov 13 13:47:21 at > org.junit.Assert.fail(Assert.java:88) 2020-11-13T13:47:21.6005325Z Nov 13 > 13:47:21 at org.junit.Assert.assertTrue(Assert.java:41) > 2020-11-13T13:47:21.6006007Z Nov 13 13:47:21 at > org.apache.flink.tests.util.kafka.SQLClientKafkaITCase.checkCsvResultFile(SQLClientKafkaITCase.java:226) > 2020-11-13T13:47:21.6007091Z Nov 13 13:47:21 at > org.apache.flink.tests.util.kafka.SQLClientKafkaITCase.testKafka(SQLClientKafkaITCase.java:166) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] AHeise merged pull request #14073: [FLINK-20145][task] Don't expose modifiable PrioritizedDeque.iterator
AHeise merged pull request #14073: URL: https://github.com/apache/flink/pull/14073 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] rmetzger commented on pull request #14044: [FLINK-20098] Don't add flink-connector-files to flink-dist
rmetzger commented on pull request #14044: URL: https://github.com/apache/flink/pull/14044#issuecomment-727789464 Note to all people involved in this PR: The CI run of this PR showed a test break that has been merged to master: https://issues.apache.org/jira/browse/FLINK-20149 I will revert this change and reopen the ticket. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (FLINK-20149) SQLClientKafkaITCase.testKafka failed with "Did not get expected results before timeout."
[ https://issues.apache.org/jira/browse/FLINK-20149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger reassigned FLINK-20149: -- Assignee: Robert Metzger > SQLClientKafkaITCase.testKafka failed with "Did not get expected results > before timeout." > - > > Key: FLINK-20149 > URL: https://issues.apache.org/jira/browse/FLINK-20149 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Table SQL / Ecosystem >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Robert Metzger >Priority: Blocker > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9558&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 > {code} > 2020-11-13T13:47:21.6002762Z Nov 13 13:47:21 [ERROR] testKafka[0: > kafka-version:2.4.1 > kafka-sql-version:universal](org.apache.flink.tests.util.kafka.SQLClientKafkaITCase) > Time elapsed: 183.015 s <<< FAILURE! 2020-11-13T13:47:21.6003744Z Nov 13 > 13:47:21 java.lang.AssertionError: Did not get expected results before > timeout. 2020-11-13T13:47:21.6004745Z Nov 13 13:47:21 at > org.junit.Assert.fail(Assert.java:88) 2020-11-13T13:47:21.6005325Z Nov 13 > 13:47:21 at org.junit.Assert.assertTrue(Assert.java:41) > 2020-11-13T13:47:21.6006007Z Nov 13 13:47:21 at > org.apache.flink.tests.util.kafka.SQLClientKafkaITCase.checkCsvResultFile(SQLClientKafkaITCase.java:226) > 2020-11-13T13:47:21.6007091Z Nov 13 13:47:21 at > org.apache.flink.tests.util.kafka.SQLClientKafkaITCase.testKafka(SQLClientKafkaITCase.java:166) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] wangyang0918 commented on pull request #14006: [FLINK-19546][doc] Add documentation for native Kubernetes HA
wangyang0918 commented on pull request #14006: URL: https://github.com/apache/flink/pull/14006#issuecomment-727789055 I have created a new ticket FLINK-20167 for the follow-up(rework the HA documentation). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] AHeise commented on pull request #14073: [FLINK-20145][task] Don't expose modifiable PrioritizedDeque.iterator
AHeise commented on pull request #14073: URL: https://github.com/apache/flink/pull/14073#issuecomment-727789160 I'm merging but not closing the ticket as I couldn't confirm yet, if it's indeed causing the issue. (It's only valid for union gates). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-20167) Rework the high availability documentation
Yang Wang created FLINK-20167: - Summary: Rework the high availability documentation Key: FLINK-20167 URL: https://issues.apache.org/jira/browse/FLINK-20167 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Yang Wang Currently, the [high availability documentation|https://ci.apache.org/projects/flink/flink-docs-master/ops/jobmanager_high_availability.html] and its structure are not ideal. We should rework it with the documentation about the different deployment modes. Then we will have Deployment -> Standalone -> HA, Deployment -> Yarn -> HA, Deployment -> K8s -> HA and etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20164) Add docs for ProcessFunction and Timer in Python DataStream API.
[ https://issues.apache.org/jira/browse/FLINK-20164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuiqiang Chen updated FLINK-20164: --- Summary: Add docs for ProcessFunction and Timer in Python DataStream API. (was: Add docs for ProcessFunction and Timer for Python DataStream API.) > Add docs for ProcessFunction and Timer in Python DataStream API. > > > Key: FLINK-20164 > URL: https://issues.apache.org/jira/browse/FLINK-20164 > Project: Flink > Issue Type: Improvement > Components: API / Python, Documentation >Reporter: Shuiqiang Chen >Assignee: Shuiqiang Chen >Priority: Minor > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #14075: [FLINK-20163][docs-zh] Translate page "raw format" into Chinese
flinkbot edited a comment on pull request #14075: URL: https://github.com/apache/flink/pull/14075#issuecomment-727598953 ## CI report: * 08214d0608cf8a45395d003dc136a3e51eb1db68 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9595) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9607) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #14077: [FLINK-20141][fs-connector] Translate FileSink document into Chinese
flinkbot edited a comment on pull request #14077: URL: https://github.com/apache/flink/pull/14077#issuecomment-727773204 ## CI report: * 605e600a9ed9a0fec3cdd872dfa6208626f0a087 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9606) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #14068: [FLINK-200137][python] Emit timestamps of current records to downstream in PythonFunctionOperator.
flinkbot edited a comment on pull request #14068: URL: https://github.com/apache/flink/pull/14068#issuecomment-726851708 ## CI report: * 5babde60923df8d09e936677752ac1140f69993f Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9592) * 04c250c0c8c9489eb622c91f057a7e5a496f1f35 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9605) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] wangyang0918 commented on a change in pull request #14006: [FLINK-19546][doc] Add documentation for native Kubernetes HA
wangyang0918 commented on a change in pull request #14006: URL: https://github.com/apache/flink/pull/14006#discussion_r523933632 ## File path: docs/ops/jobmanager_high_availability.md ## @@ -215,6 +215,76 @@ Starting zookeeper daemon on host localhost. $ bin/yarn-session.sh -n 2 +## Kubernetes Cluster High Availability +When running Flink JobManager as a Kubernetes deployment, the replica count should be configured to 1 or greater. +* The value `1` means that a new JobManager will be launched to take over leadership if the current one terminates exceptionally. +* The value `N` (greater than 1) means that multiple JobManagers will be launched simultaneously while one is active and others are standby. Starting more than one JobManager will make the recovery faster. + +### Configuration +{% highlight yaml %} +kubernetes.cluster-id: +high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory +high-availability.storageDir: hdfs:///flink/recovery +{% endhighlight %} + + Example: Highly Available Standalone Flink Cluster on Kubernetes +Both session and job/application clusters support using the Kubernetes high availability service. Users just need to add the following Flink config options to [flink-configuration-configmap.yaml]({{ site.baseurl}}/ops/deployment/kubernetes.html#common-cluster-resource-definitions). All other yamls do not need to be updated. + +Note The filesystem which corresponds to the scheme of your configured HA storage directory must be available to the runtime. Refer to [custom Flink image]({{ site.baseurl}}/ops/deployment/docker.html#customize-flink-image) and [enable plugins]({{ site.baseurl}}/ops/deployment/docker.html#using-plugins) for more information. + +{% highlight yaml %} +apiVersion: v1 +kind: ConfigMap +metadata: + name: flink-config + labels: +app: flink +data: + flink-conf.yaml: |+ + ... +kubernetes.cluster-id: +high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory +high-availability.storageDir: hdfs:///flink/recovery +restart-strategy: fixed-delay +restart-strategy.fixed-delay.attempts: 10 + ... +{% endhighlight %} + + Example: Highly Available Native Kubernetes Cluster +Using the following command to start a native Flink application cluster on Kubernetes with high availability configured. +{% highlight bash %} +$ ./bin/flink run-application -p 8 -t kubernetes-application \ + -Dkubernetes.cluster-id= \ + -Dtaskmanager.memory.process.size=4096m \ + -Dkubernetes.taskmanager.cpu=2 \ + -Dtaskmanager.numberOfTaskSlots=4 \ + -Dkubernetes.container.image= \ + -Dhigh-availability=org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory \ + -Dhigh-availability.storageDir=s3://flink/flink-ha \ + -Drestart-strategy=fixed-delay -Drestart-strategy.fixed-delay.attempts=10 \ + -Dcontainerized.master.env.ENABLE_BUILT_IN_PLUGINS=flink-s3-fs-hadoop-{{site.version}}.jar \ + -Dcontainerized.taskmanager.env.ENABLE_BUILT_IN_PLUGINS=flink-s3-fs-hadoop-{{site.version}}.jar \ + local:///opt/flink/examples/streaming/StateMachineExample.jar +{% endhighlight %} + +### High Availability Data Clean Up +Currently, when a Flink job reached the terminal state (`FAILED`, `CANCELED`, `FINISHED`), all the HA data, including metadata in Kubernetes ConfigMap and HA state on DFS, will be cleaned up. + +So the following command will only shut down the Flink session cluster and leave all the HA related ConfigMaps, state untouched. +{% highlight bash %} +$ echo 'stop' | ./bin/kubernetes-session.sh -Dkubernetes.cluster-id= -Dexecution.attached=true +{% endhighlight %} + +The following commands will cancel the job in application or session cluster and effectively remove all its HA data. +{% highlight bash %} +# Cancel a Flink job in the existing session +$ ./bin/flink cancel -t kubernetes-session -Dkubernetes.cluster-id= +# Cancel a Flink application +$ ./bin/flink cancel -t kubernetes-application -Dkubernetes.cluster-id= +{% endhighlight %} + +To keep HA data while restarting the Flink cluster, simply delete the deploy (via `kubectl delete deploy `). All the Flink cluster related resources will be destroyed (e.g. JobManager Deployment, TaskManager pods, services, Flink conf ConfigMap), as to not occupy the Kubernetes cluster resources. However, HA related ConfigMaps do not set the owner reference and they will be retained. When restarting the session / application, use `kubernetes-session.sh` or `flink run-application`. All the previous suspending running jobs will recover from the latest checkpoint successfully. Review comment: > I don't understand the subclause , as to not occupy resources. What I want to state here is that we will free all the allocated resources from K8s so that they could be used by others. At the same time, we still have the ability to recover the Flink application at any tim
[GitHub] [flink] zhuzhurk commented on a change in pull request #14020: [FLINK-20078][coordination] Factor out an ExecutionGraph factory method for DefaultExecutionTopology
zhuzhurk commented on a change in pull request #14020: URL: https://github.com/apache/flink/pull/14020#discussion_r523932872 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adapter/DefaultExecutionTopology.java ## @@ -140,6 +115,53 @@ public DefaultSchedulingPipelinedRegion getPipelinedRegionOfVertex(final Executi return pipelinedRegion; } + public static DefaultExecutionTopology fromExecutionGraph(ExecutionGraph executionGraph) { + checkNotNull(executionGraph, "execution graph can not be null"); + + ExecutionGraphIndex executionGraphIndex = computeExecutionGraphIndex( + executionGraph.getAllExecutionVertices(), + executionGraph.getNumberOfExecutionJobVertices()); Review comment: ```suggestion executionGraph.getTotalNumberOfVertices()); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #14061: [FLINK-20141][fs-connector] Add FileSink documentation
guoweiM commented on pull request #14061: URL: https://github.com/apache/flink/pull/14061#issuecomment-727779205 > The translation is in #14077, @guoweiM could you also have a look at the Chinese translation ? ok. I will take a look. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-12884) FLIP-144: Native Kubernetes HA Service
[ https://issues.apache.org/jira/browse/FLINK-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232568#comment-17232568 ] Yang Wang commented on FLINK-12884: --- [~ksp0422] Thanks for your suggestion. I second your idea and am trying to add a E2E test to cover the whole process. * Start a Flink application with HA configured * The Flink job completes checkpoints successfully * Kill the JobManager * A new one should be launched and takes over the leadership * The Flink job should be recovered from the latest checkpoint successfully > FLIP-144: Native Kubernetes HA Service > -- > > Key: FLINK-12884 > URL: https://issues.apache.org/jira/browse/FLINK-12884 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Runtime / Coordination >Reporter: MalcolmSanders >Assignee: Yang Wang >Priority: Major > Fix For: 1.12.0 > > > Currently flink only supports HighAvailabilityService using zookeeper. As a > result, it requires a zookeeper cluster to be deployed on k8s cluster if our > customers needs high availability for flink. If we support > HighAvailabilityService based on native k8s APIs, it will save the efforts of > zookeeper deployment as well as the resources used by zookeeper cluster. It > might be especially helpful for customers who run small-scale k8s clusters so > that flink HighAvailabilityService may not cause too much overhead on k8s > clusters. > Previously [FLINK-11105|https://issues.apache.org/jira/browse/FLINK-11105] > has proposed a HighAvailabilityService using etcd. As [~NathanHowell] > suggested in FLINK-11105, since k8s doesn't expose its own etcd cluster by > design (see [Securing etcd > clusters|https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#securing-etcd-clusters]), > it also requires the deployment of etcd cluster if flink uses etcd to > achieve HA. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] metaotao commented on pull request #14075: [FLINK-20163][docs-zh] Translate page "raw format" into Chinese
metaotao commented on pull request #14075: URL: https://github.com/apache/flink/pull/14075#issuecomment-727775897 @flinkbot run azure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] gaoyunhaii commented on a change in pull request #14061: [FLINK-20141][fs-connector] Add FileSink documentation
gaoyunhaii commented on a change in pull request #14061: URL: https://github.com/apache/flink/pull/14061#discussion_r523923047 ## File path: docs/dev/connectors/streamfile_sink.md ## @@ -765,6 +765,7 @@ and Flink will throw an exception. normal job termination (*e.g.* finite input stream) and termination due to failure, upon normal termination of a job, the last in-progress files will not be transitioned to the "finished" state. +TODO I think this is wrong. Review comment: Why this might be wrong ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] gaoyunhaii edited a comment on pull request #14061: [FLINK-20141][fs-connector] Add FileSink documentation
gaoyunhaii edited a comment on pull request #14061: URL: https://github.com/apache/flink/pull/14061#issuecomment-727768552 The translation is in https://github.com/apache/flink/pull/14077, @guoweiM could you also have a look at the Chinese translation ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot commented on pull request #14077: [FLINK-20141][fs-connector] Translate FileSink document into Chinese
flinkbot commented on pull request #14077: URL: https://github.com/apache/flink/pull/14077#issuecomment-727773204 ## CI report: * 605e600a9ed9a0fec3cdd872dfa6208626f0a087 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
flinkbot edited a comment on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727741544 ## CI report: * 6082a1d0c0f4274e553de47fae1285537c38ace8 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9602) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9603) * Unknown: [CANCELED](TBD) * a88211378f8e89d1ed89a22795643210352ebe81 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9604) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #14068: [FLINK-200137][python] Emit timestamps of current records to downstream in PythonFunctionOperator.
flinkbot edited a comment on pull request #14068: URL: https://github.com/apache/flink/pull/14068#issuecomment-726851708 ## CI report: * 5babde60923df8d09e936677752ac1140f69993f Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9592) * 04c250c0c8c9489eb622c91f057a7e5a496f1f35 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot commented on pull request #14077: [FLINK-20141][fs-connector] Translate FileSink document into Chinese
flinkbot commented on pull request #14077: URL: https://github.com/apache/flink/pull/14077#issuecomment-727769142 Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community to review your pull request. We will use this comment to track the progress of the review. ## Automated Checks Last check on commit 605e600a9ed9a0fec3cdd872dfa6208626f0a087 (Mon Nov 16 06:36:55 UTC 2020) ✅no warnings Mention the bot in a comment to re-run the automated checks. ## Review Progress * ❓ 1. The [description] looks good. * ❓ 2. There is [consensus] that the contribution should go into to Flink. * ❓ 3. Needs [attention] from. * ❓ 4. The change fits into the overall [architecture]. * ❓ 5. Overall code [quality] is good. Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands The @flinkbot bot supports the following commands: - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`) - `@flinkbot approve all` to approve all aspects - `@flinkbot approve-until architecture` to approve everything until `architecture` - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention - `@flinkbot disapprove architecture` to remove an approval you gave earlier This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] gaoyunhaii edited a comment on pull request #14061: [FLINK-20141][fs-connector] Add FileSink documentation
gaoyunhaii edited a comment on pull request #14061: URL: https://github.com/apache/flink/pull/14061#issuecomment-727768552 The translation is in https://github.com/apache/flink/pull/14077 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] gaoyunhaii commented on pull request #14061: [FLINK-20141][fs-connector] Add FileSink documentation
gaoyunhaii commented on pull request #14061: URL: https://github.com/apache/flink/pull/14061#issuecomment-727768552 The translation is in `https://github.com/apache/flink/pull/14077` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] gaoyunhaii opened a new pull request #14077: Pr14061 add doc zh
gaoyunhaii opened a new pull request #14077: URL: https://github.com/apache/flink/pull/14077 ## What is the purpose of the change Translating the `FileSink` document into Chinese. ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no** - The serializers: **no** - The runtime per-record code paths (performance sensitive): **no** - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: **no** - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? **no** - If yes, how is the feature documented? **not applicable** This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
flinkbot edited a comment on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727741544 ## CI report: * 6082a1d0c0f4274e553de47fae1285537c38ace8 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9602) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9603) * a88211378f8e89d1ed89a22795643210352ebe81 UNKNOWN * Unknown: [CANCELED](TBD) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-20143) use `yarn.provided.lib.dirs` config deploy job failed in yarn per job mode
[ https://issues.apache.org/jira/browse/FLINK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232558#comment-17232558 ] Yang Wang commented on FLINK-20143: --- Hmm. I think maybe I find the root cause. When the {{yarn.provided.lib.dirs}} is set to the non-qualified path(e.g. hdfs:///path/of/sharedLib), the {{URI#relativize}} in {{YarnApplicationFileUploader#getAllFilesInProvidedLibDirs}} could not work as expected. [~zhisheng] So for your situation, I guess all the deployment(e.g. yarn-per-job, yarn-application, yarn-session) could not work effectively if you are using non-qualified path. [~kkl0u] Even though we have a work around, specify a qualified path(e.g. hdfs://hdpdev/path/of/sharedLib), I think it is better to fix this issue. I will attach a PR for this ticket. > use `yarn.provided.lib.dirs` config deploy job failed in yarn per job mode > -- > > Key: FLINK-20143 > URL: https://issues.apache.org/jira/browse/FLINK-20143 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission, Deployment / YARN >Affects Versions: 1.12.0 >Reporter: zhisheng >Priority: Major > Attachments: image-2020-11-13-17-21-47-751.png, > image-2020-11-13-17-22-06-111.png, image-2020-11-13-18-43-55-188.png > > > use follow command deploy flink job to yarn failed > {code:java} > ./bin/flink run -m yarn-cluster -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jar > {code} > log: > {code:java} > $ ./bin/flink run -m yarn-cluster -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jar$ ./bin/flink run -m yarn-cluster > -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jarSLF4J: Class path contains > multiple SLF4J bindings.SLF4J: Found binding in > [jar:file:/data1/app/flink-1.12-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > Found binding in > [jar:file:/data1/app/hadoop-2.7.3-snappy-32core12disk/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > Found binding in > [jar:file:/data1/app/hadoop-2.7.3-snappy-32core12disk/share/hadoop/tools/lib/hadoop-aliyun-2.9.2-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation.SLF4J: Actual binding is of type > [org.apache.logging.slf4j.Log4jLoggerFactory]2020-11-13 16:14:30,347 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli [] - Dynamic > Property set: > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib2020-11-13 > 16:14:30,347 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli > [] - Dynamic Property set: > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/libUsage with > built-in data generator: StateMachineExample [--error-rate > ] [--sleep ]Usage > with Kafka: StateMachineExample --kafka-topic [--brokers > ]Options for both the above setups: [--backend ] > [--checkpoint-dir ] [--async-checkpoints ] > [--incremental-checkpoints ] [--output OR null for > stdout] > Using standalone source with error rate 0.00 and sleep delay 1 millis > 2020-11-13 16:14:30,706 WARN > org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The > configuration directory ('/data1/app/flink-1.12-SNAPSHOT/conf') already > contains a LOG4J config file.If you want to use logback, then please delete > or rename the log configuration file.2020-11-13 16:14:30,947 INFO > org.apache.hadoop.yarn.client.AHSProxy [] - Connecting > to Application History server at > FAT-hadoopuat-69117.vm.dc01.tech/10.69.1.17:102002020-11-13 16:14:30,958 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - No path > for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar2020-11-13 > 16:14:31,065 INFO > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider [] - Failing > over to rm22020-11-13 16:14:31,130 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - The > configured JobManager memory is 3072 MB. YARN will allocate 4096 MB to make > up an integer multiple of its minimum allocation memory (2048 MB, configured > via 'yarn.scheduler.minimum-allocation-mb'). The extra 1024 MB may not be > used by Flink.2020-11-13 16:14:31,130 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - The > configur
[GitHub] [flink] flinkbot edited a comment on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
flinkbot edited a comment on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727741544 ## CI report: * 6082a1d0c0f4274e553de47fae1285537c38ace8 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9602) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9603) * Unknown: [CANCELED](TBD) * a88211378f8e89d1ed89a22795643210352ebe81 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] xiaoHoly commented on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
xiaoHoly commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727752237 HI,@tillrohrmann .PTAL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] xiaoHoly commented on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
xiaoHoly commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727750065 > @xiaoHoly It will trigger the test atomically when you pushing a new commit. Usually there is no need to trigger the test manually. Besides, you can use command `@flinkbot run azure` instead of `run azure re-run the last Azure build` to trigger the tests. Thanks for your advice This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] dianfu commented on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
dianfu commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727749697 I'm not quite familiar with this part. cc @tillrohrmann This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] dianfu commented on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
dianfu commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727748627 @xiaoHoly It will trigger the test atomically when you pushing a new commit. Usually there is no need to trigger the test manually. Besides, you can use command `@flinkbot run azure` instead of `run azure re-run the last Azure build` to trigger the tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
flinkbot edited a comment on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727741544 ## CI report: * 6082a1d0c0f4274e553de47fae1285537c38ace8 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9602) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9603) * Unknown: [CANCELED](TBD) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] xiaoHoly commented on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
xiaoHoly commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727747755 @flinkbot run azure re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] xiaoHoly commented on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
xiaoHoly commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727747601 @flinkbot run travis re-run the last Travis build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-19775) SystemProcessingTimeServiceTest.testImmediateShutdown is instable
[ https://issues.apache.org/jira/browse/FLINK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232536#comment-17232536 ] jiawen xiao commented on FLINK-19775: - Hi,[~dian.fu], Looking forward to your follow-up to this pr (https://github.com/apache/flink/pull/14076) > SystemProcessingTimeServiceTest.testImmediateShutdown is instable > - > > Key: FLINK-19775 > URL: https://issues.apache.org/jira/browse/FLINK-19775 > Project: Flink > Issue Type: Bug > Components: API / DataStream >Affects Versions: 1.11.0 >Reporter: Dian Fu >Assignee: jiawen xiao >Priority: Major > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8131&view=logs&j=d89de3df-4600-5585-dadc-9bbc9a5e661c&t=66b5c59a-0094-561d-0e44-b149dfdd586d > {code} > 2020-10-22T21:12:54.9462382Z [ERROR] > testImmediateShutdown(org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeServiceTest) > Time elapsed: 0.009 s <<< ERROR! > 2020-10-22T21:12:54.9463024Z java.lang.InterruptedException > 2020-10-22T21:12:54.9463331Z at java.lang.Object.wait(Native Method) > 2020-10-22T21:12:54.9463766Z at java.lang.Object.wait(Object.java:502) > 2020-10-22T21:12:54.9464140Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:63) > 2020-10-22T21:12:54.9466014Z at > org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeServiceTest.testImmediateShutdown(SystemProcessingTimeServiceTest.java:154) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19775) SystemProcessingTimeServiceTest.testImmediateShutdown is instable
[ https://issues.apache.org/jira/browse/FLINK-19775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-19775: --- Labels: pull-request-available test-stability (was: test-stability) > SystemProcessingTimeServiceTest.testImmediateShutdown is instable > - > > Key: FLINK-19775 > URL: https://issues.apache.org/jira/browse/FLINK-19775 > Project: Flink > Issue Type: Bug > Components: API / DataStream >Affects Versions: 1.11.0 >Reporter: Dian Fu >Assignee: jiawen xiao >Priority: Major > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8131&view=logs&j=d89de3df-4600-5585-dadc-9bbc9a5e661c&t=66b5c59a-0094-561d-0e44-b149dfdd586d > {code} > 2020-10-22T21:12:54.9462382Z [ERROR] > testImmediateShutdown(org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeServiceTest) > Time elapsed: 0.009 s <<< ERROR! > 2020-10-22T21:12:54.9463024Z java.lang.InterruptedException > 2020-10-22T21:12:54.9463331Z at java.lang.Object.wait(Native Method) > 2020-10-22T21:12:54.9463766Z at java.lang.Object.wait(Object.java:502) > 2020-10-22T21:12:54.9464140Z at > org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:63) > 2020-10-22T21:12:54.9466014Z at > org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeServiceTest.testImmediateShutdown(SystemProcessingTimeServiceTest.java:154) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot commented on pull request #14076: [FLINK-19775][test] SystemProcessingTimeServiceTest.testImmediateShutdown is instable
flinkbot commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727741544 ## CI report: * 6082a1d0c0f4274e553de47fae1285537c38ace8 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] xiaoHoly commented on pull request #14076: Pr holy patch
xiaoHoly commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727739140 @flinkbot run travis re-run the last Travis build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] xiaoHoly commented on pull request #14076: Pr holy patch
xiaoHoly commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727739202 @flinkbot run azure re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] xiaoHoly commented on pull request #14076: Pr holy patch
xiaoHoly commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727737851 @dianfu ,hi,Teacher Fu,I have finished my work for this problem . Can you help me review my code? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot commented on pull request #14076: Pr holy patch
flinkbot commented on pull request #14076: URL: https://github.com/apache/flink/pull/14076#issuecomment-727737772 Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community to review your pull request. We will use this comment to track the progress of the review. ## Automated Checks Last check on commit 6082a1d0c0f4274e553de47fae1285537c38ace8 (Mon Nov 16 05:10:09 UTC 2020) **Warnings:** * No documentation files were touched! Remember to keep the Flink docs up to date! * **Invalid pull request title: No valid Jira ID provided** Mention the bot in a comment to re-run the automated checks. ## Review Progress * ❓ 1. The [description] looks good. * ❓ 2. There is [consensus] that the contribution should go into to Flink. * ❓ 3. Needs [attention] from. * ❓ 4. The change fits into the overall [architecture]. * ❓ 5. Overall code [quality] is good. Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands The @flinkbot bot supports the following commands: - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`) - `@flinkbot approve all` to approve all aspects - `@flinkbot approve-until architecture` to approve everything until `architecture` - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention - `@flinkbot disapprove architecture` to remove an approval you gave earlier This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] xiaoHoly opened a new pull request #14076: Pr holy patch
xiaoHoly opened a new pull request #14076: URL: https://github.com/apache/flink/pull/14076 ## What is the purpose of the change I will repair the test function (SystemProcessingTimeServiceTest.testImmediateShutdown) which is instable . Express in the FLINK-19775 ## Brief change log -change flink/flink-test-utils-parent/flink-test-utils-junit/src/main/java/org/apache/flink/core/testutils/OneShotLatch.java ## Verifying this change This change is already covered by existing tests, such as SystemProcessingTimeServiceTest.testImmediateShutdown. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: ( no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (no) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13998: [FLINK-20062][hive] ContinuousHiveSplitEnumerator should be lock-free
flinkbot edited a comment on pull request #13998: URL: https://github.com/apache/flink/pull/13998#issuecomment-724001789 ## CI report: * 73aaf9862a1d30857c4605b26a8997eeb11281c1 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9351) * 530e65e61ae94f952ffb682f0573071724da27d6 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9601) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13998: [FLINK-20062][hive] ContinuousHiveSplitEnumerator should be lock-free
flinkbot edited a comment on pull request #13998: URL: https://github.com/apache/flink/pull/13998#issuecomment-724001789 ## CI report: * 73aaf9862a1d30857c4605b26a8997eeb11281c1 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9351) * 530e65e61ae94f952ffb682f0573071724da27d6 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13900: [FLINK-19949][csv] Unescape CSV format line delimiter character
flinkbot edited a comment on pull request #13900: URL: https://github.com/apache/flink/pull/13900#issuecomment-720989833 ## CI report: * d9e47451664a976692e4e6110bbffa2842f2ae7a UNKNOWN * c9a24a3ac861addff4a26ffa47cffd38db7427ae Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9600) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13900: [FLINK-19949][csv] Unescape CSV format line delimiter character
flinkbot edited a comment on pull request #13900: URL: https://github.com/apache/flink/pull/13900#issuecomment-720989833 ## CI report: * 9dc60ef8af4b2a4e11c749a817b454c983f938aa Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8969) * d9e47451664a976692e4e6110bbffa2842f2ae7a UNKNOWN * c9a24a3ac861addff4a26ffa47cffd38db7427ae UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (FLINK-20038) Rectify the usage of ResultPartitionType#isPipelined() in partition tracker.
[ https://issues.apache.org/jira/browse/FLINK-20038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232495#comment-17232495 ] Jin Xing edited comment on FLINK-20038 at 11/16/20, 3:56 AM: - Hi [~trohrmann] [~ym] Thanks a lot for your feedback and sorry for late reply, was busy during 11.11 shopping festival support ~ We indeed need a proper design for what we want to support and how it could be mapped to properties. The characteristics of shuffle manner is exposed by ResultPartitionType. We should sort out what should be exposed to scheduling layer, which should be respected; and what are implementation details, which should be kept inside; >From my side, I think there are 3 kinds of properties should be exposed to >scheduling layer: # *_+Writing property+_* – – how the shuffle service is writable, that's what '_hasBackpressure_' indicates, but seems that it's used nowhere and blurred with the property of 'isPipeliened'. In current code, Flink assumes that if a shuffle is '_isPipiplined=true_', it's always '_hasBackpressure=true_', which is not true from the concept. # *_+Reading property+_* – – when the shuffle data is readable, that's what 'isPipelined' and '_isBlocking_' indicates. They are used when dividing PipelinedRegion. My concern is how we define the meaning of a PIpelinedRegion in short. From my understanding, a PipelinedRegion is a set of vertices which should be scheduled together because of back pressure and the internal data flow can be consumed before task finish. If my understanding is correct, when judge whether two vertices should be divided into the same region, should the condition be '_isPipeliend=true && hasBackpressure=true_', rather than the current impl of ([PipelinedRegionComputeUtil|#L158])] (_producedResult.getResultType().isPipelined()_) ? # *+_Data lifecycle_+* – – GC could happen a). after data consumption; b). after job finished; c) by recycle manually If above classification is valid, we need to rectify some misuse in scheduling layer and give full respect to shuffle property. was (Author: jinxing6...@126.com): Hi [~trohrmann] [~ym] Thanks a lot for your feedback and sorry for late reply, was busy during 11.11 shopping festival support ~ We indeed need a proper design for what we want to support and how it could be mapped to properties. The characteristics of shuffle manner is exposed by ResultPartitionType. We should sort out what should be exposed to scheduling layer, which should be respected; and what are implementation details, which should be kept inside; >From my side, I think there are 3 kinds of properties should be exposed to >scheduling layer: # *_+Writing property+_* – – when the shuffle service is writable, that's what '_hasBackpressure_' indicates, but seems that it's used nowhere and blurred with the property of 'isPipeliened'. In current code, Flink assumes that if a shuffle is '_isPipiplined=true_', it's always '_hasBackpressure=true_', which is not true from the concept. # *_+Reading property+_* – – when the shuffle data is readable, that's what 'isPipelined' and '_isBlocking_' indicates. They are used when dividing PipelinedRegion. My concern is how we define the meaning of a PIpelinedRegion in short. From my understanding, a PipelinedRegion is a set of vertices which should be scheduled together because of back pressure and the internal data flow can be consumed before task finish. If my understanding is correct, when judge whether two vertices should be divided into the same region, should the condition be '_isPipeliend=true && hasBackpressure=true_', rather than the current impl of ([PipelinedRegionComputeUtil|#L158])] (_producedResult.getResultType().isPipelined()_) ? # *+_Data lifecycle_+* – – GC could happen a). after data consumption; b). after job finished; c) by recycle manually If above classification is valid, we need to rectify some misuse in scheduling layer and give full respect to shuffle property. > Rectify the usage of ResultPartitionType#isPipelined() in partition tracker. > > > Key: FLINK-20038 > URL: https://issues.apache.org/jira/browse/FLINK-20038 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Network >Reporter: Jin Xing >Priority: Major > > After "FLIP-31: Pluggable Shuffle Service", users can extend and plug in new > shuffle manner, thus to benefit different scenarios. New shuffle manner tend > to bring in new abilities which could be leveraged by scheduling layer to > provide better performance. > From my understanding, the characteristics of shuffle manner is exposed by > ResultPartitionType (e.g. isPipelined, isBlocking, hasBackPressure .
[jira] [Comment Edited] (FLINK-20038) Rectify the usage of ResultPartitionType#isPipelined() in partition tracker.
[ https://issues.apache.org/jira/browse/FLINK-20038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232495#comment-17232495 ] Jin Xing edited comment on FLINK-20038 at 11/16/20, 3:54 AM: - Hi [~trohrmann] [~ym] Thanks a lot for your feedback and sorry for late reply, was busy during 11.11 shopping festival support ~ We indeed need a proper design for what we want to support and how it could be mapped to properties. The characteristics of shuffle manner is exposed by ResultPartitionType. We should sort out what should be exposed to scheduling layer, which should be respected; and what are implementation details, which should be kept inside; >From my side, I think there are 3 kinds of properties should be exposed to >scheduling layer: # *_+Writing property+_* – – when the shuffle service is writable, that's what '_hasBackpressure_' indicates, but seems that it's used nowhere and blurred with the property of 'isPipeliened'. In current code, Flink assumes that if a shuffle is '_isPipiplined=true_', it's always '_hasBackpressure=true_', which is not true from the concept. # *_+Reading property+_* – – when the shuffle data is readable, that's what 'isPipelined' and '_isBlocking_' indicates. They are used when dividing PipelinedRegion. My concern is how we define the meaning of a PIpelinedRegion in short. From my understanding, a PipelinedRegion is a set of vertices which should be scheduled together because of back pressure and the internal data flow can be consumed before task finish. If my understanding is correct, when judge whether two vertices should be divided into the same region, should the condition be '_isPipeliend=true && hasBackpressure=true_', rather than the current impl of ([PipelinedRegionComputeUtil|#L158])] (_producedResult.getResultType().isPipelined()_) ? # *+_Data lifecycle_+* – – GC could happen a). after data consumption; b). after job finished; c) by recycle manually If above classification is valid, we need to rectify some misuse in scheduling layer and give full respect to shuffle property. was (Author: jinxing6...@126.com): Hi [~trohrmann] [~ym] Thanks a lot for your feedback and sorry for late reply, was busy during 11.11 shopping festival support ~ We indeed need a proper design for what we want to support and how it could be mapped to properties. The characteristics of shuffle manner is exposed by ResultPartitionType. We should sort out what should be exposed to scheduling layer, which should be respected; and what is implementation details, which should be kept inside; >From my side, I think there are 3 kinds of properties should be exposed to >scheduling layer: # *_+Writing property+_* – – when the shuffle service is writable, that's what '_hasBackpressure_' indicates, but seems that it's used nowhere and blurred with the property of 'isPipeliened'. In current code, Flink assumes that if a shuffle is '_isPipiplined=true_', it's always '_hasBackpressure=true_', which is not true from the concept. # *_+Reading property+_* – – when the shuffle data is readable, that's what 'isPipelined' and '_isBlocking_' indicates. They are used when dividing PipelinedRegion. My concern is how we define the meaning of a PIpelinedRegion in short. From my understanding, a PipelinedRegion is a set of vertices which should be scheduled together because of back pressure and the internal data flow can be consumed before task finish. If my understanding is correct, when judge whether two vertices should be divided into the same region, should the condition be '_isPipeliend=true && hasBackpressure=true_', rather than the current impl of ([PipelinedRegionComputeUtil|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/PipelinedRegionComputeUtil.java#L158])] (_producedResult.getResultType().isPipelined()_) ? # *+_Data lifecycle_+* – – GC could happen a). after data consumption; b). after job finished; c) by recycle manually If above classification is valid, we need to rectify some misuse in scheduling layer and give full respect to shuffle property. > Rectify the usage of ResultPartitionType#isPipelined() in partition tracker. > > > Key: FLINK-20038 > URL: https://issues.apache.org/jira/browse/FLINK-20038 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Network >Reporter: Jin Xing >Priority: Major > > After "FLIP-31: Pluggable Shuffle Service", users can extend and plug in new > shuffle manner, thus to benefit different scenarios. New shuffle manner tend > to bring in new abilities which could be leveraged by scheduling layer to > provide better pe
[jira] [Commented] (FLINK-20038) Rectify the usage of ResultPartitionType#isPipelined() in partition tracker.
[ https://issues.apache.org/jira/browse/FLINK-20038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232495#comment-17232495 ] Jin Xing commented on FLINK-20038: -- Hi [~trohrmann] [~ym] Thanks a lot for your feedback and sorry for late reply, was busy during 11.11 shopping festival support ~ We indeed need a proper design for what we want to support and how it could be mapped to properties. The characteristics of shuffle manner is exposed by ResultPartitionType. We should sort out what should be exposed to scheduling layer, which should be respected; and what is implementation details, which should be kept inside; >From my side, I think there are 3 kinds of properties should be exposed to >scheduling layer: # *_+Writing property+_* – – when the shuffle service is writable, that's what '_hasBackpressure_' indicates, but seems that it's used nowhere and blurred with the property of 'isPipeliened'. In current code, Flink assumes that if a shuffle is '_isPipiplined=true_', it's always '_hasBackpressure=true_', which is not true from the concept. # *_+Reading property+_* – – when the shuffle data is readable, that's what 'isPipelined' and '_isBlocking_' indicates. They are used when dividing PipelinedRegion. My concern is how we define the meaning of a PIpelinedRegion in short. From my understanding, a PipelinedRegion is a set of vertices which should be scheduled together because of back pressure and the internal data flow can be consumed before task finish. If my understanding is correct, when judge whether two vertices should be divided into the same region, should the condition be '_isPipeliend=true && hasBackpressure=true_', rather than the current impl of ([PipelinedRegionComputeUtil|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/PipelinedRegionComputeUtil.java#L158])] (_producedResult.getResultType().isPipelined()_) ? # *+_Data lifecycle_+* – – GC could happen a). after data consumption; b). after job finished; c) by recycle manually If above classification is valid, we need to rectify some misuse in scheduling layer and give full respect to shuffle property. > Rectify the usage of ResultPartitionType#isPipelined() in partition tracker. > > > Key: FLINK-20038 > URL: https://issues.apache.org/jira/browse/FLINK-20038 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Network >Reporter: Jin Xing >Priority: Major > > After "FLIP-31: Pluggable Shuffle Service", users can extend and plug in new > shuffle manner, thus to benefit different scenarios. New shuffle manner tend > to bring in new abilities which could be leveraged by scheduling layer to > provide better performance. > From my understanding, the characteristics of shuffle manner is exposed by > ResultPartitionType (e.g. isPipelined, isBlocking, hasBackPressure ...), and > leveraged by scheduling layer to conduct job. But seems that Flink doesn't > provide a way to describe the new characteristics from a plugged in shuffle > manner. I also find that scheduling layer have some weak assumptions on > ResultPartitionType. I detail by below example. > In our internal Flink, we develop a new shuffle manner for batch jobs. > Characteristics can be summarized as below briefly: > 1. Upstream task shuffle writes data to DISK; > 2. Upstream task commits data while producing and notify "consumable" to > downstream BEFORE task finished; > 3. Downstream is notified when upstream data is consumable and can be > scheduled according to resource; > 4. When downstream task failover, only itself needs to be restarted because > upstream data is written into disk and replayable; > We can tell the character of this new shuffle manner as: > a. isPipelined=true – downstream task can consume data before upstream > finished; > b. hasBackPressure=false – upstream task shuffle writes data to disk and can > finish by itself no matter if there's downstream task consumes the data in > time. > But above new ResultPartitionType(isPipelined=true, hasBackPressure=false) > seems contradicts the partition lifecycle management in current scheduling > layer: > 1. The above new shuffle manner needs partition tracker for lifecycle > management, but current Flink assumes that ALL "isPipelined=true" result > partitions are released on consumption and will not be taken care of by > partition tracker > ([link|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/JobMasterPartitionTrackerImpl.java#L66]) > – the limitation is not correct for this case. > From my understanding, the method of ResultPartitionType#isPipel
[jira] [Comment Edited] (FLINK-12884) FLIP-144: Native Kubernetes HA Service
[ https://issues.apache.org/jira/browse/FLINK-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232493#comment-17232493 ] Kevin Kwon edited comment on FLINK-12884 at 11/16/20, 3:48 AM: --- Just in my opinion, I think the e2e test should mostly focus on killing the job manager since Zookeeper was used for checkpoint metadata storage aside from leader election (which is innately handled by Kubernetes' pod respawning) was (Author: ksp0422): Just in my opinion, I think the e2e test should mostly focus on killing the job manager since Zookeeper was used for checkpoint metadata storage aside from leader election (which is innately handled by Kubernetes' pod spawning) > FLIP-144: Native Kubernetes HA Service > -- > > Key: FLINK-12884 > URL: https://issues.apache.org/jira/browse/FLINK-12884 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Runtime / Coordination >Reporter: MalcolmSanders >Assignee: Yang Wang >Priority: Major > Fix For: 1.12.0 > > > Currently flink only supports HighAvailabilityService using zookeeper. As a > result, it requires a zookeeper cluster to be deployed on k8s cluster if our > customers needs high availability for flink. If we support > HighAvailabilityService based on native k8s APIs, it will save the efforts of > zookeeper deployment as well as the resources used by zookeeper cluster. It > might be especially helpful for customers who run small-scale k8s clusters so > that flink HighAvailabilityService may not cause too much overhead on k8s > clusters. > Previously [FLINK-11105|https://issues.apache.org/jira/browse/FLINK-11105] > has proposed a HighAvailabilityService using etcd. As [~NathanHowell] > suggested in FLINK-11105, since k8s doesn't expose its own etcd cluster by > design (see [Securing etcd > clusters|https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#securing-etcd-clusters]), > it also requires the deployment of etcd cluster if flink uses etcd to > achieve HA. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-12884) FLIP-144: Native Kubernetes HA Service
[ https://issues.apache.org/jira/browse/FLINK-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232493#comment-17232493 ] Kevin Kwon commented on FLINK-12884: Just in my opinion, I think the e2e test should mostly focus on killing the job manager since Zookeeper was used for checkpoint metadata storage aside from leader election (which is innately handled by Kubernetes' pod spawning) > FLIP-144: Native Kubernetes HA Service > -- > > Key: FLINK-12884 > URL: https://issues.apache.org/jira/browse/FLINK-12884 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes, Runtime / Coordination >Reporter: MalcolmSanders >Assignee: Yang Wang >Priority: Major > Fix For: 1.12.0 > > > Currently flink only supports HighAvailabilityService using zookeeper. As a > result, it requires a zookeeper cluster to be deployed on k8s cluster if our > customers needs high availability for flink. If we support > HighAvailabilityService based on native k8s APIs, it will save the efforts of > zookeeper deployment as well as the resources used by zookeeper cluster. It > might be especially helpful for customers who run small-scale k8s clusters so > that flink HighAvailabilityService may not cause too much overhead on k8s > clusters. > Previously [FLINK-11105|https://issues.apache.org/jira/browse/FLINK-11105] > has proposed a HighAvailabilityService using etcd. As [~NathanHowell] > suggested in FLINK-11105, since k8s doesn't expose its own etcd cluster by > design (see [Securing etcd > clusters|https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#securing-etcd-clusters]), > it also requires the deployment of etcd cluster if flink uses etcd to > achieve HA. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20143) use `yarn.provided.lib.dirs` config deploy job failed in yarn per job mode
[ https://issues.apache.org/jira/browse/FLINK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232488#comment-17232488 ] Yang Wang commented on FLINK-20143: --- [~zhisheng] Could you add the hdfs schema in the {{yarn.provided.lib.dirs}} and have a try again. For example, {{-yD yarn.provided.lib.dirs=hdfs://hdpdev/flink/flink-1.12-SNAPSHOT/lib}} > use `yarn.provided.lib.dirs` config deploy job failed in yarn per job mode > -- > > Key: FLINK-20143 > URL: https://issues.apache.org/jira/browse/FLINK-20143 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission, Deployment / YARN >Affects Versions: 1.12.0 >Reporter: zhisheng >Priority: Major > Attachments: image-2020-11-13-17-21-47-751.png, > image-2020-11-13-17-22-06-111.png, image-2020-11-13-18-43-55-188.png > > > use follow command deploy flink job to yarn failed > {code:java} > ./bin/flink run -m yarn-cluster -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jar > {code} > log: > {code:java} > $ ./bin/flink run -m yarn-cluster -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jar$ ./bin/flink run -m yarn-cluster > -d -ynm flink-1.12-test -ytm 3g -yjm 3g -yD > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib > ./examples/streaming/StateMachineExample.jarSLF4J: Class path contains > multiple SLF4J bindings.SLF4J: Found binding in > [jar:file:/data1/app/flink-1.12-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > Found binding in > [jar:file:/data1/app/hadoop-2.7.3-snappy-32core12disk/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > Found binding in > [jar:file:/data1/app/hadoop-2.7.3-snappy-32core12disk/share/hadoop/tools/lib/hadoop-aliyun-2.9.2-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: > See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation.SLF4J: Actual binding is of type > [org.apache.logging.slf4j.Log4jLoggerFactory]2020-11-13 16:14:30,347 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli [] - Dynamic > Property set: > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/lib2020-11-13 > 16:14:30,347 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli > [] - Dynamic Property set: > yarn.provided.lib.dirs=hdfs:///flink/flink-1.12-SNAPSHOT/libUsage with > built-in data generator: StateMachineExample [--error-rate > ] [--sleep ]Usage > with Kafka: StateMachineExample --kafka-topic [--brokers > ]Options for both the above setups: [--backend ] > [--checkpoint-dir ] [--async-checkpoints ] > [--incremental-checkpoints ] [--output OR null for > stdout] > Using standalone source with error rate 0.00 and sleep delay 1 millis > 2020-11-13 16:14:30,706 WARN > org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The > configuration directory ('/data1/app/flink-1.12-SNAPSHOT/conf') already > contains a LOG4J config file.If you want to use logback, then please delete > or rename the log configuration file.2020-11-13 16:14:30,947 INFO > org.apache.hadoop.yarn.client.AHSProxy [] - Connecting > to Application History server at > FAT-hadoopuat-69117.vm.dc01.tech/10.69.1.17:102002020-11-13 16:14:30,958 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - No path > for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar2020-11-13 > 16:14:31,065 INFO > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider [] - Failing > over to rm22020-11-13 16:14:31,130 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - The > configured JobManager memory is 3072 MB. YARN will allocate 4096 MB to make > up an integer multiple of its minimum allocation memory (2048 MB, configured > via 'yarn.scheduler.minimum-allocation-mb'). The extra 1024 MB may not be > used by Flink.2020-11-13 16:14:31,130 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - The > configured TaskManager memory is 3072 MB. YARN will allocate 4096 MB to make > up an integer multiple of its minimum allocation memory (2048 MB, configured > via 'yarn.scheduler.minimum-allocation-mb'). The extra 1024 MB may not be > used by Flink.2020-11-13 16:14:31,130 INFO > org.apache.flink.yarn.YarnClusterDescriptor [] - Cluster > specification: ClusterSpecification{masterMemoryMB=3072, > taskManagerMemoryMB=3072, slotsPer
[jira] [Assigned] (FLINK-20164) Add docs for ProcessFunction and Timer for Python DataStream API.
[ https://issues.apache.org/jira/browse/FLINK-20164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu reassigned FLINK-20164: --- Assignee: Shuiqiang Chen > Add docs for ProcessFunction and Timer for Python DataStream API. > - > > Key: FLINK-20164 > URL: https://issues.apache.org/jira/browse/FLINK-20164 > Project: Flink > Issue Type: Improvement > Components: API / Python, Documentation >Reporter: Shuiqiang Chen >Assignee: Shuiqiang Chen >Priority: Minor > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #14060: [hotfix][doc] fix the usage example of datagen connector doc
flinkbot edited a comment on pull request #14060: URL: https://github.com/apache/flink/pull/14060#issuecomment-726548680 ## CI report: * deb5686ae2485f1fd700e5f1e69e6bf8aa6e Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9599) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9549) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] flinkbot edited a comment on pull request #13900: [FLINK-19949][csv] Unescape CSV format line delimiter character
flinkbot edited a comment on pull request #13900: URL: https://github.com/apache/flink/pull/13900#issuecomment-720989833 ## CI report: * 9dc60ef8af4b2a4e11c749a817b454c983f938aa Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8969) * d9e47451664a976692e4e6110bbffa2842f2ae7a UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (FLINK-20134) Test non-vectorized Python UDAF
[ https://issues.apache.org/jira/browse/FLINK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu closed FLINK-20134. --- Resolution: Done [~zhongwei] Thanks a lot~ > Test non-vectorized Python UDAF > --- > > Key: FLINK-20134 > URL: https://issues.apache.org/jira/browse/FLINK-20134 > Project: Flink > Issue Type: Task > Components: API / Python >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Wei Zhong >Priority: Critical > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-20134) Test non-vectorized Python UDAF
[ https://issues.apache.org/jira/browse/FLINK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu reassigned FLINK-20134: --- Assignee: Wei Zhong > Test non-vectorized Python UDAF > --- > > Key: FLINK-20134 > URL: https://issues.apache.org/jira/browse/FLINK-20134 > Project: Flink > Issue Type: Task > Components: API / Python >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Wei Zhong >Priority: Critical > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20134) Test non-vectorized Python UDAF
[ https://issues.apache.org/jira/browse/FLINK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232484#comment-17232484 ] Wei Zhong commented on FLINK-20134: --- Hi [~dian.fu], I have tested the non-vectorized Python UDAF in the following sceneries: * simple count UDAF * count distinct UDAF with MapView * mixed use of built-in agg functions and non-vectorized Python UDAF > Test non-vectorized Python UDAF > --- > > Key: FLINK-20134 > URL: https://issues.apache.org/jira/browse/FLINK-20134 > Project: Flink > Issue Type: Task > Components: API / Python >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Critical > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19852) Managed memory released check can block IterativeTask
[ https://issues.apache.org/jira/browse/FLINK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232483#comment-17232483 ] Xintong Song commented on FLINK-19852: -- [~gaoyunhaii] and I also had a discussion around this issue. We have similar opinion as yours, that the least invasive fix might be to have iterative tasks cache the pages and do not re-allocate them on each iteration. > Managed memory released check can block IterativeTask > - > > Key: FLINK-19852 > URL: https://issues.apache.org/jira/browse/FLINK-19852 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Affects Versions: 1.11.0, 1.10.2, 1.12.0, 1.11.1, 1.11.2 >Reporter: shaomeng.wang >Priority: Critical > Attachments: image-2020-10-28-17-48-28-395.png, > image-2020-10-28-17-48-48-583.png > > > UnsafeMemoryBudget#reserveMemory, called on TempBarrier, needs time to wait > on GC of all allocated/released managed memory at every iteration. > > stack: > !image-2020-10-28-17-48-48-583.png! > new TempBarrier in BatchTask > !image-2020-10-28-17-48-28-395.png! > > These will be very slow than before. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20155) java.lang.OutOfMemoryError: Direct buffer memory
[ https://issues.apache.org/jira/browse/FLINK-20155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232482#comment-17232482 ] Xintong Song commented on FLINK-20155: -- Hi [~roeehersh], The problem you described sounds like a direct memory leak to me. A previous finished job may failed to properly release its direct memory. While the exception is thrown from the pulsar client, the leak may also come from other places. I would suggest looking into a dump profile, looking for the unreleased resources from previous finished jobs. {quote}i also notice that even thought i configure 10gb memory, my flink managed memory is much smaller: {quote} That's true. Currently the metrics on this page have not cover all Flink's memory use cases. The most important part missing is the native memory. There will be an improvement to this page in the upcoming release 1.12. > java.lang.OutOfMemoryError: Direct buffer memory > > > Key: FLINK-20155 > URL: https://issues.apache.org/jira/browse/FLINK-20155 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission >Affects Versions: 1.11.1 >Reporter: roee hershko >Priority: Major > Fix For: 1.11.1 > > Attachments: image-2020-11-13-17-52-54-217.png > > > update: > this issue occur every time after a job fails the only way to fix it is to > manually re-create the task managers pods (i am using flink operator) > > after submitting a job, it runs for few hours and then the job manager is > crushing, when trying to re-create the job i am getting the following error: > > {code:java} > 2020-11-13 17:44:58org.apache.pulsar.client.admin.PulsarAdminException: > org.apache.pulsar.shade.io.netty.handler.codec.EncoderException: > java.lang.OutOfMemoryError: Direct buffer memoryat > org.apache.pulsar.client.admin.internal.BaseResource.getApiException(BaseResource.java:228) > at > org.apache.pulsar.client.admin.internal.TopicsImpl$7.failed(TopicsImpl.java:324) > at > org.apache.pulsar.shade.org.glassfish.jersey.client.JerseyInvocation$4.failed(JerseyInvocation.java:1030) > at > org.apache.pulsar.shade.org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:231) > at > org.apache.pulsar.shade.org.glassfish.jersey.client.ClientRuntime.access$100(ClientRuntime.java:85) > at > org.apache.pulsar.shade.org.glassfish.jersey.client.ClientRuntime$2.lambda$failure$1(ClientRuntime.java:183) > at > org.apache.pulsar.shade.org.glassfish.jersey.internal.Errors$1.call(Errors.java:272) > at > org.apache.pulsar.shade.org.glassfish.jersey.internal.Errors$1.call(Errors.java:268) > at > org.apache.pulsar.shade.org.glassfish.jersey.internal.Errors.process(Errors.java:316) > at > org.apache.pulsar.shade.org.glassfish.jersey.internal.Errors.process(Errors.java:298) > at > org.apache.pulsar.shade.org.glassfish.jersey.internal.Errors.process(Errors.java:268) > at > org.apache.pulsar.shade.org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:312) > at > org.apache.pulsar.shade.org.glassfish.jersey.client.ClientRuntime$2.failure(ClientRuntime.java:183) > at > org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector$3.onThrowable(AsyncHttpConnector.java:279) > at > org.apache.pulsar.shade.org.asynchttpclient.netty.NettyResponseFuture.abort(NettyResponseFuture.java:277) > at > org.apache.pulsar.shade.org.asynchttpclient.netty.request.WriteListener.abortOnThrowable(WriteListener.java:50) > at > org.apache.pulsar.shade.org.asynchttpclient.netty.request.WriteListener.operationComplete(WriteListener.java:61) > at > org.apache.pulsar.shade.org.asynchttpclient.netty.request.WriteCompleteListener.operationComplete(WriteCompleteListener.java:28) > at > org.apache.pulsar.shade.org.asynchttpclient.netty.request.WriteCompleteListener.operationComplete(WriteCompleteListener.java:20) > at > org.apache.pulsar.shade.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) > at > org.apache.pulsar.shade.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) > at > org.apache.pulsar.shade.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) > at > org.apache.pulsar.shade.io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:183) > at > org.apache.pulsar.shade.io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95) > at > org.apache.pulsar.shade.io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30) > at > org.apache.pulsar.shade.org.asynchttpclient.netty.request.NettyRequestSender.writeRequest(NettyRequestSender.java:421) > at > org.
[jira] [Closed] (FLINK-20133) Test Pandas UDAF
[ https://issues.apache.org/jira/browse/FLINK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu closed FLINK-20133. --- Resolution: Done > Test Pandas UDAF > > > Key: FLINK-20133 > URL: https://issues.apache.org/jira/browse/FLINK-20133 > Project: Flink > Issue Type: Task > Components: API / Python >Affects Versions: 1.12.0 >Reporter: Dian Fu >Assignee: Huang Xingbo >Priority: Critical > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20142) Update the document for CREATE TABLE LIKE that source table from different catalog is supported
[ https://issues.apache.org/jira/browse/FLINK-20142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jark Wu updated FLINK-20142: Description: The confusion from the [USER mailing list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CREATE-TABLE-LIKE-clause-from-different-catalog-or-database-td39364.html]: Hi, Is it disallowed to refer to a table from different databases or catalogs when someone creates a table? According to [1], there's no way to refer to tables belonging to different databases or catalogs. [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/create.html#create-table Best, Dongwon was: The confusion from the USER mailing list: Hi, Is it disallowed to refer to a table from different databases or catalogs when someone creates a table? According to [1], there's no way to refer to tables belonging to different databases or catalogs. [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/create.html#create-table Best, Dongwon > Update the document for CREATE TABLE LIKE that source table from different > catalog is supported > --- > > Key: FLINK-20142 > URL: https://issues.apache.org/jira/browse/FLINK-20142 > Project: Flink > Issue Type: Improvement > Components: Documentation, Table SQL / API >Affects Versions: 1.11.2 >Reporter: Danny Chen >Priority: Major > > The confusion from the [USER mailing > list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CREATE-TABLE-LIKE-clause-from-different-catalog-or-database-td39364.html]: > Hi, > Is it disallowed to refer to a table from different databases or catalogs > when someone creates a table? > According to [1], there's no way to refer to tables belonging to different > databases or catalogs. > [1] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/create.html#create-table > Best, > Dongwon -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #14060: [hotfix][doc] fix the usage example of datagen connector doc
flinkbot edited a comment on pull request #14060: URL: https://github.com/apache/flink/pull/14060#issuecomment-726548680 ## CI report: * deb5686ae2485f1fd700e5f1e69e6bf8aa6e Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9599) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9549) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (FLINK-20125) Test memory configuration display in the web ui
[ https://issues.apache.org/jira/browse/FLINK-20125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xintong Song reassigned FLINK-20125: Assignee: Xintong Song > Test memory configuration display in the web ui > --- > > Key: FLINK-20125 > URL: https://issues.apache.org/jira/browse/FLINK-20125 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Affects Versions: 1.12.0 >Reporter: Robert Metzger >Assignee: Xintong Song >Priority: Critical > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20149) SQLClientKafkaITCase.testKafka failed with "Did not get expected results before timeout."
[ https://issues.apache.org/jira/browse/FLINK-20149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232458#comment-17232458 ] Dian Fu commented on FLINK-20149: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9596&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 > SQLClientKafkaITCase.testKafka failed with "Did not get expected results > before timeout." > - > > Key: FLINK-20149 > URL: https://issues.apache.org/jira/browse/FLINK-20149 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Table SQL / Ecosystem >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Blocker > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9558&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 > {code} > 2020-11-13T13:47:21.6002762Z Nov 13 13:47:21 [ERROR] testKafka[0: > kafka-version:2.4.1 > kafka-sql-version:universal](org.apache.flink.tests.util.kafka.SQLClientKafkaITCase) > Time elapsed: 183.015 s <<< FAILURE! 2020-11-13T13:47:21.6003744Z Nov 13 > 13:47:21 java.lang.AssertionError: Did not get expected results before > timeout. 2020-11-13T13:47:21.6004745Z Nov 13 13:47:21 at > org.junit.Assert.fail(Assert.java:88) 2020-11-13T13:47:21.6005325Z Nov 13 > 13:47:21 at org.junit.Assert.assertTrue(Assert.java:41) > 2020-11-13T13:47:21.6006007Z Nov 13 13:47:21 at > org.apache.flink.tests.util.kafka.SQLClientKafkaITCase.checkCsvResultFile(SQLClientKafkaITCase.java:226) > 2020-11-13T13:47:21.6007091Z Nov 13 13:47:21 at > org.apache.flink.tests.util.kafka.SQLClientKafkaITCase.testKafka(SQLClientKafkaITCase.java:166) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20166) FlinkKafkaProducerITCase.testRunOutOfProducersInThePool failed with "CheckpointException: Could not complete snapshot 1 for operator MockTask (1/1)#0. Failure reason: Ch
Dian Fu created FLINK-20166: --- Summary: FlinkKafkaProducerITCase.testRunOutOfProducersInThePool failed with "CheckpointException: Could not complete snapshot 1 for operator MockTask (1/1)#0. Failure reason: Checkpoint was declined." Key: FLINK-20166 URL: https://issues.apache.org/jira/browse/FLINK-20166 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Runtime / Checkpointing Affects Versions: 1.12.0 Reporter: Dian Fu Fix For: 1.12.0 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9596&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5 {code} 2020-11-15T22:54:53.4151222Z [ERROR] testRunOutOfProducersInThePool(org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerITCase) Time elapsed: 61.224 s <<< ERROR! 2020-11-15T22:54:53.4152487Z org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 1 for operator MockTask (1/1)#0. Failure reason: Checkpoint was declined. 2020-11-15T22:54:53.4153577Zat org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:226) 2020-11-15T22:54:53.4154747Zat org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:158) 2020-11-15T22:54:53.4155760Zat org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:343) 2020-11-15T22:54:53.4156772Zat org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.snapshotWithLocalState(AbstractStreamOperatorTestHarness.java:638) 2020-11-15T22:54:53.4157865Zat org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.snapshot(AbstractStreamOperatorTestHarness.java:630) 2020-11-15T22:54:53.4159065Zat org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerITCase.testRunOutOfProducersInThePool(FlinkKafkaProducerITCase.java:527) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20166) FlinkKafkaProducerITCase.testRunOutOfProducersInThePool failed with "CheckpointException: Could not complete snapshot 1 for operator MockTask (1/1)#0. Failure reason: Ch
[ https://issues.apache.org/jira/browse/FLINK-20166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu updated FLINK-20166: Labels: test-stability (was: ) > FlinkKafkaProducerITCase.testRunOutOfProducersInThePool failed with > "CheckpointException: Could not complete snapshot 1 for operator MockTask > (1/1)#0. Failure reason: Checkpoint was declined." > > > Key: FLINK-20166 > URL: https://issues.apache.org/jira/browse/FLINK-20166 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Runtime / Checkpointing >Affects Versions: 1.12.0 >Reporter: Dian Fu >Priority: Major > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9596&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5 > {code} > 2020-11-15T22:54:53.4151222Z [ERROR] > testRunOutOfProducersInThePool(org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerITCase) > Time elapsed: 61.224 s <<< ERROR! > 2020-11-15T22:54:53.4152487Z > org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete > snapshot 1 for operator MockTask (1/1)#0. Failure reason: Checkpoint was > declined. > 2020-11-15T22:54:53.4153577Z at > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:226) > 2020-11-15T22:54:53.4154747Z at > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:158) > 2020-11-15T22:54:53.4155760Z at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:343) > 2020-11-15T22:54:53.4156772Z at > org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.snapshotWithLocalState(AbstractStreamOperatorTestHarness.java:638) > 2020-11-15T22:54:53.4157865Z at > org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.snapshot(AbstractStreamOperatorTestHarness.java:630) > 2020-11-15T22:54:53.4159065Z at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerITCase.testRunOutOfProducersInThePool(FlinkKafkaProducerITCase.java:527) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] flinkbot edited a comment on pull request #14060: [hotfix][doc] fix the usage example of datagen connector doc
flinkbot edited a comment on pull request #14060: URL: https://github.com/apache/flink/pull/14060#issuecomment-726548680 ## CI report: * deb5686ae2485f1fd700e5f1e69e6bf8aa6e Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9549) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=9599) Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run travis` re-run the last Travis build - `@flinkbot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-16908) FlinkKafkaProducerITCase testScaleUpAfterScalingDown Timeout expired while initializing transactional state in 60000ms.
[ https://issues.apache.org/jira/browse/FLINK-16908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232450#comment-17232450 ] Dian Fu commented on FLINK-16908: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9596&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5 > FlinkKafkaProducerITCase testScaleUpAfterScalingDown Timeout expired while > initializing transactional state in 6ms. > --- > > Key: FLINK-16908 > URL: https://issues.apache.org/jira/browse/FLINK-16908 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka >Affects Versions: 1.11.0, 1.12.0 >Reporter: Piotr Nowojski >Priority: Major > Labels: test-stability > Fix For: 1.12.0 > > > https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6889&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=f66652e3-384e-5b25-be29-abfea69ea8da > {noformat} > [ERROR] > testScaleUpAfterScalingDown(org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerITCase) > Time elapsed: 64.353 s <<< ERROR! > org.apache.kafka.common.errors.TimeoutException: Timeout expired while > initializing transactional state in 6ms. > {noformat} > After this initial error many other tests (I think all following unit tests) > failed with errors like: > {noformat} > [ERROR] > testFailAndRecoverSameCheckpointTwice(org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerITCase) > Time elapsed: 7.895 s <<< FAILURE! > java.lang.AssertionError: Detected producer leak. Thread name: > kafka-producer-network-thread | producer-196 > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerITCase.checkProducerLeak(FlinkKafkaProducerITCase.java:675) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerITCase.testFailAndRecoverSameCheckpointTwice(FlinkKafkaProducerITCase.java:311) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-20165) YARNSessionFIFOITCase.checkForProhibitedLogContents: Error occurred during initialization of boot layer java.lang.IllegalStateException: Module system already initialize
[ https://issues.apache.org/jira/browse/FLINK-20165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dian Fu updated FLINK-20165: Labels: test-stability (was: ) > YARNSessionFIFOITCase.checkForProhibitedLogContents: Error occurred during > initialization of boot layer java.lang.IllegalStateException: Module system > already initialized > -- > > Key: FLINK-20165 > URL: https://issues.apache.org/jira/browse/FLINK-20165 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.11.3 >Reporter: Dian Fu >Priority: Major > Labels: test-stability > Fix For: 1.11.3 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9597&view=logs&j=298e20ef-7951-5965-0e79-ea664ddc435e&t=8560c56f-9ec1-5c40-4ff5-9d3e882d > {code} > 2020-11-15T22:42:03.3053212Z 22:42:03,303 [ Time-limited test] INFO > org.apache.flink.yarn.YARNSessionFIFOITCase [] - Finished > testDetachedMode() > 2020-11-15T22:42:37.9020133Z [ERROR] Tests run: 5, Failures: 2, Errors: 0, > Skipped: 2, Time elapsed: 67.485 s <<< FAILURE! - in > org.apache.flink.yarn.YARNSessionFIFOSecuredITCase > 2020-11-15T22:42:37.9022015Z [ERROR] > testDetachedMode(org.apache.flink.yarn.YARNSessionFIFOSecuredITCase) Time > elapsed: 12.841 s <<< FAILURE! > 2020-11-15T22:42:37.9023701Z java.lang.AssertionError: > 2020-11-15T22:42:37.9025649Z Found a file > /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-1_0/application_1605480087188_0002/container_1605480087188_0002_01_02/taskmanager.out > with a prohibited string (one of [Exception, Started > SelectChannelConnector@0.0.0.0:8081]). Excerpts: > 2020-11-15T22:42:37.9026730Z [ > 2020-11-15T22:42:37.9027080Z Error occurred during initialization of boot > layer > 2020-11-15T22:42:37.9027623Z java.lang.IllegalStateException: Module system > already initialized > 2020-11-15T22:42:37.9033278Z java.lang.IllegalStateException: Module system > already initialized > 2020-11-15T22:42:37.9033825Z ] > 2020-11-15T22:42:37.9034291Z at org.junit.Assert.fail(Assert.java:88) > 2020-11-15T22:42:37.9034971Z at > org.apache.flink.yarn.YarnTestBase.ensureNoProhibitedStringInLogFiles(YarnTestBase.java:479) > 2020-11-15T22:42:37.9035814Z at > org.apache.flink.yarn.YARNSessionFIFOITCase.checkForProhibitedLogContents(YARNSessionFIFOITCase.java:83) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20165) YARNSessionFIFOITCase.checkForProhibitedLogContents: Error occurred during initialization of boot layer java.lang.IllegalStateException: Module system already initialize
Dian Fu created FLINK-20165: --- Summary: YARNSessionFIFOITCase.checkForProhibitedLogContents: Error occurred during initialization of boot layer java.lang.IllegalStateException: Module system already initialized Key: FLINK-20165 URL: https://issues.apache.org/jira/browse/FLINK-20165 Project: Flink Issue Type: Bug Components: Deployment / YARN Affects Versions: 1.11.3 Reporter: Dian Fu Fix For: 1.11.3 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9597&view=logs&j=298e20ef-7951-5965-0e79-ea664ddc435e&t=8560c56f-9ec1-5c40-4ff5-9d3e882d {code} 2020-11-15T22:42:03.3053212Z 22:42:03,303 [ Time-limited test] INFO org.apache.flink.yarn.YARNSessionFIFOITCase [] - Finished testDetachedMode() 2020-11-15T22:42:37.9020133Z [ERROR] Tests run: 5, Failures: 2, Errors: 0, Skipped: 2, Time elapsed: 67.485 s <<< FAILURE! - in org.apache.flink.yarn.YARNSessionFIFOSecuredITCase 2020-11-15T22:42:37.9022015Z [ERROR] testDetachedMode(org.apache.flink.yarn.YARNSessionFIFOSecuredITCase) Time elapsed: 12.841 s <<< FAILURE! 2020-11-15T22:42:37.9023701Z java.lang.AssertionError: 2020-11-15T22:42:37.9025649Z Found a file /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-1_0/application_1605480087188_0002/container_1605480087188_0002_01_02/taskmanager.out with a prohibited string (one of [Exception, Started SelectChannelConnector@0.0.0.0:8081]). Excerpts: 2020-11-15T22:42:37.9026730Z [ 2020-11-15T22:42:37.9027080Z Error occurred during initialization of boot layer 2020-11-15T22:42:37.9027623Z java.lang.IllegalStateException: Module system already initialized 2020-11-15T22:42:37.9033278Z java.lang.IllegalStateException: Module system already initialized 2020-11-15T22:42:37.9033825Z ] 2020-11-15T22:42:37.9034291Zat org.junit.Assert.fail(Assert.java:88) 2020-11-15T22:42:37.9034971Zat org.apache.flink.yarn.YarnTestBase.ensureNoProhibitedStringInLogFiles(YarnTestBase.java:479) 2020-11-15T22:42:37.9035814Zat org.apache.flink.yarn.YARNSessionFIFOITCase.checkForProhibitedLogContents(YARNSessionFIFOITCase.java:83) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] HuangXingBo commented on pull request #14060: [hotfix][doc] fix the usage example of datagen connector doc
HuangXingBo commented on pull request #14060: URL: https://github.com/apache/flink/pull/14060#issuecomment-727681489 @flinkbot run azure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-17482) KafkaITCase.testMultipleSourcesOnePartition unstable
[ https://issues.apache.org/jira/browse/FLINK-17482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232444#comment-17232444 ] Dian Fu commented on FLINK-17482: - Another instance on 1.11: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9597&view=logs&j=1fc6e7bf-633c-5081-c32a-9dea24b05730&t=0d9ad4c1-5629-5ffc-10dc-113ca91e23c5 > KafkaITCase.testMultipleSourcesOnePartition unstable > > > Key: FLINK-17482 > URL: https://issues.apache.org/jira/browse/FLINK-17482 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka, Tests >Affects Versions: 1.11.0, 1.12.0 >Reporter: Robert Metzger >Priority: Major > Labels: test-stability > Fix For: 1.12.0 > > > CI run: > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=454&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=684b1416-4c17-504e-d5ab-97ee44e08a20 > {code} > 07:29:40,472 [main] INFO > org.apache.flink.streaming.connectors.kafka.KafkaTestBase[] - > - > [ERROR] Tests run: 22, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 152.018 s <<< FAILURE! - in > org.apache.flink.streaming.connectors.kafka.KafkaITCase > [ERROR] > testMultipleSourcesOnePartition(org.apache.flink.streaming.connectors.kafka.KafkaITCase) > Time elapsed: 4.257 s <<< FAILURE! > java.lang.AssertionError: Test failed: Job execution failed. > at org.junit.Assert.fail(Assert.java:88) > at org.apache.flink.test.util.TestUtils.tryExecute(TestUtils.java:45) > at > org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.runMultipleSourcesOnePartitionExactlyOnceTest(KafkaConsumerTestBase.java:963) > at > org.apache.flink.streaming.connectors.kafka.KafkaITCase.testMultipleSourcesOnePartition(KafkaITCase.java:107) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [flink] XComp commented on a change in pull request #14033: [FLINK-19979][e2e] Add sanity check after bash e2e tests for no leftovers
XComp commented on a change in pull request #14033: URL: https://github.com/apache/flink/pull/14033#discussion_r523823323 ## File path: flink-end-to-end-tests/test-scripts/common.sh ## @@ -822,11 +839,20 @@ internal_run_with_timeout() { ( command_pid=$BASHPID - (sleep "${timeout_in_seconds}" # set a timeout for this command - echo "${command_label:-"The command '${command}'"} (pid: $command_pid) did not finish after $timeout_in_seconds seconds." - eval "${on_failure}" - kill "$command_pid") & watchdog_pid=$! + (# this subshell contains the watchdog +local wakeup_time=$(( ${timeout_in_seconds} + $(date +%s) )) +while true; do + sleep 1 + if [ $wakeup_time -le $(date +%s) ]; then +echo "${command_label:-"The command '${command}'"} (pid: $command_pid) did not finish after $timeout_in_seconds seconds." +eval "${on_failure}" +kill "$command_pid" +pkill -P "$command_pid" Review comment: Ok, I did some more research on it. Looks like it works like that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-19250) SplitFetcherManager does not propagate errors correctly
[ https://issues.apache.org/jira/browse/FLINK-19250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232412#comment-17232412 ] Stephan Ewen commented on FLINK-19250: -- Fixed in 1.11.3 (release-1.11) via f220c2486181e4af19b246e9745a4990cd6aa9ce > SplitFetcherManager does not propagate errors correctly > --- > > Key: FLINK-19250 > URL: https://issues.apache.org/jira/browse/FLINK-19250 > Project: Flink > Issue Type: Bug > Components: Connectors / Common >Affects Versions: 1.11.2 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > The first exception that is reported does not lead to a notification, only > successive exceptions do. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19251) Avoid confusing queue handling in "SplitReader.handleSplitsChanges()"
[ https://issues.apache.org/jira/browse/FLINK-19251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-19251: - Fix Version/s: 1.11.3 > Avoid confusing queue handling in "SplitReader.handleSplitsChanges()" > - > > Key: FLINK-19251 > URL: https://issues.apache.org/jira/browse/FLINK-19251 > Project: Flink > Issue Type: Improvement > Components: Connectors / Common >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0, 1.11.3 > > > Currently, the method {{SplitReader.handleSplitsChanges()}} is passed a queue > of split changes to handle. The method may decide to handle only a subset of > them and is passes later all remaining changes. > In practice, this ends up being confusing and problematic: > - It is important to remove the elements from the queue, not accidentally > iterate, or the splits will get handles multiple times > - If the queue is not left empty, the task to handle the changes is > immediately re-enqueued. No other operation can happen before all split > changes from the queue are handled. > A simpler and more efficient contract would be to simply pass a list of split > changes directly and once, for the fetcher to handle. For all implementations > so far, this was sufficient and easier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19251) Avoid confusing queue handling in "SplitReader.handleSplitsChanges()"
[ https://issues.apache.org/jira/browse/FLINK-19251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232413#comment-17232413 ] Stephan Ewen commented on FLINK-19251: -- Fixed in 1.11.3 (release-1.11) via 257a0da75feb0cf791de01b0ec43467cc433b955 > Avoid confusing queue handling in "SplitReader.handleSplitsChanges()" > - > > Key: FLINK-19251 > URL: https://issues.apache.org/jira/browse/FLINK-19251 > Project: Flink > Issue Type: Improvement > Components: Connectors / Common >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0, 1.11.3 > > > Currently, the method {{SplitReader.handleSplitsChanges()}} is passed a queue > of split changes to handle. The method may decide to handle only a subset of > them and is passes later all remaining changes. > In practice, this ends up being confusing and problematic: > - It is important to remove the elements from the queue, not accidentally > iterate, or the splits will get handles multiple times > - If the queue is not left empty, the task to handle the changes is > immediately re-enqueued. No other operation can happen before all split > changes from the queue are handled. > A simpler and more efficient contract would be to simply pass a list of split > changes directly and once, for the fetcher to handle. For all implementations > so far, this was sufficient and easier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19245) Set default queue capacity for FLIP-27 source handover queue to 2
[ https://issues.apache.org/jira/browse/FLINK-19245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232411#comment-17232411 ] Stephan Ewen commented on FLINK-19245: -- Fixed in 1.11.3 (release-1.11) via 406aa9f3f647568410ca0d9d27c475ce84a58ece > Set default queue capacity for FLIP-27 source handover queue to 2 > - > > Key: FLINK-19245 > URL: https://issues.apache.org/jira/browse/FLINK-19245 > Project: Flink > Issue Type: Improvement > Components: Connectors / Common >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Major > Fix For: 1.12.0, 1.11.3 > > > Based on initial investigation by [~jqin] a capacity value of two results in > good queue throughput while still keeping the number of elements small. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-19250) SplitFetcherManager does not propagate errors correctly
[ https://issues.apache.org/jira/browse/FLINK-19250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-19250: - Fix Version/s: 1.11.3 > SplitFetcherManager does not propagate errors correctly > --- > > Key: FLINK-19250 > URL: https://issues.apache.org/jira/browse/FLINK-19250 > Project: Flink > Issue Type: Bug > Components: Connectors / Common >Affects Versions: 1.11.2 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > The first exception that is reported does not lead to a notification, only > successive exceptions do. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18128) CoordinatedSourceITCase.testMultipleSources gets stuck
[ https://issues.apache.org/jira/browse/FLINK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232409#comment-17232409 ] Stephan Ewen commented on FLINK-18128: -- Fixed in 1.11.3 (release-1.11) via e72e48533902fe6a7271310736584e77b64d05b8 > CoordinatedSourceITCase.testMultipleSources gets stuck > -- > > Key: FLINK-18128 > URL: https://issues.apache.org/jira/browse/FLINK-18128 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Task, Tests >Affects Versions: 1.11.0 >Reporter: Robert Metzger >Assignee: Stephan Ewen >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.12.0, 1.11.3 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2705&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=6b04ca5f-0b52-511d-19c9-52bf0d9fbdfa > {code} > 2020-06-04T11:19:39.6335702Z [INFO] > 2020-06-04T11:19:39.6337440Z [INFO] > --- > 2020-06-04T11:19:39.6338176Z [INFO] T E S T S > 2020-06-04T11:19:39.6339305Z [INFO] > --- > 2020-06-04T11:19:40.1906157Z [INFO] Running > org.apache.flink.connector.base.source.reader.CoordinatedSourceITCase > 2020-06-04T11:34:51.0599860Z > == > 2020-06-04T11:34:51.0603015Z Maven produced no output for 900 seconds. > 2020-06-04T11:34:51.0604174Z > == > 2020-06-04T11:34:51.0613908Z > == > 2020-06-04T11:34:51.0615097Z The following Java processes are running (JPS) > 2020-06-04T11:34:51.0616043Z > == > 2020-06-04T11:34:51.0762007Z Picked up JAVA_TOOL_OPTIONS: > -XX:+HeapDumpOnOutOfMemoryError > 2020-06-04T11:34:51.2775341Z 29635 surefirebooter5307550588991461882.jar > 2020-06-04T11:34:51.2931264Z 2100 Launcher > 2020-06-04T11:34:51.3012583Z 32203 Jps > 2020-06-04T11:34:51.3258038Z Picked up JAVA_TOOL_OPTIONS: > -XX:+HeapDumpOnOutOfMemoryError > 2020-06-04T11:34:51.5443730Z > == > 2020-06-04T11:34:51.5445134Z Printing stack trace of Java process 29635 > 2020-06-04T11:34:51.5445984Z > == > 2020-06-04T11:34:51.5528602Z Picked up JAVA_TOOL_OPTIONS: > -XX:+HeapDumpOnOutOfMemoryError > 2020-06-04T11:34:51.9617670Z 2020-06-04 11:34:51 > 2020-06-04T11:34:51.9619131Z Full thread dump OpenJDK 64-Bit Server VM > (25.242-b08 mixed mode): > 2020-06-04T11:34:51.9619732Z > 2020-06-04T11:34:51.9620618Z "Attach Listener" #299 daemon prio=9 os_prio=0 > tid=0x7f4d60001000 nid=0x7e59 waiting on condition [0x] > 2020-06-04T11:34:51.9621720Zjava.lang.Thread.State: RUNNABLE > 2020-06-04T11:34:51.9622190Z > 2020-06-04T11:34:51.9623631Z "flink-akka.actor.default-dispatcher-185" #297 > prio=5 os_prio=0 tid=0x7f4ca0003000 nid=0x7db4 waiting on condition > [0x7f4d10136000] > 2020-06-04T11:34:51.9624972Zjava.lang.Thread.State: WAITING (parking) > 2020-06-04T11:34:51.9625716Z at sun.misc.Unsafe.park(Native Method) > 2020-06-04T11:34:51.9627072Z - parking to wait for <0x80c557f0> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > 2020-06-04T11:34:51.9628593Z at > akka.dispatch.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) > 2020-06-04T11:34:51.9629649Z at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > 2020-06-04T11:34:51.9630825Z at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-06-04T11:34:51.9631559Z > 2020-06-04T11:34:51.9633020Z "flink-akka.actor.default-dispatcher-186" #298 > prio=5 os_prio=0 tid=0x7f4d08006800 nid=0x7db3 waiting on condition > [0x7f4d12974000] > 2020-06-04T11:34:51.9634074Zjava.lang.Thread.State: WAITING (parking) > 2020-06-04T11:34:51.9634965Z at sun.misc.Unsafe.park(Native Method) > 2020-06-04T11:34:51.9636384Z - parking to wait for <0x80c557f0> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > 2020-06-04T11:34:51.9637683Z at > akka.dispatch.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) > 2020-06-04T11:34:51.9638795Z at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > 2020-06-04T11:34:51.9639845Z at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-06-04T11:34:51.9640475Z > 2020-06-04T11:34:51.9642008Z "flink-akka.actor.default-
[jira] [Updated] (FLINK-19245) Set default queue capacity for FLIP-27 source handover queue to 2
[ https://issues.apache.org/jira/browse/FLINK-19245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen updated FLINK-19245: - Fix Version/s: 1.11.3 > Set default queue capacity for FLIP-27 source handover queue to 2 > - > > Key: FLINK-19245 > URL: https://issues.apache.org/jira/browse/FLINK-19245 > Project: Flink > Issue Type: Improvement > Components: Connectors / Common >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Major > Fix For: 1.12.0, 1.11.3 > > > Based on initial investigation by [~jqin] a capacity value of two results in > good queue throughput while still keeping the number of elements small. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19223) Simplify Availability Future Model in Base Connector
[ https://issues.apache.org/jira/browse/FLINK-19223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232410#comment-17232410 ] Stephan Ewen commented on FLINK-19223: -- Fixed in 1.11.3 (release-1.11) via 0e821eaa483b0371ac1df1339f0d3c9ad7376976 > Simplify Availability Future Model in Base Connector > > > Key: FLINK-19223 > URL: https://issues.apache.org/jira/browse/FLINK-19223 > Project: Flink > Issue Type: Improvement > Components: Connectors / Common >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.12.0 > > > The current model implemented by the {{FutureNotifier}} and the > {{SourceReaderBase}} has a shortcoming: > - It does not support availability notifications where the notification > comes before the check. IN that case the notification is lost. > - One can see the added complexity created by this model also in the > {{SourceReaderBase#isAvailable()}} where the returned future needs to be > "post-processed" and eagerly completed if the reader is in fact available. > This is based on queue size, which makes it hard to have other conditions. > I think we can do something that is both easier and a bit more efficient by > following a similar model as the > {{org.apache.flink.runtime.io.AvailabilityProvider.AvailabilityHelper}}. > Furthermore, I believe we can win more efficiency by integrating this better > with the {{FutureCompletingBlockingQueue}}. > I suggest to do a similar implementation as the {{AvailabilityHelper}} > directly in the {{FutureCompletingBlockingQueue}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)