[jira] [Created] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests
Gary Yao created FLINK-17794: Summary: Tear down installed software in reverse order in Jepsen Tests Key: FLINK-17794 URL: https://issues.apache.org/jira/browse/FLINK-17794 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.10.1, 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Tear down installed software in reverse order in Jepsen Tests. This mitigates the issue that sometimes hadoop's node manager directories cannot be removed using {{rm -rf}} because Flink processes keep running and generate files after the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if files are created in the background concurrently, the command can still fail with a non-zero exit code. {noformat} sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db': Directory not empty\nrm: cannot remove '/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db': Directory not empty {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17792) Failing to invoking jstack on TM processes should not fail Jepsen Tests
Gary Yao created FLINK-17792: Summary: Failing to invoking jstack on TM processes should not fail Jepsen Tests Key: FLINK-17792 URL: https://issues.apache.org/jira/browse/FLINK-17792 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.10.1, 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 {{jstack}} can fail if the JVM process exits prematurely while or before we invoke {{jstack}}. If {{jstack}} fails, the exception propagates and exits the Jepsen Tests prematurely. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink
Gary Yao created FLINK-1: Summary: Make Mesos Jepsen Tests pass with Hadoop-free Flink Key: FLINK-1 URL: https://issues.apache.org/jira/browse/FLINK-1 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests
Gary Yao created FLINK-17687: Summary: Collect TaskManager logs in Mesos Jepsen Tests Key: FLINK-17687 URL: https://issues.apache.org/jira/browse/FLINK-17687 Project: Flink Issue Type: Improvement Components: Tests Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17621) Use default akka.ask.timeout in TPC-DS e2e test
Gary Yao created FLINK-17621: Summary: Use default akka.ask.timeout in TPC-DS e2e test Key: FLINK-17621 URL: https://issues.apache.org/jira/browse/FLINK-17621 Project: Flink Issue Type: Task Components: Runtime / Coordination, Tests Affects Versions: 1.11.0 Reporter: Gary Yao Revert the changes in FLINK-17616 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test
Gary Yao created FLINK-17616: Summary: Temporarily increase akka.ask.timeout in TPC-DS e2e test Key: FLINK-17616 URL: https://issues.apache.org/jira/browse/FLINK-17616 Project: Flink Issue Type: Task Components: Runtime / Coordination, Tests Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17558) Partitions are released in TaskExecutor Main Thread
Gary Yao created FLINK-17558: Summary: Partitions are released in TaskExecutor Main Thread Key: FLINK-17558 URL: https://issues.apache.org/jira/browse/FLINK-17558 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.11.0 Reporter: Gary Yao Fix For: 1.11.0 Partitions are released in the main thread of the TaskExecutor (see the stacktrace below). This can lead to missed heartbeats, timeouts of RPCs, etc. because deleting files is blocking I/O. The partitions should be released in a devoted I/O thread pool ({{TaskExecutor#ioExecutor}} is a candidate). {noformat} 2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 prio=5 os_prio=0 tid=0x7f7fcc071000 nid=0x1f3f9 runnable [0x7f7fd302c000] 2020-05-06T19:13:12.4383983Zjava.lang.Thread.State: RUNNABLE 2020-05-06T19:13:12.4384519Zat sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method) 2020-05-06T19:13:12.4384971Zat sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146) 2020-05-06T19:13:12.4385465Zat sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231) 2020-05-06T19:13:12.4386000Zat sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) 2020-05-06T19:13:12.4386458Zat java.nio.file.Files.delete(Files.java:1126) 2020-05-06T19:13:12.4386968Zat org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93) 2020-05-06T19:13:12.4388088Zat org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247) 2020-05-06T19:13:12.4388765Zat org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208) 2020-05-06T19:13:12.4389444Z- locked <0xff836d78> (a java.lang.Object) 2020-05-06T19:13:12.4389905Zat org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290) 2020-05-06T19:13:12.4390481Zat org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80) 2020-05-06T19:13:12.4391118Z- locked <0x9d452b90> (a java.util.HashMap) 2020-05-06T19:13:12.4391597Zat org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153) 2020-05-06T19:13:12.4392267Zat org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62) 2020-05-06T19:13:12.4392914Zat org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776) 2020-05-06T19:13:12.4393366Zat sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) 2020-05-06T19:13:12.4393813Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-05-06T19:13:12.4394257Zat java.lang.reflect.Method.invoke(Method.java:498) 2020-05-06T19:13:12.4394693Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279) 2020-05-06T19:13:12.4395202Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) 2020-05-06T19:13:12.4395686Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) 2020-05-06T19:13:12.4396165Zat org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$72/775020844.apply(Unknown Source) 2020-05-06T19:13:12.4396606Zat akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 2020-05-06T19:13:12.4397015Zat akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 2020-05-06T19:13:12.4397447Zat scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 2020-05-06T19:13:12.4397874Zat akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 2020-05-06T19:13:12.4398414Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 2020-05-06T19:13:12.4398879Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2020-05-06T19:13:12.4399321Zat scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2020-05-06T19:13:12.4399737Zat akka.actor.Actor$class.aroundReceive(Actor.scala:517) 2020-05-06T19:13:12.4400138Zat akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 2020-05-06T19:13:12.4400552Zat akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 2020-05-06T19:13:12.4400930Zat akka.actor.ActorCell.invoke(ActorCell.scala:561) 2020-05-06T19:13:12.4401390Zat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 2020-05-06T19:13:12.4401763Zat akka.dispatch.Mailbox.run(Mailbox.scala:225) 2020-05-06T1
[jira] [Created] (FLINK-17522) Document flink-jepsen Command Line Options
Gary Yao created FLINK-17522: Summary: Document flink-jepsen Command Line Options Key: FLINK-17522 URL: https://issues.apache.org/jira/browse/FLINK-17522 Project: Flink Issue Type: Improvement Components: Tests Reporter: Gary Yao Assignee: Gary Yao -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17501) Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, Object)
Gary Yao created FLINK-17501: Summary: Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, Object) Key: FLINK-17501 URL: https://issues.apache.org/jira/browse/FLINK-17501 Project: Flink Issue Type: Bug Components: Runtime / Queryable State Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Improve logging in {{AbstractServerHandler#channelRead(ChannelHandlerContext, Object)}}. If an Error is thrown, it should be logged as early as possible. Currently we try to serialize and send an error response to the client before logging the error; this can fail and mask the original exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17473) Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder
Gary Yao created FLINK-17473: Summary: Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder Key: FLINK-17473 URL: https://issues.apache.org/jira/browse/FLINK-17473 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Tests Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Remove unused classes {{ArchivedExecutionVertexBuilder}} and {{ArchivedExecutionJobVertexBuilder}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17369) Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to PipelinedRegionComputeUtilTest
Gary Yao created FLINK-17369: Summary: Migrate RestartPipelinedRegionFailoverStrategyBuildingTest to PipelinedRegionComputeUtilTest Key: FLINK-17369 URL: https://issues.apache.org/jira/browse/FLINK-17369 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Tests Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Tests in {{RestartPipelinedRegionFailoverStrategyBuildingTest}} are actually testing the behavior of {{PipelinedRegionComputeUtil}}. Therefore, the tests should be moved to a new class {{PipelinedRegionComputeUtilTest}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17266) WorkerResourceSpec is not serializable
Gary Yao created FLINK-17266: Summary: WorkerResourceSpec is not serializable Key: FLINK-17266 URL: https://issues.apache.org/jira/browse/FLINK-17266 Project: Flink Issue Type: Bug Components: Deployment / Mesos Affects Versions: 1.11.0 Reporter: Gary Yao Fix For: 1.11.0 {{MesosResourceManager}} cannot acquire new resources due to {{WorkerResourceSpec}} not being serializable. {code} Caused by: java.lang.Exception: Could not open output stream for state backend at org.apache.flink.runtime.zookeeper.filesystem.FileSystemStateStorageHelper.store(FileSystemStateStorageHelper.java:70) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:136) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.mesos.runtime.clusterframework.store.ZooKeeperMesosWorkerStore.putWorker(ZooKeeperMesosWorkerStore.java:216) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.mesos.runtime.clusterframework.MesosResourceManager.startNewWorker(MesosResourceManager.java:441) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] ... 34 more Caused by: java.io.NotSerializableException: org.apache.flink.runtime.resourcemanager.WorkerResourceSpec at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) ~[?:1.8.0_242] at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) ~[?:1.8.0_242] at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) ~[?:1.8.0_242] at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) ~[?:1.8.0_242] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) ~[?:1.8.0_242] at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) ~[?:1.8.0_242] at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:594) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.runtime.zookeeper.filesystem.FileSystemStateStorageHelper.store(FileSystemStateStorageHelper.java:62) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:136) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.mesos.runtime.clusterframework.store.ZooKeeperMesosWorkerStore.putWorker(ZooKeeperMesosWorkerStore.java:216) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] at org.apache.flink.mesos.runtime.clusterframework.MesosResourceManager.startNewWorker(MesosResourceManager.java:441) ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] ... 34 more {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Switch to Azure Pipelines as the primary CI tool / switch off Travis
I am in favour of decommissioning Travis. Moreover, I wanted to use this thread to raise another issue with Travis that I have discovered recently; many of the builds running in my private Travis account are timing out in the compilation stage (i.e., compilation takes more than 50 minutes). This means that I am not able to reliably run a full build on a CI server without creating a pull request. If other developers also experience this issue, it would speak for putting more effort into making Azure Pipelines the project-wide default. Best, Gary On Thu, Mar 26, 2020 at 12:26 PM Yu Li wrote: > Thanks for the clarification Robert. > > Since the first step plan is to replace the travis PR runs, I checked all > PR builds from 2020-01-01 (PR#10735-11526) [1], and below is the result: > > * Travis FAILURE: 298 > * Travis SUCCESS: 649 (68.5%) > * Azure FAILURE: 420 > * Azure SUCCESS: 571 (57.6%) > > Since the patch for each run is equivalent for Travis and Azure, there > seems to be slightly higher failure rate (~10%) when running in Azure. > > However, with the just-merged fix for uploading logs (FLINK-16480), I > believe the success rate of Azure could compete with Travis now (uploading > files contribute to 20% of the failures according to the report [2]). > > So I'm +1 to disable travis runs according to the numbers. > > Best Regards, > Yu > > [1] > https://github.com/apache/flink/pulls?q=is%3Apr+created%3A%3E%3D2020-01-01 > [2] > > https://dev.azure.com/rmetzger/Flink/_pipeline/analytics/stageawareoutcome?definitionId=4 > > On Thu, 26 Mar 2020 at 03:28, Robert Metzger wrote: > > > Thank you for your responses. > > > > @Yu Li: In the current master, the log upload always fails, if the e2e > job > > failed. I just merged a PR that fixes this issue [1]. The problem was not > > really the network stability, rather a problem with the interaction of > the > > jobs in the pipeline (the e2e job did not set the right variables for the > > log upload) > > Secondly, you are looking at the report of the "flink-ci.flink" pipeline, > > where pull requests are build. Naturally, pull request builds fail all > the > > time, because the PRs are not yet perfect. > > > > "flink-ci.flink-master" is the right pipeline to look at: > > > > > https://dev.azure.com/rmetzger/Flink/_pipeline/analytics/stageawareoutcome?definitionId=8&contextType=build > > We have a fairly high number of failures there, because we currently have > > some issues downloading the maven artifacts [2]. I'm working already with > > Chesnay on merging a fix for that. > > > > > > [1] > > > > > https://github.com/apache/flink/commit/1c86b8b9dd05615a3b2600984db738a9bf388259 > > [2]https://issues.apache.org/jira/browse/FLINK-16720 > > > > > > > > On Wed, Mar 25, 2020 at 4:48 PM Chesnay Schepler > > wrote: > > > > > The easiest way to disable travis for pushes is to remove all builds > > > from the .travis.yml with a push/pr condition. > > > > > > On 25/03/2020 15:03, Robert Metzger wrote: > > > > Thank you for the feedback so far. > > > > > > > > Responses to the items Chesnay raised: > > > > > > > > - by virtue of maintaining the past 2 releases we will have to > maintain > > > any > > > >> Travis infrastructure as long as 1.10 is supported, i.e., until 1.12 > > > >> > > > > Okay. I wasn't sure about the exact policy there. > > > > > > > > > > > >> - the azure setup doesn't appear to be equivalent yet since the java > > e2e > > > >> profile isn't setting the hadoop switch (-Pe2e-hadoop), as a result > of > > > >> which SQLClientKafkaITCase isn't run > > > >> > > > > I filed a ticket to address this: > > > > https://issues.apache.org/jira/browse/FLINK-16778 > > > > > > > > > > > >> - the nightly scripts still seems to be using a maven version other > > than > > > >> 3.2.5; from today on master: > > > >> 2020-03-25T05:31:52.7412964Z [INFO] < > > > >> org.apache.flink:flink-end-to-end-tests-common-kafka > > > > >> 2020-03-25T05:31:52.7413854Z [INFO] Building > > > >> flink-end-to-end-tests-common-kafka 1.11-SNAPSHOT [39/46] > > > >> 2020-03-25T05:31:52.7414689Z [INFO] > [ > > > jar > > > >> ]- > > > >> 2020-03-25T05:31:52.7518360Z [INFO] > > > >> 2020-03-25T05:31:52.7519770Z [INFO] --- > > > maven-checkstyle-plugin:2.17:check > > > >> (validate) @ flink-end-to-end-tests-common-kafka --- > > > >> > > > > I'm planning to address this as part of > > > > https://issues.apache.org/jira/browse/FLINK-16411, where I work on > > > > centralizing all mvn invocations. > > > > > > > > > > > >> - there is no real benefit in retiring the travis support in CiBot; > > the > > > >> important part is whether Travis is run or not for pull requests. > > > >> From what I can tell though azure seems to be working fine for pull > > > >> requests, so +1 to at least disable the travis PR runs. > > > > > > > > So we disable Travis for https://github.com/flink-ci/flink ? I will > do > > > it > > > > once th
[jira] [Created] (FLINK-17181) Simplify generic Types in Topology Interface
Gary Yao created FLINK-17181: Summary: Simplify generic Types in Topology Interface Key: FLINK-17181 URL: https://issues.apache.org/jira/browse/FLINK-17181 Project: Flink Issue Type: Sub-task Reporter: Gary Yao Assignee: Gary Yao -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17180) Implement PipelinedRegion interface for SchedulingTopology
Gary Yao created FLINK-17180: Summary: Implement PipelinedRegion interface for SchedulingTopology Key: FLINK-17180 URL: https://issues.apache.org/jira/browse/FLINK-17180 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Gary Yao -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17172) Re-enable debug level logging in Jepsen Tests
Gary Yao created FLINK-17172: Summary: Re-enable debug level logging in Jepsen Tests Key: FLINK-17172 URL: https://issues.apache.org/jira/browse/FLINK-17172 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Since log4j2 was enabled, logs in Jepsen tests are on INFO level. We should re-enable debug level logging in Jepsen Tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17050) Remove methods getVertex() and getResultPartition() from SchedulingTopology
Gary Yao created FLINK-17050: Summary: Remove methods getVertex() and getResultPartition() from SchedulingTopology Key: FLINK-17050 URL: https://issues.apache.org/jira/browse/FLINK-17050 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Remove methods (they are only called from tests): # {{Optional getVertex(ExecutionVertexID)}} # {{Optional getResultPartition(IntermediateResultPartitionID)}} and replace them with the following new methods: # V getVertex(ExecutionVertexID) # V getResultPartition(IntermediateResultPartitionID) The new methods should throw {{IllegalArgumentException}} if no matching vertex or result partition could be found. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17035) Replace FailoverTopology with SchedulingTopology
Gary Yao created FLINK-17035: Summary: Replace FailoverTopology with SchedulingTopology Key: FLINK-17035 URL: https://issues.apache.org/jira/browse/FLINK-17035 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 Replace usages of {{FailoverTopology}} with {{SchedulingTopology}}. Due to their similarities it does not make sense to maintain multiple interfaces. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] FLIP-119: Pipelined Region Scheduling
Hi all, The voting time for FLIP-119 has passed. I am closing the vote now. There were 6 +1 votes, 4 of which are binding: - Till (binding) - Zhijiang (binding) - Xintong (non-binding) - Yangze (non-binding) - Kurt (binding) - Zhu Zhu (binding) There were no disapproving votes. Thus, FLIP-119 has been accepted. Best, Gary On Fri, Apr 3, 2020 at 12:02 PM Zhu Zhu wrote: > +1 (binding) > > Thanks, > Zhu Zhu > > Kurt Young 于2020年4月3日周五 下午5:25写道: > > > +1 (binding) > > > > Best, > > Kurt > > > > > > On Fri, Apr 3, 2020 at 4:02 PM Yangze Guo wrote: > > > > > +1 (non-binding) > > > > > > Best, > > > Yangze Guo > > > > > > On Fri, Apr 3, 2020 at 2:11 PM Xintong Song > > wrote: > > > > > > > > +1 (non-binding) > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Fri, Apr 3, 2020 at 1:18 PM Zhijiang > > .invalid> > > > > wrote: > > > > > > > > > +1 (binding) > > > > > > > > > > Best, > > > > > Zhijiang > > > > > > > > > > > > > > > -- > > > > > From:Till Rohrmann > > > > > Send Time:2020 Apr. 2 (Thu.) 23:09 > > > > > To:dev > > > > > Cc:zhuzh > > > > > Subject:Re: [VOTE] FLIP-119: Pipelined Region Scheduling > > > > > > > > > > +1 > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Tue, Mar 31, 2020 at 5:52 PM Gary Yao wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I would like to start the vote for FLIP-119 [1], which is > discussed > > > and > > > > > > reached a consensus in the discussion thread [2]. > > > > > > > > > > > > The vote will be open until April 3 (72h) unless there is an > > > objection > > > > > > or not enough votes. > > > > > > > > > > > > Best, > > > > > > Gary > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-119-Pipelined-Region-Scheduling-tp39350.html > > > > > > > > > > > > > > > > > > > > > >
[jira] [Created] (FLINK-16960) Add PipelinedRegion Interface to Topology
Gary Yao created FLINK-16960: Summary: Add PipelinedRegion Interface to Topology Key: FLINK-16960 URL: https://issues.apache.org/jira/browse/FLINK-16960 Project: Flink Issue Type: Sub-task Reporter: Gary Yao Assignee: Gary Yao {code} interface Topology { ... Iterable getAllPipelinedRegions(); PipelinedRegion getPipelinedRegionOfVertex(VID vertexId); ... } interface PipelinedRegion { Iterable getVertices(); Iterable getVertex(VID vertexId); Iterable getConsumedResults(); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16940) Avoid creating currentRegion HashSet with manually set initialCapacity
Gary Yao created FLINK-16940: Summary: Avoid creating currentRegion HashSet with manually set initialCapacity Key: FLINK-16940 URL: https://issues.apache.org/jira/browse/FLINK-16940 Project: Flink Issue Type: Bug Components: Runtime / REST Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 The {{currentRegion}} HashSet in {{PipelinedRegionComputeUtil}} is created with an initialCapacity of 1. This is wrong because when we add the first element, the sets capacity will be already increased. From the style guidelines: {quote} Set the initial capacity for a collection only if there is a good proven reason for that, otherwise do not clutter the code. In case of Maps it can be even deluding because the Map’s load factor effectively reduces the capacity. {quote} https://flink.apache.org/contributing/code-style-and-quality-java.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16939) TaskManagerMessageParameters#taskManagerIdParameter is not declared final
Gary Yao created FLINK-16939: Summary: TaskManagerMessageParameters#taskManagerIdParameter is not declared final Key: FLINK-16939 URL: https://issues.apache.org/jira/browse/FLINK-16939 Project: Flink Issue Type: Bug Components: Runtime / REST Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 The field {{TaskManagerMessageParameters#taskManagerIdParameter}} is not declared final but it should be. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[VOTE] FLIP-119: Pipelined Region Scheduling
Hi all, I would like to start the vote for FLIP-119 [1], which is discussed and reached a consensus in the discussion thread [2]. The vote will be open until April 3 (72h) unless there is an objection or not enough votes. Best, Gary [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-119-Pipelined-Region-Scheduling-tp39350.html
Re: [DISCUSS] FLIP-119: Pipelined Region Scheduling
new resources within timeout, we fail this request. > > > > > > A small comment concerning "Resource deadlocks when slot allocation > > > competition happens between multiple jobs in a session cluster": > Another > > > idea to solve this situation would be to give the ResourceManager the > > right > > > to revoke slot assignments in order to change the mapping between > > requests > > > and available slots. > > > > > > Cheers, > > > Till > > > > > > On Fri, Mar 27, 2020 at 12:44 PM Xintong Song > > > wrote: > > > > > > > Gary & Zhu Zhu, > > > > > > > > Thanks for preparing this FLIP, and a BIG +1 from my side. The > > trade-off > > > > between resource utilization and potential deadlock problems has > always > > > > been a pain. Despite not solving all the deadlock cases, this FLIP is > > > > definitely a big improvement. IIUC, it has already covered all the > > > existing > > > > single job cases, and all the mentioned non-covered cases are either > in > > > > multi-job session clusters or with diverse slot resources in future. > > > > > > > > I've read through the FLIP, and it looks really good to me. Good job! > > All > > > > the concerns and limitations that I can think of have already been > > > clearly > > > > stated, with reasonable potential future solutions. From the > > perspective > > > of > > > > fine-grained resource management, I do not see any > serious/irresolvable > > > > conflict at this time. > > > > > > > > nit: The in-page links are not working. I guess those are copied from > > > > google docs directly? > > > > > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Fri, Mar 27, 2020 at 6:26 PM Zhu Zhu wrote: > > > > > > > > > To Yangze, > > > > > > > > > > >> the blocking edge will not be consumable before the upstream is > > > > > finished. > > > > > Yes. This is how we define a BLOCKING result partition, "Blocking > > > > > partitions represent blocking data exchanges, where the data stream > > is > > > > > first fully produced and then consumed". > > > > > > > > > > >> I'm also wondering could we execute the upstream and downstream > > > > regions > > > > > at the same time if we have enough resources > > > > > It may lead to resource waste since the tasks in downstream regions > > > > cannot > > > > > read any data before the upstream region finishes. It saves a bit > > time > > > on > > > > > schedule, but usually it does not make much difference for large > > jobs, > > > > > since data processing takes much more time. For small jobs, one can > > > make > > > > > all edges PIPELINED so that all the tasks can be scheduled at the > > same > > > > > time. > > > > > > > > > > >> is it possible to change the data exchange mode of two regions > > > > > dynamically? > > > > > This is not in the scope of the FLIP. But we are moving forward to > a > > > more > > > > > extensible scheduler (FLINK-10429) and resource aware scheduling > > > > > (FLINK-10407). > > > > > So I think it's possible we can have a scheduler in the future > which > > > > > dynamically changes the shuffle type wisely regarding available > > > > resources. > > > > > > > > > > Thanks, > > > > > Zhu Zhu > > > > > > > > > > Yangze Guo 于2020年3月27日周五 下午4:49写道: > > > > > > > > > > > Thanks for updating! > > > > > > > > > > > > +1 for supporting the pipelined region scheduling. Although we > > could > > > > > > not prevent resource deadlock in all scenarios, it is really a > big > > > > > > step. > > > > > > > > > > > > The design generally LGTM. > > > > > > > > > > > > One minor thing I want to make sure. If I understand correctly, > the > > > > > > blocking edge will not be consumable before the upstre
Re: [DISCUSS] Creating a new repo to host Stateful Functions Dockerfiles
+1 for a separate repository. Best, Gary On Fri, Mar 27, 2020 at 2:46 AM Hequn Cheng wrote: > +1 for a separate repository. > The dedicated `flink-docker` repo works fine now. We can do it similarly. > > Best, > Hequn > > On Fri, Mar 27, 2020 at 1:16 AM Till Rohrmann > wrote: > > > +1 for a separate repository. > > > > Cheers, > > Till > > > > On Thu, Mar 26, 2020 at 5:13 PM Ufuk Celebi wrote: > > > > > +1. > > > > > > The repo creation process is a light-weight, automated process on the > ASF > > > side. When Patrick Lucas contributed docker-flink back to the Flink > > > community (as flink-docker), there was virtually no overhead in > creating > > > the repository. Reusing build scripts should still be possible at the > > cost > > > of some duplication which is fine imo. > > > > > > – Ufuk > > > > > > On Thu, Mar 26, 2020 at 4:18 PM Stephan Ewen wrote: > > > > > > > > +1 to a separate repository. > > > > > > > > It seems to be best practice in the docker community. > > > > And since it does not add overhead, why not go with the best > practice? > > > > > > > > Best, > > > > Stephan > > > > > > > > > > > > On Thu, Mar 26, 2020 at 4:15 PM Tzu-Li (Gordon) Tai < > > tzuli...@apache.org > > > > > > > wrote: > > > >> > > > >> Hi Flink devs, > > > >> > > > >> As part of a Stateful Functions release, we would like to publish > > > Stateful > > > >> Functions Docker images to Dockerhub as an official image. > > > >> > > > >> Some background context on Stateful Function images, for those who > are > > > not > > > >> familiar with the project yet: > > > >> > > > >>- Stateful Function images are built on top of the Flink official > > > >>images, with additional StateFun dependencies being added. > > > >>You can take a look at the scripts we currently use to build the > > > images > > > >>locally for development purposes [1]. > > > >>- They are quite important for user experience, since building a > > > Docker > > > >>image is the recommended go-to deployment mode for StateFun user > > > >>applications [2]. > > > >> > > > >> > > > >> A prerequisite for all of this is to first decide where we host the > > > >> Stateful Functions Dockerfiles, > > > >> before we can proceed with the process of requesting a new official > > > image > > > >> repository at Dockerhub. > > > >> > > > >> We’re proposing to create a new dedicated repo for this purpose, > > > >> with the name `apache/flink-statefun-docker`. > > > >> > > > >> While we did initially consider integrating the StateFun Dockerfiles > > to > > > be > > > >> hosted together with the Flink ones in the existing > > > `apache/flink-docker` > > > >> repo, we had the following concerns: > > > >> > > > >>- In general, it is a convention that each official Dockerhub > image > > > is > > > >>backed by a dedicated source repo hosting the Dockerfiles. > > > >>- The `apache/flink-docker` repo already has quite a few > dedicated > > > >>tooling and CI smoke tests specific for the Flink images. > > > >>- Flink and StateFun have separate versioning schemes and > > independent > > > >>release cycles. A new Flink release does not necessarily require > a > > > >>“lock-step” to release new StateFun images as well. > > > >>- Considering the above all-together, and the fact that creating > a > > > new > > > >>repo is rather low-effort, having a separate repo would probably > > make > > > more > > > >>sense here. > > > >> > > > >> > > > >> What do you think? > > > >> > > > >> Cheers, > > > >> Gordon > > > >> > > > >> [1] > > > >> > > > > > > > > > https://github.com/apache/flink-statefun/blob/master/tools/docker/build-stateful-functions.sh > > > >> [2] > > > >> > > > > > > > > > https://ci.apache.org/projects/flink/flink-statefun-docs-master/deployment-and-operations/packaging.html > > > > > >
[DISCUSS] FLIP-119: Pipelined Region Scheduling
Hi community, In the past releases, we have been working on refactoring Flink's scheduler with the goal of making the scheduler extensible [1]. We have rolled out most of the intended refactoring in Flink 1.10, and we think it is now time to leverage our newly introduced abstractions to implement a new resource optimized scheduling strategy: Pipelined Region Scheduling. This scheduling strategy aims at: * avoidance of resource deadlocks when running batch jobs * tunable with respect to resource consumption and throughput More details can be found in the Wiki [2]. We are looking forward to your feedback. Best, Zhu Zhu & Gary [1] https://issues.apache.org/jira/browse/FLINK-10429 [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling
Re: [VOTE] FLIP 116: Unified Memory Configuration for Job Managers
+1 (binding) Best, Gary On Wed, Mar 18, 2020 at 3:16 PM Andrey Zagrebin wrote: > Hi All, > > The discussion for FLIP-116 looks to be resolved [1]. > Therefore, I start the vote for it. > The vote will end at 6pm CET on Monday, 23 March. > > Best, > Andrey > > [1] > > http://mail-archives.apache.org/mod_mbox/flink-dev/202003.mbox/%3CCAJNyZN7AJAU_RUVhnWa7r%2B%2BtXpmUqWFH%2BG0hfoLVBzgRMmAO2w%40mail.gmail.com%3E >
[jira] [Created] (FLINK-16718) KvStateServerHandlerTest leaks Netty ByteBufs
Gary Yao created FLINK-16718: Summary: KvStateServerHandlerTest leaks Netty ByteBufs Key: FLINK-16718 URL: https://issues.apache.org/jira/browse/FLINK-16718 Project: Flink Issue Type: Bug Components: Runtime / Queryable State, Tests Affects Versions: 1.10.0, 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 The {{KvStateServerHandlerTest}} leaks Netty {{ByteBuf}}s. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16710) Log Upload blocks Main Thread in TaskExecutor
Gary Yao created FLINK-16710: Summary: Log Upload blocks Main Thread in TaskExecutor Key: FLINK-16710 URL: https://issues.apache.org/jira/browse/FLINK-16710 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.11.0 Uploading logs to the BlobServer blocks the TaskExecutor's main thread. We should introduce an IO thread pool that carries out file system accesses. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16590) flink-oss-fs-hadoop: Not all dependencies in NOTICE file are bundled
Gary Yao created FLINK-16590: Summary: flink-oss-fs-hadoop: Not all dependencies in NOTICE file are bundled Key: FLINK-16590 URL: https://issues.apache.org/jira/browse/FLINK-16590 Project: Flink Issue Type: Bug Components: Connectors / FileSystem, Release System Affects Versions: 1.11.0 Reporter: Gary Yao Fix For: 1.11.0 NOTICE file in flink-oss-fs-hadoop lists {{code}} org.apache.commons:commons-compress {{code}} as a bundled dependency which is not correct. There are likely other dependencies that are wrongly listed in the NOTICE file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16445) Raise japicmp.referenceVersion to 1.10.0
Gary Yao created FLINK-16445: Summary: Raise japicmp.referenceVersion to 1.10.0 Key: FLINK-16445 URL: https://issues.apache.org/jira/browse/FLINK-16445 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 1.11.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.11.0 In {{pom.xml}}, change property {{japicmp.referenceVersion}} to {{1.10.0}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] FLIP-100[NEW]: Add Attempt Information
+1 (binding) Best, Gary On Wed, Mar 4, 2020 at 1:18 PM Yadong Xie wrote: > Hi all > > I want to start the vote for FLIP-100, which proposes to add attempt > information inside subtask and timeline in web UI. > > To help everyone better understand the proposal, we spent some efforts on > making an online POC > > Timeline Attempt (click the vertex timeline to see the differences): > previous web: > http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/timeline > POC web: > > http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/timeline > > Subtask Attempt (click the vertex and switch to subtask tab to see the > differences): > previous web: > http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/overview > POC web: > > http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/overview > > > The vote will last for at least 72 hours, following the consensus voting > process. > > FLIP wiki: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-100%3A+Add+Attempt+Information > > Discussion thread: > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html > > Thanks, > > Yadong >
Re: [VOTE] FLIP-100: Add Attempt Information
Hi Yadong, Thank you for updating the wiki page. Only one minor suggestion – I would change: > If show-history is true return the information of attempt. to > If show-history is true, information for all attempts including previous ones will be returned That being said, FLIP-100 looks good to me. From my side there is not anything else to discuss. @Kurt and @Jark: Can you look into the improvements that have been made since the last time you looked at the PoC? If you are happy, we can restart the voting. Best, Gary On Tue, Mar 3, 2020 at 2:34 PM Yadong Xie wrote: > Hi all > > The rest API part has been updated with Gary and Till's suggestions > here is the link: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-100%3A+Add+Attempt+Information > > Yadong Xie 于2020年3月3日周二 下午9:14写道: > > > Hi Chesnay > > > > most discussions in this vote are about the more feature/demo request in > > POC or discussion about response format, the main proposal the web UI > part > > which is not changed > > > > and the discussion about the response is converging, the response format > > discussion could happen either here or at the code review stage, which > > would be a minor change from my point of view. > > > > Chesnay Schepler 于2020年3月3日周二 下午8:20写道: > > > >> I suggest to cancel this vote. > >> Several discussion items have been brought up during the vote, some of > >> which are still unresolved, others which resulted in changes to the > >> proposal. > >> > >> My conclusion is that this proposal needs more discussions. > >> > >> > >> On 20/02/2020 10:46, Yadong Xie wrote: > >> > Hi all > >> > > >> > I want to start the vote for FLIP-100, which proposes to add attempt > >> > information inside subtask and timeline in web UI. > >> > > >> > To help everyone better understand the proposal, we spent some efforts > >> on > >> > making an online POC > >> > > >> > Timeline Attempt (click the vertex timeline to see the differences): > >> > previous web: > >> > > >> > http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/timeline > >> > POC web: > >> > > >> > http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/timeline > >> > > >> > Subtask Attempt (click the vertex and switch to subtask tab to see the > >> > differences): > >> > previous web: > >> > > >> > http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/overview > >> > POC web: > >> > > >> > http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/overview > >> > > >> > > >> > The vote will last for at least 72 hours, following the consensus > voting > >> > process. > >> > > >> > FLIP wiki: > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-100%3A+Add+Attempt+Information > >> > > >> > Discussion thread: > >> > > >> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html > >> > > >> > Thanks, > >> > > >> > Yadong > >> > > >> > >> >
Re: [VOTE] FLIP-100: Add Attempt Information
Hi Yadong, Thanks for driving this FLIP. I have a few questions/remarks: * Why are we duplicating the subtask index in the objects that are stored in the attempts-time-info array? I thought that all objects in the same array share the same subtask index. * Are we confident that the attempts-time-info array does not grow too large during the lifetime of a job? Should the size of the array be limited? * Have we considered placing the historic attempts in the same array as the current attempts, i.e., flatten the arrays? One could toggle the historic attempts on and off with a query parameter. * I think 'attempt-history' would be a better name instead of 'attempts-time-info'. Let me know what you think. Best, Gary On Thu, Feb 20, 2020 at 10:46 AM Yadong Xie wrote: > Hi all > > I want to start the vote for FLIP-100, which proposes to add attempt > information inside subtask and timeline in web UI. > > To help everyone better understand the proposal, we spent some efforts on > making an online POC > > Timeline Attempt (click the vertex timeline to see the differences): > previous web: > http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/timeline > POC web: > > http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/timeline > > Subtask Attempt (click the vertex and switch to subtask tab to see the > differences): > previous web: > http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/overview > POC web: > > http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/overview > > > The vote will last for at least 72 hours, following the consensus voting > process. > > FLIP wiki: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-100%3A+Add+Attempt+Information > > Discussion thread: > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html > > Thanks, > > Yadong >
Re: [VOTE] FLIP-99: Make Max Exception Configurable
+1 (binding) Best, Gary On Thu, Feb 20, 2020 at 10:39 AM Yadong Xie wrote: > Hi all > > I want to start the vote for FLIP-99, which proposes to make the max > exception configurable in web UI. > > To help everyone better understand the proposal, we spent some efforts on > making an online POC > > previous web: > > http://101.132.122.69:8081/#/job/543e9dc0cb2cca4433116007f0931d1a/exceptions > POC web: > > http://101.132.122.69:8081/web/#/job/543e9dc0cb2cca4433116007f0931d1a/exceptions > > The vote will last for at least 72 hours, following the consensus voting > process. > > FLIP wiki: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-99%3A+Make+Max+Exception+Configurable > > Discussion thread: > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html > > Thanks, > > Yadong >
[RESULT] [VOTE] Release 1.10.0, release candidate #3
I'm happy to announce that we have unanimously approved this release. There are 16 approving votes, 5 of which are binding: * Kurt Young (binding) * Jark Wu (binding) * Kostas Kloudas (binding) * Thomas Weise (binding) * Jincheng Sun (binding) * Aihua Li (non-binding) * Zhu Zhu (non-binding) * Congxian Qiu (non-binding) * Rui Li (non-binding) * Jingsong Li (non-binding) * Yang Wang (non-binding) * Piotr Nowojski (non-binding) * Xintong Song (non-binding) * Benchao Li (non-binding) * Zili Chen (non-binding) * Yangze Guo (non-binding) There are no disapproving votes. Thanks everyone!
[VOTE] Release 1.10.0, release candidate #3
Hi everyone, Please review and vote on the release candidate #3 for the version 1.10.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes [1], * the official Apache source release and binary convenience releases to be deployed to dist.apache.org [2], which are signed with the key with fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3], * all artifacts to be deployed to the Maven Central Repository [4], * source code tag "release-1.10.0-rc3" [5], * website pull request listing the new release and adding announcement blog post [6][7]. The vote will be open for at least 72 hours. It is adopted by majority approval, with at least 3 PMC affirmative votes. Thanks, Yu & Gary [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845 [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc3/ [3] https://dist.apache.org/repos/dist/release/flink/KEYS [4] https://repository.apache.org/content/repositories/orgapacheflink-1333 [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc3 [6] https://github.com/apache/flink-web/pull/302 [7] https://github.com/apache/flink-web/pull/301
Re: [VOTE] Release 1.10.0, release candidate #2
-shaded-hadoop-2-uber-2.7.5-7.0.jar:/Users/jzhang/github/flink/build-target/lib/log4j-1.2.17.jar:/Users/jzhang/github/flink/build-target/lib/flink-table-blink_2.11-1.10-SNAPSHOT.jar:/Users/jzhang/github/flink/build-target/lib/flink-table_2.11-1.10-SNAPSHOT.jar:/Users/jzhang/github/flink/build-target/lib/flink-dist_2.11-1.10-SNAPSHOT.jar:/Users/jzhang/github/flink/build-target/lib/flink-python_2.11-1.10-SNAPSHOT.jar:/Users/jzhang/github/flink/build-target/lib/flink-hadoop-compatibility_2.11-1.9.1.jar*:/Users/jzhang/github/zeppelin/interpreter/flink/*:/Users/jzhang/github/zeppelin/zeppelin-interpreter-api/target/*:/Users/jzhang/github/zeppelin/zeppelin-zengine/target/test-classes/*::/Users/jzhang/github/zeppelin/interpreter/zeppelin-interpreter-api-0.9.0-SNAPSHOT.jar:/Users/jzhang/github/zeppelin/zeppelin-interpreter/target/classes:/Users/jzhang/github/zeppelin/zeppelin-zengine/target/test-classes:/Users/jzhang/github/flink/build-target/opt/flink-python_2.11-1.10-SNAPSHOT.jar > > > > > > > > > > > > > > > > > > > > /Users/jzhang/github/flink/build-target/opt/tmp/flink-python_2.11-1.10-SNAPSHOT.jar > > > > > org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer > > 0.0.0.0 > > > > > 52603 flink-shared_process : > > > > > > > > > > > > > > > Jingsong Li 于2020年2月6日周四 下午4:10写道: > > > > > > > > > > > Hi Jeff, > > > > > > > > > > > > > > > > > > For FLINK-15935 [1], > > > > > > > > > > > > I try to think of it as a non blocker. But it's really an > important > > > > > issue. > > > > > > > > > > > > > > > > > > The problem is the class loading order. We want to load the class > > in > > > > the > > > > > > blink-planner.jar, but actually load the class in the > > > > flink-planner.jar. > > > > > > > > > > > > > > > > > > First of all, the order of class loading is based on the order of > > > > > > classpath. > > > > > > > > > > > > > > > > > > I just tried, the order of classpath of the folder is depends on > > the > > > > > order > > > > > > of file names. > > > > > > > > > > > > -That is to say, our order is OK now: because > > > > > > flink-table-blink_2.11-1.11-SNAPSHOT.jar is before the name order > > of > > > > > > flink-table_2.11-1.11-snapshot.jar. > > > > > > > > > > > > -But once I change the name of flink-table_2.11-1.11-SNAPSHOT.jar > > to > > > > > > aflink-table_2.11-1.11-SNAPSHOT.jar, an error will be reported. > > > > > > > > > > > > > > > > > > The order of classpaths should be influenced by the ls of Linux. > By > > > > > default > > > > > > the ls command is listing the files in alphabetical order. [1] > > > > > > > > > > > > > > > > > > IDE seems to have another rule, because it is directly load > > classes. > > > > Not > > > > > > jars. > > > > > > > > > > > > > > > > > > So the possible impact of this problem is: > > > > > > > > > > > > -IDE with two planners to use watermark, which will report an > > error. > > > > > > > > > > > > -The real production environment, under Linux, will not report > > errors > > > > > > because of the above reasons. > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-15935 > > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > https://linuxize.com/post/how-to-list-files-in-linux-using-the-ls-command/ > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Jingsong Lee > > > > > > > > > > > > On Thu, Feb 6, 2020 at 3:09 PM Jeff Zhang > > wrote: > > > > > > > > > > > > > -1, I just found one critical issue > > > > > > > https://issues.apache.org/jira/browse/FLINK-15935 > > > > > > > This ticket means user unable to use watermar
[jira] [Created] (FLINK-15953) Job Status is hard to read for some Statuses
Gary Yao created FLINK-15953: Summary: Job Status is hard to read for some Statuses Key: FLINK-15953 URL: https://issues.apache.org/jira/browse/FLINK-15953 Project: Flink Issue Type: Bug Components: Runtime / Web Frontend Affects Versions: 1.9.2, 1.10.0 Reporter: Gary Yao Assignee: Yadong Xie Fix For: 1.10.1, 1.11.0 Attachments: 769B08ED-D644-4DEB-BA4C-14B18E562A52.png The job status {{RESTARTING}} is rendered in a white font on white background which makes it hard to read. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15936) TaskExecutorTest#testSlotAcceptance deadlocks
Gary Yao created FLINK-15936: Summary: TaskExecutorTest#testSlotAcceptance deadlocks Key: FLINK-15936 URL: https://issues.apache.org/jira/browse/FLINK-15936 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.1 https://api.travis-ci.org/v3/job/646510877/log.txt {noformat} "main" #1 prio=5 os_prio=0 tid=0x7f2f5800b800 nid=0x499 waiting on condition [0x7f2f61733000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x8669b3a8> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.flink.runtime.taskexecutor.TaskExecutorTest.testSlotAcceptance(TaskExecutorTest.java:875) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] Release 1.10.0, release candidate #2
Note that there is currently an ongoing discussion about whether FLINK-15917 and FLINK-15918 should be fixed in 1.10.0 [1]. [1] http://mail-archives.apache.org/mod_mbox/flink-dev/202002.mbox/%3CCA%2B5xAo3D21-T5QysQg3XOdm%3DL9ipz3rMkA%3DqMzxraJRgfuyg2A%40mail.gmail.com%3E On Wed, Feb 5, 2020 at 8:00 PM Gary Yao wrote: > Hi everyone, > Please review and vote on the release candidate #2 for the version 1.10.0, > as follows: > [ ] +1, Approve the release > [ ] -1, Do not approve the release (please provide specific comments) > > > The complete staging area is available for your review, which includes: > * JIRA release notes [1], > * the official Apache source release and binary convenience releases to be > deployed to dist.apache.org [2], which are signed with the key with > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3], > * all artifacts to be deployed to the Maven Central Repository [4], > * source code tag "release-1.10.0-rc2" [5], > * website pull request listing the new release and adding announcement > blog post [6][7]. > > The vote will be open for at least 24 hours. It is adopted by majority > approval, with at least 3 PMC affirmative votes. > > Thanks, > Yu & Gary > > [1] > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845 > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc2/ > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > [4] https://repository.apache.org/content/repositories/orgapacheflink-1332 > [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc2 > [6] https://github.com/apache/flink-web/pull/302 > [7] https://github.com/apache/flink-web/pull/301 >
Re: [VOTE] Release 1.10.0, release candidate #1
It is indeed unfortunate that these issues are discovered only now. I think Thomas has a valid point, and we might be risking the trust of our users here. What are our options? 1. Document this behavior and how to work around it copiously in the release notes [1] 2. Try to restore the previous behavior 3. Change default value of jobmanager.scheduler to "legacy" and rollout the feature in 1.11 4. Change default value of jobmanager.scheduler to "legacy" and rollout the feature earliest in 1.10.1 [1] https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86 On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen wrote: > Should we make these a blocker? I am not sure - we could also clearly > state in the release notes how to restore the old behavior, if your setup > assumes that behavior. > > Release candidates for this release have been out since mid December, it > is a bit unfortunate that these things have been raised so late. > Having these rather open ended tickets (how to re-define the existing > metrics in the new scheduler/failover handling) now as release blockers > would mean that 1.10 is indefinitely delayed. > > Are we sure we want to do that? > > On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise wrote: > >> Hi Gary, >> >> Thanks for the clarification! >> >> When we upgrade to a new Flink release, we don't start with a default >> flink-conf.yaml but upgrade our existing tooling and configuration. >> Therefore we notice this issue as part of the upgrade to 1.10, and not >> when >> we upgraded to 1.9. >> >> I would expect many other users to be in the same camp, and therefore >> consider making these regressions a blocker for 1.10? >> >> Thanks, >> Thomas >> >> >> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao wrote: >> >> > > also notice that the exception causing a restart is no longer >> displayed >> > > in the UI, which is probably related? >> > >> > Yes, this is also related to the new scheduler. I created FLINK-15917 >> [1] >> > to >> > track this. Moreover, I created a ticket about the uptime metric not >> > resetting >> > [2]. Both issues already exist in 1.9 if >> > "jobmanager.execution.failover-strategy" is set to "region", which is >> the >> > case >> > in the default flink-conf.yaml. >> > >> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to >> > fall >> > back to the previous behavior. >> > >> > In 1.10, you can still fall back to the previous behavior by setting >> > "jobmanager.scheduler: legacy" and unsetting >> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml >> > >> > I would not consider these issues blockers since there is a workaround >> for >> > them, but of course we would like to see the new scheduler getting some >> > production exposure. More detailed release notes about the caveats of >> the >> > new >> > scheduler will be added to the user documentation. >> > >> > >> > > The watermark issue was >> > https://issues.apache.org/jira/browse/FLINK-14470 >> > >> > This should be fixed now [3]. >> > >> > >> > [1] https://issues.apache.org/jira/browse/FLINK-15917 >> > [2] https://issues.apache.org/jira/browse/FLINK-15918 >> > [3] https://issues.apache.org/jira/browse/FLINK-8949 >> > >> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise wrote: >> > >> >> Hi Gary, >> >> >> >> Thanks for the reply. >> >> >> >> --> >> >> >> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao wrote: >> >> >> >> > Hi Thomas, >> >> > >> >> > > 2) Was there a change in how job recovery reflects in the uptime >> >> metric? >> >> > > Didn't uptime previously reset to 0 on recovery (now it just keeps >> >> > > increasing)? >> >> > >> >> > The uptime is the difference between the current time and the time >> when >> >> the >> >> > job transitioned to RUNNING state. By default we no longer transition >> >> the >> >> > job >> >> > out of the RUNNING state when restarting. This has something to do >> with >> >> the >> >> > new scheduler which enables pipelined region failover by default [1]. >> >&g
[VOTE] Release 1.10.0, release candidate #2
Hi everyone, Please review and vote on the release candidate #2 for the version 1.10.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes [1], * the official Apache source release and binary convenience releases to be deployed to dist.apache.org [2], which are signed with the key with fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3], * all artifacts to be deployed to the Maven Central Repository [4], * source code tag "release-1.10.0-rc2" [5], * website pull request listing the new release and adding announcement blog post [6][7]. The vote will be open for at least 24 hours. It is adopted by majority approval, with at least 3 PMC affirmative votes. Thanks, Yu & Gary [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845 [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc2/ [3] https://dist.apache.org/repos/dist/release/flink/KEYS [4] https://repository.apache.org/content/repositories/orgapacheflink-1332 [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc2 [6] https://github.com/apache/flink-web/pull/302 [7] https://github.com/apache/flink-web/pull/301
Re: [VOTE] Release 1.10.0, release candidate #1
> also notice that the exception causing a restart is no longer displayed > in the UI, which is probably related? Yes, this is also related to the new scheduler. I created FLINK-15917 [1] to track this. Moreover, I created a ticket about the uptime metric not resetting [2]. Both issues already exist in 1.9 if "jobmanager.execution.failover-strategy" is set to "region", which is the case in the default flink-conf.yaml. In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to fall back to the previous behavior. In 1.10, you can still fall back to the previous behavior by setting "jobmanager.scheduler: legacy" and unsetting "jobmanager.execution.failover-strategy" in your flink-conf.yaml I would not consider these issues blockers since there is a workaround for them, but of course we would like to see the new scheduler getting some production exposure. More detailed release notes about the caveats of the new scheduler will be added to the user documentation. > The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470 This should be fixed now [3]. [1] https://issues.apache.org/jira/browse/FLINK-15917 [2] https://issues.apache.org/jira/browse/FLINK-15918 [3] https://issues.apache.org/jira/browse/FLINK-8949 On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise wrote: > Hi Gary, > > Thanks for the reply. > > --> > > On Tue, Feb 4, 2020 at 5:20 AM Gary Yao wrote: > > > Hi Thomas, > > > > > 2) Was there a change in how job recovery reflects in the uptime > metric? > > > Didn't uptime previously reset to 0 on recovery (now it just keeps > > > increasing)? > > > > The uptime is the difference between the current time and the time when > the > > job transitioned to RUNNING state. By default we no longer transition the > > job > > out of the RUNNING state when restarting. This has something to do with > the > > new scheduler which enables pipelined region failover by default [1]. > > Actually > > we enabled pipelined region failover already in the binary distribution > of > > Flink 1.9 by setting: > > > > jobmanager.execution.failover-strategy: region > > > > in the default flink-conf.yaml. Unless you have removed this config > option > > or > > you are using a custom yaml, you should be seeing this behavior in Flink > > 1.9. > > If you do not want region failover, set > > > > jobmanager.execution.failover-strategy: full > > > > > We are using the default (the jobmanager.execution.failover-strategy > setting is not present in our flink config). > > The change in behavior I see is between the 1.9 based deployment and the > 1.10 RC. > > Our 1.9 branch is here: > https://github.com/lyft/flink/tree/release-1.9-lyft > > I also notice that the exception causing a restart is no longer displayed > in the UI, which is probably related? > > > > > > > 1) Is the low watermark display in the UI still broken? > > > > I was not aware that this is broken. Is there an issue tracking this bug? > > > > The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470 > > (I don't have a good way to verify it is fixed at the moment.) > > Another problem with this 1.10 RC is that the checkpointAlignmentTime > metric is missing. (I have not been able to investigate this further yet.) > > > > > > Best, > > Gary > > > > [1] https://issues.apache.org/jira/browse/FLINK-14651 > > > > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise wrote: > > > >> I opened a PR for FLINK-15868 > >> <https://issues.apache.org/jira/browse/FLINK-15868>: > >> https://github.com/apache/flink/pull/11006 > >> > >> With that change, I was able to run an application that consumes from > >> Kinesis. > >> > >> I should have data tomorrow regarding the performance. > >> > >> Two questions/observations: > >> > >> 1) Is the low watermark display in the UI still broken? > >> 2) Was there a change in how job recovery reflects in the uptime metric? > >> Didn't uptime previously reset to 0 on recovery (now it just keeps > >> increasing)? > >> > >> Thanks, > >> Thomas > >> > >> > >> > >> > >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise wrote: > >> > >> > I found another issue with the Kinesis connector: > >> > > >> > https://issues.apache.org/jira/browse/FLINK-15868 > >> > > >> > > >> > On Mon, Feb 3, 2020 at 3:35
[jira] [Created] (FLINK-15918) Uptime Metric not reset on Job Restart
Gary Yao created FLINK-15918: Summary: Uptime Metric not reset on Job Restart Key: FLINK-15918 URL: https://issues.apache.org/jira/browse/FLINK-15918 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.9.2, 1.10.0 Reporter: Gary Yao Fix For: 1.11.0, 1.10.1 *Description* The {{uptime}} metric is not reset when the job restarts, which is a change in behavior compared to Flink 1.8. This change of behavior exists since 1.9.0 if {{jobmanager.execution.failover-strategy: region}} is configured, which we do in the default flink-conf.yaml. *Workarounds* Users that find this behavior problematic can set {{jobmanager.scheduler: legacy}} in their {{flink-conf.yaml}} *How to reproduce* trivial *Expected behavior* This is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15917) Root Exception not shown in Web UI
Gary Yao created FLINK-15917: Summary: Root Exception not shown in Web UI Key: FLINK-15917 URL: https://issues.apache.org/jira/browse/FLINK-15917 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.9.2, 1.10.0 Reporter: Gary Yao Fix For: 1.11.0, 1.10.1 *Description* On the job details page in the Exceptions → Root Exception tab, exceptions that cause the job to restart are not displayed. This is already a problem since 1.9.0 if {{jobmanager.execution.failover-strategy: region}} is configured, which we do in the default flink-conf.yaml. *Workarounds* Users that run into this problem can set {{jobmanager.scheduler: legacy}} in their {{flink-conf.yaml}} *How to reproduce* In {{flink-conf.yaml}} set {{restart-strategy: fixed-delay}} so enable job restarts. {noformat} $ bin/start-cluster.sh $ bin/flink run -d examples/streaming/TopSpeedWindowing.jar $ bin/taskmanager.sh stop {noformat} Assert that no exception is displayed in the Web UI. *Expected behavior* The stacktrace of the exception should be displayed. Whether the exception should be also shown if only a partial region of the job failed is up for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] Release 1.10.0, release candidate #1
Hi Thomas, > 2) Was there a change in how job recovery reflects in the uptime metric? > Didn't uptime previously reset to 0 on recovery (now it just keeps > increasing)? The uptime is the difference between the current time and the time when the job transitioned to RUNNING state. By default we no longer transition the job out of the RUNNING state when restarting. This has something to do with the new scheduler which enables pipelined region failover by default [1]. Actually we enabled pipelined region failover already in the binary distribution of Flink 1.9 by setting: jobmanager.execution.failover-strategy: region in the default flink-conf.yaml. Unless you have removed this config option or you are using a custom yaml, you should be seeing this behavior in Flink 1.9. If you do not want region failover, set jobmanager.execution.failover-strategy: full > 1) Is the low watermark display in the UI still broken? I was not aware that this is broken. Is there an issue tracking this bug? Best, Gary [1] https://issues.apache.org/jira/browse/FLINK-14651 On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise wrote: > I opened a PR for FLINK-15868 > <https://issues.apache.org/jira/browse/FLINK-15868>: > https://github.com/apache/flink/pull/11006 > > With that change, I was able to run an application that consumes from > Kinesis. > > I should have data tomorrow regarding the performance. > > Two questions/observations: > > 1) Is the low watermark display in the UI still broken? > 2) Was there a change in how job recovery reflects in the uptime metric? > Didn't uptime previously reset to 0 on recovery (now it just keeps > increasing)? > > Thanks, > Thomas > > > > > On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise wrote: > > > I found another issue with the Kinesis connector: > > > > https://issues.apache.org/jira/browse/FLINK-15868 > > > > > > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao wrote: > > > >> Hi everyone, > >> > >> I am hereby canceling the vote due to: > >> > >> FLINK-15837 > >> FLINK-15840 > >> > >> Another RC will be created later today. > >> > >> Best, > >> Gary > >> > >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao wrote: > >> > >> > Hi everyone, > >> > Please review and vote on the release candidate #1 for the version > >> 1.10.0, > >> > as follows: > >> > [ ] +1, Approve the release > >> > [ ] -1, Do not approve the release (please provide specific comments) > >> > > >> > > >> > The complete staging area is available for your review, which > includes: > >> > * JIRA release notes [1], > >> > * the official Apache source release and binary convenience releases > to > >> be > >> > deployed to dist.apache.org [2], which are signed with the key with > >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3], > >> > * all artifacts to be deployed to the Maven Central Repository [4], > >> > * source code tag "release-1.10.0-rc1" [5], > >> > > >> > The announcement blog post is in the works. I will update this voting > >> > thread with a link to the pull request soon. > >> > > >> > The vote will be open for at least 72 hours. It is adopted by majority > >> > approval, with at least 3 PMC affirmative votes. > >> > > >> > Thanks, > >> > Yu & Gary > >> > > >> > [1] > >> > > >> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845 > >> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/ > >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > >> > [4] > >> https://repository.apache.org/content/repositories/orgapacheflink-1325 > >> > [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc1 > >> > > >> > > >
[jira] [Created] (FLINK-15900) JoinITCase#testRightJoinWithPk failed on Travis
Gary Yao created FLINK-15900: Summary: JoinITCase#testRightJoinWithPk failed on Travis Key: FLINK-15900 URL: https://issues.apache.org/jira/browse/FLINK-15900 Project: Flink Issue Type: Bug Affects Versions: 1.10.0 Reporter: Gary Yao {noformat} org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.table.planner.runtime.stream.sql.JoinITCase.testRightJoinWithPk(JoinITCase.scala:672) Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, backoffTimeMS=0) Caused by: java.lang.Exception: Exception while creating StreamOperatorStateContext. Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_17aecc34cf8aa256be6fe4836cbdf29a_(2/4) from any of the 1 provided restore options. Caused by: java.io.IOException: Failed to acquire shared cache resource for RocksDB Caused by: org.apache.flink.runtime.memory.MemoryAllocationException: Could not created the shared memory resource of size 20971520. Not enough memory left to reserve from the slot's managed memory. Caused by: org.apache.flink.runtime.memory.MemoryReservationException: Could not allocate 20971520 bytes. Only 0 bytes are remaining. {noformat} https://api.travis-ci.org/v3/job/645466432/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [VOTE] Release 1.10.0, release candidate #1
Hi everyone, I am hereby canceling the vote due to: FLINK-15837 FLINK-15840 Another RC will be created later today. Best, Gary On Mon, Jan 27, 2020 at 10:06 PM Gary Yao wrote: > Hi everyone, > Please review and vote on the release candidate #1 for the version 1.10.0, > as follows: > [ ] +1, Approve the release > [ ] -1, Do not approve the release (please provide specific comments) > > > The complete staging area is available for your review, which includes: > * JIRA release notes [1], > * the official Apache source release and binary convenience releases to be > deployed to dist.apache.org [2], which are signed with the key with > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3], > * all artifacts to be deployed to the Maven Central Repository [4], > * source code tag "release-1.10.0-rc1" [5], > > The announcement blog post is in the works. I will update this voting > thread with a link to the pull request soon. > > The vote will be open for at least 72 hours. It is adopted by majority > approval, with at least 3 PMC affirmative votes. > > Thanks, > Yu & Gary > > [1] > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845 > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/ > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > [4] https://repository.apache.org/content/repositories/orgapacheflink-1325 > [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc1 >
[jira] [Created] (FLINK-15781) Manual Smoke Test with Batch Job
Gary Yao created FLINK-15781: Summary: Manual Smoke Test with Batch Job Key: FLINK-15781 URL: https://issues.apache.org/jira/browse/FLINK-15781 Project: Flink Issue Type: Task Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 Try out a (larger) scale batch job with {{"jobmanager.scheduler: ng"}} enabled in {{flink-conf.yaml}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15780) Manual Smoke Test of Native Kubernetes Integration
Gary Yao created FLINK-15780: Summary: Manual Smoke Test of Native Kubernetes Integration Key: FLINK-15780 URL: https://issues.apache.org/jira/browse/FLINK-15780 Project: Flink Issue Type: Task Components: Deployment / Kubernetes Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 Try out the [native Kubernetes integration|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html], and report any usability issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15779) Manual Smoke Test of Python API
Gary Yao created FLINK-15779: Summary: Manual Smoke Test of Python API Key: FLINK-15779 URL: https://issues.apache.org/jira/browse/FLINK-15779 Project: Flink Issue Type: Task Components: API / Python Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 Try out the [Python API|https://ci.apache.org/projects/flink/flink-docs-release-1.10/tutorials/python_table_api.html], and report any usability issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[VOTE] Release 1.10.0, release candidate #1
Hi everyone, Please review and vote on the release candidate #1 for the version 1.10.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes [1], * the official Apache source release and binary convenience releases to be deployed to dist.apache.org [2], which are signed with the key with fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3], * all artifacts to be deployed to the Maven Central Repository [4], * source code tag "release-1.10.0-rc1" [5], The announcement blog post is in the works. I will update this voting thread with a link to the pull request soon. The vote will be open for at least 72 hours. It is adopted by majority approval, with at least 3 PMC affirmative votes. Thanks, Yu & Gary [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845 [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/ [3] https://dist.apache.org/repos/dist/release/flink/KEYS [4] https://repository.apache.org/content/repositories/orgapacheflink-1325 [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
[jira] [Created] (FLINK-15754) Remove config options table.exec.resource.*memory from Documentation
Gary Yao created FLINK-15754: Summary: Remove config options table.exec.resource.*memory from Documentation Key: FLINK-15754 URL: https://issues.apache.org/jira/browse/FLINK-15754 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 {noformat} table.exec.resource.external-buffer-memory table.exec.resource.hash-agg.memory table.exec.resource.hash-join.memory table.exec.resource.sort.memory {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15751) RocksDB Memory Management end-to-end test fails
Gary Yao created FLINK-15751: Summary: RocksDB Memory Management end-to-end test fails Key: FLINK-15751 URL: https://issues.apache.org/jira/browse/FLINK-15751 Project: Flink Issue Type: Task Components: Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 RocksDB Memory Management end-to-end test fails due to non-whitelisted exceptions in the TaskManager log. {noformat} org.apache.flink.runtime.execution.CancelTaskException: Consumed partition PipelinedSubpartitionView(index: 1) of ResultPartition a451ee900d8b1e282d877868cdb69dbe@b9cbd0aed214bb91ce4c4fb0fa6473a6 has been released. at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer(LocalInputChannel.java:190) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:509) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:487) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:475) at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:75) at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:125) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15743) Add Flink 1.10 release notes to documentation
Gary Yao created FLINK-15743: Summary: Add Flink 1.10 release notes to documentation Key: FLINK-15743 URL: https://issues.apache.org/jira/browse/FLINK-15743 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 Gather, edit, and add Flink 1.10 release notes to documentation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[ANNOUNCE] Apache Flink 1.10.0, release candidate #0
Hi all, RC0 for Apache Flink 1.10.0 has been created. This has all the artifacts that we would typically have for a release, except for a source code tag and a PR for the release announcement. This preview-only RC is created only to drive the current testing efforts, and no official vote will take place. It includes the following: * the preview source release and binary convenience releases [1], which are signed with the key with fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [2], * all artifacts that would normally be deployed to the Maven Central Repository [3] To test with these artifacts, you can create a settings.xml file with the content shown below [4]. This settings file can be referenced in your maven commands via --settings /path/to/settings.xml. This is useful for creating a quickstart project based on the staged release and also for building against the staged jars. Happy testing! Best, Gary [1] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc0/ [2] https://dist.apache.org/repos/dist/release/flink/KEYS [3] https://repository.apache.org/content/repositories/orgapacheflink-1318/ [4] flink-1.10.0 flink-1.10.0 flink-1.10.0 https://repository.apache.org/content/repositories/orgapacheflink-1318/ archetype https://repository.apache.org/content/repositories/orgapacheflink-1318/
[jira] [Created] (FLINK-15638) releasing/create_release_branch.sh does not set version in flink-python/pyflink/version.py
Gary Yao created FLINK-15638: Summary: releasing/create_release_branch.sh does not set version in flink-python/pyflink/version.py Key: FLINK-15638 URL: https://issues.apache.org/jira/browse/FLINK-15638 Project: Flink Issue Type: Bug Components: Release System Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 {{releasing/create_release_branch.sh}} does not set the version in {{flink-python/pyflink/version.py}}. Currently the version.py contains: {noformat} __version__ = "1.10.dev0" {noformat} {{setup.py}} will replace .dev0 with -SNAPSHOT and tries to find the respective flink distribution in the flink-dist/target, which will not exist. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15623) Buildling flink-python with maven profile docs-and-source fails
Gary Yao created FLINK-15623: Summary: Buildling flink-python with maven profile docs-and-source fails Key: FLINK-15623 URL: https://issues.apache.org/jira/browse/FLINK-15623 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 1.10.0 Environment: rev: 91d96abe5f42bd088a326870b4885d79611fccb5 Reporter: Gary Yao Fix For: 1.10.0 *Description* Building flink-python with maven profile docs-and-source fails due to checkstyle violations. *How to reproduce* Running {noformat} mvn clean install -pl flink-python -Pdocs-and-source -DskipTests -DretryFailedDeploymentCount=10 {noformat} should fail with the following error {noformat} [...] [ERROR] generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8343] (regexp) RegexpSinglelineJava: Line has leading space characters; indentation should be performed with tabs only. [ERROR] generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8344] (regexp) RegexpSinglelineJava: Line has leading space characters; indentation should be performed with tabs only. [ERROR] generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8345] (regexp) RegexpSinglelineJava: Line has leading space characters; indentation should be performed with tabs only. [ERROR] generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8346] (regexp) RegexpSinglelineJava: Line has leading space characters; indentation should be performed with tabs only. [ERROR] generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8347] (regexp) RegexpSinglelineJava: Line has leading space characters; indentation should be performed with tabs only. [ERROR] generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8348] (regexp) RegexpSinglelineJava: Line has leading space characters; indentation should be performed with tabs only. [ERROR] generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8349] (regexp) RegexpSinglelineJava: Line has leading space characters; indentation should be performed with tabs only. [ERROR] generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8350] (regexp) RegexpSinglelineJava: Line has leading space characters; indentation should be performed with tabs only. [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 18.046 s [INFO] Finished at: 2020-01-16T16:44:01+00:00 [INFO] Final Memory: 158M/2826M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on project flink-python_2.11: You have 7603 Checkstyle violations. -> [Help 1] [ERROR] {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15317) State TTL Heap backend end-to-end test fails on Travis
Gary Yao created FLINK-15317: Summary: State TTL Heap backend end-to-end test fails on Travis Key: FLINK-15317 URL: https://issues.apache.org/jira/browse/FLINK-15317 Project: Flink Issue Type: Bug Components: Runtime / Network, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 https://api.travis-ci.org/v3/job/626286529/log.txt {noformat} Checking for errors... Found error in log files: ... java.lang.IllegalStateException: Released at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:483) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:474) at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:75) at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:125) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:488) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15316) SQL Client end-to-end test (Old planner) failed on Travis
Gary Yao created FLINK-15316: Summary: SQL Client end-to-end test (Old planner) failed on Travis Key: FLINK-15316 URL: https://issues.apache.org/jira/browse/FLINK-15316 Project: Flink Issue Type: Bug Components: Table SQL / Client, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 https://api.travis-ci.org/v3/job/626286501/log.txt {noformat} Checking for errors... Found error in log files: ... org.apache.flink.table.client.gateway.SqlExecutionException: Invalid SQL update statement. at org.apache.flink.table.client.gateway.local.LocalExecutor.applyUpdate(LocalExecutor.java:697) at org.apache.flink.table.client.gateway.local.LocalExecutor.executeUpdateInternal(LocalExecutor.java:576) at org.apache.flink.table.client.gateway.local.LocalExecutor.executeUpdate(LocalExecutor.java:527) at org.apache.flink.table.client.cli.CliClient.callInsertInto(CliClient.java:535) at org.apache.flink.table.client.cli.CliClient.lambda$submitUpdate$0(CliClient.java:231) at java.util.Optional.map(Optional.java:215) at org.apache.flink.table.client.cli.CliClient.submitUpdate(CliClient.java:228) at org.apache.flink.table.client.SqlClient.openCli(SqlClient.java:129) at org.apache.flink.table.client.SqlClient.start(SqlClient.java:104) at org.apache.flink.table.client.SqlClient.main(SqlClient.java:178) Caused by: java.lang.RuntimeException: Error while applying rule PushProjectIntoTableSourceScanRule, args [rel#88:FlinkLogicalCalc.LOGICAL(input=RelSubset#87,expr#0..2={inputs},expr#3=IS NOT NULL($t1),user=$t1,rowtime=$t0,$condition=$t3), Scan(table:[default_catalog, default_database, JsonSourceTable], fields:(rowtime, user, event), source:KafkaTableSource(rowtime, user, event))] at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:235) at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:631) at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:327) at org.apache.flink.table.plan.Optimizer.runVolcanoPlanner(Optimizer.scala:280) at org.apache.flink.table.plan.Optimizer.optimizeLogicalPlan(Optimizer.scala:199) at org.apache.flink.table.plan.StreamOptimizer.optimize(StreamOptimizer.scala:66) at org.apache.flink.table.planner.StreamPlanner.writeToUpsertSink(StreamPlanner.scala:353) at org.apache.flink.table.planner.StreamPlanner.org$apache$flink$table$planner$StreamPlanner$$writeToSink(StreamPlanner.scala:281) at org.apache.flink.table.planner.StreamPlanner$$anonfun$2.apply(StreamPlanner.scala:166) at org.apache.flink.table.planner.StreamPlanner$$anonfun$2.apply(StreamPlanner.scala:145) at scala.Option.map(Option.scala:146) at org.apache.flink.table.planner.StreamPlanner.org$apache$flink$table$planner$StreamPlanner$$translate(StreamPlanner.scala:145) at org.apache.flink.table.planner.StreamPlanner$$anonfun$translate$1.apply(StreamPlanner.scala:117) at org.apache.flink.table.planner.StreamPlanner$$anonfun$translate$1.apply(StreamPlanner.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.flink.table.planner.StreamPlanner.translate(StreamPlanner.scala:117) at org.apache.flink.table.api.internal.TableEnvironmentImpl.translate(TableEnvironmentImpl.java:680) at org.apache.flink.table.api.internal.TableEnvironmentImpl.sqlUpdate(TableEnvironmentImpl.java:493) at org.apache.flink.table.api.java.internal.StreamTableEnvironmentImpl.sqlUpdate(StreamTableEnvironmentImpl.java:331) at org.apache.flink.table.client.gateway.local.LocalExecutor.lambda$applyUpdate$15(LocalExecutor.java:690) at org.apache.flink.table.client.gateway.local.ExecutionContext.wrapClassLoader(ExecutionContext.java:240) at org.apache.flink.table.client.gateway.local.LocalExecutor.applyUpdate(LocalExecutor.java:688) ... 9 more Caused by: org.apache.flink.table.api.ValidationException: Rowtime field 'rowtime' has invalid type LocalDateTime. Rowtime attributes must be of type Timestamp. at org.apache.flink.table.sources.TableSourceUtil$$anonf
Re: [ANNOUNCE] Zhu Zhu becomes a Flink committer
Congratulations, well deserved. On Tue, Dec 17, 2019 at 10:09 AM lining jing wrote: > Congratulations Zhu Zhu~ > > Andrey Zagrebin 于2019年12月16日周一 下午5:01写道: > > > Congrats Zhu Zhu! > > > > On Mon, Dec 16, 2019 at 8:10 AM Xintong Song > > wrote: > > > > > Congratulations Zhu Zhu~ > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Mon, Dec 16, 2019 at 12:34 PM Danny Chan > > wrote: > > > > > > > Congrats Zhu Zhu! > > > > > > > > Best, > > > > Danny Chan > > > > 在 2019年12月14日 +0800 AM12:51,dev@flink.apache.org,写道: > > > > > > > > > > Congrats Zhu Zhu and welcome on board! > > > > > > > > > >
[jira] [Created] (FLINK-15285) Resuming Externalized Checkpoint (file, async, no parallelism change) E2E test fails on Travis
Gary Yao created FLINK-15285: Summary: Resuming Externalized Checkpoint (file, async, no parallelism change) E2E test fails on Travis Key: FLINK-15285 URL: https://issues.apache.org/jira/browse/FLINK-15285 Project: Flink Issue Type: Bug Components: Runtime / Network, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 {noformat} == Running 'Resuming Externalized Checkpoint (file, async, no parallelism change) end-to-end test' == TEST_DATA_DIR: /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-08680387784 Flink dist directory: /home/travis/build/apache/flink/build-target Starting cluster. Starting standalonesession daemon on host travis-job-9b06ae82-c259-4d51-895c-c3439137e999. Starting taskexecutor daemon on host travis-job-9b06ae82-c259-4d51-895c-c3439137e999. Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Dispatcher REST endpoint is up. Running externalized checkpoints test, with ORIGINAL_DOP=2 NEW_DOP=2 and STATE_BACKEND_TYPE=file STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true SIMULATE_FAILURE=false ... Job (6eb020e02667feb4fd69b2b29be2a9ec) is running. Waiting for job (6eb020e02667feb4fd69b2b29be2a9ec) to have at least 1 completed checkpoints ... Waiting for job to process up to 200 records, current progress: 93 records ... Waiting for job to process up to 200 records, current progress: 146 records ... Cancelling job 6eb020e02667feb4fd69b2b29be2a9ec. Cancelled job 6eb020e02667feb4fd69b2b29be2a9ec. Restoring job with externalized checkpoint at /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-08680387784/externalized-chckpt-e2e-backend-dir/6eb020e02667feb4fd69b2b29be2a9ec/chk-9 ... Job (7604e7773d1f7b54b8b1c8655704aa86) is running. Waiting for job to process up to 200 records, current progress: 195 records ... Checking for errors... Found error in log files: {noformat} {noformat} 2019-12-16 11:51:30,518 WARN org.apache.flink.streaming.runtime.tasks.StreamTask - Error while canceling task. java.lang.IllegalStateException: Released at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:483) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:474) at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:75) at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:125) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:488) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527) at java.lang.Thread.run(Thread.java:748) {noformat} https://api.travis-ci.org/v3/job/625630520/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15247) Closing (Testing)MiniCluster may cause ConcurrentModificationException
Gary Yao created FLINK-15247: Summary: Closing (Testing)MiniCluster may cause ConcurrentModificationException Key: FLINK-15247 URL: https://issues.apache.org/jira/browse/FLINK-15247 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 {noformat} at org.apache.flink.test.streaming.runtime.BackPressureITCase.tearDown(BackPressureITCase.java:165) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:27) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55) at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137) at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:107) at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:83) at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75) at org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:158) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) Caused by: org.apache.flink.util.FlinkException: Error while shutting the TaskExecutor down. at org.apache.flink.runtime.taskexecutor.TaskExecutor.handleOnStopException(TaskExecutor.java:397) at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$onStop$0(TaskExecutor.java:382) at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$16(FutureUtils.java:518) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211) at java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:529) at java.util.concurrent.CompletableFuture.uniWhenComplete
Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration
Thanks for your explanation. I think the proposal is reasonable. On Thu, Dec 12, 2019 at 3:32 AM Yangze Guo wrote: > Thanks for the feedback, Gary. > > Regarding the WordCount test: > - True. There is no test coverage increment compared to others. > However, I think each test case better not have multiple purposes so > that we could find out the root cause quickly. As discussed in > FLINK-15135[1], I prefer only including WordCount test as the first > step. If the time overhead of E2E tests become severe in the future, I > agree to remove it. WDYT? > - I think the main overhead comes from building the image. The > subsequent tests will run fast since they will not build it again. > > Regarding the Rocks test, I think it is a typical scenario using > off-heap memory. The main purpose is to verify the memory usage and > memory configuration in Mesos mode. Two typical use cases are off-heap > and on-heap. Thus, I think the following two test cases are valuable > to be included: > - A streaming task using heap backend. It should explicitly set the > “taskmanager.memory.managed.size” to zero to check the potential > unexpected usage of off-heap memory. > - A streaming task using rocks backend. It covers the scenario using > off-heap memory. > > Look forward to your kind feedback. > > [1]https://issues.apache.org/jira/browse/FLINK-15135 > > Best, > Yangze Guo > > > > On Wed, Dec 11, 2019 at 6:14 PM Gary Yao wrote: > > > > Thanks for driving this effort. Also +1 from my side. I have left a few > > questions below. > > > > > - Wordcount end-to-end test. For verifying the basic process of Mesos > > > deployment. > > > > Would this add additional test coverage compared to the > > "multiple submissions" test case? I am asking because the E2E tests are > > already > > expensive to run, and adding new tests should be carefully considered. > > > > > - State TTL RocksDb backend end-to-end test. For verifying memory > > > configuration behaviors, since Mesos has it’s own config options and > > > logics. > > > > Can you elaborate more on this? Which config options are relevant here? > > > > On Wed, Dec 11, 2019 at 9:58 AM Till Rohrmann > wrote: > > > > > +1 for building the image locally. If need should arise, then we could > > > change it always later. > > > > > > Cheers, > > > Till > > > > > > On Wed, Dec 11, 2019 at 4:05 AM Xintong Song > > > wrote: > > > > > > > Thanks, Yangtze. > > > > > > > > +1 for building the image locally. > > > > The time consumption for both building image locally and pulling it > from > > > > DockerHub sounds reasonable and affordable. Therefore, I'm also in > favor > > > of > > > > avoiding the cost maintaining a custom image. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Wed, Dec 11, 2019 at 10:11 AM Yangze Guo > wrote: > > > > > > > > > Thanks for the feedback, Yang. > > > > > > > > > > Some updates I want to share in this thread. > > > > > I have built a PoC version of Meos e2e test with WordCount > > > > > workflow.[1] Then, I ran it in the testing environment. As the > result > > > > > shown here[2]: > > > > > - For pulling image from DockerHub, it took 1 minute and 21 seconds > > > > > - For building it locally, it took 2 minutes and 54 seconds. > > > > > > > > > > I prefer building it locally. Although it is slower, I think the > time > > > > > overhead, comparing to the cost of maintaining the image in > DockerHub > > > > > and the whole test process, is trivial for building or pulling the > > > > > image. > > > > > > > > > > I look forward to hearing from you. ;) > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > [1] > > > > > > > > > > > > > https://github.com/KarmaGYZ/flink/commit/0406d942446a1b17f81d93235b21a829bf88ccf0 > > > > > [2]https://travis-ci.org/KarmaGYZ/flink/jobs/623207957 > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > On Mon, Dec 9, 2019 at 2:39 PM Yang Wang > > > wrote: > > > > > > > > > > > > Thanks Yangze
Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration
Thanks for driving this effort. Also +1 from my side. I have left a few questions below. > - Wordcount end-to-end test. For verifying the basic process of Mesos > deployment. Would this add additional test coverage compared to the "multiple submissions" test case? I am asking because the E2E tests are already expensive to run, and adding new tests should be carefully considered. > - State TTL RocksDb backend end-to-end test. For verifying memory > configuration behaviors, since Mesos has it’s own config options and > logics. Can you elaborate more on this? Which config options are relevant here? On Wed, Dec 11, 2019 at 9:58 AM Till Rohrmann wrote: > +1 for building the image locally. If need should arise, then we could > change it always later. > > Cheers, > Till > > On Wed, Dec 11, 2019 at 4:05 AM Xintong Song > wrote: > > > Thanks, Yangtze. > > > > +1 for building the image locally. > > The time consumption for both building image locally and pulling it from > > DockerHub sounds reasonable and affordable. Therefore, I'm also in favor > of > > avoiding the cost maintaining a custom image. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Wed, Dec 11, 2019 at 10:11 AM Yangze Guo wrote: > > > > > Thanks for the feedback, Yang. > > > > > > Some updates I want to share in this thread. > > > I have built a PoC version of Meos e2e test with WordCount > > > workflow.[1] Then, I ran it in the testing environment. As the result > > > shown here[2]: > > > - For pulling image from DockerHub, it took 1 minute and 21 seconds > > > - For building it locally, it took 2 minutes and 54 seconds. > > > > > > I prefer building it locally. Although it is slower, I think the time > > > overhead, comparing to the cost of maintaining the image in DockerHub > > > and the whole test process, is trivial for building or pulling the > > > image. > > > > > > I look forward to hearing from you. ;) > > > > > > Best, > > > Yangze Guo > > > > > > [1] > > > > > > https://github.com/KarmaGYZ/flink/commit/0406d942446a1b17f81d93235b21a829bf88ccf0 > > > [2]https://travis-ci.org/KarmaGYZ/flink/jobs/623207957 > > > Best, > > > Yangze Guo > > > > > > On Mon, Dec 9, 2019 at 2:39 PM Yang Wang > wrote: > > > > > > > > Thanks Yangze for starting this discussion. > > > > > > > > Just share my thoughts. > > > > > > > > If the mesos official docker image could not meet our requirement, i > > > suggest to build the image locally. > > > > We have done the same things for yarn e2e tests. This way is more > > > flexible and easy to maintain. However, > > > > i have no idea how long building the mesos image locally will take. > > > Based on previous experience of yarn, i > > > > think it may not take too much time. > > > > > > > > > > > > > > > > Best, > > > > Yang > > > > > > > > Yangze Guo 于2019年12月7日周六 下午4:25写道: > > > >> > > > >> Thanks for your feedback! > > > >> > > > >> @Till > > > >> Regarding the time overhead, I think it mainly come from the network > > > >> transmission. For building the image locally, it will totally > download > > > >> 260MB files including the base image and packages. For pulling from > > > >> DockerHub, the compressed size of the image is 347MB. Thus, I agree > > > >> that it is ok to build the image locally. > > > >> > > > >> @Piyush > > > >> Thank you for offering the help and sharing your usage scenario. In > > > >> current stage, I think it will be really helpful if you can compress > > > >> the custom image[1] or reduce the time overhead to build it locally. > > > >> Any ideas for improving test coverage will also be appreciated. > > > >> > > > >> [1] > > > > > > https://hub.docker.com/layers/karmagyz/mesos-flink/latest/images/sha256-4e1caefea107818aa11374d6ac8a6e889922c81806f5cd791ead141f18ec7e64 > > > >> > > > >> Best, > > > >> Yangze Guo > > > >> > > > >> On Sat, Dec 7, 2019 at 3:17 AM Piyush Narang > > > wrote: > > > >> > > > > >> > +1 from our end as well. At Criteo, we are running some Flink jobs > > on > > > Mesos in production to compute short term features for machine > learning. > > > We’d love to help out and contribute on this initiative. > > > >> > > > > >> > Thanks, > > > >> > -- Piyush > > > >> > > > > >> > > > > >> > From: Till Rohrmann > > > >> > Date: Friday, December 6, 2019 at 8:10 AM > > > >> > To: dev > > > >> > Cc: user > > > >> > Subject: Re: [DISCUSS] Adding e2e tests for Flink's Mesos > > integration > > > >> > > > > >> > Big +1 for adding a fully working e2e test for Flink's Mesos > > > integration. Ideally we would have it ready for the 1.10 release. The > > lack > > > of such a test has bitten us already multiple times. > > > >> > > > > >> > In general I would prefer to use the official image if possible > > since > > > it frees us from maintaining our own custom image. Since Java 9 is no > > > longer officially supported as we opted for supporting Java 11 (LTS) it > > > might not be feasible, though. How much longer would building the > custom > > > image vs. downloadin
Re: [ANNOUNCE] Feature freeze for Apache Flink 1.10.0 release
Hi all, We have just created the release-1.10 branch. Please remember to merge bug fixes to both release-1.10 and master branches from now on if you want the fix to be included in the Flink 1.10 release. Best, Yu & Gary On Tue, Nov 19, 2019 at 4:44 PM Yu Li wrote: > Hi devs, > > Per the feature discussions and progress updates for 1.10.0 [1] [2] [3], we > hereby > announce the official feature freeze for Flink 1.10.0 to be on December 8. > A release feature branch for 1.10 will be cut following that date. > > We’re roughly two and a half weeks away from this date, but please keep in > mind > that we still shouldn’t rush things. If you feel that there may be problems > with > this schedule for the things you are working on, please let us know here. > > Cheers, > Gary and Yu > > [1] > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Features-for-Apache-Flink-1-10-td32824.html > [2] > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Progress-of-Apache-Flink-1-10-1-td33570.html > [3] > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Progress-of-Apache-Flink-1-10-2-td34585.html >
[jira] [Created] (FLINK-15163) japicmp should use 1.9 as the old version
Gary Yao created FLINK-15163: Summary: japicmp should use 1.9 as the old version Key: FLINK-15163 URL: https://issues.apache.org/jira/browse/FLINK-15163 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 We should configure the japicmp-maven-plugin to use the latest Flink 1.9 release as the reference version to compare against. Currently 1.8.0 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15082) Mesos App Master does not respect taskmanager.memory.total-process.size
Gary Yao created FLINK-15082: Summary: Mesos App Master does not respect taskmanager.memory.total-process.size Key: FLINK-15082 URL: https://issues.apache.org/jira/browse/FLINK-15082 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 *Description* When the Mesos App Master is started with {{taskmanager.memory.total-process.size}}, [the value is not respected|https://github.com/apache/flink/blob/d08beaa3255b3df96afe35f17e257df31a0d71ed/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParameters.java#L339]. One can reproduce this when starting the App Master with the command below: {noformat} /bin/mesos-appmaster.sh \ -Dtaskmanager.memory.total-process.size=2048m \ -Djobmanager.heap.size=2048m \ ... {noformat} The ClusterEntryPoint will fail with an exception (see below). The reason is that the default value of {{mesos.resourcemanager.tasks.mem}} will be taken as the total process memory size (1024 mb). {noformat} org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint MesosSessionClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:187) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:518) at org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:126) Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent. at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:261) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168) ... 2 more Caused by: org.apache.flink.configuration.IllegalConfigurationException: Sum of configured Framework Heap Memory (134217728 bytes), Framework Off-Heap Memory (134217728 bytes), Task Off-Heap Memory (0 bytes), Managed Memory (719407031 bytes) and Shuffle Memory (80530638 bytes) exceed configured Total Flink Memory (805306368 bytes). at org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveInternalMemoryFromTotalFlinkMemory(TaskExecutorResourceUtils.java:273) at org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveResourceSpecWithTotalProcessMemory(TaskExecutorResourceUtils.java:210) at org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:108) at org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:94) at org.apache.flink.mesos.runtime.clusterframework.MesosTaskManagerParameters.create(MesosTaskManagerParameters.java:341) at org.apache.flink.mesos.util.MesosUtils.createTmParameters(MesosUtils.java:109) at org.apache.flink.mesos.runtime.clusterframework.MesosResourceManagerFactory.createActiveResourceManager(MesosResourceManagerFactory.java:80) at org.apache.flink.runtime.resourcemanager.ActiveResourceManagerFactory.createResourceManager(ActiveResourceManagerFactory.java:58) at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:170) ... 9 more {noformat} *Expected Behavior* * If taskmanager.memory.total-process.size and mesos.resourcemanager.tasks.mem are both set and differ in their values, an exception should be thrown * If only taskmanager.memory.total-process.size is set and mesos.resourcemanager.tasks.mem is not set, then the value configured by the former should be respected * If only mesos.resourcemanager.tasks.mem is set and taskmanager.memory.total-process.size is not set, then the value configured by the former should be respected -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15058) Log required config keys if TaskManager memory configuration is invalid
Gary Yao created FLINK-15058: Summary: Log required config keys if TaskManager memory configuration is invalid Key: FLINK-15058 URL: https://issues.apache.org/jira/browse/FLINK-15058 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 Currently the error message is {noformat} Either Task Heap Memory size and Managed Memory size, or Total Flink Memory size, or Total Process Memory size need to be configured explicitly {noformat} However, it would be good to immediately see which config keys are expected to be configured. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15057) Set taskmanager.memory.total-process.size in jepsen tests
Gary Yao created FLINK-15057: Summary: Set taskmanager.memory.total-process.size in jepsen tests Key: FLINK-15057 URL: https://issues.apache.org/jira/browse/FLINK-15057 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Set {{taskmanager.memory.total-process.size}} in {{flink-conf.yaml}} used by tests. Currently the taskmanager process fails due to {noformat} org.apache.flink.configuration.IllegalConfigurationException: Either Task Heap Memory size and Managed Memory size, or Total Flink Memory size, or Total Process Memory size need to be configured explicitly. at org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:110) at org.apache.flink.runtime.taskexecutor.TaskManagerServicesConfiguration.fromConfiguration(TaskManagerServicesConfiguration.java:219) at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:357) at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.(TaskManagerRunner.java:153) at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:327) at org.apache.flink.runtime.taskexecutor.TaskManagerRunner$1.call(TaskManagerRunner.java:298) at org.apache.flink.runtime.taskexecutor.TaskManagerRunner$1.call(TaskManagerRunner.java:295) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:295) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15045) SchedulerBase should only log the RestartStrategy in legacy scheduling mode
Gary Yao created FLINK-15045: Summary: SchedulerBase should only log the RestartStrategy in legacy scheduling mode Key: FLINK-15045 URL: https://issues.apache.org/jira/browse/FLINK-15045 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 In ng scheduling, we configure {{ThrowingRestartStrategy}} in the execution graph to assert that legacy code paths are not executed. Hence, the restart strategy should not be logged in ng scheduling mode. Currently, the following obsolete log message is logged: {noformat} 2019-12-03 20:22:14,426 INFO org.apache.flink.runtime.jobmaster.JobMaster - Using restart strategy org.apache.flink.runtime.executiongraph.restart.ThrowingRestartStrategy@44003e98 for General purpose test job (9af8d0845d449f3c3447f817b2150bc8). {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15008) Tests in flink-yarn-tests fail with ClassNotFoundException (JDK11)
Gary Yao created FLINK-15008: Summary: Tests in flink-yarn-tests fail with ClassNotFoundException (JDK11) Key: FLINK-15008 URL: https://issues.apache.org/jira/browse/FLINK-15008 Project: Flink Issue Type: Bug Components: Deployment / YARN, Tests Affects Versions: 1.10.0 Reporter: Gary Yao {noformat} 1) Error injecting constructor, java.lang.NoClassDefFoundError: javax/activation/DataSource at org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver.(JAXBContextResolver.java:41) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebApp.setup(RMWebApp.java:51) while locating org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver 1 error at com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:987) at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1013) at com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory$GuiceInstantiatedComponentProvider.getInstance(GuiceComponentProviderFactory.java:332) at com.sun.jersey.core.spi.component.ioc.IoCProviderFactory$ManagedSingleton.(IoCProviderFactory.java:179) at com.sun.jersey.core.spi.component.ioc.IoCProviderFactory.wrap(IoCProviderFactory.java:100) at com.sun.jersey.core.spi.component.ioc.IoCProviderFactory._getComponentProvider(IoCProviderFactory.java:93) at com.sun.jersey.core.spi.component.ProviderFactory.getComponentProvider(ProviderFactory.java:153) at com.sun.jersey.core.spi.component.ProviderServices.getComponent(ProviderServices.java:251) at com.sun.jersey.core.spi.component.ProviderServices.getProviders(ProviderServices.java:148) at com.sun.jersey.core.spi.factory.ContextResolverFactory.init(ContextResolverFactory.java:83) at com.sun.jersey.server.impl.application.WebApplicationImpl._initiate(WebApplicationImpl.java:1271) at com.sun.jersey.server.impl.application.WebApplicationImpl.access$700(WebApplicationImpl.java:169) at com.sun.jersey.server.impl.application.WebApplicationImpl$13.f(WebApplicationImpl.java:775) at com.sun.jersey.server.impl.application.WebApplicationImpl$13.f(WebApplicationImpl.java:771) at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193) at com.sun.jersey.server.impl.application.WebApplicationImpl.initiate(WebApplicationImpl.java:771) at com.sun.jersey.guice.spi.container.servlet.GuiceContainer.initiate(GuiceContainer.java:121) at com.sun.jersey.spi.container.servlet.ServletContainer$InternalWebComponent.initiate(ServletContainer.java:318) at com.sun.jersey.spi.container.servlet.WebComponent.load(WebComponent.java:609) at com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:210) at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373) at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:710) at com.google.inject.servlet.FilterDefinition.init(FilterDefinition.java:114) at com.google.inject.servlet.ManagedFilterPipeline.initPipeline(ManagedFilterPipeline.java:98) at com.google.inject.servlet.GuiceFilter.init(GuiceFilter.java:172) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:936) ... 50 more Caused by: java.lang.NoClassDefFoundError: javax/activation/DataSource at com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl.(RuntimeBuiltinLeafInfoImpl.java:457) at com.sun.xml.bind.v2.model.impl.RuntimeTypeInfoSetImpl.(RuntimeTypeInfoSetImpl.java:65) at com.sun.xml.bind.v2
[jira] [Created] (FLINK-14940) Travis build passes despite Test failures
Gary Yao created FLINK-14940: Summary: Travis build passes despite Test failures Key: FLINK-14940 URL: https://issues.apache.org/jira/browse/FLINK-14940 Project: Flink Issue Type: Bug Components: Test Infrastructure Affects Versions: 1.10.0 Reporter: Gary Yao Build https://travis-ci.org/apache/flink/jobs/616462870 is green despite the presence of Test failures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14939) StreamingKafkaITCase fails due to distDir property not being set
Gary Yao created FLINK-14939: Summary: StreamingKafkaITCase fails due to distDir property not being set Key: FLINK-14939 URL: https://issues.apache.org/jira/browse/FLINK-14939 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Tests Affects Versions: 1.10.0 Reporter: Gary Yao https://api.travis-ci.org/v3/job/616462870/log.txt {noformat} 08:12:34.965 [INFO] --- 08:12:34.965 [INFO] T E S T S 08:12:34.965 [INFO] --- 08:12:35.868 [INFO] Running org.apache.flink.tests.util.kafka.StreamingKafkaITCase 08:12:35.893 [ERROR] Tests run: 3, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 0.02 s <<< FAILURE! - in org.apache.flink.tests.util.kafka.StreamingKafkaITCase 08:12:35.893 [ERROR] testKafka[0: kafka-version:0.10.2.0](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) Time elapsed: 0.009 s <<< FAILURE! java.lang.AssertionError: The distDir property was not set. You can set it when running maven via -DdistDir= . at org.apache.flink.tests.util.kafka.StreamingKafkaITCase.(StreamingKafkaITCase.java:71) 08:12:35.893 [ERROR] testKafka[1: kafka-version:0.11.0.2](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) Time elapsed: 0.001 s <<< FAILURE! java.lang.AssertionError: The distDir property was not set. You can set it when running maven via -DdistDir= . at org.apache.flink.tests.util.kafka.StreamingKafkaITCase.(StreamingKafkaITCase.java:71) 08:12:35.893 [ERROR] testKafka[2: kafka-version:2.2.0](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) Time elapsed: 0.001 s <<< FAILURE! java.lang.AssertionError: The distDir property was not set. You can set it when running maven via -DdistDir= . at org.apache.flink.tests.util.kafka.StreamingKafkaITCase.(StreamingKafkaITCase.java:71) 08:12:36.233 [INFO] 08:12:36.233 [INFO] Results: 08:12:36.233 [INFO] 08:12:36.233 [ERROR] Failures: 08:12:36.233 [ERROR] StreamingKafkaITCase.:71 The distDir property was not set. You can set it when running maven via -DdistDir= . 08:12:36.233 [ERROR] StreamingKafkaITCase.:71 The distDir property was not set. You can set it when running maven via -DdistDir= . 08:12:36.233 [ERROR] StreamingKafkaITCase.:71 The distDir property was not set. You can set it when running maven via -DdistDir= . 08:12:36.233 [INFO] 08:12:36.233 [ERROR] Tests run: 3, Failures: 3, Errors: 0, Skipped: 0 08:12:36.233 [INFO] {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14929) ContinuousFileProcessingCheckpointITCase sporadically fails due to FileNotFoundException
Gary Yao created FLINK-14929: Summary: ContinuousFileProcessingCheckpointITCase sporadically fails due to FileNotFoundException Key: FLINK-14929 URL: https://issues.apache.org/jira/browse/FLINK-14929 Project: Flink Issue Type: Bug Components: Connectors / Common, Tests Affects Versions: 1.10.0 Environment: rev: c7ae2b8559c0fe4cf17613249db0c2857bb91d94 Reporter: Gary Yao *Description* Test fails locally approximately 1 out of 200 times. *Stacktrace* {noformat} org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 73b2ee613731d76fb9ab20142c3ba52f) at org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:144) at org.apache.flink.test.checkpointing.StreamFaultToleranceTestBase.runCheckpointedProgram(StreamFaultToleranceTestBase.java:132) at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:27) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:54) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) at org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:142) ... 34 more Caused by: java.lang.IllegalStateException: Cannot process mail Report throwable java.io.FileNotFoundException: File file:/home/gary/code/flink/flink-tests/target/localfs/fs_tests/..file3.crc does not exist or the user running Flink ('gary') has insufficient permissions to access it. at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:68) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:213) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:154) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:445) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.util.WrappingRuntimeException: java.io.FileNotFoundException: File file:/home/gary/code/flink/flink-tests/target/localfs/fs_tests/..file3.crc does not exist or the user running Flink ('gary') has insufficient permissions to access it. at org.apache.flink.util.WrappingRuntimeException.wrapIfNecessary(WrappingRuntimeException.java:65) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.lambda$reportThrowable$0(MailboxPro
Re: [VOTE] FLIP-83: Flink End-to-end Performance Testing Framework
+1 (binding) Best, Gary On Thu, Nov 21, 2019 at 7:55 AM Zhu Zhu wrote: > Thanks Yu for this proposal. > +1(non-binding) for the overall design. > > I have 2 questions about the result check though. Posted in the discussion > ML thread. > > Thanks, > Zhu Zhu > > Becket Qin 于2019年11月21日周四 上午11:57写道: > > > Hi Yu, > > > > Thanks for updating the FLIP wiki. +1 from me. I do not have further > > questions. > > > > Jiangjie (Becket) Qin > > > > > > > > On Wed, Nov 20, 2019 at 8:58 PM Yu Li wrote: > > > > > Thanks all for the vote and further discussions. > > > > > > I have updated the FLIP document according to aihua's reply (with some > > > minor refinement/supplements). For your convenience, below are the > > contents > > > I added: > > > ** > > > > > > *There're also other dimensions other than Flink characteristics, > > > including:* > > > > > >- *Record size: to check both the processing (records/s) and data > > >(bytes/s) throughput, we will test the 10B, 100B and 1KB record size > > for > > >each test job.* > > >- *Resource for each task: we will use the Flink default settings to > > >cover the most used cases.* > > >- *Job Parallelism: we will increase the parallelism to saturate the > > >system until back-pressure.* > > >- *Source and Sink: to focus on Flink performance, we generate the > > >source data randomly and use a blackhole consumer as the sink.* > > > > > > ** > > > > > > @becket.qin please let us know if the updates > > look > > > good to you to turn your voting to a complete binding +1, or if any > more > > > concerns/comments. Thanks! > > > > > > @voters: please also take a look at the updates and let us know if any > > new > > > comments. Thanks. > > > > > > Best Regards, > > > Yu > > > > > > > > > On Wed, 20 Nov 2019 at 11:10, aihua li wrote: > > > > > >> hi,Becket, > > >> > > >> Thanks for the comments! > > >> > > >> > 1. How do the testing records look like? The size and key > > distributions. > > >> > > >> The records looks like a long string,which default size is 1k. The key > > is > > >> randomly generated according to the specified range plus a fixed > string > > to > > >> assure that the data is evenly distributed on each task. > > >> > > >> > 2. The resources for each task. > > >> > > >> The resources for each task used the default value. > > >> > > >> > 3. The intended configuration for the jobs. > > >> > > >> The parallelism of test Job will be adjusted according to resource > > >> conditions to fill the cluster as much as possible. Other > configurations > > >> are not supported at this time. > > >> > > >> > 4. What exact source and sink it would use. > > >> > > >> In order to reduce the dependence on the external, the source data is > > >> generated randomly, the sink only supports hdfs or no sinks. > > >> > > >> we will add this details to the flip laterly. > > >> > > >> > > >> > 在 2019年11月18日,下午7:59,Becket Qin 写道: > > >> > > > >> > +1 (binding) on having the test suite. > > >> > > > >> > BTW, it would be good to have a few more details about the > performance > > >> > tests. For example: > > >> > 1. How do the testing records look like? The size and key > > distributions. > > >> > 2. The resources for each task. > > >> > 3. The intended configuration for the jobs. > > >> > 4. What exact source and sink it would use. > > >> > > > >> > Thanks, > > >> > > > >> > Jiangjie (Becket) Qin > > >> > > > >> > On Mon, Nov 18, 2019 at 7:25 PM Zhijiang < > wangzhijiang...@aliyun.com > > >> .invalid> > > >> > wrote: > > >> > > > >> >> +1 (binding)! > > >> >> > > >> >> It is a good thing to enhance our testing work. > > >> >> > > >> >> Best, > > >> >> Zhijiang > > >> >> > > >> >> > > >> >> -- > > >> >> From:Hequn Cheng > > >> >> Send Time:2019 Nov. 18 (Mon.) 18:22 > > >> >> To:dev > > >> >> Subject:Re: [VOTE] FLIP-83: Flink End-to-end Performance Testing > > >> Framework > > >> >> > > >> >> +1 (binding)! > > >> >> I think this would be very helpful to detect regression problems. > > >> >> > > >> >> Best, Hequn > > >> >> > > >> >> On Mon, Nov 18, 2019 at 4:28 PM vino yang > > >> wrote: > > >> >> > > >> >>> +1 (non-binding) > > >> >>> > > >> >>> Best, > > >> >>> Vino > > >> >>> > > >> >>> jincheng sun 于2019年11月18日周一 下午2:31写道: > > >> >>> > > >> +1 (binding) > > >> > > >> OpenInx 于2019年11月18日周一 下午12:09写道: > > >> > > >> > +1 (non-binding) > > >> > > > >> > On Mon, Nov 18, 2019 at 11:54 AM aihua li < > liaihua1...@gmail.com> > > >> >>> wrote: > > >> > > > >> >> +1 (non-binding) > > >> >> > > >> >> Thanks Yu Li for driving on this. > > >> >> > > >> >>> 在 2019年11月15日,下午8:10,Yu Li 写道: > > >> >>> > > >> >>> Hi All, > > >> >>> > > >> >>> I would like to start the vote for FLIP-83 [1] which is > > discussed > > >> >>> and > > >> >>> reached co
[jira] [Created] (FLINK-14894) HybridOffHeapUnsafeMemorySegmentTest#testByteBufferWrap failed on Travis
Gary Yao created FLINK-14894: Summary: HybridOffHeapUnsafeMemorySegmentTest#testByteBufferWrap failed on Travis Key: FLINK-14894 URL: https://issues.apache.org/jira/browse/FLINK-14894 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.10.0 Reporter: Gary Yao {noformat} HybridOffHeapUnsafeMemorySegmentTest>MemorySegmentTestBase.testByteBufferWrapping:2465 expected:<992288337> but was:<196608> {noformat} https://api.travis-ci.com/v3/job/258950527/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14859) Avoid leaking unassigned Slot in DefaultScheduler when Deployment is outdated
Gary Yao created FLINK-14859: Summary: Avoid leaking unassigned Slot in DefaultScheduler when Deployment is outdated Key: FLINK-14859 URL: https://issues.apache.org/jira/browse/FLINK-14859 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 In {{DefaultScheduler#assignResourceOrHandleError()}}, if the deployment is outdated, we should release the possibly acquired {{LogicalSlot}} so that we do not leak resources. Below is an example to illustrate how slot leak is currently possible: # Vertices A1, A2, A3 are scheduled in a batch. # A2 acquires a slot. A1, A3 do not. # A1 fails due to slot allocation timeout and triggers failover ({{DefaultScheduler#cancelTasksAsync}}) # A2 is canceled first and its returned slot is assigned to A3, which triggers {{DefaultScheduler#assignResourceOrHandleError}} of A3. However, A3 is not canceled yet but it is outdated because {{executionVertexVersioner#recordVertexModifications}} was already invoked -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14843) Streaming bucketing end-to-end test can fail with Output hash mismatch
Gary Yao created FLINK-14843: Summary: Streaming bucketing end-to-end test can fail with Output hash mismatch Key: FLINK-14843 URL: https://issues.apache.org/jira/browse/FLINK-14843 Project: Flink Issue Type: Bug Components: Connectors / FileSystem, Tests Affects Versions: 1.10.0 Environment: rev: dcc1330375826b779e4902176bb2473704dabb11 Reporter: Gary Yao *Description* Streaming bucketing end-to-end test ({{test_streaming_bucketing.sh}}) can fail with Output hash mismatch. {noformat} Number of running task managers has reached 4. Job (67212178694f8b2a9bc9d9572567a53f) is running. Waiting until all values have been produced Truncating buckets Number of produced values 26325/6 Truncating buckets Number of produced values 31315/6 Truncating buckets Number of produced values 36735/6 Truncating buckets Number of produced values 40705/6 Truncating buckets Number of produced values 46125/6 Truncating buckets Number of produced values 51135/6 Truncating buckets Number of produced values 56555/6 Truncating buckets Number of produced values 61935/6 Cancelling job 67212178694f8b2a9bc9d9572567a53f. Cancelled job 67212178694f8b2a9bc9d9572567a53f. Waiting for job (67212178694f8b2a9bc9d9572567a53f) to reach terminal state CANCELED ... Job (67212178694f8b2a9bc9d9572567a53f) reached terminal state CANCELED Job 67212178694f8b2a9bc9d9572567a53f was cancelled, time to verify FAIL Bucketing Sink: Output hash mismatch. Got 4e2d1859e41184a38e5bc95090fe9941, expected 01aba5ff77a0ef5e5cf6a727c248bdc3. head hexdump of actual: 000 ( 2 , 1 0 , 0 , S o m e p a y 010 l o a d . . . ) \n ( 2 , 1 0 , 1 020 , S o m e p a y l o a d . . . 030 ) \n ( 2 , 1 0 , 2 , S o m e p 040 a y l o a d . . . ) \n ( 2 , 1 0 050 , 3 , S o m e p a y l o a d . 060 . . ) \n ( 2 , 1 0 , 4 , S o m e 070 p a y l o a d . . . ) \n ( 2 , 080 1 0 , 5 , S o m e p a y l o a 090 d . . . ) \n ( 2 , 1 0 , 6 , S o 0a0 m e p a y l o a d . . . ) \n ( 0b0 2 , 1 0 , 7 , S o m e p a y l 0c0 o a d . . . ) \n ( 2 , 1 0 , 8 , 0d0 S o m e p a y l o a d . . . ) 0e0 \n ( 2 , 1 0 , 9 , S o m e p a 0f0 y l o a d . . . ) \n 0fa Stopping taskexecutor daemon (pid: 654547) on host gyao-desktop. Stopping standalonesession daemon (pid: 650368) on host gyao-desktop. Stopping taskexecutor daemon (pid: 650812) on host gyao-desktop. Skipping taskexecutor daemon (pid: 651347), because it is not running anymore on gyao-desktop. Skipping taskexecutor daemon (pid: 651795), because it is not running anymore on gyao-desktop. Skipping taskexecutor daemon (pid: 652249), because it is not running anymore on gyao-desktop. Stopping taskexecutor daemon (pid: 653481) on host gyao-desktop. Stopping taskexecutor daemon (pid: 654099) on host gyao-desktop. [FAIL] Test script contains errors. Checking of logs skipped. [FAIL] 'flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh' failed after 2 minutes and 3 seconds! Test exited with exit code 1 {noformat} *How to reproduce* Comment out the delay of 10s after the 1st TM is restarted to provoke the issue: {code:bash} echo "Restarting 1 TM" $FLINK_DIR/bin/taskmanager.sh start wait_for_number_of_running_tms 4 #sleep 10 echo "Killing 2 TMs" kill_random_taskmanager kill_random_taskmanager wait_for_number_of_running_tms 2 {code} Command to run the test: {noformat} FLINK_DIR=build-target/ flink-end-to-end-tests/run-single-test.sh skip flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14834) Running Kerberized YARN on Docker test (custom fs plugin) fails on Travis
Gary Yao created FLINK-14834: Summary: Running Kerberized YARN on Docker test (custom fs plugin) fails on Travis Key: FLINK-14834 URL: https://issues.apache.org/jira/browse/FLINK-14834 Project: Flink Issue Type: Bug Components: Deployment / YARN, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 https://api.travis-ci.org/v3/job/612782888/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14826) Enable 'Streaming bucketing end-to-end test' to pass with new DefaultScheduler
Gary Yao created FLINK-14826: Summary: Enable 'Streaming bucketing end-to-end test' to pass with new DefaultScheduler Key: FLINK-14826 URL: https://issues.apache.org/jira/browse/FLINK-14826 Project: Flink Issue Type: Sub-task Components: Tests Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 The tests fails because we exhaust the number of restarts (3). The reason is that the new scheduler may re-schedule tasks faster – we start counting down the restart back-off time as soon as we triggered task cancellation, however the legacy scheduler will only start counting down after the task cancellation is finished. Thus, re-scheduled tasks may be deployed into a TM that was killed, and therefore increase the number of restarts multiple times. The speed of the TM loss detection depends on heartbeat.interval and heartbeat.timeout. These settings are by default 10s and 50s respectively. The problem can even be reproduced with the legacy scheduler on the current master by setting heartbeat.timeout to a high value, such as 18. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14823) Enable 'Streaming bucketing end-to-end test' to pass with new DefaultScheduler
Gary Yao created FLINK-14823: Summary: Enable 'Streaming bucketing end-to-end test' to pass with new DefaultScheduler Key: FLINK-14823 URL: https://issues.apache.org/jira/browse/FLINK-14823 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14822) Enable 'Streaming File Sink end-to-end test' to pass with new DefaultScheduler
Gary Yao created FLINK-14822: Summary: Enable 'Streaming File Sink end-to-end test' to pass with new DefaultScheduler Key: FLINK-14822 URL: https://issues.apache.org/jira/browse/FLINK-14822 Project: Flink Issue Type: Sub-task Components: Tests Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 The tests fails because we exhaust the number of restarts (3). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14821) Enable 'Queryable state (rocksdb) with TM restart' E2E test to pass with new DefaultScheduler
Gary Yao created FLINK-14821: Summary: Enable 'Queryable state (rocksdb) with TM restart' E2E test to pass with new DefaultScheduler Key: FLINK-14821 URL: https://issues.apache.org/jira/browse/FLINK-14821 Project: Flink Issue Type: Sub-task Components: Tests Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 The test expects that the job transitions from {{RESTARTING}} to {{CREATED}} . However, the {{DefaultScheduler}} will not transition the job to {{CREATED}}. Moreover, waiting for the transition is not needed because the loss of the TM is already guaranteed by {code} wait_for_number_of_running_tms 0 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14809) DataStreamAllroundTestProgram does not run because return types cannot be determined
Gary Yao created FLINK-14809: Summary: DataStreamAllroundTestProgram does not run because return types cannot be determined Key: FLINK-14809 URL: https://issues.apache.org/jira/browse/FLINK-14809 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 {noformat} 2019-11-14 19:34:55,185 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: The return type of function 'main(DataStreamAllroundTestProgram.java:182)' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface. at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:336) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:206) at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:173) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:747) at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:282) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:219) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1011) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1084) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1084) Caused by: org.apache.flink.api.common.functions.InvalidTypesException: The return type of function 'main(DataStreamAllroundTestProgram.java:182)' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface. at org.apache.flink.api.dag.Transformation.getOutputType(Transformation.java:412) at org.apache.flink.streaming.api.datastream.DataStream.addSink(DataStream.java:1296) at org.apache.flink.streaming.tests.DataStreamAllroundTestProgram.main(DataStreamAllroundTestProgram.java:185) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:322) ... 12 more Caused by: org.apache.flink.api.common.functions.InvalidTypesException: Input mismatch: Generic type 'org.apache.flink.streaming.tests.Event' or a subclass of it expected but was 'org.apache.flink.streaming.tests.Event'. at org.apache.flink.api.java.typeutils.TypeExtractor.validateInputType(TypeExtractor.java:1298) at org.apache.flink.api.java.typeutils.TypeExtractor.getUnaryOperatorReturnType(TypeExtractor.java:585) at org.apache.flink.api.java.typeutils.TypeExtractor.getFlatMapReturnTypes(TypeExtractor.java:196) at org.apache.flink.streaming.api.datastream.DataStream.flatMap(DataStream.java:634) at org.apache.flink.streaming.tests.DataStreamAllroundTestProgram.main(DataStreamAllroundTestProgram.java:182) ... 17 more Caused by: org.apache.flink.api.common.functions.InvalidTypesException: Generic type 'org.apache.flink.streaming.tests.Event' or a subclass of it expected but was 'org.apache.flink.streaming.tests.Event'. at org.apache.flink.api.java.typeutils.TypeExtractor.validateInfo(TypeExtractor.java:1481) at org.apache.flink.api.java.typeutils.TypeExtractor.validateInfo(TypeExtractor.java:1491) at org.apache.flink.api.java.typeutils.TypeExtractor.validateInputType(TypeExtractor.java:1295) ... 21 more {noformat} This happens in multiple nightlies and jepsen runs. Example https://api.travis-ci.org/v3/job/611848582/log.txt -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14780) Avoid leaking instance of DefaultScheduler before object is constructed
Gary Yao created FLINK-14780: Summary: Avoid leaking instance of DefaultScheduler before object is constructed Key: FLINK-14780 URL: https://issues.apache.org/jira/browse/FLINK-14780 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 An instance of {{DefaultScheduler}} may leak to a metric reporter thread before the instance of the object is fully constructed. This can lead to an NPE (see below. https://api.travis-ci.org/v3/job/611597698/log.txt {noformat} java.lang.NullPointerException at org.apache.flink.runtime.scheduler.DefaultScheduler.getNumberOfRestarts(DefaultScheduler.java:156) at org.apache.flink.metrics.slf4j.Slf4jReporter.tryReport(Slf4jReporter.java:114) at org.apache.flink.metrics.slf4j.Slf4jReporter.report(Slf4jReporter.java:80) at org.apache.flink.runtime.metrics.MetricRegistryImpl$ReporterTask.run(MetricRegistryImpl.java:436) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14682) Enable AbstractTaskManagerProcessFailureRecoveryTest to pass with new DefaultScheduler
Gary Yao created FLINK-14682: Summary: Enable AbstractTaskManagerProcessFailureRecoveryTest to pass with new DefaultScheduler Key: FLINK-14682 URL: https://issues.apache.org/jira/browse/FLINK-14682 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 Investigate the reason {{testTaskManagerProcessFailure()}} fails, and fix issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14681) Enable YARNHighAvailabilityITCase to pass with new DefaultScheduler
Gary Yao created FLINK-14681: Summary: Enable YARNHighAvailabilityITCase to pass with new DefaultScheduler Key: FLINK-14681 URL: https://issues.apache.org/jira/browse/FLINK-14681 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 Investigate why {{testKillYarnSessionClusterEntrypoint}} and {{testJobRecoversAfterKillingTaskManager}} fail and fix issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14680) Enable KafkaConsumerTestBase#runFailOnNoBrokerTest to pass with new DefaultScheduler
Gary Yao created FLINK-14680: Summary: Enable KafkaConsumerTestBase#runFailOnNoBrokerTest to pass with new DefaultScheduler Key: FLINK-14680 URL: https://issues.apache.org/jira/browse/FLINK-14680 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 {{KafkaConsumerTestBase#runFailOnNoBrokerTest}} has assumptions on the causal chain of the {{JobExecutionException}}. In particular, it assumes that the exception caused by user code is the direct cause of {{JobExecutionException}}. However, this is no longer true when using the {{DefaultScheduler}}, which wraps the exception in an {{JobException}}, which additionally specifies the reason of the job recovery suppression. The code in question is listed below: {code:java} } catch (JobExecutionException jee) { if (kafkaServer.getVersion().equals("0.9") || kafkaServer.getVersion().equals("0.10") || kafkaServer.getVersion().equals("0.11") || kafkaServer.getVersion().equals("2.0")) { assertTrue(jee.getCause() instanceof TimeoutException); TimeoutException te = (TimeoutException) jee.getCause(); assertEquals("Timeout expired while fetching topic metadata", te.getMessage()); } else { assertTrue(jee.getCause() instanceof RuntimeException); RuntimeException re = (RuntimeException) jee.getCause(); assertTrue(re.getMessage().contains("Unable to retrieve any partitions")); } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14651) Set default value of config option jobmanager.scheduler to "ng"
Gary Yao created FLINK-14651: Summary: Set default value of config option jobmanager.scheduler to "ng" Key: FLINK-14651 URL: https://issues.apache.org/jira/browse/FLINK-14651 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14636) Handle schedule mode LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUEST correctly in DefaultScheduler
Gary Yao created FLINK-14636: Summary: Handle schedule mode LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUEST correctly in DefaultScheduler Key: FLINK-14636 URL: https://issues.apache.org/jira/browse/FLINK-14636 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.10.0 Reporter: Gary Yao Assignee: Gary Yao Fix For: 1.10.0 It should be possible to schedule a job with {{ScheduleMode.LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUEST}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14632) ExecutionGraphSchedulingTest.testSlotReleasingFailsSchedulingOperation deadlocks
Gary Yao created FLINK-14632: Summary: ExecutionGraphSchedulingTest.testSlotReleasingFailsSchedulingOperation deadlocks Key: FLINK-14632 URL: https://issues.apache.org/jira/browse/FLINK-14632 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Tests Affects Versions: 1.10.0 Reporter: Gary Yao https://api.travis-ci.com/v3/job/253364947/log.txt {noformat} "main" #1 prio=5 os_prio=0 tid=0x7fac3c00b800 nid=0x956 waiting on condition [0x7fac449f1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x8eb74600> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.flink.runtime.executiongraph.ExecutionGraphSchedulingTest.testSlotReleasingFailsSchedulingOperation(ExecutionGraphSchedulingTest.java:501) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14628) Wordcount on Docker test (custom fs plugin) fails on Travis
Gary Yao created FLINK-14628: Summary: Wordcount on Docker test (custom fs plugin) fails on Travis Key: FLINK-14628 URL: https://issues.apache.org/jira/browse/FLINK-14628 Project: Flink Issue Type: Bug Components: FileSystems, Tests Affects Versions: 1.10.0 Reporter: Gary Yao https://api.travis-ci.org/v3/job/607616429/log.txt {noformat} Successfully tagged test_docker_embedded_job:latest ~/build/apache/flink sort: cannot read: '/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53405131685/out/docker_wc_out*': No such file or directory FAIL WordCount: Output hash mismatch. Got d41d8cd98f00b204e9800998ecf8427e, expected 72a690412be8928ba239c2da967328a5. head hexdump of actual: head: cannot open '/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53405131685/out/docker_wc_out*' for reading: No such file or directory [FAIL] Test script contains errors. Checking for errors... No errors in log files. Checking for exceptions... No exceptions in log files. Checking for non-empty .out files... grep: /home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/log/*.out: No such file or directory No non-empty .out files. {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[ANNOUNCE] Progress of Apache Flink 1.10 #2
Hi community, Because we have approximately one month of development time left until the targeted Flink 1.10 feature freeze, we thought now would be a good time to give another progress update. Below we have included a list of the ongoing efforts that have made progress since our last release progress update [1]. As always, if you are working on something that is not included here, feel free to use this thread to share your progress. - Support Java 11 [2] - Implementation is in progress (18/21 subtasks resolved) - Table API improvements - Full Data Type Support in Planner [3] - Implementing (1/8 subtasks resolved) - FLIP-66 Support Time Attribute in SQL DDL [4] - Implementation is in progress (1/7 subtasks resolved). - FLIP-70 Support Computed Column [5] - FLIP voting [6] - FLIP-63 Rework Table Partition Support [7] - Implementation is in progress (3/15 subtasks resolved). - FLIP-51 Rework of Expression Design [8] - Implementation is in progress (2/12 subtasks resolved). - FLIP-64 Support for Temporary Objects in Table Module [9] - Implementation is in progress - Hive compatibility completion (DDL/UDF) to support full Hive integration - FLIP-57 Rework FunctionCatalog [10] - Implementation is in progress (6/9 subtasks resolved) - FLIP-68 Extend Core Table System with Modular Plugins [11] - Implementation is in progress (2/8 subtasks resolved) - Finer grained resource management - FLIP-49: Unified Memory Configuration for TaskExecutors [12] - Implementation is in progress (6/10 subtasks resolved) - FLIP-53: Fine Grained Operator Resource Management [13] - Implementation is in progress (1/9 subtasks resolved) - Finish scheduler re-architecture [14] - Integration tests are being enabled for new scheduler - Executor/Client refactoring [15] - FLIP-81: Executor-related new ConfigOptions [16] - done - FLIP-73: Introducing Executors for job submission [17] - Implementation is in progress - FLIP-36 Support Interactive Programming [18] - Is built on top of FLIP-67 [19], which has been accepted - Implementation in progress - FLIP-58: Flink Python User-Defined Stateless Function for Table [20] - Implementation is in progress (12/22 subtask resolved) - FLIP-50: Spill-able Heap Keyed State Backend [21] - Implementation is in progress (2/11 subtasks resolved) - RocksDB Backend Memory Control [22] - FLIP for resource management on state backend will be opened soon - Write Buffer Manager will be backported to FRocksDB due to performance regression [23] in new RocksDB versions - Unaligned Checkpoints - FLIP-76 [24] was published and received positive feedback - Implementation is in progress - Separate framework and user class loader in per-job mode [25] - First PR is almost done. Remaining PRs will be ready next week - Active Kubernetes Integration [26] - Implementation is in progress (6/11 in review, 3/11 in progress, 2/11 todo) - FLIP-39 Flink ML pipeline and ML libs [27] - A few abstract ML classes have been merged (FLINK-13339, FLINK-13513) - Starting review of algorithms Again, the feature freeze is targeted to be at the end of November. Please make sure that all important work threads can be completed until that date. Feel free to use this thread to communicate any concerns about features that might not be finished until then. We will send another announcement later in the release cycle to make the date of the feature freeze official. Best, Yu & Gary [1] https://s.apache.org/wc0dc [2] https://issues.apache.org/jira/browse/FLINK-10725 [3] https://issues.apache.org/jira/browse/FLINK-14079 [4] https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+time+attribute+in+SQL+DDL [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-70%3A+Flink+SQL+Computed+Column+Design [6] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-FLIP-70-Flink-SQL-Computed-Column-Design-td34385.html [7] https://cwiki.apache.org/confluence/display/FLINK/FLIP-63%3A+Rework+table+partition+support [8] https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design [9] https://cwiki.apache.org/confluence/display/FLINK/FLIP-64%3A+Support+for+Temporary+Objects+in+Table+module [10] https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog [11] https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Pluggable+Modules [12] https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors [13] https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management [14] https://issues.apache.org/jira/browse/FLINK-10429 [15] https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E [16] https:
[jira] [Created] (FLINK-14592) FlinkKafkaInternalProducerITCase fails with BindException
Gary Yao created FLINK-14592: Summary: FlinkKafkaInternalProducerITCase fails with BindException Key: FLINK-14592 URL: https://issues.apache.org/jira/browse/FLINK-14592 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Tests Affects Versions: 1.10.0 Reporter: Gary Yao FlinkKafkaInternalProducerITCase fails with java.net.BindException: Address already in use. Logs: https://api.travis-ci.org/v3/job/605478801/log.txt {noformat} 02:04:04.878 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 8.822 s <<< FAILURE! - in org.apache.flink.streaming.connectors.kafka.FlinkKafkaInternalProducerITCase 02:04:04.882 [ERROR] org.apache.flink.streaming.connectors.kafka.FlinkKafkaInternalProducerITCase Time elapsed: 8.822 s <<< ERROR! org.apache.kafka.common.KafkaException: Socket server failed to bind to 0.0.0.0:38437: Address already in use. at org.apache.flink.streaming.connectors.kafka.FlinkKafkaInternalProducerITCase.prepare(FlinkKafkaInternalProducerITCase.java:59) Caused by: java.net.BindException: Address already in use at org.apache.flink.streaming.connectors.kafka.FlinkKafkaInternalProducerITCase.prepare(FlinkKafkaInternalProducerITCase.java:59) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14587) HiveTableSourceTest timeouts on Travis
Gary Yao created FLINK-14587: Summary: HiveTableSourceTest timeouts on Travis Key: FLINK-14587 URL: https://issues.apache.org/jira/browse/FLINK-14587 Project: Flink Issue Type: Bug Components: Connectors / Hive, Tests Affects Versions: 1.10.0 Reporter: Gary Yao {{HiveTableSourceTest}} failed in nightly tests running on Travis (stage: misc - scala 2.12) https://travis-ci.org/apache/flink/builds/604945292?utm_source=slack&utm_medium=notification https://api.travis-ci.org/v3/job/604945309/log.txt {noformat} 15:54:17.188 [INFO] Results: 15:54:17.188 [INFO] 15:54:17.188 [ERROR] Errors: 15:54:17.188 [ERROR] HiveTableSourceTest.org.apache.flink.connectors.hive.HiveTableSourceTest » Timeout 15:54:17.188 [INFO] 15:54:17.188 [ERROR] Tests run: 232, Failures: 0, Errors: 1, Skipped: 1 15:54:17.188 [INFO] {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14576) Elasticsearch (v6.3.1) sink end-to-end test instable
Gary Yao created FLINK-14576: Summary: Elasticsearch (v6.3.1) sink end-to-end test instable Key: FLINK-14576 URL: https://issues.apache.org/jira/browse/FLINK-14576 Project: Flink Issue Type: Bug Components: Connectors / ElasticSearch, Tests Reporter: Gary Yao https://api.travis-ci.org/v3/job/604417300/log.txt {noformat} 2019-10-29 17:12:13,668 ERROR org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase - Failed Elasticsearch item request: [:intentional invalid index:] ElasticsearchException[Elasticsearch exception [type=invalid_index_name_exception, reason=Invalid index name [:intentional invalid index:], must not contain the following characters [ , ", *, \, <, |, ,, >, /, ?]]] [:intentional invalid index:] ElasticsearchException[Elasticsearch exception [type=invalid_index_name_exception, reason=Invalid index name [:intentional invalid index:], must not contain the following characters [ , ", *, \, <, |, ,, >, /, ?]]] at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:510) at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:421) at org.elasticsearch.action.bulk.BulkItemResponse.fromXContent(BulkItemResponse.java:135) at org.elasticsearch.action.bulk.BulkResponse.fromXContent(BulkResponse.java:198) at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:653) at org.elasticsearch.client.RestHighLevelClient.lambda$performRequestAsyncAndParseEntity$3(RestHighLevelClient.java:549) at org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:580) at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onSuccess(RestClient.java:621) at org.elasticsearch.client.RestClient$1.completed(RestClient.java:375) at org.elasticsearch.client.RestClient$1.completed(RestClient.java:366) at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119) at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177) at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436) at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326) at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) at java.lang.Thread.run(Thread.java:748) 2019-10-29 17:12:13,753 ERROR org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase - Failed Elasticsearch item request: [:intentional invalid index:] ElasticsearchException[Elasticsearch exception [type=invalid_index_name_exception, reason=Invalid index name [:intentional invalid index:], must not contain the following characters [ , ", *, \, <, |, ,, >, /, ?]]] [:intentional invalid index:] ElasticsearchException[Elasticsearch exception [type=invalid_index_name_exception, reason=Invalid index name [:intentional invalid index:], must not contain the following characters [ , ", *, \, <, |, ,, >, /, ?]]] at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:510) at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:421) at org.elasticsearch.action.bulk.BulkItemResponse.fromXContent(BulkItemResponse.java:135) at org.elasticsearch.action.bulk.BulkResponse.fromXContent(BulkResponse.java:198) at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:653) at org.elasticsearch.client.RestHighLevelClient.lambda$performRequestAsyncAndParseEntity$3(RestHighLevelClient.java:549) at org.elasticsearch.client.RestHighLevelCli
[jira] [Created] (FLINK-14572) BlobsCleanupITCase failed in Travis stage core - scheduler_ng
Gary Yao created FLINK-14572: Summary: BlobsCleanupITCase failed in Travis stage core - scheduler_ng Key: FLINK-14572 URL: https://issues.apache.org/jira/browse/FLINK-14572 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Tests Affects Versions: 1.10.0 Reporter: Gary Yao Fix For: 1.10.0 {noformat} java.lang.AssertionError: Expected: is but: was at org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220) at org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14555) Streaming File Sink s3 end-to-end test stalls
Gary Yao created FLINK-14555: Summary: Streaming File Sink s3 end-to-end test stalls Key: FLINK-14555 URL: https://issues.apache.org/jira/browse/FLINK-14555 Project: Flink Issue Type: Bug Components: Connectors / FileSystem, Tests Affects Versions: 1.10.0 Reporter: Gary Yao [https://api.travis-ci.org/v3/job/603882577/log.txt] {noformat} == Running 'Streaming File Sink s3 end-to-end test' == TEST_DATA_DIR: /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-36388677539 Flink dist directory: /home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT Found AWS bucket [secure], running the e2e test. Found AWS access key, running the e2e test. Found AWS secret key, running the e2e test. Executing test with dynamic openSSL linkage (random selection between 'dynamic' and 'static') Setting up SSL with: internal OPENSSL dynamic Using SAN dns:travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866,ip:10.20.1.215,ip:172.17.0.1 Certificate was added to keystore Certificate was added to keystore Certificate reply was installed in keystore MAC verified OK Setting up SSL with: rest OPENSSL dynamic Using SAN dns:travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866,ip:10.20.1.215,ip:172.17.0.1 Certificate was added to keystore Certificate was added to keystore Certificate reply was installed in keystore MAC verified OK Mutual ssl auth: true Use s3 output Starting cluster. Starting standalonesession daemon on host travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Starting taskexecutor daemon on host travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Waiting for dispatcher REST endpoint to come up... Dispatcher REST endpoint is up. [INFO] 1 instance(s) of taskexecutor are already running on travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Starting taskexecutor daemon on host travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. [INFO] 2 instance(s) of taskexecutor are already running on travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Starting taskexecutor daemon on host travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. [INFO] 3 instance(s) of taskexecutor are already running on travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Starting taskexecutor daemon on host travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Submitting job. Job (c3a9bb7d3f47d63ebccbec5acb1342cb) is running. Waiting for job (c3a9bb7d3f47d63ebccbec5acb1342cb) to have at least 3 completed checkpoints ... Killing TM TaskManager 9227 killed. Starting TM [INFO] 3 instance(s) of taskexecutor are already running on travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Starting taskexecutor daemon on host travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Waiting for restart to happen Killing 2 TMs TaskManager 8618 killed. TaskManager 9658 killed. Starting 2 TMs [INFO] 2 instance(s) of taskexecutor are already running on travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Starting taskexecutor daemon on host travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. [INFO] 3 instance(s) of taskexecutor are already running on travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Starting taskexecutor daemon on host travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866. Waiting for restart to happen Waiting until all values have been produced Number of produced values 18080/6 No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself. Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received The build has been terminated {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)