Re: [PR] [FLINK-34312][table] Improve the handling of default node types for named parameters [flink]
snuyanzin commented on code in PR #24235: URL: https://github.com/apache/flink/pull/24235#discussion_r1475692973 ## flink-table/flink-table-planner/src/main/java/org/apache/calcite/sql/validate/SqlValidatorImpl.java: ## @@ -158,16 +160,17 @@ * Default implementation of {@link SqlValidator}, the class was copied over because of * CALCITE-4554. * - * Lines 1958 ~ 1978, Flink improves error message for functions without appropriate arguments in + * Lines 1961 ~ 1981, Flink improves error message for functions without appropriate arguments in * handleUnresolvedFunction at {@link SqlValidatorImpl#handleUnresolvedFunction}. * - * Lines 3736 ~ 3740, Flink improves Optimize the retrieval of sub-operands in SqlCall when using - * NamedParameters at {@link SqlValidatorImpl#checkRollUp}. + * Lines 3739 ~ 3743, 6333 ~ 6339, Flink improves validating the SqlCall that uses named + * parameters, rearrange the order of sub-operands, and fill in missing operands with the default Review Comment: can you please create a corresponding issue for Calcite to make it supported there as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]
1996fanrui commented on code in PR #24246: URL: https://github.com/apache/flink/pull/24246#discussion_r1475680928 ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -328,7 +330,7 @@ public void testCheckpointRescalingNonPartitionedStateCausesException() throws E // wait until the operator handles some data StateSourceBase.workStartedLatch.await(); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: IIUC, we can call the newest `waitForNewCheckpoint` here. From the semantic, the old `waitForOneMoreCheckpoint` wait for one and more checkpoints. It check the checkpoint each 100ms by default. It means if the job generates 10 checkpoints within 100ms, `waitForOneMoreCheckpoint` will wait 10 checkpoints. So I don't think `waitForNewCheckpoint` breaks the semantic `waitForOneMoreCheckpoint`. And it's more clearer than before. Also, if we introduced the `waitForNewCheckpoint`, we might don't need `waitForOneMoreCheckpoint` that checking the checkpoint count. It's needed if we have a requirement that must wait for a number of new checkpoints. At least I didn't see the strong requirement for now. WDYT? ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -411,7 +413,7 @@ public void testCheckpointRescalingWithKeyedAndNonPartitionedState() throws Exce // clear the CollectionSink set for the restarted job CollectionSink.clearElementsSet(); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: Do you mean `createJobGraphWithKeyedState` and `createJobGraphWithKeyedAndNonPartitionedOperatorState` have redundant code ? Or `testCheckpointRescalingWithKeyedAndNonPartitionedState` and `testCheckpointRescalingKeyedState`? I checked them, they have a lot of differences in details. Such as: - Source is different - The parallelism and MaxParallelism is fixed parallelism for `NonPartitionedOperator` I will check could they extract some common code later. If yes, I can submit a hotfix PR and cc you. ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -513,7 +515,7 @@ public void testCheckpointRescalingPartitionedOperatorState( // wait until the operator handles some data StateSourceBase.workStartedLatch.await(); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: Same with this comment: https://github.com/apache/flink/pull/24246/files#r1474917357 ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -261,7 +263,7 @@ public void testCheckpointRescalingKeyedState(boolean scaleOut) throws Exception waitForAllTaskRunning(cluster.getMiniCluster(), jobGraph.getJobID(), false); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: Added, please help double check, thanks~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18
[ https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813215#comment-17813215 ] Matthias Pohl edited comment on FLINK-34333 at 2/2/24 7:46 AM: --- I think it's fine to do the upgrade from v6.6.2 to v6.9.0 even for 1.18: * The only change that seems to affect Flink is in the Mock-framework and only affects test code * Most of the changes are cleanup changes, except for: ** Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: Changed behavior in certain scenarios ** Fix [#5125|https://github.com/fabric8io/kubernetes-client/issues/5125]: Different default TLS version ** Fix [#1335|https://github.com/fabric8io/kubernetes-client/issues/1335]: Different fallback used for proxy configuration Generally, we could argue that downstream project should rely on shaded dependencies provided by Flink. And fixing the bug out-weighs the stability concerns here. Do you see a problem with this argument [~gyfora], [~wangyang0918]? The alternatives to doing the upgrade are: * Reverting the upgrade (i.e. going back from v6.6.2 to v5.12.4). This would allow us to get to the stable version that was tested with 1.17- * Provide a Flink-customized implementation of the fabric8io {{LeaderElector}} class with the cherry-picked changes of [8f8c438f|https://github.com/fabric8io/kubernetes-client/commit/8f8c438f] and [0f6c6965|https://github.com/fabric8io/kubernetes-client/commit/0f6c6965]. As a consequence, we would stick to fabric8io:kubernetes-client v.6.6.2 was (Author: mapohl): I think it's fine to do the upgrade from v6.6.2 to v6.9.0 even for 1.18: * The only change that seems to affect Flink is in the Mock-framework and only affects test code * Most of the changes are cleanup changes, except for: ** Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: Changed behavior in certain scenarios ** Fix [#5125|https://github.com/fabric8io/kubernetes-client/issues/5125]: Different default TLS version ** Fix [#1335|https://github.com/fabric8io/kubernetes-client/issues/1335]: Different fallback used for proxy configuration Generally, we could argue that downstream project should rely on transitive dependencies provided by Flink. And fixing the bug out-weighs the stability concerns here. Do you see a problem with this argument [~gyfora], [~wangyang0918]? The alternatives to doing the upgrade are: * Reverting the upgrade (i.e. going back from v6.6.2 to v5.12.4). This would allow us to get to the stable version that was tested with 1.17- * Provide a Flink-customized implementation of the fabric8io {{LeaderElector}} class with the cherry-picked changes of [8f8c438f|https://github.com/fabric8io/kubernetes-client/commit/8f8c438f] and [0f6c6965|https://github.com/fabric8io/kubernetes-client/commit/0f6c6965]. As a consequence, we would stick to fabric8io:kubernetes-client v.6.6.2 > Fix FLINK-34007 LeaderElector bug in 1.18 > - > > Key: FLINK-34333 > URL: https://issues.apache.org/jira/browse/FLINK-34333 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.18.1 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > > FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since > Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which > required an update of the k8s client to v6.9.0. > This Jira issue is about finding a solution in Flink 1.18 for the very same > problem FLINK-34007 covered. It's a dedicated Jira issue because we want to > unblock the release of 1.19 by resolving FLINK-34007. > Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in > v6.6.2 which might prevent the leadership lost event being forwarded to the > client ([#5463|https://github.com/fabric8io/kubernetes-client/issues/5463]). > An initial proposal where the release call was handled in Flink's > {{KubernetesLeaderElector}} didn't work due to the leadership lost event > being triggered twice (see [FLINK-34007 PR > comment|https://github.com/apache/flink/pull/24132#discussion_r1467175902]) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34202) python tests take suspiciously long in some of the cases
[ https://issues.apache.org/jira/browse/FLINK-34202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813537#comment-17813537 ] Matthias Pohl commented on FLINK-34202: --- https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57203=logs=9cada3cb-c1d3-5621-16da-0f718fb86602=c67e71ed-6451-5d26-8920-5a8cf9651901=27977 > python tests take suspiciously long in some of the cases > > > Key: FLINK-34202 > URL: https://issues.apache.org/jira/browse/FLINK-34202 > Project: Flink > Issue Type: Bug > Components: API / Python >Affects Versions: 1.17.2, 1.19.0, 1.18.1 >Reporter: Matthias Pohl >Priority: Critical > Labels: test-stability > > [This release-1.18 > build|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56603=logs=3e4dd1a2-fe2f-5e5d-a581-48087e718d53=b4612f28-e3b5-5853-8a8b-610ae894217a] > has the python stage running into a timeout without any obvious reason. The > [python stage run for > JDK17|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56603=logs=b53e1644-5cb4-5a3b-5d48-f523f39bcf06] > was also getting close to the 4h timeout. > I'm creating this issue for documentation purposes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl reopened FLINK-34007: --- I'm reopening the issue because we're seeing test instabilities now: * [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57203=logs=64debf87-ecdb-5aef-788d-8720d341b5cb=2302fb98-0839-5df2-3354-bbae636f81a7=8066] * [https://github.com/XComp/flink/actions/runs/7745486791/job/21121947504#step:14:7309] (this one happened on the FLINK-34333 1.18 backport which covers the same change) > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]
1996fanrui commented on code in PR #24246: URL: https://github.com/apache/flink/pull/24246#discussion_r1475645896 ## flink-runtime/src/test/java/org/apache/flink/runtime/testutils/CommonTestUtils.java: ## @@ -367,9 +367,11 @@ public static void waitForCheckpoint(JobID jobID, MiniCluster miniCluster, int n }); } -/** Wait for on more completed checkpoint. */ -public static void waitForOneMoreCheckpoint(JobID jobID, MiniCluster miniCluster) -throws Exception { +/** + * Wait for a new completed checkpoint. Note: we wait for 2 or more checkpoint to ensure the + * latest checkpoint start after waitForTwoOrMoreCheckpoint is called. + */ +public static void waitForNewCheckpoint(JobID jobID, MiniCluster miniCluster) throws Exception { Review Comment: > That way, we can pass in 2 in the test implementation analogously to waitForCheckpoint. I also feel like we can remove some redundant code within the two methods. 樂 IIUC, the semantic between `waitForCheckpoint` and `waitForOneMoreCheckpoint` are different. (`waitForOneMoreCheckpoint` is renamed to `waitForNewCheckpoint` in this PR.) - `waitForCheckpoint` check the total count of all completed checkpoints. - `waitForOneMoreCheckpoint` check the whether the new checkpoint is completed after it's called. - For example, the job has 10 completed checkpoint before it's called. - `waitForOneMoreCheckpoint` will wait for checkpoint-11 is completed. BTW, I have refactored the `waitForNewCheckpoint`. I check the checkpoint trigger time instead of checkpointCount. I think checking trigger time is clearer than checkpointCount >= 2. Other developers might don't know why check 2 checkpoint here, and `checkpointCount >= 2` doesn't work when enabling the concurrent checkpoint. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]
1996fanrui commented on code in PR #24246: URL: https://github.com/apache/flink/pull/24246#discussion_r1475645896 ## flink-runtime/src/test/java/org/apache/flink/runtime/testutils/CommonTestUtils.java: ## @@ -367,9 +367,11 @@ public static void waitForCheckpoint(JobID jobID, MiniCluster miniCluster, int n }); } -/** Wait for on more completed checkpoint. */ -public static void waitForOneMoreCheckpoint(JobID jobID, MiniCluster miniCluster) -throws Exception { +/** + * Wait for a new completed checkpoint. Note: we wait for 2 or more checkpoint to ensure the + * latest checkpoint start after waitForTwoOrMoreCheckpoint is called. + */ +public static void waitForNewCheckpoint(JobID jobID, MiniCluster miniCluster) throws Exception { Review Comment: I have refactored the `waitForNewCheckpoint`. I check the checkpoint trigger time instead of checkpointCount. I think checking trigger time is clearer than checkpointCount >= 2, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34323][table-planner] Fix param name in session window tvf when using named params [flink]
xuyangzhong commented on code in PR #24243: URL: https://github.com/apache/flink/pull/24243#discussion_r1475601398 ## flink-table/flink-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/WindowTableFunctionTest.scala: ## @@ -257,4 +257,48 @@ class WindowTableFunctionTest extends TableTestBase { util.verifyRelPlan(sql) } + @Test + def testSessionTVF(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(rowtime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFProctime(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(proctime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFWithPartitionKeys(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable PARTITION BY (b, a), DESCRIPTOR(rowtime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFWithNamedParams(): Unit = { Review Comment: I also add a related test for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34336][test] Fix the bug that AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes [flink]
1996fanrui commented on PR #24248: URL: https://github.com/apache/flink/pull/24248#issuecomment-1923016887 org.apache.flink.test.checkpointing.AutoRescalingITCase#testCheckpointRescalingNonPartitionedStateCausesException hangs as well. I checked, all test jobs of AutoRescalingITCase have more than 1 job vertex. So the `desiredNumberOfRunningTasks` of all `waitForRunningTasks` should be `2 * parallelism2` or `p1 + p2`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34323][table-planner] Fix param name in session window tvf when using named params [flink]
hackergin commented on code in PR #24243: URL: https://github.com/apache/flink/pull/24243#discussion_r1475597199 ## flink-table/flink-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/WindowTableFunctionTest.scala: ## @@ -257,4 +257,48 @@ class WindowTableFunctionTest extends TableTestBase { util.verifyRelPlan(sql) } + @Test + def testSessionTVF(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(rowtime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFProctime(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(proctime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFWithPartitionKeys(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable PARTITION BY (b, a), DESCRIPTOR(rowtime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFWithNamedParams(): Unit = { Review Comment: +1, we can handle it in a separate PR, it seems that other types of TVFs may also have similar issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34323][table-planner] Fix param name in session window tvf when using named params [flink]
xuyangzhong commented on code in PR #24243: URL: https://github.com/apache/flink/pull/24243#discussion_r1475596548 ## flink-table/flink-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/WindowTableFunctionTest.scala: ## @@ -257,4 +257,48 @@ class WindowTableFunctionTest extends TableTestBase { util.verifyRelPlan(sql) } + @Test + def testSessionTVF(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(rowtime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFProctime(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(proctime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFWithPartitionKeys(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable PARTITION BY (b, a), DESCRIPTOR(rowtime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFWithNamedParams(): Unit = { Review Comment: I have created one for it. https://issues.apache.org/jira/browse/FLINK-34338 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-34338) An exception is thrown when some named params change order when using window tvf
xuyang created FLINK-34338: -- Summary: An exception is thrown when some named params change order when using window tvf Key: FLINK-34338 URL: https://issues.apache.org/jira/browse/FLINK-34338 Project: Flink Issue Type: Bug Components: Table SQL / API Affects Versions: 1.15.0 Reporter: xuyang Fix For: 1.19.0 This bug can be reproduced by the following sql in `WindowTableFunctionTest` {code:java} @Test def test(): Unit = { val sql = """ |SELECT * |FROM TABLE(TUMBLE( | DATA => TABLE MyTable, | SIZE => INTERVAL '15' MINUTE, | TIMECOL => DESCRIPTOR(rowtime) | )) |""".stripMargin util.verifyRelPlan(sql) }{code} In Flip-145 and user doc, we can found `the DATA param must be the first`, but it seems that we also can't change the order about other params. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-26369][doc-zh] Translate the part zh-page mixed with not be translated. [flink]
Jiabao-Sun commented on code in PR #20340: URL: https://github.com/apache/flink/pull/20340#discussion_r1475595276 ## docs/content.zh/docs/deployment/ha/overview.md: ## @@ -74,18 +74,9 @@ Flink 提供了两种高可用服务实现: {{< top >}} -## JobResultStore - -The JobResultStore is used to archive the final result of a job that reached a globally-terminal -state (i.e. finished, cancelled or failed). The data is stored on a file system (see -[job-result-store.storage-path]({{< ref "docs/deployment/config#job-result-store-storage-path" >}})). -Entries in this store are marked as dirty as long as the corresponding job wasn't cleaned up properly -(artifacts are found in the job's subfolder in [high-availability.storageDir]({{< ref "docs/deployment/config#high-availability-storagedir" >}})). - -Dirty entries are subject to cleanup, i.e. the corresponding job is either cleaned up by Flink at -the moment or will be picked up for cleanup as part of a recovery. The entries will be deleted as -soon as the cleanup succeeds. Check the JobResultStore configuration parameters under -[HA configuration options]({{< ref "docs/deployment/config#high-availability" >}}) for further -details on how to adapt the behavior. +## 作业结果存储 +作业结果存储用于归档达到全局结束状态作业(即完成、取消或失败)的最终结果,其数据存储在文件系统上 (请参阅[job-result-store.storage-path]({{< ref "docs/deployment/config#job-result-store-storage-path" >}}))。 +只要没有正确清理相应的作业,此数据条目就是脏数据 (数据位于作业的子文件夹中 [high-availability.storageDir]({{< ref "docs/deployment/config#high-availability-storagedir" >}}))。 +脏数据是需要清理的,即通过相应的作业要么被 Flink 清理,要么将作为恢复的一部分被选取进行清理。清理成功后,数据条目将被删除。请参阅 [HA configuration options]({{< ref "docs/deployment/config#high-availability" >}}) 下作业结果存储的配置参数以获取有关如何调整行为的更多详细信息。 Review Comment: 脏数据将被清理,即相应的作业要么在当前时刻被清理,要么在作业恢复过程中被清理。一旦清理成功,这些脏数据条目将被删除。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34323][table-planner] Fix param name in session window tvf when using named params [flink]
xuyangzhong commented on code in PR #24243: URL: https://github.com/apache/flink/pull/24243#discussion_r1475591118 ## flink-table/flink-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/WindowTableFunctionTest.scala: ## @@ -257,4 +257,48 @@ class WindowTableFunctionTest extends TableTestBase { util.verifyRelPlan(sql) } + @Test + def testSessionTVF(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(rowtime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFProctime(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(proctime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFWithPartitionKeys(): Unit = { +val sql = + """ +|SELECT * +|FROM TABLE(SESSION(TABLE MyTable PARTITION BY (b, a), DESCRIPTOR(rowtime), INTERVAL '15' MINUTE)) +|""".stripMargin +util.verifyRelPlan(sql) + } + + @Test + def testSessionTVFWithNamedParams(): Unit = { Review Comment: I have double checked the Flip-145 and the user doc. I only find that `the DATA param must be the first`. That means we should allow the order changes about param DATA and GAP. But I think it's better to open a separate Jira for it. This pr focuses on the wrong PARAM name between `SIZE` in Calcite and `GAP` in Flink. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-33396][table-planner] Fix table alias not be cleared in subquery [flink]
xuyangzhong commented on code in PR #24239: URL: https://github.com/apache/flink/pull/24239#discussion_r1475586086 ## flink-table/flink-table-planner/src/test/resources/org/apache/flink/table/planner/plan/hints/batch/NestLoopJoinHintTest.xml: ## @@ -679,8 +680,8 @@ HashJoin(joinType=[LeftSemiJoin], where=[=(a1, a2)], select=[a1, b1], build=[rig +- Calc(select=[a2]) +- NestedLoopJoin(joinType=[InnerJoin], where=[=(a2, a3)], select=[a2, a3], build=[left]) :- Exchange(distribution=[broadcast]) - : +- TableSourceScan(table=[[default_catalog, default_database, T2, project=[a2], metadata=[]]], fields=[a2], hints=[[[ALIAS options:[T2) - +- TableSourceScan(table=[[default_catalog, default_database, T3, project=[a3], metadata=[]]], fields=[a3], hints=[[[ALIAS options:[T3) Review Comment: Hi, I found ClearQueryBlockAliasResolverTest#testJoinHintWithJoinHintInCorrelateAndWithAgg do the similar test. Can you take a look to check if it's what you mean? In this test, Join Hint will not be cleared and only Alias Hint will be cleared. ``` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34337][core] Sink.InitContextWrapper should implement metadataConsumer method [flink]
flinkbot commented on PR #24249: URL: https://github.com/apache/flink/pull/24249#issuecomment-1922891781 ## CI report: * a57b0e7e04a2f3f3b61a7a5c2ed5f3e8cbae9f47 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34337][core] Sink.InitContextWrapper should implement metadataConsumer method [flink]
Jiabao-Sun commented on PR #24249: URL: https://github.com/apache/flink/pull/24249#issuecomment-1922891276 Hi @pvary, could you help review it when you have time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-34192) Update flink-connector-kafka to be compatible with updated SinkV2 interfaces
[ https://issues.apache.org/jira/browse/FLINK-34192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-34192: --- Labels: pull-request-available (was: ) > Update flink-connector-kafka to be compatible with updated SinkV2 interfaces > > > Key: FLINK-34192 > URL: https://issues.apache.org/jira/browse/FLINK-34192 > Project: Flink > Issue Type: Technical Debt > Components: Connectors / Kafka >Reporter: Martijn Visser >Priority: Blocker > Labels: pull-request-available > > {code:java} > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/streaming/connectors/kafka/table/KafkaTableTestUtils.java:[101,76] > incompatible types: java.util.List cannot be > converted to java.util.List > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[136,46] > cannot find symbol > Error:symbol: method > mock(org.apache.flink.metrics.MetricGroup,org.apache.flink.metrics.groups.OperatorIOMetricGroup) > Error:location: class > org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[171,46] > cannot find symbol > Error:symbol: method mock(org.apache.flink.metrics.MetricGroup) > Error:location: class > org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[204,54] > cannot find symbol > Error:symbol: method mock(org.apache.flink.metrics.MetricGroup) > Error:location: class > org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[233,54] > cannot find symbol > Error:symbol: method mock(org.apache.flink.metrics.MetricGroup) > Error:location: class > org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[263,54] > cannot find symbol > Error:symbol: method mock(org.apache.flink.metrics.MetricGroup) > Error:location: class > org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[294,54] > cannot find symbol > Error:symbol: method mock(org.apache.flink.metrics.MetricGroup) > Error:location: class > org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[337,54] > cannot find symbol > Error:symbol: method mock(org.apache.flink.metrics.MetricGroup) > Error:location: class > org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup > Error: > /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[525,46] > cannot find symbol > Error:symbol: method mock(org.apache.flink.metrics.MetricGroup) > Error:location: class > org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup > Error: -> [Help 1] > {code} > https://github.com/apache/flink-connector-kafka/actions/runs/7597711401/job/20692858078#step:14:221 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34178) The ScalingTracking of autoscaler is wrong
[ https://issues.apache.org/jira/browse/FLINK-34178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Fan resolved FLINK-34178. - Fix Version/s: kubernetes-operator-1.8.0 Resolution: Fixed > The ScalingTracking of autoscaler is wrong > -- > > Key: FLINK-34178 > URL: https://issues.apache.org/jira/browse/FLINK-34178 > Project: Flink > Issue Type: Bug > Components: Autoscaler >Affects Versions: kubernetes-operator-1.8.0 >Reporter: Rui Fan >Assignee: Rui Fan >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-1.8.0 > > > The ScalingTracking of autoscaler is wrong, it's always greater than > AutoScalerOptions#STABILIZATION_INTERVAL. > h2. Reason: > When flink job isStabilizing, ScalingMetricCollector#updateMetrics will > return a empty metric history. In the JobAutoScalerImpl#runScalingLogic > method, if `collectedMetrics.getMetricHistory().isEmpty()` , we don't update > the ScalingTracking. > > The default value of AutoScalerOptions#STABILIZATION_INTERVAL is 5 min, so > the restartTime is always greater than 5 min. > However, it's quick when we use rescale api. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34178) The ScalingTracking of autoscaler is wrong
[ https://issues.apache.org/jira/browse/FLINK-34178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813488#comment-17813488 ] Rui Fan commented on FLINK-34178: - Merged to main(1.8.0) via: * 2bd6b1b38171ce821509d46d687291b876f72a2e * 746a996732e56e573f7882764c82e2faa2cba71d > The ScalingTracking of autoscaler is wrong > -- > > Key: FLINK-34178 > URL: https://issues.apache.org/jira/browse/FLINK-34178 > Project: Flink > Issue Type: Bug > Components: Autoscaler >Affects Versions: kubernetes-operator-1.8.0 >Reporter: Rui Fan >Assignee: Rui Fan >Priority: Major > Labels: pull-request-available > > The ScalingTracking of autoscaler is wrong, it's always greater than > AutoScalerOptions#STABILIZATION_INTERVAL. > h2. Reason: > When flink job isStabilizing, ScalingMetricCollector#updateMetrics will > return a empty metric history. In the JobAutoScalerImpl#runScalingLogic > method, if `collectedMetrics.getMetricHistory().isEmpty()` , we don't update > the ScalingTracking. > > The default value of AutoScalerOptions#STABILIZATION_INTERVAL is 5 min, so > the restartTime is always greater than 5 min. > However, it's quick when we use rescale api. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34329) ScalingReport format tests fail locally on decimal format
[ https://issues.apache.org/jira/browse/FLINK-34329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813487#comment-17813487 ] Rui Fan edited comment on FLINK-34329 at 2/2/24 5:54 AM: - Merged to main(1.8.0) via : 398a87c9012ddc79bfe4b2378cea740642283628 was (Author: fanrui): Merged to main(1.9.0) via : 398a87c9012ddc79bfe4b2378cea740642283628 > ScalingReport format tests fail locally on decimal format > - > > Key: FLINK-34329 > URL: https://issues.apache.org/jira/browse/FLINK-34329 > Project: Flink > Issue Type: Bug > Components: Autoscaler, Kubernetes Operator >Affects Versions: kubernetes-operator-1.8.0 >Reporter: Gyula Fora >Assignee: Rui Fan >Priority: Critical > Labels: pull-request-available > Fix For: kubernetes-operator-1.8.0 > > > The recently introduced scaling event format tests fail locally due to > different decimal format: > ``` > [ERROR] AutoScalerEventHandlerTest.testScalingReport:55 > expected: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID > ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity > 424.68 -> 123.40 | Target data rate 403.67}{ Vertex ID > bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity > Infinity -> Infinity | Target data rate 812.58}\{ Vertex ID > 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity > 404.73 -> 645.00 | Target data rate 404.27}" > but was: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID > ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity > 424,68 -> 123,40 | Target data rate 403,67}{ Vertex ID > bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity > Infinity -> Infinity | Target data rate 812,58}\{ Vertex ID > 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity > 404,73 -> 645,00 | Target data rate 404,27}" > [INFO] > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34329) ScalingReport format tests fail locally on decimal format
[ https://issues.apache.org/jira/browse/FLINK-34329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Fan resolved FLINK-34329. - Resolution: Fixed > ScalingReport format tests fail locally on decimal format > - > > Key: FLINK-34329 > URL: https://issues.apache.org/jira/browse/FLINK-34329 > Project: Flink > Issue Type: Bug > Components: Autoscaler, Kubernetes Operator >Affects Versions: kubernetes-operator-1.8.0 >Reporter: Gyula Fora >Assignee: Rui Fan >Priority: Critical > Labels: pull-request-available > Fix For: kubernetes-operator-1.8.0 > > > The recently introduced scaling event format tests fail locally due to > different decimal format: > ``` > [ERROR] AutoScalerEventHandlerTest.testScalingReport:55 > expected: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID > ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity > 424.68 -> 123.40 | Target data rate 403.67}{ Vertex ID > bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity > Infinity -> Infinity | Target data rate 812.58}\{ Vertex ID > 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity > 404.73 -> 645.00 | Target data rate 404.27}" > but was: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID > ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity > 424,68 -> 123,40 | Target data rate 403,67}{ Vertex ID > bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity > Infinity -> Infinity | Target data rate 812,58}\{ Vertex ID > 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity > 404,73 -> 645,00 | Target data rate 404,27}" > [INFO] > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34329) ScalingReport format tests fail locally on decimal format
[ https://issues.apache.org/jira/browse/FLINK-34329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813487#comment-17813487 ] Rui Fan commented on FLINK-34329: - Merged to main(1.9.0) via : 398a87c9012ddc79bfe4b2378cea740642283628 > ScalingReport format tests fail locally on decimal format > - > > Key: FLINK-34329 > URL: https://issues.apache.org/jira/browse/FLINK-34329 > Project: Flink > Issue Type: Bug > Components: Autoscaler, Kubernetes Operator >Affects Versions: kubernetes-operator-1.8.0 >Reporter: Gyula Fora >Assignee: Rui Fan >Priority: Critical > Labels: pull-request-available > Fix For: kubernetes-operator-1.8.0 > > > The recently introduced scaling event format tests fail locally due to > different decimal format: > ``` > [ERROR] AutoScalerEventHandlerTest.testScalingReport:55 > expected: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID > ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity > 424.68 -> 123.40 | Target data rate 403.67}{ Vertex ID > bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity > Infinity -> Infinity | Target data rate 812.58}\{ Vertex ID > 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity > 404.73 -> 645.00 | Target data rate 404.27}" > but was: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID > ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity > 424,68 -> 123,40 | Target data rate 403,67}{ Vertex ID > bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity > Infinity -> Infinity | Target data rate 812,58}\{ Vertex ID > 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity > 404,73 -> 645,00 | Target data rate 404,27}" > [INFO] > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34178][autoscaler] Fix the bug that observed scaling restart time is always great than `stabilization.interval` [flink-kubernetes-operator]
1996fanrui merged PR #759: URL: https://github.com/apache/flink-kubernetes-operator/pull/759 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34178][autoscaler] Fix the bug that observed scaling restart time is always great than `stabilization.interval` [flink-kubernetes-operator]
1996fanrui commented on PR #759: URL: https://github.com/apache/flink-kubernetes-operator/pull/759#issuecomment-1922886105 Thanks for the review. I tested locally, it works fine, the tracking time is correct now. Merging~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-34337) Sink.InitContextWrapper should implement metadataConsumer method
[ https://issues.apache.org/jira/browse/FLINK-34337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-34337: --- Labels: pull-request-available (was: ) > Sink.InitContextWrapper should implement metadataConsumer method > > > Key: FLINK-34337 > URL: https://issues.apache.org/jira/browse/FLINK-34337 > Project: Flink > Issue Type: Bug > Components: API / Core >Affects Versions: 1.19.0 >Reporter: Jiabao Sun >Priority: Blocker > Labels: pull-request-available > Fix For: 1.19.0 > > > Sink.InitContextWrapper should implement metadataConsumer method. > If the metadataConsumer method is not implemented, the behavior of the > wrapped WriterInitContext's metadataConsumer will be lost. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [FLINK-34337][core] Sink.InitContextWrapper should implement metadataConsumer method [flink]
Jiabao-Sun opened a new pull request, #24249: URL: https://github.com/apache/flink/pull/24249 ## What is the purpose of the change [FLINK-34337][core] Sink.InitContextWrapper should implement metadataConsumer method ## Brief change log Sink.InitContextWrapper should implement metadataConsumer method. If the metadataConsumer method is not implemented, the behavior of the wrapped WriterInitContext's metadataConsumer will be lost. ## Verifying this change Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing This change is already covered by existing tests. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not documented) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-34337) Sink.InitContextWrapper should implement metadataConsumer method
Jiabao Sun created FLINK-34337: -- Summary: Sink.InitContextWrapper should implement metadataConsumer method Key: FLINK-34337 URL: https://issues.apache.org/jira/browse/FLINK-34337 Project: Flink Issue Type: Bug Components: API / Core Affects Versions: 1.19.0 Reporter: Jiabao Sun Fix For: 1.19.0 Sink.InitContextWrapper should implement metadataConsumer method. If the metadataConsumer method is not implemented, the behavior of the wrapped WriterInitContext's metadataConsumer will be lost. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34329][autoscaler] Fix the bug that scaling report parser doesn't support Locale [flink-kubernetes-operator]
1996fanrui merged PR #769: URL: https://github.com/apache/flink-kubernetes-operator/pull/769 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes
[ https://issues.apache.org/jira/browse/FLINK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Fan updated FLINK-34336: Description: AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang in waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color} h2. Reason: The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink is changed from parallelism to parallelism2. So we expect the task number should be parallelism + parallelism2 instead of parallelism2. h2. Why it can be passed for now? Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by default. It means, flink job will rescale job 30 seconds after updateJobResourceRequirements is called. So the running tasks are old parallelism when we call waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color} IIUC, it cannot be guaranteed, and it's unexpected. h2. How to reproduce this bug? [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6] * Disable the cooldown * Sleep for a while before waitForRunningTasks If so, the job running in new parallelism, so `waitForRunningTasks` will hang forever. was: AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang in waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color} h2. Reason: The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink is changed from parallelism to parallelism2. So we expect the task number should be parallelism + parallelism2 instead of parallelism2. h2. Why it can be passed for now? Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by default. It means, flink job will rescale job 30 seconds after updateJobResourceRequirements is called. So the running tasks are old parallelism when we call waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color} IIUC, it cannot be guaranteed, and it's unexpected. h2. How to reproduce this bug? [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6] * Disable the cooldown * Sleep for a while before waitForRunningTasks > AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState > may hang sometimes > - > > Key: FLINK-34336 > URL: https://issues.apache.org/jira/browse/FLINK-34336 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.19.0 >Reporter: Rui Fan >Assignee: Rui Fan >Priority: Major > Labels: pull-request-available > Fix For: 1.19.0 > > > AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState > may hang in > waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, > {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color} > h2. Reason: > The job has 2 tasks(vertices), after calling updateJobResourceRequirements. > The source parallelism isn't changed (It's parallelism) , and the > FlatMapper+Sink is changed from parallelism to parallelism2. > So we expect the task number should be parallelism + parallelism2 instead of > parallelism2. > > h2. Why it can be passed for now? > Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by > default. It means, flink job will rescale job 30 seconds after > updateJobResourceRequirements is called. > > So the running tasks are old parallelism when we call > waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, > {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color} > IIUC, it cannot be guaranteed, and it's unexpected. > > h2. How to reproduce this bug? > [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6] > * Disable the cooldown > * Sleep for a while before waitForRunningTasks > If so, the job running in new parallelism, so `waitForRunningTasks` will hang > forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34336][test] Fix the bug that AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes [flink]
flinkbot commented on PR #24248: URL: https://github.com/apache/flink/pull/24248#issuecomment-1922845516 ## CI report: * 9732c01c33a38cb989d9fb73a5838329f35d8f7e UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes
[ https://issues.apache.org/jira/browse/FLINK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-34336: --- Labels: pull-request-available (was: ) > AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState > may hang sometimes > - > > Key: FLINK-34336 > URL: https://issues.apache.org/jira/browse/FLINK-34336 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.19.0 >Reporter: Rui Fan >Assignee: Rui Fan >Priority: Major > Labels: pull-request-available > Fix For: 1.19.0 > > > AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState > may hang in > waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, > {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color} > h2. Reason: > The job has 2 tasks(vertices), after calling updateJobResourceRequirements. > The source parallelism isn't changed (It's parallelism) , and the > FlatMapper+Sink is changed from parallelism to parallelism2. > So we expect the task number should be parallelism + parallelism2 instead of > parallelism2. > > h2. Why it can be passed for now? > Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by > default. It means, flink job will rescale job 30 seconds after > updateJobResourceRequirements is called. > > So the running tasks are old parallelism when we call > waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, > {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color} > IIUC, it cannot be guaranteed, and it's unexpected. > > h2. How to reproduce this bug? > [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6] > * Disable the cooldown > * Sleep for a while before waitForRunningTasks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [FLINK-34336][test] Fix the bug that AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes [flink]
1996fanrui opened a new pull request, #24248: URL: https://github.com/apache/flink/pull/24248 ## Purpose AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang in waitForRunningTasks(restClusterClient, jobID, parallelism2); ## Reason: The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink is changed from parallelism to parallelism2. So we expect the task number should be parallelism + parallelism2 instead of parallelism2. ## Why it can be passed for now? Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by default. It means, flink job will rescale job 30 seconds after updateJobResourceRequirements is called. So the running tasks are old parallelism when we call waitForRunningTasks(restClusterClient, jobID, parallelism2);. IIUC, it cannot be guaranteed, and it's unexpected. ## How to reproduce this bug? https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6 1. Disable the cooldown 2. Sleep for a while before waitForRunningTasks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes
[ https://issues.apache.org/jira/browse/FLINK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Fan updated FLINK-34336: Description: AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang in waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color} h2. Reason: The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink is changed from parallelism to parallelism2. So we expect the task number should be parallelism + parallelism2 instead of parallelism2. h2. Why it can be passed for now? Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by default. It means, flink job will rescale job 30 seconds after updateJobResourceRequirements is called. So the running tasks are old parallelism when we call waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color} IIUC, it cannot be guaranteed, and it's unexpected. h2. How to reproduce this bug? [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6] * Disable the cooldown * Sleep for a while before waitForRunningTasks was: AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang in waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color} h2. Reason: The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink is changed from parallelism to parallelism2. So we expect the task number should be parallelism + parallelism2 instead of parallelism2. h2. Why it can be passed for now? Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by default. It means, flink job will rescale job 30 seconds after updateJobResourceRequirements is called. So the running tasks are old parallelism when we call waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color} IIUC, it cannot be guaranteed, and it's unexpected. > AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState > may hang sometimes > - > > Key: FLINK-34336 > URL: https://issues.apache.org/jira/browse/FLINK-34336 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.19.0 >Reporter: Rui Fan >Assignee: Rui Fan >Priority: Major > Fix For: 1.19.0 > > > AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState > may hang in > waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, > {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color} > h2. Reason: > The job has 2 tasks(vertices), after calling updateJobResourceRequirements. > The source parallelism isn't changed (It's parallelism) , and the > FlatMapper+Sink is changed from parallelism to parallelism2. > So we expect the task number should be parallelism + parallelism2 instead of > parallelism2. > > h2. Why it can be passed for now? > Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by > default. It means, flink job will rescale job 30 seconds after > updateJobResourceRequirements is called. > > So the running tasks are old parallelism when we call > waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, > {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color} > IIUC, it cannot be guaranteed, and it's unexpected. > > h2. How to reproduce this bug? > [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6] > * Disable the cooldown > * Sleep for a while before waitForRunningTasks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes
Rui Fan created FLINK-34336: --- Summary: AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes Key: FLINK-34336 URL: https://issues.apache.org/jira/browse/FLINK-34336 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.19.0 Reporter: Rui Fan Assignee: Rui Fan Fix For: 1.19.0 AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang in waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color} h2. Reason: The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink is changed from parallelism to parallelism2. So we expect the task number should be parallelism + parallelism2 instead of parallelism2. h2. Why it can be passed for now? Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by default. It means, flink job will rescale job 30 seconds after updateJobResourceRequirements is called. So the running tasks are old parallelism when we call waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color} IIUC, it cannot be guaranteed, and it's unexpected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-32075][FLIP-306][Checkpoint] Delete merged files on checkpoint abort or subsumption [flink]
fredia commented on code in PR #24181: URL: https://github.com/apache/flink/pull/24181#discussion_r1475523078 ## flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/filemerging/WithinCheckpointFileMergingSnapshotManager.java: ## @@ -50,6 +50,43 @@ public WithinCheckpointFileMergingSnapshotManager(String id, Executor ioExecutor writablePhysicalFilePool = new HashMap<>(); } +// +// CheckpointListener +// + +@Override +public void notifyCheckpointComplete(SubtaskKey subtaskKey, long checkpointId) Review Comment: Why `notifyCheckpointSubsumed` is not overriden? IIUC, checkpoint subsume and complete are triggered at the same time([code](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1082-L1085)), although the effect of the two methods is the same, it feels more straightforward to close the physical file in `notifyCheckpointSubsumed`, WDYT? ## flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/filemerging/FileMergingSnapshotManagerBase.java: ## @@ -345,6 +371,79 @@ protected abstract PhysicalFile getOrCreatePhysicalFileForCheckpoint( protected abstract void returnPhysicalFileForNextReuse( SubtaskKey subtaskKey, long checkpointId, PhysicalFile physicalFile) throws IOException; +/** + * The callback which will be triggered when all subtasks discarded (aborted or subsumed). + * + * @param checkpointId the discarded checkpoint id. + * @throws IOException if anything goes wrong with file system. + */ +protected abstract void discardCheckpoint(long checkpointId) throws IOException; + +// +// Checkpoint Listener +// + +@Override +public void notifyCheckpointComplete(SubtaskKey subtaskKey, long checkpointId) +throws Exception { +// does nothing +} + +@Override +public void notifyCheckpointAborted(SubtaskKey subtaskKey, long checkpointId) throws Exception { +synchronized (lock) { +Set logicalFilesForCurrentCp = uploadedStates.get(checkpointId); +if (logicalFilesForCurrentCp == null) { +return; +} +Iterator logicalFileIterator = logicalFilesForCurrentCp.iterator(); +while (logicalFileIterator.hasNext()) { +LogicalFile logicalFile = logicalFileIterator.next(); +if (logicalFile.getSubtaskKey().equals(subtaskKey) +&& logicalFile.getLastUsedCheckpointID() <= checkpointId) { +logicalFile.discardWithCheckpointId(checkpointId); +logicalFileIterator.remove(); +} +} + +if (logicalFilesForCurrentCp.isEmpty()) { +uploadedStates.remove(checkpointId); +discardCheckpoint(checkpointId); +} Review Comment: nit: There is some overlap with the code of `notifyCheckpointSubsumed`, maybe it can be extracted into a function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-34335) Query hints in RexSubQuery could not be printed
xuyang created FLINK-34335: -- Summary: Query hints in RexSubQuery could not be printed Key: FLINK-34335 URL: https://issues.apache.org/jira/browse/FLINK-34335 Project: Flink Issue Type: Bug Components: Table SQL / Planner Reporter: xuyang Fix For: 1.19.0 That is because in `RelTreeWriterImpl`, we don't care about the `RexSubQuery`. And `RexSubQuery` use `RelOptUtil.toString(rel)` to print itself instead of adding extra information such as query hints. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34335) Query hints in RexSubQuery could not be printed
[ https://issues.apache.org/jira/browse/FLINK-34335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813470#comment-17813470 ] xuyang commented on FLINK-34335: I'll try to fix it. > Query hints in RexSubQuery could not be printed > --- > > Key: FLINK-34335 > URL: https://issues.apache.org/jira/browse/FLINK-34335 > Project: Flink > Issue Type: Bug > Components: Table SQL / Planner >Reporter: xuyang >Priority: Major > Fix For: 1.19.0 > > > That is because in `RelTreeWriterImpl`, we don't care about the > `RexSubQuery`. And `RexSubQuery` use `RelOptUtil.toString(rel)` to print > itself instead of adding extra information such as query hints. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-25537] [JUnit5 Migration] Module: flink-core package api-common [flink]
Jiabao-Sun commented on code in PR #23960: URL: https://github.com/apache/flink/pull/23960#discussion_r1475422173 ## flink-core/src/test/java/org/apache/flink/api/common/io/FileInputFormatTest.java: ## @@ -603,55 +587,52 @@ public void testGetStatisticsMultipleOneFileWithCachedVersion() throws IOExcepti * split has to start from the beginning. */ @Test -public void testFileInputFormatWithCompression() { -try { -String tempFile = -TestFileUtils.createTempFileDirForProvidedFormats( -temporaryFolder.newFolder(), -FileInputFormat.getSupportedCompressionFormats()); -final DummyFileInputFormat format = new DummyFileInputFormat(); -format.setFilePath(tempFile); -format.configure(new Configuration()); -FileInputSplit[] splits = format.createInputSplits(2); -final Set supportedCompressionFormats = -FileInputFormat.getSupportedCompressionFormats(); -Assert.assertEquals(supportedCompressionFormats.size(), splits.length); -for (FileInputSplit split : splits) { -Assert.assertEquals( -FileInputFormat.READ_WHOLE_SPLIT_FLAG, -split.getLength()); // unsplittable compressed files have this size as a -// flag for "read whole file" -Assert.assertEquals(0L, split.getStart()); // always read from the beginning. -} +void testFileInputFormatWithCompression() throws IOException { -// test if this also works for "mixed" directories -TestFileUtils.createTempFileInDirectory( -tempFile.replace("file:", ""), -"this creates a test file with a random extension (at least not .deflate)"); - -final DummyFileInputFormat formatMixed = new DummyFileInputFormat(); -formatMixed.setFilePath(tempFile); -formatMixed.configure(new Configuration()); -FileInputSplit[] splitsMixed = formatMixed.createInputSplits(2); -Assert.assertEquals(supportedCompressionFormats.size() + 1, splitsMixed.length); -for (FileInputSplit split : splitsMixed) { -final String extension = - FileInputFormat.extractFileExtension(split.getPath().getName()); -if (supportedCompressionFormats.contains(extension)) { -Assert.assertEquals( -FileInputFormat.READ_WHOLE_SPLIT_FLAG, -split.getLength()); // unsplittable compressed files have this size as a -// flag for "read whole file" -Assert.assertEquals(0L, split.getStart()); // always read from the beginning. -} else { -Assert.assertEquals(0L, split.getStart()); -Assert.assertTrue("split size not correct", split.getLength() > 0); -} -} +String tempFile = +TestFileUtils.createTempFileDirForProvidedFormats( +TempDirUtils.newFolder(temporaryFolder), +FileInputFormat.getSupportedCompressionFormats()); +final DummyFileInputFormat format = new DummyFileInputFormat(); +format.setFilePath(tempFile); +format.configure(new Configuration()); +FileInputSplit[] splits = format.createInputSplits(2); +final Set supportedCompressionFormats = +FileInputFormat.getSupportedCompressionFormats(); +assertThat(splits).hasSize(supportedCompressionFormats.size()); +for (FileInputSplit split : splits) { +assertThat(split.getLength()) +.isEqualTo( +FileInputFormat.READ_WHOLE_SPLIT_FLAG); // unsplittable compressed files +// have this size as a +// flag for "read whole file" +assertThat(split.getStart()).isEqualTo(0L); // always read from the beginning. Review Comment: ```suggestion assertThat(split.getStart()).isZero(); // always read from the beginning. ``` ## flink-core/src/test/java/org/apache/flink/api/common/io/FileInputFormatTest.java: ## @@ -662,35 +643,32 @@ public void testFileInputFormatWithCompression() { * split is not the compressed file size and that the compression decorator is called. */ @Test -public void testFileInputFormatWithCompressionFromFileSource() { -try { -String tempFile = -TestFileUtils.createTempFileDirForProvidedFormats( -temporaryFolder.newFolder(), -FileInputFormat.getSupportedCompressionFormats()); -DummyFileInputFormat format = new DummyFileInputFormat(); -
[jira] [Commented] (FLINK-34169) [benchmark] CI fails during test running
[ https://issues.apache.org/jira/browse/FLINK-34169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813469#comment-17813469 ] Yunfeng Zhou commented on FLINK-34169: -- Fixed in 8dd28a2e8bf10ec278ff90c99563468e294a135a. Hi [~Zakelly] Could you please help close this issue? > [benchmark] CI fails during test running > > > Key: FLINK-34169 > URL: https://issues.apache.org/jira/browse/FLINK-34169 > Project: Flink > Issue Type: Bug > Components: Benchmarks >Reporter: Zakelly Lan >Assignee: Yunfeng Zhou >Priority: Critical > > [CI link: > https://github.com/apache/flink-benchmarks/actions/runs/7580834955/job/20647663157?pr=85|https://github.com/apache/flink-benchmarks/actions/runs/7580834955/job/20647663157?pr=85] > which says: > {code:java} > // omit some stack traces > Caused by: java.util.concurrent.ExecutionException: Boxed Error > 1115 at scala.concurrent.impl.Promise$.resolver(Promise.scala:87) > 1116 at > scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:79) > 1117 at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > 1118 at > org.apache.pekko.pattern.PromiseActorRef.$bang(AskSupport.scala:629) > 1119 at org.apache.pekko.actor.ActorRef.tell(ActorRef.scala:141) > 1120 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:317) > 1121 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:222) > 1122 ... 22 more > 1123 Caused by: java.lang.NoClassDefFoundError: > javax/activation/UnsupportedDataTypeException > 1124 at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.createKnownInputChannel(SingleInputGateFactory.java:387) > 1125 at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.lambda$createInputChannel$2(SingleInputGateFactory.java:353) > 1126 at > org.apache.flink.runtime.shuffle.ShuffleUtils.applyWithShuffleTypeCheck(ShuffleUtils.java:51) > 1127 at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.createInputChannel(SingleInputGateFactory.java:333) > 1128 at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.createInputChannelsAndTieredStorageService(SingleInputGateFactory.java:284) > 1129 at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.create(SingleInputGateFactory.java:204) > 1130 at > org.apache.flink.runtime.io.network.NettyShuffleEnvironment.createInputGates(NettyShuffleEnvironment.java:265) > 1131 at > org.apache.flink.runtime.taskmanager.Task.(Task.java:418) > 1132 at > org.apache.flink.runtime.taskexecutor.TaskExecutor.submitTask(TaskExecutor.java:821) > 1133 at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 1134 at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 1135 at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 1136 at java.base/java.lang.reflect.Method.invoke(Method.java:566) > 1137 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRpcInvocation$1(PekkoRpcActor.java:309) > 1138 at > org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) > 1139 at > org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:307) > 1140 ... 23 more > 1141 Caused by: java.lang.ClassNotFoundException: > javax.activation.UnsupportedDataTypeException > 1142 at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) > 1143 at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) > 1144 at > java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527) > 1145 ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34295) Release Testing Instructions: Verify FLINK-33712 Deprecate RuntimeContext#getExecutionConfig
[ https://issues.apache.org/jira/browse/FLINK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813465#comment-17813465 ] Junrui Li commented on FLINK-34295: --- [~lincoln.86xy] This change doesn't need cross testing. I think we could close this ticket. > Release Testing Instructions: Verify FLINK-33712 Deprecate > RuntimeContext#getExecutionConfig > > > Key: FLINK-34295 > URL: https://issues.apache.org/jira/browse/FLINK-34295 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Configuration >Affects Versions: 1.19.0 >Reporter: lincoln lee >Assignee: Junrui Li >Priority: Blocker > Labels: release-testing > Fix For: 1.19.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34296) Release Testing Instructions: Verify FLINK-33581 Deprecate configuration getters/setters that return/set complex Java objects
[ https://issues.apache.org/jira/browse/FLINK-34296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813466#comment-17813466 ] Junrui Li commented on FLINK-34296: --- [~lincoln.86xy] This change doesn't need cross testing. I think we could close this ticket. > Release Testing Instructions: Verify FLINK-33581 Deprecate configuration > getters/setters that return/set complex Java objects > - > > Key: FLINK-34296 > URL: https://issues.apache.org/jira/browse/FLINK-34296 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Configuration >Affects Versions: 1.19.0 >Reporter: lincoln lee >Assignee: Junrui Li >Priority: Blocker > Labels: release-testing > Fix For: 1.19.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-32514) FLIP-309: Support using larger checkpointing interval when source is processing backlog
[ https://issues.apache.org/jira/browse/FLINK-32514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunfeng Zhou updated FLINK-32514: - Release Note: ProcessingBacklog is introduced to demonstrate whether a record should be processed with low latency or high throughput. ProcessingBacklog can be set by source operators, and can be used to change the checkpoint internal of a job during runtime. > FLIP-309: Support using larger checkpointing interval when source is > processing backlog > --- > > Key: FLINK-32514 > URL: https://issues.apache.org/jira/browse/FLINK-32514 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Yunfeng Zhou >Assignee: Yunfeng Zhou >Priority: Major > Labels: pull-request-available > Fix For: 1.19.0 > > > Umbrella issue for > https://cwiki.apache.org/confluence/display/FLINK/FLIP-309%3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34299) Release Testing Instructions: Verify FLINK-33203 Adding a separate configuration for specifying Java Options of the SQL Gateway
[ https://issues.apache.org/jira/browse/FLINK-34299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813464#comment-17813464 ] Yangze Guo commented on FLINK-34299: This change doesn't need crossteam testing. > Release Testing Instructions: Verify FLINK-33203 Adding a separate > configuration for specifying Java Options of the SQL Gateway > > > Key: FLINK-34299 > URL: https://issues.apache.org/jira/browse/FLINK-34299 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Gateway >Affects Versions: 1.19.0 >Reporter: lincoln lee >Assignee: Yangze Guo >Priority: Blocker > Labels: release-testing > Fix For: 1.19.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-34299) Release Testing Instructions: Verify FLINK-33203 Adding a separate configuration for specifying Java Options of the SQL Gateway
[ https://issues.apache.org/jira/browse/FLINK-34299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yangze Guo closed FLINK-34299. -- Resolution: Fixed > Release Testing Instructions: Verify FLINK-33203 Adding a separate > configuration for specifying Java Options of the SQL Gateway > > > Key: FLINK-34299 > URL: https://issues.apache.org/jira/browse/FLINK-34299 > Project: Flink > Issue Type: Sub-task > Components: Table SQL / Gateway >Affects Versions: 1.19.0 >Reporter: lincoln lee >Assignee: Yangze Guo >Priority: Blocker > Labels: release-testing > Fix For: 1.19.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-34258) Incorrect example of accumulator usage within emitUpdateWithRetract for TableAggregateFunction
[ https://issues.apache.org/jira/browse/FLINK-34258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jane Chan closed FLINK-34258. - > Incorrect example of accumulator usage within emitUpdateWithRetract for > TableAggregateFunction > -- > > Key: FLINK-34258 > URL: https://issues.apache.org/jira/browse/FLINK-34258 > Project: Flink > Issue Type: Bug > Components: Documentation, Table SQL / API >Affects Versions: 1.19.0, 1.18.1 >Reporter: Jane Chan >Assignee: Jane Chan >Priority: Minor > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: GroupTableAggHandler$10.java > > > The > [documentation|https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#retraction-example] > provides an example of using `emitUpdateWithRetract`. However, the example > is misleading as it incorrectly suggests that the accumulator can be updated > within the `emitUpdateWithRetract method`. In reality, the order of > invocation is to first call `getAccumulator` and then > `emitUpdateWithRetract`, which means that updating the accumulator within > `emitUpdateWithRetract` will not take effect. Please see > [GroupTableAggFunction#L141|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L141] > ~ > [GroupTableAggFunction#L146|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L146] > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (FLINK-34258) Incorrect example of accumulator usage within emitUpdateWithRetract for TableAggregateFunction
[ https://issues.apache.org/jira/browse/FLINK-34258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jane Chan resolved FLINK-34258. --- Fix Version/s: 1.19.0 Resolution: Fixed Fixed in master 0779c91e581dc16c4aef61d6cc27774f11495907 > Incorrect example of accumulator usage within emitUpdateWithRetract for > TableAggregateFunction > -- > > Key: FLINK-34258 > URL: https://issues.apache.org/jira/browse/FLINK-34258 > Project: Flink > Issue Type: Bug > Components: Documentation, Table SQL / API >Affects Versions: 1.19.0, 1.18.1 >Reporter: Jane Chan >Assignee: Jane Chan >Priority: Minor > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: GroupTableAggHandler$10.java > > > The > [documentation|https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#retraction-example] > provides an example of using `emitUpdateWithRetract`. However, the example > is misleading as it incorrectly suggests that the accumulator can be updated > within the `emitUpdateWithRetract method`. In reality, the order of > invocation is to first call `getAccumulator` and then > `emitUpdateWithRetract`, which means that updating the accumulator within > `emitUpdateWithRetract` will not take effect. Please see > [GroupTableAggFunction#L141|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L141] > ~ > [GroupTableAggFunction#L146|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L146] > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-34258) Incorrect example of accumulator usage within emitUpdateWithRetract for TableAggregateFunction
[ https://issues.apache.org/jira/browse/FLINK-34258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jane Chan reassigned FLINK-34258: - Assignee: Jane Chan > Incorrect example of accumulator usage within emitUpdateWithRetract for > TableAggregateFunction > -- > > Key: FLINK-34258 > URL: https://issues.apache.org/jira/browse/FLINK-34258 > Project: Flink > Issue Type: Bug > Components: Documentation, Table SQL / API >Affects Versions: 1.19.0, 1.18.1 >Reporter: Jane Chan >Assignee: Jane Chan >Priority: Minor > Labels: pull-request-available > Attachments: GroupTableAggHandler$10.java > > > The > [documentation|https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#retraction-example] > provides an example of using `emitUpdateWithRetract`. However, the example > is misleading as it incorrectly suggests that the accumulator can be updated > within the `emitUpdateWithRetract method`. In reality, the order of > invocation is to first call `getAccumulator` and then > `emitUpdateWithRetract`, which means that updating the accumulator within > `emitUpdateWithRetract` will not take effect. Please see > [GroupTableAggFunction#L141|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L141] > ~ > [GroupTableAggFunction#L146|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L146] > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34258][docs][table] Fix incorrect retract example for TableAggregateFunction [flink]
LadyForest merged PR #24215: URL: https://github.com/apache/flink/pull/24215 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DRAFT][FLINK-33734] Merge unaligned checkpoint state handle [flink]
flinkbot commented on PR #24247: URL: https://github.com/apache/flink/pull/24247#issuecomment-1922655270 ## CI report: * 5d0a461e79a4b6446e73c171bac69482f9d3251f UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-33734) Merge unaligned checkpoint state handle
[ https://issues.apache.org/jira/browse/FLINK-33734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-33734: --- Labels: pull-request-available (was: ) > Merge unaligned checkpoint state handle > --- > > Key: FLINK-33734 > URL: https://issues.apache.org/jira/browse/FLINK-33734 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Reporter: Feifan Wang >Assignee: Feifan Wang >Priority: Major > Labels: pull-request-available > > h3. Background > Unaligned checkpoint will write the inflight-data of all InputChannel and > ResultSubpartition of the same subtask to the same file during checkpoint. > The InputChannelStateHandle and ResultSubpartitionStateHandle organize the > metadata of inflight-data at the channel granularity, which causes the file > name to be repeated many times. When a job is under backpressure and task > parallelism is high, the metadata of unaligned checkpoints will bloat. This > will result in: > # The amount of data reported by taskmanager to jobmanager increases, and > jobmanager takes longer to process these RPC requests. > # The metadata of the entire checkpoint becomes very large, and it takes > longer to serialize and write it to dfs. > Both of the above points ultimately lead to longer checkpoint duration. > h3. A Production example > Take our production job with a parallelism of 4800 as an example: > # When there is no back pressure, checkpoint end-to-end duration is within 7 > seconds. > # When under pressure: checkpoint end-to-end duration often exceeds 1 > minute. We found that jobmanager took more than 40 seconds to process rpc > requests, and serialized metadata took more than 20 seconds.Some checkpoint > statistics: > |metadata file size|950 MB| > |channel state count|12,229,854| > |channel file count|5536| > Of the 950MB in the metadata file, 68% are redundant file paths. > We enabled log-based checkpoint on this job and hoped that the checkpoint > could be completed within 30 seconds. This problem made it difficult to > achieve this goal. > h3. Propose changes > I suggest introducing MergedInputChannelStateHandle and > MergedResultSubpartitionStateHandle to eliminate redundant file paths. > The taskmanager merges all InputChannelStateHandles with the same delegated > StreamStateHandle in the same subtask into one MergedInputChannelStateHandle > before reporting. When recovering from checkpoint, jobmangager converts > MergedInputChannelStateHandle to InputChannelStateHandle collection before > assigning state handle, and the rest of the process does not need to be > changed. > Structure of MergedInputChannelStateHandle : > > {code:java} > { // MergedInputChannelStateHandle > "delegate": { > "filePath": > "viewfs://hadoop-meituan/flink-yg15/checkpoints/retained/1234567/ab8d0c2f02a47586490b15e7a2c30555/chk-31/ffe54c0a-9b6e-4724-aae7-61b96bf8b1cf", > "stateSize": 123456 > }, > "size": 2000, > "subtaskIndex":0, > "channels": [ // One InputChannel per element > { > "info": { > "gateIdx": 0, > "inputChannelIdx": 0 > }, > "offsets": [ > 100,200,300,400 > ], > "size": 1400 > }, > { > "info": { > "gateIdx": 0, > "inputChannelIdx": 1 > }, > "offsets": [ > 500,600 > ], > "size": 600 > } > ] > } > {code} > MergedResultSubpartitionStateHandle is similar. > > > WDYT [~roman] , [~pnowojski] , [~fanrui] ? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [FLINK-33734] Merge unaligned checkpoint state handle [flink]
zoltar9264 opened a new pull request, #24247: URL: https://github.com/apache/flink/pull/24247 ## What is the purpose of the change Merge and serialize the state handles of unaligned checkpoint on TaskManager to potentially reduce checkpoint production time. Details in [FLINK-33734](https://issues.apache.org/jira/browse/FLINK-33734). ## Brief change log - Extract interface ChannelState,InputStateHandle,OutputStateHandle as mark of state handle of unaligned checkpoint - Introduce merged channel state handle - Extract interface ChannelStateHandleSerializer - Introduce MetadataV5Serializer for merged channel state handle - Merge and serialize channel state handle before send checkpoint ack to JobManager ## Verifying this change Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing *(Please pick either of the following options)* This change is a trivial rework / code cleanup without any test coverage. *(or)* This change is already covered by existing tests, such as *(please describe tests)*. *(or)* This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end deployment with large payloads (100MB)* - *Extended integration test for recovery after master (JobManager) failure* - *Added test that validates that TaskInfo is transferred only once across recoveries* - *Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-27054) Elasticsearch SQL connector SSL issue
[ https://issues.apache.org/jira/browse/FLINK-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813391#comment-17813391 ] Mingliang Liu commented on FLINK-27054: --- Any updates on this? My understanding is this problem (not supporting SSL) exists in both ES 6 and ES 7 connectors, both SQL and non-SQL (DataStream), correct? > Elasticsearch SQL connector SSL issue > - > > Key: FLINK-27054 > URL: https://issues.apache.org/jira/browse/FLINK-27054 > Project: Flink > Issue Type: Bug > Components: Connectors / ElasticSearch >Reporter: ricardo >Assignee: Kelu Tao >Priority: Major > > The current Flink ElasticSearch SQL connector > https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/elasticsearch/ > is missing SSL options, can't connect to ES clusters which require SSL > certificate. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34111][table] Add JSON_QUOTE and JSON_UNQUOTE function [flink]
JingGe commented on PR #24156: URL: https://github.com/apache/flink/pull/24156#issuecomment-1922027269 @flinkbot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]
XComp commented on code in PR #24246: URL: https://github.com/apache/flink/pull/24246#discussion_r1474488421 ## flink-runtime/src/test/java/org/apache/flink/runtime/testutils/CommonTestUtils.java: ## @@ -367,9 +367,11 @@ public static void waitForCheckpoint(JobID jobID, MiniCluster miniCluster, int n }); } -/** Wait for on more completed checkpoint. */ -public static void waitForOneMoreCheckpoint(JobID jobID, MiniCluster miniCluster) -throws Exception { +/** + * Wait for a new completed checkpoint. Note: we wait for 2 or more checkpoint to ensure the + * latest checkpoint start after waitForTwoOrMoreCheckpoint is called. + */ +public static void waitForNewCheckpoint(JobID jobID, MiniCluster miniCluster) throws Exception { Review Comment: ```suggestion public static void waitForNewCheckpoint(JobID jobID, MiniCluster miniCluster, int checkpointCount) throws Exception { ``` Can't we make the number of checkpoints to wait for configurable? That way, we can pass in `2` in the test implementation analogously to `waitForCheckpoint`. I also feel like we can remove some redundant code within the two methods. :thinking: ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -328,7 +330,7 @@ public void testCheckpointRescalingNonPartitionedStateCausesException() throws E // wait until the operator handles some data StateSourceBase.workStartedLatch.await(); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: So far, we've only seen the issue in `#testCheckpointRescalingInKeyedState`. We don't need the two checkpoints here, actually, because we're not relying on elements in the test. We could keep the tests functionality if we make the `waitForOneMoreCheckpoint` configurable as suggested one of my previous comments. ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -261,7 +263,7 @@ public void testCheckpointRescalingKeyedState(boolean scaleOut) throws Exception waitForAllTaskRunning(cluster.getMiniCluster(), jobGraph.getJobID(), false); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: We should add a comment here explaining why we need to wait for 2 instead of one checkpoint. ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -513,7 +515,7 @@ public void testCheckpointRescalingPartitionedOperatorState( // wait until the operator handles some data StateSourceBase.workStartedLatch.await(); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: AFAIU, we don't need to change it here. ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -411,7 +413,7 @@ public void testCheckpointRescalingWithKeyedAndNonPartitionedState() throws Exce // clear the CollectionSink set for the restarted job CollectionSink.clearElementsSet(); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: I guess the scenario can happen in this test as well because it's almost the same test implementation as in `#testCheckpointRescalingKeyedState` :thinking: ## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ## @@ -411,7 +413,7 @@ public void testCheckpointRescalingWithKeyedAndNonPartitionedState() throws Exce // clear the CollectionSink set for the restarted job CollectionSink.clearElementsSet(); -waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); +waitForNewCheckpoint(jobID, cluster.getMiniCluster()); Review Comment: I'm wondering whether the redundant code could be removed here. But that's probably a bit out-of-scope for this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-33634] Add Conditions to Flink CRD's Status field [flink-kubernetes-operator]
gyfora commented on PR #749: URL: https://github.com/apache/flink-kubernetes-operator/pull/749#issuecomment-1921722522 @lajith2006 do you already have a Jira account? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-34148) Potential regression (Jan. 13): stringWrite with Java8
[ https://issues.apache.org/jira/browse/FLINK-34148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813319#comment-17813319 ] Sergey Nuyanzin commented on FLINK-34148: - As it was decided in release meeting flink-shaded upgrade was reverted as d6c7eee8243b4fe3e593698f250643534dc79cb5 [~Zakelly], [~martijnvisser] could you please check perf report? > Potential regression (Jan. 13): stringWrite with Java8 > -- > > Key: FLINK-34148 > URL: https://issues.apache.org/jira/browse/FLINK-34148 > Project: Flink > Issue Type: Improvement > Components: API / Type Serialization System >Reporter: Zakelly Lan >Priority: Blocker > Fix For: 1.19.0 > > > Significant drop of performance in stringWrite with Java8 from commit > [881062f352|https://github.com/apache/flink/commit/881062f352f8bf8c21ab7cbea95e111fd82fdf20] > to > [5d9d8748b6|https://github.com/apache/flink/commit/5d9d8748b64ff1a75964a5cd2857ab5061312b51] > . It only involves strings not so long (128 or 4). > stringWrite.128.ascii(Java8) baseline=1089.107756 current_value=754.52452 > stringWrite.128.chinese(Java8) baseline=504.244575 current_value=295.358989 > stringWrite.128.russian(Java8) baseline=655.582639 current_value=421.030188 > stringWrite.4.chinese(Java8) baseline=9598.791964 current_value=6627.929927 > stringWrite.4.russian(Java8) baseline=11070.666415 current_value=8289.95767 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34132) Batch WordCount job fails when run with AdaptiveBatch scheduler
[ https://issues.apache.org/jira/browse/FLINK-34132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813316#comment-17813316 ] Wencong Liu commented on FLINK-34132: - Thanks for the reminding. [~zhuzh] I will address these issues when I have some free time. > Batch WordCount job fails when run with AdaptiveBatch scheduler > --- > > Key: FLINK-34132 > URL: https://issues.apache.org/jira/browse/FLINK-34132 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.17.1, 1.18.1 >Reporter: Prabhu Joseph >Assignee: Junrui Li >Priority: Major > Labels: pull-request-available > Fix For: 1.19.0 > > > Batch WordCount job fails when run with AdaptiveBatch scheduler. > *Repro Steps* > {code:java} > flink-yarn-session -Djobmanager.scheduler=adaptive -d > flink run -d /usr/lib/flink/examples/batch/WordCount.jar --input > s3://prabhuflinks3/INPUT --output s3://prabhuflinks3/OUT > {code} > *Error logs* > {code:java} > The program finished with the following exception: > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: > org.apache.flink.runtime.client.JobInitializationException: Could not start > the JobMaster. > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) > at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:105) > at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:851) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:245) > at > org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1095) > at > org.apache.flink.client.cli.CliFrontend.lambda$mainInternal$9(CliFrontend.java:1189) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > org.apache.flink.client.cli.CliFrontend.mainInternal(CliFrontend.java:1189) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1157) > Caused by: java.lang.RuntimeException: > java.util.concurrent.ExecutionException: java.lang.RuntimeException: > org.apache.flink.runtime.client.JobInitializationException: Could not start > the JobMaster. > at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321) > at > org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:1067) > at > org.apache.flink.client.program.ContextEnvironment.executeAsync(ContextEnvironment.java:144) > at > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:73) > at > org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:106) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355) > ... 12 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.RuntimeException: > org.apache.flink.runtime.client.JobInitializationException: Could not start > the JobMaster. > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at > org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:1062) > ... 20 more > Caused by: java.lang.RuntimeException: > org.apache.flink.runtime.client.JobInitializationException: Could not start > the JobMaster. > at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321) > at > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > at > java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457) > at
Re: [PR] [FLINK-33089] Clean up 1.13 and 1.14 references from docs and code [flink-kubernetes-operator]
mxm merged PR #768: URL: https://github.com/apache/flink-kubernetes-operator/pull/768 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18
[ https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34333: -- Description: FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which required an update of the k8s client to v6.9.0. This Jira issue is about finding a solution in Flink 1.18 for the very same problem FLINK-34007 covered. It's a dedicated Jira issue because we want to unblock the release of 1.19 by resolving FLINK-34007. Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in v6.6.2 which might prevent the leadership lost event being forwarded to the client ([#5463|https://github.com/fabric8io/kubernetes-client/issues/5463]). An initial proposal where the release call was handled in Flink's {{KubernetesLeaderElector}} didn't work due to the leadership lost event being triggered twice (see [FLINK-34007 PR comment|https://github.com/apache/flink/pull/24132#discussion_r1467175902]) was: FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which required an update of the k8s client to v6.9.0. This Jira issue is about finding a solution in Flink 1.18 for the very same problem FLINK-34007 covered. It's a dedicated Jira issue because we want to unblock the release of 1.19 by resolving FLINK-34007. > Fix FLINK-34007 LeaderElector bug in 1.18 > - > > Key: FLINK-34333 > URL: https://issues.apache.org/jira/browse/FLINK-34333 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.18.1 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > > FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since > Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which > required an update of the k8s client to v6.9.0. > This Jira issue is about finding a solution in Flink 1.18 for the very same > problem FLINK-34007 covered. It's a dedicated Jira issue because we want to > unblock the release of 1.19 by resolving FLINK-34007. > Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in > v6.6.2 which might prevent the leadership lost event being forwarded to the > client ([#5463|https://github.com/fabric8io/kubernetes-client/issues/5463]). > An initial proposal where the release call was handled in Flink's > {{KubernetesLeaderElector}} didn't work due to the leadership lost event > being triggered twice (see [FLINK-34007 PR > comment|https://github.com/apache/flink/pull/24132#discussion_r1467175902]) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-34319) Bump okhttp version to 4.12.0
[ https://issues.apache.org/jira/browse/FLINK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gyula Fora closed FLINK-34319. -- Fix Version/s: kubernetes-operator-1.8.0 (was: 1.8.0) Assignee: ConradJam Resolution: Fixed merged to main ca3a746e42beb4816c4f3fb7d5b9aea6fc757b32 > Bump okhttp version to 4.12.0 > - > > Key: FLINK-34319 > URL: https://issues.apache.org/jira/browse/FLINK-34319 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: 1.7.2 >Reporter: ConradJam >Assignee: ConradJam >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-1.8.0 > > > Bump okhttp version to 4.12.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34319] Bump okhttp version to 4.12.0 [flink-kubernetes-operator]
gyfora merged PR #766: URL: https://github.com/apache/flink-kubernetes-operator/pull/766 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-33557] Externalize Cassandra Python connector code [flink-connector-cassandra]
ferenc-csaky commented on PR #26: URL: https://github.com/apache/flink-connector-cassandra/pull/26#issuecomment-1921525300 @pvary @gaborgsomogyi JAR included, CI run passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-31220) Replace Pod with PodTemplateSpec for the pod template properties
[ https://issues.apache.org/jira/browse/FLINK-31220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-31220: --- Labels: pull-request-available (was: ) > Replace Pod with PodTemplateSpec for the pod template properties > > > Key: FLINK-31220 > URL: https://issues.apache.org/jira/browse/FLINK-31220 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Gyula Fora >Priority: Major > Labels: pull-request-available > > The current podtemplate fields in the CR use the Pod object for schema. This > doesn't make sense as status and other fields should never be specified and > they take no effect. > We should replace this with PodTemplateSpec and make sure that this is not a > breaking change even if users incorrectly specified status before. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-31220] Replace Pod with PodTemplateSpec for the pod template properties [flink-kubernetes-operator]
gyfora commented on PR #770: URL: https://github.com/apache/flink-kubernetes-operator/pull/770#issuecomment-1921435716 cc @afedulov -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [FLINK-31220] Replace Pod with PodTemplateSpec for the pod template properties [flink-kubernetes-operator]
gyfora opened a new pull request, #770: URL: https://github.com/apache/flink-kubernetes-operator/pull/770 ## What is the purpose of the change The PodTemplate fields in the CRD currently incorrectly use the Pod time, when they should be PodTemplateSpec type. This leads to unnecessary (meaningless) fields and a very large CRD for no gain. Also accidentally changing fields like apiKind would trigger an unnecessary upgrade of the job. This PR is the first step of correcting this mistake. Unfortunately we cannot simply change the type form Pod to PodTemplateSpec as it's a backward incompatible change for the users. While Kubernetes itself would drop the now unsupported fields from the stored CRs themselves, users would not be able to submit the same CR again as they would get a validation error from the API server. We should defer the CRD schema change to the next version. ## Brief change log - *Change Pod -> PodTemplateSpec but keep the schema in CRD* - *Improve SpecDiff logic to avoid accidental upgrade due to the dropped fields* ## Verifying this change New unit tests + manual verification in local environment. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: yes ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? JavaDocs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [hotfix] Remove unnecessary fields from podTemplate examples [flink-kubernetes-operator]
gyfora closed pull request #767: [hotfix] Remove unnecessary fields from podTemplate examples URL: https://github.com/apache/flink-kubernetes-operator/pull/767 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-31472] Disable Intermittently failing throttling test [flink]
vahmed-hamdy commented on PR #24175: URL: https://github.com/apache/flink/pull/24175#issuecomment-1921379510 @flinkbot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-34325) Inconsistent state with data loss after OutOfMemoryError
[ https://issues.apache.org/jira/browse/FLINK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813238#comment-17813238 ] Alexis Sarda-Espinosa commented on FLINK-34325: --- I will say that, while I can of course reproduce the OOM problems, I cannot reliably reproduce the inconsistency, most of the time the job really ends up in a crashloop until I increase memory or clean up the state. > Inconsistent state with data loss after OutOfMemoryError > > > Key: FLINK-34325 > URL: https://issues.apache.org/jira/browse/FLINK-34325 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 > Environment: Flink on Kubernetes with HA, RocksDB with incremental > checkpoints on Azure >Reporter: Alexis Sarda-Espinosa >Priority: Major > Attachments: jobmanager_log.txt > > > I have a job that uses broadcast state to maintain a cache of required > metadata. I am currently evaluating memory requirements of my specific use > case, and I ran into a weird situation that seems worrisome. > All sources in my job are Kafka sources. I wrote a large amount of messages > in Kafka to force the broadcast state's cache to grow. At some point, this > caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job > Manager. I would have expected the whole java process of the JM to crash, but > the job was simply restarted. What's worrisome is that, after 2 checkpoint > failures ^1^, the job restarted and subsequently resumed from the latest > successful checkpoint and completely ignored all the events I wrote to Kafka, > which I can verify because I have a custom metric that exposes the > approximate size of this cache, and the fact that the job didn't crashloop at > this point after reading all the messages from Kafka over and over again. > I'm attaching an excerpt of the Job Manager's logs. My main concerns are: > # It seems the memory error from the JM didn't prevent the Kafka offsets from > being "rolled back", so eventually the Kafka events that should have ended in > the broadcast state's cache were ignored. > # -Is it normal that the state is somehow "materialized" in the JM and is > thus affected by the size of the JM's heap? Is this something particular due > to the use of broadcast state? I found this very surprising.- See comments > ^1^ Two failures are expected since > {{execution.checkpointing.tolerable-failed-checkpoints=1}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34325) Inconsistent state with data loss after OutOfMemoryError
[ https://issues.apache.org/jira/browse/FLINK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813238#comment-17813238 ] Alexis Sarda-Espinosa edited comment on FLINK-34325 at 2/1/24 1:49 PM: --- I will say that, while I can of course reproduce the OOM problems, I cannot reliably reproduce the state inconsistency, most of the time the job really ends up in a crashloop until I increase memory or clean up the state. was (Author: asardaes): I will say that, while I can of course reproduce the OOM problems, I cannot reliably reproduce the inconsistency, most of the time the job really ends up in a crashloop until I increase memory or clean up the state. > Inconsistent state with data loss after OutOfMemoryError > > > Key: FLINK-34325 > URL: https://issues.apache.org/jira/browse/FLINK-34325 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 > Environment: Flink on Kubernetes with HA, RocksDB with incremental > checkpoints on Azure >Reporter: Alexis Sarda-Espinosa >Priority: Major > Attachments: jobmanager_log.txt > > > I have a job that uses broadcast state to maintain a cache of required > metadata. I am currently evaluating memory requirements of my specific use > case, and I ran into a weird situation that seems worrisome. > All sources in my job are Kafka sources. I wrote a large amount of messages > in Kafka to force the broadcast state's cache to grow. At some point, this > caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job > Manager. I would have expected the whole java process of the JM to crash, but > the job was simply restarted. What's worrisome is that, after 2 checkpoint > failures ^1^, the job restarted and subsequently resumed from the latest > successful checkpoint and completely ignored all the events I wrote to Kafka, > which I can verify because I have a custom metric that exposes the > approximate size of this cache, and the fact that the job didn't crashloop at > this point after reading all the messages from Kafka over and over again. > I'm attaching an excerpt of the Job Manager's logs. My main concerns are: > # It seems the memory error from the JM didn't prevent the Kafka offsets from > being "rolled back", so eventually the Kafka events that should have ended in > the broadcast state's cache were ignored. > # -Is it normal that the state is somehow "materialized" in the JM and is > thus affected by the size of the JM's heap? Is this something particular due > to the use of broadcast state? I found this very surprising.- See comments > ^1^ Two failures are expected since > {{execution.checkpointing.tolerable-failed-checkpoints=1}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34325) Inconsistent state with data loss after OutOfMemoryError
[ https://issues.apache.org/jira/browse/FLINK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexis Sarda-Espinosa updated FLINK-34325: -- Description: I have a job that uses broadcast state to maintain a cache of required metadata. I am currently evaluating memory requirements of my specific use case, and I ran into a weird situation that seems worrisome. All sources in my job are Kafka sources. I wrote a large amount of messages in Kafka to force the broadcast state's cache to grow. At some point, this caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job Manager. I would have expected the whole java process of the JM to crash, but the job was simply restarted. What's worrisome is that, after 2 checkpoint failures ^1^, the job restarted and subsequently resumed from the latest successful checkpoint and completely ignored all the events I wrote to Kafka, which I can verify because I have a custom metric that exposes the approximate size of this cache, and the fact that the job didn't crashloop at this point after reading all the messages from Kafka over and over again. I'm attaching an excerpt of the Job Manager's logs. My main concerns are: # It seems the memory error from the JM didn't prevent the Kafka offsets from being "rolled back", so eventually the Kafka events that should have ended in the broadcast state's cache were ignored. # -Is it normal that the state is somehow "materialized" in the JM and is thus affected by the size of the JM's heap? Is this something particular due to the use of broadcast state? I found this very surprising.- See comments ^1^ Two failures are expected since {{execution.checkpointing.tolerable-failed-checkpoints=1}} was: I have a job that uses broadcast state to maintain a cache of required metadata. I am currently evaluating memory requirements of my specific use case, and I ran into a weird situation that seems worrisome. All sources in my job are Kafka sources. I wrote a large amount of messages in Kafka to force the broadcast state's cache to grow. At some point, this caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job Manager. I would have expected the whole java process of the JM to crash, but the job was simply restarted. What's worrisome is that, after 2 restarts, the job resumed from the latest successful checkpoint and completely ignored all the events I wrote to Kafka, which I can verify because I have a custom metric that exposes the approximate size of this cache, and the fact that the job didn't crashloop at this point after reading all the messages from Kafka over and over again. I'm attaching an excerpt of the Job Manager's logs. My main concerns are: # It seems the memory error from the JM didn't prevent the Kafka offsets from being "rolled back", so eventually the Kafka events that should have ended in the broadcast state's cache were ignored. # Is it normal that the state is somehow "materialized" in the JM and is thus affected by the size of the JM's heap? Is this something particular due to the use of broadcast state? I found this very surprising. > Inconsistent state with data loss after OutOfMemoryError > > > Key: FLINK-34325 > URL: https://issues.apache.org/jira/browse/FLINK-34325 > Project: Flink > Issue Type: Bug >Affects Versions: 1.17.1 > Environment: Flink on Kubernetes with HA, RocksDB with incremental > checkpoints on Azure >Reporter: Alexis Sarda-Espinosa >Priority: Major > Attachments: jobmanager_log.txt > > > I have a job that uses broadcast state to maintain a cache of required > metadata. I am currently evaluating memory requirements of my specific use > case, and I ran into a weird situation that seems worrisome. > All sources in my job are Kafka sources. I wrote a large amount of messages > in Kafka to force the broadcast state's cache to grow. At some point, this > caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job > Manager. I would have expected the whole java process of the JM to crash, but > the job was simply restarted. What's worrisome is that, after 2 checkpoint > failures ^1^, the job restarted and subsequently resumed from the latest > successful checkpoint and completely ignored all the events I wrote to Kafka, > which I can verify because I have a custom metric that exposes the > approximate size of this cache, and the fact that the job didn't crashloop at > this point after reading all the messages from Kafka over and over again. > I'm attaching an excerpt of the Job Manager's logs. My main concerns are: > # It seems the memory error from the JM didn't prevent the Kafka offsets from > being "rolled back", so eventually the Kafka events that should have ended in
[jira] [Commented] (FLINK-34328) Release Testing Instructions: Verify FLINK-34037 Improve Serialization Configuration And Usage In Flink
[ https://issues.apache.org/jira/browse/FLINK-34328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813234#comment-17813234 ] lincoln lee commented on FLINK-34328: - [~Zhanghao Chen] Assigned to you :) > Release Testing Instructions: Verify FLINK-34037 Improve Serialization > Configuration And Usage In Flink > --- > > Key: FLINK-34328 > URL: https://issues.apache.org/jira/browse/FLINK-34328 > Project: Flink > Issue Type: Sub-task > Components: API / Type Serialization System, Runtime / Configuration >Affects Versions: 1.19.0 >Reporter: Zhanghao Chen >Assignee: Zhanghao Chen >Priority: Major > Fix For: 1.19.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-34328) Release Testing Instructions: Verify FLINK-34037 Improve Serialization Configuration And Usage In Flink
[ https://issues.apache.org/jira/browse/FLINK-34328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lincoln lee reassigned FLINK-34328: --- Assignee: Zhanghao Chen > Release Testing Instructions: Verify FLINK-34037 Improve Serialization > Configuration And Usage In Flink > --- > > Key: FLINK-34328 > URL: https://issues.apache.org/jira/browse/FLINK-34328 > Project: Flink > Issue Type: Sub-task > Components: API / Type Serialization System, Runtime / Configuration >Affects Versions: 1.19.0 >Reporter: Zhanghao Chen >Assignee: Zhanghao Chen >Priority: Major > Fix For: 1.19.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18
[ https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813215#comment-17813215 ] Matthias Pohl commented on FLINK-34333: --- I think it's fine to do the upgrade from v6.6.2 to v6.9.0 even for 1.18: * The only change that seems to affect Flink is in the Mock-framework and only affects test code * Most of the changes are cleanup changes, except for: ** Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: Changed behavior in certain scenarios ** Fix [#5125|https://github.com/fabric8io/kubernetes-client/issues/5125]: Different default TLS version ** Fix [#1335|https://github.com/fabric8io/kubernetes-client/issues/1335]: Different fallback used for proxy configuration Generally, we could argue that downstream project should rely on transitive dependencies provided by Flink. And fixing the bug out-weighs the stability concerns here. Do you see a problem with this argument [~gyfora], [~wangyang0918]? The alternatives to doing the upgrade are: * Reverting the upgrade (i.e. going back from v6.6.2 to v5.12.4). This would allow us to get to the stable version that was tested with 1.17- * Provide a Flink-customized implementation of the fabric8io {{LeaderElector}} class with the cherry-picked changes of [8f8c438f|https://github.com/fabric8io/kubernetes-client/commit/8f8c438f] and [0f6c6965|https://github.com/fabric8io/kubernetes-client/commit/0f6c6965]. As a consequence, we would stick to fabric8io:kubernetes-client v.6.6.2 > Fix FLINK-34007 LeaderElector bug in 1.18 > - > > Key: FLINK-34333 > URL: https://issues.apache.org/jira/browse/FLINK-34333 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.18.1 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > > FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since > Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which > required an update of the k8s client to v6.9.0. > This Jira issue is about finding a solution in Flink 1.18 for the very same > problem FLINK-34007 covered. It's a dedicated Jira issue because we want to > unblock the release of 1.19 by resolving FLINK-34007. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]
flinkbot commented on PR #24246: URL: https://github.com/apache/flink/pull/24246#issuecomment-1921304022 ## CI report: * 9eec91beb3f2ac2344454b9402ba2505201f0e49 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
[ https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813205#comment-17813205 ] Rui Fan commented on FLINK-34200: - I have attached the detailed log in this JIRA. > AutoRescalingITCase#testCheckpointRescalingInKeyedState fails > - > > Key: FLINK-34200 > URL: https://issues.apache.org/jira/browse/FLINK-34200 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.19.0 >Reporter: Matthias Pohl >Assignee: Rui Fan >Priority: Critical > Labels: pull-request-available, test-stability > Attachments: FLINK-34200.failure.log.gz, debug-34200.log > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200] > {code:java} > Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in > org.apache.flink.test.checkpointing.AutoRescalingITCase > Jan 19 02:31:53 02:31:53.954 [ERROR] > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend > = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE! > Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), > (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), > (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), > (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), > (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), > (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), > (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), > (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), > (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), > (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), > (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), > (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), > (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), > (1,2), (1,36000), (1,44000)]> > Jan 19 02:31:53 at org.junit.Assert.fail(Assert.java:89) > Jan 19 02:31:53 at org.junit.Assert.failNotEquals(Assert.java:835) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:120) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:146) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196) > Jan 19 02:31:53 at java.lang.reflect.Method.invoke(Method.java:498) > Jan 19 02:31:53 at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
[ https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-34200: --- Labels: pull-request-available test-stability (was: test-stability) > AutoRescalingITCase#testCheckpointRescalingInKeyedState fails > - > > Key: FLINK-34200 > URL: https://issues.apache.org/jira/browse/FLINK-34200 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.19.0 >Reporter: Matthias Pohl >Assignee: Rui Fan >Priority: Critical > Labels: pull-request-available, test-stability > Attachments: FLINK-34200.failure.log.gz, debug-34200.log > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200] > {code:java} > Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in > org.apache.flink.test.checkpointing.AutoRescalingITCase > Jan 19 02:31:53 02:31:53.954 [ERROR] > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend > = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE! > Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), > (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), > (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), > (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), > (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), > (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), > (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), > (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), > (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), > (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), > (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), > (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), > (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), > (1,2), (1,36000), (1,44000)]> > Jan 19 02:31:53 at org.junit.Assert.fail(Assert.java:89) > Jan 19 02:31:53 at org.junit.Assert.failNotEquals(Assert.java:835) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:120) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:146) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196) > Jan 19 02:31:53 at java.lang.reflect.Method.invoke(Method.java:498) > Jan 19 02:31:53 at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
[ https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Fan reassigned FLINK-34200: --- Assignee: Rui Fan > AutoRescalingITCase#testCheckpointRescalingInKeyedState fails > - > > Key: FLINK-34200 > URL: https://issues.apache.org/jira/browse/FLINK-34200 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.19.0 >Reporter: Matthias Pohl >Assignee: Rui Fan >Priority: Critical > Labels: test-stability > Attachments: FLINK-34200.failure.log.gz, debug-34200.log > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200] > {code:java} > Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in > org.apache.flink.test.checkpointing.AutoRescalingITCase > Jan 19 02:31:53 02:31:53.954 [ERROR] > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend > = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE! > Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), > (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), > (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), > (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), > (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), > (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), > (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), > (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), > (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), > (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), > (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), > (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), > (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), > (1,2), (1,36000), (1,44000)]> > Jan 19 02:31:53 at org.junit.Assert.fail(Assert.java:89) > Jan 19 02:31:53 at org.junit.Assert.failNotEquals(Assert.java:835) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:120) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:146) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196) > Jan 19 02:31:53 at java.lang.reflect.Method.invoke(Method.java:498) > Jan 19 02:31:53 at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] Revert "[FLINK-33705] Upgrade to flink-shaded 18.0" [flink]
snuyanzin merged PR #24227: URL: https://github.com/apache/flink/pull/24227 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Revert "[FLINK-33705] Upgrade to flink-shaded 18.0" [flink]
snuyanzin commented on PR #24227: URL: https://github.com/apache/flink/pull/24227#issuecomment-1921276645 thanks for taking a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34333][k8s] 1.18 backport of FLINK-34007 [flink]
flinkbot commented on PR #24245: URL: https://github.com/apache/flink/pull/24245#issuecomment-1921277811 ## CI report: * 087ef9d4311dcacaa6d7e7caf15d9abce0e8fb95 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
[ https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813201#comment-17813201 ] Rui Fan edited comment on FLINK-34200 at 2/1/24 12:59 PM: -- After re-run hundreds of times and adding a series of LOG to analysis this test. I find the root cause. In biref, it's a test issue, the root cause is that the counter and sum of SubtaskIndexFlatMapper in the Checkpoint before rescale are incomplete. Here is a test commit, I added some logs to analyze it. * [https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe] I have added the detailed reason in this comment: [https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe#r138161972] was (Author: fanrui): After re-run hundreds of times and adding a series of system.out to analysis the log. I find the root cause. In biref, it's a test issue, the root cause is that the counter and sum of SubtaskIndexFlatMapper in the Checkpoint before rescale are incomplete. Here is a test commit, I added some logs to analyze it. * [https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe] I have added the detailed reason in this comment: https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe#r138161972 > AutoRescalingITCase#testCheckpointRescalingInKeyedState fails > - > > Key: FLINK-34200 > URL: https://issues.apache.org/jira/browse/FLINK-34200 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.19.0 >Reporter: Matthias Pohl >Priority: Critical > Labels: test-stability > Attachments: FLINK-34200.failure.log.gz, debug-34200.log > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200] > {code:java} > Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in > org.apache.flink.test.checkpointing.AutoRescalingITCase > Jan 19 02:31:53 02:31:53.954 [ERROR] > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend > = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE! > Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), > (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), > (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), > (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), > (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), > (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), > (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), > (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), > (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), > (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), > (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), > (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), > (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), > (1,2), (1,36000), (1,44000)]> > Jan 19 02:31:53 at org.junit.Assert.fail(Assert.java:89) > Jan 19 02:31:53 at org.junit.Assert.failNotEquals(Assert.java:835) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:120) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:146) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196) > Jan 19 02:31:53 at java.lang.reflect.Method.invoke(Method.java:498) > Jan 19 02:31:53 at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
[ https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813201#comment-17813201 ] Rui Fan commented on FLINK-34200: - After re-run hundreds of times and adding a series of system.out to analysis the log. I find the root cause. In biref, it's a test issue, the root cause is that the counter and sum of SubtaskIndexFlatMapper in the Checkpoint before rescale are incomplete. Here is a test commit, I added some logs to analyze it. * [https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe] I have added the detailed reason in this comment: https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe#r138161972 > AutoRescalingITCase#testCheckpointRescalingInKeyedState fails > - > > Key: FLINK-34200 > URL: https://issues.apache.org/jira/browse/FLINK-34200 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.19.0 >Reporter: Matthias Pohl >Priority: Critical > Labels: test-stability > Attachments: FLINK-34200.failure.log.gz, debug-34200.log > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200] > {code:java} > Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in > org.apache.flink.test.checkpointing.AutoRescalingITCase > Jan 19 02:31:53 02:31:53.954 [ERROR] > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend > = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE! > Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), > (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), > (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), > (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), > (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), > (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), > (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), > (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), > (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), > (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), > (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), > (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), > (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), > (1,2), (1,36000), (1,44000)]> > Jan 19 02:31:53 at org.junit.Assert.fail(Assert.java:89) > Jan 19 02:31:53 at org.junit.Assert.failNotEquals(Assert.java:835) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:120) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:146) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196) > Jan 19 02:31:53 at java.lang.reflect.Method.invoke(Method.java:498) > Jan 19 02:31:53 at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18
[ https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813199#comment-17813199 ] Matthias Pohl edited comment on FLINK-34333 at 2/1/24 12:55 PM: Breaking changes between v6.6.2 and v6.9.2 based on the {{fabric8io:kubernetes-client}} release notes. * Icons: ** (x) Change on the Flink side required. This doesn't come with a change of behavior for the user (because it only affects test code) ** (!) Not used by Flink but might cause issues with projects which rely on the Flink dependency ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Deprecation work that doesn't remove API, yet. * (x) [v6.9.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.9.0] ** (x) Fix [#5368|https://github.com/fabric8io/kubernetes-client/issues/5368]: ListOptions parameter ordering is now alphabetical. If you are using non-crud mocking for lists with options, you may need to update your parameter order. *** This issue causes a change in Flink's {{KubernetesClientTestBase}} because the order in which the GET parameters are processed to match the mocked requests changed. ** (!) Fix [#5343|https://github.com/fabric8io/kubernetes-client/issues/5343]: Removed io.fabric8.kubernetes.model.annotation.PrinterColumn, use io.fabric8.crd.generator.annotation.PrinterColumn ** (!) Fix [#5391|https://github.com/fabric8io/kubernetes-client/issues/5391]: Removed the vertx-uri-template dependency from the vertx client, if you need that for your application, then introduce your own dependency. ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Fix [#5220|https://github.com/fabric8io/kubernetes-client/issues/5220]: KubernetesResourceUtil.isValidLabelOrAnnotation has been deprecated because the rules for labels and annotations are different * (!) [v6.8.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Fix [#2718|https://github.com/fabric8io/kubernetes-client/issues/2718]: KubernetesResourceUtil.isResourceReady was deprecated. Use client.resource(item).isReady() or Readiness.getInstance().isReady(item) instead. ** (!) Fix [#5171|https://github.com/fabric8io/kubernetes-client/issues/5171]: Removed Camel-K extension, use org.apache.camel.k:camel-k-crds instead. ** (!) Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: Built-in resources were in-consistent with respect to their serialization or empty collections. In many circumstances this was confusing behavior. In order to be consistent all built-in resources will omit empty collections by default. This is a breaking change if you are relying on an empty collection in a json merge or a strategic merge where the list has a patchStrategy of atomic. In these circumstances the empty collection will no longer be serialized. You may instead use a json patch, server side apply instead, or modify the serialized form of the patch. ** (!) Fix [#5279|https://github.com/fabric8io/kubernetes-client/issues/5279]: (java-generator) Add native support for date-time fields, they are now mapped to native java.time.ZonedDateTime ** (!) Fix [#5315|https://github.com/fabric8io/kubernetes-client/issues/5315]: kubernetes-junit-jupiter no longer registers the NamespaceExtension and KubernetesExtension extensions to be used in combination with junit-platform.properties>junit.jupiter.extensions.autodetection.enabled=trueconfiguration. If you wish to use these extensions and autodetect them, change your dependency to kubernetes-junit-jupiter-autodetect. ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Deprecating io.fabric8.kubernetes.model.annotation.PrinterColumn in favor of: io.fabric8.crd.generator.annotation.PrinterColumn ** (!) Resource classes in resource.k8s.io/v1alpha1 have been moved to resource.k8s.io/v1alpha2 apiGroup in Kubernetes 1.27. Users are required to change package of the following classes: *** io.fabric8.kubernetes.api.model.resource.v1alpha1.PodSchedulingContext -> - io.fabric8.kubernetes.api.model.resource.v1alpha2.PodSchedulingContext *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaim -> - io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaim *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaimTemplate -> io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaimTemplate *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClass -> io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClass * (!) [v6.7.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.7.0] ** (!) Fix [#4911|https://github.com/fabric8io/kubernetes-client/issues/4911]: *** Config/RequestConfig.scaleTimeout has been deprecated along with
[jira] [Updated] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18
[ https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-34333: --- Labels: pull-request-available (was: ) > Fix FLINK-34007 LeaderElector bug in 1.18 > - > > Key: FLINK-34333 > URL: https://issues.apache.org/jira/browse/FLINK-34333 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.18.1 >Reporter: Matthias Pohl >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > > FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since > Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which > required an update of the k8s client to v6.9.0. > This Jira issue is about finding a solution in Flink 1.18 for the very same > problem FLINK-34007 covered. It's a dedicated Jira issue because we want to > unblock the release of 1.19 by resolving FLINK-34007. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [FLINK-34333][k8s] 1.18 backport of FLINK-34007 [flink]
XComp opened a new pull request, #24245: URL: https://github.com/apache/flink/pull/24245 ## What is the purpose of the change This is the fix of FLINK-34007 applied to Flink 1.18. In the end, it's a straight-forward backport including the upgrade to fabric8io:kubernetes-client v6.9.0. See FLINK-34333 for further details on the discussion whether the dependency upgrade is suitable. ## Brief change log * Makes `KubernetesLeaderElector` re-instantiate the fabric8io LeaderElector at the beginning of a new leadership lifecycle. * Upgrades `fabric8io:kubernetes-client` from `v6.6.2` to `v6.9.0` in order to include https://github.com/fabric8io/kubernetes-client/issues/5463 ## Verifying this change * Extends `KubernetesLeaderElectorITCase` to include a test case for the scenario that was revealed in FLINK-34007 and a check that verifies the leadership lost event being sent when stopping the `KubernetesLeaderElector` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
[ https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Fan updated FLINK-34200: Attachment: debug-34200.log > AutoRescalingITCase#testCheckpointRescalingInKeyedState fails > - > > Key: FLINK-34200 > URL: https://issues.apache.org/jira/browse/FLINK-34200 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.19.0 >Reporter: Matthias Pohl >Priority: Critical > Labels: test-stability > Attachments: FLINK-34200.failure.log.gz, debug-34200.log > > > [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200] > {code:java} > Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in > org.apache.flink.test.checkpointing.AutoRescalingITCase > Jan 19 02:31:53 02:31:53.954 [ERROR] > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend > = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE! > Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), > (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), > (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), > (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), > (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), > (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), > (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), > (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), > (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), > (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), > (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), > (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), > (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), > (1,2), (1,36000), (1,44000)]> > Jan 19 02:31:53 at org.junit.Assert.fail(Assert.java:89) > Jan 19 02:31:53 at org.junit.Assert.failNotEquals(Assert.java:835) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:120) > Jan 19 02:31:53 at org.junit.Assert.assertEquals(Assert.java:146) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296) > Jan 19 02:31:53 at > org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196) > Jan 19 02:31:53 at java.lang.reflect.Method.invoke(Method.java:498) > Jan 19 02:31:53 at > org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18
[ https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813199#comment-17813199 ] Matthias Pohl edited comment on FLINK-34333 at 2/1/24 12:48 PM: Breaking changes between v6.6.2 and v6.9.2 based on the {{fabric8io:kubernetes-client}} release notes * (x) [v6.9.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.9.0] ** (x) Fix [#5368|https://github.com/fabric8io/kubernetes-client/issues/5368]: ListOptions parameter ordering is now alphabetical. If you are using non-crud mocking for lists with options, you may need to update your parameter order. *** This issue causes a change in Flink's {{KubernetesClientTestBase}} because the order in which the GET parameters are processed to match the mocked requests changed. ** (!) Fix [#5343|https://github.com/fabric8io/kubernetes-client/issues/5343]: Removed io.fabric8.kubernetes.model.annotation.PrinterColumn, use io.fabric8.crd.generator.annotation.PrinterColumn ** (!) Fix [#5391|https://github.com/fabric8io/kubernetes-client/issues/5391]: Removed the vertx-uri-template dependency from the vertx client, if you need that for your application, then introduce your own dependency. ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Fix [#5220|https://github.com/fabric8io/kubernetes-client/issues/5220]: KubernetesResourceUtil.isValidLabelOrAnnotation has been deprecated because the rules for labels and annotations are different * (!) [v6.8.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Fix [#2718|https://github.com/fabric8io/kubernetes-client/issues/2718]: KubernetesResourceUtil.isResourceReady was deprecated. Use client.resource(item).isReady() or Readiness.getInstance().isReady(item) instead. ** (!) Fix [#5171|https://github.com/fabric8io/kubernetes-client/issues/5171]: Removed Camel-K extension, use org.apache.camel.k:camel-k-crds instead. ** (!) Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: Built-in resources were in-consistent with respect to their serialization or empty collections. In many circumstances this was confusing behavior. In order to be consistent all built-in resources will omit empty collections by default. This is a breaking change if you are relying on an empty collection in a json merge or a strategic merge where the list has a patchStrategy of atomic. In these circumstances the empty collection will no longer be serialized. You may instead use a json patch, server side apply instead, or modify the serialized form of the patch. ** (!) Fix [#5279|https://github.com/fabric8io/kubernetes-client/issues/5279]: (java-generator) Add native support for date-time fields, they are now mapped to native java.time.ZonedDateTime ** (!) Fix [#5315|https://github.com/fabric8io/kubernetes-client/issues/5315]: kubernetes-junit-jupiter no longer registers the NamespaceExtension and KubernetesExtension extensions to be used in combination with junit-platform.properties>junit.jupiter.extensions.autodetection.enabled=trueconfiguration. If you wish to use these extensions and autodetect them, change your dependency to kubernetes-junit-jupiter-autodetect. ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Deprecating io.fabric8.kubernetes.model.annotation.PrinterColumn in favor of: io.fabric8.crd.generator.annotation.PrinterColumn ** (!) Resource classes in resource.k8s.io/v1alpha1 have been moved to resource.k8s.io/v1alpha2 apiGroup in Kubernetes 1.27. Users are required to change package of the following classes: *** io.fabric8.kubernetes.api.model.resource.v1alpha1.PodSchedulingContext -> - io.fabric8.kubernetes.api.model.resource.v1alpha2.PodSchedulingContext *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaim -> - io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaim *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaimTemplate -> io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaimTemplate *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClass -> io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClass * (!) [v6.7.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.7.0] ** (!) Fix [#4911|https://github.com/fabric8io/kubernetes-client/issues/4911]: *** Config/RequestConfig.scaleTimeout has been deprecated along with Scalable.scale(count, wait) and DeployableScalableResource.deployLatest(wait). withTimeout may be called before the operation to control the timeout. *** Config/RequestConfig.websocketTimeout has been removed. Config/RequestConfig.requestTimeout will be used for websocket connection timeouts. *** HttpClient api/building changes - writeTimeout has been removed, readTimeout has
[jira] [Commented] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18
[ https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813199#comment-17813199 ] Matthias Pohl commented on FLINK-34333: --- Breaking changes between v6.6.2 and v6.9.2: * (x) [v6.9.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.9.0] ** (x) Fix [#5368|https://github.com/fabric8io/kubernetes-client/issues/5368]: ListOptions parameter ordering is now alphabetical. If you are using non-crud mocking for lists with options, you may need to update your parameter order. *** This issue causes a change in Flink's {{KubernetesClientTestBase}} because the order in which the GET parameters are processed to match the mocked requests changed. ** (!) Fix [#5343|https://github.com/fabric8io/kubernetes-client/issues/5343]: Removed io.fabric8.kubernetes.model.annotation.PrinterColumn, use io.fabric8.crd.generator.annotation.PrinterColumn ** (!) Fix [#5391|https://github.com/fabric8io/kubernetes-client/issues/5391]: Removed the vertx-uri-template dependency from the vertx client, if you need that for your application, then introduce your own dependency. ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Fix [#5220|https://github.com/fabric8io/kubernetes-client/issues/5220]: KubernetesResourceUtil.isValidLabelOrAnnotation has been deprecated because the rules for labels and annotations are different * (!) [v6.8.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Fix [#2718|https://github.com/fabric8io/kubernetes-client/issues/2718]: KubernetesResourceUtil.isResourceReady was deprecated. Use client.resource(item).isReady() or Readiness.getInstance().isReady(item) instead. ** (!) Fix [#5171|https://github.com/fabric8io/kubernetes-client/issues/5171]: Removed Camel-K extension, use org.apache.camel.k:camel-k-crds instead. ** (!) Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: Built-in resources were in-consistent with respect to their serialization or empty collections. In many circumstances this was confusing behavior. In order to be consistent all built-in resources will omit empty collections by default. This is a breaking change if you are relying on an empty collection in a json merge or a strategic merge where the list has a patchStrategy of atomic. In these circumstances the empty collection will no longer be serialized. You may instead use a json patch, server side apply instead, or modify the serialized form of the patch. ** (!) Fix [#5279|https://github.com/fabric8io/kubernetes-client/issues/5279]: (java-generator) Add native support for date-time fields, they are now mapped to native java.time.ZonedDateTime ** (!) Fix [#5315|https://github.com/fabric8io/kubernetes-client/issues/5315]: kubernetes-junit-jupiter no longer registers the NamespaceExtension and KubernetesExtension extensions to be used in combination with junit-platform.properties>junit.jupiter.extensions.autodetection.enabled=trueconfiguration. If you wish to use these extensions and autodetect them, change your dependency to kubernetes-junit-jupiter-autodetect. ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] Deprecating io.fabric8.kubernetes.model.annotation.PrinterColumn in favor of: io.fabric8.crd.generator.annotation.PrinterColumn ** (!) Resource classes in resource.k8s.io/v1alpha1 have been moved to resource.k8s.io/v1alpha2 apiGroup in Kubernetes 1.27. Users are required to change package of the following classes: *** io.fabric8.kubernetes.api.model.resource.v1alpha1.PodSchedulingContext -> - io.fabric8.kubernetes.api.model.resource.v1alpha2.PodSchedulingContext *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaim -> - io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaim *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaimTemplate -> io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaimTemplate *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClass -> io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClass * (!) [v6.7.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.7.0] ** (!) Fix [#4911|https://github.com/fabric8io/kubernetes-client/issues/4911]: *** Config/RequestConfig.scaleTimeout has been deprecated along with Scalable.scale(count, wait) and DeployableScalableResource.deployLatest(wait). withTimeout may be called before the operation to control the timeout. *** Config/RequestConfig.websocketTimeout has been removed. Config/RequestConfig.requestTimeout will be used for websocket connection timeouts. *** HttpClient api/building changes - writeTimeout has been removed, readTimeout has moved to the HttpRequest ** (!) Fix [#4662|https://github.com/fabric8io/kubernetes-client/issues/4662]: ***
[jira] [Created] (FLINK-34334) Add sub-task level RocksDB file count metric
Jufang He created FLINK-34334: - Summary: Add sub-task level RocksDB file count metric Key: FLINK-34334 URL: https://issues.apache.org/jira/browse/FLINK-34334 Project: Flink Issue Type: Improvement Components: Runtime / State Backends Affects Versions: 1.18.0 Reporter: Jufang He Attachments: img_v3_027i_7ed0b8ba-3f12-48eb-aab3-cc368ac47cdg.jpg In our production environment, we encountered the problem of task deploy failure. The root cause was that too many sst files of a single sub-task led to too much task deployment information(OperatorSubtaskState), and then caused akka request timeout in the task deploy phase. Therefore, I wanted to add sub-task level RocksDB file count metrics. It is convenient to avoid performance problems caused by too many sst files in time. RocksDB has provided the JNI (https://javadoc.io/doc/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/RocksDB.html#getColumnFamilyMetaData ()) We can easily retrieve the file count and report it via metrics reporter. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-33960][Scheduler] Fix the bug that Adaptive Scheduler doesn't respect the lowerBound when one flink job has more than 1 tasks [flink]
zentol commented on code in PR #24012: URL: https://github.com/apache/flink/pull/24012#discussion_r1474310073 ## flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/allocator/SlotSharingSlotAllocator.java: ## @@ -301,27 +301,24 @@ public Collection getContainedExecutionVertices() { private static class SlotSharingGroupMetaInfo { -private final int minLowerBound; private final int maxUpperBound; -private final int maxLowerUpperBoundRange; +private final int minResourceRequirement; -private SlotSharingGroupMetaInfo( -int minLowerBound, int maxUpperBound, int maxLowerUpperBoundRange) { -this.minLowerBound = minLowerBound; +private SlotSharingGroupMetaInfo(int maxUpperBound, int minResourceRequirement) { this.maxUpperBound = maxUpperBound; -this.maxLowerUpperBoundRange = maxLowerUpperBoundRange; -} - -public int getMinLowerBound() { -return minLowerBound; +this.minResourceRequirement = minResourceRequirement; } public int getMaxUpperBound() { return maxUpperBound; } -public int getMaxLowerUpperBoundRange() { -return maxLowerUpperBoundRange; +public int getMaxUpperMinRequirementBoundRange() { +return maxUpperBound - minResourceRequirement; +} + +public int getMinResourceRequirement() { Review Comment: I'd prefer if we'd name this `getMaxLowerBound` because that's exactly what it is. In this area of the code it is always good to use a consistent terminology and be as explicit as possible. ## flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/allocator/SlotSharingSlotAllocator.java: ## @@ -301,27 +301,24 @@ public Collection getContainedExecutionVertices() { private static class SlotSharingGroupMetaInfo { -private final int minLowerBound; private final int maxUpperBound; -private final int maxLowerUpperBoundRange; +private final int minResourceRequirement; -private SlotSharingGroupMetaInfo( -int minLowerBound, int maxUpperBound, int maxLowerUpperBoundRange) { -this.minLowerBound = minLowerBound; +private SlotSharingGroupMetaInfo(int maxUpperBound, int minResourceRequirement) { Review Comment: reorder so min is first argument -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34312][table] Improve the handling of default node types for named parameters [flink]
hackergin commented on code in PR #24235: URL: https://github.com/apache/flink/pull/24235#discussion_r1474310557 ## flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/calcite/FlinkSqlCallBinding.java: ## @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.table.planner.calcite; + +import org.apache.flink.table.planner.functions.sql.SqlDefaultOperator; + +import org.apache.calcite.rel.type.RelDataType; +import org.apache.calcite.sql.SqlCall; +import org.apache.calcite.sql.SqlCallBinding; +import org.apache.calcite.sql.SqlKind; +import org.apache.calcite.sql.SqlNode; +import org.apache.calcite.sql.fun.SqlStdOperatorTable; +import org.apache.calcite.sql.parser.SqlParserPos; +import org.apache.calcite.sql.type.SqlOperandMetadata; +import org.apache.calcite.sql.type.SqlOperandTypeChecker; +import org.apache.calcite.sql.type.SqlTypeUtil; +import org.apache.calcite.sql.validate.SqlValidator; +import org.apache.calcite.sql.validate.SqlValidatorNamespace; +import org.apache.calcite.sql.validate.SqlValidatorScope; +import org.checkerframework.checker.nullness.qual.Nullable; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +/** Binding supports to rewrite the DEFAULT operator. */ +public class FlinkSqlCallBinding extends SqlCallBinding { + +private final List fixArgumentTypes; + +private final List rewrittenOperands; + +public FlinkSqlCallBinding( +SqlValidator validator, @Nullable SqlValidatorScope scope, SqlCall call) { +super(validator, scope, call); +this.fixArgumentTypes = getFixArgumentTypes(); +this.rewrittenOperands = getRewrittenOperands(); +} + +@Override +public int getOperandCount() { +return rewrittenOperands.size(); +} + +@Override +public List operands() { +if (isFixedParameters()) { +return rewrittenOperands; +} else { +return super.operands(); +} +} + +@Override +public RelDataType getOperandType(int ordinal) { +if (!isFixedParameters()) { +return super.getOperandType(ordinal); +} + +SqlNode operand = rewrittenOperands.get(ordinal); +if (operand.getKind() == SqlKind.DEFAULT) { +return fixArgumentTypes.get(ordinal); +} + +final RelDataType type = SqlTypeUtil.deriveType(this, operand); Review Comment: @fsk119 Here I made some modifications. Previously it was: If it is fixArgumentType, then return argumentType. In fact, there is a problem. We should only return argumentType if it is a DEFAULT node. Other nodes should still return the actual operand's type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34312][table] Improve the handling of default node types for named parameters [flink]
hackergin commented on code in PR #24235: URL: https://github.com/apache/flink/pull/24235#discussion_r1474303521 ## flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/calcite/FlinkSqlCallBinding.java: ## @@ -77,20 +76,15 @@ public List operands() { @Override public RelDataType getOperandType(int ordinal) { -return isNamedArgument() && !argumentTypes.isEmpty() +return isNamedArgument() ? ((SqlOperandMetadata) getCall().getOperator().getOperandTypeChecker()) .paramTypes(typeFactory) .get(ordinal) : super.getOperandType(ordinal); } public boolean isNamedArgument() { -for (SqlNode operand : getCall().getOperandList()) { -if (operand != null && operand.getKind() == SqlKind.ARGUMENT_ASSIGNMENT) { -return !getArgumentTypes().isEmpty(); -} -} -return false; +return !argumentTypes.isEmpty(); Review Comment: Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34312][table] Improve the handling of default node types for named parameters [flink]
hackergin commented on code in PR #24235: URL: https://github.com/apache/flink/pull/24235#discussion_r1474302721 ## flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/inference/OperatorBindingCallContext.java: ## @@ -59,7 +59,7 @@ public OperatorBindingCallContext( sqlOperatorBinding.getOperator().getNameAsId().toString(), sqlOperatorBinding.getGroupCount() > 0); -this.binding = new OperatorBindingDecorator(sqlOperatorBinding); +this.binding = sqlOperatorBinding; Review Comment: Updated both `OperatorBindingcallContext` and `CallBindingCallContext` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34007: -- Release Note: Fixes a bug where the leader election wasn't able to pick up leadership again after renewing the lease token caused a leadership loss. This required fabric8io:kubernetes-client to be upgraded from v6.6.2 to v6.9.0 > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Pohl updated FLINK-34007: -- Release Note: Fixes a bug where the leader election wasn't able to pick up leadership again after renewing the lease token caused a leadership loss. This required fabric8io:kubernetes-client to be upgraded from v6.6.2 to v6.9.0. (was: Fixes a bug where the leader election wasn't able to pick up leadership again after renewing the lease token caused a leadership loss. This required fabric8io:kubernetes-client to be upgraded from v6.6.2 to v6.9.0) > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34229) Duplicate entry in InnerClasses attribute in class file FusionStreamOperator
[ https://issues.apache.org/jira/browse/FLINK-34229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813169#comment-17813169 ] Dan Zou commented on FLINK-34229: - [~xiasun] Could you please check if this [PR|https://github.com/apache/flink/pull/24228] could solve you problem. > Duplicate entry in InnerClasses attribute in class file FusionStreamOperator > > > Key: FLINK-34229 > URL: https://issues.apache.org/jira/browse/FLINK-34229 > Project: Flink > Issue Type: Bug > Components: Table SQL / Runtime >Affects Versions: 1.19.0 >Reporter: xingbe >Priority: Major > Labels: pull-request-available > Attachments: image-2024-01-24-17-05-47-883.png, taskmanager_log.txt > > > I noticed a runtime error happens in 10TB TPC-DS (q35.sql) benchmarks in > 1.19, the problem did not happen in 1.18.0. This issue may have been newly > introduced recently. !image-2024-01-24-17-05-47-883.png|width=589,height=279! -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-34324][test] Makes all s3 related operations being declared and called in a single location [flink]
zentol commented on code in PR #24244: URL: https://github.com/apache/flink/pull/24244#discussion_r1474268805 ## flink-end-to-end-tests/test-scripts/test_file_sink.sh: ## @@ -96,13 +51,54 @@ function get_complete_result { # line number in part files ### function get_total_number_of_valid_lines { - if [ "${OUT_TYPE}" == "local" ]; then -get_complete_result | wc -l | tr -d '[:space:]' - elif [ "${OUT_TYPE}" == "s3" ]; then -s3_get_number_of_lines_by_prefix "${S3_PREFIX}" "part-" - fi + get_complete_result | wc -l | tr -d '[:space:]' } +if [ "${OUT_TYPE}" == "s3" ]; then + source "$(dirname "$0")"/common_s3.sh + s3_setup hadoop + + JOB_OUTPUT_PATH="s3://$IT_CASE_S3_BUCKET/$S3_PREFIX" + set_config_key "state.checkpoints.dir" "s3://$IT_CASE_S3_BUCKET/$S3_PREFIX-chk" + mkdir -p "$OUTPUT_PATH-chk" + + # overwrites implementation for local runs + function get_complete_result { +s3_get_by_full_path_and_filename_prefix "$OUTPUT_PATH" "$S3_PREFIX" "part-" true + } Review Comment: I wonder why we are reading from OUTPUT_PATH and not JOB_OUTPUT_PATH 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34260][Connectors/AWS] Update flink-connector-aws to be compatible with updated SinkV2 interfaces [flink-connector-aws]
z3d1k commented on code in PR #127: URL: https://github.com/apache/flink-connector-aws/pull/127#discussion_r1474264702 ## flink-connector-aws/flink-connector-kinesis/src/test/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisConsumerTest.java: ## @@ -284,7 +284,7 @@ public void testListStateChangedAfterSnapshotState() throws Exception { new FlinkKinesisConsumer<>("fakeStream", new SimpleStringSchema(), config); FlinkKinesisConsumer mockedConsumer = spy(consumer); -RuntimeContext context = new MockStreamingRuntimeContext(true, 1, 1); +RuntimeContext context = new MockStreamingRuntimeContext(true, 1, 0); Review Comment: Sure, in 1.19 this test is failing with: ``` java.lang.IllegalArgumentException: Task index must be less than parallelism. at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) at org.apache.flink.api.common.TaskInfoImpl.(TaskInfoImpl.java:68) at org.apache.flink.api.common.TaskInfoImpl.(TaskInfoImpl.java:44) ``` Failed check: https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/TaskInfoImpl.java#L68-L70 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-34324][test] Makes all s3 related operations being declared and called in a single location [flink]
zentol commented on code in PR #24244: URL: https://github.com/apache/flink/pull/24244#discussion_r1474261856 ## flink-end-to-end-tests/test-scripts/test_file_sink.sh: ## @@ -96,13 +51,54 @@ function get_complete_result { # line number in part files ### function get_total_number_of_valid_lines { - if [ "${OUT_TYPE}" == "local" ]; then -get_complete_result | wc -l | tr -d '[:space:]' - elif [ "${OUT_TYPE}" == "s3" ]; then -s3_get_number_of_lines_by_prefix "${S3_PREFIX}" "part-" - fi + get_complete_result | wc -l | tr -d '[:space:]' } +if [ "${OUT_TYPE}" == "s3" ]; then + source "$(dirname "$0")"/common_s3.sh + s3_setup hadoop + + JOB_OUTPUT_PATH="s3://$IT_CASE_S3_BUCKET/$S3_PREFIX" + set_config_key "state.checkpoints.dir" "s3://$IT_CASE_S3_BUCKET/$S3_PREFIX-chk" + mkdir -p "$OUTPUT_PATH-chk" + + # overwrites implementation for local runs + function get_complete_result { +s3_get_by_full_path_and_filename_prefix "$OUTPUT_PATH" "$S3_PREFIX" "part-" true + } Review Comment: Or is that a no-op because in s3 mode we never write anything there... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org