date:20240201

Re: [PR] [FLINK-34312][table] Improve the handling of default node types for named parameters [flink]

2024-02-01 Thread via GitHub



snuyanzin commented on code in PR #24235:
URL: https://github.com/apache/flink/pull/24235#discussion_r1475692973


##
flink-table/flink-table-planner/src/main/java/org/apache/calcite/sql/validate/SqlValidatorImpl.java:
##
@@ -158,16 +160,17 @@
  * Default implementation of {@link SqlValidator}, the class was copied over 
because of
  * CALCITE-4554.
  *
- * Lines 1958 ~ 1978, Flink improves error message for functions without 
appropriate arguments in
+ * Lines 1961 ~ 1981, Flink improves error message for functions without 
appropriate arguments in
  * handleUnresolvedFunction at {@link 
SqlValidatorImpl#handleUnresolvedFunction}.
  *
- * Lines 3736 ~ 3740, Flink improves Optimize the retrieval of sub-operands 
in SqlCall when using
- * NamedParameters at {@link SqlValidatorImpl#checkRollUp}.
+ * Lines 3739 ~ 3743, 6333 ~ 6339, Flink improves validating the SqlCall 
that uses named
+ * parameters, rearrange the order of sub-operands, and fill in missing 
operands with the default

Review Comment:
   can you please create a corresponding issue for Calcite to make it supported 
there as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]

2024-02-01 Thread via GitHub



1996fanrui commented on code in PR #24246:
URL: https://github.com/apache/flink/pull/24246#discussion_r1475680928


##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -328,7 +330,7 @@ public void 
testCheckpointRescalingNonPartitionedStateCausesException() throws E
 // wait until the operator handles some data
 StateSourceBase.workStartedLatch.await();
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   IIUC, we can call the newest `waitForNewCheckpoint` here.
   
   From the semantic, the old `waitForOneMoreCheckpoint` wait for one and more 
checkpoints. It check the checkpoint each 100ms by default. It means if the job 
generates 10 checkpoints within 100ms, `waitForOneMoreCheckpoint` will wait 10 
checkpoints.
   
   So I don't think `waitForNewCheckpoint` breaks the semantic 
`waitForOneMoreCheckpoint`.  And it's more clearer than before.
   
   Also, if we introduced the `waitForNewCheckpoint`, we might don't need 
`waitForOneMoreCheckpoint` that checking the checkpoint count. It's needed if 
we have a requirement that must wait for a number of new checkpoints. At least 
I didn't see the strong requirement for now. 
   
   WDYT?



##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -411,7 +413,7 @@ public void 
testCheckpointRescalingWithKeyedAndNonPartitionedState() throws Exce
 // clear the CollectionSink set for the restarted job
 CollectionSink.clearElementsSet();
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   Do you mean `createJobGraphWithKeyedState` and 
`createJobGraphWithKeyedAndNonPartitionedOperatorState` have redundant code ? 
Or `testCheckpointRescalingWithKeyedAndNonPartitionedState` and 
`testCheckpointRescalingKeyedState`?
   
   
   I checked them, they have a lot of differences in details. Such as:
   
   - Source is different 
   - The parallelism and MaxParallelism is fixed parallelism for 
`NonPartitionedOperator`
   
   I will check could they extract some common code later. If yes, I can submit 
a hotfix PR and cc you.



##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -513,7 +515,7 @@ public void testCheckpointRescalingPartitionedOperatorState(
 // wait until the operator handles some data
 StateSourceBase.workStartedLatch.await();
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   Same with this comment: 
https://github.com/apache/flink/pull/24246/files#r1474917357



##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -261,7 +263,7 @@ public void testCheckpointRescalingKeyedState(boolean 
scaleOut) throws Exception
 
 waitForAllTaskRunning(cluster.getMiniCluster(), 
jobGraph.getJobID(), false);
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   Added, please help double check, thanks~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-01 Thread Matthias Pohl (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813215#comment-17813215
 ] 

Matthias Pohl edited comment on FLINK-34333 at 2/2/24 7:46 AM:
---

I think it's fine to do the upgrade from v6.6.2 to v6.9.0 even for 1.18:
 * The only change that seems to affect Flink is in the Mock-framework and only 
affects test code
 * Most of the changes are cleanup changes, except for:
 ** Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: 
Changed behavior in certain scenarios
 ** Fix [#5125|https://github.com/fabric8io/kubernetes-client/issues/5125]: 
Different default TLS version
 ** Fix [#1335|https://github.com/fabric8io/kubernetes-client/issues/1335]: 
Different fallback used for proxy configuration

Generally, we could argue that downstream project should rely on shaded 
dependencies provided by Flink. And fixing the bug out-weighs the stability 
concerns here. Do you see a problem with this argument [~gyfora], 
[~wangyang0918]?

The alternatives to doing the upgrade are:
 * Reverting the upgrade (i.e. going back from v6.6.2 to v5.12.4). This would 
allow us to get to the stable version that was tested with 1.17-
 * Provide a Flink-customized implementation of the fabric8io {{LeaderElector}} 
class with the cherry-picked changes of 
[8f8c438f|https://github.com/fabric8io/kubernetes-client/commit/8f8c438f] and 
[0f6c6965|https://github.com/fabric8io/kubernetes-client/commit/0f6c6965]. As a 
consequence, we would stick to fabric8io:kubernetes-client v.6.6.2


was (Author: mapohl):
I think it's fine to do the upgrade from v6.6.2 to v6.9.0 even for 1.18:
 * The only change that seems to affect Flink is in the Mock-framework and only 
affects test code
 * Most of the changes are cleanup changes, except for:
 ** Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: 
Changed behavior in certain scenarios
 ** Fix [#5125|https://github.com/fabric8io/kubernetes-client/issues/5125]: 
Different default TLS version
 ** Fix [#1335|https://github.com/fabric8io/kubernetes-client/issues/1335]: 
Different fallback used for proxy configuration

Generally, we could argue that downstream project should rely on transitive 
dependencies provided by Flink. And fixing the bug out-weighs the stability 
concerns here. Do you see a problem with this argument [~gyfora], 
[~wangyang0918]?

The alternatives to doing the upgrade are:
 * Reverting the upgrade (i.e. going back from v6.6.2 to v5.12.4). This would 
allow us to get to the stable version that was tested with 1.17-
 * Provide a Flink-customized implementation of the fabric8io {{LeaderElector}} 
class with the cherry-picked changes of 
[8f8c438f|https://github.com/fabric8io/kubernetes-client/commit/8f8c438f] and 
[0f6c6965|https://github.com/fabric8io/kubernetes-client/commit/0f6c6965]. As a 
consequence, we would stick to fabric8io:kubernetes-client v.6.6.2

> Fix FLINK-34007 LeaderElector bug in 1.18
> -
>
> Key: FLINK-34333
> URL: https://issues.apache.org/jira/browse/FLINK-34333
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
> Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which 
> required an update of the k8s client to v6.9.0.
> This Jira issue is about finding a solution in Flink 1.18 for the very same 
> problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
> unblock the release of 1.19 by resolving FLINK-34007.
> Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in 
> v6.6.2 which might prevent the leadership lost event being forwarded to the 
> client ([#5463|https://github.com/fabric8io/kubernetes-client/issues/5463]). 
> An initial proposal where the release call was handled in Flink's 
> {{KubernetesLeaderElector}} didn't work due to the leadership lost event 
> being triggered twice (see [FLINK-34007 PR 
> comment|https://github.com/apache/flink/pull/24132#discussion_r1467175902])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34202) python tests take suspiciously long in some of the cases

2024-02-01 Thread Matthias Pohl (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813537#comment-17813537
 ] 

Matthias Pohl commented on FLINK-34202:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57203=logs=9cada3cb-c1d3-5621-16da-0f718fb86602=c67e71ed-6451-5d26-8920-5a8cf9651901=27977

> python tests take suspiciously long in some of the cases
> 
>
> Key: FLINK-34202
> URL: https://issues.apache.org/jira/browse/FLINK-34202
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python
>Affects Versions: 1.17.2, 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: test-stability
>
> [This release-1.18 
> build|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56603=logs=3e4dd1a2-fe2f-5e5d-a581-48087e718d53=b4612f28-e3b5-5853-8a8b-610ae894217a]
>  has the python stage running into a timeout without any obvious reason. The 
> [python stage run for 
> JDK17|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56603=logs=b53e1644-5cb4-5a3b-5d48-f523f39bcf06]
>  was also getting close to the 4h timeout.
> I'm creating this issue for documentation purposes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Reopened] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-02-01 Thread Matthias Pohl (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reopened FLINK-34007:
---

I'm reopening the issue because we're seeing test instabilities now:
 * 
[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57203=logs=64debf87-ecdb-5aef-788d-8720d341b5cb=2302fb98-0839-5df2-3354-bbae636f81a7=8066]
 * 
[https://github.com/XComp/flink/actions/runs/7745486791/job/21121947504#step:14:7309]
 (this one happened on the FLINK-34333 1.18 backport which covers the same 
change)

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]

2024-02-01 Thread via GitHub



1996fanrui commented on code in PR #24246:
URL: https://github.com/apache/flink/pull/24246#discussion_r1475645896


##
flink-runtime/src/test/java/org/apache/flink/runtime/testutils/CommonTestUtils.java:
##
@@ -367,9 +367,11 @@ public static void waitForCheckpoint(JobID jobID, 
MiniCluster miniCluster, int n
 });
 }
 
-/** Wait for on more completed checkpoint. */
-public static void waitForOneMoreCheckpoint(JobID jobID, MiniCluster 
miniCluster)
-throws Exception {
+/**
+ * Wait for a new completed checkpoint. Note: we wait for 2 or more 
checkpoint to ensure the
+ * latest checkpoint start after waitForTwoOrMoreCheckpoint is called.
+ */
+public static void waitForNewCheckpoint(JobID jobID, MiniCluster 
miniCluster) throws Exception {

Review Comment:
   > That way, we can pass in 2 in the test implementation analogously to 
waitForCheckpoint. I also feel like we can remove some redundant code within 
the two methods. 樂
   
   IIUC, the semantic between `waitForCheckpoint` and 
`waitForOneMoreCheckpoint` are different. (`waitForOneMoreCheckpoint` is 
renamed to `waitForNewCheckpoint` in this PR.)
   
   - `waitForCheckpoint` check the total count of all completed checkpoints.
   - `waitForOneMoreCheckpoint` check the whether the new checkpoint is 
completed after it's called. 
 - For example, the job has 10 completed checkpoint before it's called. 
 - `waitForOneMoreCheckpoint` will wait for checkpoint-11 is completed.
   
   BTW, I have refactored the `waitForNewCheckpoint`. I check the checkpoint 
trigger time instead of checkpointCount.
   
   I think checking trigger time is clearer than checkpointCount >= 2. Other 
developers might don't know why check 2 checkpoint here, and `checkpointCount 
>= 2` doesn't work when enabling the concurrent checkpoint.
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]

2024-02-01 Thread via GitHub



1996fanrui commented on code in PR #24246:
URL: https://github.com/apache/flink/pull/24246#discussion_r1475645896


##
flink-runtime/src/test/java/org/apache/flink/runtime/testutils/CommonTestUtils.java:
##
@@ -367,9 +367,11 @@ public static void waitForCheckpoint(JobID jobID, 
MiniCluster miniCluster, int n
 });
 }
 
-/** Wait for on more completed checkpoint. */
-public static void waitForOneMoreCheckpoint(JobID jobID, MiniCluster 
miniCluster)
-throws Exception {
+/**
+ * Wait for a new completed checkpoint. Note: we wait for 2 or more 
checkpoint to ensure the
+ * latest checkpoint start after waitForTwoOrMoreCheckpoint is called.
+ */
+public static void waitForNewCheckpoint(JobID jobID, MiniCluster 
miniCluster) throws Exception {

Review Comment:
   I have refactored the `waitForNewCheckpoint`. I check the checkpoint trigger 
time instead of checkpointCount.
   
   I think checking trigger time is clearer than checkpointCount >= 2, WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34323][table-planner] Fix param name in session window tvf when using named params [flink]

2024-02-01 Thread via GitHub



xuyangzhong commented on code in PR #24243:
URL: https://github.com/apache/flink/pull/24243#discussion_r1475601398


##
flink-table/flink-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/WindowTableFunctionTest.scala:
##
@@ -257,4 +257,48 @@ class WindowTableFunctionTest extends TableTestBase {
 util.verifyRelPlan(sql)
   }
 
+  @Test
+  def testSessionTVF(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(rowtime), INTERVAL '15' 
MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFProctime(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(proctime), INTERVAL '15' 
MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFWithPartitionKeys(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable PARTITION BY (b, a), 
DESCRIPTOR(rowtime), INTERVAL '15' MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFWithNamedParams(): Unit = {

Review Comment:
   I also add a related test for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34336][test] Fix the bug that AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes [flink]

2024-02-01 Thread via GitHub



1996fanrui commented on PR #24248:
URL: https://github.com/apache/flink/pull/24248#issuecomment-1923016887

   
org.apache.flink.test.checkpointing.AutoRescalingITCase#testCheckpointRescalingNonPartitionedStateCausesException
 hangs as well.
   
   I checked, all test jobs of AutoRescalingITCase have more than 1 job vertex. 
So the `desiredNumberOfRunningTasks` of all  `waitForRunningTasks` should be `2 
* parallelism2` or `p1 + p2`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34323][table-planner] Fix param name in session window tvf when using named params [flink]

2024-02-01 Thread via GitHub



hackergin commented on code in PR #24243:
URL: https://github.com/apache/flink/pull/24243#discussion_r1475597199


##
flink-table/flink-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/WindowTableFunctionTest.scala:
##
@@ -257,4 +257,48 @@ class WindowTableFunctionTest extends TableTestBase {
 util.verifyRelPlan(sql)
   }
 
+  @Test
+  def testSessionTVF(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(rowtime), INTERVAL '15' 
MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFProctime(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(proctime), INTERVAL '15' 
MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFWithPartitionKeys(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable PARTITION BY (b, a), 
DESCRIPTOR(rowtime), INTERVAL '15' MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFWithNamedParams(): Unit = {

Review Comment:
   +1, we can handle it in a separate PR, it seems that other types of TVFs may 
also have similar issues.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34323][table-planner] Fix param name in session window tvf when using named params [flink]

2024-02-01 Thread via GitHub



xuyangzhong commented on code in PR #24243:
URL: https://github.com/apache/flink/pull/24243#discussion_r1475596548


##
flink-table/flink-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/WindowTableFunctionTest.scala:
##
@@ -257,4 +257,48 @@ class WindowTableFunctionTest extends TableTestBase {
 util.verifyRelPlan(sql)
   }
 
+  @Test
+  def testSessionTVF(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(rowtime), INTERVAL '15' 
MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFProctime(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(proctime), INTERVAL '15' 
MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFWithPartitionKeys(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable PARTITION BY (b, a), 
DESCRIPTOR(rowtime), INTERVAL '15' MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFWithNamedParams(): Unit = {

Review Comment:
   I have created one for it. https://issues.apache.org/jira/browse/FLINK-34338



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-34338) An exception is thrown when some named params change order when using window tvf

2024-02-01 Thread xuyang (Jira)

xuyang created FLINK-34338:
--

 Summary: An exception is thrown when some named params change 
order when using window tvf
 Key: FLINK-34338
 URL: https://issues.apache.org/jira/browse/FLINK-34338
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / API
Affects Versions: 1.15.0
Reporter: xuyang
 Fix For: 1.19.0


This bug can be reproduced by the following sql in `WindowTableFunctionTest`

 
{code:java}
@Test
def test(): Unit = {
  val sql =
"""
  |SELECT *
  |FROM TABLE(TUMBLE(
  |   DATA => TABLE MyTable,
  |   SIZE => INTERVAL '15' MINUTE,
  |   TIMECOL => DESCRIPTOR(rowtime)
  |   ))
  |""".stripMargin
  util.verifyRelPlan(sql)
}{code}
In Flip-145 and user doc, we can found `the DATA param must be the first`, but 
it seems that we also can't change the order about other params.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-26369][doc-zh] Translate the part zh-page mixed with not be translated. [flink]

2024-02-01 Thread via GitHub



Jiabao-Sun commented on code in PR #20340:
URL: https://github.com/apache/flink/pull/20340#discussion_r1475595276


##
docs/content.zh/docs/deployment/ha/overview.md:
##
@@ -74,18 +74,9 @@ Flink 提供了两种高可用服务实现：
 
 {{< top >}}
 
-## JobResultStore
-
-The JobResultStore is used to archive the final result of a job that reached a 
globally-terminal
-state (i.e. finished, cancelled or failed). The data is stored on a file 
system (see
-[job-result-store.storage-path]({{< ref 
"docs/deployment/config#job-result-store-storage-path" >}})).
-Entries in this store are marked as dirty as long as the corresponding job 
wasn't cleaned up properly
-(artifacts are found in the job's subfolder in 
[high-availability.storageDir]({{< ref 
"docs/deployment/config#high-availability-storagedir" >}})).
-
-Dirty entries are subject to cleanup, i.e. the corresponding job is either 
cleaned up by Flink at
-the moment or will be picked up for cleanup as part of a recovery. The entries 
will be deleted as
-soon as the cleanup succeeds. Check the JobResultStore configuration 
parameters under
-[HA configuration options]({{< ref "docs/deployment/config#high-availability" 
>}}) for further
-details on how to adapt the behavior.
+## 作业结果存储
 
+作业结果存储用于归档达到全局结束状态作业（即完成、取消或失败）的最终结果，其数据存储在文件系统上 
(请参阅[job-result-store.storage-path]({{< ref 
"docs/deployment/config#job-result-store-storage-path" >}}))。
+只要没有正确清理相应的作业，此数据条目就是脏数据 (数据位于作业的子文件夹中 [high-availability.storageDir]({{< ref 
"docs/deployment/config#high-availability-storagedir" >}}))。
+脏数据是需要清理的，即通过相应的作业要么被 Flink 清理，要么将作为恢复的一部分被选取进行清理。清理成功后，数据条目将被删除。请参阅 [HA 
configuration options]({{< ref "docs/deployment/config#high-availability" >}}) 
下作业结果存储的配置参数以获取有关如何调整行为的更多详细信息。

Review Comment:
   脏数据将被清理，即相应的作业要么在当前时刻被清理，要么在作业恢复过程中被清理。一旦清理成功，这些脏数据条目将被删除。



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34323][table-planner] Fix param name in session window tvf when using named params [flink]

2024-02-01 Thread via GitHub



xuyangzhong commented on code in PR #24243:
URL: https://github.com/apache/flink/pull/24243#discussion_r1475591118


##
flink-table/flink-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/WindowTableFunctionTest.scala:
##
@@ -257,4 +257,48 @@ class WindowTableFunctionTest extends TableTestBase {
 util.verifyRelPlan(sql)
   }
 
+  @Test
+  def testSessionTVF(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(rowtime), INTERVAL '15' 
MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFProctime(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable, DESCRIPTOR(proctime), INTERVAL '15' 
MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFWithPartitionKeys(): Unit = {
+val sql =
+  """
+|SELECT *
+|FROM TABLE(SESSION(TABLE MyTable PARTITION BY (b, a), 
DESCRIPTOR(rowtime), INTERVAL '15' MINUTE))
+|""".stripMargin
+util.verifyRelPlan(sql)
+  }
+
+  @Test
+  def testSessionTVFWithNamedParams(): Unit = {

Review Comment:
   I have double checked the Flip-145 and the user doc. I only find that `the 
DATA param must be the first`. That means we should allow the order changes 
about param DATA and GAP. But I think it's better to open a separate Jira for 
it. This pr focuses on the wrong PARAM name between `SIZE` in Calcite and `GAP` 
in Flink. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-33396][table-planner] Fix table alias not be cleared in subquery [flink]

2024-02-01 Thread via GitHub



xuyangzhong commented on code in PR #24239:
URL: https://github.com/apache/flink/pull/24239#discussion_r1475586086


##
flink-table/flink-table-planner/src/test/resources/org/apache/flink/table/planner/plan/hints/batch/NestLoopJoinHintTest.xml:
##
@@ -679,8 +680,8 @@ HashJoin(joinType=[LeftSemiJoin], where=[=(a1, a2)], 
select=[a1, b1], build=[rig
+- Calc(select=[a2])
   +- NestedLoopJoin(joinType=[InnerJoin], where=[=(a2, a3)], select=[a2, 
a3], build=[left])
  :- Exchange(distribution=[broadcast])
- :  +- TableSourceScan(table=[[default_catalog, default_database, T2, 
project=[a2], metadata=[]]], fields=[a2], hints=[[[ALIAS options:[T2)
- +- TableSourceScan(table=[[default_catalog, default_database, T3, 
project=[a3], metadata=[]]], fields=[a3], hints=[[[ALIAS options:[T3)

Review Comment:
   Hi, I found 
ClearQueryBlockAliasResolverTest#testJoinHintWithJoinHintInCorrelateAndWithAgg 
do the similar test. Can you take a look to check if it's what you mean?
   In this test, Join Hint will not be cleared and only Alias Hint will be 
cleared.
   ```
 
   
 
   
   
 
   
 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34337][core] Sink.InitContextWrapper should implement metadataConsumer method [flink]

2024-02-01 Thread via GitHub



flinkbot commented on PR #24249:
URL: https://github.com/apache/flink/pull/24249#issuecomment-1922891781

   
   ## CI report:
   
   * a57b0e7e04a2f3f3b61a7a5c2ed5f3e8cbae9f47 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34337][core] Sink.InitContextWrapper should implement metadataConsumer method [flink]

2024-02-01 Thread via GitHub



Jiabao-Sun commented on PR #24249:
URL: https://github.com/apache/flink/pull/24249#issuecomment-1922891276

   Hi @pvary, could you help review it when you have time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-34192) Update flink-connector-kafka to be compatible with updated SinkV2 interfaces

2024-02-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-34192:
---
Labels: pull-request-available  (was: )

> Update flink-connector-kafka to be compatible with updated SinkV2 interfaces
> 
>
> Key: FLINK-34192
> URL: https://issues.apache.org/jira/browse/FLINK-34192
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Connectors / Kafka
>Reporter: Martijn Visser
>Priority: Blocker
>  Labels: pull-request-available
>
> {code:java}
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/streaming/connectors/kafka/table/KafkaTableTestUtils.java:[101,76]
>  incompatible types: java.util.List cannot be 
> converted to java.util.List
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[136,46]
>  cannot find symbol
> Error:symbol:   method 
> mock(org.apache.flink.metrics.MetricGroup,org.apache.flink.metrics.groups.OperatorIOMetricGroup)
> Error:location: class 
> org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[171,46]
>  cannot find symbol
> Error:symbol:   method mock(org.apache.flink.metrics.MetricGroup)
> Error:location: class 
> org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[204,54]
>  cannot find symbol
> Error:symbol:   method mock(org.apache.flink.metrics.MetricGroup)
> Error:location: class 
> org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[233,54]
>  cannot find symbol
> Error:symbol:   method mock(org.apache.flink.metrics.MetricGroup)
> Error:location: class 
> org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[263,54]
>  cannot find symbol
> Error:symbol:   method mock(org.apache.flink.metrics.MetricGroup)
> Error:location: class 
> org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[294,54]
>  cannot find symbol
> Error:symbol:   method mock(org.apache.flink.metrics.MetricGroup)
> Error:location: class 
> org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[337,54]
>  cannot find symbol
> Error:symbol:   method mock(org.apache.flink.metrics.MetricGroup)
> Error:location: class 
> org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup
> Error:  
> /home/runner/work/flink-connector-kafka/flink-connector-kafka/flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterITCase.java:[525,46]
>  cannot find symbol
> Error:symbol:   method mock(org.apache.flink.metrics.MetricGroup)
> Error:location: class 
> org.apache.flink.runtime.metrics.groups.InternalSinkWriterMetricGroup
> Error:  -> [Help 1]
> {code}
> https://github.com/apache/flink-connector-kafka/actions/runs/7597711401/job/20692858078#step:14:221



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-34178) The ScalingTracking of autoscaler is wrong

2024-02-01 Thread Rui Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan resolved FLINK-34178.
-
Fix Version/s: kubernetes-operator-1.8.0
   Resolution: Fixed

> The ScalingTracking of autoscaler is wrong
> --
>
> Key: FLINK-34178
> URL: https://issues.apache.org/jira/browse/FLINK-34178
> Project: Flink
>  Issue Type: Bug
>  Components: Autoscaler
>Affects Versions: kubernetes-operator-1.8.0
>Reporter: Rui Fan
>Assignee: Rui Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.8.0
>
>
> The ScalingTracking of autoscaler is wrong, it's always greater than 
> AutoScalerOptions#STABILIZATION_INTERVAL.
> h2. Reason:
> When flink job isStabilizing, ScalingMetricCollector#updateMetrics will 
> return a empty metric history. In the JobAutoScalerImpl#runScalingLogic 
> method, if `collectedMetrics.getMetricHistory().isEmpty()` , we don't update 
> the ScalingTracking.
>  
> The default value of AutoScalerOptions#STABILIZATION_INTERVAL is 5 min, so 
> the restartTime is always greater than 5 min.
> However, it's quick when we use rescale api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34178) The ScalingTracking of autoscaler is wrong

2024-02-01 Thread Rui Fan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813488#comment-17813488
 ] 

Rui Fan commented on FLINK-34178:
-

Merged to main(1.8.0) via:
 * 2bd6b1b38171ce821509d46d687291b876f72a2e
 * 746a996732e56e573f7882764c82e2faa2cba71d

> The ScalingTracking of autoscaler is wrong
> --
>
> Key: FLINK-34178
> URL: https://issues.apache.org/jira/browse/FLINK-34178
> Project: Flink
>  Issue Type: Bug
>  Components: Autoscaler
>Affects Versions: kubernetes-operator-1.8.0
>Reporter: Rui Fan
>Assignee: Rui Fan
>Priority: Major
>  Labels: pull-request-available
>
> The ScalingTracking of autoscaler is wrong, it's always greater than 
> AutoScalerOptions#STABILIZATION_INTERVAL.
> h2. Reason:
> When flink job isStabilizing, ScalingMetricCollector#updateMetrics will 
> return a empty metric history. In the JobAutoScalerImpl#runScalingLogic 
> method, if `collectedMetrics.getMetricHistory().isEmpty()` , we don't update 
> the ScalingTracking.
>  
> The default value of AutoScalerOptions#STABILIZATION_INTERVAL is 5 min, so 
> the restartTime is always greater than 5 min.
> However, it's quick when we use rescale api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34329) ScalingReport format tests fail locally on decimal format

2024-02-01 Thread Rui Fan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813487#comment-17813487
 ] 

Rui Fan edited comment on FLINK-34329 at 2/2/24 5:54 AM:
-

Merged to main(1.8.0) via : 398a87c9012ddc79bfe4b2378cea740642283628


was (Author: fanrui):
Merged to main(1.9.0) via : 398a87c9012ddc79bfe4b2378cea740642283628

> ScalingReport format tests fail locally on decimal format
> -
>
> Key: FLINK-34329
> URL: https://issues.apache.org/jira/browse/FLINK-34329
> Project: Flink
>  Issue Type: Bug
>  Components: Autoscaler, Kubernetes Operator
>Affects Versions: kubernetes-operator-1.8.0
>Reporter: Gyula Fora
>Assignee: Rui Fan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.8.0
>
>
> The recently introduced scaling event format tests fail locally due to 
> different decimal format:
> ```
> [ERROR]   AutoScalerEventHandlerTest.testScalingReport:55
> expected: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID 
> ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity 
> 424.68 -> 123.40 | Target data rate 403.67}{ Vertex ID 
> bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity 
> Infinity -> Infinity | Target data rate 812.58}\{ Vertex ID 
> 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity 
> 404.73 -> 645.00 | Target data rate 404.27}"
>  but was: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID 
> ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity 
> 424,68 -> 123,40 | Target data rate 403,67}{ Vertex ID 
> bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity 
> Infinity -> Infinity | Target data rate 812,58}\{ Vertex ID 
> 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity 
> 404,73 -> 645,00 | Target data rate 404,27}"
> [INFO]
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-34329) ScalingReport format tests fail locally on decimal format

2024-02-01 Thread Rui Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan resolved FLINK-34329.
-
Resolution: Fixed

> ScalingReport format tests fail locally on decimal format
> -
>
> Key: FLINK-34329
> URL: https://issues.apache.org/jira/browse/FLINK-34329
> Project: Flink
>  Issue Type: Bug
>  Components: Autoscaler, Kubernetes Operator
>Affects Versions: kubernetes-operator-1.8.0
>Reporter: Gyula Fora
>Assignee: Rui Fan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.8.0
>
>
> The recently introduced scaling event format tests fail locally due to 
> different decimal format:
> ```
> [ERROR]   AutoScalerEventHandlerTest.testScalingReport:55
> expected: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID 
> ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity 
> 424.68 -> 123.40 | Target data rate 403.67}{ Vertex ID 
> bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity 
> Infinity -> Infinity | Target data rate 812.58}\{ Vertex ID 
> 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity 
> 404.73 -> 645.00 | Target data rate 404.27}"
>  but was: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID 
> ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity 
> 424,68 -> 123,40 | Target data rate 403,67}{ Vertex ID 
> bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity 
> Infinity -> Infinity | Target data rate 812,58}\{ Vertex ID 
> 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity 
> 404,73 -> 645,00 | Target data rate 404,27}"
> [INFO]
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34329) ScalingReport format tests fail locally on decimal format

2024-02-01 Thread Rui Fan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813487#comment-17813487
 ] 

Rui Fan commented on FLINK-34329:
-

Merged to main(1.9.0) via : 398a87c9012ddc79bfe4b2378cea740642283628

> ScalingReport format tests fail locally on decimal format
> -
>
> Key: FLINK-34329
> URL: https://issues.apache.org/jira/browse/FLINK-34329
> Project: Flink
>  Issue Type: Bug
>  Components: Autoscaler, Kubernetes Operator
>Affects Versions: kubernetes-operator-1.8.0
>Reporter: Gyula Fora
>Assignee: Rui Fan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.8.0
>
>
> The recently introduced scaling event format tests fail locally due to 
> different decimal format:
> ```
> [ERROR]   AutoScalerEventHandlerTest.testScalingReport:55
> expected: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID 
> ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity 
> 424.68 -> 123.40 | Target data rate 403.67}{ Vertex ID 
> bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity 
> Infinity -> Infinity | Target data rate 812.58}\{ Vertex ID 
> 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity 
> 404.73 -> 645.00 | Target data rate 404.27}"
>  but was: "Scaling execution enabled, begin scaling vertices:\{ Vertex ID 
> ea632d67b7d595e5b851708ae9ad79d6 | Parallelism 3 -> 1 | Processing capacity 
> 424,68 -> 123,40 | Target data rate 403,67}{ Vertex ID 
> bc764cd8ddf7a0cff126f51c16239658 | Parallelism 4 -> 2 | Processing capacity 
> Infinity -> Infinity | Target data rate 812,58}\{ Vertex ID 
> 0a448493b4782967b150582570326227 | Parallelism 5 -> 8 | Processing capacity 
> 404,73 -> 645,00 | Target data rate 404,27}"
> [INFO]
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34178][autoscaler] Fix the bug that observed scaling restart time is always great than `stabilization.interval` [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



1996fanrui merged PR #759:
URL: https://github.com/apache/flink-kubernetes-operator/pull/759


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34178][autoscaler] Fix the bug that observed scaling restart time is always great than `stabilization.interval` [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



1996fanrui commented on PR #759:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/759#issuecomment-1922886105

   Thanks for the review. I tested locally, it works fine, the tracking time is 
correct now. Merging~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-34337) Sink.InitContextWrapper should implement metadataConsumer method

2024-02-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-34337:
---
Labels: pull-request-available  (was: )

> Sink.InitContextWrapper should implement metadataConsumer method
> 
>
> Key: FLINK-34337
> URL: https://issues.apache.org/jira/browse/FLINK-34337
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.19.0
>Reporter: Jiabao Sun
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Sink.InitContextWrapper should implement metadataConsumer method.
> If the metadataConsumer method is not implemented, the behavior of the 
> wrapped WriterInitContext's metadataConsumer will be lost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [FLINK-34337][core] Sink.InitContextWrapper should implement metadataConsumer method [flink]

2024-02-01 Thread via GitHub



Jiabao-Sun opened a new pull request, #24249:
URL: https://github.com/apache/flink/pull/24249

   
   
   ## What is the purpose of the change
   
   [FLINK-34337][core] Sink.InitContextWrapper should implement 
metadataConsumer method
   
   ## Brief change log
   
   Sink.InitContextWrapper should implement metadataConsumer method.
   If the metadataConsumer method is not implemented, the behavior of the 
wrapped WriterInitContext's metadataConsumer will be lost.
   
   ## Verifying this change
   
   Please make sure both new and modified tests in this PR follows the 
conventions defined in our code quality guide: 
https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
   
   This change is already covered by existing tests.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
 - The serializers: (no)
 - The runtime per-record code paths (performance sensitive): (no)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
 - The S3 file system connector: (no)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (no)
 - If yes, how is the feature documented? (not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-34337) Sink.InitContextWrapper should implement metadataConsumer method

2024-02-01 Thread Jiabao Sun (Jira)

Jiabao Sun created FLINK-34337:
--

 Summary: Sink.InitContextWrapper should implement metadataConsumer 
method
 Key: FLINK-34337
 URL: https://issues.apache.org/jira/browse/FLINK-34337
 Project: Flink
  Issue Type: Bug
  Components: API / Core
Affects Versions: 1.19.0
Reporter: Jiabao Sun
 Fix For: 1.19.0


Sink.InitContextWrapper should implement metadataConsumer method.

If the metadataConsumer method is not implemented, the behavior of the wrapped 
WriterInitContext's metadataConsumer will be lost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34329][autoscaler] Fix the bug that scaling report parser doesn't support Locale [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



1996fanrui merged PR #769:
URL: https://github.com/apache/flink-kubernetes-operator/pull/769


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes

2024-02-01 Thread Rui Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan updated FLINK-34336:

Description: 
AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may 
hang in 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
h2. Reason:

The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The 
source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink 
is changed from  parallelism to parallelism2.

So we expect the task number should be parallelism + parallelism2 instead of 
parallelism2.

 
h2. Why it can be passed for now?

Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
default. It means, flink job will rescale job 30 seconds after 
updateJobResourceRequirements is called.

 

So the running tasks are old parallelism when we call 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}

IIUC, it cannot be guaranteed, and it's unexpected.

 
h2. How to reproduce this bug?

[https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
 * Disable the cooldown
 * Sleep for a while before waitForRunningTasks

If so, the job running in new parallelism, so `waitForRunningTasks` will hang 
forever.

  was:
AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may 
hang in 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
h2. Reason:

The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The 
source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink 
is changed from  parallelism to parallelism2.

So we expect the task number should be parallelism + parallelism2 instead of 
parallelism2.

 
h2. Why it can be passed for now?

Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
default. It means, flink job will rescale job 30 seconds after 
updateJobResourceRequirements is called.

 

So the running tasks are old parallelism when we call 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}

IIUC, it cannot be guaranteed, and it's unexpected.

 
h2. How to reproduce this bug?

[https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
 * Disable the cooldown
 * Sleep for a while before waitForRunningTasks


> AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
> may hang sometimes
> -
>
> Key: FLINK-34336
> URL: https://issues.apache.org/jira/browse/FLINK-34336
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.19.0
>Reporter: Rui Fan
>Assignee: Rui Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
> may hang in 
> waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
> {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
> h2. Reason:
> The job has 2 tasks(vertices), after calling updateJobResourceRequirements. 
> The source parallelism isn't changed (It's parallelism) , and the 
> FlatMapper+Sink is changed from  parallelism to parallelism2.
> So we expect the task number should be parallelism + parallelism2 instead of 
> parallelism2.
>  
> h2. Why it can be passed for now?
> Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
> default. It means, flink job will rescale job 30 seconds after 
> updateJobResourceRequirements is called.
>  
> So the running tasks are old parallelism when we call 
> waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
> {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}
> IIUC, it cannot be guaranteed, and it's unexpected.
>  
> h2. How to reproduce this bug?
> [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
>  * Disable the cooldown
>  * Sleep for a while before waitForRunningTasks
> If so, the job running in new parallelism, so `waitForRunningTasks` will hang 
> forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34336][test] Fix the bug that AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes [flink]

2024-02-01 Thread via GitHub



flinkbot commented on PR #24248:
URL: https://github.com/apache/flink/pull/24248#issuecomment-1922845516

   
   ## CI report:
   
   * 9732c01c33a38cb989d9fb73a5838329f35d8f7e UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes

2024-02-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-34336:
---
Labels: pull-request-available  (was: )

> AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
> may hang sometimes
> -
>
> Key: FLINK-34336
> URL: https://issues.apache.org/jira/browse/FLINK-34336
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.19.0
>Reporter: Rui Fan
>Assignee: Rui Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
> may hang in 
> waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
> {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
> h2. Reason:
> The job has 2 tasks(vertices), after calling updateJobResourceRequirements. 
> The source parallelism isn't changed (It's parallelism) , and the 
> FlatMapper+Sink is changed from  parallelism to parallelism2.
> So we expect the task number should be parallelism + parallelism2 instead of 
> parallelism2.
>  
> h2. Why it can be passed for now?
> Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
> default. It means, flink job will rescale job 30 seconds after 
> updateJobResourceRequirements is called.
>  
> So the running tasks are old parallelism when we call 
> waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
> {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}
> IIUC, it cannot be guaranteed, and it's unexpected.
>  
> h2. How to reproduce this bug?
> [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
>  * Disable the cooldown
>  * Sleep for a while before waitForRunningTasks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [FLINK-34336][test] Fix the bug that AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes [flink]

2024-02-01 Thread via GitHub



1996fanrui opened a new pull request, #24248:
URL: https://github.com/apache/flink/pull/24248

   ## Purpose
   
   AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
may hang in waitForRunningTasks(restClusterClient, jobID, parallelism2);
   
   ## Reason:
   
   The job has 2 tasks(vertices), after calling updateJobResourceRequirements. 
The source parallelism isn't changed (It's parallelism) , and the 
FlatMapper+Sink is changed from  parallelism to parallelism2.
   
   So we expect the task number should be parallelism + parallelism2 instead of 
parallelism2.
   

   ## Why it can be passed for now?
   
   Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
default. It means, flink job will rescale job 30 seconds after 
updateJobResourceRequirements is called.
   
   So the running tasks are old parallelism when we call 
waitForRunningTasks(restClusterClient, jobID, parallelism2);.
   
   IIUC, it cannot be guaranteed, and it's unexpected.
   

   ## How to reproduce this bug?
   
   
https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6
   
   1. Disable the cooldown
   2. Sleep for a while before waitForRunningTasks
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes

2024-02-01 Thread Rui Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan updated FLINK-34336:

Description: 
AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may 
hang in 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
h2. Reason:

The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The 
source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink 
is changed from  parallelism to parallelism2.

So we expect the task number should be parallelism + parallelism2 instead of 
parallelism2.

 
h2. Why it can be passed for now?

Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
default. It means, flink job will rescale job 30 seconds after 
updateJobResourceRequirements is called.

 

So the running tasks are old parallelism when we call 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}

IIUC, it cannot be guaranteed, and it's unexpected.

 
h2. How to reproduce this bug?

[https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
 * Disable the cooldown
 * Sleep for a while before waitForRunningTasks

  was:
AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may 
hang in 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
h2. Reason:

The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The 
source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink 
is changed from  parallelism to parallelism2.

So we expect the task number should be parallelism + parallelism2 instead of 
parallelism2.

 
h2. Why it can be passed for now?

Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
default. It means, flink job will rescale job 30 seconds after 
updateJobResourceRequirements is called.

 

So the running tasks are old parallelism when we call 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}

IIUC, it cannot be guaranteed, and it's unexpected.

 


> AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
> may hang sometimes
> -
>
> Key: FLINK-34336
> URL: https://issues.apache.org/jira/browse/FLINK-34336
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.19.0
>Reporter: Rui Fan
>Assignee: Rui Fan
>Priority: Major
> Fix For: 1.19.0
>
>
> AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState 
> may hang in 
> waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
> {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
> h2. Reason:
> The job has 2 tasks(vertices), after calling updateJobResourceRequirements. 
> The source parallelism isn't changed (It's parallelism) , and the 
> FlatMapper+Sink is changed from  parallelism to parallelism2.
> So we expect the task number should be parallelism + parallelism2 instead of 
> parallelism2.
>  
> h2. Why it can be passed for now?
> Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
> default. It means, flink job will rescale job 30 seconds after 
> updateJobResourceRequirements is called.
>  
> So the running tasks are old parallelism when we call 
> waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
> {color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}
> IIUC, it cannot be guaranteed, and it's unexpected.
>  
> h2. How to reproduce this bug?
> [https://github.com/1996fanrui/flink/commit/ffd713e24d37db2c103e4cd4361d0cd916d0d2f6]
>  * Disable the cooldown
>  * Sleep for a while before waitForRunningTasks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-34336) AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may hang sometimes

2024-02-01 Thread Rui Fan (Jira)

Rui Fan created FLINK-34336:
---

 Summary: 
AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may 
hang sometimes
 Key: FLINK-34336
 URL: https://issues.apache.org/jira/browse/FLINK-34336
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.19.0
Reporter: Rui Fan
Assignee: Rui Fan
 Fix For: 1.19.0


AutoRescalingITCase#testCheckpointRescalingWithKeyedAndNonPartitionedState may 
hang in 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};{color}
h2. Reason:

The job has 2 tasks(vertices), after calling updateJobResourceRequirements. The 
source parallelism isn't changed (It's parallelism) , and the FlatMapper+Sink 
is changed from  parallelism to parallelism2.

So we expect the task number should be parallelism + parallelism2 instead of 
parallelism2.

 
h2. Why it can be passed for now?

Flink 1.19 supports the scaling cooldown, and the cooldown time is 30s by 
default. It means, flink job will rescale job 30 seconds after 
updateJobResourceRequirements is called.

 

So the running tasks are old parallelism when we call 
waitForRunningTasks({color:#9876aa}restClusterClient{color}{color:#cc7832}, 
{color}jobID{color:#cc7832}, {color}parallelism2){color:#cc7832};. {color}

IIUC, it cannot be guaranteed, and it's unexpected.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-32075][FLIP-306][Checkpoint] Delete merged files on checkpoint abort or subsumption [flink]

2024-02-01 Thread via GitHub



fredia commented on code in PR #24181:
URL: https://github.com/apache/flink/pull/24181#discussion_r1475523078


##
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/filemerging/WithinCheckpointFileMergingSnapshotManager.java:
##
@@ -50,6 +50,43 @@ public WithinCheckpointFileMergingSnapshotManager(String id, 
Executor ioExecutor
 writablePhysicalFilePool = new HashMap<>();
 }
 
+// 
+//  CheckpointListener
+// 
+
+@Override
+public void notifyCheckpointComplete(SubtaskKey subtaskKey, long 
checkpointId)

Review Comment:
   Why `notifyCheckpointSubsumed` is not overriden? 
   
   IIUC, checkpoint subsume and complete are triggered at the same 
time([code](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1082-L1085)),
  although the effect of the two methods is the same, it feels more 
straightforward to close the physical file in `notifyCheckpointSubsumed`, WDYT?



##
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/filemerging/FileMergingSnapshotManagerBase.java:
##
@@ -345,6 +371,79 @@ protected abstract PhysicalFile 
getOrCreatePhysicalFileForCheckpoint(
 protected abstract void returnPhysicalFileForNextReuse(
 SubtaskKey subtaskKey, long checkpointId, PhysicalFile 
physicalFile) throws IOException;
 
+/**
+ * The callback which will be triggered when all subtasks discarded 
(aborted or subsumed).
+ *
+ * @param checkpointId the discarded checkpoint id.
+ * @throws IOException if anything goes wrong with file system.
+ */
+protected abstract void discardCheckpoint(long checkpointId) throws 
IOException;
+
+// 
+//  Checkpoint Listener
+// 
+
+@Override
+public void notifyCheckpointComplete(SubtaskKey subtaskKey, long 
checkpointId)
+throws Exception {
+// does nothing
+}
+
+@Override
+public void notifyCheckpointAborted(SubtaskKey subtaskKey, long 
checkpointId) throws Exception {
+synchronized (lock) {
+Set logicalFilesForCurrentCp = 
uploadedStates.get(checkpointId);
+if (logicalFilesForCurrentCp == null) {
+return;
+}
+Iterator logicalFileIterator = 
logicalFilesForCurrentCp.iterator();
+while (logicalFileIterator.hasNext()) {
+LogicalFile logicalFile = logicalFileIterator.next();
+if (logicalFile.getSubtaskKey().equals(subtaskKey)
+&& logicalFile.getLastUsedCheckpointID() <= 
checkpointId) {
+logicalFile.discardWithCheckpointId(checkpointId);
+logicalFileIterator.remove();
+}
+}
+
+if (logicalFilesForCurrentCp.isEmpty()) {
+uploadedStates.remove(checkpointId);
+discardCheckpoint(checkpointId);
+}

Review Comment:
   nit: There is some overlap with the code of `notifyCheckpointSubsumed`, 
maybe it can be extracted into a function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-34335) Query hints in RexSubQuery could not be printed

2024-02-01 Thread xuyang (Jira)

xuyang created FLINK-34335:
--

 Summary: Query hints in RexSubQuery could not be printed
 Key: FLINK-34335
 URL: https://issues.apache.org/jira/browse/FLINK-34335
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Reporter: xuyang
 Fix For: 1.19.0


That is because in `RelTreeWriterImpl`, we don't care about the `RexSubQuery`. 
And `RexSubQuery` use `RelOptUtil.toString(rel)` to print itself instead of 
adding extra information such as query hints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34335) Query hints in RexSubQuery could not be printed

2024-02-01 Thread xuyang (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813470#comment-17813470
 ] 

xuyang commented on FLINK-34335:


I'll try to fix it.

> Query hints in RexSubQuery could not be printed
> ---
>
> Key: FLINK-34335
> URL: https://issues.apache.org/jira/browse/FLINK-34335
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Reporter: xuyang
>Priority: Major
> Fix For: 1.19.0
>
>
> That is because in `RelTreeWriterImpl`, we don't care about the 
> `RexSubQuery`. And `RexSubQuery` use `RelOptUtil.toString(rel)` to print 
> itself instead of adding extra information such as query hints.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-25537] [JUnit5 Migration] Module: flink-core package api-common [flink]

2024-02-01 Thread via GitHub



Jiabao-Sun commented on code in PR #23960:
URL: https://github.com/apache/flink/pull/23960#discussion_r1475422173


##
flink-core/src/test/java/org/apache/flink/api/common/io/FileInputFormatTest.java:
##
@@ -603,55 +587,52 @@ public void 
testGetStatisticsMultipleOneFileWithCachedVersion() throws IOExcepti
  * split has to start from the beginning.
  */
 @Test
-public void testFileInputFormatWithCompression() {
-try {
-String tempFile =
-TestFileUtils.createTempFileDirForProvidedFormats(
-temporaryFolder.newFolder(),
-FileInputFormat.getSupportedCompressionFormats());
-final DummyFileInputFormat format = new DummyFileInputFormat();
-format.setFilePath(tempFile);
-format.configure(new Configuration());
-FileInputSplit[] splits = format.createInputSplits(2);
-final Set supportedCompressionFormats =
-FileInputFormat.getSupportedCompressionFormats();
-Assert.assertEquals(supportedCompressionFormats.size(), 
splits.length);
-for (FileInputSplit split : splits) {
-Assert.assertEquals(
-FileInputFormat.READ_WHOLE_SPLIT_FLAG,
-split.getLength()); // unsplittable compressed files 
have this size as a
-// flag for "read whole file"
-Assert.assertEquals(0L, split.getStart()); // always read from 
the beginning.
-}
+void testFileInputFormatWithCompression() throws IOException {
 
-// test if this also works for "mixed" directories
-TestFileUtils.createTempFileInDirectory(
-tempFile.replace("file:", ""),
-"this creates a test file with a random extension (at 
least not .deflate)");
-
-final DummyFileInputFormat formatMixed = new 
DummyFileInputFormat();
-formatMixed.setFilePath(tempFile);
-formatMixed.configure(new Configuration());
-FileInputSplit[] splitsMixed = formatMixed.createInputSplits(2);
-Assert.assertEquals(supportedCompressionFormats.size() + 1, 
splitsMixed.length);
-for (FileInputSplit split : splitsMixed) {
-final String extension =
-
FileInputFormat.extractFileExtension(split.getPath().getName());
-if (supportedCompressionFormats.contains(extension)) {
-Assert.assertEquals(
-FileInputFormat.READ_WHOLE_SPLIT_FLAG,
-split.getLength()); // unsplittable compressed 
files have this size as a
-// flag for "read whole file"
-Assert.assertEquals(0L, split.getStart()); // always read 
from the beginning.
-} else {
-Assert.assertEquals(0L, split.getStart());
-Assert.assertTrue("split size not correct", 
split.getLength() > 0);
-}
-}
+String tempFile =
+TestFileUtils.createTempFileDirForProvidedFormats(
+TempDirUtils.newFolder(temporaryFolder),
+FileInputFormat.getSupportedCompressionFormats());
+final DummyFileInputFormat format = new DummyFileInputFormat();
+format.setFilePath(tempFile);
+format.configure(new Configuration());
+FileInputSplit[] splits = format.createInputSplits(2);
+final Set supportedCompressionFormats =
+FileInputFormat.getSupportedCompressionFormats();
+assertThat(splits).hasSize(supportedCompressionFormats.size());
+for (FileInputSplit split : splits) {
+assertThat(split.getLength())
+.isEqualTo(
+FileInputFormat.READ_WHOLE_SPLIT_FLAG); // 
unsplittable compressed files
+// have this size as a
+// flag for "read whole file"
+assertThat(split.getStart()).isEqualTo(0L); // always read from 
the beginning.

Review Comment:
   ```suggestion
   assertThat(split.getStart()).isZero(); // always read from the 
beginning.
   ```



##
flink-core/src/test/java/org/apache/flink/api/common/io/FileInputFormatTest.java:
##
@@ -662,35 +643,32 @@ public void testFileInputFormatWithCompression() {
  * split is not the compressed file size and that the compression 
decorator is called.
  */
 @Test
-public void testFileInputFormatWithCompressionFromFileSource() {
-try {
-String tempFile =
-TestFileUtils.createTempFileDirForProvidedFormats(
-temporaryFolder.newFolder(),
-FileInputFormat.getSupportedCompressionFormats());
-DummyFileInputFormat format = new DummyFileInputFormat();
-

[jira] [Commented] (FLINK-34169) [benchmark] CI fails during test running

2024-02-01 Thread Yunfeng Zhou (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813469#comment-17813469
 ] 

Yunfeng Zhou commented on FLINK-34169:
--

Fixed in 8dd28a2e8bf10ec278ff90c99563468e294a135a.

Hi [~Zakelly] Could you please help close this issue?

> [benchmark] CI fails during test running
> 
>
> Key: FLINK-34169
> URL: https://issues.apache.org/jira/browse/FLINK-34169
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks
>Reporter: Zakelly Lan
>Assignee: Yunfeng Zhou
>Priority: Critical
>
> [CI link: 
> https://github.com/apache/flink-benchmarks/actions/runs/7580834955/job/20647663157?pr=85|https://github.com/apache/flink-benchmarks/actions/runs/7580834955/job/20647663157?pr=85]
> which says:
> {code:java}
> // omit some stack traces
>   Caused by: java.util.concurrent.ExecutionException: Boxed Error
> 1115  at scala.concurrent.impl.Promise$.resolver(Promise.scala:87)
> 1116  at 
> scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:79)
> 1117  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
> 1118  at 
> org.apache.pekko.pattern.PromiseActorRef.$bang(AskSupport.scala:629)
> 1119  at org.apache.pekko.actor.ActorRef.tell(ActorRef.scala:141)
> 1120  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:317)
> 1121  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:222)
> 1122  ... 22 more
> 1123  Caused by: java.lang.NoClassDefFoundError: 
> javax/activation/UnsupportedDataTypeException
> 1124  at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.createKnownInputChannel(SingleInputGateFactory.java:387)
> 1125  at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.lambda$createInputChannel$2(SingleInputGateFactory.java:353)
> 1126  at 
> org.apache.flink.runtime.shuffle.ShuffleUtils.applyWithShuffleTypeCheck(ShuffleUtils.java:51)
> 1127  at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.createInputChannel(SingleInputGateFactory.java:333)
> 1128  at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.createInputChannelsAndTieredStorageService(SingleInputGateFactory.java:284)
> 1129  at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGateFactory.create(SingleInputGateFactory.java:204)
> 1130  at 
> org.apache.flink.runtime.io.network.NettyShuffleEnvironment.createInputGates(NettyShuffleEnvironment.java:265)
> 1131  at 
> org.apache.flink.runtime.taskmanager.Task.(Task.java:418)
> 1132  at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.submitTask(TaskExecutor.java:821)
> 1133  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 1134  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 1135  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 1136  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> 1137  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRpcInvocation$1(PekkoRpcActor.java:309)
> 1138  at 
> org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
> 1139  at 
> org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:307)
> 1140  ... 23 more
> 1141  Caused by: java.lang.ClassNotFoundException: 
> javax.activation.UnsupportedDataTypeException
> 1142  at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
> 1143  at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
> 1144  at 
> java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
> 1145  ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34295) Release Testing Instructions: Verify FLINK-33712 Deprecate RuntimeContext#getExecutionConfig

2024-02-01 Thread Junrui Li (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813465#comment-17813465
 ] 

Junrui Li commented on FLINK-34295:
---

[~lincoln.86xy] This change doesn't need cross testing. I think we could close 
this ticket.

> Release Testing Instructions: Verify FLINK-33712 Deprecate 
> RuntimeContext#getExecutionConfig
> 
>
> Key: FLINK-34295
> URL: https://issues.apache.org/jira/browse/FLINK-34295
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Configuration
>Affects Versions: 1.19.0
>Reporter: lincoln lee
>Assignee: Junrui Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.19.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34296) Release Testing Instructions: Verify FLINK-33581 Deprecate configuration getters/setters that return/set complex Java objects

2024-02-01 Thread Junrui Li (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813466#comment-17813466
 ] 

Junrui Li commented on FLINK-34296:
---

[~lincoln.86xy] This change doesn't need cross testing. I think we could close 
this ticket.

> Release Testing Instructions: Verify FLINK-33581 Deprecate configuration 
> getters/setters that return/set complex Java objects
> -
>
> Key: FLINK-34296
> URL: https://issues.apache.org/jira/browse/FLINK-34296
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Configuration
>Affects Versions: 1.19.0
>Reporter: lincoln lee
>Assignee: Junrui Li
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.19.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-32514) FLIP-309: Support using larger checkpointing interval when source is processing backlog

2024-02-01 Thread Yunfeng Zhou (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-32514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunfeng Zhou updated FLINK-32514:
-
Release Note: ProcessingBacklog is introduced to demonstrate whether a 
record should be processed with low latency or high throughput. 
ProcessingBacklog can be set by source operators, and can be used to change the 
checkpoint internal of a job during runtime.

> FLIP-309: Support using larger checkpointing interval when source is 
> processing backlog
> ---
>
> Key: FLINK-32514
> URL: https://issues.apache.org/jira/browse/FLINK-32514
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Yunfeng Zhou
>Assignee: Yunfeng Zhou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Umbrella issue for 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-309%3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34299) Release Testing Instructions: Verify FLINK-33203 Adding a separate configuration for specifying Java Options of the SQL Gateway

2024-02-01 Thread Yangze Guo (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813464#comment-17813464
 ] 

Yangze Guo commented on FLINK-34299:


This change doesn't need crossteam testing.

> Release Testing Instructions: Verify FLINK-33203 Adding a separate 
> configuration for specifying Java Options of the SQL Gateway 
> 
>
> Key: FLINK-34299
> URL: https://issues.apache.org/jira/browse/FLINK-34299
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Gateway
>Affects Versions: 1.19.0
>Reporter: lincoln lee
>Assignee: Yangze Guo
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.19.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (FLINK-34299) Release Testing Instructions: Verify FLINK-33203 Adding a separate configuration for specifying Java Options of the SQL Gateway

2024-02-01 Thread Yangze Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangze Guo closed FLINK-34299.
--
Resolution: Fixed

> Release Testing Instructions: Verify FLINK-33203 Adding a separate 
> configuration for specifying Java Options of the SQL Gateway 
> 
>
> Key: FLINK-34299
> URL: https://issues.apache.org/jira/browse/FLINK-34299
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Gateway
>Affects Versions: 1.19.0
>Reporter: lincoln lee
>Assignee: Yangze Guo
>Priority: Blocker
>  Labels: release-testing
> Fix For: 1.19.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (FLINK-34258) Incorrect example of accumulator usage within emitUpdateWithRetract for TableAggregateFunction

2024-02-01 Thread Jane Chan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jane Chan closed FLINK-34258.
-

> Incorrect example of accumulator usage within emitUpdateWithRetract for 
> TableAggregateFunction
> --
>
> Key: FLINK-34258
> URL: https://issues.apache.org/jira/browse/FLINK-34258
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Table SQL / API
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Jane Chan
>Assignee: Jane Chan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: GroupTableAggHandler$10.java
>
>
> The 
> [documentation|https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#retraction-example]
>  provides an example of using `emitUpdateWithRetract`. However, the example 
> is misleading as it incorrectly suggests that the accumulator can be updated 
> within the `emitUpdateWithRetract method`. In reality, the order of 
> invocation is to first call `getAccumulator` and then 
> `emitUpdateWithRetract`, which means that updating the accumulator within 
> `emitUpdateWithRetract` will not take effect. Please see 
> [GroupTableAggFunction#L141|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L141]
>  ~ 
> [GroupTableAggFunction#L146|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L146]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-34258) Incorrect example of accumulator usage within emitUpdateWithRetract for TableAggregateFunction

2024-02-01 Thread Jane Chan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jane Chan resolved FLINK-34258.
---
Fix Version/s: 1.19.0
   Resolution: Fixed

Fixed in master 0779c91e581dc16c4aef61d6cc27774f11495907

> Incorrect example of accumulator usage within emitUpdateWithRetract for 
> TableAggregateFunction
> --
>
> Key: FLINK-34258
> URL: https://issues.apache.org/jira/browse/FLINK-34258
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Table SQL / API
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Jane Chan
>Assignee: Jane Chan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: GroupTableAggHandler$10.java
>
>
> The 
> [documentation|https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#retraction-example]
>  provides an example of using `emitUpdateWithRetract`. However, the example 
> is misleading as it incorrectly suggests that the accumulator can be updated 
> within the `emitUpdateWithRetract method`. In reality, the order of 
> invocation is to first call `getAccumulator` and then 
> `emitUpdateWithRetract`, which means that updating the accumulator within 
> `emitUpdateWithRetract` will not take effect. Please see 
> [GroupTableAggFunction#L141|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L141]
>  ~ 
> [GroupTableAggFunction#L146|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L146]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-34258) Incorrect example of accumulator usage within emitUpdateWithRetract for TableAggregateFunction

2024-02-01 Thread Jane Chan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jane Chan reassigned FLINK-34258:
-

Assignee: Jane Chan

> Incorrect example of accumulator usage within emitUpdateWithRetract for 
> TableAggregateFunction
> --
>
> Key: FLINK-34258
> URL: https://issues.apache.org/jira/browse/FLINK-34258
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Table SQL / API
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Jane Chan
>Assignee: Jane Chan
>Priority: Minor
>  Labels: pull-request-available
> Attachments: GroupTableAggHandler$10.java
>
>
> The 
> [documentation|https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#retraction-example]
>  provides an example of using `emitUpdateWithRetract`. However, the example 
> is misleading as it incorrectly suggests that the accumulator can be updated 
> within the `emitUpdateWithRetract method`. In reality, the order of 
> invocation is to first call `getAccumulator` and then 
> `emitUpdateWithRetract`, which means that updating the accumulator within 
> `emitUpdateWithRetract` will not take effect. Please see 
> [GroupTableAggFunction#L141|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L141]
>  ~ 
> [GroupTableAggFunction#L146|https://github.com/apache/flink/blob/20450485b20cb213b96318b0c3275e42c0300e15/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/operators/aggregate/GroupTableAggFunction.java#L146]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34258][docs][table] Fix incorrect retract example for TableAggregateFunction [flink]

2024-02-01 Thread via GitHub



LadyForest merged PR #24215:
URL: https://github.com/apache/flink/pull/24215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [DRAFT][FLINK-33734] Merge unaligned checkpoint state handle [flink]

2024-02-01 Thread via GitHub



flinkbot commented on PR #24247:
URL: https://github.com/apache/flink/pull/24247#issuecomment-1922655270

   
   ## CI report:
   
   * 5d0a461e79a4b6446e73c171bac69482f9d3251f UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-33734) Merge unaligned checkpoint state handle

2024-02-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-33734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-33734:
---
Labels: pull-request-available  (was: )

> Merge unaligned checkpoint state handle
> ---
>
> Key: FLINK-33734
> URL: https://issues.apache.org/jira/browse/FLINK-33734
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Feifan Wang
>Assignee: Feifan Wang
>Priority: Major
>  Labels: pull-request-available
>
> h3. Background
> Unaligned checkpoint will write the inflight-data of all InputChannel and 
> ResultSubpartition of the same subtask to the same file during checkpoint. 
> The InputChannelStateHandle and ResultSubpartitionStateHandle organize the 
> metadata of inflight-data at the channel granularity, which causes the file 
> name to be repeated many times. When a job is under backpressure and task 
> parallelism is high, the metadata of unaligned checkpoints will bloat. This 
> will result in:
>  # The amount of data reported by taskmanager to jobmanager increases, and 
> jobmanager takes longer to process these RPC requests.
>  # The metadata of the entire checkpoint becomes very large, and it takes 
> longer to serialize and write it to dfs.
> Both of the above points ultimately lead to longer checkpoint duration.
> h3. A Production example
> Take our production job with a parallelism of 4800 as an example:
>  # When there is no back pressure, checkpoint end-to-end duration is within 7 
> seconds.
>  # When under pressure: checkpoint end-to-end duration often exceeds 1 
> minute. We found that jobmanager took more than 40 seconds to process rpc 
> requests, and serialized metadata took more than 20 seconds.Some checkpoint 
> statistics:
> |metadata file size|950 MB|
> |channel state count|12,229,854|
> |channel file count|5536|
> Of the 950MB in the metadata file, 68% are redundant file paths.
> We enabled log-based checkpoint on this job and hoped that the checkpoint 
> could be completed within 30 seconds. This problem made it difficult to 
> achieve this goal.
> h3. Propose changes
> I suggest introducing MergedInputChannelStateHandle and 
> MergedResultSubpartitionStateHandle to eliminate redundant file paths.
> The taskmanager merges all InputChannelStateHandles with the same delegated 
> StreamStateHandle in the same subtask into one MergedInputChannelStateHandle 
> before reporting. When recovering from checkpoint, jobmangager converts 
> MergedInputChannelStateHandle to InputChannelStateHandle collection before 
> assigning state handle, and the rest of the process does not need to be 
> changed. 
> Structure of MergedInputChannelStateHandle :
>  
> {code:java}
> {   // MergedInputChannelStateHandle
>     "delegate": {
>         "filePath": 
> "viewfs://hadoop-meituan/flink-yg15/checkpoints/retained/1234567/ab8d0c2f02a47586490b15e7a2c30555/chk-31/ffe54c0a-9b6e-4724-aae7-61b96bf8b1cf",
>         "stateSize": 123456
>     },
>     "size": 2000,
>     "subtaskIndex":0,
>     "channels": [ // One InputChannel per element
>         {
>             "info": {
>                 "gateIdx": 0,
>                 "inputChannelIdx": 0
>             },
>             "offsets": [
>                 100,200,300,400
>             ],
>             "size": 1400
>         },
>         {
>             "info": {
>                 "gateIdx": 0,
>                 "inputChannelIdx": 1
>             },
>             "offsets": [
>                 500,600
>             ],
>             "size": 600
>         }
>     ]
> }
>  {code}
> MergedResultSubpartitionStateHandle is similar.
>  
>  
> WDYT [~roman] , [~pnowojski] , [~fanrui] ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [FLINK-33734] Merge unaligned checkpoint state handle [flink]

2024-02-01 Thread via GitHub



zoltar9264 opened a new pull request, #24247:
URL: https://github.com/apache/flink/pull/24247

   ## What is the purpose of the change
   
   Merge and serialize the state handles of unaligned checkpoint on TaskManager 
to potentially reduce checkpoint production time. Details in 
[FLINK-33734](https://issues.apache.org/jira/browse/FLINK-33734).
   
   
   ## Brief change log
 - Extract interface ChannelState,InputStateHandle,OutputStateHandle as 
mark of state handle of unaligned checkpoint
 - Introduce merged channel state handle 
 - Extract interface ChannelStateHandleSerializer
 - Introduce MetadataV5Serializer for merged channel state handle
 - Merge and serialize channel state handle before send checkpoint ack to 
JobManager
   
   
   ## Verifying this change
   
   Please make sure both new and modified tests in this PR follows the 
conventions defined in our code quality guide: 
https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
   
   *(Please pick either of the following options)*
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This change is already covered by existing tests, such as *(please describe 
tests)*.
   
   *(or)*
   
   This change added tests and can be verified as follows:
   
   *(example:)*
 - *Added integration tests for end-to-end deployment with large payloads 
(100MB)*
 - *Extended integration test for recovery after master (JobManager) 
failure*
 - *Added test that validates that TaskInfo is transferred only once across 
recoveries*
 - *Manually verified the change by running a 4 node cluster with 2 
JobManagers and 4 TaskManagers, a stateful streaming program, and killing one 
JobManager and two TaskManagers during the execution, verifying that recovery 
happens correctly.*
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
 - The serializers: (no)
 - The runtime per-record code paths (performance sensitive): (no)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
 - The S3 file system connector: (no)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (no)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-27054) Elasticsearch SQL connector SSL issue

2024-02-01 Thread Mingliang Liu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813391#comment-17813391
 ] 

Mingliang Liu commented on FLINK-27054:
---

Any updates on this?

My understanding is this problem (not supporting SSL) exists in both ES 6 and 
ES 7 connectors, both SQL and non-SQL (DataStream), correct?

> Elasticsearch SQL connector SSL issue
> -
>
> Key: FLINK-27054
> URL: https://issues.apache.org/jira/browse/FLINK-27054
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / ElasticSearch
>Reporter: ricardo
>Assignee: Kelu Tao
>Priority: Major
>
> The current Flink ElasticSearch SQL connector 
> https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/elasticsearch/
>  is missing SSL options, can't connect to ES clusters which require SSL 
> certificate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34111][table] Add JSON_QUOTE and JSON_UNQUOTE function [flink]

2024-02-01 Thread via GitHub



JingGe commented on PR #24156:
URL: https://github.com/apache/flink/pull/24156#issuecomment-1922027269

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]

2024-02-01 Thread via GitHub



XComp commented on code in PR #24246:
URL: https://github.com/apache/flink/pull/24246#discussion_r1474488421


##
flink-runtime/src/test/java/org/apache/flink/runtime/testutils/CommonTestUtils.java:
##
@@ -367,9 +367,11 @@ public static void waitForCheckpoint(JobID jobID, 
MiniCluster miniCluster, int n
 });
 }
 
-/** Wait for on more completed checkpoint. */
-public static void waitForOneMoreCheckpoint(JobID jobID, MiniCluster 
miniCluster)
-throws Exception {
+/**
+ * Wait for a new completed checkpoint. Note: we wait for 2 or more 
checkpoint to ensure the
+ * latest checkpoint start after waitForTwoOrMoreCheckpoint is called.
+ */
+public static void waitForNewCheckpoint(JobID jobID, MiniCluster 
miniCluster) throws Exception {

Review Comment:
   ```suggestion
   public static void waitForNewCheckpoint(JobID jobID, MiniCluster 
miniCluster, int checkpointCount) throws Exception {
   ```
   Can't we make the number of checkpoints to wait for configurable? That way, 
we can pass in `2` in the test implementation analogously to 
`waitForCheckpoint`. I also feel like we can remove some redundant code within 
the two methods. :thinking: 



##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -328,7 +330,7 @@ public void 
testCheckpointRescalingNonPartitionedStateCausesException() throws E
 // wait until the operator handles some data
 StateSourceBase.workStartedLatch.await();
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   So far, we've only seen the issue in `#testCheckpointRescalingInKeyedState`. 
We don't need the two checkpoints here, actually, because we're not relying on 
elements in the test. We could keep the tests functionality if we make the 
`waitForOneMoreCheckpoint` configurable as suggested one of my previous 
comments.



##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -261,7 +263,7 @@ public void testCheckpointRescalingKeyedState(boolean 
scaleOut) throws Exception
 
 waitForAllTaskRunning(cluster.getMiniCluster(), 
jobGraph.getJobID(), false);
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   We should add a comment here explaining why we need to wait for 2 instead of 
one checkpoint.



##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -513,7 +515,7 @@ public void testCheckpointRescalingPartitionedOperatorState(
 // wait until the operator handles some data
 StateSourceBase.workStartedLatch.await();
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   AFAIU, we don't need to change it here.



##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -411,7 +413,7 @@ public void 
testCheckpointRescalingWithKeyedAndNonPartitionedState() throws Exce
 // clear the CollectionSink set for the restarted job
 CollectionSink.clearElementsSet();
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   I guess the scenario can happen in this test as well because it's almost the 
same test implementation as in `#testCheckpointRescalingKeyedState` :thinking: 



##
flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java:
##
@@ -411,7 +413,7 @@ public void 
testCheckpointRescalingWithKeyedAndNonPartitionedState() throws Exce
 // clear the CollectionSink set for the restarted job
 CollectionSink.clearElementsSet();
 
-waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster());
+waitForNewCheckpoint(jobID, cluster.getMiniCluster());

Review Comment:
   I'm wondering whether the redundant code could be removed here. But that's 
probably a bit out-of-scope for this issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-33634] Add Conditions to Flink CRD's Status field [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



gyfora commented on PR #749:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/749#issuecomment-1921722522

   @lajith2006 do you already have a Jira account?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-34148) Potential regression (Jan. 13): stringWrite with Java8

2024-02-01 Thread Sergey Nuyanzin (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813319#comment-17813319
 ] 

Sergey Nuyanzin commented on FLINK-34148:
-

As it was decided in release meeting
flink-shaded upgrade was reverted as d6c7eee8243b4fe3e593698f250643534dc79cb5

[~Zakelly], [~martijnvisser] could you please check perf report?

> Potential regression (Jan. 13): stringWrite with Java8
> --
>
> Key: FLINK-34148
> URL: https://issues.apache.org/jira/browse/FLINK-34148
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Type Serialization System
>Reporter: Zakelly Lan
>Priority: Blocker
> Fix For: 1.19.0
>
>
> Significant drop of performance in stringWrite with Java8 from commit 
> [881062f352|https://github.com/apache/flink/commit/881062f352f8bf8c21ab7cbea95e111fd82fdf20]
>  to 
> [5d9d8748b6|https://github.com/apache/flink/commit/5d9d8748b64ff1a75964a5cd2857ab5061312b51]
>  . It only involves strings not so long (128 or 4).
> stringWrite.128.ascii(Java8) baseline=1089.107756 current_value=754.52452
> stringWrite.128.chinese(Java8) baseline=504.244575 current_value=295.358989
> stringWrite.128.russian(Java8) baseline=655.582639 current_value=421.030188
> stringWrite.4.chinese(Java8) baseline=9598.791964 current_value=6627.929927
> stringWrite.4.russian(Java8) baseline=11070.666415 current_value=8289.95767



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34132) Batch WordCount job fails when run with AdaptiveBatch scheduler

2024-02-01 Thread Wencong Liu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813316#comment-17813316
 ] 

Wencong Liu commented on FLINK-34132:
-

Thanks for the reminding. [~zhuzh] I will address these issues when I have some 
free time. 

> Batch WordCount job fails when run with AdaptiveBatch scheduler
> ---
>
> Key: FLINK-34132
> URL: https://issues.apache.org/jira/browse/FLINK-34132
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.17.1, 1.18.1
>Reporter: Prabhu Joseph
>Assignee: Junrui Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.19.0
>
>
> Batch WordCount job fails when run with AdaptiveBatch scheduler.
> *Repro Steps*
> {code:java}
> flink-yarn-session -Djobmanager.scheduler=adaptive -d
>  flink run -d /usr/lib/flink/examples/batch/WordCount.jar --input 
> s3://prabhuflinks3/INPUT --output s3://prabhuflinks3/OUT
> {code}
> *Error logs*
> {code:java}
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
>   at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:105)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:851)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:245)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1095)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$mainInternal$9(CliFrontend.java:1189)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>   at 
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at 
> org.apache.flink.client.cli.CliFrontend.mainInternal(CliFrontend.java:1189)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1157)
> Caused by: java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
>   at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321)
>   at 
> org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:1067)
>   at 
> org.apache.flink.client.program.ContextEnvironment.executeAsync(ContextEnvironment.java:144)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:73)
>   at 
> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)
>   ... 12 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:1062)
>   ... 20 more
> Caused by: java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobInitializationException: Could not start 
> the JobMaster.
>   at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321)
>   at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>   at 
> java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
>   at

Re: [PR] [FLINK-33089] Clean up 1.13 and 1.14 references from docs and code [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



mxm merged PR #768:
URL: https://github.com/apache/flink-kubernetes-operator/pull/768


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-01 Thread Matthias Pohl (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34333:
--
Description: 
FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which required 
an update of the k8s client to v6.9.0.

This Jira issue is about finding a solution in Flink 1.18 for the very same 
problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
unblock the release of 1.19 by resolving FLINK-34007.

Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in v6.6.2 
which might prevent the leadership lost event being forwarded to the client 
([#5463|https://github.com/fabric8io/kubernetes-client/issues/5463]). An 
initial proposal where the release call was handled in Flink's 
{{KubernetesLeaderElector}} didn't work due to the leadership lost event being 
triggered twice (see [FLINK-34007 PR 
comment|https://github.com/apache/flink/pull/24132#discussion_r1467175902])

  was:
FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which required 
an update of the k8s client to v6.9.0.

This Jira issue is about finding a solution in Flink 1.18 for the very same 
problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
unblock the release of 1.19 by resolving FLINK-34007.




> Fix FLINK-34007 LeaderElector bug in 1.18
> -
>
> Key: FLINK-34333
> URL: https://issues.apache.org/jira/browse/FLINK-34333
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
> Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which 
> required an update of the k8s client to v6.9.0.
> This Jira issue is about finding a solution in Flink 1.18 for the very same 
> problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
> unblock the release of 1.19 by resolving FLINK-34007.
> Just to summarize why the upgrade to v6.9.0 is desired: There's a bug in 
> v6.6.2 which might prevent the leadership lost event being forwarded to the 
> client ([#5463|https://github.com/fabric8io/kubernetes-client/issues/5463]). 
> An initial proposal where the release call was handled in Flink's 
> {{KubernetesLeaderElector}} didn't work due to the leadership lost event 
> being triggered twice (see [FLINK-34007 PR 
> comment|https://github.com/apache/flink/pull/24132#discussion_r1467175902])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (FLINK-34319) Bump okhttp version to 4.12.0

2024-02-01 Thread Gyula Fora (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-34319.
--
Fix Version/s: kubernetes-operator-1.8.0
   (was: 1.8.0)
 Assignee: ConradJam
   Resolution: Fixed

merged to main ca3a746e42beb4816c4f3fb7d5b9aea6fc757b32

> Bump okhttp version to 4.12.0
> -
>
> Key: FLINK-34319
> URL: https://issues.apache.org/jira/browse/FLINK-34319
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: 1.7.2
>Reporter: ConradJam
>Assignee: ConradJam
>Priority: Major
>  Labels: pull-request-available
> Fix For: kubernetes-operator-1.8.0
>
>
> Bump okhttp version to 4.12.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34319] Bump okhttp version to 4.12.0 [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



gyfora merged PR #766:
URL: https://github.com/apache/flink-kubernetes-operator/pull/766


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-33557] Externalize Cassandra Python connector code [flink-connector-cassandra]

2024-02-01 Thread via GitHub



ferenc-csaky commented on PR #26:
URL: 
https://github.com/apache/flink-connector-cassandra/pull/26#issuecomment-1921525300

   @pvary @gaborgsomogyi JAR included, CI run passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-31220) Replace Pod with PodTemplateSpec for the pod template properties

2024-02-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-31220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-31220:
---
Labels: pull-request-available  (was: )

> Replace Pod with PodTemplateSpec for the pod template properties
> 
>
> Key: FLINK-31220
> URL: https://issues.apache.org/jira/browse/FLINK-31220
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Assignee: Gyula Fora
>Priority: Major
>  Labels: pull-request-available
>
> The current podtemplate fields in the CR use the Pod object for schema. This 
> doesn't make sense as status and other fields should never be specified and 
> they take no effect.
> We should replace this with PodTemplateSpec and make sure that this is not a 
> breaking change even if users incorrectly specified status before.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-31220] Replace Pod with PodTemplateSpec for the pod template properties [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



gyfora commented on PR #770:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/770#issuecomment-1921435716

   cc @afedulov 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [FLINK-31220] Replace Pod with PodTemplateSpec for the pod template properties [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



gyfora opened a new pull request, #770:
URL: https://github.com/apache/flink-kubernetes-operator/pull/770

   ## What is the purpose of the change
   
   The PodTemplate fields in the CRD currently incorrectly use the Pod time, 
when they should be PodTemplateSpec type. This leads to unnecessary 
(meaningless) fields and a very large CRD for no gain.
   
   Also accidentally changing fields like apiKind would trigger an unnecessary 
upgrade of the job.
   
   This PR is the first step of correcting this mistake. Unfortunately we 
cannot simply change the type form Pod to PodTemplateSpec as it's a backward 
incompatible change for the users.
   
   While Kubernetes itself would drop the now unsupported fields from the 
stored CRs themselves, users would not be able to submit the same CR again as 
they would get a validation error from the API server. We should defer the CRD 
schema change to the next version.
   
   
   ## Brief change log
   
 - *Change Pod -> PodTemplateSpec but keep the schema in CRD*
 - *Improve SpecDiff logic to avoid accidental upgrade due to the dropped 
fields*
   
   ## Verifying this change
   
   New unit tests + manual verification in local environment.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
 - Core observer or reconciler logic that is regularly executed: yes
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? JavaDocs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [hotfix] Remove unnecessary fields from podTemplate examples [flink-kubernetes-operator]

2024-02-01 Thread via GitHub



gyfora closed pull request #767: [hotfix] Remove unnecessary fields from 
podTemplate examples
URL: https://github.com/apache/flink-kubernetes-operator/pull/767


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-31472] Disable Intermittently failing throttling test [flink]

2024-02-01 Thread via GitHub



vahmed-hamdy commented on PR #24175:
URL: https://github.com/apache/flink/pull/24175#issuecomment-1921379510

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-34325) Inconsistent state with data loss after OutOfMemoryError

2024-02-01 Thread Alexis Sarda-Espinosa (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813238#comment-17813238
 ] 

Alexis Sarda-Espinosa commented on FLINK-34325:
---

I will say that, while I can of course reproduce the OOM problems, I cannot 
reliably reproduce the inconsistency, most of the time the job really ends up 
in a crashloop until I increase memory or clean up the state.

> Inconsistent state with data loss after OutOfMemoryError
> 
>
> Key: FLINK-34325
> URL: https://issues.apache.org/jira/browse/FLINK-34325
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
> Environment: Flink on Kubernetes with HA, RocksDB with incremental 
> checkpoints on Azure
>Reporter: Alexis Sarda-Espinosa
>Priority: Major
> Attachments: jobmanager_log.txt
>
>
> I have a job that uses broadcast state to maintain a cache of required 
> metadata. I am currently evaluating memory requirements of my specific use 
> case, and I ran into a weird situation that seems worrisome.
> All sources in my job are Kafka sources. I wrote a large amount of messages 
> in Kafka to force the broadcast state's cache to grow. At some point, this 
> caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job 
> Manager. I would have expected the whole java process of the JM to crash, but 
> the job was simply restarted. What's worrisome is that, after 2 checkpoint 
> failures ^1^, the job restarted and subsequently resumed from the latest 
> successful checkpoint and completely ignored all the events I wrote to Kafka, 
> which I can verify because I have a custom metric that exposes the 
> approximate size of this cache, and the fact that the job didn't crashloop at 
> this point after reading all the messages from Kafka over and over again.
> I'm attaching an excerpt of the Job Manager's logs. My main concerns are:
> # It seems the memory error from the JM didn't prevent the Kafka offsets from 
> being "rolled back", so eventually the Kafka events that should have ended in 
> the broadcast state's cache were ignored.
> # -Is it normal that the state is somehow "materialized" in the JM and is 
> thus affected by the size of the JM's heap? Is this something particular due 
> to the use of broadcast state? I found this very surprising.- See comments
> ^1^ Two failures are expected since 
> {{execution.checkpointing.tolerable-failed-checkpoints=1}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34325) Inconsistent state with data loss after OutOfMemoryError

2024-02-01 Thread Alexis Sarda-Espinosa (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813238#comment-17813238
 ] 

Alexis Sarda-Espinosa edited comment on FLINK-34325 at 2/1/24 1:49 PM:
---

I will say that, while I can of course reproduce the OOM problems, I cannot 
reliably reproduce the state inconsistency, most of the time the job really 
ends up in a crashloop until I increase memory or clean up the state.


was (Author: asardaes):
I will say that, while I can of course reproduce the OOM problems, I cannot 
reliably reproduce the inconsistency, most of the time the job really ends up 
in a crashloop until I increase memory or clean up the state.

> Inconsistent state with data loss after OutOfMemoryError
> 
>
> Key: FLINK-34325
> URL: https://issues.apache.org/jira/browse/FLINK-34325
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
> Environment: Flink on Kubernetes with HA, RocksDB with incremental 
> checkpoints on Azure
>Reporter: Alexis Sarda-Espinosa
>Priority: Major
> Attachments: jobmanager_log.txt
>
>
> I have a job that uses broadcast state to maintain a cache of required 
> metadata. I am currently evaluating memory requirements of my specific use 
> case, and I ran into a weird situation that seems worrisome.
> All sources in my job are Kafka sources. I wrote a large amount of messages 
> in Kafka to force the broadcast state's cache to grow. At some point, this 
> caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job 
> Manager. I would have expected the whole java process of the JM to crash, but 
> the job was simply restarted. What's worrisome is that, after 2 checkpoint 
> failures ^1^, the job restarted and subsequently resumed from the latest 
> successful checkpoint and completely ignored all the events I wrote to Kafka, 
> which I can verify because I have a custom metric that exposes the 
> approximate size of this cache, and the fact that the job didn't crashloop at 
> this point after reading all the messages from Kafka over and over again.
> I'm attaching an excerpt of the Job Manager's logs. My main concerns are:
> # It seems the memory error from the JM didn't prevent the Kafka offsets from 
> being "rolled back", so eventually the Kafka events that should have ended in 
> the broadcast state's cache were ignored.
> # -Is it normal that the state is somehow "materialized" in the JM and is 
> thus affected by the size of the JM's heap? Is this something particular due 
> to the use of broadcast state? I found this very surprising.- See comments
> ^1^ Two failures are expected since 
> {{execution.checkpointing.tolerable-failed-checkpoints=1}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34325) Inconsistent state with data loss after OutOfMemoryError

2024-02-01 Thread Alexis Sarda-Espinosa (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Sarda-Espinosa updated FLINK-34325:
--
Description: 
I have a job that uses broadcast state to maintain a cache of required 
metadata. I am currently evaluating memory requirements of my specific use 
case, and I ran into a weird situation that seems worrisome.

All sources in my job are Kafka sources. I wrote a large amount of messages in 
Kafka to force the broadcast state's cache to grow. At some point, this caused 
an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job Manager. 
I would have expected the whole java process of the JM to crash, but the job 
was simply restarted. What's worrisome is that, after 2 checkpoint failures 
^1^, the job restarted and subsequently resumed from the latest successful 
checkpoint and completely ignored all the events I wrote to Kafka, which I can 
verify because I have a custom metric that exposes the approximate size of this 
cache, and the fact that the job didn't crashloop at this point after reading 
all the messages from Kafka over and over again.

I'm attaching an excerpt of the Job Manager's logs. My main concerns are:

# It seems the memory error from the JM didn't prevent the Kafka offsets from 
being "rolled back", so eventually the Kafka events that should have ended in 
the broadcast state's cache were ignored.
# -Is it normal that the state is somehow "materialized" in the JM and is thus 
affected by the size of the JM's heap? Is this something particular due to the 
use of broadcast state? I found this very surprising.- See comments

^1^ Two failures are expected since 
{{execution.checkpointing.tolerable-failed-checkpoints=1}}

  was:
I have a job that uses broadcast state to maintain a cache of required 
metadata. I am currently evaluating memory requirements of my specific use 
case, and I ran into a weird situation that seems worrisome.

All sources in my job are Kafka sources. I wrote a large amount of messages in 
Kafka to force the broadcast state's cache to grow. At some point, this caused 
an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job Manager. 
I would have expected the whole java process of the JM to crash, but the job 
was simply restarted. What's worrisome is that, after 2 restarts, the job 
resumed from the latest successful checkpoint and completely ignored all the 
events I wrote to Kafka, which I can verify because I have a custom metric that 
exposes the approximate size of this cache, and the fact that the job didn't 
crashloop at this point after reading all the messages from Kafka over and over 
again.

I'm attaching an excerpt of the Job Manager's logs. My main concerns are:

# It seems the memory error from the JM didn't prevent the Kafka offsets from 
being "rolled back", so eventually the Kafka events that should have ended in 
the broadcast state's cache were ignored.
# Is it normal that the state is somehow "materialized" in the JM and is thus 
affected by the size of the JM's heap? Is this something particular due to the 
use of broadcast state? I found this very surprising.


> Inconsistent state with data loss after OutOfMemoryError
> 
>
> Key: FLINK-34325
> URL: https://issues.apache.org/jira/browse/FLINK-34325
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.17.1
> Environment: Flink on Kubernetes with HA, RocksDB with incremental 
> checkpoints on Azure
>Reporter: Alexis Sarda-Espinosa
>Priority: Major
> Attachments: jobmanager_log.txt
>
>
> I have a job that uses broadcast state to maintain a cache of required 
> metadata. I am currently evaluating memory requirements of my specific use 
> case, and I ran into a weird situation that seems worrisome.
> All sources in my job are Kafka sources. I wrote a large amount of messages 
> in Kafka to force the broadcast state's cache to grow. At some point, this 
> caused an "{{java.lang.OutOfMemoryError: Java heap space}}" error in the Job 
> Manager. I would have expected the whole java process of the JM to crash, but 
> the job was simply restarted. What's worrisome is that, after 2 checkpoint 
> failures ^1^, the job restarted and subsequently resumed from the latest 
> successful checkpoint and completely ignored all the events I wrote to Kafka, 
> which I can verify because I have a custom metric that exposes the 
> approximate size of this cache, and the fact that the job didn't crashloop at 
> this point after reading all the messages from Kafka over and over again.
> I'm attaching an excerpt of the Job Manager's logs. My main concerns are:
> # It seems the memory error from the JM didn't prevent the Kafka offsets from 
> being "rolled back", so eventually the Kafka events that should have ended in

[jira] [Commented] (FLINK-34328) Release Testing Instructions: Verify FLINK-34037 Improve Serialization Configuration And Usage In Flink

2024-02-01 Thread lincoln lee (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813234#comment-17813234
 ] 

lincoln lee commented on FLINK-34328:
-

[~Zhanghao Chen] Assigned to you :)

> Release Testing Instructions: Verify FLINK-34037 Improve Serialization 
> Configuration And Usage In Flink
> ---
>
> Key: FLINK-34328
> URL: https://issues.apache.org/jira/browse/FLINK-34328
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Type Serialization System, Runtime / Configuration
>Affects Versions: 1.19.0
>Reporter: Zhanghao Chen
>Assignee: Zhanghao Chen
>Priority: Major
> Fix For: 1.19.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-34328) Release Testing Instructions: Verify FLINK-34037 Improve Serialization Configuration And Usage In Flink

2024-02-01 Thread lincoln lee (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lincoln lee reassigned FLINK-34328:
---

Assignee: Zhanghao Chen

> Release Testing Instructions: Verify FLINK-34037 Improve Serialization 
> Configuration And Usage In Flink
> ---
>
> Key: FLINK-34328
> URL: https://issues.apache.org/jira/browse/FLINK-34328
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Type Serialization System, Runtime / Configuration
>Affects Versions: 1.19.0
>Reporter: Zhanghao Chen
>Assignee: Zhanghao Chen
>Priority: Major
> Fix For: 1.19.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-01 Thread Matthias Pohl (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813215#comment-17813215
 ] 

Matthias Pohl commented on FLINK-34333:
---

I think it's fine to do the upgrade from v6.6.2 to v6.9.0 even for 1.18:
 * The only change that seems to affect Flink is in the Mock-framework and only 
affects test code
 * Most of the changes are cleanup changes, except for:
 ** Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: 
Changed behavior in certain scenarios
 ** Fix [#5125|https://github.com/fabric8io/kubernetes-client/issues/5125]: 
Different default TLS version
 ** Fix [#1335|https://github.com/fabric8io/kubernetes-client/issues/1335]: 
Different fallback used for proxy configuration

Generally, we could argue that downstream project should rely on transitive 
dependencies provided by Flink. And fixing the bug out-weighs the stability 
concerns here. Do you see a problem with this argument [~gyfora], 
[~wangyang0918]?

The alternatives to doing the upgrade are:
 * Reverting the upgrade (i.e. going back from v6.6.2 to v5.12.4). This would 
allow us to get to the stable version that was tested with 1.17-
 * Provide a Flink-customized implementation of the fabric8io {{LeaderElector}} 
class with the cherry-picked changes of 
[8f8c438f|https://github.com/fabric8io/kubernetes-client/commit/8f8c438f] and 
[0f6c6965|https://github.com/fabric8io/kubernetes-client/commit/0f6c6965]. As a 
consequence, we would stick to fabric8io:kubernetes-client v.6.6.2

> Fix FLINK-34007 LeaderElector bug in 1.18
> -
>
> Key: FLINK-34333
> URL: https://issues.apache.org/jira/browse/FLINK-34333
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
> Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which 
> required an update of the k8s client to v6.9.0.
> This Jira issue is about finding a solution in Flink 1.18 for the very same 
> problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
> unblock the release of 1.19 by resolving FLINK-34007.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally [flink]

2024-02-01 Thread via GitHub



flinkbot commented on PR #24246:
URL: https://github.com/apache/flink/pull/24246#issuecomment-1921304022

   
   ## CI report:
   
   * 9eec91beb3f2ac2344454b9402ba2505201f0e49 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails

2024-02-01 Thread Rui Fan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813205#comment-17813205
 ] 

Rui Fan commented on FLINK-34200:
-

I have attached the detailed log in this JIRA.

> AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
> -
>
> Key: FLINK-34200
> URL: https://issues.apache.org/jira/browse/FLINK-34200
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: Rui Fan
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Attachments: FLINK-34200.failure.log.gz, debug-34200.log
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200]
> {code:java}
> Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in 
> org.apache.flink.test.checkpointing.AutoRescalingITCase
> Jan 19 02:31:53 02:31:53.954 [ERROR] 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend
>  = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE!
> Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), 
> (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), 
> (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), 
> (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), 
> (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), 
> (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), 
> (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), 
> (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), 
> (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), 
> (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), 
> (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), 
> (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), 
> (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), 
> (1,2), (1,36000), (1,44000)]>
> Jan 19 02:31:53   at org.junit.Assert.fail(Assert.java:89)
> Jan 19 02:31:53   at org.junit.Assert.failNotEquals(Assert.java:835)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:120)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:146)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196)
> Jan 19 02:31:53   at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 19 02:31:53   at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails

2024-02-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-34200:
---
Labels: pull-request-available test-stability  (was: test-stability)

> AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
> -
>
> Key: FLINK-34200
> URL: https://issues.apache.org/jira/browse/FLINK-34200
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: Rui Fan
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Attachments: FLINK-34200.failure.log.gz, debug-34200.log
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200]
> {code:java}
> Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in 
> org.apache.flink.test.checkpointing.AutoRescalingITCase
> Jan 19 02:31:53 02:31:53.954 [ERROR] 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend
>  = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE!
> Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), 
> (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), 
> (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), 
> (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), 
> (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), 
> (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), 
> (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), 
> (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), 
> (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), 
> (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), 
> (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), 
> (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), 
> (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), 
> (1,2), (1,36000), (1,44000)]>
> Jan 19 02:31:53   at org.junit.Assert.fail(Assert.java:89)
> Jan 19 02:31:53   at org.junit.Assert.failNotEquals(Assert.java:835)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:120)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:146)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196)
> Jan 19 02:31:53   at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 19 02:31:53   at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails

2024-02-01 Thread Rui Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan reassigned FLINK-34200:
---

Assignee: Rui Fan

> AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
> -
>
> Key: FLINK-34200
> URL: https://issues.apache.org/jira/browse/FLINK-34200
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: Rui Fan
>Priority: Critical
>  Labels: test-stability
> Attachments: FLINK-34200.failure.log.gz, debug-34200.log
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200]
> {code:java}
> Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in 
> org.apache.flink.test.checkpointing.AutoRescalingITCase
> Jan 19 02:31:53 02:31:53.954 [ERROR] 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend
>  = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE!
> Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), 
> (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), 
> (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), 
> (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), 
> (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), 
> (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), 
> (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), 
> (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), 
> (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), 
> (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), 
> (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), 
> (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), 
> (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), 
> (1,2), (1,36000), (1,44000)]>
> Jan 19 02:31:53   at org.junit.Assert.fail(Assert.java:89)
> Jan 19 02:31:53   at org.junit.Assert.failNotEquals(Assert.java:835)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:120)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:146)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196)
> Jan 19 02:31:53   at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 19 02:31:53   at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] Revert "[FLINK-33705] Upgrade to flink-shaded 18.0" [flink]

2024-02-01 Thread via GitHub



snuyanzin merged PR #24227:
URL: https://github.com/apache/flink/pull/24227


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Revert "[FLINK-33705] Upgrade to flink-shaded 18.0" [flink]

2024-02-01 Thread via GitHub



snuyanzin commented on PR #24227:
URL: https://github.com/apache/flink/pull/24227#issuecomment-1921276645

   thanks for taking a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34333][k8s] 1.18 backport of FLINK-34007 [flink]

2024-02-01 Thread via GitHub



flinkbot commented on PR #24245:
URL: https://github.com/apache/flink/pull/24245#issuecomment-1921277811

   
   ## CI report:
   
   * 087ef9d4311dcacaa6d7e7caf15d9abce0e8fb95 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails

2024-02-01 Thread Rui Fan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813201#comment-17813201
 ] 

Rui Fan edited comment on FLINK-34200 at 2/1/24 12:59 PM:
--

After re-run hundreds of times and adding a series of LOG to analysis this 
test. I find the root cause.

In biref, it's a test issue, the root cause is that the counter and sum of 
SubtaskIndexFlatMapper in the Checkpoint before rescale are incomplete.

Here is a test commit, I added some logs to analyze it.
 * 
[https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe]

 

I have added the detailed reason in this comment: 
[https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe#r138161972]

 


was (Author: fanrui):
After re-run hundreds of times and adding a series of system.out to analysis 
the log. I find the root cause.

In biref, it's a test issue, the root cause is that the counter and sum of 
SubtaskIndexFlatMapper in the Checkpoint before rescale are incomplete.

Here is a test commit, I added some logs to analyze it.
 * 
[https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe]

 

I have added the detailed reason in this comment: 
https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe#r138161972

 

> AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
> -
>
> Key: FLINK-34200
> URL: https://issues.apache.org/jira/browse/FLINK-34200
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: test-stability
> Attachments: FLINK-34200.failure.log.gz, debug-34200.log
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200]
> {code:java}
> Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in 
> org.apache.flink.test.checkpointing.AutoRescalingITCase
> Jan 19 02:31:53 02:31:53.954 [ERROR] 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend
>  = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE!
> Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), 
> (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), 
> (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), 
> (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), 
> (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), 
> (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), 
> (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), 
> (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), 
> (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), 
> (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), 
> (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), 
> (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), 
> (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), 
> (1,2), (1,36000), (1,44000)]>
> Jan 19 02:31:53   at org.junit.Assert.fail(Assert.java:89)
> Jan 19 02:31:53   at org.junit.Assert.failNotEquals(Assert.java:835)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:120)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:146)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196)
> Jan 19 02:31:53   at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 19 02:31:53   at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails

2024-02-01 Thread Rui Fan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813201#comment-17813201
 ] 

Rui Fan commented on FLINK-34200:
-

After re-run hundreds of times and adding a series of system.out to analysis 
the log. I find the root cause.

In biref, it's a test issue, the root cause is that the counter and sum of 
SubtaskIndexFlatMapper in the Checkpoint before rescale are incomplete.

Here is a test commit, I added some logs to analyze it.
 * 
[https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe]

 

I have added the detailed reason in this comment: 
https://github.com/1996fanrui/flink/commit/420fdf3cfec4e2ec6d1f600fa0c87a3fc131b5fe#r138161972

 

> AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
> -
>
> Key: FLINK-34200
> URL: https://issues.apache.org/jira/browse/FLINK-34200
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: test-stability
> Attachments: FLINK-34200.failure.log.gz, debug-34200.log
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200]
> {code:java}
> Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in 
> org.apache.flink.test.checkpointing.AutoRescalingITCase
> Jan 19 02:31:53 02:31:53.954 [ERROR] 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend
>  = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE!
> Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), 
> (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), 
> (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), 
> (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), 
> (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), 
> (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), 
> (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), 
> (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), 
> (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), 
> (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), 
> (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), 
> (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), 
> (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), 
> (1,2), (1,36000), (1,44000)]>
> Jan 19 02:31:53   at org.junit.Assert.fail(Assert.java:89)
> Jan 19 02:31:53   at org.junit.Assert.failNotEquals(Assert.java:835)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:120)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:146)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196)
> Jan 19 02:31:53   at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 19 02:31:53   at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-01 Thread Matthias Pohl (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813199#comment-17813199
 ] 

Matthias Pohl edited comment on FLINK-34333 at 2/1/24 12:55 PM:


Breaking changes between v6.6.2 and v6.9.2 based on the 
{{fabric8io:kubernetes-client}} release notes.
 * Icons:
 ** (x) Change on the Flink side required. This doesn't come with a change of 
behavior for the user (because it only affects test code)
 ** (!) Not used by Flink but might cause issues with projects which rely on 
the Flink dependency
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Deprecation work that doesn't remove API, yet.

 
 * (x) 
[v6.9.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.9.0]
 ** (x) Fix [#5368|https://github.com/fabric8io/kubernetes-client/issues/5368]: 
ListOptions parameter ordering is now alphabetical. If you are using non-crud 
mocking for lists with options, you may need to update your parameter order.
 *** This issue causes a change in Flink's {{KubernetesClientTestBase}} because 
the order in which the GET parameters are processed to match the mocked 
requests changed.
 ** (!) Fix [#5343|https://github.com/fabric8io/kubernetes-client/issues/5343]: 
Removed io.fabric8.kubernetes.model.annotation.PrinterColumn, use 
io.fabric8.crd.generator.annotation.PrinterColumn
 ** (!) Fix [#5391|https://github.com/fabric8io/kubernetes-client/issues/5391]: 
Removed the vertx-uri-template dependency from the vertx client, if you need 
that for your application, then introduce your own dependency.
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Fix [#5220|https://github.com/fabric8io/kubernetes-client/issues/5220]: 
KubernetesResourceUtil.isValidLabelOrAnnotation has been deprecated because the 
rules for labels and annotations are different
 * (!) 
[v6.8.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0]
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Fix [#2718|https://github.com/fabric8io/kubernetes-client/issues/2718]: 
KubernetesResourceUtil.isResourceReady was deprecated. Use 
client.resource(item).isReady() or Readiness.getInstance().isReady(item) 
instead.
 ** (!) Fix [#5171|https://github.com/fabric8io/kubernetes-client/issues/5171]: 
Removed Camel-K extension, use org.apache.camel.k:camel-k-crds instead.
 ** (!) Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: 
Built-in resources were in-consistent with respect to their serialization or 
empty collections. In many circumstances this was confusing behavior. In order 
to be consistent all built-in resources will omit empty collections by default. 
This is a breaking change if you are relying on an empty collection in a json 
merge or a strategic merge where the list has a patchStrategy of atomic. In 
these circumstances the empty collection will no longer be serialized. You may 
instead use a json patch, server side apply instead, or modify the serialized 
form of the patch.
 ** (!) Fix [#5279|https://github.com/fabric8io/kubernetes-client/issues/5279]: 
(java-generator) Add native support for date-time fields, they are now mapped 
to native java.time.ZonedDateTime
 ** (!) Fix [#5315|https://github.com/fabric8io/kubernetes-client/issues/5315]: 
kubernetes-junit-jupiter no longer registers the NamespaceExtension and 
KubernetesExtension extensions to be used in combination with 
junit-platform.properties>junit.jupiter.extensions.autodetection.enabled=trueconfiguration.
 If you wish to use these extensions and autodetect them, change your 
dependency to kubernetes-junit-jupiter-autodetect.
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Deprecating io.fabric8.kubernetes.model.annotation.PrinterColumn in favor of: 
io.fabric8.crd.generator.annotation.PrinterColumn
 ** (!) Resource classes in resource.k8s.io/v1alpha1 have been moved to 
resource.k8s.io/v1alpha2 apiGroup in Kubernetes 1.27. Users are required to 
change package of the following classes:
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.PodSchedulingContext -> 
- io.fabric8.kubernetes.api.model.resource.v1alpha2.PodSchedulingContext
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaim -> - 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaim
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaimTemplate -> 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaimTemplate
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClass -> 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClass
 * (!) 
[v6.7.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.7.0]
 ** (!) Fix [#4911|https://github.com/fabric8io/kubernetes-client/issues/4911]:
 *** Config/RequestConfig.scaleTimeout has been deprecated along with

[jira] [Updated] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-34333:
---
Labels: pull-request-available  (was: )

> Fix FLINK-34007 LeaderElector bug in 1.18
> -
>
> Key: FLINK-34333
> URL: https://issues.apache.org/jira/browse/FLINK-34333
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
> Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which 
> required an update of the k8s client to v6.9.0.
> This Jira issue is about finding a solution in Flink 1.18 for the very same 
> problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
> unblock the release of 1.19 by resolving FLINK-34007.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [FLINK-34333][k8s] 1.18 backport of FLINK-34007 [flink]

2024-02-01 Thread via GitHub



XComp opened a new pull request, #24245:
URL: https://github.com/apache/flink/pull/24245

   ## What is the purpose of the change
   
   This is the fix of FLINK-34007 applied to Flink 1.18. In the end, it's a 
straight-forward backport including the upgrade to fabric8io:kubernetes-client 
v6.9.0. See FLINK-34333 for further details on the discussion whether the 
dependency upgrade is suitable.
   
   ## Brief change log
   
   * Makes `KubernetesLeaderElector` re-instantiate the fabric8io LeaderElector 
at the beginning of a new leadership lifecycle.
   * Upgrades `fabric8io:kubernetes-client` from `v6.6.2` to `v6.9.0` in order 
to include https://github.com/fabric8io/kubernetes-client/issues/5463
   
   ## Verifying this change
   
   * Extends `KubernetesLeaderElectorITCase` to include a test case for the 
scenario that was revealed in FLINK-34007 and a check that verifies the 
leadership lost event being sent when stopping the `KubernetesLeaderElector`
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers: no
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
 - The S3 file system connector: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-34200) AutoRescalingITCase#testCheckpointRescalingInKeyedState fails

2024-02-01 Thread Rui Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan updated FLINK-34200:

Attachment: debug-34200.log

> AutoRescalingITCase#testCheckpointRescalingInKeyedState fails
> -
>
> Key: FLINK-34200
> URL: https://issues.apache.org/jira/browse/FLINK-34200
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Priority: Critical
>  Labels: test-stability
> Attachments: FLINK-34200.failure.log.gz, debug-34200.log
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56601=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=8200]
> {code:java}
> Jan 19 02:31:53 02:31:53.954 [ERROR] Tests run: 32, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1050 s <<< FAILURE! -- in 
> org.apache.flink.test.checkpointing.AutoRescalingITCase
> Jan 19 02:31:53 02:31:53.954 [ERROR] 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState[backend
>  = rocksdb, buffersPerChannel = 2] -- Time elapsed: 59.10 s <<< FAILURE!
> Jan 19 02:31:53 java.lang.AssertionError: expected:<[(0,8000), (0,32000), 
> (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), (0,1), 
> (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), (1,16000), 
> (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), (0,52000), 
> (0,6), (0,68000), (0,76000), (1,18000), (1,26000), (1,34000), (1,42000), 
> (1,58000), (0,6000), (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), 
> (0,7), (1,4000), (1,2), (1,36000), (1,44000)]> but was:<[(0,8000), 
> (0,32000), (0,48000), (0,72000), (1,78000), (1,3), (1,54000), (0,2000), 
> (0,1), (0,5), (0,66000), (0,74000), (0,82000), (1,8), (1,0), 
> (1,16000), (1,24000), (1,4), (1,56000), (1,64000), (0,12000), (0,28000), 
> (0,52000), (0,6), (0,68000), (0,76000), (0,1000), (0,25000), (0,33000), 
> (0,41000), (1,18000), (1,26000), (1,34000), (1,42000), (1,58000), (0,6000), 
> (0,14000), (0,22000), (0,38000), (0,46000), (0,62000), (0,7), (1,4000), 
> (1,2), (1,36000), (1,44000)]>
> Jan 19 02:31:53   at org.junit.Assert.fail(Assert.java:89)
> Jan 19 02:31:53   at org.junit.Assert.failNotEquals(Assert.java:835)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:120)
> Jan 19 02:31:53   at org.junit.Assert.assertEquals(Assert.java:146)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingKeyedState(AutoRescalingITCase.java:296)
> Jan 19 02:31:53   at 
> org.apache.flink.test.checkpointing.AutoRescalingITCase.testCheckpointRescalingInKeyedState(AutoRescalingITCase.java:196)
> Jan 19 02:31:53   at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 19 02:31:53   at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-01 Thread Matthias Pohl (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813199#comment-17813199
 ] 

Matthias Pohl edited comment on FLINK-34333 at 2/1/24 12:48 PM:


Breaking changes between v6.6.2 and v6.9.2 based on the 
{{fabric8io:kubernetes-client}} release notes
 * (x) 
[v6.9.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.9.0]
 ** (x) Fix [#5368|https://github.com/fabric8io/kubernetes-client/issues/5368]: 
ListOptions parameter ordering is now alphabetical. If you are using non-crud 
mocking for lists with options, you may need to update your parameter order.
 *** This issue causes a change in Flink's {{KubernetesClientTestBase}} because 
the order in which the GET parameters are processed to match the mocked 
requests changed.
 ** (!) Fix [#5343|https://github.com/fabric8io/kubernetes-client/issues/5343]: 
Removed io.fabric8.kubernetes.model.annotation.PrinterColumn, use 
io.fabric8.crd.generator.annotation.PrinterColumn
 ** (!) Fix [#5391|https://github.com/fabric8io/kubernetes-client/issues/5391]: 
Removed the vertx-uri-template dependency from the vertx client, if you need 
that for your application, then introduce your own dependency.
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Fix [#5220|https://github.com/fabric8io/kubernetes-client/issues/5220]: 
KubernetesResourceUtil.isValidLabelOrAnnotation has been deprecated because the 
rules for labels and annotations are different
 * (!) 
[v6.8.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0]
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Fix [#2718|https://github.com/fabric8io/kubernetes-client/issues/2718]: 
KubernetesResourceUtil.isResourceReady was deprecated. Use 
client.resource(item).isReady() or Readiness.getInstance().isReady(item) 
instead.
 ** (!) Fix [#5171|https://github.com/fabric8io/kubernetes-client/issues/5171]: 
Removed Camel-K extension, use org.apache.camel.k:camel-k-crds instead.
 ** (!) Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: 
Built-in resources were in-consistent with respect to their serialization or 
empty collections. In many circumstances this was confusing behavior. In order 
to be consistent all built-in resources will omit empty collections by default. 
This is a breaking change if you are relying on an empty collection in a json 
merge or a strategic merge where the list has a patchStrategy of atomic. In 
these circumstances the empty collection will no longer be serialized. You may 
instead use a json patch, server side apply instead, or modify the serialized 
form of the patch.
 ** (!) Fix [#5279|https://github.com/fabric8io/kubernetes-client/issues/5279]: 
(java-generator) Add native support for date-time fields, they are now mapped 
to native java.time.ZonedDateTime
 ** (!) Fix [#5315|https://github.com/fabric8io/kubernetes-client/issues/5315]: 
kubernetes-junit-jupiter no longer registers the NamespaceExtension and 
KubernetesExtension extensions to be used in combination with 
junit-platform.properties>junit.jupiter.extensions.autodetection.enabled=trueconfiguration.
 If you wish to use these extensions and autodetect them, change your 
dependency to kubernetes-junit-jupiter-autodetect.
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Deprecating io.fabric8.kubernetes.model.annotation.PrinterColumn in favor of: 
io.fabric8.crd.generator.annotation.PrinterColumn
 ** (!) Resource classes in resource.k8s.io/v1alpha1 have been moved to 
resource.k8s.io/v1alpha2 apiGroup in Kubernetes 1.27. Users are required to 
change package of the following classes:
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.PodSchedulingContext -> 
- io.fabric8.kubernetes.api.model.resource.v1alpha2.PodSchedulingContext
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaim -> - 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaim
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaimTemplate -> 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaimTemplate
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClass -> 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClass
 * (!) 
[v6.7.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.7.0]
 ** (!) Fix [#4911|https://github.com/fabric8io/kubernetes-client/issues/4911]:
 *** Config/RequestConfig.scaleTimeout has been deprecated along with 
Scalable.scale(count, wait) and DeployableScalableResource.deployLatest(wait). 
withTimeout may be called before the operation to control the timeout.
 *** Config/RequestConfig.websocketTimeout has been removed. 
Config/RequestConfig.requestTimeout will be used for websocket connection 
timeouts.
 *** HttpClient api/building changes - writeTimeout has been removed, 
readTimeout has

[jira] [Commented] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-01 Thread Matthias Pohl (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813199#comment-17813199
 ] 

Matthias Pohl commented on FLINK-34333:
---

Breaking changes between v6.6.2 and v6.9.2:
 * (x) 
[v6.9.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.9.0]
 ** (x) Fix [#5368|https://github.com/fabric8io/kubernetes-client/issues/5368]: 
ListOptions parameter ordering is now alphabetical. If you are using non-crud 
mocking for lists with options, you may need to update your parameter order.
 *** This issue causes a change in Flink's {{KubernetesClientTestBase}} because 
the order in which the GET parameters are processed to match the mocked 
requests changed.
 ** (!) Fix [#5343|https://github.com/fabric8io/kubernetes-client/issues/5343]: 
Removed io.fabric8.kubernetes.model.annotation.PrinterColumn, use 
io.fabric8.crd.generator.annotation.PrinterColumn
 ** (!) Fix [#5391|https://github.com/fabric8io/kubernetes-client/issues/5391]: 
Removed the vertx-uri-template dependency from the vertx client, if you need 
that for your application, then introduce your own dependency.
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Fix [#5220|https://github.com/fabric8io/kubernetes-client/issues/5220]: 
KubernetesResourceUtil.isValidLabelOrAnnotation has been deprecated because the 
rules for labels and annotations are different
 * (!) 
[v6.8.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0]
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Fix [#2718|https://github.com/fabric8io/kubernetes-client/issues/2718]: 
KubernetesResourceUtil.isResourceReady was deprecated. Use 
client.resource(item).isReady() or Readiness.getInstance().isReady(item) 
instead.
 ** (!) Fix [#5171|https://github.com/fabric8io/kubernetes-client/issues/5171]: 
Removed Camel-K extension, use org.apache.camel.k:camel-k-crds instead.
 ** (!) Fix [#5262|https://github.com/fabric8io/kubernetes-client/issues/5262]: 
Built-in resources were in-consistent with respect to their serialization or 
empty collections. In many circumstances this was confusing behavior. In order 
to be consistent all built-in resources will omit empty collections by default. 
This is a breaking change if you are relying on an empty collection in a json 
merge or a strategic merge where the list has a patchStrategy of atomic. In 
these circumstances the empty collection will no longer be serialized. You may 
instead use a json patch, server side apply instead, or modify the serialized 
form of the patch.
 ** (!) Fix [#5279|https://github.com/fabric8io/kubernetes-client/issues/5279]: 
(java-generator) Add native support for date-time fields, they are now mapped 
to native java.time.ZonedDateTime
 ** (!) Fix [#5315|https://github.com/fabric8io/kubernetes-client/issues/5315]: 
kubernetes-junit-jupiter no longer registers the NamespaceExtension and 
KubernetesExtension extensions to be used in combination with 
junit-platform.properties>junit.jupiter.extensions.autodetection.enabled=trueconfiguration.
 If you wish to use these extensions and autodetect them, change your 
dependency to kubernetes-junit-jupiter-autodetect.
 ** [(/)|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.8.0] 
Deprecating io.fabric8.kubernetes.model.annotation.PrinterColumn in favor of: 
io.fabric8.crd.generator.annotation.PrinterColumn
 ** (!) Resource classes in resource.k8s.io/v1alpha1 have been moved to 
resource.k8s.io/v1alpha2 apiGroup in Kubernetes 1.27. Users are required to 
change package of the following classes:
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.PodSchedulingContext -> 
- io.fabric8.kubernetes.api.model.resource.v1alpha2.PodSchedulingContext
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaim -> - 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaim
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClaimTemplate -> 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClaimTemplate
 *** io.fabric8.kubernetes.api.model.resource.v1alpha1.ResourceClass -> 
io.fabric8.kubernetes.api.model.resource.v1alpha2.ResourceClass
 * (!) 
[v6.7.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.7.0]
 ** (!) Fix [#4911|https://github.com/fabric8io/kubernetes-client/issues/4911]:
 *** Config/RequestConfig.scaleTimeout has been deprecated along with 
Scalable.scale(count, wait) and DeployableScalableResource.deployLatest(wait). 
withTimeout may be called before the operation to control the timeout.
 *** Config/RequestConfig.websocketTimeout has been removed. 
Config/RequestConfig.requestTimeout will be used for websocket connection 
timeouts.
 *** HttpClient api/building changes - writeTimeout has been removed, 
readTimeout has moved to the HttpRequest
 ** (!) Fix [#4662|https://github.com/fabric8io/kubernetes-client/issues/4662]:
 ***

[jira] [Created] (FLINK-34334) Add sub-task level RocksDB file count metric

2024-02-01 Thread Jufang He (Jira)

Jufang He created FLINK-34334:
-

 Summary: Add sub-task level RocksDB file count metric
 Key: FLINK-34334
 URL: https://issues.apache.org/jira/browse/FLINK-34334
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Affects Versions: 1.18.0
Reporter: Jufang He
 Attachments: img_v3_027i_7ed0b8ba-3f12-48eb-aab3-cc368ac47cdg.jpg

In our production environment, we encountered the problem of task deploy 
failure. The root cause was that too many sst files of a single sub-task led to 
too much task deployment information(OperatorSubtaskState), and then caused 
akka request timeout in the task deploy phase. Therefore, I wanted to add 
sub-task level RocksDB file count metrics. It is convenient to avoid 
performance problems caused by too many sst files in time.

RocksDB has provided the JNI 
(https://javadoc.io/doc/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/RocksDB.html#getColumnFamilyMetaData
 ()) We can easily retrieve the file count and report it via metrics reporter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-33960][Scheduler] Fix the bug that Adaptive Scheduler doesn't respect the lowerBound when one flink job has more than 1 tasks [flink]

2024-02-01 Thread via GitHub



zentol commented on code in PR #24012:
URL: https://github.com/apache/flink/pull/24012#discussion_r1474310073


##
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/allocator/SlotSharingSlotAllocator.java:
##
@@ -301,27 +301,24 @@ public Collection 
getContainedExecutionVertices() {
 
 private static class SlotSharingGroupMetaInfo {
 
-private final int minLowerBound;
 private final int maxUpperBound;
-private final int maxLowerUpperBoundRange;
+private final int minResourceRequirement;
 
-private SlotSharingGroupMetaInfo(
-int minLowerBound, int maxUpperBound, int 
maxLowerUpperBoundRange) {
-this.minLowerBound = minLowerBound;
+private SlotSharingGroupMetaInfo(int maxUpperBound, int 
minResourceRequirement) {
 this.maxUpperBound = maxUpperBound;
-this.maxLowerUpperBoundRange = maxLowerUpperBoundRange;
-}
-
-public int getMinLowerBound() {
-return minLowerBound;
+this.minResourceRequirement = minResourceRequirement;
 }
 
 public int getMaxUpperBound() {
 return maxUpperBound;
 }
 
-public int getMaxLowerUpperBoundRange() {
-return maxLowerUpperBoundRange;
+public int getMaxUpperMinRequirementBoundRange() {
+return maxUpperBound - minResourceRequirement;
+}
+
+public int getMinResourceRequirement() {

Review Comment:
   I'd prefer if we'd name this `getMaxLowerBound` because that's exactly what 
it is. In this area of the code it is always good to use a consistent 
terminology and be as explicit as possible.



##
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/allocator/SlotSharingSlotAllocator.java:
##
@@ -301,27 +301,24 @@ public Collection 
getContainedExecutionVertices() {
 
 private static class SlotSharingGroupMetaInfo {
 
-private final int minLowerBound;
 private final int maxUpperBound;
-private final int maxLowerUpperBoundRange;
+private final int minResourceRequirement;
 
-private SlotSharingGroupMetaInfo(
-int minLowerBound, int maxUpperBound, int 
maxLowerUpperBoundRange) {
-this.minLowerBound = minLowerBound;
+private SlotSharingGroupMetaInfo(int maxUpperBound, int 
minResourceRequirement) {

Review Comment:
   reorder so min is first argument



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34312][table] Improve the handling of default node types for named parameters [flink]

2024-02-01 Thread via GitHub



hackergin commented on code in PR #24235:
URL: https://github.com/apache/flink/pull/24235#discussion_r1474310557


##
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/calcite/FlinkSqlCallBinding.java:
##
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.planner.calcite;
+
+import org.apache.flink.table.planner.functions.sql.SqlDefaultOperator;
+
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.sql.SqlCall;
+import org.apache.calcite.sql.SqlCallBinding;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.sql.SqlNode;
+import org.apache.calcite.sql.fun.SqlStdOperatorTable;
+import org.apache.calcite.sql.parser.SqlParserPos;
+import org.apache.calcite.sql.type.SqlOperandMetadata;
+import org.apache.calcite.sql.type.SqlOperandTypeChecker;
+import org.apache.calcite.sql.type.SqlTypeUtil;
+import org.apache.calcite.sql.validate.SqlValidator;
+import org.apache.calcite.sql.validate.SqlValidatorNamespace;
+import org.apache.calcite.sql.validate.SqlValidatorScope;
+import org.checkerframework.checker.nullness.qual.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/** Binding supports to rewrite the DEFAULT operator. */
+public class FlinkSqlCallBinding extends SqlCallBinding {
+
+private final List fixArgumentTypes;
+
+private final List rewrittenOperands;
+
+public FlinkSqlCallBinding(
+SqlValidator validator, @Nullable SqlValidatorScope scope, SqlCall 
call) {
+super(validator, scope, call);
+this.fixArgumentTypes = getFixArgumentTypes();
+this.rewrittenOperands = getRewrittenOperands();
+}
+
+@Override
+public int getOperandCount() {
+return rewrittenOperands.size();
+}
+
+@Override
+public List operands() {
+if (isFixedParameters()) {
+return rewrittenOperands;
+} else {
+return super.operands();
+}
+}
+
+@Override
+public RelDataType getOperandType(int ordinal) {
+if (!isFixedParameters()) {
+return super.getOperandType(ordinal);
+}
+
+SqlNode operand = rewrittenOperands.get(ordinal);
+if (operand.getKind() == SqlKind.DEFAULT) {
+return fixArgumentTypes.get(ordinal);
+}
+
+final RelDataType type = SqlTypeUtil.deriveType(this, operand);

Review Comment:
   @fsk119  Here I made some modifications. Previously it was: If it is 
fixArgumentType, then return argumentType. In fact, there is a problem. We 
should only return argumentType if it is a DEFAULT node. Other nodes should 
still return the actual operand's type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34312][table] Improve the handling of default node types for named parameters [flink]

2024-02-01 Thread via GitHub



hackergin commented on code in PR #24235:
URL: https://github.com/apache/flink/pull/24235#discussion_r1474303521


##
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/calcite/FlinkSqlCallBinding.java:
##
@@ -77,20 +76,15 @@ public List operands() {
 
 @Override
 public RelDataType getOperandType(int ordinal) {
-return isNamedArgument() && !argumentTypes.isEmpty()
+return isNamedArgument()
 ? ((SqlOperandMetadata) 
getCall().getOperator().getOperandTypeChecker())
 .paramTypes(typeFactory)
 .get(ordinal)
 : super.getOperandType(ordinal);
 }
 
 public boolean isNamedArgument() {
-for (SqlNode operand : getCall().getOperandList()) {
-if (operand != null && operand.getKind() == 
SqlKind.ARGUMENT_ASSIGNMENT) {
-return !getArgumentTypes().isEmpty();
-}
-}
-return false;
+return !argumentTypes.isEmpty();

Review Comment:
   Updated. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34312][table] Improve the handling of default node types for named parameters [flink]

2024-02-01 Thread via GitHub



hackergin commented on code in PR #24235:
URL: https://github.com/apache/flink/pull/24235#discussion_r1474302721


##
flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/inference/OperatorBindingCallContext.java:
##
@@ -59,7 +59,7 @@ public OperatorBindingCallContext(
 sqlOperatorBinding.getOperator().getNameAsId().toString(),
 sqlOperatorBinding.getGroupCount() > 0);
 
-this.binding = new OperatorBindingDecorator(sqlOperatorBinding);
+this.binding = sqlOperatorBinding;

Review Comment:
   Updated both  `OperatorBindingcallContext` and `CallBindingCallContext`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-02-01 Thread Matthias Pohl (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34007:
--
Release Note: Fixes a bug where the leader election wasn't able to pick up 
leadership again after renewing the lease token caused a leadership loss. This 
required fabric8io:kubernetes-client to be upgraded from v6.6.2 to v6.9.0

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-02-01 Thread Matthias Pohl (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34007:
--
Release Note: Fixes a bug where the leader election wasn't able to pick up 
leadership again after renewing the lease token caused a leadership loss. This 
required fabric8io:kubernetes-client to be upgraded from v6.6.2 to v6.9.0.  
(was: Fixes a bug where the leader election wasn't able to pick up leadership 
again after renewing the lease token caused a leadership loss. This required 
fabric8io:kubernetes-client to be upgraded from v6.6.2 to v6.9.0)

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34229) Duplicate entry in InnerClasses attribute in class file FusionStreamOperator

2024-02-01 Thread Dan Zou (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-34229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813169#comment-17813169
 ] 

Dan Zou commented on FLINK-34229:
-

[~xiasun] Could you please check if this 
[PR|https://github.com/apache/flink/pull/24228] could solve you problem.

> Duplicate entry in InnerClasses attribute in class file FusionStreamOperator
> 
>
> Key: FLINK-34229
> URL: https://issues.apache.org/jira/browse/FLINK-34229
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Runtime
>Affects Versions: 1.19.0
>Reporter: xingbe
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-01-24-17-05-47-883.png, taskmanager_log.txt
>
>
> I noticed a runtime error happens in 10TB TPC-DS (q35.sql) benchmarks in 
> 1.19, the problem did not happen in 1.18.0. This issue may have been newly 
> introduced recently. !image-2024-01-24-17-05-47-883.png|width=589,height=279!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-34324][test] Makes all s3 related operations being declared and called in a single location [flink]

2024-02-01 Thread via GitHub



zentol commented on code in PR #24244:
URL: https://github.com/apache/flink/pull/24244#discussion_r1474268805


##
flink-end-to-end-tests/test-scripts/test_file_sink.sh:
##
@@ -96,13 +51,54 @@ function get_complete_result {
 #   line number in part files
 ###
 function get_total_number_of_valid_lines {
-  if [ "${OUT_TYPE}" == "local" ]; then
-get_complete_result | wc -l | tr -d '[:space:]'
-  elif [ "${OUT_TYPE}" == "s3" ]; then
-s3_get_number_of_lines_by_prefix "${S3_PREFIX}" "part-"
-  fi
+  get_complete_result | wc -l | tr -d '[:space:]'
 }
 
+if [ "${OUT_TYPE}" == "s3" ]; then
+  source "$(dirname "$0")"/common_s3.sh
+  s3_setup hadoop
+
+  JOB_OUTPUT_PATH="s3://$IT_CASE_S3_BUCKET/$S3_PREFIX"
+  set_config_key "state.checkpoints.dir" 
"s3://$IT_CASE_S3_BUCKET/$S3_PREFIX-chk"
+  mkdir -p "$OUTPUT_PATH-chk"
+
+  # overwrites implementation for local runs
+  function get_complete_result {
+s3_get_by_full_path_and_filename_prefix "$OUTPUT_PATH" "$S3_PREFIX" 
"part-" true
+  }

Review Comment:
   I wonder why we are reading from OUTPUT_PATH and not JOB_OUTPUT_PATH 樂 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34260][Connectors/AWS] Update flink-connector-aws to be compatible with updated SinkV2 interfaces [flink-connector-aws]

2024-02-01 Thread via GitHub



z3d1k commented on code in PR #127:
URL: 
https://github.com/apache/flink-connector-aws/pull/127#discussion_r1474264702


##
flink-connector-aws/flink-connector-kinesis/src/test/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisConsumerTest.java:
##
@@ -284,7 +284,7 @@ public void testListStateChangedAfterSnapshotState() throws 
Exception {
 new FlinkKinesisConsumer<>("fakeStream", new 
SimpleStringSchema(), config);
 FlinkKinesisConsumer mockedConsumer = spy(consumer);
 
-RuntimeContext context = new MockStreamingRuntimeContext(true, 1, 1);
+RuntimeContext context = new MockStreamingRuntimeContext(true, 1, 0);

Review Comment:
   Sure, in 1.19 this test is failing with:
   ```
   java.lang.IllegalArgumentException: Task index must be less than parallelism.
at 
org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138)
at org.apache.flink.api.common.TaskInfoImpl.(TaskInfoImpl.java:68)
at org.apache.flink.api.common.TaskInfoImpl.(TaskInfoImpl.java:44)
   ```
   Failed check: 
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/TaskInfoImpl.java#L68-L70



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-34324][test] Makes all s3 related operations being declared and called in a single location [flink]

2024-02-01 Thread via GitHub



zentol commented on code in PR #24244:
URL: https://github.com/apache/flink/pull/24244#discussion_r1474261856


##
flink-end-to-end-tests/test-scripts/test_file_sink.sh:
##
@@ -96,13 +51,54 @@ function get_complete_result {
 #   line number in part files
 ###
 function get_total_number_of_valid_lines {
-  if [ "${OUT_TYPE}" == "local" ]; then
-get_complete_result | wc -l | tr -d '[:space:]'
-  elif [ "${OUT_TYPE}" == "s3" ]; then
-s3_get_number_of_lines_by_prefix "${S3_PREFIX}" "part-"
-  fi
+  get_complete_result | wc -l | tr -d '[:space:]'
 }
 
+if [ "${OUT_TYPE}" == "s3" ]; then
+  source "$(dirname "$0")"/common_s3.sh
+  s3_setup hadoop
+
+  JOB_OUTPUT_PATH="s3://$IT_CASE_S3_BUCKET/$S3_PREFIX"
+  set_config_key "state.checkpoints.dir" 
"s3://$IT_CASE_S3_BUCKET/$S3_PREFIX-chk"
+  mkdir -p "$OUTPUT_PATH-chk"
+
+  # overwrites implementation for local runs
+  function get_complete_result {
+s3_get_by_full_path_and_filename_prefix "$OUTPUT_PATH" "$S3_PREFIX" 
"part-" true
+  }

Review Comment:
   Or is that a no-op because in s3 mode we never write anything there...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 153 matches

Mail list logo