[jira] [Commented] (FLINK-34582) release build tools lost the newly added py3.11 packages for mac

2024-05-24 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849185#comment-17849185
 ] 

Matthias Pohl commented on FLINK-34582:
---

You're checking [~hxb]'s fork where the {{master}} branch doesn't seem to be 
up-to-date. 
[apache/flink:flink-python/dev/build-wheels.sh|https://github.com/apache/flink/blob/master/flink-python/dev/build-wheels.sh#L19-L26]
 does, indeed, have 3.11 added to the python version list.

> release build tools lost the newly added py3.11 packages for mac
> 
>
> Key: FLINK-34582
> URL: https://issues.apache.org/jira/browse/FLINK-34582
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.19.0, 1.20.0
>Reporter: lincoln lee
>Assignee: Xingbo Huang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.19.0, 1.20.0
>
> Attachments: image-2024-03-07-10-39-49-341.png
>
>
> during 1.19.0-rc1 building binaries via 
> tools/releasing/create_binary_release.sh
> lost the newly added py3.11  2 packages for mac



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34672) HA deadlock between JobMasterServiceLeadershipRunner and DefaultLeaderElectionService

2024-05-22 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848648#comment-17848648
 ] 

Matthias Pohl commented on FLINK-34672:
---

I'm still trying to find a reviewer. It's on my plate. But it's not a blocker 
because the issue already existed in older versions of Flink:
{quote}
I also verified that this is not something that was introduced in Flink 1.18 
with the FLIP-285 changes. AFAIS, it can also happen in 1.17- (I didn't check 
the pre-FLINK-24038 code but only looked into release-1.17).
{quote}

> HA deadlock between JobMasterServiceLeadershipRunner and 
> DefaultLeaderElectionService
> -
>
> Key: FLINK-34672
> URL: https://issues.apache.org/jira/browse/FLINK-34672
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0
>Reporter: Chesnay Schepler
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> We recently observed a deadlock in the JM within the HA system.
> (see below for the thread dump)
> [~mapohl] and I looked a bit into it and there appears to be a race condition 
> when leadership is revoked while a JobMaster is being started.
> It appears to be caused by 
> {{JobMasterServiceLeadershipRunner#createNewJobMasterServiceProcess}} 
> forwarding futures while holding a lock; depending on whether the forwarded 
> future is already complete the next stage may or may not run while holding 
> that same lock.
> We haven't determined yet whether we should be holding that lock or not.
> {code}
> "DefaultLeaderElectionService-leadershipOperationExecutor-thread-1" #131 
> daemon prio=5 os_prio=0 cpu=157.44ms elapsed=78749.65s tid=0x7f531f43d000 
> nid=0x19d waiting for monitor entry  [0x7f53084fd000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:462)
> - waiting to lock <0xf1c0e088> (a java.lang.Object)
> at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:397)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.notifyLeaderContenderOfLeadershipLoss(DefaultLeaderElectionService.java:484)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1252/0x000840ddec40.accept(Unknown
>  Source)
> at java.util.HashMap.forEach(java.base@11.0.22/HashMap.java:1337)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadershipInternal(DefaultLeaderElectionService.java:452)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1251/0x000840dcf840.run(Unknown
>  Source)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.lambda$runInLeaderEventThread$3(DefaultLeaderElectionService.java:549)
> - locked <0xf0e3f4d8> (a java.lang.Object)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1075/0x000840c23040.run(Unknown
>  Source)
> at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@11.0.22/CompletableFuture.java:1736)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
> {code}
> {code}
> "jobmanager-io-thread-1" #636 daemon prio=5 os_prio=0 cpu=125.56ms 
> elapsed=78699.01s tid=0x7f5321c6e800 nid=0x396 waiting for monitor entry  
> [0x7f530567d000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.hasLeadership(DefaultLeaderElectionService.java:366)
> - waiting to lock <0xf0e3f4d8> (a java.lang.Object)
> at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElection.hasLeadership(DefaultLeaderElection.java:52)
> at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.isValidLeader(JobMasterServiceLeadershipRunner.java:509)
> at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$forwardIfValidLeader$15(JobMasterServiceLeadershipRunner.java:520)
> - locked <0xf1c0e088> (a java.lang.Object)
> at 
> 

[jira] [Assigned] (FLINK-20402) Migrate test_tpch.sh

2024-05-21 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-20402:
-

Assignee: Muhammet Orazov

> Migrate test_tpch.sh
> 
>
> Key: FLINK-20402
> URL: https://issues.apache.org/jira/browse/FLINK-20402
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table SQL / Ecosystem, Tests
>Reporter: Jark Wu
>Assignee: Muhammet Orazov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker

2024-05-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846924#comment-17846924
 ] 

Matthias Pohl commented on FLINK-20392:
---

Sure, sounds reasonable. Feel free to update it.

> Migrating bash e2e tests to Java/Docker
> ---
>
> Key: FLINK-20392
> URL: https://issues.apache.org/jira/browse/FLINK-20392
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Test Infrastructure, Tests
>Reporter: Matthias Pohl
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> starter
>
> This Jira issue serves as an umbrella ticket for single e2e test migration 
> tasks. This should enable us to migrate all bash-based e2e tests step-by-step.
> The goal is to utilize the e2e test framework (see 
> [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]).
>  Ideally, the test should use Docker containers as much as possible 
> disconnect the execution from the environment. A good source to achieve that 
> is [testcontainers.org|https://www.testcontainers.org/].
> The related ML discussion is [Stop adding new bash-based e2e tests to 
> Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker

2024-05-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846838#comment-17846838
 ] 

Matthias Pohl commented on FLINK-20392:
---

This discussion feels similar to our efforts around migrating to JUnit5 and 
assertj as the standard JUnit tests. It costed (and is still costing) quite a 
bit of resources with the risk of missing things when reviewing the tests.
That is why I still see value in just keeping both options around. That 
requires less resources and we're not losing much. The pros and cons are still 
a good guideline for developers to decide on which technology to use if they 
are planning to create a new e2e test in Java. WDYT?

> Migrating bash e2e tests to Java/Docker
> ---
>
> Key: FLINK-20392
> URL: https://issues.apache.org/jira/browse/FLINK-20392
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Test Infrastructure, Tests
>Reporter: Matthias Pohl
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> starter
>
> This Jira issue serves as an umbrella ticket for single e2e test migration 
> tasks. This should enable us to migrate all bash-based e2e tests step-by-step.
> The goal is to utilize the e2e test framework (see 
> [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]).
>  Ideally, the test should use Docker containers as much as possible 
> disconnect the execution from the environment. A good source to achieve that 
> is [testcontainers.org|https://www.testcontainers.org/].
> The related ML discussion is [Stop adding new bash-based e2e tests to 
> Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-20392) Migrating bash e2e tests to Java/Docker

2024-05-15 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846540#comment-17846540
 ] 

Matthias Pohl commented on FLINK-20392:
---

Thanks for the write-up. I'm just wondering whether we gain anything from only 
allowing one of the two approaches. What about allowing both options?

> Migrating bash e2e tests to Java/Docker
> ---
>
> Key: FLINK-20392
> URL: https://issues.apache.org/jira/browse/FLINK-20392
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Test Infrastructure, Tests
>Reporter: Matthias Pohl
>Priority: Minor
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> starter
>
> This Jira issue serves as an umbrella ticket for single e2e test migration 
> tasks. This should enable us to migrate all bash-based e2e tests step-by-step.
> The goal is to utilize the e2e test framework (see 
> [flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]).
>  Ideally, the test should use Docker containers as much as possible 
> disconnect the execution from the environment. A good source to achieve that 
> is [testcontainers.org|https://www.testcontainers.org/].
> The related ML discussion is [Stop adding new bash-based e2e tests to 
> Flink|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stop-adding-new-bash-based-e2e-tests-to-Flink-td46607.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34324) s3_setup is called in test_file_sink.sh even if the common_s3.sh is not sourced

2024-05-10 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845238#comment-17845238
 ] 

Matthias Pohl edited comment on FLINK-34324 at 5/10/24 8:07 AM:


* master: 
[93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43]
* 1.19: 
[8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914]
* 1.18: 
[7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18]


was (Author: mapohl):
* master
** 
[93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43]
* 1.19
** 
[8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914]
* 1.18
** 
[7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18]

> s3_setup is called in test_file_sink.sh even if the common_s3.sh is not 
> sourced
> ---
>
> Key: FLINK-34324
> URL: https://issues.apache.org/jira/browse/FLINK-34324
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hadoop Compatibility, Tests
>Affects Versions: 1.17.2, 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> See example CI run from the FLINK-34150 PR:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56570=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=0f3adb59-eefa-51c6-2858-3654d9e0749d=3191
> {code}
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_file_sink.sh: 
> line 38: s3_setup: command not found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34324) s3_setup is called in test_file_sink.sh even if the common_s3.sh is not sourced

2024-05-10 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34324.
---
Fix Version/s: 1.18.2
   1.20.0
   1.19.1
   Resolution: Fixed

* master
** 
[93526c2f3247598ce80854cf65dd4440eb5aaa43|https://github.com/apache/flink/commit/93526c2f3247598ce80854cf65dd4440eb5aaa43]
* 1.19
** 
[8707c63ee147085671a9ae1b294854bac03fc914|https://github.com/apache/flink/commit/8707c63ee147085671a9ae1b294854bac03fc914]
* 1.18
** 
[7d98ab060be82fe3684d15501b9eb83373303d18|https://github.com/apache/flink/commit/7d98ab060be82fe3684d15501b9eb83373303d18]

> s3_setup is called in test_file_sink.sh even if the common_s3.sh is not 
> sourced
> ---
>
> Key: FLINK-34324
> URL: https://issues.apache.org/jira/browse/FLINK-34324
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Hadoop Compatibility, Tests
>Affects Versions: 1.17.2, 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> See example CI run from the FLINK-34150 PR:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=56570=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=0f3adb59-eefa-51c6-2858-3654d9e0749d=3191
> {code}
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_file_sink.sh: 
> line 38: s3_setup: command not found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-34937) Apache Infra GHA policy update

2024-05-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-34937:
-

Assignee: Matthias Pohl

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-04 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833910#comment-17833910
 ] 

Matthias Pohl commented on FLINK-34989:
---

[~martijnvisser] pointed out that we might need to fix this in the connector 
repos as well.

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources. According to [this 
> source|https://infra.apache.org/github-actions-secrets.html] Apache Infra 
> manages 180 runners right now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34999) PR CI stopped operating

2024-04-04 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34999.
---
Resolution: Fixed

Thanks for working on it. I verified that [PR 
CI|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] is 
picked up again. (y)

> PR CI stopped operating
> ---
>
> Key: FLINK-34999
> URL: https://issues.apache.org/jira/browse/FLINK-34999
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Blocker
>
> There are no [new PR CI 
> runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
> being picked up anymore. [Recently updated 
> PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not 
> picked up by the @flinkbot.
> In the meantime there was a notification sent from GitHub that the password 
> of the [@flinkbot|https://github.com/flinkbot] was reset for security 
> reasons. It's quite likely that these two events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35005) SqlClientITCase Failed to build JobManager image

2024-04-04 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35005:
--
Component/s: Test Infrastructure

> SqlClientITCase Failed to build JobManager image
> 
>
> Key: FLINK-35005
> URL: https://issues.apache.org/jira/browse/FLINK-35005
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Affects Versions: 1.20.0
>Reporter: Ryan Skraba
>Priority: Critical
>  Labels: test-stability
>
> jdk21 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58708=logs=dc1bf4ed-4646-531a-f094-e103042be549=fb3d654d-52f8-5b98-fe9d-b18dd2e2b790=15140
> {code}
> Apr 03 02:59:16 02:59:16.247 [INFO] 
> ---
> Apr 03 02:59:16 02:59:16.248 [INFO]  T E S T S
> Apr 03 02:59:16 02:59:16.248 [INFO] 
> ---
> Apr 03 02:59:17 02:59:17.841 [INFO] Running SqlClientITCase
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
> Apr 03 03:03:15 Caused by: 
> org.apache.flink.connector.testframe.container.ImageBuildException: Failed to 
> build image "flink-configured-jobmanager"
> Apr 03 03:03:15   at 
> org.apache.flink.connector.testframe.container.FlinkImageBuilder.build(FlinkImageBuilder.java:234)
> Apr 03 03:03:15   at 
> org.apache.flink.connector.testframe.container.FlinkTestcontainersConfigurator.configureJobManagerContainer(FlinkTestcontainersConfigurator.java:65)
> Apr 03 03:03:15   ... 12 more
> Apr 03 03:03:15 Caused by: java.lang.RuntimeException: 
> com.github.dockerjava.api.exception.DockerClientException: Could not build 
> image: Head 
> "https://registry-1.docker.io/v2/library/eclipse-temurin/manifests/21-jre-jammy":
>  received unexpected HTTP status: 500 Internal Server Error
> Apr 03 03:03:15   at 
> org.rnorth.ducttape.timeouts.Timeouts.callFuture(Timeouts.java:68)
> Apr 03 03:03:15   at 
> org.rnorth.ducttape.timeouts.Timeouts.getWithTimeout(Timeouts.java:43)
> Apr 03 03:03:15   at 
> org.testcontainers.utility.LazyFuture.get(LazyFuture.java:47)
> Apr 03 03:03:15   at 
> org.apache.flink.connector.testframe.container.FlinkImageBuilder.buildBaseImage(FlinkImageBuilder.java:255)
> Apr 03 03:03:15   at 
> org.apache.flink.connector.testframe.container.FlinkImageBuilder.build(FlinkImageBuilder.java:206)
> Apr 03 03:03:15   ... 13 more
> Apr 03 03:03:15 Caused by: 
> com.github.dockerjava.api.exception.DockerClientException: Could not build 
> image: Head 
> "https://registry-1.docker.io/v2/library/eclipse-temurin/manifests/21-jre-jammy":
>  received unexpected HTTP status: 500 Internal Server Error
> Apr 03 03:03:15   at 
> com.github.dockerjava.api.command.BuildImageResultCallback.getImageId(BuildImageResultCallback.java:78)
> Apr 03 03:03:15   at 
> com.github.dockerjava.api.command.BuildImageResultCallback.awaitImageId(BuildImageResultCallback.java:50)
> Apr 03 03:03:15   at 
> org.testcontainers.images.builder.ImageFromDockerfile.resolve(ImageFromDockerfile.java:159)
> Apr 03 03:03:15   at 
> org.testcontainers.images.builder.ImageFromDockerfile.resolve(ImageFromDockerfile.java:40)
> Apr 03 03:03:15   at 
> org.testcontainers.utility.LazyFuture.getResolvedValue(LazyFuture.java:19)
> Apr 03 03:03:15   at 
> org.testcontainers.utility.LazyFuture.get(LazyFuture.java:41)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> Apr 03 03:03:15   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> Apr 03 03:03:15   at java.base/java.lang.Thread.run(Thread.java:1583)
> Apr 03 03:03:15 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35004) SqlGatewayE2ECase could not start container

2024-04-04 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35004:
--
Component/s: Test Infrastructure

> SqlGatewayE2ECase could not start container
> ---
>
> Key: FLINK-35004
> URL: https://issues.apache.org/jira/browse/FLINK-35004
> Project: Flink
>  Issue Type: Bug
>  Components: Test Infrastructure
>Affects Versions: 1.20.0
>Reporter: Ryan Skraba
>Priority: Critical
>  Labels: github-actions, test-stability
>
> 1.20, jdk17: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58708=logs=e8e46ef5-75cc-564f-c2bd-1797c35cbebe=60c49903-2505-5c25-7e46-de91b1737bea=15078
> There is an error: "Process failed due to timeout" in 
> {{SqlGatewayE2ECase.testSqlClientExecuteStatement}}.  In the maven logs, we 
> can see:
> {code:java}
> 02:57:26,979 [main] INFO  tc.prestodb/hdp2.6-hive:10  
>  [] - Image prestodb/hdp2.6-hive:10 pull took 
> PT43.59218S
> 02:57:26,991 [main] INFO  tc.prestodb/hdp2.6-hive:10  
>  [] - Creating container for image: 
> prestodb/hdp2.6-hive:10
> 02:57:27,032 [main] INFO  tc.prestodb/hdp2.6-hive:10  
>  [] - Container prestodb/hdp2.6-hive:10 is starting: 
> 162069678c7d03252a42ed81ca43e1911ca7357c476a4a5de294ffe55bd83145
> 02:57:42,846 [main] INFO  tc.prestodb/hdp2.6-hive:10  
>  [] - Container prestodb/hdp2.6-hive:10 started in 
> PT15.855339866S
> 02:57:53,447 [main] ERROR tc.prestodb/hdp2.6-hive:10  
>  [] - Could not start container
> java.lang.RuntimeException: java.net.SocketTimeoutException: timeout
>   at 
> org.apache.flink.table.gateway.containers.HiveContainer.containerIsStarted(HiveContainer.java:94)
>  ~[test-classes/:?]
>   at 
> org.testcontainers.containers.GenericContainer.containerIsStarted(GenericContainer.java:723)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:543)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:354)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
>  ~[duct-tape-1.0.8.jar:?]
>   at 
> org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:344)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.apache.flink.table.gateway.containers.HiveContainer.doStart(HiveContainer.java:69)
>  ~[test-classes/:?]
>   at 
> org.testcontainers.containers.GenericContainer.start(GenericContainer.java:334)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.testcontainers.containers.GenericContainer.starting(GenericContainer.java:1144)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at 
> org.testcontainers.containers.FailureDetectingExternalResource$1.evaluate(FailureDetectingExternalResource.java:28)
>  ~[testcontainers-1.19.1.jar:1.19.1]
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20) 
> ~[junit-4.13.2.jar:4.13.2]
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> ~[junit-4.13.2.jar:4.13.2]
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
> ~[junit-4.13.2.jar:4.13.2]
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137) 
> ~[junit-4.13.2.jar:4.13.2]
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115) 
> ~[junit-4.13.2.jar:4.13.2]
>   at 
> org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42)
>  ~[junit-vintage-engine-5.10.1.jar:5.10.1]
>   at 
> org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80)
>  ~[junit-vintage-engine-5.10.1.jar:5.10.1]
>   at 
> org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72) 
> ~[junit-vintage-engine-5.10.1.jar:5.10.1]
>   at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:198)
>  ~[junit-platform-launcher-1.10.1.jar:1.10.1]
>   at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:169)
>  ~[junit-platform-launcher-1.10.1.jar:1.10.1]
>   at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:93)
>  ~[junit-platform-launcher-1.10.1.jar:1.10.1]
>   at 
> org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:58)
>  ~[junit-platform-launcher-1.10.1.jar:1.10.1]
>   at 
> 

[jira] [Resolved] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-35000.
---
Fix Version/s: 1.18.2
   1.20.0
   1.19.1
   Resolution: Fixed

master: 
[d301839dfe2ed9b1313d23f8307bda76868a0c0a|https://github.com/apache/flink/commit/d301839dfe2ed9b1313d23f8307bda76868a0c0a]
1.19: 
[eb58599b434b6c5fe86f6e487ce88315c98b4ec3|https://github.com/apache/flink/commit/eb58599b434b6c5fe86f6e487ce88315c98b4ec3]
1.18: 
[9150f93b18b8694646092a6ed24a14e3653f613f|https://github.com/apache/flink/commit/9150f93b18b8694646092a6ed24a14e3653f613f]

> PullRequest template doesn't use the correct format to refer to the testing 
> code convention
> ---
>
> Key: FLINK-35000
> URL: https://issues.apache.org/jira/browse/FLINK-35000
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI, Project Website
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> The PR template refers to 
> https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
>  rather than 
> https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35002) GitHub action/upload-artifact@v4 can timeout

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-35002:
--
Labels: github-actions test-stability  (was: test-stability)

> GitHub action/upload-artifact@v4 can timeout
> 
>
> Key: FLINK-35002
> URL: https://issues.apache.org/jira/browse/FLINK-35002
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Reporter: Ryan Skraba
>Priority: Major
>  Labels: github-actions, test-stability
>
> A timeout can occur when uploading a successfully built artifact:
>  * [https://github.com/apache/flink/actions/runs/8516411871/job/23325392650]
> {code:java}
> 2024-04-02T02:20:15.6355368Z With the provided path, there will be 1 file 
> uploaded
> 2024-04-02T02:20:15.6360133Z Artifact name is valid!
> 2024-04-02T02:20:15.6362872Z Root directory input is valid!
> 2024-04-02T02:20:20.6975036Z Attempt 1 of 5 failed with error: Request 
> timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. 
> Retrying request in 3000 ms...
> 2024-04-02T02:20:28.7084937Z Attempt 2 of 5 failed with error: Request 
> timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. 
> Retrying request in 4785 ms...
> 2024-04-02T02:20:38.5015936Z Attempt 3 of 5 failed with error: Request 
> timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. 
> Retrying request in 7375 ms...
> 2024-04-02T02:20:50.8901508Z Attempt 4 of 5 failed with error: Request 
> timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. 
> Retrying request in 14988 ms...
> 2024-04-02T02:21:10.9028438Z ##[error]Failed to CreateArtifact: Failed to 
> make request after 5 attempts: Request timeout: 
> /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact
> 2024-04-02T02:22:59.9893296Z Post job cleanup.
> 2024-04-02T02:22:59.9958844Z Post job cleanup. {code}
> (This is unlikely to be something we can fix, but we can track it.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34999) PR CI stopped operating

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34999:
--
Description: 
There are no [new PR CI 
runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
being picked up anymore. [Recently updated 
PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked 
up by the @flinkbot.

In the meantime there was a notification sent from GitHub that the password of 
the [@flinkbot|https://github.com/flinkbot] was reset for security reasons. 
It's quite likely that these two events are related.

  was:
There are no [new PR CI 
runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
being picked up anymore. [Recently updated 
PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked 
up by the @flinkbot.

In the meantime there was a notification sent from GitHub that the password of 
the @flinkbot was reset for security reasons. It's quite likely that these two 
events are related.


> PR CI stopped operating
> ---
>
> Key: FLINK-34999
> URL: https://issues.apache.org/jira/browse/FLINK-34999
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Blocker
>
> There are no [new PR CI 
> runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
> being picked up anymore. [Recently updated 
> PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not 
> picked up by the @flinkbot.
> In the meantime there was a notification sent from GitHub that the password 
> of the [@flinkbot|https://github.com/flinkbot] was reset for security 
> reasons. It's quite likely that these two events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention

2024-04-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35000:
-

 Summary: PullRequest template doesn't use the correct format to 
refer to the testing code convention
 Key: FLINK-35000
 URL: https://issues.apache.org/jira/browse/FLINK-35000
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI, Project Website
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The PR template refers to 
https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
 rather than 
https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-35000:
-

Assignee: Matthias Pohl

> PullRequest template doesn't use the correct format to refer to the testing 
> code convention
> ---
>
> Key: FLINK-35000
> URL: https://issues.apache.org/jira/browse/FLINK-35000
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI, Project Website
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Minor
>
> The PR template refers to 
> https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
>  rather than 
> https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34999) PR CI stopped operating

2024-04-03 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833523#comment-17833523
 ] 

Matthias Pohl commented on FLINK-34999:
---

CC [~uce] [~Weijie Guo] [~fanrui] [~rmetzger]
CC [~jingge] since it might be Ververica infrastructure-related

> PR CI stopped operating
> ---
>
> Key: FLINK-34999
> URL: https://issues.apache.org/jira/browse/FLINK-34999
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Blocker
>
> There are no [new PR CI 
> runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
> being picked up anymore. [Recently updated 
> PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not 
> picked up by the @flinkbot.
> In the meantime there was a notification sent from GitHub that the password 
> of the @flinkbot was reset for security reasons. It's quite likely that these 
> two events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34999) PR CI stopped operating

2024-04-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34999:
-

 Summary: PR CI stopped operating
 Key: FLINK-34999
 URL: https://issues.apache.org/jira/browse/FLINK-34999
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


There are no [new PR CI 
runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
being picked up anymore. [Recently updated 
PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked 
up by the @flinkbot.

In the meantime there was a notification sent from GitHub that the password of 
the @flinkbot was reset for security reasons. It's quite likely that these two 
events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833505#comment-17833505
 ] 

Matthias Pohl commented on FLINK-34997:
---

The issue seems to be that {{docker-compose}} binaries are missing in the Azure 
VMs.

> PyFlink YARN per-job on Docker test failed on azure
> ---
>
> Key: FLINK-34997
> URL: https://issues.apache.org/jira/browse/FLINK-34997
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Blocker
>  Labels: test-stability
>
> {code}
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
> Apr 03 03:12:37 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Docker version 24.0.9, build 2936816
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: 
> line 24: docker-compose: command not found
> Apr 03 03:12:38 [FAIL] Test script contains errors.
> Apr 03 03:12:38 Checking of logs skipped.
> Apr 03 03:12:38 
> Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
> minutes and 1 seconds! Test exited with exit code 1
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34997:
--
Labels: test-stability  (was: )

> PyFlink YARN per-job on Docker test failed on azure
> ---
>
> Key: FLINK-34997
> URL: https://issues.apache.org/jira/browse/FLINK-34997
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>  Labels: test-stability
>
> {code}
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
> Apr 03 03:12:37 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Docker version 24.0.9, build 2936816
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: 
> line 24: docker-compose: command not found
> Apr 03 03:12:38 [FAIL] Test script contains errors.
> Apr 03 03:12:38 Checking of logs skipped.
> Apr 03 03:12:38 
> Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
> minutes and 1 seconds! Test exited with exit code 1
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34998) Wordcount on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833504#comment-17833504
 ] 

Matthias Pohl commented on FLINK-34998:
---

I guess, this one is a duplicate of FLINK-34997. In the end, the error happens 
due to the missing {{docker-compose}} binaries in the Azure VMs. WDYT?

> Wordcount on Docker test failed on azure
> 
>
> Key: FLINK-34998
> URL: https://issues.apache.org/jira/browse/FLINK-34998
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh:
>  line 65: docker-compose: command not found
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh:
>  line 66: docker-compose: command not found
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/test_docker_embedded_job.sh:
>  line 67: docker-compose: command not found
> sort: cannot read: 
> '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-24250435151/out/docker_wc_out*':
>  No such file or directory
> Apr 03 02:08:14 FAIL WordCount: Output hash mismatch.  Got 
> d41d8cd98f00b204e9800998ecf8427e, expected 0e5bd0a3dd7d5a7110aa85ff70adb54b.
> Apr 03 02:08:14 head hexdump of actual:
> head: cannot open 
> '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-24250435151/out/docker_wc_out*'
>  for reading: No such file or directory
> Apr 03 02:08:14 Stopping job timeout watchdog (with pid=244913)
> Apr 03 02:08:14 [FAIL] Test script contains errors.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=e9d3d34f-3d15-59f4-0e3e-35067d100dfe=5d91035e-8022-55f2-2d4f-ab121508bf7e=6043



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34997:
--
Description: 
{code}
Apr 03 03:12:37 
==
Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
Apr 03 03:12:37 
==
Apr 03 03:12:37 TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
Apr 03 03:12:37 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Apr 03 03:12:38 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Apr 03 03:12:38 Docker version 24.0.9, build 2936816
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: line 
24: docker-compose: command not found
Apr 03 03:12:38 [FAIL] Test script contains errors.
Apr 03 03:12:38 Checking of logs skipped.
Apr 03 03:12:38 
Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
minutes and 1 seconds! Test exited with exit code 1
{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226

  was:
Apr 03 03:12:37 
==
Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
Apr 03 03:12:37 
==
Apr 03 03:12:37 TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
Apr 03 03:12:37 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Apr 03 03:12:38 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Apr 03 03:12:38 Docker version 24.0.9, build 2936816
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: line 
24: docker-compose: command not found
Apr 03 03:12:38 [FAIL] Test script contains errors.
Apr 03 03:12:38 Checking of logs skipped.
Apr 03 03:12:38 
Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
minutes and 1 seconds! Test exited with exit code 1




https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226


> PyFlink YARN per-job on Docker test failed on azure
> ---
>
> Key: FLINK-34997
> URL: https://issues.apache.org/jira/browse/FLINK-34997
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Major
>
> {code}
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
> Apr 03 03:12:37 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Docker version 24.0.9, build 2936816
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: 
> line 24: docker-compose: command not found
> Apr 03 03:12:38 [FAIL] Test script contains errors.
> Apr 03 03:12:38 Checking of logs skipped.
> Apr 03 03:12:38 
> Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
> minutes and 1 seconds! Test exited with exit code 1
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34997) PyFlink YARN per-job on Docker test failed on azure

2024-04-03 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34997:
--
Priority: Blocker  (was: Major)

> PyFlink YARN per-job on Docker test failed on azure
> ---
>
> Key: FLINK-34997
> URL: https://issues.apache.org/jira/browse/FLINK-34997
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI
>Affects Versions: 1.20.0
>Reporter: Weijie Guo
>Priority: Blocker
>  Labels: test-stability
>
> {code}
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 Running 'PyFlink YARN per-job on Docker test'
> Apr 03 03:12:37 
> ==
> Apr 03 03:12:37 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-37046085202
> Apr 03 03:12:37 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
> Apr 03 03:12:38 Docker version 24.0.9, build 2936816
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: 
> line 24: docker-compose: command not found
> Apr 03 03:12:38 [FAIL] Test script contains errors.
> Apr 03 03:12:38 Checking of logs skipped.
> Apr 03 03:12:38 
> Apr 03 03:12:38 [FAIL] 'PyFlink YARN per-job on Docker test' failed after 0 
> minutes and 1 seconds! Test exited with exit code 1
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58709=logs=f8e16326-dc75-5ba0-3e95-6178dd55bf6c=94ccd692-49fc-5c64-8775-d427c6e65440=10226



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34643) JobIDLoggingITCase failed

2024-04-03 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833414#comment-17833414
 ] 

Matthias Pohl commented on FLINK-34643:
---

I guess, reopening the issue would be fine. But for the sake of not putting too 
much into a single ticket, it wouldn't be wrong to create a new ticket and 
linking FLINK-34643 as the cause, either. I personally would go for the latter 
option.

> JobIDLoggingITCase failed
> -
>
> Key: FLINK-34643
> URL: https://issues.apache.org/jira/browse/FLINK-34643
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.20.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=7897
> {code}
> Mar 09 01:24:23 01:24:23.498 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 4.209 s <<< FAILURE! -- in 
> org.apache.flink.test.misc.JobIDLoggingITCase
> Mar 09 01:24:23 01:24:23.498 [ERROR] 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(ClusterClient) 
> -- Time elapsed: 1.459 s <<< ERROR!
> Mar 09 01:24:23 java.lang.IllegalStateException: Too few log events recorded 
> for org.apache.flink.runtime.jobmaster.JobMaster (12) - this must be a bug in 
> the test code
> Mar 09 01:24:23   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:215)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.assertJobIDPresent(JobIDLoggingITCase.java:148)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(JobIDLoggingITCase.java:132)
> Mar 09 01:24:23   at java.lang.reflect.Method.invoke(Method.java:498)
> Mar 09 01:24:23   at 
> java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Mar 09 01:24:23 
> {code}
> The other test failures of this build were also caused by the same test:
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=b78d9d30-509a-5cea-1fef-db7abaa325ae=8349
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=a596f69e-60d2-5a4b-7d39-dc69e4cdaed3=712ade8c-ca16-5b76-3acd-14df33bc1cb1=8209



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833154#comment-17833154
 ] 

Matthias Pohl edited comment on FLINK-34989 at 4/2/24 12:18 PM:


This Jira issue is about adding job concurrency support. Ideally, we should 
make it configurable in an easy way and set it to a concurrency level >20 as 
requested by Apache Infra. This affects the nightly builds which run per branch 
with 5 different test profiles and each test profile having 11 runners (10 
stages + a short-running license check) being occupied in parallel.

Generally, we should make CI be more selective anyway. Apache Infra constantly 
criticizes projects for running heavy-load CI on changes like simple doc 
changes (see [here|https://infra.apache.org/github-actions-secrets.html]).


was (Author: mapohl):
This Jira issue is about adding job concurrency support. Ideally, we should 
make it configurable in an easy way and set it to a concurrency level >20 as 
requested by Apache Infra. This affects the nightly builds which run per branch 
with 5 different test profiles and each test profile having 11 runners (10 
stages + a short-running license check) being occupied in parallel.

Generally, we should make CI be more selective anyway. Apache Infra constantly 
criticizes projects to run heavy-load CI for things like simple doc changes.

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources. According to [this 
> source|https://infra.apache.org/github-actions-secrets.html] Apache Infra 
> manages 180 runners right now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34989:
--
Description: 
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}

Currently (last week of March 2024) Flink was ranked at #19 of projects that 
used the Apache Infra runner resources the most which doesn't seem too bad. 
This contained not only Apache Flink but also the Kubernetes operator, 
connectors and other resources. According to [this 
source|https://infra.apache.org/github-actions-secrets.html] Apache Infra 
manages 180 runners right now.

  was:
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}

Currently (last week of March 2024) Flink was ranked at #19 of projects that 
used the Apache Infra runner resources the most which doesn't seem too bad. 
This contained not only Apache Flink but also the Kubernetes operator, 
connectors and other resources.


> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> 

[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833155#comment-17833155
 ] 

Matthias Pohl commented on FLINK-34989:
---

For this issue, we should keep in mind that it is only affecting the 
non-ephemeral runners. FLINK-34331 works on enabling ephemeral runners for 
Apache Flink. Ephemeral runners would allow us to donate project specific 
runners, i.e. someone could donate hardware to allow Flink to have its own 
runners and not to worry to much about blocking other projects with CI.

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34331) Enable Apache INFRA ephemeral runners for nightly builds

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34331:
--
Summary: Enable Apache INFRA ephemeral runners for nightly builds  (was: 
Enable Apache INFRA runners for nightly builds)

> Enable Apache INFRA ephemeral runners for nightly builds
> 
>
> Key: FLINK-34331
> URL: https://issues.apache.org/jira/browse/FLINK-34331
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The nightly CI is currently still utilizing the GitHub runners. We want to 
> switch to Apache INFRA's ephemeral runners (see 
> [docs|https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833154#comment-17833154
 ] 

Matthias Pohl commented on FLINK-34989:
---

This Jira issue is about adding job concurrency support. Ideally, we should 
make it configurable in an easy way and set it to a concurrency level >20 as 
requested by Apache Infra. This affects the nightly builds which run per branch 
with 5 different test profiles and each test profile having 11 runners (10 
stages + a short-running license check) being occupied in parallel.

Generally, we should make CI be more selective anyway. Apache Infra constantly 
criticizes projects to run heavy-load CI for things like simple doc changes.

> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833153#comment-17833153
 ] 

Matthias Pohl commented on FLINK-34989:
---

Here's a summary of the requirements and whether we meet them based on the 
most-recent report:

|| Requirement || Flink CI ||
| Job concurrency level 20 (or better 15) or below | (n) |
| Do not exceed 25 full-time runners (FT runner), i.e. 4200hours per 7 days | 
(y) |
| Avg number of minutes should not exceed 3600 hours per 5 days | (y) |


> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}
> Currently (last week of March 2024) Flink was ranked at #19 of projects that 
> used the Apache Infra runner resources the most which doesn't seem too bad. 
> This contained not only Apache Flink but also the Kubernetes operator, 
> connectors and other resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34989:
--
Description: 
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}

Currently (last week of March 2024) Flink was ranked at #19 of projects that 
used the Apache Infra runner resources the most which doesn't seem too bad. 
This contained not only Apache Flink but also the Kubernetes operator, 
connectors and other resources.

  was:
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}


> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to 

[jira] [Updated] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34989:
--
Description: 
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

There was a policy change announced recently:
{quote}
Policy change on use of GitHub Actions

Due to misconfigurations in their builds, some projects have been using 
unsupportable numbers of GitHub Actions. As part of fixing this situation, 
Infra has added a 'resource use' section to the policy on GitHub Actions. 
This section of the policy will come into effect on April 20, 2024:

All workflows MUST have a job concurrency level less than or equal to 20. 
This means a workflow cannot have more than 20 jobs running at the same time 
across all matrices.
All workflows SHOULD have a job concurrency level less than or equal to 15. 
Just because 20 is the max, doesn't mean you should strive for 20.
The average number of minutes a project uses per calendar week MUST NOT 
exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
hours).
The average number of minutes a project uses in any consecutive five-day 
period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
minutes, or 3,600 hours).
Projects whose builds consistently cross the maximum use limits will lose 
their access to GitHub Actions until they fix their build configurations.
The full policy is at  
https://infra.apache.org/github-actions-policy.html.
{quote}

  was:
The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)


> Apache Infra requests to reduce the runner usage for a project
> --
>
> Key: FLINK-34989
> URL: https://issues.apache.org/jira/browse/FLINK-34989
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
> now. These runners are limited. The runner usage can be monitored via the 
> following links:
> * [Flink-specific 
> report|https://infra-reports.apache.org/#ghactions=flink=168] 
> (needs ASF committer rights) This project-specific report can only be 
> modified through the HTTP GET parameters of the URL.
> * [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
> membership)
> There was a policy change announced recently:
> {quote}
> Policy change on use of GitHub Actions
> Due to misconfigurations in their builds, some projects have been using 
> unsupportable numbers of GitHub Actions. As part of fixing this situation, 
> Infra has added a 'resource use' section to the policy on GitHub Actions. 
> This section of the policy will come into effect on April 20, 2024:
> All workflows MUST have a job concurrency level less than or equal to 20. 
> This means a workflow cannot have more than 20 jobs running at the same time 
> across all matrices.
> All workflows SHOULD have a job concurrency level less than or equal to 15. 
> Just because 20 is the max, doesn't mean you should strive for 20.
> The average number of minutes a project uses per calendar week MUST NOT 
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200 
> hours).
> The average number of minutes a project uses in any consecutive five-day 
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000 
> minutes, or 3,600 hours).
> Projects whose builds consistently cross the maximum use limits will lose 
> their access to GitHub Actions until they fix their build configurations.
> The full policy is at  
> https://infra.apache.org/github-actions-policy.html.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833149#comment-17833149
 ] 

Matthias Pohl commented on FLINK-34937:
---

I moved the runner usage discussion into FLINK-34989

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34989:
-

 Summary: Apache Infra requests to reduce the runner usage for a 
project
 Key: FLINK-34989
 URL: https://issues.apache.org/jira/browse/FLINK-34989
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34427) FineGrainedSlotManagerTest fails fatally (exit code 239)

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833098#comment-17833098
 ] 

Matthias Pohl commented on FLINK-34427:
---

Copied over from FLINK-33416:
* https://github.com/XComp/flink/actions/runs/6472726326/job/17575765131
* 1.19: 
https://github.com/apache/flink/actions/runs/8467681781/job/23199435037#step:10:8909

> FineGrainedSlotManagerTest fails fatally (exit code 239)
> 
>
> Key: FLINK-34427
> URL: https://issues.apache.org/jira/browse/FLINK-34427
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> https://github.com/apache/flink/actions/runs/7866453350/job/21460921911#step:10:8959
> {code}
> Error: 02:28:53 02:28:53.220 [ERROR] Process Exit Code: 239
> Error: 02:28:53 02:28:53.220 [ERROR] Crashed tests:
> Error: 02:28:53 02:28:53.220 [ERROR] 
> org.apache.flink.runtime.resourcemanager.ResourceManagerTaskExecutorTest
> Error: 02:28:53 02:28:53.220 [ERROR] 
> org.apache.maven.surefire.booter.SurefireBooterForkException: 
> ExecutionException The forked VM terminated without properly saying goodbye. 
> VM crash or System.exit called?
> Error: 02:28:53 02:28:53.220 [ERROR] Command was /bin/sh -c cd 
> '/root/flink/flink-runtime' && 
> '/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java' '-XX:+UseG1GC' '-Xms256m' 
> '-XX:+IgnoreUnrecognizedVMOptions' 
> '--add-opens=java.base/java.util=ALL-UNNAMED' 
> '--add-opens=java.base/java.lang=ALL-UNNAMED' 
> '--add-opens=java.base/java.net=ALL-UNNAMED' 
> '--add-opens=java.base/java.io=ALL-UNNAMED' 
> '--add-opens=java.base/java.util.concurrent=ALL-UNNAMED' '-Xmx768m' '-jar' 
> '/root/flink/flink-runtime/target/surefire/surefirebooter-20240212022332296_94.jar'
>  '/root/flink/flink-runtime/target/surefire' 
> '2024-02-12T02-21-39_495-jvmRun3' 'surefire-20240212022332296_88tmp' 
> 'surefire_26-20240212022332296_91tmp'
> Error: 02:28:53 02:28:53.220 [ERROR] Error occurred in starting fork, check 
> output in log
> Error: 02:28:53 02:28:53.220 [ERROR] Process Exit Code: 239
> Error: 02:28:53 02:28:53.220 [ERROR] Crashed tests:
> Error: 02:28:53 02:28:53.221 [ERROR] 
> org.apache.flink.runtime.resourcemanager.ResourceManagerTaskExecutorTest
> Error: 02:28:53 02:28:53.221 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:456)
> [...]
> {code}
> The fatal error is triggered most likely within the 
> {{FineGrainedSlotManagerTest}}:
> {code}
> 02:26:39,362 [   pool-643-thread-1] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: 
> Thread 'pool-643-thread-1' produced an uncaught exception. Stopping the 
> process...
> java.util.concurrent.CompletionException: 
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4bbc0b10 
> rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@7a45cd9a[Shutting down, pool 
> size = 1, active threads = 1, queued tasks = 1, completed tasks = 194]
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:838) 
> ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
>  ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:851)
>  ~[?:1.8.0_392]
> at 
> java.util.concurrent.CompletableFuture.handleAsync(CompletableFuture.java:2178)
>  ~[?:1.8.0_392]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer.allocateSlot(DefaultSlotStatusSyncer.java:138)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.allocateSlotsAccordingTo(FineGrainedSlotManager.java:722)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.checkResourceRequirements(FineGrainedSlotManager.java:645)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.lambda$null$12(FineGrainedSlotManager.java:603)
>  ~[classes/:?]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_392]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_392]
> at 
> 

[jira] [Closed] (FLINK-33416) FineGrainedSlotManagerTest failed with fatal error

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl closed FLINK-33416.
-
Resolution: Duplicate

This issue is addressed in FLINK-34427. I'm closing FLINK-33416 in favor of 
FLINK-34427 because the investigation happened there.

> FineGrainedSlotManagerTest failed with fatal error
> --
>
> Key: FLINK-33416
> URL: https://issues.apache.org/jira/browse/FLINK-33416
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: github-actions, test-stability
>
> In FLINK-33245, we reported an error of the 
> {{ZooKeeperLeaderElectionConnectionHandlingTest}} failure due to a fatal 
> error. The corresponding build is [this 
> one|https://github.com/XComp/flink/actions/runs/6472726326/job/17575765131].
> But the stacktrace indicates that it's actually 
> {{FineGrainedSlotManagerTest}} which ran before the ZK-related test:
> {code}
> Test 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest.testSlotAllocationAccordingToStrategyResult[testSlotAllocationAccordingToStrategyResult()]
>  successfully run.
> 
> 19:30:11,463 [   pool-752-thread-1] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: 
> Thread 'pool-752-thread-1' produced an uncaught exception. Stopping the 
> process...
> java.util.concurrent.CompletionException: 
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1201ef67[Not
>  completed, task = 
> java.util.concurrent.Executors$RunnableAdapter@1ea6ccfa[Wrapped task = 
> java.util.concurrent.CompletableFuture$UniHandle@36f84d94]] rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@4642c78d[Shutting down, pool 
> size = 1, active threads = 1, queued tasks = 1, completed tasks = 194]
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>  ~[?:?]
> at 
> java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:951)
>  ~[?:?]
> at 
> java.util.concurrent.CompletableFuture.handleAsync(CompletableFuture.java:2276)
>  ~[?:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer.allocateSlot(DefaultSlotStatusSyncer.java:138)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.allocateSlotsAccordingTo(FineGrainedSlotManager.java:722)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.checkResourceRequirements(FineGrainedSlotManager.java:645)
>  ~[classes/:?]
> at 
> org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager.lambda$checkResourceRequirementsWithDelay$12(FineGrainedSlotManager.java:603)
>  ~[classes/:?]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
> at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1201ef67[Not
>  completed, task = 
> java.util.concurrent.Executors$RunnableAdapter@1ea6ccfa[Wrapped task = 
> java.util.concurrent.CompletableFuture$UniHandle@36f84d94]] rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@4642c78d[Shutting down, pool 
> size = 1, active threads = 1, queued tasks = 1, completed tasks = 194]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
>  ~[?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) 
> ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
>  ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
>  ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705)
>  ~[?:?]
> at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:687)
>  ~[?:?]
> at 
> 

[jira] [Comment Edited] (FLINK-34988) Class loading issues in JDK17 and JDK21

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833095#comment-17833095
 ] 

Matthias Pohl edited comment on FLINK-34988 at 4/2/24 10:07 AM:


It's most likely caused by FLINK-34548 based on the git history between the 
most recent successful nightly run on master 
[20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results]
 (based on {{3841f062}}) and 
[20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results]
 (based on {{d271495c}}):
{code}
$ git log 3841f062..d271495c --oneline
d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation
28762497bdf [FLINK-34548][API] Supports sink-v2 Sink
056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source
ceafa5a5705 [FLINK-34548][API] Implement datastream
4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators
e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment
9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector 
to flink-core-api
cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction
13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext 
related interfaces
13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api
59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core 
depend on it
5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules
{code}


was (Author: mapohl):
It's most likely caused by FLINK-34548 based on the git history between the 
most recent successful nightly run on master 
[20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results]
 (based on {{3841f062}}) and 
[20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results]
 (based on {{d271495c}}):
{code}
$ git log 3841f062..d271495c5be34f4e4a518207ca7716f4e8907e5f --oneline
d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation
28762497bdf [FLINK-34548][API] Supports sink-v2 Sink
056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source
ceafa5a5705 [FLINK-34548][API] Implement datastream
4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators
e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment
9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector 
to flink-core-api
cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction
13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext 
related interfaces
13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api
59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core 
depend on it
5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules
{code}

> Class loading issues in JDK17 and JDK21
> ---
>
> Key: FLINK-34988
> URL: https://issues.apache.org/jira/browse/FLINK-34988
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: test-stability
>
> * JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942
> * JDK 17 (misc; ExceptionInInitializeError): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548
> * JDK 21 (core; same as above): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963
> * JDK 21 (misc; same as above): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34988) Class loading issues in JDK17 and JDK21

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833095#comment-17833095
 ] 

Matthias Pohl commented on FLINK-34988:
---

It's most likely caused by FLINK-34548 based on the git history between the 
most recent successful nightly run on master 
[20240331.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58645=results]
 (based on {{3841f062}}) and 
[20240402.1|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=results]
 (based on {{d271495c}}):
{code}
$ git log 3841f062..d271495c5be34f4e4a518207ca7716f4e8907e5f --oneline
d271495c5be [hotfix] Fix compile error in DataStreamV2SinkTransformation
28762497bdf [FLINK-34548][API] Supports sink-v2 Sink
056660e0b69 [FLINK-34548][API] Supports FLIP-27 Source
ceafa5a5705 [FLINK-34548][API] Implement datastream
4f71c5b4660 [FLINK-34548][API] Implement process function's underlying operators
e1147ca7e39 [FLINK-34548][API] Introduce ExecutionEnvironment
9fa74a8a706 [FLINK-34548][API] Introduce stream interface and move KeySelector 
to flink-core-api
cedbcce6eff [FLINK-34548][API] Introduce variants of ProcessFunction
13cfaa76b5e [FLINK-34548][API] Introduce ProcessFunction and RuntimeContext 
related interfaces
13790e03207 [FLINK-34548][API] Move Function interface to flink-core-api
59525e460af [FLINK-34548][API] Create flink-core-api module and let flink-core 
depend on it
5b2e923be0a [FLINK-34548][API] Initialize the datastream v2 related modules
{code}

> Class loading issues in JDK17 and JDK21
> ---
>
> Key: FLINK-34988
> URL: https://issues.apache.org/jira/browse/FLINK-34988
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: test-stability
>
> * JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942
> * JDK 17 (misc; ExceptionInInitializeError): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548
> * JDK 21 (core; same as above): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963
> * JDK 21 (misc; same as above): 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34988) Class loading issues in JDK17 and JDK21

2024-04-02 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34988:
-

 Summary: Class loading issues in JDK17 and JDK21
 Key: FLINK-34988
 URL: https://issues.apache.org/jira/browse/FLINK-34988
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.20.0
Reporter: Matthias Pohl


* JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942
* JDK 17 (misc; ExceptionInInitializeError): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548
* JDK 21 (core; same as above): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963
* JDK 21 (misc; same as above): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed

2024-04-02 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-33816:
--
Fix Version/s: 1.19.1

> SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due 
> async checkpoint triggering not being completed 
> -
>
> Key: FLINK-33816
> URL: https://issues.apache.org/jira/browse/FLINK-33816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: jiabao.sun
>Priority: Major
>  Labels: github-actions, pull-request-available, test-stability
> Fix For: 1.20.0, 1.19.1
>
> Attachments: screenshot-1.png
>
>
> [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430]
> {code:java}
> rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest
> 9426Error: 14:39:01 14:39:01.930 [ERROR] 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain
>   Time elapsed: 0.034 s  <<< FAILURE!
> 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: 
> 9428Dec 12 14:39:01 
> 9429Dec 12 14:39:01 Expecting value to be true but was false
> 9430Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
> 9431Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
> 9432Dec 12 14:39:01   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710)
> 9433Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> 9434Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:580)
> [...] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833056#comment-17833056
 ] 

Matthias Pohl commented on FLINK-33816:
---

master: 
[5aebb04b3055fbec6a74eaf4226c4a88d3fd2d6e|https://github.com/apache/flink/commit/5aebb04b3055fbec6a74eaf4226c4a88d3fd2d6e]
1.19: 
[ece4faee055b3797b39e9c0b55f3e94a3db2f912|https://github.com/apache/flink/commit/ece4faee055b3797b39e9c0b55f3e94a3db2f912]

> SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due 
> async checkpoint triggering not being completed 
> -
>
> Key: FLINK-33816
> URL: https://issues.apache.org/jira/browse/FLINK-33816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: jiabao.sun
>Priority: Major
>  Labels: github-actions, pull-request-available, test-stability
> Fix For: 1.20.0
>
> Attachments: screenshot-1.png
>
>
> [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430]
> {code:java}
> rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest
> 9426Error: 14:39:01 14:39:01.930 [ERROR] 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain
>   Time elapsed: 0.034 s  <<< FAILURE!
> 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: 
> 9428Dec 12 14:39:01 
> 9429Dec 12 14:39:01 Expecting value to be true but was false
> 9430Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
> 9431Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
> 9432Dec 12 14:39:01   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710)
> 9433Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> 9434Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:580)
> [...] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34953) Add github ci for flink-web to auto commit build files

2024-04-02 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833036#comment-17833036
 ] 

Matthias Pohl commented on FLINK-34953:
---

Hi [~gongzhongqiang], it sounds like we reached consensus in this matter 
already. But you can bring this up in the dev ML to check whether there are 
some objections against this approach before going ahead with this ticket to 
have a proper backing from the community.

> Add github ci for flink-web to auto commit build files
> --
>
> Key: FLINK-34953
> URL: https://issues.apache.org/jira/browse/FLINK-34953
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhongqiang Gong
>Priority: Minor
>  Labels: website
>
> Currently, https://github.com/apache/flink-web commit build files by local 
> build. So I want use github ci to build docs and commit.
>  
> Changes:
>  * Add website build check for pr
>  * Auto build and commit build files after pr was merged to `asf-site`
>  * Optinal: this ci can triggered by manual



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34961) GitHub Actions runner statistcs can be monitored per workflow name

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34961:
--
Labels: starter  (was: )

> GitHub Actions runner statistcs can be monitored per workflow name
> --
>
> Key: FLINK-34961
> URL: https://issues.apache.org/jira/browse/FLINK-34961
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: starter
>
> Apache Infra allows the monitoring of runner usage per workflow (see [report 
> for 
> Flink|https://infra-reports.apache.org/#ghactions=flink=168=10];
>   only accessible with Apache committer rights). They accumulate the data by 
> workflow name. The Flink space has multiple repositories that use the generic 
> workflow name {{CI}}). That makes the differentiation in the report harder.
> This Jira issue is about identifying all Flink-related projects with a CI 
> workflow (Kubernetes operator and the JDBC connector were identified, for 
> instance) and adding a more distinct name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34961) GitHub Actions statistcs can be monitored per workflow name

2024-03-28 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34961:
-

 Summary: GitHub Actions statistcs can be monitored per workflow 
name
 Key: FLINK-34961
 URL: https://issues.apache.org/jira/browse/FLINK-34961
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI
Reporter: Matthias Pohl


Apache Infra allows the monitoring of runner usage per workflow (see [report 
for 
Flink|https://infra-reports.apache.org/#ghactions=flink=168=10];
  only accessible with Apache committer rights). They accumulate the data by 
workflow name. The Flink space has multiple repositories that use the generic 
workflow name {{CI}}). That makes the differentiation in the report harder.

This Jira issue is about identifying all Flink-related projects with a CI 
workflow (Kubernetes operator and the JDBC connector were identified, for 
instance) and adding a more distinct name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34961) GitHub Actions runner statistcs can be monitored per workflow name

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34961:
--
Summary: GitHub Actions runner statistcs can be monitored per workflow name 
 (was: GitHub Actions statistcs can be monitored per workflow name)

> GitHub Actions runner statistcs can be monitored per workflow name
> --
>
> Key: FLINK-34961
> URL: https://issues.apache.org/jira/browse/FLINK-34961
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Priority: Major
>
> Apache Infra allows the monitoring of runner usage per workflow (see [report 
> for 
> Flink|https://infra-reports.apache.org/#ghactions=flink=168=10];
>   only accessible with Apache committer rights). They accumulate the data by 
> workflow name. The Flink space has multiple repositories that use the generic 
> workflow name {{CI}}). That makes the differentiation in the report harder.
> This Jira issue is about identifying all Flink-related projects with a CI 
> workflow (Kubernetes operator and the JDBC connector were identified, for 
> instance) and adding a more distinct name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831844#comment-17831844
 ] 

Matthias Pohl commented on FLINK-34937:
---

Looks like Flink is on rank 19 in terms of runner minutes used for the past 7 
days:

[Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights)

[Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34933) JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored isn't implemented properly

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34933.
---
Fix Version/s: 1.18.2
   1.20.0
   1.19.1
   Resolution: Fixed

master: 
[1668a07276929416469392a35a77ba7699aac30b|https://github.com/apache/flink/commit/1668a07276929416469392a35a77ba7699aac30b]
1.19: 
[c11656a2406f07e2ae7cd6f80c46afb14385ee0e|https://github.com/apache/flink/commit/c11656a2406f07e2ae7cd6f80c46afb14385ee0e]
1.18: 
[94d1363c27e26fc8313721e138c7b4de744ca69e|https://github.com/apache/flink/commit/94d1363c27e26fc8313721e138c7b4de744ca69e]

> JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored
>  isn't implemented properly
> ---
>
> Key: FLINK-34933
> URL: https://issues.apache.org/jira/browse/FLINK-34933
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> {{testResultFutureCompletionOfOutdatedLeaderIsIgnored}} doesn't test the 
> desired behavior: The {{TestingJobMasterService#closeAsync()}} callback 
> throws an {{UnsupportedOperationException}} by default which prevents the 
> test from properly finalizing the leadership revocation.
> The test is still passing because the test checks implicitly for this error. 
> Instead, we should verify that the runner's resultFuture doesn't complete 
> until the runner is closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-33376) Extend Curator config option for Zookeeper configuration

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-33376.
---
Fix Version/s: 1.20.0
 Release Note: Adds support for the following curator parameters: 
high-availability.zookeeper.client.authorization (curator parameter: 
authorization), high-availability.zookeeper.client.max-close-wait (curator 
parameter: maxCloseWaitMs), 
high-availability.zookeeper.client.simulated-session-expiration-percent 
(curator parameter: simulatedSessionExpirationPercent)
   Resolution: Fixed

master: 
[83f82ab0c865a4fa9e119c96e11e0fb3df4a5ecd|https://github.com/apache/flink/commit/83f82ab0c865a4fa9e119c96e11e0fb3df4a5ecd]

> Extend Curator config option for Zookeeper configuration
> 
>
> Key: FLINK-33376
> URL: https://issues.apache.org/jira/browse/FLINK-33376
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Oleksandr Nitavskyi
>Assignee: Oleksandr Nitavskyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> In certain cases ZooKeeper requires additional Authentication information. 
> For example list of valid [names for 
> ensemble|https://zookeeper.apache.org/doc/r3.8.0/zookeeperAdmin.html#:~:text=for%20secure%20authentication.-,zookeeper.ensembleAuthName,-%3A%20(Java%20system%20property]
>  in order to prevent the accidental connecting to a wrong ensemble.
> Curator allows to add additional AuthInfo object for such configuration. Thus 
> it would be useful to add one more additional Map property which would allow 
> to pass AuthInfo objects during Curator client creation.
> *Acceptance Criteria:* For Flink users it is possible to configure auth info 
> list for Curator framework client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33376) Extend Curator config option for Zookeeper configuration

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-33376:
--
Release Note: Adds support for the following curator parameters: 
high-availability.zookeeper.client.authorization (corresponding curator 
parameter: authorization), high-availability.zookeeper.client.max-close-wait 
(corresponding curator parameter: maxCloseWaitMs), 
high-availability.zookeeper.client.simulated-session-expiration-percent 
(corresponding curator parameter: simulatedSessionExpirationPercent).  (was: 
Adds support for the following curator parameters: 
high-availability.zookeeper.client.authorization (curator parameter: 
authorization), high-availability.zookeeper.client.max-close-wait (curator 
parameter: maxCloseWaitMs), 
high-availability.zookeeper.client.simulated-session-expiration-percent 
(curator parameter: simulatedSessionExpirationPercent))

> Extend Curator config option for Zookeeper configuration
> 
>
> Key: FLINK-33376
> URL: https://issues.apache.org/jira/browse/FLINK-33376
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Oleksandr Nitavskyi
>Assignee: Oleksandr Nitavskyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> In certain cases ZooKeeper requires additional Authentication information. 
> For example list of valid [names for 
> ensemble|https://zookeeper.apache.org/doc/r3.8.0/zookeeperAdmin.html#:~:text=for%20secure%20authentication.-,zookeeper.ensembleAuthName,-%3A%20(Java%20system%20property]
>  in order to prevent the accidental connecting to a wrong ensemble.
> Curator allows to add additional AuthInfo object for such configuration. Thus 
> it would be useful to add one more additional Map property which would allow 
> to pass AuthInfo objects during Curator client creation.
> *Acceptance Criteria:* For Flink users it is possible to configure auth info 
> list for Curator framework client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (FLINK-34953) Add github ci for flink-web to auto commit build files

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reopened FLINK-34953:
---

> Add github ci for flink-web to auto commit build files
> --
>
> Key: FLINK-34953
> URL: https://issues.apache.org/jira/browse/FLINK-34953
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhongqiang Gong
>Priority: Minor
>  Labels: website
>
> Currently, https://github.com/apache/flink-web commit build files by local 
> build. So I want use github ci to build docs and commit.
>  
> Changes:
>  * Add website build check for pr
>  * Auto build and commit build files after pr was merged to `asf-site`
>  * Optinal: this ci can triggered by manual



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34953) Add github ci for flink-web to auto commit build files

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831665#comment-17831665
 ] 

Matthias Pohl edited comment on FLINK-34953 at 3/28/24 9:52 AM:


I guess we could do it. The [GitHub Actions 
Policy|https://infra.apache.org/github-actions-policy.html] excludes 
non-released artifacts like websites from the restriction:
{quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) 
MAY work on website content and other non-released data such as documentation 
and convenience binaries. Automated services MUST NOT push data to a repository 
or branch that is subject to official release as a software package by the 
project, unless the project secures specific prior authorization of the 
workflow from Infrastructure.
{quote}
Not sure whether they updated that one recently. Or do you have another source 
which is stricter, [~martijnvisser] ?


was (Author: mapohl):
I guess we could do it. The [GitHub Actions 
Policy|https://infra.apache.org/github-actions-policy.html] excludes 
non-released artifacts like website from the restriction:
{quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) 
MAY work on website content and other non-released data such as documentation 
and convenience binaries. Automated services MUST NOT push data to a repository 
or branch that is subject to official release as a software package by the 
project, unless the project secures specific prior authorization of the 
workflow from Infrastructure.
{quote}
Not sure whether they updated that one recently. Or do you have another source 
which is stricter, [~martijnvisser] ?

> Add github ci for flink-web to auto commit build files
> --
>
> Key: FLINK-34953
> URL: https://issues.apache.org/jira/browse/FLINK-34953
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhongqiang Gong
>Priority: Minor
>  Labels: website
>
> Currently, https://github.com/apache/flink-web commit build files by local 
> build. So I want use github ci to build docs and commit.
>  
> Changes:
>  * Add website build check for pr
>  * Auto build and commit build files after pr was merged to `asf-site`
>  * Optinal: this ci can triggered by manual



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34953) Add github ci for flink-web to auto commit build files

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831665#comment-17831665
 ] 

Matthias Pohl commented on FLINK-34953:
---

I guess we could do it. The [GitHub Actions 
Policy|https://infra.apache.org/github-actions-policy.html] excludes 
non-released artifacts like website from the restriction:
{quote}Automated services such as GitHub Actions (and Jenkins, BuildBot, etc.) 
MAY work on website content and other non-released data such as documentation 
and convenience binaries. Automated services MUST NOT push data to a repository 
or branch that is subject to official release as a software package by the 
project, unless the project secures specific prior authorization of the 
workflow from Infrastructure.
{quote}
Not sure whether they updated that one recently. Or do you have another source 
which is stricter, [~martijnvisser] ?

> Add github ci for flink-web to auto commit build files
> --
>
> Key: FLINK-34953
> URL: https://issues.apache.org/jira/browse/FLINK-34953
> Project: Flink
>  Issue Type: Improvement
>  Components: Project Website
>Reporter: Zhongqiang Gong
>Priority: Minor
>  Labels: website
>
> Currently, https://github.com/apache/flink-web commit build files by local 
> build. So I want use github ci to build docs and commit.
>  
> Changes:
>  * Add website build check for pr
>  * Auto build and commit build files after pr was merged to `asf-site`
>  * Optinal: this ci can triggered by manual



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831659#comment-17831659
 ] 

Matthias Pohl commented on FLINK-34937:
---

let's check https://github.com/assignUser/stash (which is provided by 
[~assignuser] from the Apache Arrow project and promoted in Apache Infra's 
roundtable group) whether our CI can benefit from it

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-34551) Align retry mechanisms of FutureUtils

2024-03-28 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-34551:
-

Assignee: Matthias Pohl  (was: Kumar Mallikarjuna)

> Align retry mechanisms of FutureUtils
> -
>
> Key: FLINK-34551
> URL: https://issues.apache.org/jira/browse/FLINK-34551
> Project: Flink
>  Issue Type: Technical Debt
>  Components: API / Core
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> The retry mechanisms of FutureUtils include quite a bit of redundant code 
> which makes it hard to understand and to extend. The logic should be aligned 
> properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34551) Align retry mechanisms of FutureUtils

2024-03-28 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831657#comment-17831657
 ] 

Matthias Pohl commented on FLINK-34551:
---

The intention of this ticket came from FLINK-34227 where I wanted to add logic 
for retrying forever. I managed to split the 
{{retrySuccessfulOperationWithDelay}} in FLINK-34227 in a way now that I didn't 
generate too much additional redundant code. I created FLINK-34551 as a 
follow-up anyway because I noticed that {{retrySuccessfulOperationWithDelay}} 
and  {{retryOperation}} share some common logic and that we could improve the 
way how these methods decide on which executor to run the {{operation}} on 
(scheduledExecutor vs calling thread).

Your current proposal has still redundant code. We would need to iterate over 
the change a bit more and discuss the contract of these methods in more detail. 
But unfortunately, I am gone for quite a bit soon. So, I would not be able to 
help you. Additionally, it's not a high-priority task right. I'm wondering 
whether we should unassign the task again. I want to avoid that you spend time 
on it and then get stuck because of missing feedback from my side.

I should have considered it yesterday already. Sorry for that.

> Align retry mechanisms of FutureUtils
> -
>
> Key: FLINK-34551
> URL: https://issues.apache.org/jira/browse/FLINK-34551
> Project: Flink
>  Issue Type: Technical Debt
>  Components: API / Core
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Kumar Mallikarjuna
>Priority: Major
>  Labels: pull-request-available
>
> The retry mechanisms of FutureUtils include quite a bit of redundant code 
> which makes it hard to understand and to extend. The logic should be aligned 
> properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34937) Apache Infra GHA policy update

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831422#comment-17831422
 ] 

Matthias Pohl edited comment on FLINK-34937 at 3/27/24 3:45 PM:


We should pin all actions (i.e. use the git SHA rather than a version tag) for 
external actions (anything other than {{actions/\*}}, {{github/\*}} and 
{{apache/\*}} prefixed actions). That's not the case right now.


was (Author: mapohl):
We should pin all actions (i.e. use the git SHA rather than a version tag) for 
external actions (anything other than {{actions/*}}, {{github/*}} and 
{{apache/*}} prefixed actions). That's not the case right now.

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831422#comment-17831422
 ] 

Matthias Pohl commented on FLINK-34937:
---

We should pin all actions (i.e. use the git SHA rather than a version tag) for 
external actions (anything other than {{actions/*}}, {{github/*}} and 
{{apache/*}} prefixed actions). That's not the case right now.

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21

2024-03-27 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34419.
---
Resolution: Fixed

> flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
> ---
>
> Key: FLINK-34419
> URL: https://issues.apache.org/jira/browse/FLINK-34419
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Assignee: Muhammet Orazov
>Priority: Major
>  Labels: pull-request-available, starter
>
> [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40]
>  needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 
> support was added in 1.19 (FLINK-33163)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831391#comment-17831391
 ] 

Matthias Pohl edited comment on FLINK-34419 at 3/27/24 2:56 PM:


master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b
dev-master: 1460077743b29e17edd0a2d7efd3897fa097988d
dev-1.19: 67d7c46ed382a665e941f0cf1f1606d10f87dee5
dev-1.18: d93d911b015e535fc2b6f1426c3b36229ff3d02a


was (Author: mapohl):
master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b
dev-master: tba
dev-1.19: tba
dev-1.18: tba

> flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
> ---
>
> Key: FLINK-34419
> URL: https://issues.apache.org/jira/browse/FLINK-34419
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Assignee: Muhammet Orazov
>Priority: Major
>  Labels: pull-request-available, starter
>
> [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40]
>  needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 
> support was added in 1.19 (FLINK-33163)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831391#comment-17831391
 ] 

Matthias Pohl commented on FLINK-34419:
---

master: 9e0041a2c9dace4bf3f32815e3e24e24385b179b
dev-master: tba
dev-1.19: tba
dev-1.18: tba

> flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21
> ---
>
> Key: FLINK-34419
> URL: https://issues.apache.org/jira/browse/FLINK-34419
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Build System / CI
>Reporter: Matthias Pohl
>Assignee: Muhammet Orazov
>Priority: Major
>  Labels: pull-request-available, starter
>
> [.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40]
>  needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 
> support was added in 1.19 (FLINK-33163)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34897) JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip needs to be enabled again

2024-03-27 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34897.
---
Fix Version/s: 1.18.2
   1.20.0
   1.19.1
   Resolution: Fixed

master: 
[0e70d89ad9f807a5816290e9808720e71bdad655|https://github.com/apache/flink/commit/0e70d89ad9f807a5816290e9808720e71bdad655]
1.19: 
[6b5c48ff53ddc6e75056a9050afded2ac44a413a|https://github.com/apache/flink/commit/6b5c48ff53ddc6e75056a9050afded2ac44a413a]
1.18: 
[a6aa569f5005041934a2e6398b6749584beeaabd|https://github.com/apache/flink/commit/a6aa569f5005041934a2e6398b6749584beeaabd]

> JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip
>  needs to be enabled again
> --
>
> Key: FLINK-34897
> URL: https://issues.apache.org/jira/browse/FLINK-34897
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> While working on FLINK-34672 I noticed that 
> {{JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip}}
>  is disabled without a reason.
> It looks like I disabled it accidentally as part of FLINK-31783.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34897) JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip needs to be enabled again

2024-03-27 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34897:
--
Affects Version/s: (was: 1.17.2)

> JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip
>  needs to be enabled again
> --
>
> Key: FLINK-34897
> URL: https://issues.apache.org/jira/browse/FLINK-34897
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> While working on FLINK-34672 I noticed that 
> {{JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip}}
>  is disabled without a reason.
> It looks like I disabled it accidentally as part of FLINK-31783.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34551) Align retry mechanisms of FutureUtils

2024-03-27 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831232#comment-17831232
 ] 

Matthias Pohl commented on FLINK-34551:
---

The intention of the ticket is to remove the code redundancy, yes. I'm gonna 
assign the issue to you.

> Align retry mechanisms of FutureUtils
> -
>
> Key: FLINK-34551
> URL: https://issues.apache.org/jira/browse/FLINK-34551
> Project: Flink
>  Issue Type: Technical Debt
>  Components: API / Core
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> The retry mechanisms of FutureUtils include quite a bit of redundant code 
> which makes it hard to understand and to extend. The logic should be aligned 
> properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-34551) Align retry mechanisms of FutureUtils

2024-03-27 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-34551:
-

Assignee: Kumar Mallikarjuna

> Align retry mechanisms of FutureUtils
> -
>
> Key: FLINK-34551
> URL: https://issues.apache.org/jira/browse/FLINK-34551
> Project: Flink
>  Issue Type: Technical Debt
>  Components: API / Core
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Kumar Mallikarjuna
>Priority: Major
>
> The retry mechanisms of FutureUtils include quite a bit of redundant code 
> which makes it hard to understand and to extend. The logic should be aligned 
> properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34940) LeaderContender implementations handle invalid state

2024-03-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34940:
-

 Summary: LeaderContender implementations handle invalid state
 Key: FLINK-34940
 URL: https://issues.apache.org/jira/browse/FLINK-34940
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Reporter: Matthias Pohl


Currently, LeaderContender implementations (e.g. see 
[ResourceManagerServiceImplTest#grantLeadership_withExistingLeader_waitTerminationOfExistingLeader|https://github.com/apache/flink/blob/master/flink-runtime/src/test/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImplTest.java#L219])
 allow the handling of leader events of the same type happening after each 
other which shouldn't be the case.

Two subsequent leadership grants indicate that the leading instance which 
received the leadership grant again missed the leadership revocation event 
causing an invalid state of the overall deployment (i.e. split brain scenario). 
We should fail fatally in these scenarios rather than handling them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34937) Apache Infra GHA policy update

2024-03-26 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830876#comment-17830876
 ] 

Matthias Pohl commented on FLINK-34937:
---

I updated the link to refer to a publicly available resource (y)

I haven't gone through the policy in detail. We might have to get back to infra 
if things are unclear. For this, it might be worth it to respond in the [infra 
ML thread|https://lists.apache.org/thread/6qw21x44q88rc3mhkn42jgjjw94rsvb1] 
(for which you would have to subscribe)

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34937) Apache Infra GHA policy update

2024-03-26 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34937:
--
Description: 
There is a policy update [announced in the infra 
ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
asked Apache projects to limit the number of runners per job. Additionally, the 
[GHA policy|https://infra.apache.org/github-actions-policy.html] is referenced 
which I wasn't aware of when working on the action workflow.

This issue is about applying the policy to the Flink GHA workflows.

  was:
There is a policy update [announced in the infra 
ML|https://lists.apache.org/thread/6qw21x44q88rc3mhkn42jgjjw94rsvb1] which 
asked Apache projects to limit the number of runners per job. Additionally, the 
[GHA policy|https://infra.apache.org/github-actions-policy.html] is referenced 
which I wasn't aware of when working on the action workflow.

This issue is about applying the policy to the Flink GHA workflows.


> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://www.mail-archive.com/jdo-dev@db.apache.org/msg13638.html] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34939) Harden TestingLeaderElection

2024-03-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34939:
-

 Summary: Harden TestingLeaderElection
 Key: FLINK-34939
 URL: https://issues.apache.org/jira/browse/FLINK-34939
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The {{TestingLeaderElection}} implementation does not follow the interface 
contract of {{LeaderElection}} in all of its facets (e.g. leadership acquire 
and revocation events should be alternating).

This issue is about hardening {{LeaderElection}} contract in the test 
implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34937) Apache Infra GHA policy update

2024-03-26 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34937:
--
Parent: FLINK-33901
Issue Type: Sub-task  (was: Bug)

> Apache Infra GHA policy update
> --
>
> Key: FLINK-34937
> URL: https://issues.apache.org/jira/browse/FLINK-34937
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System / CI
>Affects Versions: 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Priority: Major
>
> There is a policy update [announced in the infra 
> ML|https://lists.apache.org/thread/6qw21x44q88rc3mhkn42jgjjw94rsvb1] which 
> asked Apache projects to limit the number of runners per job. Additionally, 
> the [GHA policy|https://infra.apache.org/github-actions-policy.html] is 
> referenced which I wasn't aware of when working on the action workflow.
> This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34937) Apache Infra GHA policy update

2024-03-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34937:
-

 Summary: Apache Infra GHA policy update
 Key: FLINK-34937
 URL: https://issues.apache.org/jira/browse/FLINK-34937
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


There is a policy update [announced in the infra 
ML|https://lists.apache.org/thread/6qw21x44q88rc3mhkn42jgjjw94rsvb1] which 
asked Apache projects to limit the number of runners per job. Additionally, the 
[GHA policy|https://infra.apache.org/github-actions-policy.html] is referenced 
which I wasn't aware of when working on the action workflow.

This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34643) JobIDLoggingITCase failed

2024-03-26 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830812#comment-17830812
 ] 

Matthias Pohl commented on FLINK-34643:
---

Should we try to reproduce the test failure in a PR by modifying the CI scripts 
(i.e. executing the test in a loop)? That way we could disable the test in 
{{master}} for now.

> JobIDLoggingITCase failed
> -
>
> Key: FLINK-34643
> URL: https://issues.apache.org/jira/browse/FLINK-34643
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.20.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=7897
> {code}
> Mar 09 01:24:23 01:24:23.498 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 4.209 s <<< FAILURE! -- in 
> org.apache.flink.test.misc.JobIDLoggingITCase
> Mar 09 01:24:23 01:24:23.498 [ERROR] 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(ClusterClient) 
> -- Time elapsed: 1.459 s <<< ERROR!
> Mar 09 01:24:23 java.lang.IllegalStateException: Too few log events recorded 
> for org.apache.flink.runtime.jobmaster.JobMaster (12) - this must be a bug in 
> the test code
> Mar 09 01:24:23   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:215)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.assertJobIDPresent(JobIDLoggingITCase.java:148)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(JobIDLoggingITCase.java:132)
> Mar 09 01:24:23   at java.lang.reflect.Method.invoke(Method.java:498)
> Mar 09 01:24:23   at 
> java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Mar 09 01:24:23 
> {code}
> The other test failures of this build were also caused by the same test:
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=b78d9d30-509a-5cea-1fef-db7abaa325ae=8349
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=a596f69e-60d2-5a4b-7d39-dc69e4cdaed3=712ade8c-ca16-5b76-3acd-14df33bc1cb1=8209



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (FLINK-33900) Multiple failures in WindowRankITCase due to NoResourceAvailableException

2024-03-25 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl closed FLINK-33900.
-
Resolution: Duplicate

I checked for the failures where the logs were not removed yet that it's 
actually a duplicated of FLINK-34227. Closing this one in favor of FLINK-34227.

> Multiple failures in WindowRankITCase due to NoResourceAvailableException
> -
>
> Key: FLINK-33900
> URL: https://issues.apache.org/jira/browse/FLINK-33900
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Affects Versions: 1.18.0, 1.19.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: github-actions, test-stability
>
> [https://github.com/XComp/flink/actions/runs/7244405295/job/19733011527#step:12:14989]
> There are multiple tests in {{WindowRankITCase}} that fail due to a 
> {{NoResourceAvailableException}} supposedly:
> {code:java}
> [...]
> Error: 09:19:33 09:19:32.966 [ERROR] 
> WindowRankITCase.testTumbleWindowTVFWithOffset  Time elapsed: 300.072 s  <<< 
> FAILURE!
> 14558Dec 18 09:19:33 org.opentest4j.MultipleFailuresError: 
> 14559Dec 18 09:19:33 Multiple Failures (2 failures)
> 14560Dec 18 09:19:33  org.apache.flink.runtime.client.JobExecutionException: 
> Job execution failed.
> 14561Dec 18 09:19:33  java.lang.AssertionError: 
> 14562Dec 18 09:19:33  at 
> org.junit.vintage.engine.execution.TestRun.getStoredResultOrSuccessful(TestRun.java:200)
> 14563Dec 18 09:19:33  at 
> org.junit.vintage.engine.execution.RunListenerAdapter.fireExecutionFinished(RunListenerAdapter.java:248)
> 14564Dec 18 09:19:33  at 
> org.junit.vintage.engine.execution.RunListenerAdapter.testFinished(RunListenerAdapter.java:214)
> 14565Dec 18 09:19:33  at 
> org.junit.vintage.engine.execution.RunListenerAdapter.testFinished(RunListenerAdapter.java:88)
> 14566Dec 18 09:19:33  at 
> org.junit.runner.notification.SynchronizedRunListener.testFinished(SynchronizedRunListener.java:87)
> 14567Dec 18 09:19:33  at 
> org.junit.runner.notification.RunNotifier$9.notifyListener(RunNotifier.java:225)
> 14568Dec 18 09:19:33  at 
> org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:72)
> 14569Dec 18 09:19:33  at 
> org.junit.runner.notification.RunNotifier.fireTestFinished(RunNotifier.java:222)
> 14570Dec 18 09:19:33  at 
> org.junit.internal.runners.model.EachTestNotifier.fireTestFinished(EachTestNotifier.java:38)
> 14571Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:372)
> 14572Dec 18 09:19:33  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> 14573Dec 18 09:19:33  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> 14574Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> 14575Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> 14576Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> 14577Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> 14578Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> 14579Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> 14580Dec 18 09:19:33  at org.junit.runners.Suite.runChild(Suite.java:128)
> 14581Dec 18 09:19:33  at org.junit.runners.Suite.runChild(Suite.java:27)
> 14582Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> 14583Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> 14584Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> 14585Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> 14586Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> 14587Dec 18 09:19:33  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> 14588Dec 18 09:19:33  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> 14589Dec 18 09:19:33  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 14590Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> 14591Dec 18 09:19:33  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> 14592Dec 18 09:19:33  at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> 14593Dec 18 09:19:33  at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
> 14594Dec 18 09:19:33  at 
> org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42)
> 14595Dec 18 09:19:33  at 
> org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80)
> 14596Dec 18 09:19:33  at 
> 

[jira] [Commented] (FLINK-34227) Job doesn't disconnect from ResourceManager

2024-03-25 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830581#comment-17830581
 ] 

Matthias Pohl commented on FLINK-34227:
---

https://github.com/apache/flink/actions/runs/8414062328/job/23037443503#step:10:12562

> Job doesn't disconnect from ResourceManager
> ---
>
> Key: FLINK-34227
> URL: https://issues.apache.org/jira/browse/FLINK-34227
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Critical
>  Labels: github-actions, pull-request-available, test-stability
> Attachments: FLINK-34227.7e7d69daebb438b8d03b7392c9c55115.log, 
> FLINK-34227.log
>
>
> https://github.com/XComp/flink/actions/runs/7634987973/job/20800205972#step:10:14557
> {code}
> [...]
> "main" #1 prio=5 os_prio=0 tid=0x7f4b7000 nid=0x24ec0 waiting on 
> condition [0x7fccce1eb000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0xbdd52618> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>   at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>   at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2131)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2099)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2077)
>   at 
> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:876)
>   at 
> org.apache.flink.table.planner.runtime.stream.sql.WindowDistinctAggregateITCase.testHopWindow_Cube(WindowDistinctAggregateITCase.scala:550)
> [...]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34273) git fetch fails

2024-03-25 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830565#comment-17830565
 ] 

Matthias Pohl commented on FLINK-34273:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58519=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=bc77b88f-20e6-5fb3-ac3b-0b6efcca48c5=406

> git fetch fails
> ---
>
> Key: FLINK-34273
> URL: https://issues.apache.org/jira/browse/FLINK-34273
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / CI, Test Infrastructure
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: test-stability
>
> We've seen multiple {{git fetch}} failures. I assume this to be an 
> infrastructure issue. This Jira issue is for documentation purposes.
> {code:java}
> error: RPC failed; curl 18 transfer closed with outstanding read data 
> remaining
> error: 5211 bytes of body are still expected
> fetch-pack: unexpected disconnect while reading sideband packet
> fatal: early EOF
> fatal: fetch-pack: invalid index-pack output {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57080=logs=0e7be18f-84f2-53f0-a32d-4a5e4a174679=5d6dc3d3-393d-5111-3a40-c6a5a36202e6=667



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30719) flink-runtime-web failed due to a corrupted nodejs dependency

2024-03-25 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830562#comment-17830562
 ] 

Matthias Pohl commented on FLINK-30719:
---

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58502=logs=52b61abe-a3cc-5bde-cc35-1bbe89bb7df5=54421a62-0c80-5aad-3319-094ff69180bb=9714

Slightly different error but still worth it mentioning:
{code}
13:36:43.413 [ERROR] Failed to execute goal 
com.github.eirslett:frontend-maven-plugin:1.11.0:install-node-and-npm (install 
node and npm) on project flink-runtime-web: Could not download Node.js: Got 
error code 525 from the server. -> [Help 1]
{code}

> flink-runtime-web failed due to a corrupted nodejs dependency
> -
>
> Key: FLINK-30719
> URL: https://issues.apache.org/jira/browse/FLINK-30719
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Web Frontend, Test Infrastructure, Tests
>Affects Versions: 1.16.0, 1.17.0, 1.18.0
>Reporter: Matthias Pohl
>Assignee: Sergey Nuyanzin
>Priority: Critical
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=44954=logs=52b61abe-a3cc-5bde-cc35-1bbe89bb7df5=54421a62-0c80-5aad-3319-094ff69180bb=12550
> The build failed due to a corrupted nodejs dependency:
> {code}
> [ERROR] The archive file 
> /__w/1/.m2/repository/com/github/eirslett/node/16.13.2/node-16.13.2-linux-x64.tar.gz
>  is corrupted and will be deleted. Please try the build again.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-21450) Add local recovery support to adaptive scheduler

2024-03-25 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830558#comment-17830558
 ] 

Matthias Pohl commented on FLINK-21450:
---

Enabling the tests for AdaptiveScheduler (see FLINK-34409):
* master
** 
[8f06fb472ba6a10f0829aecf1eedee26e924aa6d|https://github.com/apache/flink/commit/8f06fb472ba6a10f0829aecf1eedee26e924aa6d]
* 1.19
** 
[00492630baa5cf041ea2cce2a3560f3e713bf57a|https://github.com/apache/flink/commit/00492630baa5cf041ea2cce2a3560f3e713bf57a]
* 1.18
** 
[f5c243097ac9fae29c3365a2361b7b0c6be3b3ee|https://github.com/apache/flink/commit/f5c243097ac9fae29c3365a2361b7b0c6be3b3ee]

> Add local recovery support to adaptive scheduler
> 
>
> Key: FLINK-21450
> URL: https://issues.apache.org/jira/browse/FLINK-21450
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Robert Metzger
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> auto-unassigned, pull-request-available
> Fix For: 1.18.0
>
>
> local recovery means that, on a failure, we are able to re-use the state in a 
> taskmanager, instead of loading it again from distributed storage (which 
> means the scheduler needs to know where which state is located, and schedule 
> tasks accordingly).
> Adaptive Scheduler is currently not respecting the location of state, so 
> failures require the re-loading of state from the distributed storage.
> Adding this feature will allow us to enable the {{Local recovery and sticky 
> scheduling end-to-end test}} for adaptive scheduler again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-21535) UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap space"

2024-03-25 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830557#comment-17830557
 ] 

Matthias Pohl commented on FLINK-21535:
---

Enabling the tests for the AdaptiveScheduler (see FLINK-34409):
* master
** 
[96142404c143f2094af262b8ac02a8b06aa773d5|https://github.com/apache/flink/commit/96142404c143f2094af262b8ac02a8b06aa773d5]
* 1.19
** 
[7d107966dbe7e38e43680fabf3ffdfeaa71e8d3c|https://github.com/apache/flink/commit/7d107966dbe7e38e43680fabf3ffdfeaa71e8d3c]
* 1.18
** 
[836b332b2d100e21b1d0008257a009d9ec09e13a|https://github.com/apache/flink/commit/836b332b2d100e21b1d0008257a009d9ec09e13a]

> UnalignedCheckpointITCase.execute failed with "OutOfMemoryError: Java heap 
> space"
> -
>
> Key: FLINK-21535
> URL: https://issues.apache.org/jira/browse/FLINK-21535
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.0
>Reporter: Dawid Wysakowicz
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.13.0, 1.12.3
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13866=logs=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3=a99e99c7-21cd-5a1f-7274-585e62b72f56
> {code}
> 2021-02-27T02:11:41.5659201Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2021-02-27T02:11:41.5659947Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
> 2021-02-27T02:11:41.5660794Z  at 
> org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
> 2021-02-27T02:11:41.5661618Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2021-02-27T02:11:41.5662356Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2021-02-27T02:11:41.5663104Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5664016Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5664817Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
> 2021-02-27T02:11:41.5665638Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2021-02-27T02:11:41.5666405Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2021-02-27T02:11:41.5667609Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2021-02-27T02:11:41.5668358Z  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> 2021-02-27T02:11:41.5669218Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1066)
> 2021-02-27T02:11:41.5669928Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:264)
> 2021-02-27T02:11:41.5670540Z  at 
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2021-02-27T02:11:41.5671268Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2021-02-27T02:11:41.5671881Z  at 
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2021-02-27T02:11:41.5672512Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5673219Z  at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
> 2021-02-27T02:11:41.5674085Z  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2021-02-27T02:11:41.5674794Z  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2021-02-27T02:11:41.5675466Z  at 
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2021-02-27T02:11:41.5676181Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
> 2021-02-27T02:11:41.5676977Z  at 
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2021-02-27T02:11:41.5677717Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2021-02-27T02:11:41.5678409Z  at 
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2021-02-27T02:11:41.5679071Z  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2021-02-27T02:11:41.5679776Z  at 
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2021-02-27T02:11:41.5680576Z  at 
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2021-02-27T02:11:41.5681383Z  at 
> 

[jira] [Commented] (FLINK-21400) Attempt numbers are not maintained across restarts

2024-03-25 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-21400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830556#comment-17830556
 ] 

Matthias Pohl commented on FLINK-21400:
---

Enabling the tests for AdaptiveScheduler (see FLINK-34409):
* master
** 
[a1d17ccf0eec7dd614146e22832037cadd7abe5c|https://github.com/apache/flink/commit/a1d17ccf0eec7dd614146e22832037cadd7abe5c]
* 1.19
** 
[4fc36e9abaa8cc2d0e01c1e389b449f563b87e8e|https://github.com/apache/flink/commit/4fc36e9abaa8cc2d0e01c1e389b449f563b87e8e]
* 1.18
** 
[8f6890fbd757f3d3c9c891ea9139a1e5ac3412a2|https://github.com/apache/flink/commit/8f6890fbd757f3d3c9c891ea9139a1e5ac3412a2]

> Attempt numbers are not maintained across restarts
> --
>
> Key: FLINK-21400
> URL: https://issues.apache.org/jira/browse/FLINK-21400
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> The DeclarativeScheduler discards the ExecutionGraph on each restart attempt, 
> as a result of which the attempt number remains 0.
> Various tests use the attempt number to determine whether an exception should 
> be thrown, and thus continue to throw exceptions on each restart.
> Affected tests:
> UnalignedCheckpointTestBase
> UnalignedCheckpointITCase
> ProcessingTimeWindowCheckpointingITCase
> LocalRecoveryITCase
> EventTimeWindowCheckpointingITCase
> EventTimeAllWindowCheckpointingITCase
> FileSinkITBase#testFileSink



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-34409) Increase test coverage for AdaptiveScheduler

2024-03-25 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl resolved FLINK-34409.
---
Fix Version/s: 1.18.2
   1.20.0
   1.19.1
   Resolution: Fixed

master: 
[1aa35b95975560da6afb7fcf0ad80f0a25c5d183|https://github.com/apache/flink/commit/1aa35b95975560da6afb7fcf0ad80f0a25c5d183]
1.19: 
[f82ff7c656d3eeb3e82b456d284639e59624a849|https://github.com/apache/flink/commit/f82ff7c656d3eeb3e82b456d284639e59624a849]
1.18: 
[f2a6ff5a97bf27d68be1188c05158e18df810549|https://github.com/apache/flink/commit/f2a6ff5a97bf27d68be1188c05158e18df810549]

> Increase test coverage for AdaptiveScheduler
> 
>
> Key: FLINK-34409
> URL: https://issues.apache.org/jira/browse/FLINK-34409
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Coordination
>Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> There are still several tests disabled for the {{AdaptiveScheduler}} which we 
> can enable now. All the issues seem to have been fixed.
> We can even remove the annotation {{@FailsWithAdaptiveScheduler}} now. It's 
> not needed anymore.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-34933) JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored isn't implemented properly

2024-03-25 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl reassigned FLINK-34933:
-

Assignee: Matthias Pohl

> JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored
>  isn't implemented properly
> ---
>
> Key: FLINK-34933
> URL: https://issues.apache.org/jira/browse/FLINK-34933
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Major
>
> {{testResultFutureCompletionOfOutdatedLeaderIsIgnored}} doesn't test the 
> desired behavior: The {{TestingJobMasterService#closeAsync()}} callback 
> throws an {{UnsupportedOperationException}} by default which prevents the 
> test from properly finalizing the leadership revocation.
> The test is still passing because the test checks implicitly for this error. 
> Instead, we should verify that the runner's resultFuture doesn't complete 
> until the runner is closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34933) JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored isn't implemented properly

2024-03-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34933:
-

 Summary: 
JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored
 isn't implemented properly
 Key: FLINK-34933
 URL: https://issues.apache.org/jira/browse/FLINK-34933
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.17.2, 1.20.0
Reporter: Matthias Pohl


{{testResultFutureCompletionOfOutdatedLeaderIsIgnored}} doesn't test the 
desired behavior: The {{TestingJobMasterService#closeAsync()}} callback throws 
an {{UnsupportedOperationException}} by default which prevents the test from 
properly finalizing the leadership revocation.

The test is still passing because the test checks implicitly for this error. 
Instead, we should verify that the runner's resultFuture doesn't complete until 
the runner is closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed

2024-03-22 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829918#comment-17829918
 ] 

Matthias Pohl edited comment on FLINK-33816 at 3/22/24 3:51 PM:


I created the [1.19 backport|https://github.com/apache/flink/pull/24556]. Is 
this also affecting 1.18? Based on the git history I would assume so.


was (Author: mapohl):
I created the 1.19 backport. Is this also affecting 1.18? Based on the git 
history I would assume so.

> SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due 
> async checkpoint triggering not being completed 
> -
>
> Key: FLINK-33816
> URL: https://issues.apache.org/jira/browse/FLINK-33816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: jiabao.sun
>Priority: Major
>  Labels: github-actions, pull-request-available, test-stability
> Fix For: 1.20.0
>
> Attachments: screenshot-1.png
>
>
> [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430]
> {code:java}
> rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest
> 9426Error: 14:39:01 14:39:01.930 [ERROR] 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain
>   Time elapsed: 0.034 s  <<< FAILURE!
> 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: 
> 9428Dec 12 14:39:01 
> 9429Dec 12 14:39:01 Expecting value to be true but was false
> 9430Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
> 9431Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
> 9432Dec 12 14:39:01   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710)
> 9433Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> 9434Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:580)
> [...] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed

2024-03-22 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829918#comment-17829918
 ] 

Matthias Pohl commented on FLINK-33816:
---

I created the 1.19 backport. Is this also affecting 1.18? Based on the git 
history I would assume so.

> SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due 
> async checkpoint triggering not being completed 
> -
>
> Key: FLINK-33816
> URL: https://issues.apache.org/jira/browse/FLINK-33816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: jiabao.sun
>Priority: Major
>  Labels: github-actions, pull-request-available, test-stability
> Fix For: 1.20.0
>
> Attachments: screenshot-1.png
>
>
> [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430]
> {code:java}
> rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest
> 9426Error: 14:39:01 14:39:01.930 [ERROR] 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain
>   Time elapsed: 0.034 s  <<< FAILURE!
> 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: 
> 9428Dec 12 14:39:01 
> 9429Dec 12 14:39:01 Expecting value to be true but was false
> 9430Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
> 9431Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
> 9432Dec 12 14:39:01   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710)
> 9433Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> 9434Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:580)
> [...] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33816) SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due async checkpoint triggering not being completed

2024-03-22 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-33816:
--
Fix Version/s: 1.20.0
   (was: 2.0.0)

> SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain failed due 
> async checkpoint triggering not being completed 
> -
>
> Key: FLINK-33816
> URL: https://issues.apache.org/jira/browse/FLINK-33816
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.19.0
>Reporter: Matthias Pohl
>Assignee: jiabao.sun
>Priority: Major
>  Labels: github-actions, pull-request-available, test-stability
> Fix For: 1.20.0
>
> Attachments: screenshot-1.png
>
>
> [https://github.com/XComp/flink/actions/runs/7182604625/job/19559947894#step:12:9430]
> {code:java}
> rror: 14:39:01 14:39:01.930 [ERROR] Tests run: 16, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 1.878 s <<< FAILURE! - in 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest
> 9426Error: 14:39:01 14:39:01.930 [ERROR] 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain
>   Time elapsed: 0.034 s  <<< FAILURE!
> 9427Dec 12 14:39:01 org.opentest4j.AssertionFailedError: 
> 9428Dec 12 14:39:01 
> 9429Dec 12 14:39:01 Expecting value to be true but was false
> 9430Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
> 9431Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
> 9432Dec 12 14:39:01   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTaskTest.testTriggeringStopWithSavepointWithDrain(SourceStreamTaskTest.java:710)
> 9433Dec 12 14:39:01   at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> 9434Dec 12 14:39:01   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:580)
> [...] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34643) JobIDLoggingITCase failed

2024-03-22 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829876#comment-17829876
 ] 

Matthias Pohl edited comment on FLINK-34643 at 3/22/24 3:44 PM:


* 
[https://github.com/apache/flink/actions/runs/8375475096/job/22933386950#step:10:7849]
 * 
[https://github.com/apache/flink/actions/runs/8384698540/job/22962603273#step:10:8296]
 * 
https://github.com/apache/flink/actions/runs/8384423503/job/22961956846#step:10:7958


was (Author: ryanskraba):
* 
[https://github.com/apache/flink/actions/runs/8375475096/job/22933386950#step:10:7849]
 * 
[https://github.com/apache/flink/actions/runs/8384698540/job/22962603273#step:10:8296]
 * 
https://github.com/apache/flink/actions/runs/8375475096/job/22933386950#step:10:7849

> JobIDLoggingITCase failed
> -
>
> Key: FLINK-34643
> URL: https://issues.apache.org/jira/browse/FLINK-34643
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.20.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=7897
> {code}
> Mar 09 01:24:23 01:24:23.498 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 4.209 s <<< FAILURE! -- in 
> org.apache.flink.test.misc.JobIDLoggingITCase
> Mar 09 01:24:23 01:24:23.498 [ERROR] 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(ClusterClient) 
> -- Time elapsed: 1.459 s <<< ERROR!
> Mar 09 01:24:23 java.lang.IllegalStateException: Too few log events recorded 
> for org.apache.flink.runtime.jobmaster.JobMaster (12) - this must be a bug in 
> the test code
> Mar 09 01:24:23   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:215)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.assertJobIDPresent(JobIDLoggingITCase.java:148)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(JobIDLoggingITCase.java:132)
> Mar 09 01:24:23   at java.lang.reflect.Method.invoke(Method.java:498)
> Mar 09 01:24:23   at 
> java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Mar 09 01:24:23 
> {code}
> The other test failures of this build were also caused by the same test:
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=b78d9d30-509a-5cea-1fef-db7abaa325ae=8349
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=a596f69e-60d2-5a4b-7d39-dc69e4cdaed3=712ade8c-ca16-5b76-3acd-14df33bc1cb1=8209



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-18476) PythonEnvUtilsTest#testStartPythonProcess fails

2024-03-22 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-18476:
--
Affects Version/s: 1.20.0

> PythonEnvUtilsTest#testStartPythonProcess fails
> ---
>
> Key: FLINK-18476
> URL: https://issues.apache.org/jira/browse/FLINK-18476
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python, Tests
>Affects Versions: 1.11.0, 1.15.3, 1.18.0, 1.19.0, 1.20.0
>Reporter: Dawid Wysakowicz
>Priority: Major
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> test-stability
>
> The 
> {{org.apache.flink.client.python.PythonEnvUtilsTest#testStartPythonProcess}} 
> failed in my local environment as it assumes the environment has 
> {{/usr/bin/python}}. 
> I don't know exactly how did I get python in Ubuntu 20.04, but I have only 
> alias for {{python = python3}}. Therefore the tests fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34919) WebMonitorEndpointTest.cleansUpExpiredExecutionGraphs fails starting REST server

2024-03-22 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34919:
--
Component/s: Runtime / Coordination

> WebMonitorEndpointTest.cleansUpExpiredExecutionGraphs fails starting REST 
> server
> 
>
> Key: FLINK-34919
> URL: https://issues.apache.org/jira/browse/FLINK-34919
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Ryan Skraba
>Priority: Critical
>  Labels: test-stability
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58482=logs=77a9d8e1-d610-59b3-fc2a-4766541e0e33=125e07e7-8de0-5c6c-a541-a567415af3ef=8641]
> {code:java}
> Mar 22 04:12:50 04:12:50.260 [INFO] Running 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpointTest
> Mar 22 04:12:50 04:12:50.609 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 0.318 s <<< FAILURE! -- in 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpointTest
> Mar 22 04:12:50 04:12:50.609 [ERROR] 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpointTest.cleansUpExpiredExecutionGraphs
>  -- Time elapsed: 0.303 s <<< ERROR!
> Mar 22 04:12:50 java.net.BindException: Could not start rest endpoint on any 
> port in port range 8081
> Mar 22 04:12:50   at 
> org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:286)
> Mar 22 04:12:50   at 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpointTest.cleansUpExpiredExecutionGraphs(WebMonitorEndpointTest.java:69)
> Mar 22 04:12:50   at java.lang.reflect.Method.invoke(Method.java:498)
> Mar 22 04:12:50   at 
> java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
> Mar 22 04:12:50   at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> Mar 22 04:12:50   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> Mar 22 04:12:50   at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> Mar 22 04:12:50   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Mar 22 04:12:50  {code}
> This was noted as a symptom of FLINK-22980, but doesn't have the same failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34919) WebMonitorEndpointTest.cleansUpExpiredExecutionGraphs fails starting REST server

2024-03-22 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34919:
--
Affects Version/s: 1.19.0

> WebMonitorEndpointTest.cleansUpExpiredExecutionGraphs fails starting REST 
> server
> 
>
> Key: FLINK-34919
> URL: https://issues.apache.org/jira/browse/FLINK-34919
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.20.0
>Reporter: Ryan Skraba
>Priority: Critical
>  Labels: test-stability
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58482=logs=77a9d8e1-d610-59b3-fc2a-4766541e0e33=125e07e7-8de0-5c6c-a541-a567415af3ef=8641]
> {code:java}
> Mar 22 04:12:50 04:12:50.260 [INFO] Running 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpointTest
> Mar 22 04:12:50 04:12:50.609 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 0.318 s <<< FAILURE! -- in 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpointTest
> Mar 22 04:12:50 04:12:50.609 [ERROR] 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpointTest.cleansUpExpiredExecutionGraphs
>  -- Time elapsed: 0.303 s <<< ERROR!
> Mar 22 04:12:50 java.net.BindException: Could not start rest endpoint on any 
> port in port range 8081
> Mar 22 04:12:50   at 
> org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:286)
> Mar 22 04:12:50   at 
> org.apache.flink.runtime.webmonitor.WebMonitorEndpointTest.cleansUpExpiredExecutionGraphs(WebMonitorEndpointTest.java:69)
> Mar 22 04:12:50   at java.lang.reflect.Method.invoke(Method.java:498)
> Mar 22 04:12:50   at 
> java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
> Mar 22 04:12:50   at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> Mar 22 04:12:50   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> Mar 22 04:12:50   at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> Mar 22 04:12:50   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Mar 22 04:12:50  {code}
> This was noted as a symptom of FLINK-22980, but doesn't have the same failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34921) SystemProcessingTimeServiceTest fails due to missing output

2024-03-22 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34921:
-

 Summary: SystemProcessingTimeServiceTest fails due to missing 
output
 Key: FLINK-34921
 URL: https://issues.apache.org/jira/browse/FLINK-34921
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.20.0
Reporter: Matthias Pohl


This PR CI build with {{AdaptiveScheduler}} enabled failed:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58476=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=11224

{code}
"ForkJoinPool-61-worker-25" #863 daemon prio=5 os_prio=0 tid=0x7f8c19eba000 
nid=0x60a5 waiting on condition [0x7f8bc2cf9000]
Mar 21 17:19:42java.lang.Thread.State: WAITING (parking)
Mar 21 17:19:42 at sun.misc.Unsafe.park(Native Method)
Mar 21 17:19:42 - parking to wait for  <0xd81959b8> (a 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask)
Mar 21 17:19:42 at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
Mar 21 17:19:42 at 
java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
Mar 21 17:19:42 at 
java.util.concurrent.FutureTask.get(FutureTask.java:191)
Mar 21 17:19:42 at 
org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeServiceTest$$Lambda$1443/1477662666.call(Unknown
 Source)
Mar 21 17:19:42 at 
org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:63)
Mar 21 17:19:42 at 
org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:892)
Mar 21 17:19:42 at 
org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1366)
Mar 21 17:19:42 at 
org.assertj.core.api.Assertions.assertThatThrownBy(Assertions.java:1210)
Mar 21 17:19:42 at 
org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeServiceTest.testQuiesceAndAwaitingCancelsScheduledAtFixRateFuture(SystemProcessingTimeServiceTest.java:92)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34911) ChangelogRecoveryRescaleITCase failed fatally with 127 exit code

2024-03-21 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34911:
--
Component/s: Runtime / State Backends

> ChangelogRecoveryRescaleITCase failed fatally with 127 exit code
> 
>
> Key: FLINK-34911
> URL: https://issues.apache.org/jira/browse/FLINK-34911
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.20.0
>Reporter: Ryan Skraba
>Priority: Major
>  Labels: test-stability
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58455=logs=a657ddbf-d986-5381-9649-342d9c92e7fb=dc085d4a-05c8-580e-06ab-21f5624dab16=9029]
>  
> {code:java}
> Mar 21 01:50:42 01:50:42.553 [ERROR] Command was /bin/sh -c cd 
> '/__w/1/s/flink-tests' && '/usr/lib/jvm/jdk-21.0.1+12/bin/java' 
> '-XX:+UseG1GC' '-Xms256m' '-XX:+IgnoreUnrecognizedVMOptions' 
> '--add-opens=java.base/java.util=ALL-UNNAMED' 
> '--add-opens=java.base/java.io=ALL-UNNAMED' '-Xmx1536m' '-jar' 
> '/__w/1/s/flink-tests/target/surefire/surefirebooter-20240321010847189_810.jar'
>  '/__w/1/s/flink-tests/target/surefire' '2024-03-21T01-08-44_720-jvmRun3' 
> 'surefire-20240321010847189_808tmp' 'surefire_207-20240321010847189_809tmp'
> Mar 21 01:50:42 01:50:42.553 [ERROR] Error occurred in starting fork, check 
> output in log
> Mar 21 01:50:42 01:50:42.553 [ERROR] Process Exit Code: 127
> Mar 21 01:50:42 01:50:42.553 [ERROR] Crashed tests:
> Mar 21 01:50:42 01:50:42.553 [ERROR] 
> org.apache.flink.test.checkpointing.ChangelogRecoveryRescaleITCase
> Mar 21 01:50:42 01:50:42.553 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:456)
> Mar 21 01:50:42 01:50:42.553 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:418)
> Mar 21 01:50:42 01:50:42.553 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:297)
> Mar 21 01:50:42 01:50:42.553 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:250)
> Mar 21 01:50:42 01:50:42.554 [ERROR]  at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1240)
> {code}
> From the watchdog, only {{ChangelogRecoveryRescaleITCase}} didn't complete, 
> specifically parameterized with an {{EmbeddedRocksDBStateBackend}} with 
> incremental checkpointing enabled.
> The base class ({{{}ChangelogRecoveryITCaseBase{}}}) starts a 
> {{MiniClusterWithClientResource}}
> {code:java}
> ~/Downloads/CI/logs-cron_jdk21-test_cron_jdk21_tests-1710982836$ cat 
> watchdog| grep "Tests run\|Running org.apache.flink" | grep -o 
> "org.apache.flink[^ ]*$" | sort | uniq -c | sort -n | head
>       1 org.apache.flink.test.checkpointing.ChangelogRecoveryRescaleITCase
>       2 org.apache.flink.api.connector.source.lib.NumberSequenceSourceITCase
> {code}
>  
> {color:#00} {color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-34911) ChangelogRecoveryRescaleITCase failed fatally with 127 exit code

2024-03-21 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-34911:
--
Priority: Critical  (was: Major)

> ChangelogRecoveryRescaleITCase failed fatally with 127 exit code
> 
>
> Key: FLINK-34911
> URL: https://issues.apache.org/jira/browse/FLINK-34911
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.20.0
>Reporter: Ryan Skraba
>Priority: Critical
>  Labels: test-stability
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58455=logs=a657ddbf-d986-5381-9649-342d9c92e7fb=dc085d4a-05c8-580e-06ab-21f5624dab16=9029]
>  
> {code:java}
> Mar 21 01:50:42 01:50:42.553 [ERROR] Command was /bin/sh -c cd 
> '/__w/1/s/flink-tests' && '/usr/lib/jvm/jdk-21.0.1+12/bin/java' 
> '-XX:+UseG1GC' '-Xms256m' '-XX:+IgnoreUnrecognizedVMOptions' 
> '--add-opens=java.base/java.util=ALL-UNNAMED' 
> '--add-opens=java.base/java.io=ALL-UNNAMED' '-Xmx1536m' '-jar' 
> '/__w/1/s/flink-tests/target/surefire/surefirebooter-20240321010847189_810.jar'
>  '/__w/1/s/flink-tests/target/surefire' '2024-03-21T01-08-44_720-jvmRun3' 
> 'surefire-20240321010847189_808tmp' 'surefire_207-20240321010847189_809tmp'
> Mar 21 01:50:42 01:50:42.553 [ERROR] Error occurred in starting fork, check 
> output in log
> Mar 21 01:50:42 01:50:42.553 [ERROR] Process Exit Code: 127
> Mar 21 01:50:42 01:50:42.553 [ERROR] Crashed tests:
> Mar 21 01:50:42 01:50:42.553 [ERROR] 
> org.apache.flink.test.checkpointing.ChangelogRecoveryRescaleITCase
> Mar 21 01:50:42 01:50:42.553 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:456)
> Mar 21 01:50:42 01:50:42.553 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:418)
> Mar 21 01:50:42 01:50:42.553 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:297)
> Mar 21 01:50:42 01:50:42.553 [ERROR]  at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:250)
> Mar 21 01:50:42 01:50:42.554 [ERROR]  at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1240)
> {code}
> From the watchdog, only {{ChangelogRecoveryRescaleITCase}} didn't complete, 
> specifically parameterized with an {{EmbeddedRocksDBStateBackend}} with 
> incremental checkpointing enabled.
> The base class ({{{}ChangelogRecoveryITCaseBase{}}}) starts a 
> {{MiniClusterWithClientResource}}
> {code:java}
> ~/Downloads/CI/logs-cron_jdk21-test_cron_jdk21_tests-1710982836$ cat 
> watchdog| grep "Tests run\|Running org.apache.flink" | grep -o 
> "org.apache.flink[^ ]*$" | sort | uniq -c | sort -n | head
>       1 org.apache.flink.test.checkpointing.ChangelogRecoveryRescaleITCase
>       2 org.apache.flink.api.connector.source.lib.NumberSequenceSourceITCase
> {code}
>  
> {color:#00} {color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34643) JobIDLoggingITCase failed

2024-03-21 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829485#comment-17829485
 ] 

Matthias Pohl edited comment on FLINK-34643 at 3/21/24 11:13 AM:
-

* 
https://github.com/apache/flink/actions/runs/8290287716/job/22688325865#step:10:9328
* 
https://github.com/apache/flink/actions/runs/8304571223/job/22730531076#step:10:9194
* 
https://github.com/apache/flink/actions/runs/8312246651/job/22747312383#step:10:8539
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764925776#step:10:8913
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764920830#step:10:8727
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764903331#step:10:9336
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813901357#step:10:8952
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813876201#step:10:9327
* 
https://github.com/apache/flink/actions/runs/8352823788/job/22863786799#step:10:8952
* 
https://github.com/apache/flink/actions/runs/8352823788/job/22863772571#step:10:9337
* 
https://github.com/apache/flink/actions/runs/8368626493/job/22913270846#step:10:8418


was (Author: mapohl):
* 
https://github.com/apache/flink/actions/runs/8290287716/job/22688325865#step:10:9328
* 
https://github.com/apache/flink/actions/runs/8304571223/job/22730531076#step:10:9194
* 
https://github.com/apache/flink/actions/runs/8312246651/job/22747312383#step:10:8539
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764925776#step:10:8913
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764920830#step:10:8727
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764903331#step:10:9336
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813901357#step:10:8952
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813876201#step:10:9327
* 
https://github.com/apache/flink/actions/runs/8352823788/job/22863786799#step:10:8952
* 
https://github.com/apache/flink/actions/runs/8352823788/job/22863772571#step:10:9337

> JobIDLoggingITCase failed
> -
>
> Key: FLINK-34643
> URL: https://issues.apache.org/jira/browse/FLINK-34643
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.20.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=7897
> {code}
> Mar 09 01:24:23 01:24:23.498 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 4.209 s <<< FAILURE! -- in 
> org.apache.flink.test.misc.JobIDLoggingITCase
> Mar 09 01:24:23 01:24:23.498 [ERROR] 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(ClusterClient) 
> -- Time elapsed: 1.459 s <<< ERROR!
> Mar 09 01:24:23 java.lang.IllegalStateException: Too few log events recorded 
> for org.apache.flink.runtime.jobmaster.JobMaster (12) - this must be a bug in 
> the test code
> Mar 09 01:24:23   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:215)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.assertJobIDPresent(JobIDLoggingITCase.java:148)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(JobIDLoggingITCase.java:132)
> Mar 09 01:24:23   at java.lang.reflect.Method.invoke(Method.java:498)
> Mar 09 01:24:23   at 
> java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Mar 09 01:24:23 
> {code}
> The other test failures of this build were also caused by the same test:
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=b78d9d30-509a-5cea-1fef-db7abaa325ae=8349
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=a596f69e-60d2-5a4b-7d39-dc69e4cdaed3=712ade8c-ca16-5b76-3acd-14df33bc1cb1=8209



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33186) CheckpointAfterAllTasksFinishedITCase.testRestoreAfterSomeTasksFinished fails on AZP

2024-03-21 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829501#comment-17829501
 ] 

Matthias Pohl commented on FLINK-33186:
---

https://github.com/apache/flink/actions/runs/8369823390/job/22916375709#step:10:7894

>  CheckpointAfterAllTasksFinishedITCase.testRestoreAfterSomeTasksFinished 
> fails on AZP
> -
>
> Key: FLINK-33186
> URL: https://issues.apache.org/jira/browse/FLINK-33186
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Sergey Nuyanzin
>Assignee: Jiang Xin
>Priority: Critical
>  Labels: test-stability
>
> This build 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=53509=logs=baf26b34-3c6a-54e8-f93f-cf269b32f802=8c9d126d-57d2-5a9e-a8c8-ff53f7b35cd9=8762
> fails as
> {noformat}
> Sep 28 01:23:43 Caused by: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Task local 
> checkpoint failure.
> Sep 28 01:23:43   at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:550)
> Sep 28 01:23:43   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2248)
> Sep 28 01:23:43   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2235)
> Sep 28 01:23:43   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:817)
> Sep 28 01:23:43   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> Sep 28 01:23:43   at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Sep 28 01:23:43   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> Sep 28 01:23:43   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> Sep 28 01:23:43   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> Sep 28 01:23:43   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> Sep 28 01:23:43   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28440) EventTimeWindowCheckpointingITCase failed with restore

2024-03-21 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829500#comment-17829500
 ] 

Matthias Pohl commented on FLINK-28440:
---

https://github.com/apache/flink/actions/runs/8360441603/job/22886656534#step:10:7536

> EventTimeWindowCheckpointingITCase failed with restore
> --
>
> Key: FLINK-28440
> URL: https://issues.apache.org/jira/browse/FLINK-28440
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.16.0, 1.17.0, 1.18.0, 1.19.0
>Reporter: Huang Xingbo
>Assignee: Yanfei Lei
>Priority: Critical
>  Labels: auto-deprioritized-critical, pull-request-available, 
> stale-assigned, test-stability
> Fix For: 1.20.0
>
> Attachments: image-2023-02-01-00-51-54-506.png, 
> image-2023-02-01-01-10-01-521.png, image-2023-02-01-01-19-12-182.png, 
> image-2023-02-01-16-47-23-756.png, image-2023-02-01-16-57-43-889.png, 
> image-2023-02-02-10-52-56-599.png, image-2023-02-03-10-09-07-586.png, 
> image-2023-02-03-12-03-16-155.png, image-2023-02-03-12-03-56-614.png
>
>
> {code:java}
> Caused by: java.lang.Exception: Exception while creating 
> StreamOperatorStateContext.
>   at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:256)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:268)
>   at 
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:722)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:698)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:665)
>   at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
>   at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:904)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
> state backend for WindowOperator_0a448493b4782967b150582570326227_(2/4) from 
> any of the 1 provided restore options.
>   at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
>   at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:353)
>   at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:165)
>   ... 11 more
> Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: 
> /tmp/junit1835099326935900400/junit1113650082510421526/52ee65b7-033f-4429-8ddd-adbe85e27ced
>  (No such file or directory)
>   at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321)
>   at 
> org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:87)
>   at 
> org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.hasNext(StateChangelogHandleStreamHandleReader.java:69)
>   at 
> org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.readBackendHandle(ChangelogBackendRestoreOperation.java:96)
>   at 
> org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.restore(ChangelogBackendRestoreOperation.java:75)
>   at 
> org.apache.flink.state.changelog.ChangelogStateBackend.restore(ChangelogStateBackend.java:92)
>   at 
> org.apache.flink.state.changelog.AbstractChangelogStateBackend.createKeyedStateBackend(AbstractChangelogStateBackend.java:136)
>   at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:336)
>   at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
>   at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
>   ... 13 more
> Caused by: java.io.FileNotFoundException: 
> 

[jira] [Comment Edited] (FLINK-34227) Job doesn't disconnect from ResourceManager

2024-03-21 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829499#comment-17829499
 ] 

Matthias Pohl edited comment on FLINK-34227 at 3/21/24 11:11 AM:
-

SetOperatorsITCase: 
https://github.com/apache/flink/actions/runs/8352823891/job/22863768994#step:10:12399


was (Author: mapohl):
https://github.com/apache/flink/actions/runs/8352823891/job/22863768994#step:10:12399

> Job doesn't disconnect from ResourceManager
> ---
>
> Key: FLINK-34227
> URL: https://issues.apache.org/jira/browse/FLINK-34227
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Critical
>  Labels: github-actions, pull-request-available, test-stability
> Attachments: FLINK-34227.7e7d69daebb438b8d03b7392c9c55115.log, 
> FLINK-34227.log
>
>
> https://github.com/XComp/flink/actions/runs/7634987973/job/20800205972#step:10:14557
> {code}
> [...]
> "main" #1 prio=5 os_prio=0 tid=0x7f4b7000 nid=0x24ec0 waiting on 
> condition [0x7fccce1eb000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0xbdd52618> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>   at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>   at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2131)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2099)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2077)
>   at 
> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:876)
>   at 
> org.apache.flink.table.planner.runtime.stream.sql.WindowDistinctAggregateITCase.testHopWindow_Cube(WindowDistinctAggregateITCase.scala:550)
> [...]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34227) Job doesn't disconnect from ResourceManager

2024-03-21 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829499#comment-17829499
 ] 

Matthias Pohl commented on FLINK-34227:
---

https://github.com/apache/flink/actions/runs/8352823891/job/22863768994#step:10:12399

> Job doesn't disconnect from ResourceManager
> ---
>
> Key: FLINK-34227
> URL: https://issues.apache.org/jira/browse/FLINK-34227
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1
>Reporter: Matthias Pohl
>Assignee: Matthias Pohl
>Priority: Critical
>  Labels: github-actions, pull-request-available, test-stability
> Attachments: FLINK-34227.7e7d69daebb438b8d03b7392c9c55115.log, 
> FLINK-34227.log
>
>
> https://github.com/XComp/flink/actions/runs/7634987973/job/20800205972#step:10:14557
> {code}
> [...]
> "main" #1 prio=5 os_prio=0 tid=0x7f4b7000 nid=0x24ec0 waiting on 
> condition [0x7fccce1eb000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0xbdd52618> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>   at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>   at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2131)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2099)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2077)
>   at 
> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:876)
>   at 
> org.apache.flink.table.planner.runtime.stream.sql.WindowDistinctAggregateITCase.testHopWindow_Cube(WindowDistinctAggregateITCase.scala:550)
> [...]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34643) JobIDLoggingITCase failed

2024-03-21 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829485#comment-17829485
 ] 

Matthias Pohl edited comment on FLINK-34643 at 3/21/24 11:08 AM:
-

* 
https://github.com/apache/flink/actions/runs/8290287716/job/22688325865#step:10:9328
* 
https://github.com/apache/flink/actions/runs/8304571223/job/22730531076#step:10:9194
* 
https://github.com/apache/flink/actions/runs/8312246651/job/22747312383#step:10:8539
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764925776#step:10:8913
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764920830#step:10:8727
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764903331#step:10:9336
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813901357#step:10:8952
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813876201#step:10:9327
* 
https://github.com/apache/flink/actions/runs/8352823788/job/22863786799#step:10:8952
* 
https://github.com/apache/flink/actions/runs/8352823788/job/22863772571#step:10:9337


was (Author: mapohl):
* 
https://github.com/apache/flink/actions/runs/8290287716/job/22688325865#step:10:9328
* 
https://github.com/apache/flink/actions/runs/8304571223/job/22730531076#step:10:9194
* 
https://github.com/apache/flink/actions/runs/8312246651/job/22747312383#step:10:8539
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764925776#step:10:8913
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764920830#step:10:8727
* 
https://github.com/apache/flink/actions/runs/8320242443/job/22764903331#step:10:9336
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813901357#step:10:8952
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813876201#step:10:9327
* 
https://github.com/apache/flink/actions/runs/8352823788/job/22863786799#step:10:8952

> JobIDLoggingITCase failed
> -
>
> Key: FLINK-34643
> URL: https://issues.apache.org/jira/browse/FLINK-34643
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.20.0
>Reporter: Matthias Pohl
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available, test-stability
> Fix For: 1.20.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=7897
> {code}
> Mar 09 01:24:23 01:24:23.498 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 4.209 s <<< FAILURE! -- in 
> org.apache.flink.test.misc.JobIDLoggingITCase
> Mar 09 01:24:23 01:24:23.498 [ERROR] 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(ClusterClient) 
> -- Time elapsed: 1.459 s <<< ERROR!
> Mar 09 01:24:23 java.lang.IllegalStateException: Too few log events recorded 
> for org.apache.flink.runtime.jobmaster.JobMaster (12) - this must be a bug in 
> the test code
> Mar 09 01:24:23   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:215)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.assertJobIDPresent(JobIDLoggingITCase.java:148)
> Mar 09 01:24:23   at 
> org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(JobIDLoggingITCase.java:132)
> Mar 09 01:24:23   at java.lang.reflect.Method.invoke(Method.java:498)
> Mar 09 01:24:23   at 
> java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> Mar 09 01:24:23   at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Mar 09 01:24:23 
> {code}
> The other test failures of this build were also caused by the same test:
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=b78d9d30-509a-5cea-1fef-db7abaa325ae=8349
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=a596f69e-60d2-5a4b-7d39-dc69e4cdaed3=712ade8c-ca16-5b76-3acd-14df33bc1cb1=8209



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34718) KeyedPartitionWindowedStream and NonPartitionWindowedStream IllegalStateException in AZP

2024-03-21 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829484#comment-17829484
 ] 

Matthias Pohl edited comment on FLINK-34718 at 3/21/24 11:08 AM:
-

before the fix was committed to master:
* 
https://github.com/apache/flink/actions/runs/8290287716/job/22688325865#step:10:9329
* 
https://github.com/apache/flink/actions/runs/8304571223/job/22730531076#step:10:8057
* 
https://github.com/apache/flink/actions/runs/8312246651/job/22747312383#step:10:9345
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813876201#step:10:9330
* 
https://github.com/apache/flink/actions/runs/8352823788/job/22863772571#step:10:9347


was (Author: mapohl):
before the fix was committed to master:
* 
https://github.com/apache/flink/actions/runs/8290287716/job/22688325865#step:10:9329
* 
https://github.com/apache/flink/actions/runs/8304571223/job/22730531076#step:10:8057
* 
https://github.com/apache/flink/actions/runs/8312246651/job/22747312383#step:10:9345
* 
https://github.com/apache/flink/actions/runs/8336454518/job/22813876201#step:10:9330

> KeyedPartitionWindowedStream and NonPartitionWindowedStream 
> IllegalStateException in AZP
> 
>
> Key: FLINK-34718
> URL: https://issues.apache.org/jira/browse/FLINK-34718
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.20.0
>Reporter: Ryan Skraba
>Assignee: Ryan Skraba
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.20.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58320=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=9646]
> 18 of the KeyedPartitionWindowedStreamITCase and 
> NonKeyedPartitionWindowedStreamITCase unit tests introduced in FLINK-34543 
> are failing in the adaptive scheduler profile, with errors similar to:
> {code:java}
> Mar 15 01:54:12 Caused by: java.lang.IllegalStateException: The adaptive 
> scheduler supports pipelined data exchanges (violated by MapPartition 
> (org.apache.flink.streaming.runtime.tasks.OneInputStreamTask) -> 
> ddb598ad156ed281023ba4eebbe487e3).
> Mar 15 01:54:12   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:215)
> Mar 15 01:54:12   at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.assertPreconditions(AdaptiveScheduler.java:438)
> Mar 15 01:54:12   at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.(AdaptiveScheduler.java:356)
> Mar 15 01:54:12   at 
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerFactory.createInstance(AdaptiveSchedulerFactory.java:124)
> Mar 15 01:54:12   at 
> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:121)
> Mar 15 01:54:12   at 
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:384)
> Mar 15 01:54:12   at 
> org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:361)
> Mar 15 01:54:12   at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:128)
> Mar 15 01:54:12   at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:100)
> Mar 15 01:54:12   at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
> Mar 15 01:54:12   at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> Mar 15 01:54:12   ... 4 more
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >