[jira] [Created] (FLINK-35159) CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash

2024-04-18 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-35159:


 Summary: CreatingExecutionGraph can leak CheckpointCoordinator and 
cause JM crash
 Key: FLINK-35159
 URL: https://issues.apache.org/jira/browse/FLINK-35159
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.2, 1.20.0, 1.19.1


When a task manager dies while the JM is generating an ExecutionGraph in the 
background then {{CreatingExecutionGraph#handleExecutionGraphCreation}} can 
transition back into WaitingForResources if the TM hosted one of the slots that 
we planned to use in {{tryToAssignSlots}}.

At this point the ExecutionGraph was already transitioned to running, which 
implicitly kicks of periodic checkpointing by the CheckpointCoordinator, 
without the operator coordinator holders being initialized yet (as this happens 
after we assigned slots).

This effectively leaks that CheckpointCoordinator, including the timer thread 
that will continue to try triggering checkpoints, which will naturally fail to 
trigger.
This can cause a JM crash because it results in 
{{OperatorCoordinatorHolder#abortCurrentTriggering}} to be called, which fails 
with an NPE since the {{mainThreadExecutor}} was not initialized yet.

{code}
java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: java.lang.NullPointerException
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707)
at 
java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
at 
java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.CompletionException: 
java.lang.NullPointerException
at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
... 7 more
Caused by: java.lang.NullPointerException
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at 
java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985)
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961)
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693)
at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
... 8 more
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34672) HA deadlock between JobMasterServiceLeadershipRunner and DefaultLeaderElectionService

2024-03-14 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34672:


 Summary: HA deadlock between JobMasterServiceLeadershipRunner and 
DefaultLeaderElectionService
 Key: FLINK-34672
 URL: https://issues.apache.org/jira/browse/FLINK-34672
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1
Reporter: Chesnay Schepler
 Fix For: 1.18.2, 1.20.0, 1.19.1


We recently observed a deadlock in the JM within the HA system.
(see below for the thread dump)

[~mapohl] and I looked a bit into it and there appears to be a race condition 
when leadership is revoked while a JobMaster is being started.
It appears to be caused by 
{{JobMasterServiceLeadershipRunner#createNewJobMasterServiceProcess}} 
forwarding futures while holding a lock; depending on whether the forwarded 
future is already complete the next stage may or may not run while holding that 
same lock.
We haven't determined yet whether we should be holding that lock or not.

{{code}}
"DefaultLeaderElectionService-leadershipOperationExecutor-thread-1" #131 daemon 
prio=5 os_prio=0 cpu=157.44ms elapsed=78749.65s tid=0x7f531f43d000 
nid=0x19d waiting for monitor entry  [0x7f53084fd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:462)
- waiting to lock <0xf1c0e088> (a java.lang.Object)
at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:397)
at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.notifyLeaderContenderOfLeadershipLoss(DefaultLeaderElectionService.java:484)
at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1252/0x000840ddec40.accept(Unknown
 Source)
at java.util.HashMap.forEach(java.base@11.0.22/HashMap.java:1337)
at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadershipInternal(DefaultLeaderElectionService.java:452)
at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1251/0x000840dcf840.run(Unknown
 Source)
at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.lambda$runInLeaderEventThread$3(DefaultLeaderElectionService.java:549)
- locked <0xf0e3f4d8> (a java.lang.Object)
at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1075/0x000840c23040.run(Unknown
 Source)
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@11.0.22/CompletableFuture.java:1736)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
{{code}}

{{code}}
"jobmanager-io-thread-1" #636 daemon prio=5 os_prio=0 cpu=125.56ms 
elapsed=78699.01s tid=0x7f5321c6e800 nid=0x396 waiting for monitor entry  
[0x7f530567d000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.hasLeadership(DefaultLeaderElectionService.java:366)
- waiting to lock <0xf0e3f4d8> (a java.lang.Object)
at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElection.hasLeadership(DefaultLeaderElection.java:52)
at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.isValidLeader(JobMasterServiceLeadershipRunner.java:509)
at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$forwardIfValidLeader$15(JobMasterServiceLeadershipRunner.java:520)
- locked <0xf1c0e088> (a java.lang.Object)
at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$1320/0x000840e1a840.accept(Unknown
 Source)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@11.0.22/CompletableFuture.java:859)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@11.0.22/CompletableFuture.java:837)
at 
java.util.concurrent.CompletableFuture.postComplete(java.base@11.0.22/CompletableFuture.java:506)
at 
java.util.concurrent.CompletableFuture.complete(java.base@11.0.22/CompletableFuture.java:2079)
at 
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.registerJobMasterServiceFutures(DefaultJobMasterServiceProcess.java:124)
at 
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:114)
at 
org.apache.flink.runtime.jobmaster

[jira] [Created] (FLINK-34640) Replace DummyMetricGroup usage with UnregisteredMetricsGroup

2024-03-11 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34640:


 Summary: Replace DummyMetricGroup usage with 
UnregisteredMetricsGroup
 Key: FLINK-34640
 URL: https://issues.apache.org/jira/browse/FLINK-34640
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Metrics, Tests
Reporter: Chesnay Schepler
 Fix For: 1.20.0


The {{DummyMetricGroup}} is terrible because it is decidedly unsafe to use. Use 
the {{UnregisteredMetricsGroup}} instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Move CheckpointingMode to flink-core

2024-02-27 Thread Chesnay Schepler
Moving classes (== keep the same package) to a module higher up in the 
dependency tree should not be a breaking change and can imo be done 
anytime without any risk to users.


On 27/02/2024 17:01, Lincoln Lee wrote:

Hi Zakelly,

Thanks for letting us 1.19 RMs know about this!

This change has been discussed during today's release sync meeting, we
suggest not merge it into 1.19.
We can continue discussing the removal in 2.x separately.

Best,
Lincoln Lee


Hangxiang Yu  于2024年2月27日周二 11:28写道:


Hi, Zakelly.
Thanks for driving this.
Moving this class to flink-core makes sense to me which could make the code
path and configs clearer.
It's marked as @Public from 1.0 and 1.20 should be the next long-term
version, so 1.19 should have been a suitable version to do it.
And also look forward to thoughts of other developers/RMs since 1.19 is
currently under a feature freeze status.

On Mon, Feb 26, 2024 at 6:42 PM Zakelly Lan  wrote:


Hi devs,

When working on the FLIP-406[1], I realized that moving all options of
ExecutionCheckpointingOptions(flink-streaming-java) to
CheckpointingOptions(flink-core) depends on relocating the
enum CheckpointingMode(flink-streaming-java) to flink-core module.

However,

the CheckpointingMode is annotated as @Public and used by datastream api
like 'CheckpointConfig#setCheckpointingMode'. So I'd like to start a
discussion on moving the CheckpointingMode to flink-core. It is in a

little

bit of a hurry if we want the old enum to be entirely removed in Flink

2.x

series, since the deprecation should be shipped in the upcoming Flink

1.19.

I suggest not creating a dedicated FLIP and treating this as a sub-task

of

FLIP-406.

I prepared a minimal change of providing new APIs and deprecating the old
ones[2], which could be merged to 1.19 if we agree to do so.

Looking forward to your thoughts! Also cc RMs of 1.19 about this.

[1]


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=284789560

[2]



https://github.com/apache/flink/commit/9bdd237d0322df8853f1b9e6ae658f77b9175237

Best,
Zakelly



--
Best,
Hangxiang.





Re: [VOTE] Release flink-connector-parent 1.1.0 release candidate #2

2024-02-27 Thread Chesnay Schepler

+1
- pom contents
- source contents
- Website PR

On 19/02/2024 18:33, Etienne Chauchot wrote:

Hi everyone,
Please review and vote on the release candidate #2 for the version 
1.1.0, as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org 
[2], which are signed with the key with fingerprint 
D1A76BA19D6294DD0033F6843A019F0B8DD163EA [3],

* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag v1.1.0-rc2 [5],
* website pull request listing the new release [6].

* confluence wiki: connector parent upgrade to version 1.1.0 that will 
be validated after the artifact is released (there is no PR mechanism 
on the wiki) [7]



The vote will be open for at least 72 hours. It is adopted by majority 
approval, with at least 3 PMC affirmative votes.


Thanks,
Etienne

[1] 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353442
[2] 
https://dist.apache.org/repos/dist/dev/flink/flink-connector-parent-1.1.0-rc2

[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] 
https://repository.apache.org/content/repositories/orgapacheflink-1707
[5] 
https://github.com/apache/flink-connector-shared-utils/releases/tag/v1.1.0-rc2


[6] https://github.com/apache/flink-web/pull/717

[7] 
https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development






[jira] [Created] (FLINK-34499) Configuration#toString should hide sensitive values

2024-02-22 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34499:


 Summary: Configuration#toString should hide sensitive values
 Key: FLINK-34499
 URL: https://issues.apache.org/jira/browse/FLINK-34499
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Chesnay Schepler
 Fix For: 1.20.0


Time and time again people log the entire Flink configuration for no reason, 
risking that sensitive values are logged in plain text.

We should make this harder by changing {{Configuration#toString}} to 
automatically hide sensitive values, for example like this:

{code}
@Override
public String toString() {
return ConfigurationUtils
.hideSensitiveValues(this.confData.entrySet().stream().collect(
Collectors.toMap(Map.Entry::getKey, entry -> 
entry.getValue().toString(
.toString();
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34498) GSFileSystemFactory logs full Flink config

2024-02-22 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34498:


 Summary: GSFileSystemFactory logs full Flink config
 Key: FLINK-34498
 URL: https://issues.apache.org/jira/browse/FLINK-34498
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem
Affects Versions: 1.18.1
Reporter: Chesnay Schepler
 Fix For: 1.19.0, 1.18.2, 1.20.0


This can cause secrets from the config to be logged.
{code}
@Override
public void configure(Configuration flinkConfig) {
LOGGER.info("Configuring GSFileSystemFactory with Flink configuration 
{}", flinkConfig);
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34496) Classloading deadlock between ExecNodeMetadataUtil and JsonSerdeUtil

2024-02-22 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34496:


 Summary: Classloading deadlock between ExecNodeMetadataUtil and 
JsonSerdeUtil
 Key: FLINK-34496
 URL: https://issues.apache.org/jira/browse/FLINK-34496
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.18.1
Reporter: Chesnay Schepler
 Fix For: 1.19.0, 1.18.2, 1.20.0


This is a fun one!

ExecNodeMetadataUtil and JsonSerdeUtil have a circular dependency in their 
static initialization, which can cause a classloading lockup when 2 threads are 
running the class initialization of each class at the same time because during 
class initialization they hold a lock.

{code}
Feb 22 00:31:58 "ForkJoinPool-3-worker-11" #25 daemon prio=5 os_prio=0 
cpu=219.87ms elapsed=995.99s tid=0x7ff11c50e000 nid=0xf0fc in Object.wait() 
 [0x7ff12a4f3000]
Feb 22 00:31:58java.lang.Thread.State: RUNNABLE
Feb 22 00:31:58 at 
org.apache.flink.table.planner.plan.nodes.exec.serde.JsonSerdeUtil.createFlinkTableJacksonModule(JsonSerdeUtil.java:133)
Feb 22 00:31:58 at 
org.apache.flink.table.planner.plan.nodes.exec.serde.JsonSerdeUtil.(JsonSerdeUtil.java:111)

Feb 22 00:31:58 "ForkJoinPool-3-worker-7" #23 daemon prio=5 os_prio=0 
cpu=54.83ms elapsed=996.00s tid=0x7ff11c50c000 nid=0xf0fb in Object.wait()  
[0x7ff12a5f4000]
Feb 22 00:31:58java.lang.Thread.State: RUNNABLE
Feb 22 00:31:58 at 
org.apache.flink.table.planner.plan.utils.ExecNodeMetadataUtil.addToLookupMap(ExecNodeMetadataUtil.java:235)
Feb 22 00:31:58 at 
org.apache.flink.table.planner.plan.utils.ExecNodeMetadataUtil.(ExecNodeMetadataUtil.java:156)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34485) Token delegation doesn't work with Presto S3 filesystem

2024-02-21 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34485:


 Summary: Token delegation doesn't work with Presto S3 filesystem
 Key: FLINK-34485
 URL: https://issues.apache.org/jira/browse/FLINK-34485
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem
Affects Versions: 1.18.1
Reporter: Chesnay Schepler
 Fix For: 1.20.0


AFAICT it's not possible to use token delegation with the Presto filesystem.
The token delegation relies on the {{DynamicTemporaryAWSCredentialsProvider}}, 
but it doesn't have a constructor that presto required (ruling out 
presto.s3.credentials-provider), and other providers can't be used due to 
FLINK-13602.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release flink-connector-parent, release candidate #1

2024-02-15 Thread Chesnay Schepler
Martijn is correct in that if the source release got changed we need to 
vet everything again.
If anything else had been changed instead there'd be a bit of leeway but 
the source release itself is the most important bit of a release and we 
just can't allow changes to that after the fact and still allow votes to 
count.


As for the change itself, it's good that the shared directory is now 
excluded, but we still have a de-facto empty tools directory.
That's ultimately the (really minor) issue I was referring to that I'd 
like to see resolved /eventually/, because an empty directory in a 
source release is just a bit confusing ("Is the release missing 
something?").


On 15/02/2024 17:27, Martijn Visser wrote:

Hi Etienne,


I fixed the source release [1] as requested, it no more contains 
tools/release/shared directory.

I don't think that is the correct way: my understanding is that this
invalidates basically all the votes, because now the checked artifact
has changed. It was requested to file a ticket as a follow-up, not to
immediately change the binary. We can't have a lazy consensus on a
release topic, with a changed artifact.

Best regards,

Martijn

On Thu, Feb 15, 2024 at 2:32 PM Etienne Chauchot  wrote:

Hi,

Considering that the code and artifact have note changed since last vote
(only source release and tag have changed) and considering that there
were already 3 binding votes for this RC1, I'll do this on a lazy
consensus. I'll release if no one objects until tomorrow as it will be
72h since last change.

Best

Etienne

Le 13/02/2024 à 13:24, Etienne Chauchot a écrit :

Hi all,

I fixed the source release [1] as requested, it no more contains
tools/release/shared directory.

I found out why it contained that directory, it was because parent_pom
branch was referring to an incorrect sub-module mount point for
release_utils branch (cf FLINK-34364 [2]). Here is the fixing PR (3).

And by the way I noticed that all the connectors source releases were
containing an empty tools/releasing directory because only
tools/releasing/shared is excluded in the source release script and
not the whole tools/releasing directory. It seems a bit messy to me so
I think we should fix that in the release scripts later on for next
connectors releases.

I also found out that the RC1 tag was pointing to my fork instead of
the main repo so I remade the tag (4)

Apart of that, the code and artifact have not changed so I did not
invalidate the RC1.

Please confirm that I can proceed to the release.

Best

Etienne

[1]
https://dist.apache.org/repos/dist/dev/flink/flink-connector-parent-1.1.0-rc1/

[2]https://issues.apache.org/jira/browse/FLINK-34364

[3]https://github.com/apache/flink-connector-shared-utils/pull/36

[4]
https://github.com/apache/flink-connector-shared-utils/releases/tag/v1.1.0-rc1


Le 05/02/2024 à 12:36, Etienne Chauchot a écrit :

Hi,

I just got back from vacations. I'll close the vote thread and
proceed to the release later this week.

Here is the ticket:https://issues.apache.org/jira/browse/FLINK-34364

Best

Etienne

Le 04/02/2024 à 05:06, Qingsheng Ren a écrit :

+1 (binding)

- Verified checksum and signature
- Verified pom content
- Built flink-connector-kafka from source with the parent pom in staging

Best,
Qingsheng

On Thu, Feb 1, 2024 at 11:19 PM Chesnay Schepler   wrote:


- checked source/maven pom contents

Please file a ticket to exclude tools/release from the source release.

+1 (binding)

On 29/01/2024 15:59, Maximilian Michels wrote:

- Inspected the source for licenses and corresponding headers
- Checksums and signature OK

+1 (binding)

On Tue, Jan 23, 2024 at 4:08 PM Etienne Chauchot

wrote:

Hi everyone,

Please review and vote on the release candidate #1 for the version
1.1.0, as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org
[2], which are signed with the key with fingerprint
D1A76BA19D6294DD0033F6843A019F0B8DD163EA [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag v1.1.0-rc1 [5],
* website pull request listing the new release [6]

* confluence wiki: connector parent upgrade to version 1.1.0 that will
be validated after the artifact is released (there is no PR mechanism on
the wiki) [7]

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,

Etienne

[1]


https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353442

[2]


https://dist.apache.org/repos/dist/dev/flink/flink-connector-parent-1.1.0-rc1

[3]https://dist.apache.org/repos/dist/release/flink/KEYS
[4]

https://repository.apache.org/content/repositories/orgapacheflink-1698/

[5]


https://github.com/apache/flink-connector-shared-utils/releases/tag/

[jira] [Created] (FLINK-34431) Move static BlobWriter methods to separate util

2024-02-12 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34431:


 Summary: Move static BlobWriter methods to separate util
 Key: FLINK-34431
 URL: https://issues.apache.org/jira/browse/FLINK-34431
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.20.0


The BlobWriter interface contains several static methods, some being used, 
others being de-facto internal methods.
We should move these into a dedicated BlobWriterUtils class so we can properly 
deal with method visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34422) BatchTestBase doesn't actually use MiniClusterExtension

2024-02-10 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34422:


 Summary: BatchTestBase doesn't actually use MiniClusterExtension
 Key: FLINK-34422
 URL: https://issues.apache.org/jira/browse/FLINK-34422
 Project: Flink
  Issue Type: Technical Debt
  Components: Test Infrastructure
Affects Versions: 1.18.1
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.19.0, 1.18.2, 1.20.0


BatchTestBase sets up a table environment in instance fields, which runs before 
the BeforeEachCallback from the MiniClusterExtension has time to run.
As a result _all_ test extending the BatchTestBase are spawning separate mini 
clusters for every single job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34421) Skip post-compile checks in compile.sh if fast profile is active

2024-02-10 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34421:


 Summary: Skip post-compile checks in compile.sh if fast profile is 
active
 Key: FLINK-34421
 URL: https://issues.apache.org/jira/browse/FLINK-34421
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.19.0, 1.18.2, 1.20.0


We currently waste time in our e2e tests, re-running a bunch of post-compile 
checks (like packaging/licensing).
Let's couple this to the -Dfast/-Pfast switches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-02-08 Thread Chesnay Schepler
How exactly are you tuning SQL jobs without compiled plans while 
ensuring that the resulting compiled plans are compatible? That's 
explicitly not supported by Flink, hence why CompiledPlans exist.
If you change _anything_ the planner is free to generate a completely 
different plan, where you have no guarantees that you can map the state 
between one another.


On 08/02/2024 09:42, Martijn Visser wrote:

Hi,


However, compiled plan is still too complicated for Flink newbies from my point 
of view.

I don't think that the compiled plan was ever positioned to be a
simple solution. If you want to have an easy approach, we have a
declarative solution in place with SQL and/or the Table API imho.

Best regards,

Martijn

On Thu, Feb 8, 2024 at 9:14 AM Zhanghao Chen  wrote:

Hi Piotr,

Thanks for the comment. I agree that compiled plan is the ultimate tool for 
Flink SQL if one wants to make any changes to
query later, and this FLIP indeed is not essential in this sense. However, 
compiled plan is still too complicated for Flink newbies from my point of view. 
As I mentioned previously, our internal platform provides a visualized tool for 
editing the compiled plan but most users still find it complex. Therefore, the 
FLIP can still benefit users with better useability and the proposed changes 
are actually quite lightweight (just copying a new hasher with 2 lines deleted 
+ extending the OperatorIdPair data structure) without much extra effort.

Best,
Zhanghao Chen

From: Piotr Nowojski 
Sent: Thursday, February 8, 2024 14:50
To: Zhanghao Chen 
Cc: Chesnay Schepler ; dev@flink.apache.org 
; Yu Chen 
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hey


AFAIK, there's no way to set UIDs for a SQL job,

AFAIK you can't set UID manually, but  Flink SQL generates a compiled plan
of a query with embedded UIDs. As I understand it, using a compiled plan is
the preferred (only?) way for Flink SQL if one wants to make any changes to
query later on or support Flink's runtime upgrades, without losing the
state.

If that's the case, what would be the usefulness of this FLIP? Only for
DataStream API for users that didn't know that they should have manually
configured UIDs? But they have the workaround to actually post-factum add
the UIDs anyway, right? So maybe indeed Chesnay is right that this FLIP is
not that helpful/worth the extra effort?

Best,
Piotrek

czw., 8 lut 2024 o 03:55 Zhanghao Chen 
napisał(a):


Hi Chesnay,

AFAIK, there's no way to set UIDs for a SQL job, it'll be great if you can
share how you allow UID setting for SQL jobs. We've explored providing a
visualized DAG editor for SQL jobs that allows UID setting on our internal
platform, but most users found it too complicated to use. Another
possible way is to utilize SQL hints, but that's complicated as well. From
our experience, many SQL users are not familiar with Flink, what they want
is an experience similar to writing a normal SQL in MySQL, without
involving much extra concepts like the DAG and the UID. In fact, some
DataStream and PyFlink users also share the same concern.

On the other hand, some performance-tuning is inevitable for a
long-running jobs in production, and parallelism tuning is among the most
common techniques. FLIP-367 [1] and FLIP-146 [2] allow user to tune the
parallelism of source and sinks, and both are well-received in the
discussion thread. Users definitely don't want to lost state after a
parallelism tuning, which is highly risky at present.

Putting these together, I think the FLIP has a high value in production.
Through offline discussion, I leant that multiple companies have developed
or trying to develop similar hasher changes in their internal distribution,
including ByteDance, Xiaohongshu, and Bilibili. It'll be great if we can
improve the SQL experience for all community users as well, WDYT?

Best,
Zhanghao Chen
--
*From:* Chesnay Schepler 
*Sent:* Thursday, February 8, 2024 2:01
*To:* dev@flink.apache.org ; Zhanghao Chen <
zhanghao.c...@outlook.com>; Piotr Nowojski ; Yu
Chen 
*Subject:* Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID
generation for improved state compatibility on parallelism change

The FLIP is a bit weird to be honest. It only applies in cases where
users haven't set uids, but that goes against best-practices and as far
as I'm told SQL also sets UIDs everywhere.

I'm wondering if this is really worth the effort.

On 07/02/2024 10:23, Zhanghao Chen wrote:

After offline discussion with @Yu Chen<mailto:yuchen.e...@gmail.com

>, I've updated the FLIP [1] to include a design
that allows for compatible hasher upgrade by adding StreamGraphHasherV2 to
the legacy hasher list, which is actually a revival of the idea from
FLIP-5290 [2] when StreamGraphHasherV2 was introduced in Flink 1.2. We're
targeting to make V3 the default hasher in Flin

Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-02-07 Thread Chesnay Schepler
The FLIP is a bit weird to be honest. It only applies in cases where 
users haven't set uids, but that goes against best-practices and as far 
as I'm told SQL also sets UIDs everywhere.


I'm wondering if this is really worth the effort.

On 07/02/2024 10:23, Zhanghao Chen wrote:

After offline discussion with @Yu Chen, I've updated 
the FLIP [1] to include a design that allows for compatible hasher upgrade by adding 
StreamGraphHasherV2 to the legacy hasher list, which is actually a revival of the idea from 
FLIP-5290 [2] when StreamGraphHasherV2 was introduced in Flink 1.2. We're targeting to make 
V3 the default hasher in Flink 1.20 given that state-compatibility is no longer an issue. 
Take a review when you have a chance, and I'd like to especially thank @Yu 
Chen for the through offline discussion and code 
debugging help to make this possible.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change
[2] https://issues.apache.org/jira/browse/FLINK-5290

Best,
Zhanghao Chen

From: Zhanghao Chen 
Sent: Friday, January 12, 2024 10:46
To: Piotr Nowojski ; Yu Chen 
Cc: dev@flink.apache.org 
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Thanks for the input, Piotr. It might still be possible to make it compatible with 
the old snapshots, following the direction of 
FLINK-5290 suggested by Yu. 
I'll discuss with Yu on more details.

Best,
Zhanghao Chen

From: Piotr Nowojski 
Sent: Friday, January 12, 2024 1:55
To: Yu Chen 
Cc: Zhanghao Chen ; dev@flink.apache.org 

Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi,

Using unaligned checkpoints is orthogonal to this FLIP.

Yes, unaligned checkpoints are not supported for pointwise connections, so most 
of the cases go away anyway.
It is possible to switch from unchained to chained subtasks by removing a keyBy 
exchange, and this would be
a problem, but that's just one of the things that we claim that unaligned 
checkpoints do not support [1]. But as
I stated above, this is an orthogonal issue to this FLIP.

Regarding the proposal itself, generally speaking it makes sense to me as well. 
However I'm quite worried about
the compatibility and/or migration path. The:

(v2.0) Make HasherV3 the default hasher, mark HasherV2 deprecated.

step would break the compatibility with Flink 1.xx snapshots. But as this is 
for v2.0, maybe that's not the end of
the world?

Best,
Piotrek

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations

czw., 11 sty 2024 o 12:10 Yu Chen 
mailto:yuchen.e...@gmail.com>> napisał(a):
Hi Zhanghao,

Actually, Stefan has done similar compatibility work in the early 
FLINK-5290[1], where he introduced the legacyStreamGraphHashers list for hasher 
backward compatibility.

We have attempted to implement a similar feature in the internal version of 
FLINK and tried to include the new hasher as part of the 
legacyStreamGraphHashers,
which would ensure that the corresponding Operator State could be found at 
restore while ignoring the chaining condition(without changing the default 
hasher).

However, we have found that such a solution may lead to some unexpected 
situations in some cases. While I have no time to find out the root cause 
recently.

If you're interested, I'd be happy to discuss it with you and try to solve the 
problem.

[1] https://issues.apache.org/jira/browse/FLINK-5290

Best,
Yu Chen



2024年1月11日 15:07,Zhanghao Chen 
mailto:zhanghao.c...@outlook.com>> 写道:

Hi Yu,

I haven't thought too much about the compatibility design before. By the nature 
of the problem, it's impossible to make V3 compatible with V2, what we can do 
is to somewhat better inform users when switching the hasher, but I don't have 
any good idea so far. Do you have any suggestions on this?

Best,
Zhanghao Chen

From: Yu Chen mailto:yuchen.e...@gmail.com>>
Sent: Thursday, January 11, 2024 13:52
To: dev@flink.apache.org 
mailto:dev@flink.apache.org>>
Cc: Piotr Nowojski mailto:piotr.nowoj...@gmail.com>>; 
zhanghao.c...@outlook.com 
mailto:zhanghao.c...@outlook.com>>
Subject: Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for 
improved state compatibility on parallelism change

Hi Zhanghao,

Thanks for driving this, that’s really painful for us when we need to switch 
config `pipeline.operator-chaining`.

But I have a Concern, according to FLIP description, modifying `isChainable` 
related code in `StreamGraphHasherV2` will cause the 

[jira] [Created] (FLINK-34397) Resource wait timeout can't be disabled

2024-02-06 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34397:


 Summary: Resource wait timeout can't be disabled
 Key: FLINK-34397
 URL: https://issues.apache.org/jira/browse/FLINK-34397
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Configuration
Affects Versions: 1.17.2
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.19.0, 1.17.3, 1.18.2


The documentation for {{jobmanager.adaptive-scheduler.resource-wait-timeout}} 
states that:

??Setting a negative duration will disable the resource timeout: The JobManager 
will wait indefinitely for resources to appear.??

However, we don't support parsing negative durations.

{code}
Could not parse value '-1 s' for key 
'jobmanager.adaptive-scheduler.resource-wait-timeout'.
Caused by: java.lang.NumberFormatException: text does not start with a number
at org.apache.flink.util.TimeUtils.parseDuration(TimeUtils.java:80)
at 
org.apache.flink.configuration.ConfigurationUtils.convertToDuration(ConfigurationUtils.java:399)
at 
org.apache.flink.configuration.ConfigurationUtils.convertValue(ConfigurationUtils.java:331)
at 
org.apache.flink.configuration.Configuration.lambda$getOptional$3(Configuration.java:729)
at java.base/java.util.Optional.map(Optional.java:260)
at 
org.apache.flink.configuration.Configuration.getOptional(Configuration.java:729)
... 2 more
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Security fixes for Flink 1.18 (flink-shaded)

2024-02-02 Thread Chesnay Schepler
Guava CVEs don't apply because it's all about using it's createTempDir 
method which we don't use.


Zookeeper CVE doesn't really apply because it's a server-side issue.

On 02/02/2024 09:42, Martijn Visser wrote:

To add to this: we can't upgrade to flink-shaded 18.0, since we've just
reverted that for Flink 1.19 because of the performance regression. We will
need a new flink-shaded version to deal with these performance regressions.

On Fri, Feb 2, 2024 at 9:39 AM Martijn Visser 
wrote:


Hi Hong,

I do have objections: upgrading Flink-Shaded in a patch version is
something that we should not take lightly, since it involves components
that are used in the core functionality of Flink. We've seen in the past
that changes in Flink Shaded have an impact on stability and performance. I
would like to see how Flink is affected by these CVEs, since in almost all
cases these are false-positives for Flink.

Best regards,

Martijn

On Thu, Feb 1, 2024 at 4:22 PM Hong Liang  wrote:


Hi all,

Recently, we detected some active CVEs on the flink-shaded-guava and
flink-shaded-zookeeper package used in Flink 1.18. Since Flink 1.18 is
still in support for security fixes, we should consider fixing this.
However, since the vulnerable package is coming from flink-shaded, I
wanted
to check if there are thoughts from the community around releasing a patch
version of flink-shaded.

Problem:
Flink 1.18 uses guava 31.1-jre from flink-shaded-guava 17.0, which is
affected by CVE-2023-2976 (HIGH) [1] and CVE-2020-8908 (LOW) [2]. Flink
1.18 also uses zookeeper 3.7.1, which is affected by CVE-2023-44981
(CRITICAL) [3].

To fix, I can think of two options:
Option 1:
Upgrade Flink 1.18 to use flink.shaded.version 18.0. This is easiest as we
can backport the change for Flink 1.19 directly (after the performance
regression is addressed) [4]. However, there are also upgrades to jackson,
asm and netty in flink.shaded.version 1.18.

Option 2:
Release flink.shaded.version 17.1, with just a bump in zookeeper and guava
versions. Then, upgrade Flink 1.18 to use this new flink.shaded.version
17.1. This is harder, but keeps the changes contained and minimal.

Given the version bump is on flink-shaded, which is relocated to keep the
usage of libraries contained within the flink runtime itself, I am
inclined
to go with Option 1, even though the change is slightly larger than just
the security fixes.

Do people have any objections?


Regards,
Hong

[1] https://nvd.nist.gov/vuln/detail/CVE-2023-2976
[2] https://nvd.nist.gov/vuln/detail/CVE-2020-8908
[3] https://nvd.nist.gov/vuln/detail/CVE-2023-44981
[4] https://issues.apache.org/jira/browse/FLINK-33705





Re: [VOTE] Release flink-connector-parent, release candidate #1

2024-02-01 Thread Chesnay Schepler

- checked source/maven pom contents

Please file a ticket to exclude tools/release from the source release.

+1 (binding)

On 29/01/2024 15:59, Maximilian Michels wrote:

- Inspected the source for licenses and corresponding headers
- Checksums and signature OK

+1 (binding)

On Tue, Jan 23, 2024 at 4:08 PM Etienne Chauchot  wrote:

Hi everyone,

Please review and vote on the release candidate #1 for the version
1.1.0, as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org
[2], which are signed with the key with fingerprint
D1A76BA19D6294DD0033F6843A019F0B8DD163EA [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag v1.1.0-rc1 [5],
* website pull request listing the new release [6]

* confluence wiki: connector parent upgrade to version 1.1.0 that will
be validated after the artifact is released (there is no PR mechanism on
the wiki) [7]

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,

Etienne

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353442
[2]
https://dist.apache.org/repos/dist/dev/flink/flink-connector-parent-1.1.0-rc1
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1698/
[5]
https://github.com/apache/flink-connector-shared-utils/releases/tag/v1.1.0-rc1
[6] https://github.com/apache/flink-web/pull/717

[7]
https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development




[jira] [Created] (FLINK-34286) Attach cluster config map labels at creation time

2024-01-30 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34286:


 Summary: Attach cluster config map labels at creation time
 Key: FLINK-34286
 URL: https://issues.apache.org/jira/browse/FLINK-34286
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.19.0


We attach a set of labels to config maps that we create to ease the manual 
cleanup by users in case Flink fails unrecoverably.

For cluster config maps (that are used for leader election), these labels are 
not set at creation time, but when leadership is acquired, in contrast to job 
config maps.

This means there's a gap where we create a CM without any labels being 
attached, and should Flink fail before leadership can be acquired it will 
continue to lack labels indefinitely.

AFAICT it should be straight-forward, at least API-wise, to set these labels at 
creation time. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34097) Remove unused JobMasterGateway#requestJobDetails

2024-01-15 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34097:


 Summary: Remove unused JobMasterGateway#requestJobDetails
 Key: FLINK-34097
 URL: https://issues.apache.org/jira/browse/FLINK-34097
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.19.0


This method is wired all the way to the scheduler; remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34004) TestingCheckpointIDCounter can easily lead to NPEs

2024-01-05 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-34004:


 Summary: TestingCheckpointIDCounter can easily lead to NPEs
 Key: FLINK-34004
 URL: https://issues.apache.org/jira/browse/FLINK-34004
 Project: Flink
  Issue Type: Technical Debt
  Components: Tests
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.19.0


The TestingCheckpointIDCounter builder doesn't define safe defaults for all 
builder parameters. Using it can easily lead to surprising null pointer 
exceptions in tests when code is being modified to call more methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-395: Migration to GitHub Actions

2023-12-04 Thread Chesnay Schepler

We could limit the (first) trial run to branches.

PRs wouldn't be affected (avoiding a bunch of concerns about maybe 
blocking PRs and misleading people into thinking that CI is green), we'd 
have a better handle on how much capacity we are consuming, but 
contributors would still get the new setup (which for some is better 
than none).

We'd also side-step any potential security issue for the time being.

On 01/12/2023 05:10, Yangze Guo wrote:

Thanks for the efforts, @Matthias. +1 to start a trial on Github
Actions and migrate the CI if we can prove its computation capacity
and stability.

I share the same concern with Xintong that we do not explicitly claim
the effect of this trial on the contribution procedure. I think you
can elaborate more on this in the migration plan section. Here is my
thought about it:
I prefer to enable the CI workflow based on GitHub Actions for each PR
because it helps us understand its stability and performance under
certain pressures. However, I am not inclined to make "passing the CI
via GitHub Actions" a necessity in the code contribution process, we
can encourage contributors to report unstable cases under a specific
ticket umbrella when they encounter them.

Best,
Yangze Guo

On Thu, Nov 30, 2023 at 12:10 AM Matthias Pohl
 wrote:

With regards to Alex' concerns on hardware disparity: I did a bit more
digging on that one. I added my findings in a hardware section to FLIP-396
[1]. It appears that the hardware is more or less the same between the
different hosts. Apache INFRA's runners have more disk space (1TB in
comparison to 14GB), though.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Trial+during+Flink+1.19+Cycle+to+test+migrating+to+GitHub+Actions#FLIP396:TrialduringFlink1.19CycletotestmigratingtoGitHubActions-HardwareSpecifications

On Wed, Nov 29, 2023 at 4:01 PM Matthias Pohl 
wrote:


Thanks for your feedback Alex. I responded to your comments below:

This is mentioned in the "Limitations of GitHub Actions in the past"

section of the FLIP. Does this also apply to the Apache INFRA setup or
can we expect contributors' runs executed there too?


Workflow runs on Flink forks (independent of PRs that would merge to
Apache Flink's core repo) will be executed with runners provided by GitHub
with their own limitations. Secrets are not set in these runs (similar to
what we have right now with PR runs).

If we allow the PR CI to run on Apache INFRA-hosted ephemeral runners we
might have the same freedom because of their ephemeral nature (the VMs are
discarded leaving).

We only have to start thinking about self-hosted customized runners if we
decide/need to have dedicated VMs for Flink's CI (similar to what we have
right now with Azure CI and Alibaba's VMs). This might happen if the
waiting times for acquiring a runner are too long. In that case, we might
give a certain group of people (e.g. committers) or certain types of events
(for PRs,  nightly builds, PR merges) the ability to use the self-hosted
runners.

As you mentioned in the FLIP, there are some timeout-related test

discrepancies between different setups. Similar discrepancies could
manifest themselves between the Github runners and the Apache INFRA
runners. It would be great if we should have a uniform setup, where if
tests pass in the individual CI, they also pass in the main runner and vice
versa.


I agree. So far, what we've seen is that the timeout instability is coming
from too optimistic timeout configurations in some tests (they eventually
also fail in Azure CI; but the GitHub-provided runners seem to be more
sensitive in this regard). Fixing the tests if such a flakiness is observed
should bring us to a stage where the test behavior is matching between
different runners.

We had a similar issue in the Azure CI setup: Certain tests were more
stable on the Alibaba machines than on Azure VMs. That is why we introduced
a dedicated stage for Azure CI VMs as part of the nightly runs (see
FLINK-18370 [1]). We could do the same for GitHub Actions if necessary.

Currently we have such memory limits-related issues in individual vs main

Azure CI pipelines.


I'm not sure I understand what you mean by memory limit-related issues.
The GitHub-provided runners do not seem to run into memory-related issues.
We have to see whether this also applies to Apache INFRA-provided runners.
My hope is that they have even better hardware than what GitHub offers. But
GitHub-provided runners seem to be a good fallback to rely on (see the
workflows I shared in my previous response to Xintong's message).

[1] https://issues.apache.org/jira/browse/FLINK-18370

On Wed, Nov 29, 2023 at 3:17 PM Matthias Pohl 
wrote:


Thanks for your comments, Xintong. See my answers below.



I think it would be helpful if we can at the end migrate the CI to an
ASF-managed Github Action, as long as it provides us a similar
computation capacity and stability.


The current test runs in my Flink fork (using the GitHub-provided

[jira] [Created] (FLINK-33352) OpenAPI spec is lacking mappings for discriminator properties

2023-10-24 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-33352:


 Summary: OpenAPI spec is lacking mappings for discriminator 
properties
 Key: FLINK-33352
 URL: https://issues.apache.org/jira/browse/FLINK-33352
 Project: Flink
  Issue Type: Bug
  Components: Documentation, Runtime / REST
Affects Versions: 1.17.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.17.2, 1.19.0, 1.18.1






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release 1.18.0, release candidate #2

2023-10-17 Thread Chesnay Schepler

IIRC releases MUST be built with JDK 8.

@Yun Tang how did you discover this?

There is supposed to be a check to enforce this, but weirdly enough it 
isn't rejecting my JDK 11 installation :/
Ah, we didn't set up the enforcer plugin correctly; while we do set a 
required JDK version by default this is interpreted as ">=", so anything 
above or equal 8 is currently accepted...


On 17/10/2023 07:55, Yun Tang wrote:

Hi Jing,

I found the pre-built Flink binary release is built with JDK17[1]. Since the 
default target version is still 1.8 [2] and we did not mention that building 
with java17 is supported [3], do we really need to make the default binary 
release built with JDK17?


[1] 
https://dist.apache.org/repos/dist/dev/flink/flink-1.18.0-rc2/flink-1.18.0-bin-scala_2.12.tgz
[2] 
https://github.com/apache/flink/blob/f978a77e2b9ade1e89dfb681d4b99fc13c72d2ed/pom.xml#L128
[3] 
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/flinkdev/building/#build-flink

Best,
Yun Tang

From: Lijie Wang 
Sent: Tuesday, October 17, 2023 12:42
To: dev@flink.apache.org 
Subject: Re: [VOTE] Release 1.18.0, release candidate #2

+1 (non-binding)

  -  Verified the signature and checksum
  -  Built from the source code
  -  Ran an example job on yarn cluster
  -  Checked the website PR

Best,
Lijie

Jing Ge  于2023年10月16日周一 18:43写道:


Hi everyone,

Please review and vote on the release candidate #2 for the version
1.18.0, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:

* JIRA release notes [1], and the pull request adding release note for
users [2]
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [3], which are signed with the key with
fingerprint 96AE0E32CBE6E0753CE6 [4],
* all artifacts to be deployed to the Maven Central Repository [5],
* source code tag "release-1.17.0-rc2" [6],
* website pull request listing the new release and adding announcement blog
post [7].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Best regards,
Konstantin, Qingsheng, Sergey, and Jing

[1]

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12352885
[2] https://github.com/apache/flink/pull/23527
[3] https://dist.apache.org/repos/dist/dev/flink/flink-1.18.0-rc2/
[4] https://dist.apache.org/repos/dist/release/flink/KEYS
[5] https://repository.apache.org/content/repositories/orgapacheflink-1658
[6] https://github.com/apache/flink/releases/tag/release-1.18.0-rc2
[7] https://github.com/apache/flink-web/pull/680





Re: [VOTE] FLIP-366: Support standard YAML for FLINK configuration

2023-10-16 Thread Chesnay Schepler

+1 (binding)

On 13/10/2023 04:12, Junrui Lee wrote:

Hi all,

Thank you to everyone for the feedback on FLIP-366[1]: Support standard
YAML for FLINK configuration in the discussion thread [2].
I would like to start a vote for it. The vote will be open for at least 72
hours (excluding weekends, unless there is an objection or an insufficient
number of votes).

Thanks,
Junrui

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-366%3A+Support+standard+YAML+for+FLINK+configuration
[2]https://lists.apache.org/thread/qfhcm7h8r5xkv38rtxwkghkrcxg0q7k5





Re: [DISCUSS] Java Record support

2023-10-04 Thread Chesnay Schepler
Kryo isn't required for this; newer versions do support records but we 
want something like a PojoSerializer for records to be performant.


The core challenges are
a) detecting records during type extraction
b) ensuring parameters are passed to the constructor in the right order.

From what I remember from my own experiments this shouldn't exactly 
/difficult/, but just a bit tedious to integrate into the Type 
extraction stack.


On 04/10/2023 16:14, Őrhidi Mátyás wrote:

+1 This would be great

On Wed, Oct 4, 2023 at 7:04 AM Gyula Fóra  wrote:

Hi All!

Flink 1.18 contains experimental Java 17 support but it misses out
on Java
records which can be one of the nice benefits of actually using
newer java
versions.

There is already a Jira to track this feature [1] but I am not
aware of any
previous efforts so far.

Since records have pretty strong guarantees and many users would
probably
want to migrate from their POJOs, I think we should enhance the
current
Pojo TypeInfo/Serializer to accommodate for the records.

I experimented with this locally and the changes are not huge as
we only
need to allow instantiating records through the constructor instead of
setters. This would mean that the serialization format is basically
equivalent to the same non-record pojo, giving us backward
compatibility
and all the features of the Pojo serializer for basically free.

We should make sure to not introduce any performance regression in the
PojoSerializer but I am happy to open a preview PR if there is
interest.

There were mentions of upgrading Kryo to support this but I think that
would add unnecessary complexity.

What do you all think?

Cheers,
Gyula

[1] https://issues.apache.org/jira/browse/FLINK-32380



Re: Flink and Flink shaded dependency

2023-10-04 Thread Chesnay Schepler

There is no "monolithic" flink-shaded dependency.
Connectors shouldn't depend on anything that Flink provides, but be 
self-contained as Martijn pointed out.


Connectors shouldn't depend on flink-shaded.
The overhead and/or risks of doing/supporting that right now far 
outweigh the benefits.
( Because we either have to encode the full version for all dependencies 
into the package, or accept the risk of minor/patch dependency clashes)
Connectors are small enough in scope that depending directly on 
guava/jackson/etc. is a fine approach, and they have plenty of other 
dependencies that they need to manage anyway; let's treat these the same 
way.
Naturally this is also an argument against flink-shaded-connectors; on 
top of that we already experience repo creep and managing releases is 
difficult enough as-is.


As for class-loading, there has been a long-standing goal of each 
connector being loaded in their own classloader. That still is the north 
star and the only reasonable way to ensure that multiple connectors can 
be safely used with SQL.


On 02/10/2023 18:32, Jing Ge wrote:

Hi Sergey,

Thanks for sharing your thoughts. It could somehow help but didn't get to
the root of this issue.

According to the documentation, Flink shaded is used to provide a single
instance of a shaded dependency across sub-modules in Flink repo. Shaded
namespaces should be used where shaded dependencies are configured. After
connectors have been externalized, it ends up with more repos depending on
one shaded jar, e.g. guava. This is a "monolithic" dependency setup that
makes it difficult to change the root(flink-shade), because any changes of
the root have to be propagated to all downstream repos. Even worse is that
not every downstream repo is known while modifying the root.

Since all externalized connectors have their own repos and are not
sub-modules of Flink anymore, I would suggest the following upgrade:

1. Connectors should use their own classloader instead of Flink's
classloader. This will break the monolithic dependency. Connectors and
Flink can use different versions of flink-shaded.
2. [optional] It would be even better that all connector repos depend on
their own individual shaded repo, e.g. flink-connector-shaded. flink-shaded
should only be used by Flink.

WDYT?

Best regards,
Jing


On Thu, Sep 14, 2023 at 11:28 PM Sergey Nuyanzin 
wrote:


Yes, that's a reasonable question, thanks for raising it.

I think this is not only about flink-shaded, rather about dependencies in
general

I guess there is no rule of thumb, or at least I'm not aware of
Here are my thoughts
1. If bumping dependency doesn't require breaking changes and passes
existing tests then just bump it
2. In case there are breaking changes we could consider doing this within
next major release
for minor release
 a. try to answer a question whether it impacts Flink or not
 b. in case it impacts Flink and fix itself is relatively small then to
avoid breaking change
we could copy classes with solutions to Flink repo like it usually
happens with Calcite related fixes.
The problem of this approach is that I guess it will not work for
non jvm deps like e.g. RocksDB
 c. In case no way to do it without breaking changes for minor release
then probably need sort of announcement motivating to move to another major
version where the issue is fixed

looking forward to seeing other opinions about that

On Wed, Sep 13, 2023 at 9:47 PM Jing Ge 
wrote:


Hi Sergey,

Thanks for doing the analysis and providing the great insight. I did my

own

analysis and got the same conclusion. I just wanted to use this example

to

kick off a discussion and check if there is a common guideline or concept
in the community to handle such cases, since it seems any bump-up might
have a big impact. This time, we are kind of lucky. What if CVE related
code has been used in Flink? I do see it is an issue that upgrading a lib
in the flink-shaded repo is not recommended because of the complexity.
IMHO, we don't want to put the cart before the horse.

Best regards,
Jing

On Wed, Sep 13, 2023 at 9:11 PM Sergey Nuyanzin 
wrote:


Thanks for raising this

I would suggest trying to double check whether it actually impacts

Flink

or

not.

For instance from one side Calcite between 1.22.0..1.31.0 has a

critical

CVE-2022-39135 [1]
from another side Flink does not use this functionality and is not

impacted

by this.

Regarding guava
After closer look at Guava's high CVE you've mentioned [2]
Based on Github issue describing the problem (exists since 2016) [3]
there is a security issue with class
com.google.common.io.FileBackedOutputStream

While looking at source code for
com.google.common.io.FileBackedOutputStream usage I was not able to

find

such.
Also I was not able to find usage of
org.apache.flink.shaded.guava30.com.google.common.io

.Files#createTempDir

which was fixed within commit [4]
Also I was not able to find other Guava classes which use the 

Re: [Discuss] FLIP-366: Support standard YAML for FLINK configuration

2023-10-03 Thread Chesnay Schepler
It is a unfortunate that we'll need a separate config file but the FLIP 
does a good job justifying it.


Personally I'd just name it "config.yaml"; I never quite understood why 
there was a flink prefix to begin with, and the current proposal 
("flink-configuration.yaml") seems unnecessarily long.


For the deprecation process we could consider logging a warning if the 
old parser is used.


In terms of scope, is the migration of existing e2e tests and the docker 
image to the new parser part of the FLIP?


On 22/09/2023 09:32, Jane Chan wrote:

Hi, Junrui,

Sorry for the late reply. The update looks good to me and thanks for your
effort!

Best,
Jane

On Fri, Sep 22, 2023 at 2:25 PM Yuxin Tan  wrote:


Hi, Junrui

+1 for the proposal.
Thanks for your effort.

Best,
Yuxin


Samrat Deb  于2023年9月22日周五 13:23写道:


Hello Junrui,

+1 for the proposal.


Bests,
Samrat

On Fri, Sep 22, 2023 at 10:18 AM Shammon FY  wrote:


+1 for the proposal, thanks for driving.

Bet,
Shammon FY

On Fri, Sep 22, 2023 at 12:41 PM Yangze Guo 

wrote:

Thanks for driving this, +1 for the proposal.

Best,
Yangze Guo


On Fri, Sep 22, 2023 at 11:59 AM Lijie Wang <

wangdachui9...@gmail.com>

wrote:

Hi Junrui,

+1 for this proposal, thanks for driving.

Best,
Lijie

ConradJam  于2023年9月22日周五 10:07写道:


+1 Support for standard YAML format facilitates specification

Jing Ge  于2023年9月22日周五 02:23写道:


Hi Junrui,

+1 for following the standard. Thanks for your effort!

Best regards,
Jing

On Thu, Sep 21, 2023 at 5:09 AM Junrui Lee <

jrlee@gmail.com>

wrote:

Hi Jane,

Thank you for your valuable feedback and suggestions.
I agree with your point about differentiating between

"flink-config.yaml"

and "flink-conf.yaml" to determine the standard syntax at a

glance.

While I understand your suggestion of using

"flink-conf-default.yaml"

to

represent the default YAML file for Flink 1.x, I have been

considering

the option of using "flink-configuration.yaml" as the file

name

for the

new configuration file.
This name "flink-configuration.yaml" provides a clear

distinction

between

the new and old configuration files based on their names, and

it

does

not

introduce any additional semantics. Moreover, this name
"flink-configuration.yaml" can continue to be used in future

versions

FLINK-2.0.

WDYT? If we can reach a consensus on this, I will update the

FLIP

documentation
accordingly.

Best regards,
Junrui

Jane Chan  于2023年9月20日周三 23:38写道:


Hi Junrui,

Thanks for driving this FLIP. +1 for adoption of the

standard

YAML

syntax.

I just have one minor suggestion. It's a little bit

challenging

to

differentiate between `flink-config.yaml` and

`flink-conf.yaml`

to

determine which one uses the standard syntax at a glance.

How

about

using `flink-conf-default.yaml` to represent the default

yaml

file

for

Flink 1.x?

Best,
Jane

On Wed, Sep 20, 2023 at 11:06 AM Junrui Lee <

jrlee@gmail.com

wrote:

Hi devs,

I would like to start a discussion about FLIP-366:
Support standard YAML for FLINK configuration[1]

The current flink-conf.yaml parser in FLINK is not a

standard

YAML

parser,

which has some shortcomings.
Firstly, it does not support nested structure

configuration

items

and

only

supports key-value pairs, resulting in poor readability.

Secondly,

if

the

value is a collection type, such as a List or Map, users

are

required

to

write the value in a FLINK-specific pattern, which is

inconvenient

to

use.

Additionally, the parser of FLINK has some differences in

syntax

compared

to the standard YAML parser, such as the syntax for

parsing

comments

and

null values. These inconsistencies can cause confusion

for

users,

as

seen

in FLINK-15358 and FLINK-32740.

By supporting standard YAML, these issues can be

resolved,

and

users

can

create a Flink configuration file using third-party tools

and

leverage

some advanced YAML features. Therefore, we propose to

support

standard

YAML for FLINK configuration.

You can find more details in the FLIP-366[1]. Looking

forward

to

your

feedback.

[1]



https://cwiki.apache.org/confluence/display/FLINK/FLIP-366%3A+Support+standard+YAML+for+FLINK+configuration

Best,
Junrui



--
Best

ConradJam





Re: [DISCUSS] Flink annotation strategy/consensus

2023-09-15 Thread Chesnay Schepler
We need the @Internal annotation for cases where a user-facing class has 
methods entirely meant for internal uses.
It's not great that we have these cases to being with, but fixing that 
is another story.


The semantics for non-annotated classes is already documented.
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/upgrading/#api-compatibility-guarantees

On 14/09/2023 02:31, Becket Qin wrote:

Does it make sense to clearly define that APIs without annotation are
internal APIs and should be used outside of Flink. And deprecate @Internal?


We can do this. Although I think it is OK to keep the @Internal annotation
in case extra clarity is needed sometimes.

Thanks,

Jiangjie (Becket) Qin

On Tue, Sep 12, 2023 at 7:11 PM Jing Ge  wrote:


Hi Becket,

Thanks for your reply with details.



2. I agree it would be too verbose to annotate every internal method /
class / interface. Currently we already treat the methods / interfaces /
classes without annotations as effectively @Internal.


Does it make sense to clearly define that APIs without annotation are
internal APIs and should be used outside of Flink. And deprecate @Internal?

Best regards,
Jing

On Mon, Sep 11, 2023 at 5:05 AM Becket Qin  wrote:


Hi Jing,

Thanks for bringing up the discussion. My two cents:

1. All the public methods / classes / interfaces MUST be annotated with

one

of the @Experimental / @PublicEvolving / @Public. In practice, all the
methods by default inherit the annotation from the containing class,

unless

annotated otherwise. e.g. an @Internal method in a @Public class.


2. I agree it would be too verbose to annotate every internal method /

class / interface. Currently we already treat the methods / interfaces /
classes without annotations as effectively @Internal.



3. Per our discussion in the other thread, @Deprecated SHOULD coexist with

one of the @Experimental / @PublicEvolving / @Public. In that
case, @Deprecated overrides the other annotation, which means that public
API will not evolve and will be removed according to the deprecation
process.

4. The internal methods / classes / interfaces SHOULD NOT be marked as
deprecated. Instead, an immediate refactor should be done to remove the
"deprecated" internal methods / classes / interfaces, and migrate the

code

to its successor. Otherwise, technical debts will build up.

Thanks,

Jiangjie (Becket) Qin



On Sat, Sep 9, 2023 at 5:29 AM Jing Ge 

wrote:

Hi devs,

While I was joining the flink-avro enhancement and cleanup discussion
driven by Becket[1], I realized that there are some issues with the

current

Flink API annotation usage in the source code.

As far as I am concerned, Flink wants to control the access/visibility

of

APIs across modules and for downstreams. Since no OSGI is used(it

should

not be used because of its complexity, IMHO), Flink decided to use a

very

lightweight but manual solution: customized annotation like @Internal,
@Experimental, @PublicEvolving,
etc. This is a Flink only concept on top of JDK annotation and is

therefore

orthogonal to @Deprecated or any other annotations offered by JDK.

After

this concept has been used, APIs without one of these annotations are

in

the kind of gray area which means they have no contract in the context

of

this new concept. Without any given metadata they could be considered
as @Internal or @Experimental, because changes are allowed to be

applied

at

any time. But there is no clear definition and therefore different

people

will understand it differently.

There are two options to improve it, as far as I could figure out:

option 1: All APIs must have one of those annotations. We should put

some

effort into going through all source code and add missing annotations.
There were discussions[2] and activities going in this direction.
option 2: the community comes to a new consensus that APIs without
annotation equals one of @Internal, @Experimental, or @PublicEvolving.

I

personally will choose @Internal, because it is the safest one. And if
@Internal is chosen as the default one, it could also be deprecated,
because no annotation equals @Internal. If it makes sense, I can

create a

FLIP and help the community reach this consensus.

Both options have their own pros and cons. I would choose option 2,

since

we will not end up with a lot of APIs marked as @Internal.

Looking forward to hearing your thoughts.

Best regards
Jing


[1] https://lists.apache.org/thread/7zsv528swbjxo5zk0bxq33hrkvd77d6f
[2] https://lists.apache.org/thread/zl2rmodsjsdb49tt4hn6wv3gdwo0m31o





[jira] [Created] (FLINK-32888) File upload runs into EndOfDataDecoderException

2023-08-17 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32888:


 Summary: File upload runs into EndOfDataDecoderException
 Key: FLINK-32888
 URL: https://issues.apache.org/jira/browse/FLINK-32888
 Project: Flink
  Issue Type: Bug
  Components: Runtime / REST
Affects Versions: 1.17.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0, 1.17.2


With the right request the FIleUploadHandler runs into a 
EndOfDataDecoderException although everything is fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32834) Allow compile.sh to be used manually

2023-08-11 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32834:


 Summary: Allow compile.sh to be used manually
 Key: FLINK-32834
 URL: https://issues.apache.org/jira/browse/FLINK-32834
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


For debugging purposes it would be nice if you could run compile.sh locally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32745) Add a flag to skip InputSelectable preValidate step

2023-08-03 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32745:


 Summary: Add a flag to skip InputSelectable preValidate step
 Key: FLINK-32745
 URL: https://issues.apache.org/jira/browse/FLINK-32745
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream, Runtime / Configuration
Reporter: Chesnay Schepler
 Fix For: 1.19.0


{{StreamingJobGraphGenerator#preValidate}} has a step where it checks that no 
operator implements {{InputSelectable}} if checkpointing is enabled, because 
these features aren't compatible.

This step can be extremely expensive when the {{CodeGenOperatorFactory}} is 
used, because it requires all generated operator classes to actually be 
compiled (which usually only happens on the task manager).

If you know what jobs you're running this step can be pure overhead.
It would be nice if we'd have a flag to skip this validation step.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32681) RocksDBStateDownloaderTest.testMultiThreadCleanupOnFailure unstablie

2023-07-26 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32681:


 Summary: 
RocksDBStateDownloaderTest.testMultiThreadCleanupOnFailure unstablie
 Key: FLINK-32681
 URL: https://issues.apache.org/jira/browse/FLINK-32681
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / State Backends, Tests
Affects Versions: 1.18.0
Reporter: Chesnay Schepler
 Fix For: 1.18.0


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=51712=logs=77a9d8e1-d610-59b3-fc2a-4766541e0e33=125e07e7-8de0-5c6c-a541-a567415af3ef



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS][2.0] FLIP-352: Use camelCast for all REST API fields/parameters

2023-07-25 Thread Chesnay Schepler

On 25/07/2023 04:09, Xintong Song wrote:

In this case, I wonder if it makes sense to bump the REST API version to
V2. That should allow us to gradually phase out the old APIs without the
above mentioned problems.
It's ultimately possible do do that; it just implies duplicating _a lot_ 
classes and setting up a class hierarchy between the messages such that 
we don't need to duplicate all workers and slapping v2 onto all handlers.
It's quite a lot of monkey work really, with quite a maintenance cost on 
the 1.x side.
Personally I'd like to avoid it; if we go with a v2 for this one all 
other changes that I'm proposing will also be pushed into v2, which will 
make the efforts more work and way harder to maintain in 1.x (and imo 
the point of 2.0 is to be able to break things hard).


Re: [DISCUSS][2.0] FLIP-351: REST API normalizes +/-Inf / NaN to 0

2023-07-25 Thread Chesnay Schepler
Then we'd break the API for users that did already apply workarounds 
although the user hasn't done anything wrong.


On 25/07/2023 04:31, Xintong Song wrote:

we should treat these cases as errors


Looking at the fields listed in the FLIP, I'd agree with this argument. And
based on this, shouldn't we fail the request with e.g., a status code 500,
rather than responding with a fallback value silently?

Best,

Xintong



On Tue, Jul 25, 2023 at 12:22 AM Jing Ge  wrote:


We might consider using 0 as null for values that never return 0. But null
is still not equal to 0 and it will be very difficult to let every
contributor in this community follow this rule, especially for future
unknown APIs, which means there will be some cases that still need null.
Personally, I would choose accuracy over convenience and consistency over
convenience. Therefore, null is recommended.

Best regards,
Jing

On Mon, Jul 24, 2023 at 11:48 PM Chesnay Schepler 
wrote:


The downside to null is that it again forces users to handle this case
themselves.

The reality is that there is no good default value.

Ideally we fix all cases where we return such values, such that the
fallback to 0 isn't even used.
Arguably the same could be said for null, but I'd think that 0 is less
of a surprise.

On 24/07/2023 17:21, Gyula Fóra wrote:

I agree that it's a bit strange to have 0 as a fallback value because

it

can also naturally occur for many metrics.
If we want to omit the value null would be probably better as Matthias
suggested.

Gyula

On Mon, Jul 24, 2023 at 4:02 PM Matthias Pohl
 wrote:


What was the reason you decided to go for 0 as the fallback value

instead

of null? Wouldn't that be a more reasonable value for error cases?

On Mon, Jul 24, 2023 at 12:51 PM Chesnay Schepler 
There are a number of cases where the REST API can return infinity or
NaN for certain double fields.

This is problematic because the JSON spec does not allow such values,
and tooling working against that spec may run into issues when
encountering such a value.

Specifically we've seen this become an issue in clients generated

from

the OpenAPI spec.





https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425797






Re: [DISCUSS][2.0] FLIP-351: REST API normalizes +/-Inf / NaN to 0

2023-07-24 Thread Chesnay Schepler
The downside to null is that it again forces users to handle this case 
themselves.


The reality is that there is no good default value.

Ideally we fix all cases where we return such values, such that the 
fallback to 0 isn't even used.
Arguably the same could be said for null, but I'd think that 0 is less 
of a surprise.


On 24/07/2023 17:21, Gyula Fóra wrote:

I agree that it's a bit strange to have 0 as a fallback value because it
can also naturally occur for many metrics.
If we want to omit the value null would be probably better as Matthias
suggested.

Gyula

On Mon, Jul 24, 2023 at 4:02 PM Matthias Pohl
 wrote:


What was the reason you decided to go for 0 as the fallback value instead
of null? Wouldn't that be a more reasonable value for error cases?

On Mon, Jul 24, 2023 at 12:51 PM Chesnay Schepler 
wrote:


There are a number of cases where the REST API can return infinity or
NaN for certain double fields.

This is problematic because the JSON spec does not allow such values,
and tooling working against that spec may run into issues when
encountering such a value.

Specifically we've seen this become an issue in clients generated from
the OpenAPI spec.





https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425797




Re: [VOTE][2.0] FLIP-340: Remove rescale REST endpoint

2023-07-24 Thread Chesnay Schepler

I update the endpoint in the FLIP.

On 24/07/2023 14:28, Matthias Pohl wrote:

I should have mentioned it in the discussion thread but I missed going over
that ML thread earlier: We might want to update the FLIP to refer to the
actual endpoint /jobs/:jobid/rescaling (AFAIU) with the corresponding cause
being FLINK-12312 [1].

But that's just a minor thing.

+1 (binding)

[1] https://issues.apache.org/jira/browse/FLINK-12312

On Mon, Jul 24, 2023 at 2:24 PM Konstantin Knauf  wrote:


+1 (binding)

Am Mo., 24. Juli 2023 um 14:15 Uhr schrieb Martijn Visser <
martijnvis...@apache.org>:


+1 (binding)

On Mon, Jul 24, 2023 at 1:10 PM Chesnay Schepler 
wrote:


Hello,

I'd like to start a vote on FLIP-340.

Discussion thread:
https://lists.apache.org/thread/zkslk0qzttwgs8j3s951rht3v1tsyqqk
FLIP:



https://cwiki.apache.org/confluence/display/FLINK/FLIP-340%3A+Remove+rescale+REST+endpoint

Regards,
Chesnay



--
https://twitter.com/snntrable
https://github.com/knaufk





[VOTE][2.0] FLIP-340: Remove rescale REST endpoint

2023-07-24 Thread Chesnay Schepler

Hello,

I'd like to start a vote on FLIP-340.

Discussion thread: 
https://lists.apache.org/thread/zkslk0qzttwgs8j3s951rht3v1tsyqqk
FLIP: 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-340%3A+Remove+rescale+REST+endpoint


Regards,
Chesnay


[VOTE][2.0] FLIP-336: Remove "now" timestamp field from REST responses

2023-07-24 Thread Chesnay Schepler

Hello,

I'd like to start a vote on FLIP-336.

Discussion thread: 
https://lists.apache.org/thread/ms3sk0p21n7q2oq0fjtq43koqj2pmwv4
FLIP: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263424789


Regards,
Chesnay


[DISCUSS][2.0] FLIP-352: Use camelCast for all REST API fields/parameters

2023-07-24 Thread Chesnay Schepler

Make the REST API more consistent and easier to work with from the UI.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425799


[DISCUSS][2.0] FLIP-351: REST API normalizes +/-Inf / NaN to 0

2023-07-24 Thread Chesnay Schepler
There are a number of cases where the REST API can return infinity or 
NaN for certain double fields.


This is problematic because the JSON spec does not allow such values, 
and tooling working against that spec may run into issues when 
encountering such a value.


Specifically we've seen this become an issue in clients generated from 
the OpenAPI spec.




https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425797


[DISCUSS][2.0] FLIP-350: Remove query parameters from Jar handlers

2023-07-24 Thread Chesnay Schepler
The jar handlers currently accept a variety of parameters both as query 
parameters and via the request body.


While it is primarily a problem for sending program args as query 
parameters, because it's a nightmare with escaping parameters, the 
remaining parameters should follow suite for consistency.


https://cwiki.apache.org/confluence/display/FLINK/FLIP-350%3A+Remove+query+parameters+from+Jar+handlers 



[DISCUSS][2.0] FLIP-349: Move RocksDB statebackend classes to o.a.f.state.rocksdb package

2023-07-24 Thread Chesnay Schepler
To properly reflect the state of the rocksdb statebackend I propose to 
move all classes in the state-backend-rocksdb module under the classes 
to o.a.f.state.rocksdb package.



https://cwiki.apache.org/confluence/display/FLINK/FLIP-349%3A+Move+RocksDB+statebackend+classes+to+o.a.f.state.rocksdb+package 



Re: [VOTE] Release 2.0 must-have work items

2023-07-21 Thread Chesnay Schepler

On 21/07/2023 11:45, Leonard Xu wrote:

In this way, the user will see the deprecated API firstly but they can not find 
a candidate if we can not finish all tasks in one minor version .


i'm not convinced that this matters. There will be a whole bunch of APIs 
deprecated in 1.18 (that will remain in 1.x!) without a replacement so 
we can remove them in 2.0.

We already accepted this scenario.


Re: FLIP-342: Remove brackets around keys returned by MetricGroup#getAllVariables

2023-07-21 Thread Chesnay Schepler

There's a separate discussion altogether.

On 21/07/2023 02:49, Mason Chen wrote:

Hi all,

I agree getScopeVariables is in line with the existing terminology but I’ve
seen that the existing terminology is a bit confusing with regards to how
users end up querying these metrics and building alerts/dashboards. I often
get the question how do I add a tag or label to my Flink metric, and it
would be more intuitive if labels and tags replaced group as a name, to be
more in align with popular metric systems.

Best,
Mason

On Thu, Jul 20, 2023 at 1:41 AM Chesnay Schepler  wrote:


environment variables is a very different though though, that'd just
confuse users.
It's also not a term we've used in the documentation.
getScopeVariables would I guess be most in line with existing terminology.

On 19/07/2023 18:10, Matthias Pohl wrote:

We don't have a well-defined process for breaking behavioral changes. We
could consider adding a new method with a different name.

Introducing a new API to make the behavioral change visible was also the
suggestion in the deprecation ML thread [1]. getEnvironmentVariables (or
even getEnvironment) might be a reasonable change.

[1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9

On Tue, Jul 18, 2023 at 1:10 PM Jing Ge 

wrote:

+1

On Tue, Jul 18, 2023 at 12:24 PM Xintong Song 
wrote:


+1

Best,

Xintong



On Tue, Jul 18, 2023 at 5:02 PM Chesnay Schepler 
wrote:


The FLIP number was changed to 342.

On 18/07/2023 10:56, Chesnay Schepler wrote:

MetricGroup#getAllVariables returns all variables associated with the
metric, for example:

| = abcde|
| = ||0|

The keys are surrounded by brackets for no particular reason.

In virtually every use-case for this method the user is stripping the
brackets from keys, as done in:

   * our datadog reporter:


https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L244

<

https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L244

   * our prometheus reporter (implicitly via a character filter):


https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java#L236

<

https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java#L236

   * our JMX reporter:


https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-jmx/src/main/java/org/apache/flink/metrics/jmx/JMXReporter.java#L223

<

https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-jmx/src/main/java/org/apache/flink/metrics/jmx/JMXReporter.java#L223

I propose to change the method spec and implementation to remove the
brackets around keys.

For migration purposes it may make sense to add a new method with the
new behavior (|getVariables()|) and deprecate the old method.




https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425202





Re: [DISCUSS][2.0] FLIP-338: Remove terminationMode query parameter from job cancellation REST endpoint

2023-07-20 Thread Chesnay Schepler

We can introduce a V2 API at any point while supporting V1.
Given that this has been the case for years I'm not entirely convinced 
that the pain points are large enough to push anyone to pursue that.
( in particular, providing an _entire_ API and not just a few special 
endpoints (which would also be possible and not necessarily bad!))


I would need more context to comment on the stack trace issue to 
understand what the problem is and how it could be improved.


Anyhow, that'd be a way larger effort, while most of these smaller REST 
API are mostly for maintenance reasons, which given that the V1 API must 
be supported for all of 2.x still make sense. (and even if V2 lands in 
2.0, having a WebUI-specific API isn't necessarily bad either)-


A separate thread on the topic could be very useful though!

On 20/07/2023 00:51, Austin Cawley-Edwards wrote:

It doesn't need to be part of the Flink 2.0 release perse, but starting to
wonder if we'd get more bang for our buck if we started fresh with a v2
REST API vs. one-off cleanups of the current v1 API. @Chesnay Schepler
 -- wdyt?

The v1 REST API seemed to grow naturally from its original use case of
supporting the Web UI iiuc, but now another of the core use cases is
operational (e.g., supporting the K8s Operator). For the operational use
case, it is clear that this wasn't the original design goal (e.g., cases
exist that require parsing the included Java stack trace to determine what
to do). Maybe @Gyula Fóra  also has some
experience/suggestions to share on if this would be valuable. (also happy
to start a new thread, sorry for co-opting this one)


Austin





Re: FLIP-342: Remove brackets around keys returned by MetricGroup#getAllVariables

2023-07-20 Thread Chesnay Schepler
environment variables is a very different though though, that'd just 
confuse users.

It's also not a term we've used in the documentation.
getScopeVariables would I guess be most in line with existing terminology.

On 19/07/2023 18:10, Matthias Pohl wrote:

We don't have a well-defined process for breaking behavioral changes. We
could consider adding a new method with a different name.

Introducing a new API to make the behavioral change visible was also the
suggestion in the deprecation ML thread [1]. getEnvironmentVariables (or
even getEnvironment) might be a reasonable change.

[1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9

On Tue, Jul 18, 2023 at 1:10 PM Jing Ge  wrote:


+1

On Tue, Jul 18, 2023 at 12:24 PM Xintong Song 
wrote:


+1

Best,

Xintong



On Tue, Jul 18, 2023 at 5:02 PM Chesnay Schepler 
wrote:


The FLIP number was changed to 342.

On 18/07/2023 10:56, Chesnay Schepler wrote:

MetricGroup#getAllVariables returns all variables associated with the
metric, for example:

| = abcde|
| = ||0|

The keys are surrounded by brackets for no particular reason.

In virtually every use-case for this method the user is stripping the
brackets from keys, as done in:

  * our datadog reporter:


https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L244

<

https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L244

  * our prometheus reporter (implicitly via a character filter):


https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java#L236

<

https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java#L236

  * our JMX reporter:


https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-jmx/src/main/java/org/apache/flink/metrics/jmx/JMXReporter.java#L223

<

https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-jmx/src/main/java/org/apache/flink/metrics/jmx/JMXReporter.java#L223


I propose to change the method spec and implementation to remove the
brackets around keys.

For migration purposes it may make sense to add a new method with the
new behavior (|getVariables()|) and deprecate the old method.




https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425202






Re: [DISCUSS][2.0] FLIP-341: Remove MetricGroup methods accepting an int as a name

2023-07-18 Thread Chesnay Schepler

Good catch; i've fixed the list.

On 18/07/2023 12:20, Xintong Song wrote:

+1 in general.

I think the list of affected public interfaces in the FLIP is not accurate.

- `#counter(int, Counter)` is missed
- `#meter(int)` should be `#meter(int, Meter)`
- `#group(int)` should be `#addGroup(int)`


Best,

Xintong



On Tue, Jul 18, 2023 at 4:39 PM Chesnay Schepler  wrote:


The MetricGroup interface contains methods to create groups and metrics
using an int as a name. The original intention was to allow pattern like
|group.addGroup("subtaskIndex").addGroup(0)| , but this didn't really
work out, with |addGroup(String, String)|  serving this use case much
better.

Metric methods accept an int mostly for consistency, but there's no good
use-case for it.

These methods also offer hardly any convenience since all they do is
save potential users from using |String.valueOf| on one argument. That's
doesn't seem valuable enough for something that doubles the size of the
interface.

I propose to remove said method.



https://cwiki.apache.org/confluence/display/FLINK/FLIP-341%3A+Remove+MetricGroup+methods+accepting+an+int+as+a+name





Re: FLIP-342: Remove brackets around keys returned by MetricGroup#getAllVariables

2023-07-18 Thread Chesnay Schepler

The FLIP number was changed to 342.

On 18/07/2023 10:56, Chesnay Schepler wrote:
MetricGroup#getAllVariables returns all variables associated with the 
metric, for example:


| = abcde|
| = ||0|

The keys are surrounded by brackets for no particular reason.

In virtually every use-case for this method the user is stripping the 
brackets from keys, as done in:


 * our datadog reporter:
https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L244
<https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L244>
 * our prometheus reporter (implicitly via a character filter):
https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java#L236
<https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java#L236>
 * our JMX reporter:
https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-jmx/src/main/java/org/apache/flink/metrics/jmx/JMXReporter.java#L223
<https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-jmx/src/main/java/org/apache/flink/metrics/jmx/JMXReporter.java#L223>

I propose to change the method spec and implementation to remove the 
brackets around keys.


For migration purposes it may make sense to add a new method with the 
new behavior (|getVariables()|) and deprecate the old method.



https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425202 





Re: [DISCUSS][2.0] FLIP-337: Remove JarRequestBody#programArgs

2023-07-18 Thread Chesnay Schepler
Something to note is that the UI is using this parameter, and would have 
to be changed to the new one.


Since we want to avoid having to split arguments ourselves, this may 
imply changes to the UI.


On 18/07/2023 10:18, Chesnay Schepler wrote:

We'll log a warn message when it is used and maybe hide it from the docs.

Archunit rule doesn't really work here because it's not annotated with 
stability annotations (as it shouldn't since the classes aren't really 
user-facing).


On 17/07/2023 21:56, Jing Ge wrote:

Hi Chesnay,

I am trying to understand what is the right removal process with this
concrete example. Given all things about the programArgs are private or
package private except the constructor. Will you just mark it as 
deprecated

with constructor overloading in 1.18 and remove it in 2.0? Should we
describe the deprecation work in the FLIP?

Another more general question, maybe offtrack, I don't know which 
thread is

the right place to ask, since Java 11 has been recommended, should we
always include "since" and "forRemoval" while adding @Deprecated, i.e.
ArchUnit rule?

Best regards,
Jing

On Mon, Jul 17, 2023 at 5:33 AM Xintong Song  
wrote:



+1

Best,

Xintong



On Thu, Jul 13, 2023 at 9:34 PM Chesnay Schepler 
wrote:


Hello,

The request body for the jar run/plan REST endpoints accepts program
arguments as a string (programArgs) or a list of strings
(programArgsList). The latter was introduced as kept running into 
issues

with splitting the string into individual arguments./
/

We ideally force users to use the list argument, and we can 
simplify the

codebase if there'd only be 1 way to pass arguments.

As such I propose to remove the programArgs field from the request 
body.



https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263424796 



Regards,

Chesnay







FLIP-401: Remove brackets around keys returned by MetricGroup#getAllVariables

2023-07-18 Thread Chesnay Schepler
MetricGroup#getAllVariables returns all variables associated with the 
metric, for example:


| = abcde|
| = ||0|

The keys are surrounded by brackets for no particular reason.

In virtually every use-case for this method the user is stripping the 
brackets from keys, as done in:


 * our datadog reporter:
   
https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L244
   

 * our prometheus reporter (implicitly via a character filter):
   
https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java#L236
   

 * our JMX reporter:
   
https://github.com/apache/flink/blob/9c3c8afbd9325b5df8291bd831da2d9f8785b30a/flink-metrics/flink-metrics-jmx/src/main/java/org/apache/flink/metrics/jmx/JMXReporter.java#L223
   


I propose to change the method spec and implementation to remove the 
brackets around keys.


For migration purposes it may make sense to add a new method with the 
new behavior (|getVariables()|) and deprecate the old method.



https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263425202

Re: [DISCUSS] Release 2.0 Work Items

2023-07-18 Thread Chesnay Schepler

On 18/07/2023 10:33, Wencong Liu wrote:

For FLINK-6912:

 There are three implementations of RichFunction that actually use
the Configuration parameter in RichFunction#open:
 1. ContinuousFileMonitoringFunction#open: It uses the configuration
to configure the FileInputFormat. [1]
 2. OutputFormatSinkFunction#open: It uses the configuration
to configure the OutputFormat. [2]
 3. InputFormatSourceFunction#open: It uses the configuration
  to configure the InputFormat. [3]


And none of them should have any effect since the configuration is empty.

See org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator#open.


[DISCUSS][2.0] FLIP-341: Remove MetricGroup methods accepting an int as a name

2023-07-18 Thread Chesnay Schepler
The MetricGroup interface contains methods to create groups and metrics 
using an int as a name. The original intention was to allow pattern like 
|group.addGroup("subtaskIndex").addGroup(0)| , but this didn't really 
work out, with |addGroup(String, String)|  serving this use case much 
better.


Metric methods accept an int mostly for consistency, but there's no good 
use-case for it.


These methods also offer hardly any convenience since all they do is 
save potential users from using |String.valueOf| on one argument. That's 
doesn't seem valuable enough for something that doubles the size of the 
interface.


I propose to remove said method.


https://cwiki.apache.org/confluence/display/FLINK/FLIP-341%3A+Remove+MetricGroup+methods+accepting+an+int+as+a+name 


[DISCUSS][2.0] FLIP-340: Remove rescale REST endpoint

2023-07-18 Thread Chesnay Schepler
The endpoint hasn't been working for years and was only kept to inform 
users about it. Let's finally remove it.


https://cwiki.apache.org/confluence/display/FLINK/FLIP-340%3A+Remove+rescale+REST+endpoint 



Re: [DISCUSS][2.0] FLIP-337: Remove JarRequestBody#programArgs

2023-07-18 Thread Chesnay Schepler

We'll log a warn message when it is used and maybe hide it from the docs.

Archunit rule doesn't really work here because it's not annotated with 
stability annotations (as it shouldn't since the classes aren't really 
user-facing).


On 17/07/2023 21:56, Jing Ge wrote:

Hi Chesnay,

I am trying to understand what is the right removal process with this
concrete example. Given all things about the programArgs are private or
package private except the constructor. Will you just mark it as deprecated
with constructor overloading in 1.18 and remove it in 2.0?  Should we
describe the deprecation work in the FLIP?

Another more general question, maybe offtrack, I don't know which thread is
the right place to ask, since Java 11 has been recommended, should we
always include "since" and "forRemoval" while adding @Deprecated, i.e.
ArchUnit rule?

Best regards,
Jing

On Mon, Jul 17, 2023 at 5:33 AM Xintong Song  wrote:


+1

Best,

Xintong



On Thu, Jul 13, 2023 at 9:34 PM Chesnay Schepler 
wrote:


Hello,

The request body for the jar run/plan REST endpoints accepts program
arguments as a string (programArgs) or a list of strings
(programArgsList). The latter was introduced as kept running into issues
with splitting the string into individual arguments./
/

We ideally force users to use the list argument, and we can simplify the
codebase if there'd only be 1 way to pass arguments.

As such I propose to remove the programArgs field from the request body.



https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263424796


Regards,

Chesnay





Re: [DISCUSS] FLIP-335: Removing Flink's Time classes as part of Flink 2.0

2023-07-17 Thread Chesnay Schepler

I don't understand this bit:/

"One minor Scala change is necessary, though: We need to touch the Scala 
implementation of the Pattern class (in flink-cep). Pattern requires a 
new method which needs to be implemented in the Scala Pattern class as 
well to comply with PatternScalaAPICompletenessTest."


/FLIP-265//states that /all/ Scala APIs will be removed, which should 
also cover CEP.

//
On 13/07/2023 12:08, Matthias Pohl wrote:

The 2.0 feature list includes the removal of Flink's Time classes in favor
of the JDKs java.time.Duration class. There was already a discussion about
it in [1] and FLINK-14068 [2] was created as a consequence of this
discussion.

I started working on marking the APIs as deprecated in FLINK-32570 [3]
where Chesnay raised a fair point that there isn't a FLIP, yet, to
formalize this public API change. Therefore, I went ahead and created
FLIP-335 [4] to have this change properly documented.

I'm not 100% sure whether there are better ways of checking whether we're
covering everything Public API-related. There are even classes which I
think might be user-facing but are not labeled accordingly (e.g.
flink-cep). But I don't have the proper knowledge in these parts of the
code. Therefore, I would propose marking these methods as deprecated,
anyway, to be on the safe side.

I'm open to any suggestions on improving the Test Plan of this change.

I'm looking forward to feedback on this FLIP.

Best,
Matthias

[1]https://lists.apache.org/thread/76yywnwf3lk8qn4dby0vz7yoqx7f7pkj
[2]https://issues.apache.org/jira/browse/FLINK-14068
[3]https://issues.apache.org/jira/browse/FLINK-32570
[4]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-335%3A+Removing+Flink%27s+Time+classes



Re: [DISCUSS] FLIP 333 - Redesign Apache Flink website

2023-07-17 Thread Chesnay Schepler

+1

On 16/07/2023 08:10, Mohan, Deepthi wrote:

@Chesnay

Thank you for your feedback.

An important takeaway from the previous discussion [1] and your feedback was to 
keep the design and text/diagram changes separate as each change for text and 
diagrams likely require deeper discussion. Therefore, as a first step I am 
proposing only UX changes with minimal text changes for the pages mentioned in 
the FLIP.

The feedback we received from customers cover both aesthetics and functional 
aspects of the website. Note that most feedback is focused only on the main 
Flink website [2].

1) New customers who are considering Flink have said about the website “there is a 
lot going on”, “looks too complicated”, “I am not sure *why* I should use this" 
and similar feedback. The proposed redesign in this FLIP helps partially address 
this category of feedback, but we may need to make the use cases and value 
proposition “pop” more than we have currently proposed in the redesign. I’d like to 
get the community’s thoughts on this.

2) On the look and feel of the website, I’ve already shared feedback prior that 
I am repeating here: “like a wiki page thrown together by developers.” 
Customers also point out other related Apache project websites: [3] and [4] as 
having “modern” user design. The proposed redesign in this FLIP will help 
address this feedback. Modernizing the look and feel of the website will appeal 
to customers who are used to what they encounter on other contemporary websites.

3) New and existing Flink developers have said “I am not sure what the diagram 
is supposed to depict” - referencing the main diagram on [2] and have said that 
the website lacks useful graphics and colors. Apart from removing the diagram 
on the main page [2], the current FLIP does propose major changes to diagrams 
in the rest of website and we can discuss them separately as they become 
available. I’d like to keep the FLIP focused only on the website redesign.

Ultimately, to Chesnay’s point in the earlier discussion in [1], I do not want 
to boil the ocean with all the changes at once. In this FLIP, my proposal is to 
first work on the UX design as that gives us a good starting point. We can use 
it as a framework to make iterative changes and enhancements to diagrams and 
the actual website content incrementally.

I’ve added a few more screenshots of additional pages to the FLIP that will 
give you a clearer picture of the proposed changes for the main page, What is 
Flink [Architecture, Applications, and Operations] pages.

And finally, I am not proposing any tooling changes.

[1] https://lists.apache.org/thread/c3pt00cf77lrtgt242p26lgp9l2z5yc8
[2]https://flink.apache.org/
[3] https://spark.apache.org/
[4] https://kafka.apache.org/

On 7/13/23, 6:25 AM, "Chesnay Schepler" mailto:ches...@apache.org>> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






On 13/07/2023 08:07, Mohan, Deepthi wrote:

However, even these developers when explicitly asked in our conversations often 
comment that the website could do with a redesign


Can you go into more detail as to their specific concerns? Are there
functional problems with the page, or is this just a matter of "I don't
like the way it looks"?


What had they trouble with? Which information was
missing/unnecessary/too hard to find?


The FLIP states that "/we want to modernize the website so that new and
existing users can easily find information to understand what Flink is,
the primary use cases where Flink is useful, and clearly understand its
value proposition/."


 From the mock-ups I don't /really/ see how these stated goals are
achieved. It mostly looks like a fresh coat of paint, with a compressed
nav bar (which does reduce how much information and links we throw at
people at once (which isn't necessarily bad)).


Can you go into more detail w.r.t. to the proposed
text/presentation/diagram changes?


I assume you are not proposing any tooling changes?









[DISCUSS][2.0] FLIP-338: Remove terminationMode query parameter from job cancellation REST endpoint

2023-07-13 Thread Chesnay Schepler

Hello,

The job cancellation REST endpoint has a terminationMode query 
parameter, which in the past could be set to either CANCEL or STOP, but 
nowadays the job stop endpoint has subsumed the STOP functionality.


Since then the cancel endpoint rejected requests that specified STOP.

I propose to finally remove this parameter, as it currently serves no 
function.


https://cwiki.apache.org/confluence/display/FLINK/FLIP-338%3A+Remove+terminationMode+query+parameter+from+job+cancellation+REST+endpoint 




Regards,

Chesnay


[DISCUSS][2.0] FLIP-337: Remove JarRequestBody#programArgs

2023-07-13 Thread Chesnay Schepler

Hello,

The request body for the jar run/plan REST endpoints accepts program 
arguments as a string (programArgs) or a list of strings 
(programArgsList). The latter was introduced as kept running into issues 
with splitting the string into individual arguments./

/

We ideally force users to use the list argument, and we can simplify the 
codebase if there'd only be 1 way to pass arguments.


As such I propose to remove the programArgs field from the request body.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263424796


Regards,

Chesnay


Re: [DISCUSS] FLIP 333 - Redesign Apache Flink website

2023-07-13 Thread Chesnay Schepler

On 13/07/2023 08:07, Mohan, Deepthi wrote:

However, even these developers when explicitly asked in our conversations often 
comment that the website could do with a redesign


Can you go into more detail as to their specific concerns? Are there 
functional problems with the page, or is this just a matter of "I don't 
like the way it looks"?


What had they trouble with? Which information was 
missing/unnecessary/too hard to find?


The FLIP states that "/we want to modernize the website so that new and 
existing users can easily find information to understand what Flink is, 
the primary use cases where Flink is useful, and clearly understand its 
value proposition/."


From the mock-ups I don't /really/ see how these stated goals are 
achieved. It mostly looks like a fresh coat of paint, with a compressed 
nav bar (which does reduce how much information and links we throw at 
people at once (which isn't necessarily bad)).


Can you go into more detail w.r.t. to the proposed 
text/presentation/diagram changes?


I assume you are not proposing any tooling changes?



[DISCUSS][2.0] FLIP-336: Remove "now" timestamp field from REST responses

2023-07-13 Thread Chesnay Schepler

Hello,

Several REST responses contain a timestamp field of the current time

 There is no known use-case for said timestamp, it makes caching of 
responses technically sketchy (since the response differs on each call) 
and  it complicates testing since the timestamp field can't be easily 
predicted.


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263424789


Regards,

Chesnay



Re: [DISCUSS] Deprecating Public API for 2.0 requiring FLIPs

2023-07-13 Thread Chesnay Schepler

The issue isn't avoiding 1.19.

The issue is that if things aren't deprecated in 1.18 then for every 
breaking change we have to start discussing exemptions from the API 
deprecation process, which stipulates that all Public APIs must be 
deprecated for at least 2 minor releases before they can be removed 
(which is now unsurprisingly backfiring on us).


So if something isn't deprecated in 1.18 then either:
- we delay 2.0 by at 1 release cycle
- we effectively ignore the newly agreed upon deprecation process for 2.0
- we circumvent the newly agreed upon deprecation process by creating 2 
minor releases in the same time-frame that we'd usually create 1 release in.


None of these options are great.

On 13/07/2023 14:03, Matthias Pohl wrote:

Jing brought up a question in the FLIP-335 discussion thread [1] which I
want to move into a dedicated discussion thread as it's a bit more general:
How do we handle the deprecation process of Public APIs for Flink 2.0?

I just have a related question: Do we need to create a FLIP each time
when we want to deprecate any classes?


The community defined the requirements of a FLIP [2] in the following way:
- Any major new feature, subsystem, or piece of functionality
- Any change that impacts the public interfaces of the project

public interfaces in this sense are defined as:
- DataStream and DataSet API, including classes related to that, such as
StreamExecutionEnvironment
- Classes marked with the @Public annotation
- On-disk binary formats, such as checkpoints/savepoints
- User-facing scripts/command-line tools, i.e. bin/flink, Yarn scripts,
Mesos scripts
- Configuration settings
- Exposed monitoring information

I think that this makes sense. There should be a proper discussion on the
best approach to change public APIs. Additionally, the FLIP should be a way
to document the changes in the discussion process towards the change
transparently.

In contrast, the impression I have is that we're trying to push all the
deprecation work into 1.18 (which has its feature freeze date by the end of
next week) to avoid having an additional 1.19 minor release. (Correct me if
I'm wrong here but that's the impression I'm getting from the ML threads
I've seen so far.)

I have some concerns on the Flink 2.0 development in this regard. Zhu Zhu
[4] and Chesnay [5] shared similar concerns in the thread about the 2.0
must-have work items.

Considering that quite a few (or some; I haven't checked in detail to be
honest) of the changes for 2.0 should require a FLIP and that there are
still some hanging items [6] (Jira issues which are annotated for Flink 2.0
but have been properly checked, yet): Shouldn't we avoid pushing everything
into 1.18? Instead, we should follow the required process properly and
might plan for another 1.19 minor release, instead?

I'm curious how other contributors feel here and sorry in case I have
misinterpreted the ongoing discussions.

Best,
Matthias

[1] https://lists.apache.org/thread/48ysrg1rrtl8s1twg9wmx35l201hnc2w
[2]
https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals#FlinkImprovementProposals-Whatisconsidereda%22majorchange%22thatneedsaFLIP
?
[3] https://lists.apache.org/thread/r0y9syc6k5nmcxvnd0hj33htdpdj9k6m
[4] https://lists.apache.org/thread/45xm348jr8n6s89jldntv5z3t13hdbn8
[5] https://lists.apache.org/thread/98wgqrx0sycpskvgpydvkywsoxt0fkc6
[6] https://lists.apache.org/thread/77hj39ls3kxvx2qd6o09hq1ndtn6hg4y





[jira] [Created] (FLINK-32571) Prebuild HBase testing docker image

2023-07-10 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32571:


 Summary: Prebuild HBase testing docker image
 Key: FLINK-32571
 URL: https://issues.apache.org/jira/browse/FLINK-32571
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / HBase
Reporter: Chesnay Schepler
 Fix For: hbase-3.0.0


For testing we currently build an HBase docker image on-demand during testing. 
We can improve reliability and testing times by building this image ahead of 
time, as the only parameter is the HBase version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Release 2.0 Work Items

2023-07-10 Thread Chesnay Schepler
rt.

I’d add a +1 to reworking configs, as suggested by @Jark and

@Chesnay,

especially the types. We have various configs that encode Time /

MemorySize

that are Long instead!

Regards,
Hong




On 29 Jun 2023, at 16:19, Yuan Mei 

wrote:

CAUTION: This email originated from outside of the organization.

Do

not

click links or open attachments unless you can confirm the sender

and

know

the content is safe.



Thanks for driving this effort, Xintong!

To Chesnay

I'm curious as to why the "Disaggregated State Management" item

is

marked as a must-have; will it require changes that break

something?

What prevents it from being added in 2.1?

As to "Disaggregated State Management".

We plan to provide a new type of state backend to support DFS as

primary

storage.
To achieve this, we at least need to include two parts of amends

(not

entirely sure yet, since we are still in the designing and

prototype

phase)

1. Statebackend Change
2. State Access Change

Not all of the interfaces related are `@Internal`. Some of the

interfaces

like `StateBackend` is `@PublicEvolving`
So, you are right in the sense that "Disaggregated State

Management"

itself

probably does not need to be a "Must Have"

But I was hoping changes that related to public APIs can be

finalized

and

merged in Flink 2.0 (I will fix the wiki accordingly).

I also agree with Jark that 2.0 is a good chance to rework the

default

value of configurations.

Best
Yuan


On Thu, Jun 29, 2023 at 8:43 PM Chesnay Schepler <

ches...@apache.org>

wrote:

Something else configuration-related is that there are a bunch of
options where the type isn't quite correct (e.g., a String where

it

could be an enum, a string where it should be an int or

something).

Could do a pass over those as well.

On 29/06/2023 13:50, Jark Wu wrote:

Hi,

I think one more thing we need to consider to do in 2.0 is

changing

the

default value of configuration to improve out-of-box user

experience.

Currently, in order to run a Flink job, users may need to set
a bunch of configurations, such as minibatch, checkpoint

interval,

exactly-once,
incremental-checkpoint, etc. It's very verbose and hard to use

for

beginners.
Most of them can have a universally applicable value.  Because

changing

the

default value is a breaking change. I think It's worth

considering

changing

them in 2.0.

What do you think?

Best,
Jark


On Wed, 28 Jun 2023 at 14:10, Sergey Nuyanzin <

snuyan...@gmail.com>

wrote:

Hi Chesnay


"Move Calcite rules from Scala to Java": I would hope that

this

would

be

an entirely internal change, and could thus be an incremental

process

independent of major releases.
What is the actual scale of this item; how much are we

actually

re-writing?

Thanks for asking
yes, you're right, that should be internal change.
Yeah I was also thinking about incremental change (rule by rule

or

reasonable small group of rules).
And yes, this could be an independent (on major release)

activity

The problem is actually for children of RelOptRule.
Currently I see 60+ such rules (in Scala) using the mentioned

deprecated

api.
There are also children of ConverterRule (50+) which do not

have

such

issues.
Maybe it could be considered as the next step to have all the

rules in

Java.

On Tue, Jun 27, 2023 at 1:34 PM Xintong Song <

tonysong...@gmail.com>

wrote:


Hi Alex & Gyula,

By compatibility discussion do you mean the "[DISCUSS]

FLIP-321:

Introduce

an API deprecation process" thread [1]?


Yes, I meant the FLIP-321 discussion. I just noticed I pasted

the

wrong

url

in my previous email. Sorry for the mistake.

I am also curious to know if the rationale behind this new API

has

been

previously discussed on the mailing list. Do we have a list

of

shortcomings

in the current DataStream API that it tries to resolve? How

does

the

current ProcessFunction functionality fit into the picture?

Will

it

be

kept

as is or subsumed by new API?


I don't think we should create a replacement for the

DataStream

API

unless

we have a very good reason to do so and with a proper

discussion

about

this

as Alex said.

The ProcessFunction API which is targeting to replace

DataStream

API

is

still a proposal, not a decision. Sorry for the confusion, I

should

have

been more careful with my words, not giving the impression

that

this

is

something we'll do anyway.

There will be a FLIP describing the motivations and designs in

detail,

for

the community to discuss and vote on. We are still working on

it.

TBH,

this

is not trivial and we would need more time on it.

Just to quickly share some backgrounds:

- We see quite some problems with the current DataStream

APIs

   - Users are working with concrete classes rather than

interfaces,

   which means
   - Users can access methods that are designed to be used

by

internal

  classes, eve

Re: [DISCUSS] Flink and Externalized connectors leads to block and circular dependency problems

2023-07-05 Thread Chesnay Schepler
Has there ever been thoughts about changing flink-pythons connector 
setup to use the table api connectors underneath?


The wrapping of connectors is a bit of a maintenance nightmare and 
doesn't really work with external/custom connectors.


On 04/07/2023 13:35, Dian Fu wrote:

Thanks Ran Tao for proposing this discussion and Martijn for sharing
the thought.


  While flink-python now fails the CI, it shouldn't actually depend on the

externalized connectors. I'm not sure what PyFlink does with it, but if
belongs to the connector code,

For each DataStream connector, there is a corresponding Python wrapper
and also some test cases in PyFlink. In theory, we should move that
wrapper into each connector repository. In the past, we have not done
that when externalizing the connectors since it may introduce some
burden when releasing since it means that we have to publish each
connector to PyPI separately.

To resolve this problem, I guess we can move the connector support in
PyFlink into the external connector repository.

Regards,
Dian


On Mon, Jul 3, 2023 at 11:08 PM Ran Tao  wrote:

@Martijn
thanks for clear explanations.

If we follow the line you specified (Connectors shouldn't rely on
dependencies that may or may not be
available in Flink itself)
It seems that we should add a certain dependency if we need(such as
commons-io, commons-collection) in connector pom explicitly.
And bundle it in sql-connector uber jar.

Then there is only one thing left that we need to make flink-python test
not depend on the released flink-connector.
Maybe we should check it out and decouple it like you suggested.

Best Regards,
Ran Tao
https://github.com/chucheng92


Martijn Visser  于2023年7月3日周一 22:06写道:


Hi Ran Tao,

Thanks for opening this topic. I think there's a couple of things at hand:
1. Connectors shouldn't rely on dependencies that may or may not be
available in Flink itself, like we've seen with flink-shaded. That avoids a
tight coupling between Flink and connectors, which is exactly what we try
to avoid.
2. When following that line, that would also be applicable for things like
commons-collections and commons-io. If a connector wants to use them, it
should make sure that it bundles those artifacts itself.
3. While flink-python now fails the CI, it shouldn't actually depend on the
externalized connectors. I'm not sure what PyFlink does with it, but if
belongs to the connector code, that code should also be moved to the
individual connector repo. If it's just a generic test, we could consider
creating a generic test against released connector versions to determine
compatibility.

I'm curious about the opinions of others as well.

Best regards,

Martijn

On Mon, Jul 3, 2023 at 3:37 PM Ran Tao  wrote:


I have an issue here that needs to upgrade commons-collections[1] (this

is

an example), but PR ci fails because flink-python test cases depend on
flink-sql-connector-kafka, but kafka-sql-connector is a small jar, does

not

include this dependency, so flink ci cause exception[2]. Current my
solution is [3]. But even if this PR is done, the upgrade of flink still
requires kafka-connector released.

This issue leads to deeper problems. Although the connectors have been
externalized, many UTs of flink-python depend on these connectors, and a
basic agreement of externalized connectors is that other dependencies
cannot be introduced explicitly, which means the externalized connectors
use dependencies inherited from flink. In this way, when flink main
upgrades some dependencies, it is easy to fail when executing

flink-python

test cases,because flink no longer has this class, and the connector does
not contain it. It's circular problem.

Unless, the connector self-consistently includes all dependencies, which

is

uncontrollable.
(only a few connectors include all jars in shade phase)

In short, the current flink-python module's dependencies on the connector
leads to an incomplete process of externalization and decoupling, which
will lead to circular dependencies when flink upgrade or change some
dependencies.

I don't know if I made it clear. I hope to get everyone's opinions on

what

better solutions we should adopt for similar problems in the future.

[1] https://issues.apache.org/jira/browse/FLINK-30274
[2]



https://user-images.githubusercontent.com/11287509/250120404-d12b60f4-7ff3-457e-a2c4-8cd415bb5ca2.png




https://user-images.githubusercontent.com/11287509/250120522-6b096a4f-83f0-4287-b7ad-d46b9371de4c.png

[3] https://github.com/apache/flink-connector-kafka/pull/38

Best Regards,
Ran Tao
https://github.com/chucheng92





[jira] [Created] (FLINK-32544) PythonFunctionFactoryTest fails on Java 17

2023-07-05 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32544:


 Summary: PythonFunctionFactoryTest fails on Java 17
 Key: FLINK-32544
 URL: https://issues.apache.org/jira/browse/FLINK-32544
 Project: Flink
  Issue Type: Sub-task
  Components: API / Python, Legacy Components / Flink on Tez
Affects Versions: 1.18.0
Reporter: Chesnay Schepler


https://dev.azure.com/chesnay/flink/_build/results?buildId=3676=logs=fba17979-6d2e-591d-72f1-97cf42797c11=727942b6-6137-54f7-1ef9-e66e706ea068

{code}
Jul 05 10:17:23 Exception in thread "main" 
java.lang.reflect.InaccessibleObjectException: Unable to make field private 
static java.util.IdentityHashMap java.lang.ApplicationShutdownHooks.hooks 
accessible: module java.base does not "opens java.lang" to unnamed module 
@1880a322
Jul 05 10:17:23 at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
Jul 05 10:17:23 at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
Jul 05 10:17:23 at 
java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178)
Jul 05 10:17:23 at 
java.base/java.lang.reflect.Field.setAccessible(Field.java:172)
Jul 05 10:17:23 at 
org.apache.flink.client.python.PythonFunctionFactoryTest.closeStartedPythonProcess(PythonFunctionFactoryTest.java:115)
Jul 05 10:17:23 at 
org.apache.flink.client.python.PythonFunctionFactoryTest.cleanEnvironment(PythonFunctionFactoryTest.java:79)
Jul 05 10:17:23 at 
org.apache.flink.client.python.PythonFunctionFactoryTest.main(PythonFunctionFactoryTest.java:52)
{code}

Side-notes:
* maybe re-evaluate if the test could be run through maven now
* The shutdown hooks business is quite sketchy, and AFAICT would be unnecessary 
if the test were an ITCase



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISUCSS] Deprecate multiple APIs in 1.18

2023-07-05 Thread Chesnay Schepler

There's a whole bunch of metric APIs that would need to be deprecated.
That is of course if the metric FLIPs are being accepted.

Which makes me wonder if we aren't doing things the wrong way around; 
shouldn't the decision to deprecate an API be part of the FLIP discussion?


On 05/07/2023 07:39, Xintong Song wrote:

Thanks all for the discussion.

It seems to me there's a consensus on marking the following as deprecated
in 1.18:
- DataSet API
- SourceFunction
- Queryable State
- All Scala APIs

More time is needed for deprecating SinkFunction.

I'll leave this discussion open for a few more days. And if there's no
objections, I'll create JIRA tickets accordingly.

Best,

Xintong



On Wed, Jul 5, 2023 at 1:34 PM Xintong Song  wrote:


Thanks for the input, Jing. I'd also be +1 for option 1.

Best,

Xintong



On Mon, Jul 3, 2023 at 7:20 PM Jing Ge  wrote:


Hi Xingtong,

Option 1, secure plan would be:

1. graduate kafka, File, JDBC connectors to @Public
2. graduate SinkV2 to @Public
3. remove SinkFunction.

Option 2, risky plan but at a fast pace:

1. graduate SinkV2 to @Public and expecting more maintenance effort since
there are many known and unsolved issues.
2. remove SinkFunction.
3. It depends on the connectors' contributors whether connectors can
upgrade to Flink 2.0, since we moved forward with SinkV2 API without
taking
care of implementations in external connectors.

I am ok with both of them and personally prefer option 1.

Best regards,
Jing


On Fri, Jun 30, 2023 at 3:41 AM Xintong Song 
wrote:


I see. Thanks for the explanation. I may have not looked into this

deeply

enough, and would trust the decision from you and the community members

who

participated in the discussion & vote.

Best,

Xintong



On Thu, Jun 29, 2023 at 10:28 PM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:


However, I'm not sure about 2.

I am not aware of a bylaw that states the specific requirements in

order

to

mark something as @Deprecated. My understanding from the discussion

and

the

vote was that the community recognizes the necessity to make it

explicit

that
the usage of the SourceFunction API is discouraged. This can actually
stimulate
authors of connectors that rely on this very specific and non-baseline
functionality to contribute extensions to the new Source API

themselves

in

order to
close the gap. ExternallyInducedSource, for instance, was driven by

Pravega

to
begin with, since it was only needed for their purposes [1]. We are

not

removing
anything - until 2.0 everything will continue to work and we can work

on

resolving the limitations until then, I personally don't see a big

issue

here.


Do you think it is feasible to resolve them by the feature freeze

date

of

1.18?
No, these are rather complex additions that would probably require

FLIP(s).

[1]



https://flink.apache.org/2022/01/20/pravega-flink-connector-101/#checkpoint-integration

On Thu, 29 Jun 2023 at 14:25, Xintong Song 

wrote:

Thanks for the explanation, Alex.

Not blocking the deprecation on 1 & 3 makes sense to me. However,

I'm

not

sure about 2.

It sounds to me that, without FLINK-28051 & FLINK-28054, some of the
connectors cannot migrate to the new Source API, or at least further
investigation is needed to understand the situation. If this is the

case,

we probably should not deprecate the API until these issues are

resolved.

Do you think it is feasible to resolve them by the feature freeze

date

of

1.18?

Best,

Xintong



On Thu, Jun 29, 2023 at 8:02 PM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:


@Xintong
The original discussion [1] and vote [2] converged on the idea

that

it

is

better
to make it clear to the users that they should stop using

SourceFunction

since it
is going away. The longer we do not have this indication, the more

user

implementations will be based on it and the more pain will be

induced

when

we
finally drop it. Users now have an alternative API that they

should

use

and

which
is fully functional, from that perspective nothing blocks marking

it

@Deprecated.
As for the remaining work items - there are primarily three kinds:

1. Where Flink internally uses SourceFunction, without exposing

this

fact

to the
outside world:
- FLINK-28050 [3]
- FLINK-28229 [4]
- FLINK-28048 [5]

2. Very specific edge cases that might not be covered by the

Source

API

as

is:
- FLINK-28054 [6]
- FLINK-28051 [7]

3. Usability improvements - something that was easily doable with
SourceFunction,
but requires deep knowledge of the new, significantly more

complex,

Source API
to achieve:
- FLINK-28056 [8]

In my mind, none of those are blockers for proceeding with adding

the

@Deprecated
annotation:
(1) is a simple case of encapsulation, internals should not

concern

the

API

users
(2) is really only relevant for "exotic" use cases. Does not mean

we

should

not
consider those, but since it is irrelevant for 99.9% of the

users, 

[jira] [Created] (FLINK-32536) Python tests fail with Arrow DirectBuffer exception

2023-07-04 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32536:


 Summary: Python tests fail with Arrow DirectBuffer exception
 Key: FLINK-32536
 URL: https://issues.apache.org/jira/browse/FLINK-32536
 Project: Flink
  Issue Type: Sub-task
  Components: API / Python, Tests
Affects Versions: 1.18.0
Reporter: Chesnay Schepler


https://dev.azure.com/chesnay/flink/_build/results?buildId=3674=logs=fba17979-6d2e-591d-72f1-97cf42797c11=727942b6-6137-54f7-1ef9-e66e706ea068

{code}
2023-07-04T12:54:15.5296754Z Jul 04 12:54:15 E   
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.flink.table.runtime.arrow.ArrowUtils.collectAsPandasDataFrame.
2023-07-04T12:54:15.5299579Z Jul 04 12:54:15 E   : 
java.lang.RuntimeException: Arrow depends on DirectByteBuffer.(long, int) 
which is not available. Please set the system property 
'io.netty.tryReflectionSetAccessible' to 'true'.
2023-07-04T12:54:15.5302307Z Jul 04 12:54:15 E  at 
org.apache.flink.table.runtime.arrow.ArrowUtils.checkArrowUsable(ArrowUtils.java:184)
2023-07-04T12:54:15.5302859Z Jul 04 12:54:15 E  at 
org.apache.flink.table.runtime.arrow.ArrowUtils.collectAsPandasDataFrame(ArrowUtils.java:546)
2023-07-04T12:54:15.5303177Z Jul 04 12:54:15 E  at 
jdk.internal.reflect.GeneratedMethodAccessor287.invoke(Unknown Source)
2023-07-04T12:54:15.5303515Z Jul 04 12:54:15 E  at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2023-07-04T12:54:15.5303929Z Jul 04 12:54:15 E  at 
java.base/java.lang.reflect.Method.invoke(Method.java:568)
2023-07-04T12:54:15.5307338Z Jul 04 12:54:15 E  at 
org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
2023-07-04T12:54:15.5309888Z Jul 04 12:54:15 E  at 
org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
2023-07-04T12:54:15.5310306Z Jul 04 12:54:15 E  at 
org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)
2023-07-04T12:54:15.5337220Z Jul 04 12:54:15 E  at 
org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
2023-07-04T12:54:15.5341859Z Jul 04 12:54:15 E  at 
org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)
2023-07-04T12:54:15.5342363Z Jul 04 12:54:15 E  at 
org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
2023-07-04T12:54:15.5344866Z Jul 04 12:54:15 E  at 
java.base/java.lang.Thread.run(Thread.java:833)
{code}

{code}
2023-07-04T12:54:15.5663559Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::BatchPandasConversionTests::test_empty_to_pandas
2023-07-04T12:54:15.5663891Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::BatchPandasConversionTests::test_from_pandas
2023-07-04T12:54:15.5664299Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::BatchPandasConversionTests::test_to_pandas
2023-07-04T12:54:15.5664655Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::BatchPandasConversionTests::test_to_pandas_for_retract_table
2023-07-04T12:54:15.5665003Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::StreamPandasConversionTests::test_empty_to_pandas
2023-07-04T12:54:15.5665360Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::StreamPandasConversionTests::test_from_pandas
2023-07-04T12:54:15.5665704Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::StreamPandasConversionTests::test_to_pandas
2023-07-04T12:54:15.5666045Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::StreamPandasConversionTests::test_to_pandas_for_retract_table
2023-07-04T12:54:15.5666415Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_conversion.py::StreamPandasConversionTests::test_to_pandas_with_event_time
2023-07-04T12:54:15.5666840Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_udaf.py::BatchPandasUDAFITTests::test_group_aggregate_function
2023-07-04T12:54:15.5667189Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_udaf.py::BatchPandasUDAFITTests::test_group_aggregate_with_aux_group
2023-07-04T12:54:15.5667526Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_udaf.py::BatchPandasUDAFITTests::test_group_aggregate_without_keys
2023-07-04T12:54:15.5667882Z Jul 04 12:54:15 FAILED 
pyflink/table/tests/test_pandas_udaf.py::BatchPandasUDAFITTests::test_over_window_aggregate_function
2023-07-04T12:54:15.5668242Z Jul 04 12:54:15 FAILED

Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

2023-07-04 Thread Chesnay Schepler

I think the cooldown still makes sense with FLIP-291 APIs.

If you want to fully control the parallelism and rescale timings then 
you can set the cooldown to zero.
If you don't want complete control but just the target parallelism from 
time to time, then the cooldown within Flink still makes sense imo 
because it can account for all scale up operations, which an external 
scaler would struggle with (because it doesn't actually know when a 
scale up happened).


> Wouldn't a simple case where you add a new TM and remove it before 
the max interval is reached (so there is nothing to do) result in an 
unnecessary job restart?


Depends on how you implement it. If you ignore all of shouldRescale, 
yes, but you shouldn't do that in the first place.


Within shouldRescale() the SlotAllocater wouldn't provide us with a new 
parallelism alternative and we wouldn't ask the RescaleController, which 
is the bit we actually want to override.


On 04/07/2023 09:16, David Morávek wrote:
> They will struggle if they add new resources and nothing happens for 
5 minutes.


The same applies if they start playing with FLIP-291 APIs. I'm 
wondering if the cooldown makes sense there since it was the user's 
deliberate choice to push new requirements. 樂


Best,
D.

On Tue, Jul 4, 2023 at 9:11 AM David Morávek  wrote:

The FLIP reads sane to me. I'm unsure about the default values,
though; 5 minutes of wait time between rescales feels rather
strict, and we should rethink it to provide a better
out-of-the-box experience.

I'd focus on newcomers trying AS / Reactive Mode out. They will
struggle if they add new resources and nothing happens for 5
minutes. I'd suggest defaulting to
/jobmanager.adaptive-scheduler.resource-stabilization-timeout/ (which
defaults to 10s).

I'm still struggling to grasp max internal (force rescale).
Ignoring `AdaptiveScheduler#shouldRescale()` condition seems
rather dangerous. Wouldn't a simple case where you add a new TM
and remove it before the max interval is reached (so there is
nothing to do) result in an unnecessary job restart?

Best,
D.

On Thu, Jun 29, 2023 at 3:43 PM Etienne Chauchot
 wrote:

Thanks Chesnay for your feedback. I have updated the FLIP.
I'll start a
vote thread.

Best

Etienne

Le 28/06/2023 à 11:49, Chesnay Schepler a écrit :
> > we should schedule a check that will rescale if
> min-parallelism-increase is met. Then, what it the use of
> scaling-interval.max timeout in that context ?
>
> To force a rescale if min-parallelism-increase is not met
(but we
> could still run above the current parallelism).
>
> min-parallelism-increase is a trade-off between the cost of
rescaling
> vs the performance benefit of the parallelism increase. Over
time the
> balance tips more and more in favor of the parallelism
increase, hence
> we should eventually rescale anyway even if the minimum
isn't met, or
> at least give users the option to do so.
>
> > I meant the opposite: not having only the cooldown but
having only
> the stabilization time. I must have missed something because
what I
> wonder is: if every rescale entails a restart of the
pipeline and
> every restart entails passing in waiting for resources
state, then why
> introduce a cooldown when there is already at each rescale a
stable
> resource timeout ?
>
> It is technically correct that the stable resource timeout
can be used
> to limit the number of rescale operations per interval,
however during
> that time the job isn't running, in contrast to the cooldown.
>
> Having both just gives you a lot more flexibility.
> "I want at most 1 rescale operation per hour, and wait at
most 1
> minute for resource to stabilize when a rescale happens".
> You can't express this with only one of the options.
>
> On 20/06/2023 14:41, Etienne Chauchot wrote:
>> Hi Chesnay,
    >>
>> Thanks for your feedback. Comments inline
>>
>> Le 16/06/2023 à 17:24, Chesnay Schepler a écrit :
>>> 1) Options specific to the adaptive scheduler should start
with
>>> "jobmanager.adaptive-scheduler".
>>
>>
>> ok
>>
>>
>>> 2)
>>> There isn't /really /a notion of a "scaling event". The
scheduler is
>>> informed about new/lost slots and job failur

Re: [VOTE] FLIP-309: Support using larger checkpointing interval when source is processing backlog

2023-06-30 Thread Chesnay Schepler

-1 (binding)

I feel like this FLIP needs a bit more time in the oven.

It seems to be very light on actual details; you can summarize the 
entire changes section as "The enumerator calls this method and then 
another checkpoint interval is used."
I would love to know how this is wired into the triggering of 
checkpoints, what the behavior is with multiple sources, if a sink is 
allowed to set this at any point or just once, what the semantics of a 
"backlog" are for sources other than Hybrid/ MySQL CDC (because catching 
up after a failover is a common enough pattern), whether/how this 
information could also be interesting for the scheduler (because we may 
want to avoid rescalings during the backlog processing), whether the 
backlog processing should be exposed as a metric for users (or for that 
matter, how we inform users at all that we're using a different 
checkpoint interval at this time).


Following my discussion with Piotr and Stefan I'm also not sure how 
future-proof the proposed API really is. Already I feel like the name 
"setIsProcessingBacklog()" is rather specific for the state of the 
source (making it technically wrong to call it in other situations like 
being backpressured (again, depending on what "backlog processing" even 
means)), while not being clear on what this actually results in. The 
javadocs don't even mention the checkpointing interval at all, but 
instead reference downstream optimizations that, afaict, aren't 
mentioned in the FLIP.


I'd be very hesitant with marking it as public from the get-go. Ideally 
it would maybe even be added as a separate interface (somehow).


On 30/06/2023 16:37, Piotr Nowojski wrote:

Hey,

Sorry to disturb this voting, but after discussing this thoroughly with
Chesnay and Stefan Richter I have to vote:
  -1 (binding)
mainly to suspend the current voting thread. Please take a look at my mail
at dev mailing list.

Best,
Piotrek

czw., 29 cze 2023 o 14:59 feng xiangyu  napisał(a):


+1 (non-binding)

Best,
Xiangyu

yuxia  于2023年6月29日周四 20:44写道:


+1 (binding)

Best regards,
Yuxia

- 原始邮件 -
发件人: "Yuepeng Pan" 
收件人: "dev" 
发送时间: 星期四, 2023年 6 月 29日 下午 8:21:14
主题: Re: [VOTE] FLIP-309: Support using larger checkpointing interval when
source is processing backlog

+1  non-binding.


Best.
Yuepeng Pan


 Replied Message 
| From | Jingsong Li |
| Date | 06/29/2023 13:25 |
| To | dev |
| Cc | flink.zhouyunfeng |
| Subject | Re: [VOTE] FLIP-309: Support using larger checkpointing
interval when source is processing backlog |
+1 binding

On Thu, Jun 29, 2023 at 11:03 AM Dong Lin  wrote:

Hi all,

We would like to start the vote for FLIP-309: Support using larger
checkpointing interval when source is processing backlog [1]. This FLIP

was

discussed in this thread [2].

Flink 1.18 release will feature freeze on July 11. We hope to make this
feature available in Flink 1.18.

The vote will be open until at least July 4th (at least 72 hours),

following

the consensus voting process.

Cheers,
Yunfeng and Dong

[1]


https://cwiki.apache.org/confluence/display/FLINK/FLIP-309%3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog

[2] https://lists.apache.org/thread/l1l7f30h7zldjp6ow97y70dcthx7tl37





[DISCUSS] Ease docker image usage across patch versions

2023-06-29 Thread Chesnay Schepler

Hello,

when using our docker image users can enable certain plugins via an 
environment variable, listing all plugins they'd like to enable.
This is a list of plugin jar names, which contain the full Flink 
version, like 1.16.1.


This implies that users always need to be aware of the patch version 
that they are using (because the plugin name has to match exactly); 
which then means that using the docker tags without a patch version 
becomes rather tricky, because the actual patch version encoded into the 
plugin directory may suddenly change.


In https://github.com/apache/flink-docker/pull/158 a proposal was made 
to add symlinks without the patch version for all plugin directories.


This would work, but I'm not sure if it's the right way to go about this.
Alternatively we could

 * allow globbing patterns to select plugins (e.g., flink-s3-fs*)
 o this adds some complexity to ensure only 1 jar matches the pattern
 * stop encoding the Flink version altogether into the various jars in
   the distribution

On the other hand, we could also stop publishing docker tags without a 
patch version altogether, because it is a potential trap for users if 
they are unaware that different processes in a cluster must all run on 
the same version and you can't just do an in-place upgrade for running jobs.



I can't really decide, and am looking for input from others.


Cheers,
Chesnay

Re: [DISCUSS] Release 2.0 Work Items

2023-06-29 Thread Chesnay Schepler
ent DataStream API seems to be a mixture of many things,
   making it hard to understand especially for newcomers. It might be
better
   to re-organize it into several parts: (the taxonomy below are just

an

   example of the, we are still working on this)
  - The most fundamental stateful stream processing: streams,
  partitions / key, process functions, state, timeline-service
  - An extension for common batch-streaming unified functions:

map,

  flatmap, filter, agg, reduce, join, etc.
  - An extension for windowing supports:  window, triggering
  - An extension for event-time supports: event time, watermark
  - The extensions are like short-cuts / sugars, without which

users

  can probably still achieve the same behavior by working with the
  fundamental APIs, but would be a lot easier with the extensions
   - The original plan was to do in-place refactors / changes on
DataStream API. Some related items are listed in this doc [2] attached
to
the kicking off email [3]. Not all of the above issues are listed,
because
we haven't looked into this as deeply as now  by that time.
- We proposed this as a new API rather than in-place refactors in the
2.0 work item list, because we realized the changes might be too big
for an
in-place change. First having a new API then gradually retiring the

old

one
would help users to smoothly migrate between them.

A thorough discussion is definitely needed once the FLIP is out. And of
course it's possible that the FLIP might be rejected. Given that we are
planning for release 2.0, I just feel it would be better to bring this up
early even the concrete plan is not yet ready,

Best,

Xintong


[1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9
[2]



https://docs.google.com/document/d/1_PMGl5RuDQGlV99_gL3y7OiRsF0DgCk91Coua6hFXhE/edit?usp=sharing

[3] https://lists.apache.org/thread/b8w5cx0qqbwzzklyn5xxf54vw9ymys1c

On Tue, Jun 27, 2023 at 5:15 PM Gyula Fóra  wrote:


Hey!

I share the same concerns mentioned above regarding the

"ProcessFunction

API".

I don't think we should create a replacement for the DataStream API

unless

we have a very good reason to do so and with a proper discussion about

this

as Alex said.

Cheers,
Gyula

On Tue, Jun 27, 2023 at 11:03 AM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:


Hi Xintong,

By compatibility discussion do you mean the "[DISCUSS] FLIP-321:

Introduce

an API deprecation process" thread [1]?

I am also curious to know if the rationale behind this new API has

been

previously discussed on the mailing list. Do we have a list of

shortcomings

in the current DataStream API that it tries to resolve? How does the
current ProcessFunction functionality fit into the picture? Will it

be

kept

as is or subsumed by new API?

[1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9

Best,
Alex

On Mon, 26 Jun 2023 at 14:33, Xintong Song 

wrote:

The ProcessFunction API item is giving me the most headaches

because

it's

very unclear what it actually entails; like is it an entirely

separate

API

to DataStream (sounds like it is!) or an extension of DataStream.

How

much

will it share the internals with DataStream etc.; how does it

relate

to

the

Table API (w.r.t. switching APIs / what Table API uses

underneath).

I totally understand your confusion. We started planning this after

kicking

off the release 2.0, so there's still a lot to be explored and the

plan

keeps changing.


- In the beginning, we planned to do an in-place refactor of

DataStream

API, until the API migration period is proposed.
- Then we want to make it an entirely separate API to

DataStream,

and

listed as a must-have for release 2.0 so that we can remove

DataStream

once
it's ready.
- However, depending on the outcome of the API compatibility

discussion

[1], we may not be able to remove DataStream in 2.0 anyway,

which

means

we
might need to re-evaluate the necessity of this item for 2.0.

I'd say we wait a bit longer for the compatibility discussion [1]

and

decide the priority for this item afterwards.


Best,

Xintong


[1] https://lists.apache.org/list.html?dev@flink.apache.org


On Mon, Jun 26, 2023 at 6:00 PM Chesnay Schepler <

ches...@apache.org

wrote:


by-and-large I'm quite happy with the list of items.

I'm curious as to why the "Disaggregated State Management" item

is

marked

as a must-have; will it require changes that break something?

What

prevents

it from being added in 2.1?

We may want to update the Java 17 item to "Make Java 17 the

default,

drop

Java 8/11". Maybe even split it into a must-have "Drop Java 8"

and

a

nice-to-have "Drop Java 11"?

"Move Calcite rules from Scala to Java": I would hope that this

would

be

an entirely internal change, and could thu

[jira] [Created] (FLINK-32482) Add Java 17 to Docker build matrix

2023-06-29 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32482:


 Summary: Add Java 17 to Docker build matrix
 Key: FLINK-32482
 URL: https://issues.apache.org/jira/browse/FLINK-32482
 Project: Flink
  Issue Type: Sub-task
  Components: flink-docker, Release System
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32479) Tests revoke leadership too early

2023-06-29 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32479:


 Summary: Tests revoke leadership too early
 Key: FLINK-32479
 URL: https://issues.apache.org/jira/browse/FLINK-32479
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination, Tests
Affects Versions: 1.18.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


There are a few tests issue a request to the dispatcher and immediately revoke 
leadership. In this case there is no guarantee that the guarantee arrived 
before leadership was revoked, so it could fail if it arrives afterwards since 
we reject requests if we aren't the leader anymore.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32467) Move CleanupOnCloseRpcSystem to rpc-core

2023-06-28 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32467:


 Summary: Move CleanupOnCloseRpcSystem to rpc-core
 Key: FLINK-32467
 URL: https://issues.apache.org/jira/browse/FLINK-32467
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / RPC
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


This class is useful for any rpc system implementation and should thus be 
shared.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

2023-06-28 Thread Chesnay Schepler
> we should schedule a check that will rescale if 
min-parallelism-increase is met. Then, what it the use of 
scaling-interval.max timeout in that context ?


To force a rescale if min-parallelism-increase is not met (but we could 
still run above the current parallelism).


min-parallelism-increase is a trade-off between the cost of rescaling vs 
the performance benefit of the parallelism increase. Over time the 
balance tips more and more in favor of the parallelism increase, hence 
we should eventually rescale anyway even if the minimum isn't met, or at 
least give users the option to do so.


> I meant the opposite: not having only the cooldown but having only 
the stabilization time. I must have missed something because what I 
wonder is: if every rescale entails a restart of the pipeline and every 
restart entails passing in waiting for resources state, then why 
introduce a cooldown when there is already at each rescale a stable 
resource timeout ?


It is technically correct that the stable resource timeout can be used 
to limit the number of rescale operations per interval, however during 
that time the job isn't running, in contrast to the cooldown.


Having both just gives you a lot more flexibility.
"I want at most 1 rescale operation per hour, and wait at most 1 minute 
for resource to stabilize when a rescale happens".

You can't express this with only one of the options.

On 20/06/2023 14:41, Etienne Chauchot wrote:

Hi Chesnay,

Thanks for your feedback. Comments inline

Le 16/06/2023 à 17:24, Chesnay Schepler a écrit :
1) Options specific to the adaptive scheduler should start with 
"jobmanager.adaptive-scheduler".



ok



2)
There isn't /really /a notion of a "scaling event". The scheduler is 
informed about new/lost slots and job failures, and reacts 
accordingly by maybe rescaling the job.
(sure, you can think of these as events, but you can think of 
practically everything as events)


There shouldn't be a queue for events. All the scheduler should have 
to know is that the next rescale check is scheduled for time T, which 
in practice boils down to a flag and a scheduled action that runs 
Executing#maybeRescale.



Makes total sense, its very simple like this. Thanks for the precision 
and pointer. After the related FLIPs, I'll look at the code now.



With that in mind, we also have to look at how we keep this state 
around. Presumably it is scoped to the current state, such that the 
cooldown is reset if a job fails.
Maybe we should add a separate ExecutingWithCooldown state; not sure 
yet.



Yes loosing cooldown state and cooldown reset upon failure is what I 
suggested in point 3 in previous email. Not sure either for a new 
state, I'll figure it out after experimenting with the code. I'll 
update the FLIP then.





It would be good to clarify whether this FLIP only attempts to cover 
scale up operations, or also scale downs in case of slot losses.



When there are slots loss, most of the time it is due to a TM loss so 
there should be several slots lost at the same time but (hopefully) 
only once. There should not be many scale downs in a row (but still 
cascading failures can happen). I think, we should just protect 
against having scale ups immediately following. For that, I think we 
could just keep the current behavior of transitioning to Restarting 
state and then back to Waiting for Resources state. This state will 
protect us against scale ups immediately following failure/restart.





We should also think about how it relates to the externalized 
declarative resource management. Should we always rescale 
immediately? Should we wait until the cooldown is over?



It relates to point 2, no ? we should rescale immediately only if last 
rescale was done more than scaling-interval.min ago otherwise schedule 
a rescale at last-rescale + scaling-interval.min time.



Related to this, there's the min-parallelism-increase option, that if 
for example set to "2" restricts rescale operations to only occur if 
the parallelism increases by at least 2.



yes I saw that in the code



Ideally however there would be a max timeout for this.

As such we could maybe think about this a bit differently:
Add 2 new options instead of 1:
jobmanager.adaptive-scheduler.scaling-interval.min: The minimum time 
the scheduler will wait for the next effective rescale operations.
jobmanager.adaptive-scheduler.scaling-interval.max: The maximum time 
the scheduler will wait for the next effective rescale operations.



At point 2, we said that when slots change (requirements change or new 
slots available), if last rescale check (call to maybeRescale) was 
done less than scaling-interval.min ago, we should schedule a check 
that will rescale if min-parallelism-increase is met. Then, what it 
the use of scaling-interval.max timeout in that context ?





3) It sounds fine that we lose the cooldown state, because imo we 
want to reset the cooldown 

Re: [DISCUSS] Release 2.0 Work Items

2023-06-26 Thread Chesnay Schepler

by-and-large I'm quite happy with the list of items.

I'm curious as to why the "Disaggregated State Management" item is 
marked as a must-have; will it require changes that break something? 
What prevents it from being added in 2.1?


We may want to update the Java 17 item to "Make Java 17 the default, 
drop Java 8/11". Maybe even split it into a must-have "Drop Java 8" and 
a nice-to-have "Drop Java 11"?


"Move Calcite rules from Scala to Java": I would hope that this would be 
an entirely internal change, and could thus be an incremental process 
independent of major releases.

What is the actual scale of this item; how much are we actually re-writing?

"Add MetricGroup#getLogicalScope": I'd raise this to a must-have; i 
think I marked it down as nice-to-have only because it depends on 
another item.


The ProcessFunction API item is giving me the most headaches because 
it's very unclear what it actually entails; like is it an entirely 
separate API to DataStream (sounds like it is!) or an extension of 
DataStream. How much will it share the internals with DataStream etc.; 
how does it relate to the Table API (w.r.t. switching APIs / what Table 
API uses underneath).


There are a few items I added as ideas which don't have a priority yet; 
would love to get some feedback on those.


On 21/06/2023 08:41, Xintong Song wrote:

Hi devs,

As previously discussed in [1], we had been collecting work item proposals
for the 2.0 release until June 15th, on the wiki page [2].

- As we have passed the due date, I'd like to kindly remind everyone *not
to add / remove items directly on the wiki page*. If needed, please post
in this thread or reach out to the release managers instead.
- I've reached out to some folks for clarifications about their
proposals. Some of them mentioned that they can not yet tell whether we
should do an item or not, and would need more time / discussions to make
the decision. So I added a new symbol for items whose priorities are `TBD`.

Now it's time to collaboratively decide a minimum set of must-have items.
I've gone through the entire list of proposed items, and found most of them
make quite much sense. So I think an online sync might not be necessary for
this. I'd like to go with this DISCUSS thread, where everyone can comment
on how they think the list can be improved, followed by a VOTE to formally
make the decision.

Any feedback and opinions, including but not limited to the following
aspects, will be appreciated.

- Important items that are missing from the list
- Concerns regarding the listed items or their priorities

Looking forward to your feedback.

Best,

Xintong


[1]
https://lists.apache.org/list?dev@flink.apache.org:lte=1M:release%202.0%20status%20updates

[2]https://cwiki.apache.org/confluence/display/FLINK/2.0+Release



[jira] [Created] (FLINK-32380) Serialization of Java records fails

2023-06-19 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32380:


 Summary: Serialization of Java records fails
 Key: FLINK-32380
 URL: https://issues.apache.org/jira/browse/FLINK-32380
 Project: Flink
  Issue Type: Sub-task
  Components: API / Type Serialization System
Reporter: Chesnay Schepler


Reportedly Java records are not supported, because they are neither detected by 
our Pojo serializer nor supported by Kryo 2.x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32379) Skip archunit tests in java1X-target profiles

2023-06-19 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32379:


 Summary: Skip archunit tests in java1X-target profiles
 Key: FLINK-32379
 URL: https://issues.apache.org/jira/browse/FLINK-32379
 Project: Flink
  Issue Type: Technical Debt
  Components: Build System
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


When compiling to Java 11/17 byte code archunit fails; not sure why. Maybe it 
finds more/less stuff or signatures are represented differently.

In any case let's use the Java 8 bytecode version as the "canonical" version 
and skip archunit otherwise.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32378) 2.0 Breaking Metric system changes

2023-06-19 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32378:


 Summary: 2.0 Breaking Metric system changes
 Key: FLINK-32378
 URL: https://issues.apache.org/jira/browse/FLINK-32378
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Metrics
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


Umbrella issue for all breaking changes to the metric system



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32377) 2.0 Breaking REST API changes

2023-06-19 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32377:


 Summary: 2.0 Breaking REST API changes
 Key: FLINK-32377
 URL: https://issues.apache.org/jira/browse/FLINK-32377
 Project: Flink
  Issue Type: Technical Debt
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 2.0.0


Umbrella issue for all breaking changes to the REST API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[NOTICE] Experimental Java 17 support now available on master

2023-06-16 Thread Chesnay Schepler

The master branch now builds and runs with Java 17 out-of-the-box.

Notes:
- a nightly cron build was set up, but I expect it to fail because some 
tests only run in the branches and thus weren't covered by my testing. 
I'll try to stabilize things over the weekend.
- The Python Kafka connector is known to still have issues: (FLINK-32327 
)

  - due to this the Python tests are disabled on CI in Java 17 builds
- In Java 17 builds Scala is being bumped to 2.12.15; don't try to load 
a savepoint from a Java 8/11 build.
- In addition to all tests that are being skipped on Java 11, HBase 2.2 
has now also made the list.


Huge shout-out to Mathias Pohl for reviewing my seemingly endless stream 
of PRs in the last weeks.


If you run into any issues, please report it in FLINK-15736 
.

Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

2023-06-16 Thread Chesnay Schepler
1) Options specific to the adaptive scheduler should start with 
"jobmanager.adaptive-scheduler".


2)
There isn't /really /a notion of a "scaling event". The scheduler is 
informed about new/lost slots and job failures, and reacts accordingly 
by maybe rescaling the job.
(sure, you can think of these as events, but you can think of 
practically everything as events)


There shouldn't be a queue for events. All the scheduler should have to 
know is that the next rescale check is scheduled for time T, which in 
practice boils down to a flag and a scheduled action that runs 
Executing#maybeRescale.
With that in mind, we also have to look at how we keep this state 
around. Presumably it is scoped to the current state, such that the 
cooldown is reset if a job fails.

Maybe we should add a separate ExecutingWithCooldown state; not sure yet.

It would be good to clarify whether this FLIP only attempts to cover 
scale up operations, or also scale downs in case of slot losses.


We should also think about how it relates to the externalized 
declarative resource management. Should we always rescale immediately? 
Should we wait until the cooldown is over?
Related to this, there's the min-parallelism-increase option, that if 
for example set to "2" restricts rescale operations to only occur if the 
parallelism increases by at least 2.

Ideally however there would be a max timeout for this.

As such we could maybe think about this a bit differently:
Add 2 new options instead of 1:
jobmanager.adaptive-scheduler.scaling-interval.min: The minimum time the 
scheduler will wait for the next effective rescale operations.
jobmanager.adaptive-scheduler.scaling-interval.max: The maximum time the 
scheduler will wait for the next effective rescale operations.


3) It sounds fine that we lose the cooldown state, because imo we want 
to reset the cooldown anyway on job failures (because a job failure 
inherently implies a potential rescaling).


4) The stabilization time isn't really redundant and serves a different 
use-case. The idea behind is that if a users adds multiple TMs at once 
then we don't want to rescale immediately at the first received slot. 
Without the stabilization time the cooldown would actually cause bad 
behavior here, because not only would we rescale immediately upon 
receiving the minimum required slots to scale up, but we also wouldn't 
use the remaining slots just because the cooldown says so.


On 16/06/2023 15:47, Etienne Chauchot wrote:

Hi Robert,

Thanks for your feedback. I don't know the scheduler part well enough 
yet and I'm taking this ticket as a learning workshop.


Regarding your comments:

1. Taking a look at the AdaptiveScheduler class which takes all its 
configuration from the JobManagerOptions, and also to be consistent 
with other parameters name, I'd suggest 
/jobmanager.scheduler-scaling-cooldown-period/


2. I thought scaling events existed already and the scheduler received 
them as mentioned in FLIP-160 (cf "Whenever the scheduler is in the 
Executing state and receives new slots") or in FLIP-138 (cf "Whenever 
new slots are available the SlotPool notifies the Scheduler"). If it 
is not the case (it is the scheduler who asks for slots), then there 
is no need for storing scaling requests indeed.


=> I need a confirmation here

3. If we loose the JobManager, we loose both the AdaptiveScheduler 
state and the CoolDownTimer state. So, upon recovery, it would be as 
if there was no ongoing coolDown period. So, a first re-scale could 
happen right away and it will start a coolDown period. A second 
re-scale would have to wait for the end of this period.


4. When a pipeline is re-scaled, it is restarted. Upon restart, the 
AdaptiveScheduler passes again in the "waiting for resources" state as 
FLIP-160 suggests. If so, then it seems that the coolDown period is 
kind of redundant with the resource-stabilization-timeout. I guess it 
is not the case otherwise the FLINK-21883 ticket would not have been 
created.


=> I need a confirmation here also.


Thanks for your views on point 2 and 4.


Best

Etienne

Le 15/06/2023 à 13:35, Robert Metzger a écrit :

Thanks for the FLIP.

Some comments:
1. Can you specify the full proposed configuration name? "
scaling-cooldown-period" is probably not the full config name?
2. Why is the concept of scaling events and a scaling queue needed? If I
remember correctly, the adaptive scheduler will just check how many
TaskManagers are available and then adjust the execution graph 
accordingly.

There's no need to store a number of scaling events. We just need to
determine the time to trigger an adjustment of the execution graph.
3. What's the behavior wrt to JobManager failures (e.g. we lose the 
state

of the Adaptive Scheduler?). My proposal would be to just reset the
cooldown period, so after recovery of a JobManager, we have to wait at
least for the cooldown period until further scaling operations are done.
4. What's the relationship to the

[jira] [Created] (FLINK-32370) JDBC SQl gateway e2e test is unstable

2023-06-16 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32370:


 Summary: JDBC SQl gateway e2e test is unstable
 Key: FLINK-32370
 URL: https://issues.apache.org/jira/browse/FLINK-32370
 Project: Flink
  Issue Type: Technical Debt
Affects Versions: 1.18.0
Reporter: Chesnay Schepler
 Fix For: 1.18.0
 Attachments: flink-vsts-sql-gateway-0-fv-az75-650.log, 
flink-vsts-standalonesession-0-fv-az75-650.log, 
flink-vsts-taskexecutor-0-fv-az75-650.log

The client is failing while trying to collect data when the job already 
finished on the cluster.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32369) Setup cron build

2023-06-16 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32369:


 Summary: Setup cron build
 Key: FLINK-32369
 URL: https://issues.apache.org/jira/browse/FLINK-32369
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-15 Thread Chesnay Schepler

On 13/06/2023 17:26, Becket Qin wrote:

It would be valuable if we can avoid releasing minor versions for previous
major versions.


On paper, /absolutely /agree, but I'm not sure how viable that is in 
practice.


On the current 2.0 agenda is potentially dropping support for Java 8/11, 
which may very well be a problem for our current users.



On 13/06/2023 17:26, Becket Qin wrote:

Thanks for the feedback and sorry for the confusion about Public API
deprecation. I just noticed that there was a mistake in the NOTES part for
Public API due to a copy-paste error... I just fixed it.
I'm very relieved to hear that. Glad to hear that we are on the same 
page on that note.



On 15/06/2023 15:20, Becket Qin wrote:

But it should be
completely OK to bump up the major version if we really want to get rid of
a public API, right?


Technically yes, but look at how long it took to get us to 2.0. ;)

There's a separate discussion to be had on the cadence of major releases 
going forward, and there seem to be different opinions on that.


If we take the Kafka example of 2 minor releases between major ones, 
that for us means that users have to potentially deal with breaking 
changes every 6 months, which seems like a lot.


Given our track record I would prefer a regular cycle (1-2 years) to 
force us to think about this whole topic, and not put it again to the 
wayside and giving us (and users) a clear expectation on when breaking 
changes can be made.


But again, maybe this should be in a separate thread.

On 14/06/2023 11:37, Becket Qin wrote:

Do you have an example of behavioral change in mind? Not sure I fully
understand the concern for behavioral change here.


This could be a lot of things. It can be performance in certain 
edge-cases, a bug fix that users (maybe unknowingly) relied upon 
(https://xkcd.com/1172/), a semantic change to some API.


For a concrete example, consider the job submission. A few releases back 
we made changes such that the initialization of the job master happens 
asynchronously.
This meant the job submission call returns sooner, and the job state 
enum was extended to cover this state.
API-wise we consider this a compatible change, but the observed behavior 
may be different.


Metrics are another example; I believe over time we changed what some 
metrics returned a few times.


[jira] [Created] (FLINK-32359) AdaptiveSchedulerBuilder shoudl accept executor service in constructor

2023-06-15 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32359:


 Summary: AdaptiveSchedulerBuilder shoudl accept executor service 
in constructor
 Key: FLINK-32359
 URL: https://issues.apache.org/jira/browse/FLINK-32359
 Project: Flink
  Issue Type: Technical Debt
  Components: Tests
Reporter: Chesnay Schepler
 Fix For: 1.18.0


The ASBuilder currently accepts mandatory arguments in both the constructor and 
final {{build()}} method.
This makes it difficult to create composite helper factory methods, since you 
always need to pass a special value in build(), usually leaking details of the 
test setup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32358) CI may unintentionally use fallback akka loader

2023-06-15 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32358:


 Summary: CI may unintentionally use fallback akka loader
 Key: FLINK-32358
 URL: https://issues.apache.org/jira/browse/FLINK-32358
 Project: Flink
  Issue Type: Technical Debt
  Components: Build System / CI
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


We have a fallback akka loader for developer convenience in the IDE, that is on 
the classpath of most modules. Depending on the order of jars on the classpath 
it can happen that the fallback loader appears first, which we dont want 
because it slows down the build and creates noisy logs.

We can add a simple prioritization scheme to the rpc system loading to remedy 
that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32338) Add FailsOnJava17 annotation

2023-06-14 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32338:


 Summary: Add FailsOnJava17 annotation
 Key: FLINK-32338
 URL: https://issues.apache.org/jira/browse/FLINK-32338
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


Add an annotation for disabling specific tests on Java 17, similar to 
FailsOnJava11.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32336) PartitionITCase#ComparablePojo should be public

2023-06-14 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32336:


 Summary: PartitionITCase#ComparablePojo should be public
 Key: FLINK-32336
 URL: https://issues.apache.org/jira/browse/FLINK-32336
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


POJOs should be public, but this one is private forcing it go through Kryo, 
which is currently failing for some odd reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32330) Setup Java 17 in e2e builds

2023-06-13 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32330:


 Summary: Setup Java 17 in e2e builds
 Key: FLINK-32330
 URL: https://issues.apache.org/jira/browse/FLINK-32330
 Project: Flink
  Issue Type: Sub-task
  Components: Test Infrastructure
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32329) Do not overwrite env.java.opts.all in HA e2e test

2023-06-13 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32329:


 Summary: Do not overwrite env.java.opts.all in HA e2e test
 Key: FLINK-32329
 URL: https://issues.apache.org/jira/browse/FLINK-32329
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


Avoid overriding env.java.opts.all since it will soon contain the module 
declarations required for running Java 17.

This is a bit of a hack; a nicer approach would be to append to the existing 
value, but ain't no one got time to deal with bash.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32328) Ensure surefire baseLine is picked up by IntelliJ

2023-06-13 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32328:


 Summary: Ensure surefire baseLine is picked up by IntelliJ
 Key: FLINK-32328
 URL: https://issues.apache.org/jira/browse/FLINK-32328
 Project: Flink
  Issue Type: Sub-task
  Components: Build System
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


We currently configure JVM arguments exclusively within the surefire 
executions, which IntelliJ doesn't read. We should also set the baseArgsLine 
(which in the future will contain module declarations) to the base surefire 
configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32327) Python Kafka connector runs into strange NullPointerException

2023-06-13 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32327:


 Summary: Python Kafka connector runs into strange 
NullPointerException
 Key: FLINK-32327
 URL: https://issues.apache.org/jira/browse/FLINK-32327
 Project: Flink
  Issue Type: Sub-task
  Components: API / Python
Reporter: Chesnay Schepler


The following error occurs when running the python kafka tests:
(this uses a slightly modified version of the code, but the error also happens 
without it)

{code:python}
 def set_record_serializer(self, record_serializer: 
'KafkaRecordSerializationSchema') \
 -> 'KafkaSinkBuilder':
 """
 Sets the :class:`KafkaRecordSerializationSchema` that transforms 
incoming records to kafka
 producer records.
 
 :param record_serializer: The :class:`KafkaRecordSerializationSchema`.
 """
 # NOTE: If topic selector is a generated first-column selector, do 
extra preprocessing
 j_topic_selector = 
get_field_value(record_serializer._j_serialization_schema,
'topicSelector')
 
 caching_name_suffix = 
'KafkaRecordSerializationSchemaBuilder.CachingTopicSelector'
 if 
j_topic_selector.getClass().getCanonicalName().endswith(caching_name_suffix):
 class_name = get_field_value(j_topic_selector, 'topicSelector')\
 .getClass().getCanonicalName()
 >   if class_name.startswith('com.sun.proxy') or 
 > class_name.startswith('jdk.proxy'):
 E   AttributeError: 'NoneType' object has no attribute 'startswith'
{code}

My assumption is that {{getCanonicalName}} returns {{null}} for some objects, 
and this set of objects may have increased in Java 17. I tried adding a null 
check, but that caused other tests to fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-13 Thread Chesnay Schepler

On 13/06/2023 12:50, Jing Ge wrote:

One major issue we have, afaiu, is caused by the lack of housekeeping/house
cleaning, there are many APIs that were marked as deprecated a few years
ago and still don't get removed. Some APIs should be easy to remove and
others will need some more clear rules, like the issue discussed at [1].


This is by design. Most of these are @Public APIs that we had to carry 
around until Flink 2.0, because that was the initial guarantee that we 
gave people.



As for the FLIP, I like the idea of explicitly writing down a 
deprecation period for APIs, particularly PublicEvolving ones.
For Experimental I don't think it'd be a problem if we could change them 
right away,
but looking back a bit I don't think it hurts us to also enforce some 
deprecation period.

1 release for both of these sound fine to me.


My major point of contention is the removal of Public APIs between minor 
versions.

This to me would a major setback towards a simpler upgrade path for users.
If these can be removed in minor versions than what even is a major release?
The very definition we have for Public APIs is that they stick around 
until the next major one.
Any rule that theoretically allows for breaking changes in Public API in 
every minor release is in my opinion not a viable option.



The "carry recent Public APIs forward into the next major release" thing 
seems to presume a linear release history (aka, if 2.0 is released after 
1.20, then there will be no 1.21), which I doubt will be the case. The 
idea behind it is good, but I'd say the right conclusion would be to not 
make that API public if we know a new major release hits in 3 months and 
is about to modify it. With a regular schedule for major releases this 
wouldn't be difficult to do.


[jira] [Created] (FLINK-32314) Ignore class-loading errors after RPC system shutdown

2023-06-12 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32314:


 Summary: Ignore class-loading errors after RPC system shutdown
 Key: FLINK-32314
 URL: https://issues.apache.org/jira/browse/FLINK-32314
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / RPC, Tests
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


In tests we occasionally see the akka rpc service throwing class loading errors 
_after_ it was shut down.
AFAICT our shutdown procedure is correct, and it's just akka shutting down some 
things asynchronously.
I couldn't figure out why/what is still running, so as a bandaid I suggest to 
ignore classloading errors after the rpc service shutdown has completed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32304) Reduce rpc-akka jar

2023-06-09 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32304:


 Summary: Reduce rpc-akka jar
 Key: FLINK-32304
 URL: https://issues.apache.org/jira/browse/FLINK-32304
 Project: Flink
  Issue Type: Improvement
  Components: Build System, Runtime / RPC
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0, 1.17.2


We bundle unnecessary dependencies in the rpc-akka jar; we can easily shave of 
15mb of dependencies.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32302) Disable Hbase 2.x tests on Java 17

2023-06-09 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32302:


 Summary: Disable Hbase 2.x tests on Java 17
 Key: FLINK-32302
 URL: https://issues.apache.org/jira/browse/FLINK-32302
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / HBase, Tests
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


Lacking support on the HBase side. Version bumps may solve it, but that's out 
of scope of this issue since the connector is being externalized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32301) common.sh#create_ha_config should use set_config_key

2023-06-09 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32301:


 Summary: common.sh#create_ha_config should use set_config_key
 Key: FLINK-32301
 URL: https://issues.apache.org/jira/browse/FLINK-32301
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


Instead of replacing the entire configuration, set the desired individual 
options instead.
The current approach isn't great because it prevents us from setting required 
defaults in the flink-dist config.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32297) Use Temurin image in FlinkImageBuilder

2023-06-09 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-32297:


 Summary: Use Temurin image in FlinkImageBuilder
 Key: FLINK-32297
 URL: https://issues.apache.org/jira/browse/FLINK-32297
 Project: Flink
  Issue Type: Sub-task
  Components: Test Infrastructure
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.18.0


The FlinkImageBuilder currently uses openjdk images. I've seen issues with 
these on Java 17, and propose to use Temurin, similar to the prod images.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >