Re: [VOTE] FLIP-437: Support ML Models in Flink SQL

2024-04-03 Thread David Morávek
+1 (binding)

My only suggestion would be to move Catalog changes into a separate
interface to allow us to begin with lower stability guarantees. Existing
Catalogs would be able to opt-in by implementing it. It's a minor thing
though, overall the FLIP is solid and the direction is pretty exciting.

Best,
D.

On Wed, Apr 3, 2024 at 2:31 AM David Radley  wrote:

> Hi Hao,
> I don’t think this counts as an objection, I have some comments. I should
> have put this on the discussion thread earlier but have just got to this.
> - I suggest we can put a model version in the model resource. Versions are
> notoriously difficult to add later; I don’t think we want to proliferate
> differently named models as a model mutates. We may want to work with
> non-latest models.
> - I see that the model name is the unique identifier. I realise this would
> move away from the Oracle syntax – so may not be feasible short term; but I
> wonder if we can have:
>  - a uuid as the main identifier and the model name as an attribute.
> or
>  - a namespace (or something like a system of origin)
> to help organise models with the same name.
> - does the model have an owner? I assume that Flink model resource is the
> master of the model? I imagine in the future that a model that comes in via
> a new connector could be kept up to date with the external model and would
> not be allowed to be changed by anything other than the connector.
>
>Kind regards, David.
>
> From: Hao Li 
> Date: Friday, 29 March 2024 at 16:30
> To: dev@flink.apache.org 
> Subject: [EXTERNAL] [VOTE] FLIP-437: Support ML Models in Flink SQL
> Hi devs,
>
> I'd like to start a vote on the FLIP-437: Support ML Models in Flink
> SQL [1]. The discussion thread is here [2].
>
> The vote will be open for at least 72 hours unless there is an objection or
> insufficient votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL
>
> [2] https://lists.apache.org/thread/9z94m2bv4w265xb5l2mrnh4lf9m28ccn
>
> Thanks,
> Hao
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>


Re: [DISCUSS] FLIP-388: Support Dynamic Logger Level Adjustment

2024-01-16 Thread David Morávek
Hi Yuepeng,

Thanks for the FLIP! There was already quite a discussion on FLIP-210
[1][2], that has proposed more or less the same thing. FLIP was marked as
out of scope for Flink because underlying technologies already address it.

Are you aware of the effort? If yes, what has changed since then?

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-210%3A+Change+logging+level+dynamically+at+runtime
[2] https://lists.apache.org/thread/n8omkpjf1mk9jphx38b8tfrs4h3nxo3z

Best,
D.

On Tue, Jan 16, 2024 at 4:37 PM Yuepeng Pan  wrote:

> Hi all,
>
>
>
>
> I created the FLIP-388[1] to support dynamic logger level
> adjustment.
>
>
>
>
>  Comprehensive and detailed system logs(like debug, trace, all,
> etc.)
>
> could contribute to improved visibility of internal system execution
> information
>
> and also enhance the efficiency of program debugging. Flink currently only
>
> supports static log level configuration(like debug, trace, all, etc.) to
> help application
>
> debugging, which can lead to the following issues when using static log
> configuration:
>
>   1. A sharp increase in log volume, accelerating disk occupancy.
>
>  2. Potential risks of system performance degradation due to a large
> volume of log printing.
>
>  3. The need to simplify log configuration subsequently, which causes
> inevitably cause the program to restart.
>
>
>
>  Therefore, introducing a mechanism to dynamically adjust the online
> log output level
>
> in the event of debugging programs will be meaningful, which can complete
> the switch
>
> of log level configuration without restarting the program.
>
>
>
>
>  I really appreciate Fan Rui(CC'ed), Zhanghao Chen(CC'ed)  for
> providing some valuable help and suggestions.
>
>  Please refer to the FLIP[1] document for more details about the
> proposed design and implementation.
>
> We welcome any feedback and opinions on this proposal.
>
>
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-388%3A+Support+Dynamic+Logger+Level+Adjustment
>
> [2] https://issues.apache.org/jira/browse/FLINK-33320
>
>
>
>
> Best,
>
> Yuepeng Pan


Re: FW: [ANNOUNCE] New Apache Flink Committer - Alexander Fedulov

2024-01-09 Thread David Morávek
Congrats, Alex!

Best,
D.

On Fri, Jan 5, 2024 at 7:25 AM Sergey Nuyanzin  wrote:

> Congratulations, Alex!
>
> On Fri, Jan 5, 2024, 05:12 Lincoln Lee  wrote:
>
> > Congratulations, Alex!
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Alexander Fedulov  于2024年1月4日周四 19:08写道:
> >
> > > Thanks, everyone! It is great to be part of such an active and
> > > collaborative community!
> > >
> > > Best,
> > > Alex
> > >
> > > On Thu, 4 Jan 2024 at 10:10, Etienne Chauchot 
> > > wrote:
> > >
> > > > Congrats! Welcome onboard.
> > > >
> > > > Best
> > > >
> > > > Etienne
> > > >
> > > > Le 04/01/2024 à 03:14, Jane Chan a écrit :
> > > > > Congratulations, Alex!
> > > > >
> > > > > Best,
> > > > > Jane
> > > > >
> > > > > On Thu, Jan 4, 2024 at 10:03 AM Junrui Lee
> > > wrote:
> > > > >
> > > > >> Congratulations, Alex!
> > > > >>
> > > > >> Best,
> > > > >> Junrui
> > > > >>
> > > > >> weijie guo  于2024年1月4日周四 09:57写道:
> > > > >>
> > > > >>> Congratulations, Alex!
> > > > >>>
> > > > >>> Best regards,
> > > > >>>
> > > > >>> Weijie
> > > > >>>
> > > > >>>
> > > > >>> Steven Wu  于2024年1月4日周四 02:07写道:
> > > > >>>
> > > >  Congra, Alex! Well deserved!
> > > > 
> > > >  On Wed, Jan 3, 2024 at 2:31 AM David Radley<
> > david_rad...@uk.ibm.com
> > > >
> > > >  wrote:
> > > > 
> > > > > Sorry for my typo.
> > > > >
> > > > > Many congratulations Alex!
> > > > >
> > > > > From: David Radley
> > > > > Date: Wednesday, 3 January 2024 at 10:23
> > > > > To: David Anderson
> > > > > Cc:dev@flink.apache.org  
> > > > > Subject: Re: [EXTERNAL] [ANNOUNCE] New Apache Flink Committer -
> > > > >>> Alexander
> > > > > Fedulov
> > > > > Many Congratulations David .
> > > > >
> > > > > From: Maximilian Michels
> > > > > Date: Tuesday, 2 January 2024 at 12:16
> > > > > To: dev
> > > > > Cc: Alexander Fedulov
> > > > > Subject: [EXTERNAL] [ANNOUNCE] New Apache Flink Committer -
> > > Alexander
> > > > > Fedulov
> > > > > Happy New Year everyone,
> > > > >
> > > > > I'd like to start the year off by announcing Alexander Fedulov
> > as a
> > > > > new Flink committer.
> > > > >
> > > > > Alex has been active in the Flink community since 2019. He has
> > > > > contributed more than 100 commits to Flink, its Kubernetes
> > > operator,
> > > > > and various connectors [1][2].
> > > > >
> > > > > Especially noteworthy are his contributions on deprecating and
> > > > > migrating the old Source API functions and test harnesses, the
> > > > > enhancement to flame graphs, the dynamic rescale time
> computation
> > > in
> > > > > Flink Autoscaling, as well as all the small enhancements Alex
> has
> > > > > contributed which make a huge difference.
> > > > >
> > > > > Beyond code contributions, Alex has been an active community
> > member
> > > > > with his activity on the mailing lists [3][4], as well as
> various
> > > > > talks and blog posts about Apache Flink [5][6].
> > > > >
> > > > > Congratulations Alex! The Flink community is proud to have you.
> > > > >
> > > > > Best,
> > > > > The Flink PMC
> > > > >
> > > > > [1]
> > > > >
> > > > >>>
> > > >
> > https://github.com/search?type=commits=author%3Aafedulov+org%3Aapache
> > > > > [2]
> > > > >
> > > > >>
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-28229?jql=status%20in%20(Resolved%2C%20Closed)%20AND%20assignee%20in%20(afedulov)%20ORDER%20BY%20resolved%20DESC%2C%20created%20DESC
> > > > > [3]
> > > > >>>
> > https://lists.apache.org/list?dev@flink.apache.org:lte=100M:Fedulov
> > > > > [4]
> > > > >>>
> > https://lists.apache.org/list?u...@flink.apache.org:lte=100M:Fedulov
> > > > > [5]
> > > > >
> > > > >>
> > > >
> > >
> >
> https://flink.apache.org/2020/01/15/advanced-flink-application-patterns-vol.1-case-study-of-a-fraud-detection-system/
> > > > > [6]
> > > > >
> > > > >>
> > > >
> > >
> >
> https://www.ververica.com/blog/presenting-our-streaming-concepts-introduction-to-flink-video-series
> > > > > Unless otherwise stated above:
> > > > >
> > > > > IBM United Kingdom Limited
> > > > > Registered in England and Wales with number 741598
> > > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants.
> > PO6
> > > > >> 3AU
> > >
> >
>


Re: [DISCUSS] FLIP 411: Chaining-agnostic Operator ID generation for improved state compatibility on parallelism change

2024-01-09 Thread David Morávek
Hi Zhanghao,

Thanks for the FLIP. What you're proposing makes a lot of sense +1

Have you thought about how this works with unaligned checkpoints in case
you go from unchained to chained? I think it should be fine because this
scenario should only apply to forward/rebalance scenarios where we, as far
as I recall, force alignment anyway, so there should be no exchanges to
snapshot. It might just work, but something to double-check. Maybe @Piotr
Nowojski  could confirm it.

Best,
D.

On Wed, Jan 3, 2024 at 7:10 AM Zhanghao Chen 
wrote:

> Dear Flink devs,
>
> I'd like to start a discussion on FLIP 411: Chaining-agnostic Operator ID
> generation for improved state compatibility on parallelism change [1].
>
> Currently, when user does not explicitly set operator UIDs, the chaining
> behavior will still affect state compatibility, as the generation of the
> Operator ID is dependent on its chained output nodes. For example, a simple
> source->sink DAG with source and sink chained together is state
> incompatible with an otherwise identical DAG with source and sink unchained
> (either because the parallelisms of the two ops are changed to be unequal
> or chaining is disabled). This greatly limits the flexibility to perform
> chain-breaking/building for performance tuning.
>
> The dependency on chained output nodes for Operator ID generation can be
> traced back to Flink 1.2. It is unclear at this point on why chained output
> nodes are involved in the algorithm, but the following history background
> might be related: prior to Flink 1.3, Flink runtime takes the snapshots by
> the operator ID of the first vertex in a chain, so it somewhat makes sense
> to include chained output nodes into the algorithm as
> chain-breaking/building is expected to break state-compatibility anyway.
>
> Given that operator-level state recovery within a chain has long been
> supported since Flink 1.3, I propose to introduce StreamGraphHasherV3 that
> is agnostic of the chaining behavior of operators, so that users are free
> to tune the parallelism of individual operators without worrying about
> state incompatibility. We can make the V3 hasher an optional choice in
> Flink 1.19, and make it the default hasher in 2.0 for backwards
> compatibility.
>
> Looking forward to your suggestions on it, thanks~
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-411%3A+Chaining-agnostic+Operator+ID+generation+for+improved+state+compatibility+on+parallelism+change
>
> Best,
> Zhanghao Chen
>


Re: [VOTE] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

2023-11-22 Thread David Morávek
+1 (binding)

Best,
D.

On Wed, Nov 22, 2023 at 11:21 AM Roman Khachatryan  wrote:

> +1 (binding)
>
> Regards,
> Roman
>
> On Wed, Nov 22, 2023, 7:08 AM Zakelly Lan  wrote:
>
> > +1(non-binding)
> >
> > Best,
> > Zakelly
> >
> > On Wed, Nov 22, 2023 at 3:04 PM Hangxiang Yu 
> wrote:
> >
> > > +1 (binding)
> > > Thanks for driving this again!
> > >
> > > On Wed, Nov 22, 2023 at 10:30 AM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > > +1(binding)
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Wed, Nov 22, 2023 at 6:43 AM Jing Ge 
> > > > wrote:
> > > >
> > > > > +1(binding) Thanks!
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > > On Tue, Nov 21, 2023 at 6:17 PM Piotr Nowojski <
> pnowoj...@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I'd like to start a vote on the FLIP-384: Introduce TraceReporter
> > and
> > > > use
> > > > > > it to create checkpointing and recovery traces [1]. The
> discussion
> > > > thread
> > > > > > is here [2].
> > > > > >
> > > > > > The vote will be open for at least 72 hours unless there is an
> > > > objection
> > > > > or
> > > > > > not enough votes.
> > > > > >
> > > > > > [1] https://cwiki.apache.org/confluence/x/TguZE
> > > > > > [2]
> > https://lists.apache.org/thread/7lql5f5q1np68fw1wc9trq3d9l2ox8f4
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Piotrek
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Hangxiang.
> > >
> >
>


Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-11-20 Thread David Morávek
The paper looks interesting, but it might not manifest the described
benefit for practical reasons:

1. It forces you to remember all keys in the broadcasted (partitioned is
impossible without timeouts, etc.) operator state. Forever. This itself is
a blocker for a bunch of pipelines. The primary motivation for using the
state is that you can't simply recompute the WM as with the global one.
2. It's super easy to run into idleness issues (it's almost likely).
3. Thinking about multiple chained aggregations on different keys is just
... 勞
4. This is a significant change to public APIs.

The main problem the paper needs to address is idleness and stuck per-key
watermarks (pipeline not making progress).

What do you think about these issues?

Best,
D.


On Sat, Nov 18, 2023 at 6:41 PM Tawfek Yasser Tawfek 
wrote:

> Hello Alexander,
>
> Will we continue the discussion?
>
>
>
> Thanks & BR,
>
> Tawfik
>
> 
> From: Tawfek Yasser Tawfek 
> Sent: 30 October 2023 15:32
> To: dev@flink.apache.org 
> Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> Hi Alexander,
>
> Thank you for your reply.
>
> Yes. As you showed keyed-watermarks mechanism is mainly required for the
> case when we need a fine-grained calculation for each partition
> [Calculation over data produced by each individual sensor], as scalability
> factors require partitioning the calculations,
> so, the keyed-watermarks mechanism is designed for this type of problem.
>
> Thanks,
> Tawfik
> 
> From: Alexander Fedulov 
> Sent: 30 October 2023 13:37
> To: dev@flink.apache.org 
> Subject: Re: Proposal for Implementing Keyed Watermarks in Apache Flink
>
> [You don't often get email from alexander.fedu...@gmail.com. Learn why
> this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hi Tawfek,
>
> > The idea is to generate a watermark for each key (sub-stream), in order
> to avoid the fast progress of the global watermark which affects low-rate
> sources.
>
> Let's consider the sensors example from the paper. Shouldn't it be about
> the delay between the time of taking the measurement and its arrival at
> Flink, rather than the rate at which the measurements are produced? If a
> particular sensor produces no data during a specific time window, it
> doesn't make sense to wait for it—there won't be any corresponding
> measurement arriving because none was produced. Thus, I believe we should
> be talking about situations where data from certain sensors can arrive with
> significant delay compared to most other sensors.
>
> From the perspective of data aggregation, there are two main scenarios:
> 1) Calculation over data produced by multiple sensors
> 2) Calculation over data produced by an individual sensor
>
> In scenario 1), there are two subcategories:
> a) Meaningful results cannot be produced without data from those delayed
> sensors; hence, you need to wait longer.
>   => Time is propagated by the mix of all sources. You just need to set
> a bounded watermark with enough lag to accommodate the delayed results.
> This is precisely what event time processing and bounded watermarks are for
> (no keyed watermarking is required).
> b) You need to produce the results as they are and perhaps patch them later
> when the delayed data arrives.
>  => Time is propagated by the mix of all sources. You produce the
> results as they are but utilize allowedLateness to patch the aggregates if
> needed (no keyed watermarking is required).
>
> So, is it correct to say that keyed watermarking is applicable only in
> scenario 2)?
>
> Best,
> Alexander Fedulov
>
> On Sat, 28 Oct 2023 at 14:33, Tawfek Yasser Tawfek 
> wrote:
>
> > Thanks, Alexander for your reply.
> >
> > Our solution initiated from this inquiry on Stack Overflow:
> >
> >
> https://stackoverflow.com/questions/52179898/does-flink-support-keyed-watermarks-if-not-is-there-any-plan-of-implementing-i
> >
> > The idea is to generate a watermark for each key (sub-stream), in order
> to
> > avoid the fast progress of the global watermark which affects low-rate
> > sources.
> >
> > Instead of using only one watermark (vanilla/global watermark), we
> changed
> > the API to allow moving the keyBy() before the
> > assignTimestampsAndWatermarks() so the stream will be partitioned then
> the
> > TimestampsAndWatermarkOperator will handle the generation of each
> watermark
> > for each key (source/sub-stream/partition).
> >
> > *Let's discuss more if you want I have a presentation at a conference, we
> > can meet or whatever is suitable.*
> >
> > Also, I contacted David Anderson one year ago and he followed me step by
> > step and helped me a lot.
> >
> > I attached some messages with David.
> >
> >
> > *Thanks & BR,*
> >
> >
> > 
> >
> >
> >
> >
> >  Tawfik Yasser Tawfik
> >
> > * Teaching Assistant | AI-ITCS-NU*
> >
> >  Office: UB1-B, Room 229
> >
> >  26th of July Corridor, Sheikh Zayed 

Re: [DISCUSS] REST API behaviour when user main method throws error

2023-11-20 Thread David Morávek
Hi Danny,

> My current proposal is that the REST API should not leave the Flink
cluster
in an inconsistent state.

Regarding consistency, Flink only cares about individual jobs, but I can
see your point.

For streaming, this is probably something we could address by book-keeping
jobs submitted by the jar and canceling them on exception. This is not
prone to JM failure and would be subject to bug reports because there are
no guarantees (it could be addressed; it's not straightforward; we've spent
years getting this correct for Application mode).

A bigger problem is that the main method could have the side-effects that
you don't know how to roll-back. For example, creating directories for
batch outputs and moving files. To make this 100% correct, we'd need to
introduce a new set of client-facing APIs.

I'm unaware of any framework that did this right (MR, Spark, etc had
the same issue); you must solve HA for drivers/client programs first.

These are quick thoughts; something simple that works for 90% might be
worth pursuing, ignoring the corner cases.

Best,
D.

On Tue, Nov 14, 2023 at 10:00 AM Danny Cranmer 
wrote:

> Hey all,
>
> We run Flink clusters in session mode; we upload the user jar and then
> invoke "/jars/:jarid/run" [1] REST API endpoint. We have noticed a
> discrepancy in the run endpoint and were hoping to get some feedback from
> the community before proposing a FLIP or Jira to fix it.
>
> Some problem context: The "/jars/:jarid/run" endpoint runs the main()
> method in the user jar. When the call is successful the API will return the
> job ID (case 1). When the call throws an error, the API will return the
> error message (case 2). If a job is submitted successfully AND it throws an
> error, the result is a running job but the API returns the error message
> (case 3). There are two common cases that can result in this failure mode:
> 1/ the user code attempts to run multiple jobs, the first is successful and
> the second is rejected [2]. 2/ the user code throws an arbitrary exception
> after successfully submitting a job. Reproduction code for both is included
> below. For case 3 the client must 1/ run a jar, 2/ query for running jobs
> and 3/ decide how to proceed, either cleaning up or marking it as
> successful. This is overloading the responsibility of the client.
>
> My current proposal is that the REST API should not leave the Flink cluster
> in an inconsistent state. If the run command is successful we should have a
> running job, if the run command fails, we should not have any running jobs.
> There are a few ways to achieve this, but I would like to leave that
> discussion to the FLIP. Right now I am looking for feedback on the desired
> API behaviour.
>
> 1/ The user code attempts to run multiple jobs (Flink 1.15)
>
> public class MultipleJobs {
> public static final long ROWS_PER_SECOND = 1;
> public static final long TOTAL_ROWS = 1_000_000;
>
> public static void main(String[] args) throws Exception {
>
> StreamExecutionEnvironment env = StreamExecutionEnvironment.
> getExecutionEnvironment();
>
> env.addSource(new DataGeneratorSource<>(stringGenerator(32),
> ROWS_PER_SECOND,
> TOTAL_ROWS))
> .returns(String.class)
> .print();
>
> env.execute("Job #1");
>
> env.addSource(new DataGeneratorSource<>(stringGenerator(32),
> ROWS_PER_SECOND,
> TOTAL_ROWS))
> .returns(String.class)
> .print();
>
> env.execute("Job #2");
> }
> }
>
>
> 2/ The user code throws an arbitrary exception after successfully
> submitting a job (Flink 1.15)
>
> public class CustomerErrorJobSubmission {
> public static final long ROWS_PER_SECOND = 1;
> public static final long TOTAL_ROWS = 1_000_000;
>
> public static void main(String[] args) throws Exception {
>
> StreamExecutionEnvironment env =
> StreamExecutionEnvironment.getExecutionEnvironment();
>
> env.addSource(new DataGeneratorSource<>(stringGenerator(32),
> ROWS_PER_SECOND, TOTAL_ROWS))
> .returns(String.class)
> .print();
>
> env.execute("Job #1");
>
> throw new RuntimeException("REST API call will fail, but there
> will still be a job running");
> }
> }
>
>
> --
>
>
> Thanks,
> Danny
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/rest_api/#jars-jarid-run
> [2]
>
> https://github.com/apache/flink/blob/release-1.15/flink-clients/src/main/java/org/apache/flink/client/program/StreamContextEnvironment.java#L198
>


Re: Support AWS SDK V2 for Flink's S3 FileSystem

2023-10-20 Thread David Morávek
ving the FileSystem interface as part of the 2.0
> efforts as a follow-up.
>
>
>
> [1]
> https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L695
>
> [2]
> https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L706
>
> [3]
> https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L773
>
>
>
> On Tue, Oct 3, 2023 at 6:25 PM Martijn Visser 
> wrote:
>
> +1 for David's suggestion. We should get away from the current
> approach with two abstractions and get to one rock solid one.
>
> On Mon, Oct 2, 2023 at 11:13 PM David Morávek  wrote:
> >
> > Hi Maomao,
> >
> > I wonder whether it would make sense to take a stab at consolidating the
> S3
> > filesystems instead and introduce a native one. The whole Hadoop wrapper
> > around the S3 client exists for legacy reasons, and it adds complexity
> and
> > probably an unnecessary performance penalty.
> >
> > If you take a look at the underlying presto implementation, it's actually
> > not too complex to adapt to Flink interfaces (since you're proposing to
> > maintain a copy of it anyway).
> >
> > Overall, the S3 FS is probably the most used one that we have so this
> could
> > be rather high impact. It would also eliminate user confusion when
> choosing
> > the implementation to use.
> >
> > WDYT?
> >
> > Best,
> > D.
> >
> > On Fri, Sep 29, 2023 at 2:41 PM Min, Maomao  >
> > wrote:
> >
> > > Hi Flink Dev,
> > >
> > > I’m Maomao, a developer from AWS EMR.
> > >
> > > Recently, our team is working on adding AWS SDK V2 support for Flink’s
> S3
> > > Filesystem. During development, we found out that our work was blocked
> by
> > > Presto. This is because that Presto still uses AWS SDK V1 and won’t add
> > > support for AWS SDK V2 in short term. To unblock, our team proposed
> several
> > > options and I’ve created a JIRA issue as here<
> > > https://issues.apache.org/jira/browse/FLINK-33157>.
> > >
> > > Since our team plans to contribute this work back to the community
> later,
> > > we’d like to collect feedback from the community about the options we
> > > proposed in the long term so that the community won’t need to duplicate
> > > this work in the future.
> > >
> > > Best,
> > > Maomao
> > >
> > >
>
>


Re: [Discuss] FLIP-362: Support minimum resource limitation

2023-10-04 Thread David Morávek
> If not, what is the difference between the spare resources and redundant
taskmanagers?

I wasn't aware of this one; good catch! The main difference is that you
don't express the spare resources in terms of slots but in terms of task
managers. Also, those options serve slightly different purpose, and users
configuring slot manager might not look for another option somewhere else.

> Secondly, IMHO the difference between min-reserved resource and spare
resources is that we could configure a rather large min-reserved resource

Agreed; in my mind, this boils down to the ability to quickly allocate new
slots (TMs). This might differ between environments though. In most cases,
there should be some time between interactive queries unless they're
submitted programmatically. I can see the value of having both (min + slots
to keep around).

All in all, I don't have a strong opinion here, it's a significant
improvement either way. This was just the first thing that I thought about
after reading the flip.

Best,
D.

On Tue, Oct 3, 2023 at 2:10 PM xiangyu feng  wrote:

> Hi David,
>
> Thx for your feedback.
>
> First of all, for keeping some spare resources around, do you mean
> 'Redundant TaskManagers'[1]? If not, what is the difference between the
> spare resources and redundant taskmanagers?
>
> Secondly, IMHO the difference between min-reserved resource and spare
> resources is that we could configure a rather large min-reserved resource
> for user cases submitting lots of short-lived jobs concurrently, but we
> don't want to configure a large spare resource since this might double the
> total resource usage and lead to resource waste.
>
> Looking forward to hearing from you.
>
> Regards,
> Xiangyu
>
> [1] https://issues.apache.org/jira/browse/FLINK-18625
>
> David Morávek  于2023年10月3日周二 05:00写道:
>
> > H Xiangyui,
> >
> > The sentiment of the FLIP makes sense, but I keep wondering whether this
> > is the best way to think about the problem. I assume that "interactive
> > session cluster" users always want to keep some spare resources around
> (up
> > to a configured threshold) to reduce cold start instead of statically
> > configuring the minimum.
> >
> > It's just a tiny change from the original proposal, but it could make all
> > the difference (eliminate overprovisioning, maintain latencies with a
> > growing # of jobs, ..)
> >
> > WDYT?
> >
> > Best,
> > D.
> >
> > On Mon, Sep 25, 2023 at 5:11 PM Jing Ge 
> > wrote:
> >
> >> Hi Yangze,
> >>
> >> Thanks for the clarification. The example of two batch jobs team up with
> >> one streaming job is interesting.
> >>
> >> Best regards,
> >> Jing
> >>
> >> On Wed, Sep 20, 2023 at 7:19 PM Yangze Guo  wrote:
> >>
> >> > Thanks for the comments, Jing.
> >> >
> >> > > Will the minimum resource configuration also take effect for
> streaming
> >> > jobs in application mode?
> >> > > Since it is not recommended to configure
> >> slotmanager.number-of-slots.max
> >> > for streaming jobs, does it make sense to disable it for common
> >> streaming
> >> > jobs? At least disable the check for avoiding the oscillation?
> >> >
> >> > Yes. The minimum resource configuration will only disabled in
> >> > standalone cluster atm. I agree it make sense to disable it for a pure
> >> > streaming job, however:
> >> > - By default, the minimum resource is configured to 0. If users do not
> >> > proactively set it, either the oscillation check or the minimum
> >> > restriction can be considered as disabled.
> >> > - The minimum resource is a cluster-level configuration rather than a
> >> > job-level configuration. If a user has an application with two batch
> >> > jobs preceding the streaming job, they may also require this
> >> > configuration to accelerate the execution of batch jobs.
> >> >
> >> > WDYT?
> >> >
> >> > Best,
> >> > Yangze Guo
> >> >
> >> > On Thu, Sep 21, 2023 at 4:49 AM Jing Ge 
> >> > wrote:
> >> > >
> >> > > Hi Xiangyu,
> >> > >
> >> > > Thanks for driving it! There is one thing I am not really sure if I
> >> > > understand you correctly.
> >> > >
> >> > > According to the FLIP: "The minimum resource limitation will be
> >> > implemented
> >> > > in the DefaultResourceAllocationStrategy of FineGrainedSlo

Re: Support AWS SDK V2 for Flink's S3 FileSystem

2023-10-02 Thread David Morávek
Hi Maomao,

I wonder whether it would make sense to take a stab at consolidating the S3
filesystems instead and introduce a native one. The whole Hadoop wrapper
around the S3 client exists for legacy reasons, and it adds complexity and
probably an unnecessary performance penalty.

If you take a look at the underlying presto implementation, it's actually
not too complex to adapt to Flink interfaces (since you're proposing to
maintain a copy of it anyway).

Overall, the S3 FS is probably the most used one that we have so this could
be rather high impact. It would also eliminate user confusion when choosing
the implementation to use.

WDYT?

Best,
D.

On Fri, Sep 29, 2023 at 2:41 PM Min, Maomao 
wrote:

> Hi Flink Dev,
>
> I’m Maomao, a developer from AWS EMR.
>
> Recently, our team is working on adding AWS SDK V2 support for Flink’s S3
> Filesystem. During development, we found out that our work was blocked by
> Presto. This is because that Presto still uses AWS SDK V1 and won’t add
> support for AWS SDK V2 in short term. To unblock, our team proposed several
> options and I’ve created a JIRA issue as here<
> https://issues.apache.org/jira/browse/FLINK-33157>.
>
> Since our team plans to contribute this work back to the community later,
> we’d like to collect feedback from the community about the options we
> proposed in the long term so that the community won’t need to duplicate
> this work in the future.
>
> Best,
> Maomao
>
>


Re: [Discuss] FLIP-362: Support minimum resource limitation

2023-10-02 Thread David Morávek
H Xiangyui,

The sentiment of the FLIP makes sense, but I keep wondering whether this is
the best way to think about the problem. I assume that "interactive session
cluster" users always want to keep some spare resources around (up to a
configured threshold) to reduce cold start instead of statically
configuring the minimum.

It's just a tiny change from the original proposal, but it could make all
the difference (eliminate overprovisioning, maintain latencies with a
growing # of jobs, ..)

WDYT?

Best,
D.

On Mon, Sep 25, 2023 at 5:11 PM Jing Ge  wrote:

> Hi Yangze,
>
> Thanks for the clarification. The example of two batch jobs team up with
> one streaming job is interesting.
>
> Best regards,
> Jing
>
> On Wed, Sep 20, 2023 at 7:19 PM Yangze Guo  wrote:
>
> > Thanks for the comments, Jing.
> >
> > > Will the minimum resource configuration also take effect for streaming
> > jobs in application mode?
> > > Since it is not recommended to configure
> slotmanager.number-of-slots.max
> > for streaming jobs, does it make sense to disable it for common streaming
> > jobs? At least disable the check for avoiding the oscillation?
> >
> > Yes. The minimum resource configuration will only disabled in
> > standalone cluster atm. I agree it make sense to disable it for a pure
> > streaming job, however:
> > - By default, the minimum resource is configured to 0. If users do not
> > proactively set it, either the oscillation check or the minimum
> > restriction can be considered as disabled.
> > - The minimum resource is a cluster-level configuration rather than a
> > job-level configuration. If a user has an application with two batch
> > jobs preceding the streaming job, they may also require this
> > configuration to accelerate the execution of batch jobs.
> >
> > WDYT?
> >
> > Best,
> > Yangze Guo
> >
> > On Thu, Sep 21, 2023 at 4:49 AM Jing Ge 
> > wrote:
> > >
> > > Hi Xiangyu,
> > >
> > > Thanks for driving it! There is one thing I am not really sure if I
> > > understand you correctly.
> > >
> > > According to the FLIP: "The minimum resource limitation will be
> > implemented
> > > in the DefaultResourceAllocationStrategy of FineGrainedSlotManager.
> > >
> > > Each time when SlotManager needs to reconcile the cluster resources or
> > > fulfill job resource requirements, the
> DefaultResourceAllocationStrategy
> > > will check if the minimum resource requirement has been fulfilled. If
> it
> > is
> > > not, DefaultResourceAllocationStrategy will request new
> > PendingTaskManagers
> > > and FineGrainedSlotManager will allocate new worker resources
> > accordingly."
> > >
> > > "To avoid this oscillation, we need to check the worker number derived
> > from
> > > minimum and maximum resource configuration is consistent before
> starting
> > > SlotManager."
> > >
> > > Will the minimum resource configuration also take effect for streaming
> > jobs
> > > in application mode? Since it is not recommended to
> > > configure slotmanager.number-of-slots.max for streaming jobs, does it
> > make
> > > sense to disable it for common streaming jobs? At least disable the
> check
> > > for avoiding the oscillation?
> > >
> > > Best regards,
> > > Jing
> > >
> > >
> > > On Tue, Sep 19, 2023 at 4:58 PM Chen Zhanghao <
> zhanghao.c...@outlook.com
> > >
> > > wrote:
> > >
> > > > Thanks for driving this, Xiangyu. We use Session clusters for quick
> SQL
> > > > debugging internally, and found cold-start job submission slow due to
> > lack
> > > > of the exact minimum resource reservation feature proposed here. This
> > > > should improve the experience a lot for running short lived-jobs in
> > session
> > > > clusters.
> > > >
> > > > Best,
> > > > Zhanghao Chen
> > > > 
> > > > 发件人: Yangze Guo 
> > > > 发送时间: 2023年9月19日 13:10
> > > > 收件人: xiangyu feng 
> > > > 抄送: dev@flink.apache.org 
> > > > 主题: Re: [Discuss] FLIP-362: Support minimum resource limitation
> > > >
> > > > Thanks for driving this @Xiangyu. This is a feature that many users
> > > > have requested for a long time. +1 for the overall proposal.
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Tue, Sep 19, 2023 at 11:48 AM xiangyu feng 
> > > > wrote:
> > > > >
> > > > > Hi Devs,
> > > > >
> > > > > I'm opening this thread to discuss FLIP-362: Support minimum
> resource
> > > > limitation. The design doc can be found at:
> > > > > FLIP-362: Support minimum resource limitation
> > > > >
> > > > > Currently, the Flink cluster only requests Task Managers (TMs) when
> > > > there is a resource requirement, and idle TMs are released after a
> > certain
> > > > period of time. However, in certain scenarios, such as running short
> > > > lived-jobs in session cluster and scheduling batch jobs stage by
> > stage, we
> > > > need to improve the efficiency of job execution by maintaining a
> > certain
> > > > number of available workers in the cluster all the time.
> > > > >
> > > > > After discussed with Yangze, we introduced this new 

Re: [DISCUSS] FLIP-370 : Support Balanced Tasks Scheduling

2023-10-02 Thread David Morávek
Hello Yuepeng,

The FLIP reads sane; nice work! To re-phrase my understanding:

The problem you're trying to solve only exists in complex graphs with
different per-vertex parallelism. If the parallelism is set globally
(assuming the pipeline has roughly even data skew), the algorithm could
make things slightly worse by eliminating some local exchanges. Is that
correct?

Where I'm headed with this is that there could be a hybrid strategy that
provides a reasonable default when the pipeline uses slot-sharing (for
per-vertex parallelism, use the new strategy; for global parallelism use
the old one). It's always a shame if improvements like this end up being a
power-user feature and very few workloads benefit from it. Any thoughts?

Best,
D.

On Sun, Oct 1, 2023 at 1:38 PM Yangze Guo  wrote:

> Hi, Rui,
>
> 1. With the current mechanism, when physical slots are offered from
> TM, the JobMaster will start deploying tasks and synchronizing their
> states. With the addition of the waiting mechanism, IIUC, the
> JobMaster will deploy and synchronize the states of all tasks only
> after all resources are available. The task deployment and state
> synchronization both occupy the JobMaster's RPC main thread. In
> complex jobs with a lot of tasks, this waiting mechanism may increase
> the pressure on the JobMaster and increase the end-to-end job
> deployment time.
>
> 2. From my understanding, if user enable the
> cluster.evenly-spread-out-slots,
> LeastUtilizationResourceMatchingStrategy will be used to determine the
> slot distribution and the slot allocation in the three TM will be
> (taskmanager.numberOfTaskSlots=3):
> TM1: 3 slot
> TM2: 2 slot
> TM3: 2 slot
>
> Best,
> Yangze Guo
>
> On Sun, Oct 1, 2023 at 6:14 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Hi Shammon,
> >
> > Thanks for your feedback as well!
> >
> > > IIUC, the overall balance is divided into two parts: slot to TM and
> task
> > to slot.
> > > 1. Slot to TM is guaranteed by SlotManager in ResourceManager
> > > 2. Task to slot is guaranteed by the slot pool in JM
> > >
> > > These two are completely independent, what are the benefits of unifying
> > > these two into one option? Also, do we want to share the same
> > > option between SlotPool in JM and SlotManager in RM? This sounds a bit
> > > strange.
> >
> > Your understanding is totally right, the balance needs 2 parts: slot to
> TM
> > and task to slot.
> >
> > As I understand, the following are benefits of unifying them into one
> > option:
> >
> > - Flink users don't care about these principles inside of flink, they
> don't
> > know these 2 parts.
> > - If flink provides 2 options, flink users need to set 2 options for
> their
> > job.
> > - If one option is missed, the final result may not be good. (Users may
> > have questions when using)
> > - If flink just provides 1 option, enabling one option is enough. (Reduce
> > the probability of misconfiguration)
> >
> > Also, Flink’s options are user-oriented. Each option represents a switch
> or
> > parameter of a feature.
> > A feature may be composed of multiple components inside Flink.
> > It might be better to keep only one switch per feature.
> >
> > Actually, the cluster.evenly-spread-out-slots option is used between
> > SlotPool in JM and SlotManager in RM. 2 components to ensure
> > this feature works well.
> >
> > Please correct me if my understanding is wrong,
> > and looking forward to your feedback, thanks!
> >
> > Best,
> > Rui
> >
> > On Sun, Oct 1, 2023 at 5:52 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi Yangze,
> > >
> > > Thanks for your feedback!
> > >
> > > > 1. Is it possible for the SlotPool to get the slot allocation results
> > > > from the SlotManager in advance instead of waiting for the actual
> > > > physical slots to be registered, and perform pre-allocation? The
> > > > benefit of doing this is to make the task deployment process
> smoother,
> > > > especially when there are a large number of tasks in the job.
> > >
> > > Could you elaborate on that? I didn't understand what's the benefit and
> > > smoother.
> > >
> > > > 2. If user enable the cluster.evenly-spread-out-slots, the issue in
> > > > example 2 of section 2.2.3 can be resolved. Do I understand it
> > > > correctly?
> > >
> > > The example assigned result is the final allocation result when flink
> > > user enables the cluster.evenly-spread-out-slots. We think the
> > > assigned result is expected, so I think your understanding is right.
> > >
> > > Best,
> > > Rui
> > >
> > > On Thu, Sep 28, 2023 at 1:10 PM Shammon FY  wrote:
> > >
> > >> Thanks Yuepeng for initiating this discussion.
> > >>
> > >> +1 in general too, in fact we have implemented a similar mechanism
> > >> internally to ensure a balanced allocation of tasks to slots, it works
> > >> well.
> > >>
> > >> Some comments about the mechanism
> > >>
> > >> 1. This mechanism will be only supported in `SlotPool` or both
> `SlotPool`
> > >> and `DeclarativeSlotPool`? Currently the two 

Re: Re: [DISCUSS] FLIP-357: Deprecate Iteration API of DataStream

2023-09-05 Thread David Morávek
+1 since there is an alternative, more complete implementation available

Best,
D.

On Sat, Sep 2, 2023 at 12:07 AM David Anderson  wrote:

> +1
>
> Keeping the legacy implementation in place is confusing and encourages
> adoption of something that really shouldn't be used.
>
> Thanks for driving this,
> David
>
> On Fri, Sep 1, 2023 at 8:45 AM Jing Ge  wrote:
> >
> > Hi Wencong,
> >
> > Thanks for your clarification! +1
> >
> > Best regards,
> > Jing
> >
> > On Fri, Sep 1, 2023 at 12:36 PM Wencong Liu 
> wrote:
> >
> > > Hi Jing,
> > >
> > >
> > > Thanks for your reply!
> > >
> > >
> > > > Or the "independent module extraction" mentioned in the FLIP does
> mean an
> > > independent module in Flink?
> > >
> > >
> > > Yes. If there are submodules in Flink repository needs the iteration
> > > (currently not),
> > > we could consider extracting them to a new submodule of Flink.
> > >
> > >
> > > > users will have to add one more dependency of Flink ML. If iteration
> is
> > > the
> > > only feature they need, it will look a little bit weird.
> > >
> > >
> > > If users only need to execute iteration jobs, they can simply remove
> the
> > > Flink
> > > dependency and add the necessary dependencies related to Flink ML.
> > > However,
> > > they can still utilize the DataStream API as it is also a dependency of
> > > Flink ML.
> > >
> > >
> > > Keeping an iteration submodule in Flink repository and make Flink ML
> > > depends it
> > > is also another solution. But the current implementation of Iteration
> in
> > > DataStream
> > > should be removed definitely due to its Incompleteness.
> > >
> > >
> > > The placement of the Iteration API in the repository is a topic that
> has
> > > multiple
> > > potential solutions. WDYT?
> > >
> > >
> > > Best,
> > > Wencong
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2023-09-01 17:59:34, "Jing Ge"  wrote:
> > > >Hi Wencong,
> > > >
> > > >Thanks for the proposal!
> > > >
> > > >"The Iteration API in DataStream is planned be deprecated in Flink
> 1.19
> > > and
> > > >then finally removed in Flink 2.0. For the users that rely on the
> > > Iteration
> > > >API in DataStream, they will have to migrate to Flink ML."
> > > >- Does it make sense to migrate the iteration module into Flink
> directly?
> > > >Or the "independent module extraction" mentioned in the FLIP does
> mean an
> > > >independent module in Flink? Since the iteration will be removed in
> Flink,
> > > >users will have to add one more dependency of Flink ML. If iteration
> is
> > > the
> > > >only feature they need, it will look a little bit weird.
> > > >
> > > >
> > > >Best regards,
> > > >Jing
> > > >
> > > >On Fri, Sep 1, 2023 at 11:05 AM weijie guo  >
> > > >wrote:
> > > >
> > > >> Thanks, +1 for this.
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Weijie
> > > >>
> > > >>
> > > >> Yangze Guo  于2023年9月1日周五 14:29写道:
> > > >>
> > > >> > +1
> > > >> >
> > > >> > Thanks for driving this.
> > > >> >
> > > >> > Best,
> > > >> > Yangze Guo
> > > >> >
> > > >> > On Fri, Sep 1, 2023 at 2:00 PM Xintong Song <
> tonysong...@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > +1
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > Xintong
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > On Fri, Sep 1, 2023 at 1:11 PM Dong Lin 
> > > wrote:
> > > >> > >
> > > >> > > > Thanks Wencong for initiating the discussion.
> > > >> > > >
> > > >> > > > +1 for the proposal.
> > > >> > > >
> > > >> > > > On Fri, Sep 1, 2023 at 12:00 PM Wencong Liu <
> liuwencle...@163.com
> > > >
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > Hi devs,
> > > >> > > > >
> > > >> > > > > I would like to start a discussion on FLIP-357: Deprecate
> > > Iteration
> > > >> > API
> > > >> > > > of
> > > >> > > > > DataStream [1].
> > > >> > > > >
> > > >> > > > > Currently, the Iteration API of DataStream is incomplete.
> For
> > > >> > instance,
> > > >> > > > it
> > > >> > > > > lacks support
> > > >> > > > > for iteration in sync mode and exactly once semantics.
> > > >> Additionally,
> > > >> > it
> > > >> > > > > does not offer the
> > > >> > > > > ability to set iteration termination conditions. As a
> result,
> > > it's
> > > >> > hard
> > > >> > > > > for developers to
> > > >> > > > > build an iteration pipeline by DataStream in the practical
> > > >> > applications
> > > >> > > > > such as machine learning.
> > > >> > > > >
> > > >> > > > > FLIP-176: Unified Iteration to Support Algorithms [2] has
> > > >> introduced
> > > >> > a
> > > >> > > > > unified iteration library
> > > >> > > > > in the Flink ML repository. This library addresses all the
> > > issues
> > > >> > present
> > > >> > > > > in the Iteration API of
> > > >> > > > > DataStream and could provide solution for all the iteration
> > > >> > use-cases.
> > > >> > > > > However, maintaining two
> > > >> > > > > separate implementations of iteration in both the Flink
> > > repository
> > > >> > and
> > > >> > > > the
> > > >> > > > > Flink ML 

Re: Proposal for Implementing Keyed Watermarks in Apache Flink

2023-09-05 Thread David Morávek
Hi Tawfik,

It's exciting to see any ongoing research that tries to push Flink forward!

The get the discussion started, can you please your paper with the
community? Assessing the proposal without further context is tough.

Best,
D.

On Mon, Sep 4, 2023 at 4:42 PM Tawfek Yasser Tawfek 
wrote:

> Dear Apache Flink Development Team,
>
> I hope this email finds you well. I am writing to propose an exciting new
> feature for Apache Flink that has the potential to significantly enhance
> its capabilities in handling unbounded streams of events, particularly in
> the context of event-time windowing.
>
> As you may be aware, Apache Flink has been at the forefront of Big Data
> Stream processing engines, leveraging windowing techniques to manage
> unbounded event streams effectively. The accuracy of the results obtained
> from these streams relies heavily on the ability to gather all relevant
> input within a window. At the core of this process are watermarks, which
> serve as unique timestamps marking the progression of events in time.
>
> However, our analysis has revealed a critical issue with the current
> watermark generation method in Apache Flink. This method, which operates at
> the input stream level, exhibits a bias towards faster sub-streams,
> resulting in the unfortunate consequence of dropped events from slower
> sub-streams. Our investigations showed that Apache Flink's conventional
> watermark generation approach led to an alarming data loss of approximately
> 33% when 50% of the keys around the median experienced delays. This loss
> further escalated to over 37% when 50% of random keys were delayed.
>
> In response to this issue, we have authored a research paper outlining a
> novel strategy named "keyed watermarks" to address data loss and
> substantially enhance data processing accuracy, achieving at least 99%
> accuracy in most scenarios.
>
> Moreover, we have conducted comprehensive comparative studies to evaluate
> the effectiveness of our strategy against the conventional watermark
> generation method, specifically in terms of event-time tracking accuracy.
>
> We believe that implementing keyed watermarks in Apache Flink can greatly
> enhance its performance and reliability, making it an even more valuable
> tool for organizations dealing with complex, high-throughput data
> processing tasks.
>
> We kindly request your consideration of this proposal. We would be eager
> to discuss further details, provide the full research paper, or collaborate
> closely to facilitate the integration of this feature into Apache Flink.
>
> Thank you for your time and attention to this proposal. We look forward to
> the opportunity to contribute to the continued success and evolution of
> Apache Flink.
>
> Best Regards,
>
> Tawfik Yasser
> Senior Teaching Assistant @ Nile University, Egypt
> Email: tyas...@nu.edu.eg
> LinkedIn: https://www.linkedin.com/in/tawfikyasser/
>


Re: [DISCUSS][2.0] Deprecating Accumulator in favor of MetricsGroup

2023-08-28 Thread David Morávek
AFAIK Apache Beam also used acummulators for metric collection, which is
indeed a major use case.

I’m not convinced that MetricGroup is fuĺly replacing what acummulators
have to offer though; OperatorCoordinators might be able to rplace
remaining capabilities, but this need bit more thoughts, the missing part
there would be that accumulators are part of the JobResult.

On Tue 29. 8. 2023 at 6:12, Xintong Song  wrote:

> Thanks for bringing this up, Matthias.
>
> One thing that a user may achieve with an accumulator but not with a metric
> group is to programmatically fetch the job execution result, rather than
> outputting the results to an external sink, in attached mode. This can also
> be achieved by using CollectSink, which is still @Experimental and
> internally uses accumulators. So I guess it depends on 1) how stable we
> think CollectSink is now, and 2) how many users directly use accumulators
> rather than CollectSink and whether their requirements can be fully covered
> by CollectSink. For 2), we probably also need to involve the user@ ML.
>
> Best,
>
> Xintong
>
>
>
> On Wed, Aug 23, 2023 at 11:00 PM Matthias Pohl
>  wrote:
>
> > Hi everyone,
> > I was looking into serializing the ArchivedExecutionGraph for another
> FLIP
> > and came across Accumulators [1] (don't mix that one up with the window
> > accumulators of the Table/SQL API). Accumulators were introduced in Flink
> > quite a while ago in Statosphere PR #340 [2].
> >
> > I had a brief chat with Chesnay about it who pointed out that there was
> an
> > intention to use this for collecting metrics in the past. The Accumulator
> > JavaDoc provides a hint that it was inspired by Hadoop's Counter concept
> > [3] which also sounds like it is more or less equivalent to Flink's
> > metrics.
> >
> > The Accumulator is currently accessible through the RuntimeContext
> > interface which provides addAccumuator [4] and getAccumulator [5]. Usages
> > for these messages appear in the following classes:
> > - CollectSinkFunction [6]: Here it's used to collect the final data when
> > closing the function. This feels like a misuse of the feature. Instead,
> the
> > CollectSink could block the close call until all data was fetched from
> the
> > client program.
> > - DataSet.collect() [7]: Uses CollectHelper utilizes
> > SerializedListAccumulator to collect the final data similarly to
> > CollectSinkFunction
> > - EmptyFieldsCountAccumulator [8] is an example program that counts empty
> > fields. This could be migrated to MetricGroups
> > - ChecksumHashCodeHelper [9] is used in DataSetUtils where the calling
> > method is marked as deprecated for 2.0 already
> > - CollectOutputFormat [10] uses SerializedListAccumulator analogously to
> > DataSet.collect(). This class will be removed with the removal of the
> Scala
> > API in 2.0.
> >
> > The initial investigation brings me to the conclusion that we can remove
> > the Accumulator feature in favor of Metrics and proper collect
> > implementations: That would also help cleaning up the
> > (Archived)ExecutionGraph: IMHO, we should have a clear separation between
> > Metrics (which are part of the ExecutionGraph) and processed data (which
> > shouldn't be part of the ExecutionGraph).
> >
> > I'm curious what others think about this. Did I miss a scenario where
> > Accumulators are actually needed? Or is this already part of some other
> 2.0
> > effort [11] which I missed? I would suggest removing it could be a
> > nice-to-have item for 2.0.
> >
> > Best,
> > Matthias
> >
> >
> >
> > [1]
> >
> >
> https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/api/common/accumulators/Accumulator.java
> > <
> >
> https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/api/common/accumulators/Accumulator.java#L40
> > >
> > [2] https://github.com/stratosphere/stratosphere/pull/340
> > [3]
> >
> >
> https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Counters.html
> > [4]
> >
> >
> https://github.com/apache/flink/blob/63ee60859cac64f2bc6cfe2c5015ceb1199cea9c/flink-core/src/main/java/org/apache/flink/api/common/functions/RuntimeContext.java#L156
> > [5]
> >
> >
> https://github.com/apache/flink/blob/63ee60859cac64f2bc6cfe2c5015ceb1199cea9c/flink-core/src/main/java/org/apache/flink/api/common/functions/RuntimeContext.java#L165
> >
> > [6]
> >
> >
> https://github.com/apache/flink/blob/5ae8cb0503449b07f76d0ab621c3e81734496b26/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/collect/CollectSinkFunction.java#L304
> > [7]
> >
> >
> https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-java/src/main/java/org/apache/flink/api/java/Utils.java#L145
> > [8]
> >
> >
> 

Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

2023-07-04 Thread David Morávek
> waiting 2 min between 2 requirements push seems ok to me

This depends on the workload. Would you care if the cost of rescaling were
close to zero (which is for most out-of-the-box workloads)? In that case,
it would be desirable to rescale more frequently, for example, if TMs join
incrementally.

Creating a value that covers everything is impossible unless it's
self-tuning, so I'd prefer having a smooth experience for people trying
things out (just imagine doing a demo at the conference) and having them
opt-in for longer cooldowns.


One idea to keep the timeouts lower while getting more balance would be
restarting the cooldown period when new resources or requirements are
received. This would also bring the cooldown's behavior closer to the
resource-stabilization timeout. Would that make sense?

> Depends on how you implement it. If you ignore all of shouldRescale, yes,
but you shouldn't do that in the first place.

That sounds great; let's go ahead and outline this in the FLIP.

Best,
D.


On Tue, Jul 4, 2023 at 12:30 PM Etienne Chauchot 
wrote:

> Hi all,
>
> Thanks David for your feedback. My comments are inline
>
> Le 04/07/2023 à 09:16, David Morávek a écrit :
> >> They will struggle if they add new resources and nothing happens for 5
> > minutes.
> >
> > The same applies if they start playing with FLIP-291 APIs. I'm wondering
> if
> > the cooldown makes sense there since it was the user's deliberate choice
> to
> > push new requirements. 樂
>
>
> Sure, but remember that the initial rescale is always done immediately.
> Only the time between 2 rescales is controlled by the cooldown period. I
> don't see a user adding resources every 10s (your proposed default
> value) or even with, let's say 2 min, waiting 2 min between 2
> requirements push seems ok to me.
>
>
> >
> > Best,
> > D.
> >
> > On Tue, Jul 4, 2023 at 9:11 AM David Morávek  wrote:
> >
> >> The FLIP reads sane to me. I'm unsure about the default values, though;
> 5
> >> minutes of wait time between rescales feels rather strict, and we should
> >> rethink it to provide a better out-of-the-box experience.
> >>
> >> I'd focus on newcomers trying AS / Reactive Mode out. They will struggle
> >> if they add new resources and nothing happens for 5 minutes. I'd suggest
> >> defaulting to
> >> *jobmanager.adaptive-scheduler.resource-stabilization-timeout* (which
> >> defaults to 10s).
>
>
> If users add resources, the re-scale will happen right away. It is only
> for next additions that they will have to wait for the coolDown period
> to end.
>
> But anyway, we could lower the default value, I just took what Robert
> suggested in the ticket.
>
>
> >>
> >> I'm still struggling to grasp max internal (force rescale). Ignoring
> `AdaptiveScheduler#shouldRescale()`
> >> condition seems rather dangerous. Wouldn't a simple case where you add a
> >> new TM and remove it before the max interval is reached (so there is
> >> nothing to do) result in an unnecessary job restart?
>
> With current behavior (on master) : adding the TM will result in
> restarting if the number of slots added leads to job parallelism
> increase of more than 2. Then removing it can have 2 consequences:
> either it is removed before the resource-stabilisation timeout and there
> will be no restart. Or it is removed after this timeout (the job is in
> Running state) and it will entail another restart and parallelism decrease.
>
> With the proposed behavior: what the scaling-interval.max will change is
> only on the resource addition part: when the TM is added, if the time
> since last rescale > scaling-interval.max, then a restart and
> parallelism increase will be done even if it leads to a parallelism
> increase < 2. The rest regarding TM removal does not change.
>
> => So, the real difference with the current behavior is ** if the slots
> addition was too little ** : in the current behavior nothing happens. In
> the new behavior nothing happens unless the addition arrives after
> scaling-interval.max.
>
>
> Best
>
> Etienne
>
> >>
> >> Best,
> >> D.
> >>
> >> On Thu, Jun 29, 2023 at 3:43 PM Etienne Chauchot
> >> wrote:
> >>
> >>> Thanks Chesnay for your feedback. I have updated the FLIP. I'll start a
> >>> vote thread.
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>> Le 28/06/2023 à 11:49, Chesnay Schepler a écrit :
> >>>>> we should schedule a check that will rescale if
> >>>> min-parallelism-increase is met. Then, what i

Re: Working improving the REST API

2023-07-04 Thread David Morávek
I've left some comments on the PR.

On Tue, Jul 4, 2023 at 9:20 AM Martijn Visser 
wrote:

> Hi Hong,
>
> Given that this changes the REST API, which is a public interface, I'm
> wondering if this shouldn't first have had a (small) FLIP if I follow the
> guidelines from the overview page [1].
>
> Best regards,
>
> Martijn
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> On Mon, Jul 3, 2023 at 6:04 PM Teoh, Hong 
> wrote:
>
> > Just adding the updates from the Slack thread -
> > @SamratDeb and @JulietLee have expressed interest, and there is a PR
> ready
> > for review!
> > https://github.com/apache/flink/pull/22901
> >
> > Regards,
> > Hong
> >
> >
> > On 3 Jul 2023, at 16:56, Teoh, Hong 
> wrote:
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> > Hi all,
> >
> > (This is a cross-post from a Slack message posted in #dev [1]. Posting
> > here for visibility)
> >
> > Recently I’ve been looking into improving the REST API that Flink has, to
> > make it more generally available for programmatic access (e.g. control
> > functions) rather than just for the Flink dashboard.
> >
> > In particular, various APIs seem to consume from the ExecutionGraph
> cache,
> > which is useful when displaying on a Flink dashboard (to guarantee
> > consistent behaviour), but since there is no way to “force” the latest
> > result, it might be not very useful for a control-function that wants to
> > retrieve the latest data. This could be achieved via Cache-Control HTTP
> > headers, as mentioned on this thread [2].
> >
> > Looking for any contributors / committers who are interested in working
> on
> > these items together! (To bounce ideas / deliver this faster)
> >
> >
> > [1]
> https://apache-flink.slack.com/archives/C03GV7L3G2C/p1688051972688149
> > [2] https://lists.apache.org/thread/7o330hfyoqqkkrfhtvz3kp448jcspjrm
> >
> >
> >
> > Regards,
> > Hong
> >
> >
> >
>


Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

2023-07-04 Thread David Morávek
> They will struggle if they add new resources and nothing happens for 5
minutes.

The same applies if they start playing with FLIP-291 APIs. I'm wondering if
the cooldown makes sense there since it was the user's deliberate choice to
push new requirements. 樂

Best,
D.

On Tue, Jul 4, 2023 at 9:11 AM David Morávek  wrote:

> The FLIP reads sane to me. I'm unsure about the default values, though; 5
> minutes of wait time between rescales feels rather strict, and we should
> rethink it to provide a better out-of-the-box experience.
>
> I'd focus on newcomers trying AS / Reactive Mode out. They will struggle
> if they add new resources and nothing happens for 5 minutes. I'd suggest
> defaulting to
> *jobmanager.adaptive-scheduler.resource-stabilization-timeout* (which
> defaults to 10s).
>
> I'm still struggling to grasp max internal (force rescale). Ignoring 
> `AdaptiveScheduler#shouldRescale()`
> condition seems rather dangerous. Wouldn't a simple case where you add a
> new TM and remove it before the max interval is reached (so there is
> nothing to do) result in an unnecessary job restart?
>
> Best,
> D.
>
> On Thu, Jun 29, 2023 at 3:43 PM Etienne Chauchot 
> wrote:
>
>> Thanks Chesnay for your feedback. I have updated the FLIP. I'll start a
>> vote thread.
>>
>> Best
>>
>> Etienne
>>
>> Le 28/06/2023 à 11:49, Chesnay Schepler a écrit :
>> > > we should schedule a check that will rescale if
>> > min-parallelism-increase is met. Then, what it the use of
>> > scaling-interval.max timeout in that context ?
>> >
>> > To force a rescale if min-parallelism-increase is not met (but we
>> > could still run above the current parallelism).
>> >
>> > min-parallelism-increase is a trade-off between the cost of rescaling
>> > vs the performance benefit of the parallelism increase. Over time the
>> > balance tips more and more in favor of the parallelism increase, hence
>> > we should eventually rescale anyway even if the minimum isn't met, or
>> > at least give users the option to do so.
>> >
>> > > I meant the opposite: not having only the cooldown but having only
>> > the stabilization time. I must have missed something because what I
>> > wonder is: if every rescale entails a restart of the pipeline and
>> > every restart entails passing in waiting for resources state, then why
>> > introduce a cooldown when there is already at each rescale a stable
>> > resource timeout ?
>> >
>> > It is technically correct that the stable resource timeout can be used
>> > to limit the number of rescale operations per interval, however during
>> > that time the job isn't running, in contrast to the cooldown.
>> >
>> > Having both just gives you a lot more flexibility.
>> > "I want at most 1 rescale operation per hour, and wait at most 1
>> > minute for resource to stabilize when a rescale happens".
>> > You can't express this with only one of the options.
>> >
>> > On 20/06/2023 14:41, Etienne Chauchot wrote:
>> >> Hi Chesnay,
>> >>
>> >> Thanks for your feedback. Comments inline
>> >>
>> >> Le 16/06/2023 à 17:24, Chesnay Schepler a écrit :
>> >>> 1) Options specific to the adaptive scheduler should start with
>> >>> "jobmanager.adaptive-scheduler".
>> >>
>> >>
>> >> ok
>> >>
>> >>
>> >>> 2)
>> >>> There isn't /really /a notion of a "scaling event". The scheduler is
>> >>> informed about new/lost slots and job failures, and reacts
>> >>> accordingly by maybe rescaling the job.
>> >>> (sure, you can think of these as events, but you can think of
>> >>> practically everything as events)
>> >>>
>> >>> There shouldn't be a queue for events. All the scheduler should have
>> >>> to know is that the next rescale check is scheduled for time T,
>> >>> which in practice boils down to a flag and a scheduled action that
>> >>> runs Executing#maybeRescale.
>> >>
>> >>
>> >> Makes total sense, its very simple like this. Thanks for the
>> >> precision and pointer. After the related FLIPs, I'll look at the code
>> >> now.
>> >>
>> >>
>> >>> With that in mind, we also have to look at how we keep this state
>> >>> around. Presumably it is scoped to the current state, such that the
>> >>> cooldow

Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

2023-07-04 Thread David Morávek
The FLIP reads sane to me. I'm unsure about the default values, though; 5
minutes of wait time between rescales feels rather strict, and we should
rethink it to provide a better out-of-the-box experience.

I'd focus on newcomers trying AS / Reactive Mode out. They will struggle if
they add new resources and nothing happens for 5 minutes. I'd suggest
defaulting to *jobmanager.adaptive-scheduler.resource-stabilization-timeout*
(which
defaults to 10s).

I'm still struggling to grasp max internal (force rescale). Ignoring
`AdaptiveScheduler#shouldRescale()`
condition seems rather dangerous. Wouldn't a simple case where you add a
new TM and remove it before the max interval is reached (so there is
nothing to do) result in an unnecessary job restart?

Best,
D.

On Thu, Jun 29, 2023 at 3:43 PM Etienne Chauchot 
wrote:

> Thanks Chesnay for your feedback. I have updated the FLIP. I'll start a
> vote thread.
>
> Best
>
> Etienne
>
> Le 28/06/2023 à 11:49, Chesnay Schepler a écrit :
> > > we should schedule a check that will rescale if
> > min-parallelism-increase is met. Then, what it the use of
> > scaling-interval.max timeout in that context ?
> >
> > To force a rescale if min-parallelism-increase is not met (but we
> > could still run above the current parallelism).
> >
> > min-parallelism-increase is a trade-off between the cost of rescaling
> > vs the performance benefit of the parallelism increase. Over time the
> > balance tips more and more in favor of the parallelism increase, hence
> > we should eventually rescale anyway even if the minimum isn't met, or
> > at least give users the option to do so.
> >
> > > I meant the opposite: not having only the cooldown but having only
> > the stabilization time. I must have missed something because what I
> > wonder is: if every rescale entails a restart of the pipeline and
> > every restart entails passing in waiting for resources state, then why
> > introduce a cooldown when there is already at each rescale a stable
> > resource timeout ?
> >
> > It is technically correct that the stable resource timeout can be used
> > to limit the number of rescale operations per interval, however during
> > that time the job isn't running, in contrast to the cooldown.
> >
> > Having both just gives you a lot more flexibility.
> > "I want at most 1 rescale operation per hour, and wait at most 1
> > minute for resource to stabilize when a rescale happens".
> > You can't express this with only one of the options.
> >
> > On 20/06/2023 14:41, Etienne Chauchot wrote:
> >> Hi Chesnay,
> >>
> >> Thanks for your feedback. Comments inline
> >>
> >> Le 16/06/2023 à 17:24, Chesnay Schepler a écrit :
> >>> 1) Options specific to the adaptive scheduler should start with
> >>> "jobmanager.adaptive-scheduler".
> >>
> >>
> >> ok
> >>
> >>
> >>> 2)
> >>> There isn't /really /a notion of a "scaling event". The scheduler is
> >>> informed about new/lost slots and job failures, and reacts
> >>> accordingly by maybe rescaling the job.
> >>> (sure, you can think of these as events, but you can think of
> >>> practically everything as events)
> >>>
> >>> There shouldn't be a queue for events. All the scheduler should have
> >>> to know is that the next rescale check is scheduled for time T,
> >>> which in practice boils down to a flag and a scheduled action that
> >>> runs Executing#maybeRescale.
> >>
> >>
> >> Makes total sense, its very simple like this. Thanks for the
> >> precision and pointer. After the related FLIPs, I'll look at the code
> >> now.
> >>
> >>
> >>> With that in mind, we also have to look at how we keep this state
> >>> around. Presumably it is scoped to the current state, such that the
> >>> cooldown is reset if a job fails.
> >>> Maybe we should add a separate ExecutingWithCooldown state; not sure
> >>> yet.
> >>
> >>
> >> Yes loosing cooldown state and cooldown reset upon failure is what I
> >> suggested in point 3 in previous email. Not sure either for a new
> >> state, I'll figure it out after experimenting with the code. I'll
> >> update the FLIP then.
> >>
> >>
> >>>
> >>> It would be good to clarify whether this FLIP only attempts to cover
> >>> scale up operations, or also scale downs in case of slot losses.
> >>
> >>
> >> When there are slots loss, most of the time it is due to a TM loss so
> >> there should be several slots lost at the same time but (hopefully)
> >> only once. There should not be many scale downs in a row (but still
> >> cascading failures can happen). I think, we should just protect
> >> against having scale ups immediately following. For that, I think we
> >> could just keep the current behavior of transitioning to Restarting
> >> state and then back to Waiting for Resources state. This state will
> >> protect us against scale ups immediately following failure/restart.
> >>
> >>
> >>>
> >>> We should also think about how it relates to the externalized
> >>> declarative resource management. Should we always rescale
> >>> immediately? Should we wait 

Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler

2023-07-04 Thread David Morávek
Hmm, sorry for the confusion; it seems that Google is playing games with me
(I see this chained under the old [DISCUSS] thread), but it seems correct
in the mail archive [1] :/

Just ignore the -1 above.

[1] https://lists.apache.org/thread/22fovrkmzcvzblcohhtsp5t96vd64obj

On Tue, Jul 4, 2023 at 8:49 AM David Morávek  wrote:

> The vote closes within 6 hours and, as for now, there was no vote. This
>> is a very short FLIP, that takes a few minutes to read.
>>
>
> Maybe because there should have been a dedicated voting thread (marked as
> [VOTE]), this one was hidden and hard to notice.
>
> We should restart the vote with proper mechanics to allow everyone to
> participate.
>
> Soft -1 from my side until there is a proper voting thread.
>
> Best,
> D.
>
> On Mon, Jul 3, 2023 at 4:40 PM ConradJam  wrote:
>
>> +1 (no-binding)
>>
>> Etienne Chauchot  于2023年7月3日周一 15:57写道:
>>
>> > Hi all,
>> >
>> > The vote closes within 6 hours and, as for now, there was no vote. This
>> > is a very short FLIP, that takes a few minutes to read.
>> >
>> > Please cast your vote so that the development could start.
>> >
>> > Thanks.
>> >
>> > Best
>> >
>> > Etienne
>> >
>> > Le 29/06/2023 à 15:51, Etienne Chauchot a écrit :
>> > >
>> > > Hi all,
>> > >
>> > > Thanks for all the feedback about the FLIP-322: Cooldown period for
>> > > adaptive scheduler [1].
>> > >
>> > > This FLIP was discussed in [2].
>> > >
>> > > I'd like to start a vote for it. The vote will be open for at least 72
>> > > hours (until July 3rd 14:00 GMT) unless there is an objection or
>> > > insufficient votes.
>> > >
>> > > [1]
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler
>> > > [2] https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6
>> > >
>> > > Best,
>> > >
>> > > Etienne
>> > >
>>
>


Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler

2023-07-04 Thread David Morávek
>
> The vote closes within 6 hours and, as for now, there was no vote. This
> is a very short FLIP, that takes a few minutes to read.
>

Maybe because there should have been a dedicated voting thread (marked as
[VOTE]), this one was hidden and hard to notice.

We should restart the vote with proper mechanics to allow everyone to
participate.

Soft -1 from my side until there is a proper voting thread.

Best,
D.

On Mon, Jul 3, 2023 at 4:40 PM ConradJam  wrote:

> +1 (no-binding)
>
> Etienne Chauchot  于2023年7月3日周一 15:57写道:
>
> > Hi all,
> >
> > The vote closes within 6 hours and, as for now, there was no vote. This
> > is a very short FLIP, that takes a few minutes to read.
> >
> > Please cast your vote so that the development could start.
> >
> > Thanks.
> >
> > Best
> >
> > Etienne
> >
> > Le 29/06/2023 à 15:51, Etienne Chauchot a écrit :
> > >
> > > Hi all,
> > >
> > > Thanks for all the feedback about the FLIP-322: Cooldown period for
> > > adaptive scheduler [1].
> > >
> > > This FLIP was discussed in [2].
> > >
> > > I'd like to start a vote for it. The vote will be open for at least 72
> > > hours (until July 3rd 14:00 GMT) unless there is an objection or
> > > insufficient votes.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler
> > > [2] https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6
> > >
> > > Best,
> > >
> > > Etienne
> > >
>


Re: [DISCUSS] Flink REST API improvements

2023-06-26 Thread David Morávek
Hi Hong,

Thanks for starting the discussion.

seems to be using the cached version of the entire Execution graph (stale
> data), when it could just use the CheckpointStatsCache directly


CheckpointStatsCache is also populated using the "cached execution graph,"
so there is nothing to gain from the "staleness" pov; see
AbstractCheckpointHandler for more details.

Anyone aware of a reason we don’t do this already?
>

The CheckpointStatsCache is populated lazily on the request for a
particular checkpoint (so it might not have a full view); the used data
structure is also slightly different; one more thing is that
CheckpointStatsCache is meant for different purpose -> keeping a particular
checkpoint around while it's being investigated. Otherwise, it might
expire; using it for "overview" would break this.

Configuration for web.refresh-interval controls both dashboard refresh rate
> and ExecutionGraph cache
>

This sounds reasonable as long as it falls back to "web.refresh-interval"
when not defined. For consistency reasons, it should be also named
"rest.cache-timeout"


> Cache-Control on the HTTP headers.
>

In general, I'd be in favor of this ("rest.cache-timeout" would then need
to become "rest.default-cache-timeout"), but I need to see a detailed FLIP
because in my mind this could get quite complicated.

Best,
D.

On Fri, Jun 23, 2023 at 6:26 PM Teoh, Hong 
wrote:

> Hi all,
>
> I have been looking at the Flink REST API implementation, and had some
> question on potential improvements. Looking to gather some thoughts:
>
> 1. Only use what is necessary. The GET /checkpoints API seems to be using
> the cached version of the entire Execution graph (stale data), when it
> could just use the CheckpointStatsCache directly. I am thinking of doing
> this refactoring. Anyone aware of a reason we don’t do this already?
> 2. Configuration for web.refresh-interval controls both dashboard refresh
> rate and ExecutionGraph cache. I am thinking of introducing a new
> configuration, rest.cache.timeout
> 3. Cache-Control on the HTTP headers. Seems like we are using caches in
> our REST endpoint. It would be step in the right direction to introduce
> cache-control in our REST API headers, so that we can improve the
> programmatic access of the Flink REST API.
>
>
> Looking forwards to hearing people’s thoughts.
>
> Regards,
> Hong
>
>


Re: [DISCUSS] FLIP-294: Support Customized Job Meta Data Listener

2023-06-06 Thread David Morávek
Hi,

Thanks for the FLIP! Data lineage is an important problem to tackle.

Can you please expand on how this is planned to be wired into the
JobManager? As I understand, the listeners will be configured globally (per
cluster), so this won't introduce a new code path for running per-job /
per-session user code. Is that correct?

Best,
D

On Tue, Jun 6, 2023 at 9:17 AM Leonard Xu  wrote:

> Thanks Shammon for driving this FLIP forward, I’ve several comments about
> the updated FLIP.
>
> 1. CatalogModificationContext is introduced as a class instead of an
> interface, is it a typo?
>
> 2. The FLIP defined multiple  Map config();  methods in
> some Context classes, Could we use  Configuration getConfiguration();Class
> org.apache.flink.configuration.Configuration is recommend as it’s public
> API and offers more useful methods as well.
>
> 3. The Context of CatalogModificationListenerFactory should be an
> interface too, and getUserClassLoder()
> would be more aligned with flink’s naming style.
>
>
> Best,
> Leonard
>
> > On May 26, 2023, at 4:08 PM, Shammon FY  wrote:
> >
> > Hi devs,
> >
> > We would like to bring up a discussion about FLIP-294: Support Customized
> > Job Meta Data Listener[1]. We have had several discussions with Jark Wu,
> > Leonard Xu, Dong Lin, Qingsheng Ren and Poorvank about the functions and
> > interfaces, and thanks for their valuable advice.
> > The overall job and connector information is divided into metadata and
> > lineage, this FLIP focuses on metadata and lineage will be discussed in
> > another FLIP in the future. In this FLIP we want to add a customized
> > listener in Flink to report catalog modifications to external metadata
> > systems such as datahub[2] or atlas[3]. Users can view the specific
> > information of connectors such as source and sink for Flink jobs in these
> > systems, including fields, watermarks, partitions, etc.
> >
> > Looking forward to hearing from you, thanks.
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-294%3A+Support+Customized+Job+Meta+Data+Listener
> > [2] https://datahub.io/
> > [3] https://atlas.apache.org/#/
>
>


Re: Call for help on the Web UI (In-Place Rescaling)

2023-05-22 Thread David Morávek
+1 these things already popped up during the initial review by Chesnay, but
I wasn't able to visualize them in a way that didn't look completely awful
:)

I hope that someone will be able to pick this up.

Best,
D.

On Fri, May 19, 2023 at 1:30 PM Jing Ge  wrote:

> Hi David,
>
> Thanks for driving this. I'd like to second Emre, especially the second
> suggestion. In practice, the parallelism number could be big. Users will be
> frustrated, if they have to click hundred or even more times in order to
> reach the wished number. Anyone who will take over the design task, please
> count it in while designing the new UX/UI. Thanks!
>
> Best regards,
> Jing
>
>
> On Fri, May 19, 2023 at 1:07 PM Kartoglu, Emre  >
> wrote:
>
> > Hi David,
> >
> > This looks awesome. I am no expert on UI/UX, but still have opinions 
> >
> > I normally use the Overview tab for monitoring Flink jobs, and having
> > control inputs there breaks my assumption that Overview is “read-only”
> and
> > for “watching”.
> > Having said that for “educational purposes” that might actually be a good
> > place - I am imagining there would be a “educationalMode: true” flag or
> > something somewhere to enable these buttons (and other educational bits
> in
> > future).
> >
> > The “educational purpose” bit makes me a lot more relaxed about having
> > those buttons as they are in the video!
> >
> > Couple other things to consider:
> >
> >
> >   *   Confirming new parallelism before actually doing it, e.g. having a
> > “Deploy/Commit/Save” button
> >   *   Allow users to enter parallelism without having to
> > increment/decrement one by one
> >
> > Thanks,
> > Emre
> >
> > On 2023/05/19 06:49:08 David Morávek wrote:
> > > Hi Everyone,
> > >
> > > In FLINK-31471, we've introduced new "in-place rescaling features" to
> the
> > > Web UI that show up when the scheduler supports FLIP-291 REST
> endpoints.
> > >
> > > I expect this to be a significant feature for user education (they have
> > an
> > > easy way to try out how rescaling behaves, especially in combination
> > with a
> > > backpressure monitor) and marketing (read as "we can do fancy demos").
> > >
> > > However, the current sketch is not optimal due to my lack of UI/UX
> > skills.
> > >
> > > Are there any volunteers that could and would like to help polish this?
> > >
> > > Here is a short demo [2] of what the current implementation can do.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-31471
> > > [2] https://www.youtube.com/watch?v=B1NVDTazsZY
> > >
> > > Best,
> > > D.
> > >
> >
>


Call for help on the Web UI (In-Place Rescaling)

2023-05-19 Thread David Morávek
Hi Everyone,

In FLINK-31471, we've introduced new "in-place rescaling features" to the
Web UI that show up when the scheduler supports FLIP-291 REST endpoints.

I expect this to be a significant feature for user education (they have an
easy way to try out how rescaling behaves, especially in combination with a
backpressure monitor) and marketing (read as "we can do fancy demos").

However, the current sketch is not optimal due to my lack of UI/UX skills.

Are there any volunteers that could and would like to help polish this?

Here is a short demo [2] of what the current implementation can do.

[1] https://issues.apache.org/jira/browse/FLINK-31471
[2] https://www.youtube.com/watch?v=B1NVDTazsZY

Best,
D.


Re: [DISCUSS] Preventing Mockito usage for the new code with Checkstyle

2023-04-27 Thread David Morávek
Thanks, everyone, for participating. There seems to be a broad consensus,
so I'll move forward. I've created [1] and [2] to track this.

[1] https://issues.apache.org/jira/browse/FLINK-31954
[2] https://issues.apache.org/jira/browse/FLINK-31955

Best,
D.

On Thu, Apr 27, 2023 at 9:25 AM weijie guo 
wrote:

> +1 for introducing this rule for junit4 and mockito.
>
> Best regards,
>
> Weijie
>
>
> Alexander Fedulov  于2023年4月26日周三 23:50写道:
>
> > +1 for the proposal,
> >
> > Best,
> > Alex
> >
> > On Wed, 26 Apr 2023 at 15:50, Chesnay Schepler 
> wrote:
> >
> > > * adds a note to not include "import " in the regex" *
> > >
> > > On 26/04/2023 11:22, Maximilian Michels wrote:
> > > > If we ban Mockito imports, I can still write tests using the full
> > > > qualifiers, right?
> > > >
> > > > For example:
> > > >
> > >
> >
> org.mockito.Mockito.when(somethingThatShouldHappen).thenReturn(somethingThatNeverActuallyHappens)
> > > >
> > > > Just kidding, +1 on the proposal.
> > > >
> > > > -Max
> > > >
> > > > On Wed, Apr 26, 2023 at 9:02 AM Panagiotis Garefalakis
> > > >  wrote:
> > > >> Thanks for bringing this up!  +1 for the proposal
> > > >>
> > > >> @Jing Ge -- we don't necessarily need to completely migrate to
> Junit5
> > > (even
> > > >> though it would be ideal).
> > > >> We could introduce the checkstyle rule and add suppressions for the
> > > >> existing problematic paths (as we do today for other rules e.g.,
> > > >> AvoidStarImport)
> > > >>
> > > >> Cheers,
> > > >> Panagiotis
> > > >>
> > > >> On Tue, Apr 25, 2023 at 11:48 PM Weihua Hu 
> > > wrote:
> > > >>
> > > >>> Thanks for driving this.
> > > >>>
> > > >>> +1 for Mockito and Junit4.
> > > >>>
> > > >>> A clarity checkstyle will be of great help to new developers.
> > > >>>
> > > >>> Best,
> > > >>> Weihua
> > > >>>
> > > >>>
> > > >>> On Wed, Apr 26, 2023 at 1:47 PM Jing Ge  >
> > > >>> wrote:
> > > >>>
> > > >>>> This is a great idea, thanks for bringing this up. +1
> > > >>>>
> > > >>>> Also +1 for Junit4. If I am not mistaken, it could only be done
> > after
> > > the
> > > >>>> Junit5 migration is done.
> > > >>>>
> > > >>>> @Chesnay thanks for the hint. Do we have any doc about it? If not,
> > it
> > > >>> might
> > > >>>> deserve one. WDYT?
> > > >>>>
> > > >>>> Best regards,
> > > >>>> Jing
> > > >>>>
> > > >>>> On Wed, Apr 26, 2023 at 5:13 AM Lijie Wang <
> > wangdachui9...@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Thanks for driving this. +1 for the proposal.
> > > >>>>>
> > > >>>>> Can we also prevent Junit4 usage in new code by this way?Because
> > > >>>> currently
> > > >>>>> we are aiming to migrate our codebase to JUnit 5.
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Lijie
> > > >>>>>
> > > >>>>> Piotr Nowojski  于2023年4月25日周二 23:02写道:
> > > >>>>>
> > > >>>>>> Ok, thanks for the clarification.
> > > >>>>>>
> > > >>>>>> Piotrek
> > > >>>>>>
> > > >>>>>> wt., 25 kwi 2023 o 16:38 Chesnay Schepler 
> > > >>>>> napisał(a):
> > > >>>>>>> The checkstyle rule would just ban certain imports.
> > > >>>>>>> We'd add exclusions for all existing usages as we did when
> > > >>>> introducing
> > > >>>>>>> other rules.
> > > >>>>>>> So far we usually disabled checkstyle rules for a specific
> files.
> > > >>>>>>>
> > > >>>>>>> On 25/04/2023 16:34, Piotr Nowojski wrote:
> > > >>>>>>>> +1 to the idea.
> > > >>>>>>>>
> > > >>>>>>>> How would this checkstyle rule work? Are you suggesting to
> start
> > > >>>>> with a
> > > >>>>>>>> number of exclusions? On what level will those exclusions be?
> > Per
> > > >>>>> file?
> > > >>>>>>> Per
> > > >>>>>>>> line?
> > > >>>>>>>>
> > > >>>>>>>> Best,
> > > >>>>>>>> Piotrek
> > > >>>>>>>>
> > > >>>>>>>> wt., 25 kwi 2023 o 13:18 David Morávek 
> > > >>>> napisał(a):
> > > >>>>>>>>> Hi Everyone,
> > > >>>>>>>>>
> > > >>>>>>>>> A long time ago, the community decided not to use
> Mockito-based
> > > >>>>> tests
> > > >>>>>>>>> because those are hard to maintain. This is already baked in
> > our
> > > >>>>> Code
> > > >>>>>>> Style
> > > >>>>>>>>> and Quality Guide [1].
> > > >>>>>>>>>
> > > >>>>>>>>> Because we still have Mockito imported into the code base,
> it's
> > > >>>> very
> > > >>>>>>> easy
> > > >>>>>>>>> for newcomers to unconsciously introduce new tests violating
> > the
> > > >>>>> code
> > > >>>>>>> style
> > > >>>>>>>>> because they're unaware of the decision.
> > > >>>>>>>>>
> > > >>>>>>>>> I propose to prevent Mockito usage with a Checkstyle rule
> for a
> > > >>>> new
> > > >>>>>>> code,
> > > >>>>>>>>> which would eventually allow us to eliminate it. This could
> > also
> > > >>>>>> prevent
> > > >>>>>>>>> some wasted work and unnecessary feedback cycles during
> > reviews.
> > > >>>>>>>>>
> > > >>>>>>>>> WDYT?
> > > >>>>>>>>>
> > > >>>>>>>>> [1]
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>
> > >
> >
> https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#avoid-mockito---use-reusable-test-implementations
> > > >>>>>>>>> Best,
> > > >>>>>>>>> D.
> > > >>>>>>>>>
> > > >>>>>>>
> > >
> > >
> >
>


Re: [DISCUSS] Planning Flink 2.0

2023-04-27 Thread David Morávek
Hi,

Great to see this topic moving forward; I agree it's long overdue.

I keep thinking about 2.0 as a chance to eliminate things that didn't work,
make the feature set denser, and fix rough edges and APIs that hold us back.

Some items in the doc (Key Features section) don't tick these boxes for me,
as they could also be implemented in the 1x branch. We should consider
whether we need a backward incompatible release to introduce each feature.
This should help us to keep the discussion more focused.

Best,
D.


On Wed, Apr 26, 2023 at 2:33 PM DONG Weike  wrote:

> Hi,
>
> It is thrilling to see the foreseeable upcoming rollouts of Flink 2.x
> releases, and I believe that this roadmap can take Flink to the next stage
> of a top-of-notch unified streaming & batch computing engine.
>
> Given that all of the existing user programs are written and run in Flink
> 1.x versions as for now, and some of them are very complex and rely on
> various third-party connectors written with legacy APIs, one thing that I
> have concerns about is if, one day in the future, the community decides
> that new features are only given to 2.x releases, could the last release of
> Flink 1.x be converted as an LTS version (backporting severe bug fixes and
> critical security patches), so that existing users could have enough time
> to wait for third-party connectors to upgrade, test their programs on the
> Flink APIs, and avoid sudden loss of community support.
>
> Just my two cents : )
>
> Best,
> Weike
>
> 
> 发件人: Xintong Song 
> 发送时间: 2023年4月26日 20:01
> 收件人: dev 
> 主题: Re: [DISCUSS] Planning Flink 2.0
>
> @Chesnay
>
>
> > Technically this implies that every minor release may contain breaking
> > changes, which is exactly what users don't want.
>
>
> It's not necessary to introduce the breaking chagnes immediately upon
> reaching the minimum guaranteed stable time. If there are multiple changes
> waiting for the stable time, we can still gather them in 1 minor release.
> But I see your point, from the user's perspective, the mechanism does not
> provide any guarantees for the compatibility of minor releases.
>
> What problems to do you see in creating major releases every N years?
> >
>
> It might not be concrete problem, but I'm a bit concerned by the
> uncertainty. I assume N should not be too small, e.g., at least 3. I'd
> expect the decision to ship a major release would be made based on
> comprehensive considerations over the situations at that time. Making a
> decision now that we would ship a major release 3 years later seems a bit
> agressive to me.
>
> We need to figure out what this release means for connectors
> > compatibility-wise.
> >
>
> +1
>
>
> > What process are you thinking of for deciding what breaking changes to
> > make? The obvious choice would be FLIPs, but I'm worried that this will
> > overload the mailing list / wiki for lots of tiny changes.
> >
>
> This should be a community decision. What I have in mind would be: (1)
> collect a wish list on wiki, (2) schedule a series of online meetings (like
> the release syncs) to get an agreed set of must-have items, (3) develop and
> polish the detailed plans of items via FLIPs, and (4) if the plan for a
> must-have item does not work out then go back to (2) for an update. I'm
> also open to other opinions.
>
> Would we wait a few months for people to prepare/agree on changes so we
> > reduce the time we need to merge things into 2 branches?
> >
>
> That's what I had in mind. Hopefully after 1.18.
>
> @Max
>
> When I look at
> >
> https://docs.google.com/document/d/1_PMGl5RuDQGlV99_gL3y7OiRsF0DgCk91Coua6hFXhE/edit
> > , I'm a bit skeptical we will even be able to reach all these goals. I
> > think we have to prioritize and try to establish a deadline. Otherwise we
> > will end up never releasing 2.0.
>
>
> Sorry for the confusion. I should have explain this more clearly. We are
> not planning to finish all the items in the list. It's more like a
> brainstorm, a list of candidates. We are also expecting to collect more
> ideas from the community. And after collecting the ideas, we should
> prioritize them and decide on a subset of must-have items, following the
> consensus decision making.
>
> +1 on Flink 2.0 by May 2024 (not a hard deadline but I think having a
> > deadline helps).
> >
>
> I agree that having a deadline helps. I proposed mid 2024, which is similar
> to but not as explicit as what you proposed. We may start with having a
> deadline for deciding the must-have items (e.g., by the end of June).  That
> should make it easier for estimating the overall time needed for preparing
> the release.
>
> Best,
>
> Xintong
>
>
>
> On Wed, Apr 26, 2023 at 6:57 PM Gyula Fóra  wrote:
>
> > +1 to everything Max said.
> >
> > Gyula
> >
> > On Wed, 26 Apr 2023 at 11:42, Maximilian Michels  wrote:
> >
> > > Thanks for starting the discussion, Jark and Xingtong!
> > >
> > > Flink 2.0 is long overdue. In the past, the expectations for such 

Re: [DISCUSS] FLINK-31873: Add setMaxParallelism to the DataStreamSink Class

2023-04-25 Thread David Morávek
Hi Eric,

this sounds reasonable, there are definitely cases where you need to limit
sink parallelism for example not to overload the storage or limit the
number of output files

+1

Best,
D.

On Sun, Apr 23, 2023 at 1:09 PM Weihua Hu  wrote:

> Hi, Eric
>
> Thanks for bringing this discussion.
> I think it's reasonable to add ''setMaxParallelism" for DataStreamSink.
>
> +1
>
> Best,
> Weihua
>
>
> On Sat, Apr 22, 2023 at 3:20 AM eric xiao  wrote:
>
> > Hi there devs,
> >
> > I would like to start a discussion thread for FLINK-31873[1].
> >
> > We are in the processing of enabling Flink reactive mode as the default
> > scheduling mode. While reading configuration docs [2] (I believe it was
> > also mentioned during one of the training sessions during Flink Forward
> > 2022), one can/should replace all setParallelism calls with
> > setMaxParallelism when migrating to reactive mode.
> >
> > This currently isn't possible on a sink in a Flink pipeline as we do not
> > expose a setMaxParallelism on the DataStreamSink class [3]. The
> underlying
> > Transformation class does have both a setMaxParallelism and
> setParallelism
> > function defined [4], but only setParallelism is offered in the
> > DataStreamSink class.
> >
> > I believe adding setMaxParallelism would be beneficial for not just flink
> > reactive mode, both modes of running of a flink pipeline (non reactive
> > mode, flink auto scaling).
> >
> > Best,
> >
> > Eric Xiao
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-31873
> > [2]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/elastic_scaling/#configuration
> > [3]
> >
> >
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/DataStreamSink.java
> > [4]
> >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/dag/Transformation.java#L248-L285
> >
>


[DISCUSS] Preventing Mockito usage for the new code with Checkstyle

2023-04-25 Thread David Morávek
Hi Everyone,

A long time ago, the community decided not to use Mockito-based tests
because those are hard to maintain. This is already baked in our Code Style
and Quality Guide [1].

Because we still have Mockito imported into the code base, it's very easy
for newcomers to unconsciously introduce new tests violating the code style
because they're unaware of the decision.

I propose to prevent Mockito usage with a Checkstyle rule for a new code,
which would eventually allow us to eliminate it. This could also prevent
some wasted work and unnecessary feedback cycles during reviews.

WDYT?

[1]
https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#avoid-mockito---use-reusable-test-implementations

Best,
D.


Re: [ANNOUNCE] New Apache Flink PMC Member - Qingsheng Ren

2023-04-21 Thread David Morávek
Congratulations, Qingsheng, well deserved!

Best,
D.

On Fri 21. 4. 2023 at 16:41, Feng Jin  wrote:

> Congratulations, Qingsheng
>
>
> 
> Best,
> Feng Jin
>
> On Fri, Apr 21, 2023 at 8:39 PM Mang Zhang  wrote:
>
> > Congratulations, Qingsheng.
> >
> >
> >
> >
> >
> > --
> >
> > Best regards,
> > Mang Zhang
> >
> >
> >
> >
> >
> > At 2023-04-21 19:50:02, "Jark Wu"  wrote:
> > >Hi everyone,
> > >
> > >We are thrilled to announce that Qingsheng Ren has joined the Flink PMC!
> > >
> > >Qingsheng has been contributing to Apache Flink for a long time. He is
> the
> > >core contributor and maintainer of the Kafka connector and
> > >flink-cdc-connectors, bringing users stability and ease of use in both
> > >projects. He drove discussions and implementations in FLIP-221,
> FLIP-288,
> > >and the connector testing framework. He is continuously helping with the
> > >expansion of the Flink community and has given several talks about Flink
> > >connectors at many conferences, such as Flink Forward Global and Flink
> > >Forward Asia. Besides that, he is willing to help a lot in the community
> > >work, such as being the release manager for both 1.17 and 1.18,
> verifying
> > >releases, and answering questions on the mailing list.
> > >
> > >Congratulations and welcome Qingsheng!
> > >
> > >Best,
> > >Jark (on behalf of the Flink PMC)
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Leonard Xu

2023-04-21 Thread David Morávek
Congratulations, Leonard, well deserved!

Best,
D.

On Fri 21. 4. 2023 at 16:40, Feng Jin  wrote:

> Congratulations, Leonard
>
>
> 
> Best,
> Feng Jin
>
> On Fri, Apr 21, 2023 at 8:38 PM Mang Zhang  wrote:
>
> > Congratulations, Leonard.
> >
> >
> > --
> >
> > Best regards,
> > Mang Zhang
> >
> >
> >
> >
> >
> > At 2023-04-21 19:47:52, "Jark Wu"  wrote:
> > >Hi everyone,
> > >
> > >We are thrilled to announce that Leonard Xu has joined the Flink PMC!
> > >
> > >Leonard has been an active member of the Apache Flink community for many
> > >years and became a committer in Nov 2021. He has been involved in
> various
> > >areas of the project, from code contributions to community building. His
> > >contributions are mainly focused on Flink SQL and connectors, especially
> > >leading the flink-cdc-connectors project to receive 3.8+K GitHub stars.
> He
> > >authored 150+ PRs, and reviewed 250+ PRs, and drove several FLIPs (e.g.,
> > >FLIP-132, FLIP-162). He has participated in plenty of discussions in the
> > >dev mailing list, answering questions about 500+ threads in the
> > >user/user-zh mailing list. Besides that, he is community minded, such as
> > >being the release manager of 1.17, verifying releases, managing release
> > >syncs, etc.
> > >
> > >Congratulations and welcome Leonard!
> > >
> > >Best,
> > >Jark (on behalf of the Flink PMC)
> >
>


Re: [VOTE] FLIP-304: Pluggable Failure Enrichers

2023-04-20 Thread David Morávek
Thanks for the update!

+1 (binding)

Best,
D.

On Thu, Apr 20, 2023 at 9:50 AM Piotr Nowojski  wrote:

> Hi,
>
> I see that the FLIP has been updated, thanks Panos!
>
> +1 (binding)
>
> Best,
> Piotrek
>
> śr., 19 kwi 2023 o 13:49 Piotr Nowojski 
> napisał(a):
>
> > +1 to what David wrote. I think we need to update the FLIP and extend the
> > voting?
> >
> > Piotrek
> >
> > śr., 19 kwi 2023 o 09:06 David Morávek  napisał(a):
> >
> >> Hi Panos,
> >>
> >> It seems that most recent discussions (e.g. changing the semantics of
> the
> >> config option) are not reflected in the FLIP. Can you please
> double-check
> >> that this is the correct version?
> >>
> >> Best,
> >> D.
> >>
> >>
> >> On Mon, Apr 17, 2023 at 9:24 AM Panagiotis Garefalakis <
> pga...@apache.org
> >> >
> >> wrote:
> >>
> >> > Hello everyone,
> >> >
> >> > I want to start the vote for FLIP-304: Pluggable Failure Enrichers [1]
> >> --
> >> > discussed as part of [2].
> >> >
> >> > FLIP-304 introduces a pluggable interface allowing users to add custom
> >> > logic and enrich failures with custom metadata labels.
> >> >
> >> > The vote will last for at least 72 hours (Thursday, 20th of April
> 2023,
> >> > 12:30 PST) unless there is an objection or insufficient votes.
> >> >
> >> > [1]
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> >> > [2] https://lists.apache.org/thread/zs9n9p8d7tyvnq4yyxhc8zvq1k2c1hvs
> >> >
> >> >
> >> > Cheers,
> >> > Panagiotis
> >> >
> >>
> >
>


Re: [VOTE] FLIP-304: Pluggable Failure Enrichers

2023-04-19 Thread David Morávek
Hi Panos,

It seems that most recent discussions (e.g. changing the semantics of the
config option) are not reflected in the FLIP. Can you please double-check
that this is the correct version?

Best,
D.


On Mon, Apr 17, 2023 at 9:24 AM Panagiotis Garefalakis 
wrote:

> Hello everyone,
>
> I want to start the vote for FLIP-304: Pluggable Failure Enrichers [1]  --
> discussed as part of [2].
>
> FLIP-304 introduces a pluggable interface allowing users to add custom
> logic and enrich failures with custom metadata labels.
>
> The vote will last for at least 72 hours (Thursday, 20th of April 2023,
> 12:30 PST) unless there is an objection or insufficient votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> [2] https://lists.apache.org/thread/zs9n9p8d7tyvnq4yyxhc8zvq1k2c1hvs
>
>
> Cheers,
> Panagiotis
>


Re: [Discussion] - Take findify/flink-scala-api under Flink umbrella

2023-04-16 Thread David Morávek
cc dev@f.a.o

On Sun, Apr 16, 2023 at 11:42 AM David Morávek  wrote:

> Hi Alexey,
>
> I'm a bit skeptical because, looking at the project, I see a couple of red
> flags:
>
> - The project is inactive. The last release and commit are both from the
> last May.
> - The project has not been adapted for the last two Flink versions, which
> signals a lack of users.
> - All commits are by a single person, which could mean that there is no
> community around the project.
> - There was no external contribution (except the Scala bot).
> - There is no fork of the project (except the Scala bot).
>
> >  As I know, FIndify does not want or cannot maintain this library.
>
> Who are the users of the library? I'd assume Findify no longer uses it if
> they're abandoning it.
>
> > which would be similar to the StateFun
>
> We're currently dealing with a lack of maintainers for StateFun, so we
> should have a solid building ground around the project to avoid the same
> issue.
>
>
> I think there is value in having a modern Scala API, but we should have a
> bigger plan to address the future of Flink Scala APIs than importing an
> unmaintained library and calling it a day. I suggest starting a thread on
> the dev ML and concluding the overall plan first.
>
> Best,
> D.
>
> On Sun, Apr 16, 2023 at 10:48 AM guenterh.lists 
> wrote:
>
>> Hello Alexey
>>
>> Thank you for your initiative and your suggestion!
>>
>> I can only fully support the following statements in your email:
>>
>>  >Taking into account my Scala experience for the last 8 years, I
>> predict these wrappers will eventually be abandoned, unless such a Scala
>> library is a part of some bigger community like ASF.
>>  >Also, non-official Scala API will lead people to play safe and choose
>> Java API only, even if they didn't want that at the beginning.
>>
>> Second sentence is my current state.
>>
>>  From my point of view it would be very unfortunate if the Flink project
>> would lose the Scala API and thus the integration of concise, flexible
>> and future-oriented language constructs of the Scala language (and
>> further development of version 3).
>>
>> Documentation of the API is essential. I would be interested to support
>> this efforts.
>>
>> Best wishes
>>
>> Günter
>>
>>
>> On 13.04.23 15:39, Alexey Novakov via user wrote:
>> > Hello Flink PMCs and Flink Scala Users,
>> >
>> > I would like to propose an idea to take the 3rd party Scala API
>> > findify/flink-scala-api <https://github.com/findify/flink-scala-api>
>> > project into the Apache Flink organization.
>> >
>> > *Motivation *
>> >
>> > The Scala-free Flink idea was finally implemented by the 1.15 release
>> and
>> > allowed Flink users to bring their own Scala version and use it via the
>> > Flink Java API. See blog-post here: Scala Free in One Fifteen
>> > <https://flink.apache.org/2022/02/22/scala-free-in-one-fifteen/>. Also,
>> > existing Flink Scala API will be deprecated, because it is too hard to
>> > upgrade it to Scala 2.13 or 3.
>> >
>> > Taking into account my Scala experience for the last 8 years, I predict
>> > these wrappers will eventually be abandoned, unless such a Scala
>> library is
>> > a part of some bigger community like ASF.
>> > Also, non-official Scala API will lead people to play safe and choose
>> Java
>> > API only, even if they did want that at the beginning.
>> >
>> > https://github.com/findify/flink-scala-api has already advanced and
>> > implemented Scala support for 2.13 and 3 versions on top of Flink Java
>> API.
>> > As I know, FIndify does not want or does not have a capacity to maintain
>> > this library. I propose to fork this great library and create a new
>> Flink
>> > project with its own version and build process (SBT, not Maven), which
>> > would be similar to the StateFun or FlinkML projects.
>> >
>> > *Proposal *
>> >
>> > 1. Create a fork of findify/flink-scala-api and host in Apache Flink Git
>> > space (PMCs please advise).
>> > 2. I and Roman
>> > <
>> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=rgrebennikov>
>> > would
>> > be willing to maintain this library in future for the next several
>> years.
>> > Further, we believe it will live on its own.
>> > 3. Flink Docs: PMCs, we need your guidelines here. One way I see is to
>> > create new documentation in a similar way as StateFun docs.
>> Alternatively,
>> > we could just fix existing Flink Scala code examples to make sure they
>> work
>> > with the new wrapper. In any case, I see docs will be upgraded/fixed
>> > gradually.
>> >
>> > I hope you will find this idea interesting and worth going forward.
>> >
>> > P.S. The irony here is that findify/flink-scala-api was also a fork of
>> > Flink Scala-API some time ago, so we have a chance to close the loop :-)
>> >
>> > Best regards.
>> > Alexey
>> >
>> --
>> Günter Hipler
>> https://openbiblio.social/@vog61
>> https://twitter.com/vog61
>>
>>


Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-23 Thread David Morávek
 think of where we actually need
> > > >> classifiers depending on each other. Supporting such a use case
> right
> > > from
> > > >> the start feels a bit over-engineered and could be covered in a
> > > follow-up
> > > >> FLIP if we really come to that point where such a feature is
> requested
> > > by
> > > >> users.
> > > >>
> > > >> - key/value pairs instead of plain labels: I think that's a good
> idea.
> > > >> key/value pairs are more expressive. +1
> > > >>
> > > >> - extending the FLIP to cover restart strategy: I understand Gen's
> > > concern
> > > >> about introducing too many different types of plugins. But I would
> > still
> > > >> favor not extending the FLIP in this regard. A pluggable restart
> > > strategy
> > > >> sounds reasonable. But an error classifier and a restart strategy
> are
> > > still
> > > >> different enough to justify separate plugins, IMHO. And therefore, I
> > > would
> > > >> think that covering the restart strategy in a separate FLIP is the
> > > better
> > > >> option for the sake of simplicity.
> > > >>
> > > >> - immutable context: Passing in an immutable context and returning
> > data
> > > >> through the interface method's return value sounds like a better
> > > approach
> > > >> to harden the contract of the interface. +1 for that proposal
> > > >>
> > > >> - async operation: I think David is right. An async interface makes
> > the
> > > >> listener implementations more robust when it comes to heavy IO
> > > operations.
> > > >> The ioExecutor can be passed through the context object. +1
> > > >>
> > > >> Matthias
> > > >>
> > > >> On Tue, Mar 21, 2023 at 2:09 PM David Morávek <
> > david.mora...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> *@Piotr*
> > > >>>
> > > >>>
> > > >>>> I was thinking about actually defining the order of the
> > > >>>> classifiers/handlers and not allowing them to be asynchronous.
> > > >>>> Asynchronousity would create some problems: when to actually
> return
> > > the
> > > >>>> error to the user? After all async responses will get back?
> Before,
> > > but
> > > >>>> without classified exception? It would also add implementation
> > > >> complexity
> > > >>>> and I think we can always expand the API with async version in the
> > > >> future
> > > >>>> if needed.
> > > >>>
> > > >>>
> > > >>> As long as the classifiers need to talk to an external system, we
> by
> > > >>> definition need to allow them to be asynchronous to unblock the
> main
> > > >> thread
> > > >>> for handling other RPCs. Exposing ioExecutor via the context
> proposed
> > > >> above
> > > >>> would be great.
> > > >>>
> > > >>> After all async responses will get back
> > > >>>
> > > >>>
> > > >>> This would be the same if we trigger them synchronously one by one,
> > > with
> > > >> a
> > > >>> caveat that synchronous execution might take significantly longer
> and
> > > >>> introduce unnecessary downtime to a job.
> > > >>>
> > > >>> D.
> > > >>>
> > > >>> On Tue, Mar 21, 2023 at 1:12 PM Zhu Zhu  wrote:
> > > >>>
> > > >>>> Hi Piotr,
> > > >>>>
> > > >>>> It's fine to me to have a separate FLIP to extend this
> > > >> `FailureListener`
> > > >>>> to support custom restart strategy.
> > > >>>>
> > > >>>> What I was a bit concerned is that if we just treat the
> > > >> `FailureListener`
> > > >>>> as an error classifier which is not crucial to Flink framework
> > > process,
> > > >>>> we may design it to run asynchronously and not trigger Flink
> > failures.
> > > >>>> This may be a blocker if later we want to enable it to support
> > cust

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread David Morávek
t; > perform heavy actions in a different thread. The context can
> provide
> > > an
> > > > > > `ioExecutor` to the plugins for reuse.
> > > > > >
> > > > > > Thanks,
> > > > > > Zhu
> > > > > >
> > > > > > Shammon FY  于2023年3月20日周一 20:21写道:
> > > > > > >
> > > > > > > Hi Panagiotis
> > > > > > >
> > > > > > > Thank you for your answer. I agree that `FailureListener`
> could be
> > > > > > > stateless, then I have some thoughts as follows
> > > > > > >
> > > > > > > 1. I see that listeners and tag collections are associated.
> When
> > > > > > JobManager
> > > > > > > fails and restarts, how can the new listener be associated
> with the
> > > > tag
> > > > > > > collection before failover? Is the listener loading order?
> > > > > > >
> > > > > > > 2. The tag collection may be too large, resulting in the
> JobManager
> > > > > OOM,
> > > > > > do
> > > > > > > we need to provide a management class that supports some
> > > obsolescence
> > > > > > > strategies instead of a direct Collection?
> > > > > > >
> > > > > > > 3. Is it possible to provide a more complex data structure
> than a
> > > > > simple
> > > > > > > string collection for tags in listeners, such as key-value?
> > > > > > >
> > > > > > > Best,
> > > > > > > Shammon FY
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Mar 20, 2023 at 7:48 PM Leonard Xu 
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,Panagiotis
> > > > > > > >
> > > > > > > >
> > > > > > > > Thank you for kicking off this discussion. Overall, the
> proposed
> > > > > > feature of
> > > > > > > > this FLIP makes sense to me. We have also discussed similar
> > > > > > requirements
> > > > > > > > with our users and developers, and I believe it will help
> many
> > > > users.
> > > > > > > >
> > > > > > > >
> > > > > > > > In terms of FLIP content, I have some thoughts:
> > > > > > > >
> > > > > > > > (1) For the FailureListenerContextget interface, the methods
> > > > > > > > FailureListenerContext#addTag and
> FailureListenerContextgetTags
> > > > looks
> > > > > > very
> > > > > > > > inconsistent because they imply specific implementation
> details,
> > > > and
> > > > > > not
> > > > > > > > all FailureListeners need to handle them, we shouldn't put
> them
> > > in
> > > > > the
> > > > > > > > interface. Minor: The comment "UDF loading" in the
> > > > > getUserClassLoader()
> > > > > > > > method looks like a typo, IIUC it should return the
> classloader
> > > of
> > > > > the
> > > > > > > > current job.
> > > > > > > >
> > > > > > > > (2) Regarding the implementation in
> > > > > > ExecutionFailureHandler#handleFailure,
> > > > > > > > some custom listeners may have heavy IO operations, such as
> > > > reporting
> > > > > > to
> > > > > > > > their monitoring system. The current logic appears to be
> > > processing
> > > > > in
> > > > > > the
> > > > > > > > JobMaster's main thread, and it is recommended not to do this
> > > kind
> > > > of
> > > > > > > > processing in the main thread.
> > > > > > > >
> > > > > > > > (3) The results of FailureListener's processing and the
> > > > > > > > FailureHandlingResult returned by ExecutionFailureHandler
> are not
> > > > > > related.
> > > > > > > > I think these two are closely related, the motivation of this
> > > FLIP
> > > > is
> > > > > &g

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-20 Thread David Morávek
>
> however listeners can use previous state (tags/labels) to make decisions


That sounds like a very fragile contract. We should either allow passing
tags between listeners and then need to define ordering or make all of them
independent. I prefer the latter because it allows us to parallelize things
if needed (if all listeners trigger an RCP to the external system, for
example).

Can you expand on why we need more than one classifier to be able to output
the same tag?

system ones come first and then the ones loaded from the plugin manager


Since they're returned as a Set, the order is completely non-deterministic,
no matter in which order they're loaded.

just communicating with external monitoring/alerting systems
>

That makes the need for pushing things out of the main thread even
stronger. This almost sounds like we need to return a CompletableFuture for
the per-throwable classification because an external system might take a
significant time to respond. We need to unblock the main thread for other
RPCs.

Also, in the proposal, this happens in the failure handler. If that's the
case, this might block the job from being restarted (if the restart
strategy allows for another restart), which would be great to avoid because
it can introduce extra downtime.

This raises another question: what should happen if the classification
fails? Crashing the job (which is what's currently proposed) seems very
dangerous if this might depend on an external system.

Thats a valid point, passing the JobGraph containing all the above
> information is also something to consider
>

We should avoid passing JG around because it's mutable (which we must fix
in the long term), and letting users change it might have consequences.

Best,
D.

On Mon, Mar 20, 2023 at 7:23 AM Panagiotis Garefalakis 
wrote:

> Hey David, Shammon,
>
> Thanks for the valuable comments!
> I am glad you find this proposal useful, some thoughts:
>
> @Shammon
>
> 1. How about adding more job information in FailureListenerContext? For
> > example, job vertext, subtask, taskmanager location. And then user can do
> > more statistics according to different dimensions.
>
>
> Thats a valid point, passing the JobGraph containing all the above
> information
> is also something to consider, I was mostly trying to be conservative:
> i.e., passingly only the information we need, and extend as we see fit
>
> 2. Users may want to save results in listener, and then they can get the
> > historical results even jabmanager failover. Can we provide a unified
> > implementation for data storage requirements?
>
>
> The idea is to store only the output of the Listeners (tags) and treat them
> as stateless.
> Tags are be stored along with HistoryEntries, and will be available through
> the HistoryServer
> even after a JM dies.
>
> @David
>
> 1) Should we also consider adding labels? The combination of tags and
> > labels seems to be what most systems offer; sometimes, they offer labels
> > only (key=value pairs) because tags can be implemented using those, but
> not
> > the other way around.
>
>
> Indeed changing tags to k:v labels could be more expressive, I like it!
> Let's see what others think.
>
> 2) Since we can not predict how heavy user-defined models ("listeners") are
> > going to be, it would be great to keep the interfaces/data structures
> > immutable so we can push things over to the I/O threads. Also, it sounds
> > off to call the main interface a Listener since it's supposed to enhance
> > the original throwable with additional metadata.
>
>
> The idea was for the name to be generic as there could be Listener
> implementations
> just communicating with external monitoring/alerting systems and no
> metadata output
> -- but lets rethink that. For immutability, see below:
>
> 3) You're proposing to support a set of listeners. Since you're passing the
> > mutable context around, which includes tags set by the previous listener,
> > do you expect users to make any assumptions about the order in which
> > listeners are executed?
>
>
> In the existing proposal we are not making any assumptions about the order
> of listeners,
> (system ones come first and then the ones loaded from the plugin manager)
> however listeners can use previous state (tags/labels) to make decisions:
> e.g., wont assign *UNKNOWN* failureType when we have already seen *USER *or
> the other way around -- when we have seen *UNKNOWN* remove in favor of
> *USER*
>
>
> Cheers,
> Panagiotis
>
> On Sun, Mar 19, 2023 at 10:42 AM David Morávek  wrote:
>
> > Hi Panagiotis,
> >
> > This is an excellent proposal and something everyone trying to provide
> > "Flink as a service" n

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-19 Thread David Morávek
Hi Panagiotis,

This is an excellent proposal and something everyone trying to provide
"Flink as a service" needs to solve at some point. I have a couple of
questions:

If I understand the proposal correctly, this is just about adding tags to
the Throwable by running a tuple of (Throwable, FailureContext) through a
user-defined model.

1) Should we also consider adding labels? The combination of tags and
labels seems to be what most systems offer; sometimes, they offer labels
only (key=value pairs) because tags can be implemented using those, but not
the other way around.

2) Since we can not predict how heavy user-defined models ("listeners") are
going to be, it would be great to keep the interfaces/data structures
immutable so we can push things over to the I/O threads. Also, it sounds
off to call the main interface a Listener since it's supposed to enhance
the original throwable with additional metadata.

I'd propose something along the lines of (we should have better names, this
is just to outline the idea):

interface FailureEnricher {

  ThrowableWithTagsAndLabels enrichFailure(Throwable cause,
ImmutableContextualMetadataAboutTheThrowable context);
}

The names should change; this is just to outline the idea.

3) You're proposing to support a set of listeners. Since you're passing the
mutable context around, which includes tags set by the previous listener,
do you expect users to make any assumptions about the order in which
listeners are executed?

*@Shammon*

Users may want to save results in listener, and then they can get the
> historical results even jabmanager failover. Can we provide a unified
> implementation for data storage requirements?


I think we should explicitly state that all "listeners" are treated as
stateless. I don't see any strong reason for snapshotting them.

Best,
D.

On Sat, Mar 18, 2023 at 1:00 AM Shammon FY  wrote:

> Hi Panagiotis
>
> Thank you for starting this discussion. I think this FLIP is valuable and
> can help user to analyze the causes of job failover better!
>
> I have two comments as follows
>
> 1. How about adding more job information in FailureListenerContext? For
> example, job vertext, subtask, taskmanager location. And then user can do
> more statistics according to different dimensions.
>
> 2. Users may want to save results in listener, and then they can get the
> historical results even jabmanager failover. Can we provide a unified
> implementation for data storage requirements?
>
>
> Best,
> shammon FY
>
>
> On Saturday, March 18, 2023, Panagiotis Garefalakis 
> wrote:
>
> > Hi everyone,
> >
> > This FLIP [1] proposes a pluggable interface for failure handling
> allowing
> > users to implement custom failure logic using the plugin framework.
> > Motivated by existing proposals [2] and tickets [3], this enables
> use-cases
> > like: assigning particular types to failures (e.g., User or System),
> > emitting custom metrics per type (e.g., application or platform), even
> > exposing errors to downstream consumers (e.g., notification systems).
> >
> > Thanks to Piotr and Anton for the initial reviews and discussions!
> >
> > For anyone interested, the starting point would be the FLIP [1] that I
> > created,
> > describing the motivation and the proposed changes (part of the core,
> > runtime and web).
> >
> > The intuition behind this FLIP is being able to execute custom logic on
> > failures by exposing a FailureListener interface. Implementation by users
> > can be simply loaded to the system as Jar files. FailureListeners may
> also
> > decide to assign failure tags to errors (expressed as strings),
> > that will then be exposed as metadata by the UI/Rest interfaces.
> >
> > Feedback is always appreciated! Looking forward to your thoughts!
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%
> > 3A+Pluggable+failure+handling+for+Apache+Flink
> > [2]
> > https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67
> > Hmjgy0-hRDeuFnrMgT4
> > [3] https://issues.apache.org/jira/browse/FLINK-20833
> >
> > Cheers,
> > Panagiotis
> >
>


Re: Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

2023-03-13 Thread David Morávek
> Although YARN serves as the platform for Flink, does YARN also operate on
K8s?

YARN is an alternative to k8s and Flink should make no assumptions about
how it's deployed, even though some companies might deploy it as an overlay
RM on top of k8s (I doubt that, but I guess they might do it for
legacy/migration reasons).

Best,
D.

On Mon, Mar 13, 2023 at 9:39 AM Samrat Deb  wrote:

> Hi Ramkrishna,
>
>
> I hope this email finds you well. Please accept my apologies for the delay
> in responding to your previous message.
>
>
> I would like to discuss the following matter with you: Although YARN serves
> as the platform for Flink, does YARN also operate on K8s? I am curious to
> know if this is the reason why you wished to implement a generic autoscale
> logic within the operator itself.
>
> Upon analyzing the code, we discovered that the autoscaling logic/framework
> is closely tied to the K8s operator. Our intention is to take a gradual and
> measured approach by developing a generic interface that can easily
> integrate with any resource manager initially. Subsequently, we can
> evaluate and determine the suitability of the interface for autoscaling
> logic based on need.
>
>
> Bests,
>
> Samrat
>


Re: [Vote] FLIP-298: Unifying the Implementation of SlotManager

2023-03-09 Thread David Morávek
+1 (binding)

Best,
D.

On Fri, Mar 10, 2023 at 4:49 AM Yuxin Tan  wrote:

> Thanks, Weihua!
> +1 (non-binding)
>
> Best,
> Yuxin
>
>
> weijie guo  于2023年3月10日周五 11:29写道:
>
> > +1 (binding)
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Shammon FY  于2023年3月10日周五 11:02写道:
> >
> > > Thanks weihua, +1 (non-binding)
> > >
> > > Best,
> > > Shammon
> > >
> > > On Fri, Mar 10, 2023 at 10:32 AM Xintong Song 
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > >
> > > > On Thu, Mar 9, 2023 at 1:28 PM Weihua Hu 
> > wrote:
> > > >
> > > > > Hi Everyone,
> > > > >
> > > > > I would like to start the vote on FLIP-298: Unifying the
> > Implementation
> > > > > of SlotManager [1]. The FLIP was discussed in this thread [2].
> > > > >
> > > > > This FLIP aims to unify the implementation of SlotManager in
> > > > > order to reduce maintenance costs.
> > > > >
> > > > > The vote will last for at least 72 hours (03/14, 15:00 UTC+8)
> > > > > unless there is an objection or insufficient votes. Thank you all.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > > > [2]
> https://lists.apache.org/thread/ocssfxglpc8z7cto3k8p44mrjxwr67r9
> > > > >
> > > > > Best,
> > > > > Weihua
> > > > >
> > > >
> > >
> >
>


Re: Regarding new command to download jars in flink cluster

2023-03-04 Thread David Morávek
Hi Surendra,

Since you're mentioning docker, I assume you're deploying your application
to k8s. Is that correct?

For handcrafted Kubernetes deployments, you can simply download the jar
into the user lib folder in an init container [1]. You can usually reuse
existing docker images to download the jar. For example, for S3, the AWS
CLI image will do the trick [2].

In general, my take is that this doesn't belong to the Flink itself (we
should keep the core feature matrix dense) but to the
deployment/orchestration layer around it (e.g., the AF Kubernetes Operator
[3]).

[1] https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
[2] https://hub.docker.com/r/amazon/aws-cli
[3] https://github.com/apache/flink-kubernetes-operator

Best,
D.

On Fri, Mar 3, 2023 at 8:11 PM Surendra Singh Lilhore <
surendralilh...@apache.org> wrote:

> Hi Team,
>
>
>
> According to the Flink documentation, in the APP mode, the application jar
> should be bundled with the Flink image. However, building an image for each
> new application can be difficult. Can we introduce new commands that will
> help to download the required jar locally before starting Flink JM or TM
> containers? This should be a simple command that depends on the supported
> file system (S3, HDFS, ABFS) in Flink, and the command format should be
> something like this:
>
> *./flink fs-download  *
>
> Example:
>
> *./flink fs-download
> abfs://mycontai...@storageaccount.dfs.core.windows.net/jars /tmp/appjars
>  /tmp/appjars>*
>
> I have already tested this in my cluster, and it is working fine. Before
> raising a JIRA ticket, I would like to get suggestions from the community.
>
>
> Thanks and Regards
> Surendra
>


[RESULT][VOTE] FLIP-291: Externalized Declarative Resource Management

2023-03-03 Thread David Morávek
I'm happy to announce that we have unanimously approved this FLIP.

There are 8 approving votes, 3 of which are binding:

* John Roesler
* Konstantin Knauf (binding)
* Zhanghao Chen
* ConradJam
* Feng Xiangyu
* Gyula Fóra (binding)
* Roman Khachatryan (binding)
* Shammon FY

There are no disapproving votes.

Thanks everyone for participating!

Best,
D.


Re: [VOTE] FLIP-291: Externalized Declarative Resource Management

2023-03-03 Thread David Morávek
Thanks, everyone, I'm closing this vote now. I'll follow up with the result
in a separate email.

On Wed, Mar 1, 2023 at 10:01 AM Shammon FY  wrote:

> +1 (non-binding)
>
> Best,
> Shammon
>
>
> On Wed, Mar 1, 2023 at 4:51 PM Roman Khachatryan  wrote:
>
> > +1 (binding)
> >
> > Thanks David, and everyone involved :)
> >
> > Regards,
> > Roman
> >
> >
> > On Wed, Mar 1, 2023 at 8:01 AM Gyula Fóra  wrote:
> >
> > > +1 (binding)
> > >
> > > Looking forward to this :)
> > >
> > > Gyula
> > >
> > > On Wed, 1 Mar 2023 at 04:02, feng xiangyu 
> wrote:
> > >
> > > > +1  (non-binding)
> > > >
> > > > ConradJam  于2023年3月1日周三 10:37写道:
> > > >
> > > > > +1  (non-binding)
> > > > >
> > > > > Zhanghao Chen  于2023年3月1日周三 10:18写道:
> > > > >
> > > > > > Thanks for driving this. +1 (non-binding)
> > > > > >
> > > > > > Best,
> > > > > > Zhanghao Chen
> > > > > > 
> > > > > > From: David Mor?vek 
> > > > > > Sent: Tuesday, February 28, 2023 21:46
> > > > > > To: dev 
> > > > > > Subject: [VOTE] FLIP-291: Externalized Declarative Resource
> > > Management
> > > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > I want to start the vote on FLIP-291: Externalized Declarative
> > > Resource
> > > > > > Management [1]. The FLIP was discussed in this thread [2].
> > > > > >
> > > > > > The goal of the FLIP is to enable external declaration of the
> > > resource
> > > > > > requirements of a running job.
> > > > > >
> > > > > > The vote will last for at least 72 hours (Friday, 3rd of March,
> > 15:00
> > > > > CET)
> > > > > > unless
> > > > > > there is an objection or insufficient votes.
> > > > > >
> > > > > > [1]
> > https://lists.apache.org/thread/b8fnj127jsl5ljg6p4w3c4wvq30cnybh
> > > > > > [2]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > >
> > > > > > Best,
> > > > > > D.
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best
> > > > >
> > > > > ConradJam
> > > > >
> > > >
> > >
> >
>


[VOTE] FLIP-291: Externalized Declarative Resource Management

2023-02-28 Thread David Morávek
Hi Everyone,

I want to start the vote on FLIP-291: Externalized Declarative Resource
Management [1]. The FLIP was discussed in this thread [2].

The goal of the FLIP is to enable external declaration of the resource
requirements of a running job.

The vote will last for at least 72 hours (Friday, 3rd of March, 15:00 CET)
unless
there is an objection or insufficient votes.

[1] https://lists.apache.org/thread/b8fnj127jsl5ljg6p4w3c4wvq30cnybh
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management

Best,
D.


Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-28 Thread David Morávek
> I suppose we could further remove the min because it would always be
safer to scale down if resources are not available than not to run at
all [1].

Apart from what @Roman has already mentioned, there are still cases where
we're certain that there is no point in running the jobs with resources
lower than X; e.g., because the state is too large to be processed with
parallelism of 1; this allows you not to waste resources if you're certain
that the job would go into the restart loop / won't be able to checkpoint

I believe that for most use cases, simply keeping the lower bound at 1 will
be sufficient.

> I saw that the minimum bound is currently not used in the code you posted
above [2]. Is that still planned?

Yes. We already allow setting the lower bound via API, but it's not
considered by the scheduler. I'll address this limitation in a separate
issue.

> Note that originally we had assumed min == max but I think that would be
a less safe scaling approach because we would get stuck waiting for
resources when they are not available, e.g. k8s resource limits reached.

100% agreed; The above-mentioned knobs should allow you to balance the
trade-off.


Does that make sense?

Best,
D.



On Tue, Feb 28, 2023 at 1:14 PM Roman Khachatryan  wrote:

> Hi,
>
> Thanks for the update, I think distinguishing the rescaling behaviour and
> the desired parallelism declaration is important.
>
> Having the ability to specify min parallelism might be useful in
> environments with multiple jobs: Scheduler will then have an option to stop
> the less suitable job.
> In other setups, where the job should not be stopped at all, the user can
> always set it to 0.
>
> Regards,
> Roman
>
>
> On Tue, Feb 28, 2023 at 12:58 PM Maximilian Michels 
> wrote:
>
>> Hi David,
>>
>> Thanks for the update! We consider using the new declarative resource
>> API for autoscaling. Currently, we treat a scaling decision as a new
>> deployment which means surrendering all resources to Kubernetes and
>> subsequently reallocating them for the rescaled deployment. The
>> declarative resource management API is a great step forward because it
>> allows us to do faster and safer rescaling. Faster, because we can
>> continue to run while resources are pre-allocated which minimizes
>> downtime. Safer, because we can't get stuck when the desired resources
>> are not available.
>>
>> An example with two vertices and their respective parallelisms:
>>   v1: 50
>>   v2: 10
>> Let's assume slot sharing is disabled, so we need 60 task slots to run
>> the vertices.
>>
>> If the autoscaler was to decide to scale up v1 and v2, it could do so
>> in a safe way by using min/max configuration:
>>   v1: [min: 50, max: 70]
>>   v2: [min: 10, max: 20]
>> This would then need 90 task slots to run at max capacity.
>>
>> I suppose we could further remove the min because it would always be
>> safer to scale down if resources are not available than to not run at
>> all [1]. In fact, I saw that the minimum bound is currently not used
>> in the code you posted above [2]. Is that still planned?
>>
>> -Max
>>
>> PS: Note that originally we had assumed min == max but I think that
>> would be a less safe scaling approach because we would get stuck
>> waiting for resources when they are not available, e.g. k8s resource
>> limits reached.
>>
>> [1] However, there might be costs involved with executing the
>> rescaling, e.g. for using external storage like s3, especially without
>> local recovery.
>> [2]
>> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9
>>
>> On Tue, Feb 28, 2023 at 9:33 AM David Morávek  wrote:
>> >
>> > Hi Everyone,
>> >
>> > We had some more talks about the pre-allocation of resources with @Max,
>> and
>> > here is the final state that we've converged to for now:
>> >
>> > The vital thing to note about the new API is that it's declarative,
>> meaning
>> > we're declaring the desired state to which we want our job to converge;
>> If,
>> > after the requirements update job no longer holds the desired resources
>> > (fewer resources than the lower bound), it will be canceled and
>> transition
>> > back into the waiting for resources state.
>> >
>> > In some use cases, you might always want to rescale to the upper bound
>> > (this goes along the lines of "preallocating resources" and minimizing
>> the
>> > number of rescales, which is especially useful with the large state).
>> This
>> > can be controlled by two knobs that already exis

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-28 Thread David Morávek
Hi Everyone,

We had some more talks about the pre-allocation of resources with @Max, and
here is the final state that we've converged to for now:

The vital thing to note about the new API is that it's declarative, meaning
we're declaring the desired state to which we want our job to converge; If,
after the requirements update job no longer holds the desired resources
(fewer resources than the lower bound), it will be canceled and transition
back into the waiting for resources state.

In some use cases, you might always want to rescale to the upper bound
(this goes along the lines of "preallocating resources" and minimizing the
number of rescales, which is especially useful with the large state). This
can be controlled by two knobs that already exist:

1) "jobmanager.adaptive-scheduler.min-parallelism-increase" - this affects
a minimal parallelism increase step of a running job; we'll slightly change
the semantics, and we'll trigger rescaling either once this condition is
met or when you hit the ceiling; setting this to the high number will
ensure that you always rescale to the upper bound

2) "jobmanager.adaptive-scheduler.resource-stabilization-timeout" - for new
and already restarting jobs, we'll always respect this timeout, which
allows you to wait for more resources even though you already have more
resources than defined in the lower bound; again, in the case we reach the
ceiling (the upper bound), we'll transition into the executing state.


We're still planning to dig deeper in this direction with other efforts,
but this is already good enough and should allow us to move the FLIP
forward.

WDYT? Unless there are any objectives against the above, I'd like to
proceed to a vote.

Best,
D.

On Thu, Feb 23, 2023 at 5:39 PM David Morávek  wrote:

> Hi Everyone,
>
> @John
>
> This is a problem that we've spent some time trying to crack; in the end,
> we've decided to go against doing any upgrades to JobGraphStore from
> JobMaster to avoid having multiple writers that are guarded by different
> leader election lock (Dispatcher and JobMaster might live in a different
> process). The contract we've decided to choose instead is leveraging the
> idempotency of the endpoint and having the user of the API retry in case
> we're unable to persist new requirements in the JobGraphStore [1]. We
> eventually need to move JobGraphStore out of the dispatcher, but that's way
> out of the scope of this FLIP. The solution is a deliberate trade-off. The
> worst scenario is that the Dispatcher fails over in between retries, which
> would simply rescale the job to meet the previous resource requirements
> (more extended unavailability of underlying HA storage would have worse
> consequences than this). Does that answer your question?
>
> @Matthias
>
> Good catch! I'm fixing it now, thanks!
>
> [1]
> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151
>
> Best,
> D.
>
> On Tue, Feb 21, 2023 at 12:24 AM John Roesler  wrote:
>
>> Thanks for the FLIP, David!
>>
>> I just had one small question. IIUC, the REST API PUT request will go
>> through the new DispatcherGateway method to be handled. Then, after
>> validation, the dispatcher would call the new JobMasterGateway method to
>> actually update the job.
>>
>> Which component will write the updated JobGraph? I just wanted to make
>> sure it’s the JobMaster because it it were the dispatcher, there could be a
>> race condition with the async JobMaster method.
>>
>> Thanks!
>> -John
>>
>> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
>> > Thanks for your clarifications, David. I don't have any additional major
>> > points to add. One thing about the FLIP: The RPC layer API for updating
>> the
>> > JRR returns a future with a JRR? I don't see value in returning a JRR
>> here
>> > since it's an idempotent operation? Wouldn't it be enough to return
>> > CompletableFuture here? Or am I missing something?
>> >
>> > Matthias
>> >
>> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels 
>> wrote:
>> >
>> >> Thanks David! If we could get the pre-allocation working as part of
>> >> the FLIP, that would be great.
>> >>
>> >> Concerning the downscale case, I agree this is a special case for the
>> >> (single-job) application mode where we could re-allocate slots in a
>> >> way that could leave entire task managers unoccupied which we would
>> >> then be able to release. The goal essentially is to reduce slot
>> >> fragmentation on scale down by packing the slots efficientl

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

2023-02-27 Thread David Morávek
Hi Weihua, I still need to dig into the details, but the overall sentiment
of this change sounds reasonable.

Best,
D.

On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen 
wrote:

> Thanks for driving this topic. I think this FLIP could help clean up the
> codebase to make it easier to maintain. +1 on it.
>
> Best,
> Zhanghao Chen
> 
> From: Weihua Hu 
> Sent: Monday, February 27, 2023 20:40
> To: dev 
> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager
>
> Hi everyone,
>
> I would like to begin a discussion on FLIP-298: Unifying the Implementation
> of SlotManager[1]. There are currently two types of SlotManager in Flink:
> DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager
> should behave as DeclarativeSlotManager if the user does not configure the
> slot request profile.
>
> Therefore, this FLIP aims to unify the implementation of SlotManager in
> order to reduce maintenance costs.
>
> Looking forward to hearing from you.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>
> Best,
> Weihua
>


Re: [DISCUSS] Deprecating GlobalAggregateManager

2023-02-27 Thread David Morávek
I think this makes sense, +1 from my side; as I wrote on the ticket, I'm
not aware of any other usages apart from the kinesis connector, and we
already have more feature complete API that can replace the functionality
there.

Best,
D.

On Mon, Feb 27, 2023 at 2:44 PM Zhanghao Chen 
wrote:

> Hi dev,
>
> I'd like to open discussion on deprecating Global Aggregate Manager in
> favor of Operator Coordinator.
>
>
>   1.  Global Aggregate Manager is rarely used and can be replaced by
> Opeator Coordinator. Global Aggregate Manager was introduced in [1]<
> https://issues.apache.org/jira/browse/FLINK-10886> to support event time
> synchronization across sources and more generally, coordination of parallel
> tasks. AFAIK, this was only used in the Kinesis source [2] for an early
> version of watermark alignment. Operator Coordinator, introduced in [3],
> provides a more powerful and elegant solution for that need and is part of
> the new source API standard.
>   2.  Global Aggregate Manager manages state in JobMaster object, causing
> problems for adaptive parallelism changes. It maintains a state (the
> accumulators field in JobMaster) in JM memory. The accumulator state
> content is defined in user code. In my company, a user stores task
> parallelism in the accumulator, assuming task parallelism never changes.
> However, this assumption is broken when using adaptive scheduler. See [4]
> for more details.
>
> Therefore, I think we should deprecate the use of Global Aggregate
> Manager, which can improve the maintainability of the Flink codebase
> without compromising its functionality. Looking forward to your opinions on
> this.
>
> [1] https://issues.apache.org/jira/browse/FLINK-10886
> [2]
> https://github.com/apache/flink-connector-aws/blob/d0817fecdcaa53c4bf039761c2d1a16e8fb9f89b/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/util/JobManagerWatermarkTracker.java
> [3]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-SplitEnumerator
> [4] [FLINK-31245] Adaptive scheduler does not reset the state of
> GlobalAggregateManager when rescaling - ASF JIRA (apache.org)<
> https://issues.apache.org/jira/browse/FLINK-31245?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel
> >
>
> Best,
> Zhanghao Chen
>


Re: [ANNOUNCE] New Apache Flink Committer - Anton Kalashnikov

2023-02-23 Thread David Morávek
Congratulations Anton, well deserved!

Best,
d.

On Wed, Feb 22, 2023 at 5:01 AM Jane Chan  wrote:

> Congratulations, Anton!
>
> Best regards,
> Jane
>
> On Wed, Feb 22, 2023 at 11:22 AM Yun Tang  wrote:
>
> > Congratulations, Anton!
> >
> > Best
> > Yun Tang
> > 
> > From: Dian Fu 
> > Sent: Wednesday, February 22, 2023 9:44
> > To: dev@flink.apache.org 
> > Subject: Re: [ANNOUNCE] New Apache Flink Committer - Anton Kalashnikov
> >
> > Congratulations Anton!
> >
> > On Tue, Feb 21, 2023 at 9:12 PM Austin Cawley-Edwards <
> > austin.caw...@gmail.com> wrote:
> >
> > > Congrats Anton 拾
> > >
> > > On Tue, Feb 21, 2023 at 04:08 Roman Khachatryan 
> > wrote:
> > >
> > > > Congratulations Anton, well deserved!
> > > >
> > > > Regards,
> > > > Roman
> > > >
> > > >
> > > > On Tue, Feb 21, 2023 at 9:34 AM Martijn Visser <
> > martijnvis...@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Congratulations Anton!
> > > > >
> > > > > On Tue, Feb 21, 2023 at 8:08 AM Lincoln Lee <
> lincoln.8...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Congratulations, Anton!
> > > > > >
> > > > > > Best,
> > > > > > Lincoln Lee
> > > > > >
> > > > > >
> > > > > > Guowei Ma  于2023年2月21日周二 15:05写道:
> > > > > >
> > > > > > > Congratulations, Anton!
> > > > > > >
> > > > > > > Best,
> > > > > > > Guowei
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 21, 2023 at 1:52 PM Shammon FY 
> > > > wrote:
> > > > > > >
> > > > > > > > Congratulations, Anton!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Shammon
> > > > > > > >
> > > > > > > > On Tue, Feb 21, 2023 at 1:41 PM Sergey Nuyanzin <
> > > > snuyan...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Congratulations, Anton!
> > > > > > > > >
> > > > > > > > > On Tue, Feb 21, 2023 at 4:53 AM Weihua Hu <
> > > > huweihua@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Congratulations, Anton!
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Weihua
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 21, 2023 at 11:22 AM weijie guo <
> > > > > > > guoweijieres...@gmail.com
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Congratulations, Anton!
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > >
> > > > > > > > > > > Weijie
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Leonard Xu  于2023年2月21日周二 11:02写道:
> > > > > > > > > > >
> > > > > > > > > > > > Congratulations, Anton!
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Leonard
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > On Feb 21, 2023, at 10:02 AM, Rui Fan <
> > > fan...@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Congratulations, Anton!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Rui Fan
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Feb 21, 2023 at 9:23 AM yuxia <
> > > > > > > > luoyu...@alumni.sjtu.edu.cn
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Congrats Anton!
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Best regards,
> > > > > > > > > > > > >> Yuxia
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> - 原始邮件 -
> > > > > > > > > > > > >> 发件人: "Matthias Pohl"  > .INVALID>
> > > > > > > > > > > > >> 收件人: "dev" 
> > > > > > > > > > > > >> 发送时间: 星期二, 2023年 2 月 21日 上午 12:52:40
> > > > > > > > > > > > >> 主题: Re: [ANNOUNCE] New Apache Flink Committer -
> > Anton
> > > > > > > > Kalashnikov
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Congratulations, Anton! :-)
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Mon, Feb 20, 2023 at 5:09 PM Jing Ge
> > > > > > > > >  > > > > > > > > > >
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>> Congrats Anton!
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> On Mon, Feb 20, 2023 at 5:02 PM Samrat Deb <
> > > > > > > > > decordea...@gmail.com>
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >>>
> > > > > > > > > > > >  congratulations Anton!
> > > > > > > > > > > > 
> > > > > > > > > > > >  Bests,
> > > > > > > > > > > >  Samrat
> > > > > > > > > > > > 
> > > > > > > > > > > >  On Mon, 20 Feb 2023 at 9:29 PM, John Roesler <
> > > > > > > > > vvcep...@apache.org
> > > > > > > > > > >
> > > > > > > > > > > > >>> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > > Congratulations, Anton!
> > > > > > > > > > > > > -John
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Feb 20, 2023, at 08:18, Piotr Nowojski
> > > wrote:
> > > > > > > > > > > > >> Hi, everyone
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On behalf of the PMC, I'm very happy to
> announce
> > > > Anton
> > > > > > > > > > 

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-23 Thread David Morávek
Hi Everyone,

@John

This is a problem that we've spent some time trying to crack; in the end,
we've decided to go against doing any upgrades to JobGraphStore from
JobMaster to avoid having multiple writers that are guarded by different
leader election lock (Dispatcher and JobMaster might live in a different
process). The contract we've decided to choose instead is leveraging the
idempotency of the endpoint and having the user of the API retry in case
we're unable to persist new requirements in the JobGraphStore [1]. We
eventually need to move JobGraphStore out of the dispatcher, but that's way
out of the scope of this FLIP. The solution is a deliberate trade-off. The
worst scenario is that the Dispatcher fails over in between retries, which
would simply rescale the job to meet the previous resource requirements
(more extended unavailability of underlying HA storage would have worse
consequences than this). Does that answer your question?

@Matthias

Good catch! I'm fixing it now, thanks!

[1]
https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151

Best,
D.

On Tue, Feb 21, 2023 at 12:24 AM John Roesler  wrote:

> Thanks for the FLIP, David!
>
> I just had one small question. IIUC, the REST API PUT request will go
> through the new DispatcherGateway method to be handled. Then, after
> validation, the dispatcher would call the new JobMasterGateway method to
> actually update the job.
>
> Which component will write the updated JobGraph? I just wanted to make
> sure it’s the JobMaster because it it were the dispatcher, there could be a
> race condition with the async JobMaster method.
>
> Thanks!
> -John
>
> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
> > Thanks for your clarifications, David. I don't have any additional major
> > points to add. One thing about the FLIP: The RPC layer API for updating
> the
> > JRR returns a future with a JRR? I don't see value in returning a JRR
> here
> > since it's an idempotent operation? Wouldn't it be enough to return
> > CompletableFuture here? Or am I missing something?
> >
> > Matthias
> >
> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels 
> wrote:
> >
> >> Thanks David! If we could get the pre-allocation working as part of
> >> the FLIP, that would be great.
> >>
> >> Concerning the downscale case, I agree this is a special case for the
> >> (single-job) application mode where we could re-allocate slots in a
> >> way that could leave entire task managers unoccupied which we would
> >> then be able to release. The goal essentially is to reduce slot
> >> fragmentation on scale down by packing the slots efficiently. The
> >> easiest way to add this optimization when running in application mode
> >> would be to drop as many task managers during the restart such that
> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
> >> this independently of the FLIP.
> >>
> >> Feel free to start the vote.
> >>
> >> -Max
> >>
> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek  wrote:
> >> >
> >> > Hi everyone,
> >> >
> >> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT
> API
> >> instead of PATCH and to properly handle lower bound settings, to support
> >> the "pre-allocation" of the resources.
> >> >
> >> > @Max
> >> >
> >> > > How hard would it be to address this issue in the FLIP?
> >> >
> >> > I've included this in the FLIP. It might not be too hard to implement
> >> this in the end.
> >> >
> >> > > B) drop as many superfluous task managers as needed
> >> >
> >> > I've intentionally left this part out for now because this ultimately
> >> needs to be the responsibility of the Resource Manager. After all, in
> the
> >> Session Cluster scenario, the Scheduler doesn't have the bigger picture
> of
> >> other tasks of other jobs running on those TMs. This will most likely
> be a
> >> topic for another FLIP.
> >> >
> >> > WDYT? If there are no other questions or concerns, I'd like to start
> the
> >> vote on Wednesday.
> >> >
> >> > Best,
> >> > D.
> >> >
> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels 
> >> wrote:
> >> >>
> >> >> I missed that the FLIP states:
> >> >>
> >> >> > Currently, even though we’d expose the lower boun

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-20 Thread David Morávek
Hi everyone,

Thanks for the feedback! I've updated the FLIP to use idempotent PUT API
instead of PATCH and to properly handle lower bound settings, to support
the "pre-allocation" of the resources.

@Max

> How hard would it be to address this issue in the FLIP?

I've included this in the FLIP. It might not be too hard to implement this
in the end.

> B) drop as many superfluous task managers as needed

I've intentionally left this part out for now because this ultimately needs
to be the responsibility of the Resource Manager. After all, in the Session
Cluster scenario, the Scheduler doesn't have the bigger picture of other
tasks of other jobs running on those TMs. This will most likely be a topic
for another FLIP.

WDYT? If there are no other questions or concerns, I'd like to start the
vote on Wednesday.

Best,
D.

On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels  wrote:

> I missed that the FLIP states:
>
> > Currently, even though we’d expose the lower bound for clarity and API
> completeness, we won’t allow setting it to any other value than one until
> we have full support throughout the stack.
>
> How hard would it be to address this issue in the FLIP?
>
> There is not much value to offer setting a lower bound which won't be
> respected / throw an error when it is set. If we had support for a
> lower bound, we could enforce a resource contract externally via
> setting lowerBound == upperBound. That ties back to the Rescale API
> discussion we had. I want to better understand what the major concerns
> would be around allowing this.
>
> Just to outline how I imagine the logic to work:
>
> A) The resource constraints are already met => Nothing changes
> B) More resources available than required => Cancel the job, drop as
> many superfluous task managers as needed, restart the job
> C) Less resources available than required => Acquire new task
> managers, wait for them to register, cancel and restart the job
>
> I'm open to helping out with the implementation.
>
> -Max
>
> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels  wrote:
> >
> > Based on further discussion I had with Chesnay on this PR [1], I think
> > jobs would currently go into a restarting state after the resource
> > requirements have changed. This wouldn't achieve what we had in mind,
> > i.e. sticking to the old resource requirements until enough slots are
> > available to fulfil the new resource requirements. So this may not be
> > 100% what we need but it could be extended to do what we want.
> >
> > -Max
> >
> > [1] https://github.com/apache/flink/pull/21908#discussion_r1104792362
> >
> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels 
> wrote:
> > >
> > > Hi David,
> > >
> > > This is awesome! Great writeup and demo. This is pretty much what we
> > > need for the autoscaler as part of the Flink Kubernetes operator [1].
> > > Scaling Flink jobs effectively is hard but fortunately we have solved
> > > the issue as part of the Flink Kubernetes operator. The only critical
> > > piece we are missing is a better way to execute scaling decisions, as
> > > discussed in [2].
> > >
> > > Looking at your proposal, we would set lowerBound == upperBound for
> > > the parallelism because we want to fully determine the parallelism
> > > externally based on the scaling metrics. Does that sound right?
> > >
> > > What is the timeline for these changes? Is there a JIRA?
> > >
> > > Cheers,
> > > Max
> > >
> > > [1]
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> > > [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > >
> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu 
> wrote:
> > > >
> > > > Hi David,
> > > >
> > > > Thanks for your reply.  I think your response totally make sense.
> This
> > > > flip targets on declaring required resource to ResourceManager
> instead of
> > > > using  ResourceManager to add/remove TMs directly.
> > > >
> > > > Best,
> > > > Xiangyu
> > > >
> > > >
> > > >
> > > > David Morávek  于2023年2月13日周一 15:46写道:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > @Shammon
> > > > >
> > > > > I'm not entirely sure what "config file" you're referring to. You
> can, of
> > > > > course, override the default parallelism in "flink-conf.yaml", but
> for
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-12 Thread David Morávek
Hi everyone,

@Shammon

I'm not entirely sure what "config file" you're referring to. You can, of
course, override the default parallelism in "flink-conf.yaml", but for
sinks and sources, the parallelism needs to be tweaked on the connector
level ("WITH" statement).

This is something that should be achieved with tooling around Flink. We
want to provide an API on the lowest level that generalizes well. Achieving
what you're describing should be straightforward with this API.

@Xiangyu

Is it possible for this REST API to declare TM resources in the future?


Would you like to add/remove TMs if you use an active Resource Manager?
This would be out of the scope of this effort since it targets the
scheduler component only (we make no assumptions about the used Resource
Manager). Also, the AdaptiveScheduler is only intended to be used for
Streaming.

 And for streaming jobs, I'm wondering if there is any situation we need to
> rescale the TM resources of a flink cluster at first and then the adaptive
> scheduler will rescale the per-vertex ResourceProfiles accordingly.
>

We plan on adding support for the ResourceProfiles (dynamic slot
allocation) as the next step. Again we won't make any assumptions about the
used Resource Manager. In other words, this effort ends by declaring
desired resources to the Resource Manager.

Does that make sense?

@Matthias

We've done another pass on the proposed API and currently lean towards
having an idempotent PUT API.
- We don't care too much about multiple writers' scenarios in terms of who
can write an authoritative payload; this is up to the user of the API to
figure out
- It's indeed tricky to achieve atomicity with PATCH API; switching to PUT
API seems to do the trick
- We won't allow partial "payloads" anymore, meaning you need to define
requirements for all vertices in the JobGraph; This is completely fine for
the programmatic workflows. For DEBUG / DEMO purposes, you can use the GET
endpoint and tweak the response to avoid writing the whole payload by hand.

WDYT?


Best,
D.

On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu  wrote:

> Hi David,
>
> Thanks for creating this flip. I think this work it is very useful,
> especially in autoscaling scenario.  I would like to share some questions
> from my view.
>
> 1, Is it possible for this REST API to declare TM resources in the future?
> I'm asking because we are building the autoscaling feature for Flink OLAP
> Session Cluster in ByteDance. We need to rescale the cluster's resource on
> TM level instead of Job level. It would be very helpful if we have a REST
> API for out external Autoscaling service to use.
>
> 2, And for streaming jobs, I'm wondering if there is any situation we need
> to rescale the TM resources of a flink cluster at first and then the
> adaptive scheduler will rescale the per-vertex ResourceProfiles
> accordingly.
>
> best.
> Xiangyu
>
> Shammon FY  于2023年2月9日周四 11:31写道:
>
> > Hi David
> >
> > Thanks for your answer.
> >
> > > Can you elaborate more about how you'd intend to use the endpoint? I
> > think we can ultimately introduce a way of re-declaring "per-vertex
> > defaults," but I'd like to understand the use case bit more first.
> >
> > For this issue, I mainly consider the consistency of user configuration
> and
> > job runtime. For sql jobs, users usually set specific parallelism for
> > source and sink, and set a global parallelism for other operators. These
> > config items are stored in a config file. For some high-priority jobs,
> > users may want to manage them manually.
> > 1. When users need to scale the parallelism, they should update the
> config
> > file and restart flink job, which may take a long time.
> > 2. After providing the REST API, users can just send a request to the job
> > via REST API quickly after updating the config file.
> > The configuration in the running job and config file should be the same.
> > What do you think of this?
> >
> > best.
> > Shammon
> >
> >
> >
> > On Tue, Feb 7, 2023 at 4:51 PM David Morávek 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Let's try to answer the questions one by one.
> > >
> > > *@ConradJam*
> > >
> > > when the number of "slots" is insufficient, can we can stop users
> > rescaling
> > > > or throw something to tell user "less avaliable slots to upgrade,
> > please
> > > > checkout your alivalbe slots" ?
> > > >
> > >
> > > The main property of AdaptiveScheduler is that it can adapt to
> "available
> > > resources," which means you're still able to make p

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-07 Thread David Morávek
would enable us to provide resource requirement
> > > changes in the UI or through the REST API. It is related to a problem
> > > around keeping track of the exception history within the
> > AdaptiveScheduler
> > > and also having to consider multiple versions of a JobGraph. But for
> that
> > > one, we use the ExecutionGraphInfoStore right now.
> > > - Updating the JobGraph in the JobGraphStore makes sense. I'm just
> > > wondering whether we bundle two things together that are actually
> > separate:
> > > The business logic and the execution configuration (the resource
> > > requirements). I'm aware that this is not a flaw of the current FLIP
> but
> > > rather something that was not necessary to address in the past because
> > the
> > > JobGraph was kind of static. I don't remember whether that was already
> > > discussed while working on the AdaptiveScheduler for FLIP-160 [1].
> Maybe,
> > > I'm missing some functionality here that requires us to have everything
> > in
> > > one place. But it feels like updating the entire JobGraph which could
> be
> > > actually a "config change" is not reasonable. ...also considering the
> > > amount of data that can be stored in a ConfigMap/ZooKeeper node if
> > > versioning the resource requirement change as proposed in my previous
> > item
> > > is an option for us.
> > > - Updating the JobGraphStore means adding more requests to the HA
> backend
> > > API. There were some concerns shared in the discussion thread [2] for
> > > FLIP-270 [3] on pressuring the k8s API server in the past with too many
> > > calls. Eventhough, it's more likely to be caused by checkpointing, I
> > still
> > > wanted to bring it up. We're working on a standardized performance test
> > to
> > > prepare going forward with FLIP-270 [3] right now.
> > >
> > > Best,
> > > Matthias
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > [2] https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > [3]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > >
> > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam  wrote:
> > >
> > > > Hi David:
> > > >
> > > > Thank you for drive this flip, which helps less flink shutdown time
> > > >
> > > > for this flip, I would like to make a few idea on share
> > > >
> > > >
> > > >- when the number of "slots" is insufficient, can we can stop
> users
> > > >rescaling or throw something to tell user "less avaliable slots to
> > > > upgrade,
> > > >please checkout your alivalbe slots" ? Or we could have a request
> > > >switch(true/false) to allow this behavior
> > > >
> > > >
> > > >- when user upgrade job-vertx-parallelism . I want to have an
> > > interface
> > > >to query the current update parallel execution status, so that the
> > > user
> > > > or
> > > >program can understand the current status
> > > >- I want to have an interface to query the current update
> > parallelism
> > > >execution status. This also helps similar to *[1] Flink K8S
> > Operator*
> > > >management
> > > >
> > > >
> > > > {
> > > >   status: Failed
> > > >   reason: "less avaliable slots to upgrade, please checkout your
> > alivalbe
> > > > slots"
> > > > }
> > > >
> > > >
> > > >
> > > >- *Pending*: this job now is join the upgrade queue,it will be
> > update
> > > >later
> > > >- *Rescaling*: job now is rescaling,wait it finish
> > > >- *Finished*: finish do it
> > > >- *Failed* : something have wrong,so this job is not alivable
> > upgrade
> > > >
> > > > I want to supplement my above content in flip, what do you think ?
> > > >
> > > >
> > > >1.
> > > >
> > >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > >
> > > >
> > > > David Morávek  于2023年2月3日周五 16:42写道:
> > > >
> > > > > Hi everyone,
> 

Re: [VOTE] FLIP-285: Refactoring LeaderElection to make Flink support multi-component leader election out-of-the-box

2023-02-03 Thread David Morávek
Thanks for the detailed FLIP, Matthias; this will simplify the HA code-base
significantly.

+1 (binding)

Best,
D.

On Tue, Jan 31, 2023 at 5:22 AM Yang Wang  wrote:

> +1 (Binding)
>
> Best,
> Yang
>
> ConradJam  于2023年1月31日周二 12:09写道:
>
> > +1 non-binding
> >
> > Matthias Pohl  于2023年1月25日周三 17:34写道:
> >
> > > Hi everyone,
> > > After the discussion thread [1] on FLIP-285 [2] didn't bring up any new
> > > items, I want to start voting on FLIP-285. This FLIP will not only
> align
> > > the leader election code base again through FLINK-26522 [3]. I also
> plan
> > to
> > > improve the test coverage for the leader election as part of this
> change
> > > (covered in FLINK-30338 [4]).
> > >
> > > The vote will remain open until at least Jan 30th (at least 72 hours)
> > > unless there are some objections or insufficient votes.
> > >
> > > Best,
> > > Matthias
> > >
> > > [1] https://lists.apache.org/thread/qrl881wykob3jnmzsof5ho8b9fgkklpt
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > [3] https://issues.apache.org/jira/browse/FLINK-26522
> > > [4] https://issues.apache.org/jira/browse/FLINK-30338
> > >
> > > --
> > >
> > > [image: Aiven]
> > >
> > > Matthias Pohl
> > >
> > > Software Engineer, Aiven
> > >
> > > matthias.p...@aiven.io 
> > >
> > > aiven.io    |
> > >  <
> > > https://www.facebook.com/aivencloud/>
> > > 
> > > <
> > https://twitter.com/aiven_io>
> > > 
> > >
> > > Aiven Deutschland GmbH
> > >
> > > Immanuelkirchstraße 26, 10405 Berlin
> > >
> > > Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> > >
> > > Amtsgericht Charlottenburg, HRB 209739 B
> > >
> >
> >
> > --
> > Best
> >
> > ConradJam
> >
>


[DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-03 Thread David Morávek
Hi everyone,

This FLIP [1] introduces a new REST API for declaring resource requirements
for the Adaptive Scheduler. There seems to be a clear need for this API
based on the discussion on the "Reworking the Rescale API" [2] thread.

Before we get started, this work is heavily based on the prototype [3]
created by Till Rohrmann, and the FLIP is being published with his consent.
Big shoutout to him!

Last and not least, thanks to Chesnay and Roman for the initial reviews and
discussions.

The best start would be watching a short demo [4] that I've recorded, which
illustrates newly added capabilities (rescaling the running job, handing
back resources to the RM, and session cluster support).

The intuition behind the FLIP is being able to define resource requirements
("resource boundaries") externally that the AdaptiveScheduler can navigate
within. This is a building block for higher-level efforts such as an
external Autoscaler. The natural extension of this work would be to allow
to specify per-vertex ResourceProfiles.

Looking forward to your thoughts; any feedback is appreciated!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
[2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
[3] https://github.com/tillrohrmann/flink/tree/autoscaling
[4] https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view

Best,
D.


Re: Reworking the Rescale API

2023-02-01 Thread David Morávek
It makes sense to give the whole "scheduler ecosystem," not just the
adaptive scheduler, a little bit more structure in the docs. We already
have 4 different schedulers (Default, Adaptive, AdaptiveBatch,
AdaptiveBatchSpeculative), and it becomes quite confusing since the details
are scattered around the docs. Maybe having a "Job Schedulers" subpage, the
same way as we have for "Resource Providers" could do the trick.

I should be able to fill in the details about the streaming ones, but I
will probably need some help with the batch ones.

As for the first FLIP, it's already prepared and we should be able to
publish it until Friday.

Best,
D.


On Wed, Feb 1, 2023 at 9:56 AM Gyula Fóra  wrote:

> Chesnay, David:
>
> Thank you guys for the extra information. We were clearly missing some
> context here around the scheduler related efforts and the currently
> available feature set.
>
> As for the concrete suggestions regarding the docs.
>
> 1. If the adaptive scheduler provides a significantly different feature set
> from the default scheduler we could have its own smaller doc page detailing
> the differences and why people should switch to it for streaming. This will
> also help us when we are making the transition and change the default
> behaviour.
> 2. We could still have an elastic scaling page that links to the adaptive
> scheduler (and vice versa) that focuses on elastic scaling + the Kubernetes
> operator autoscaler for a complete picture on elastic scaling options +
> detailing the limitations of the different approaches.
>
> This way the Adaptive Scheduler docs will be decoupled from elastic scaling
> and will result in a better understanding for the users (it sure would have
> helped us here, and we are on the more advanced user side :))
>
> What do you think?
> Gyula
>
> On Sat, Jan 28, 2023 at 4:20 AM ConradJam  wrote:
>
> > Sorry I'm late to join discuss, I've gleaned a lot of useful information
> > from you guys
> >
> > *@max*
> >
> >- when user repartition, we still need to restart the job, can we try
> to
> >do this part of the work internally instead of externally, as
> >*@konstantin* said only trigger rescaling when the checkpoint or
> >retain-checkpoint is completed operations to minimize reprocessing
> >
> > *@konstantin*
> >
> >- I think you mentioned that 2 FLIPs are being drafted which I
> consider
> >to be the condition to achieve the *@max* goal, I would love to join
> >this discussion and contribute it. I've tried a native implementation
> of
> >this part myself, if I can help the community that's the best I can do
> >
> > *@chesnay*
> >
> >- The docs section is confusion/misconceptions confusing like *@gyula
> > *say,
> >I'll see if I can fix it
> >
> >
> > *About Rescale Api*
> >
> >   Some limitations and differences between *default* and *reactive mode*
> > were
> > discussed earlier, and *@chesnay* explained some of their limitations and
> > behaviors, essentially they are two different things. I agree that when
> > reactive mode is ready, it should be used as the *reactive mode* for the
> > default *stream processing* job.
> >   As for the *[1] **Rescale API*, as we know now it seems to be
> unusable, I
> > believe the goal of this api is to be able to do fast reparallelism. I
> > would like to wait until the discussion is over and the 2 draft FILPs
> > mentioned earlier are completed. It is not too late to make another
> > decision on whether to modify the *[2] **Rescale Rest API *to support for
> > parallelism modification of job vertices
> >
> >
> >1.
> > *
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/elastic_scaling/
> ><
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/elastic_scaling/
> > >
> >*
> >2.
> > *
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-rescaling
> ><
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-rescaling
> > >
> >*
> >
> >
> > Best~
> >
> >
> >
> > Maximilian Michels  于2023年1月24日周二 01:08写道:
> >
> > > Hi,
> > >
> > > The current rescale API appears to be a work in progress. A couple
> years
> > > ago, we disabled access to the API [1].
> > >
> > > I'm looking into this problem as part of working on autoscaling [2]
> where
> > > we currently require a full restart of the job to apply the parallelism
> > > overrides. This adds additional delay and comes with the caveat that we
> > > don't know whether sufficient resources are available prior to
> executing
> > > the scaling decision. We obviously do not want to get stuck due to a
> lack
> > > of resources. So a rescale API would have to ensure enough resources
> are
> > > available prior to restarting the job.
> > >
> > > I've created an issue here:
> > > https://issues.apache.org/jira/browse/FLINK-30773
> > >
> > > Any comments or interest in working on this?
> > >
> > > -Max
> > >
> > > [1] 

Re: Reworking the Rescale API

2023-01-27 Thread David Morávek
>
> The adaptive scheduler only supports streaming jobs. That's the biggest
> limitation that probably won't be fixed anytime soon.


Since FLIP-283 [1] has been accepted, I think this limitation might have
already been addressed to a certain extent. I'd be completely fine with
having a separate scheduler for batch and streaming (maybe we could build a
hybrid one at some point that automatically switches between the two).

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-283%3A+Use+adaptive+batch+scheduler+as+default+scheduler+for+batch+jobs


On Fri, Jan 27, 2023 at 9:58 AM Chesnay Schepler  wrote:

> The adaptive scheduler only supports streaming jobs. That's the biggest
> limitation that probably won't be fixed anytime soon.
> The goal was though to make the adaptive scheduler the default for
> streaming jobs eventually.
> it was very much meant as a better version of the default scheduler for
> streaming jobs.
>
> On 26/01/2023 19:06, David Morávek wrote:
> > Hi Gyula,
> >
> >
> >> can you please explain why the AdaptiveScheduler is not the default
> >> scheduler?
> >
> > There are still some smaller bits missing. As far as I know, the missing
> > parts are:
> >
> > 1) Local recovery (reusing the already downloaded state files after
> restart
> > / rescale)
> > 2) Support for fine-grained resource management
> > 3) Support for the session cluster (Chesnay will be submitting a FLIP for
> > this soon)
> >
> > We're looking into addressing all of these limitations in the short term.
> >
> > Personally, I'd love to start a discussion about making transitioning the
> > AdaptiveScheduler into a default one after those limitations are fixed.
> > Being able to eventually deprecate and remove the DefaultScheduler would
> > simplify the code-base by a lot since there are many adapters between new
> > and old interfaces (eg. SlotPool-related interfaces).
> >
> > Best,
> > D.
> >
> > On Thu, Jan 26, 2023 at 6:27 PM Gyula Fóra  wrote:
> >
> >> Chesnay,
> >>
> >> Seems like you are suggesting that the Adaptive scheduler does
> everything
> >> the standard scheduler does and more.
> >>
> >> I am clearly not an expert on this topic but can you please explain why
> the
> >> AdaptiveScheduler is not the default scheduler?
> >> If it can do everything, why do we even have 2 schedulers? Why not
> simply
> >> drop the "old" one?
> >>
> >> That would probably clear up all confusionsthen :)
> >>
> >> Gyula
> >>
> >> On Thu, Jan 26, 2023 at 6:23 PM Chesnay Schepler 
> >> wrote:
> >>
> >>> There's the default and reactive mode; nothing else.
> >>> At it's core they are the same thing; reactive mode just cranks up the
> >>> desired parallelism to infinity and enforces certain assumptions (e.g.,
> >>> no active resource management).
> >>>
> >>> The advantage is that the adaptive scheduler can run jobs while not
> >>> sufficient resources are available, and scale things up again once they
> >>> are available.
> >>> This is it's core functionality, but we always intended to extend it
> >>> such that users can modify the parallelism at runtime as well.
> >>> And since the AS can already rescale jobs (and was purpose-built with
> >>> that functionality in mind), this is just a matter of exposing an API
> >>> for it. Everything else is already there.
> >>>
> >>> As a concrete use-case, let's say you have an SLA that says jobs must
> >>> not be down longer than X seconds, and a TM just crashed.
> >>> If you can absolutely guarantee that your k8s cluster can provision a
> >>> new TM within X seconds, no matter what cruel reality has in store for
> >>> you, than you /may/ not need it.
> >>> If you can't, well then here's a use-case for you.
> >>>
> >>>   > Last time I looked they implemented the same interface and the same
> >>> base class. Of course, their behavior is quite different.
> >>>
> >>> They never shared a base class since day 1. Are you maybe mixing up the
> >>> AdaptiveScheduler and AdaptiveBatchScheduler?
> >>>
> >>> As for FLINK-30773, I think that should be covered.
> >>>
> >>> On 26/01/2023 17:10, Maximilian Michels wrote:
> >>>> Thanks for the explanation. If not for the "reactive mode", what is
> >>>> the advantage of the adap

Re: Reworking the Rescale API

2023-01-26 Thread David Morávek
Hi Gyula,


> can you please explain why the AdaptiveScheduler is not the default
> scheduler?


There are still some smaller bits missing. As far as I know, the missing
parts are:

1) Local recovery (reusing the already downloaded state files after restart
/ rescale)
2) Support for fine-grained resource management
3) Support for the session cluster (Chesnay will be submitting a FLIP for
this soon)

We're looking into addressing all of these limitations in the short term.

Personally, I'd love to start a discussion about making transitioning the
AdaptiveScheduler into a default one after those limitations are fixed.
Being able to eventually deprecate and remove the DefaultScheduler would
simplify the code-base by a lot since there are many adapters between new
and old interfaces (eg. SlotPool-related interfaces).

Best,
D.

On Thu, Jan 26, 2023 at 6:27 PM Gyula Fóra  wrote:

> Chesnay,
>
> Seems like you are suggesting that the Adaptive scheduler does everything
> the standard scheduler does and more.
>
> I am clearly not an expert on this topic but can you please explain why the
> AdaptiveScheduler is not the default scheduler?
> If it can do everything, why do we even have 2 schedulers? Why not simply
> drop the "old" one?
>
> That would probably clear up all confusionsthen :)
>
> Gyula
>
> On Thu, Jan 26, 2023 at 6:23 PM Chesnay Schepler 
> wrote:
>
> > There's the default and reactive mode; nothing else.
> > At it's core they are the same thing; reactive mode just cranks up the
> > desired parallelism to infinity and enforces certain assumptions (e.g.,
> > no active resource management).
> >
> > The advantage is that the adaptive scheduler can run jobs while not
> > sufficient resources are available, and scale things up again once they
> > are available.
> > This is it's core functionality, but we always intended to extend it
> > such that users can modify the parallelism at runtime as well.
> > And since the AS can already rescale jobs (and was purpose-built with
> > that functionality in mind), this is just a matter of exposing an API
> > for it. Everything else is already there.
> >
> > As a concrete use-case, let's say you have an SLA that says jobs must
> > not be down longer than X seconds, and a TM just crashed.
> > If you can absolutely guarantee that your k8s cluster can provision a
> > new TM within X seconds, no matter what cruel reality has in store for
> > you, than you /may/ not need it.
> > If you can't, well then here's a use-case for you.
> >
> >  > Last time I looked they implemented the same interface and the same
> > base class. Of course, their behavior is quite different.
> >
> > They never shared a base class since day 1. Are you maybe mixing up the
> > AdaptiveScheduler and AdaptiveBatchScheduler?
> >
> > As for FLINK-30773, I think that should be covered.
> >
> > On 26/01/2023 17:10, Maximilian Michels wrote:
> > > Thanks for the explanation. If not for the "reactive mode", what is
> > > the advantage of the adaptive scheduler? What other modes does it
> > > support?
> > >
> > >> Apart from implementing the same interface the implementations of the
> > adaptive and default schedulers are separate.
> > > Last time I looked they implemented the same interface and the same
> > > base class. Of course, their behavior is quite different.
> > >
> > > I'm still very interested in learning about the future FLIPs
> > > mentioned. Based on the replies, I'm assuming that they will support
> > > the changes required for
> > > https://issues.apache.org/jira/browse/FLINK-30773, or at least provide
> > > the basis for implementing them.
> > >
> > > -Max
> > >
> > > On Thu, Jan 26, 2023 at 4:57 PM Chesnay Schepler
> > wrote:
> > >> On 26/01/2023 16:18, Maximilian Michels wrote:
> > >>
> > >> I see slightly different goals for the standard and the adaptive
> > >> scheduler. The adaptive scheduler's goal is to adapt the Flink job
> > >> according to the available resources.
> > >>
> > >> This is really a misconception that we just have to stomp out.
> > >>
> > >> This statement only applies to reactive mode, a special mode in which
> > the adaptive scheduler (AS) can run in where active resource management
> is
> > not supported since requesting infinite resources from k8s doesn't really
> > make sense.
> > >>
> > >> The AS itself can work perfectly fine with active resource management,
> > and has no effect on how the RM talks to k8s. It can just keep the job
> > running in cases where less than desired (==user-provided parallelism)
> > resources are provided by k8s (possibly temporarily).
> > >>
> > >> On 26/01/2023 16:18, Maximilian Michels wrote:
> > >>
> > >> After
> > >> all, both schedulers share the same super class
> > >>
> > >> Apart from implementing the same interface the implementations of the
> > adaptive and default schedulers are separate.
> >
> >
>


Re: [DISCUSS] FLIP-276: Data Consistency of Streaming and Batch ETL in Flink and Table Store

2022-12-12 Thread David Morávek
Hi Shammon,

I'm starting to see what you're trying to achieve, and it's really
exciting. I share Piotr's concerns about e2e latency and disability to use
unaligned checkpoints.

I have a couple of questions that are not clear to me from going over the
FLIP:

1) Global Checkpoint Commit

Are you planning on committing the checkpoints in a) a "rolling fashion" -
one pipeline after another, or b) altogether - once the data have been
processed by all pipelines?

Option a) would be eventually consistent (for batch queries, you'd need to
use the last checkpoint produced by the most downstream table), whereas b)
would be strongly consistent at the cost of increasing the e2e latency even
more.

I feel that option a) is what this should be headed for.

2) MetaService

Should this be a new general Flink component or one specific to the Flink
Table Store?

3) Follow-ups

>From the above discussion, there is a consensus that, in the ideal case,
watermarks would be a way to go, but there is some underlying mechanism
missing. It would be great to discuss this option in more detail to compare
the solutions in terms of implementation cost, maybe it could not be as
complex.


All in all, I don't feel that checkpoints are suitable for providing
consistent table versioning between multiple pipelines. The main reason is
that they are designed to be a fault tolerance mechanism. Somewhere between
the lines, you've already noted that the primitive you're looking for is
cross-pipeline barrier alignment, which is the mechanism a subset of
currently supported checkpointing implementations happen to be using. Is
that correct?

My biggest concern is that tying this with a "side-effect" of the
checkpointing mechanism could block us from evolving it further.

Best,
D.

On Mon, Dec 12, 2022 at 6:11 AM Shammon FY  wrote:

> Hi Piotr,
>
> Thank you for your feedback. I cannot see the DAG in 3.a in your reply, but
> I'd like to answer some questions first.
>
> Your understanding is very correct. We want to align the data versions of
> all intermediate tables through checkpoint mechanism in Flink. I'm sorry
> that I have omitted some default constraints in FLIP, including only
> supporting aligned checkpoints; one table can only be written by one ETL
> job. I will add these later.
>
> Why can't the watermark mechanism achieve the data consistency we wanted?
> For example, there are 3 tables, Table1 is word table, Table2 is word->cnt
> table and Table3 is cnt1->cnt2 table.
>
> 1. ETL1 from Table1 to Table2: INSERT INTO Table2 SELECT word, count(*)
> FROM Table1 GROUP BY word
>
> 2. ETL2 from Table2 to Table3: INSERT INTO Table3 SELECT cnt, count(*) FROM
> Table2 GROUP BY cnt
>
> ETL1 has 2 subtasks to read multiple buckets from Table1, where subtask1
> reads streaming data as [a, b, c, a, d, a, b, c, d ...] and subtask2 reads
> streaming data as [a, c, d, q, a, v, c, d ...].
>
> 1. Unbounded streaming data is divided into multiple sets according to some
> semantic requirements. The most extreme may be one set for each data.
> Assume that the sets of subtask1 and subtask2 separated by the same
> semantics are [a, b, c, a, d] and [a, c, d, q], respectively.
>
> 2. After the above two sets are computed by ETL1, the result data generated
> in Table 2 is [(a, 3), (b, 1), (c, 1), (d, 2), (q, 1)].
>
> 3. The result data generated in Table 3 after the data in Table 2 is
> computed by ETL2 is [(1, 3), (2, 1), (3, 1)]
>
> We want to align the data of Table1, Table2 and Table3 and manage the data
> versions. When users execute OLAP/Batch queries join on these tables, the
> following consistency data can be found
>
> 1. Table1: [a, b, c, a, d] and [a, c, d, q]
>
> 2. Table2: [a, 3], [b, 1], [c, 1], [d, 2], [q, 1]
>
> 3. Table3: [1, 3], [2, 1], [3, 1]
>
> Users can perform query: SELECT t1.word, t2.cnt, t3.cnt2 from Table1 t1
> JOIN Table2 t2 JOIN Table3 t3 on t1.word=t2.word and t2.cnt=t3.cnt1;
>
> In the view of users, the data is consistent on a unified "version" between
> Table1, Table2 and Table3.
>
> In the current Flink implementation, the aligned checkpoint can achieve the
> above capabilities (let's ignore the segmentation semantics of checkpoint
> first). Because the Checkpoint Barrier will align the data when performing
> the global Count aggregation, we can associate the snapshot with the
> checkpoint in the Table Store, query the specified snapshot of
> Table1/Table2/Table3 through the checkpoint, and achieve the consistency
> requirements of the above unified "version".
>
> Current watermark mechanism in Flink cannot achieve the above consistency.
> For example, we use watermark to divide data into multiple sets in subtask1
> and subtask2 as followed
>
> 1. subtask1:[(a, T1), (b, T1), (c, T1), (a, T1), (d, T1)], T1, [(a, T2),
> (b, T2), (c, T2), (d, T2)], T2
>
> 2. subtask2: [(a, T1), (c, T1), (d, T1), (q, T1)], T1, 
>
> As Flink watermark does not have barriers and cannot align data, ETL1 Count
> operator may compute the data of 

Re: [DISCUSS] Cleaning up HighAvailabilityServices interface to reflect the per-JM-process LeaderElection

2022-12-10 Thread David Morávek
Hi Dong,

> Adding regarding the effort to add back the per-component election
capability: given that the implementation already follows per-process
election, and given that there will likely be a lot of extra
design/implementation/test effort needed to achieve the use-cases described
above, maybe the change proposed in this thread won't affect the overall
effort much?

This might be a misunderstanding; what Chesnay is proposing is _not
removing the existing interfaces_, that allow us to split components out to
separate processes eventually.

Maybe let's be more explicit about what the current state is:

1) _HighAvailabilityServices_ interface contains methods to create
_LeaderElectionService_ and _LeaderRetrievalService_ for each component
separately
2) In 1.15, we've introduced an alternative implementation ->
_MultipleComponentLeaderElectionService_ that can multicast the leader
election to multiple components.
3) In 1.16, we've removed the old HA services because they didn't provide
any extra capability beyond what _MultipleComponentLeaderElectionService_
offers. It indeed did per-component leader election, but it was still
effectively tied with a single JM process, so adding this back would only
help a little with the component split efforts.

The biggest motivation for re-factoring from my side would be that it was
tough to fit the _MultipleComponentLeaderElectionService_ into the existing
interfaces, so the implementation is unnecessarily complex.

I think what we should do instead is re-thinking these interfaces, so they
can still provide the flexibility of letting the user split out some
components into a separate process. There is also a pending discussion
(FLIP-257 [1]) that hints that some people are already thinking in this
direction, and it might be required for their use case.

I've also recently started to incline that splitting out the
ResourceManager might be crucial for building a large-scale managed
service. There are a lot of companies emerging in this area right now, so I
don't feel like we should be closing these doors just yet.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-257%3A+Flink+JobManager+Process+Split

Best,
D.

On Sat, Dec 10, 2022 at 7:01 AM Dong Lin  wrote:

> Hi Matthias,
>
> Thanks for the explanation. I was trying to understand the concrete
> user-facing benefits of preserving the flexibility of per-component leader
> election. Now I get that maybe they want to scale those components
> independently, and maybe run the UI in an environment that is more
> accessible
> than the other processes.
>
> I replied to Chesnay's email regarding whether it is worthwhile to keep the
> existing interface for those potential but not-yet-realized benefits.
>
> Thanks,
> Dong
>
> On Fri, Dec 9, 2022 at 5:47 PM Matthias Pohl  .invalid>
> wrote:
>
> > Hi Dong,
> > see my answers below.
> >
> > Regarding "Interface change might affect other projects that customize HA
> > > services", are you referring to those projects which hack into Flink's
> > > source code (as opposed to using Flink's public API) to customize HA
> > > services?
> >
> >
> > Yes, the proposed change might affect projects that need to have their
> own
> > HA implementation for whatever reason (interface change) or if a project
> > accesses the HA backend to retrieve metadata from the ZK node/k8s
> ConfigMap
> > (change about how the data is stored in the HA backend). The latter one
> was
> > actually already the case with the change introduced by FLINK-24038 [1].
> >
> > By the way, since Flink already supports zookeeper and kubernetes as the
> > > high availability services, are you aware of many projects that still
> > need
> > > to hack into Flink's code to customize high availability services?
> >
> >
> > I am aware of projects that use customized HA. But based on our
> experience
> > in FLINK-24038 [1] no one complained. So, making people aware through the
> > mailing list might be good enough.
> >
> > And regarding "We lose some flexibility in terms of per-component
> > > LeaderElection", could you explain what flexibility we need so that we
> > can
> > > gauge the associated downside of losing the flexibility?
> >
> >
> > Just to recap: The current interface allows having per-component
> > LeaderElection (e.g. the ResourceManager leader can run on a different
> > JobManager than the Dispatcher). This implementation was replaced by
> > FLINK-24038 [1] and removed in FLINK-25806 [2]. The new implementation
> does
> > LeaderElection per process (e.g. ResourceManager and Dispatcher always
> run
> > on the same JobManager). The changed interface would require us to touch
> > the interface again if (for whatever reason) we want to reintroduce
> > per-component leader election in some form.
> > The interface change is, strictly speaking, not necessary to provide the
> > new functionality. But I like the idea of certain requirements
> (currently,
> > we need per-process leader election to fix what was 

Re: [DISCUSS] Changing the minimal supported version of Hadoop to 2.10.2

2022-10-19 Thread David Morávek
+1; anything below 2.10.x seems to be EOL

Best,
D.

On Mon, Oct 17, 2022 at 10:48 AM Márton Balassi 
wrote:

> Hi Martjin,
>
> +1 for 2.10.2. Do you expect to have bandwidth in the near term to
> implement the bump?
>
> On Wed, Oct 5, 2022 at 5:00 PM Gabor Somogyi 
> wrote:
>
> > Hi Martin,
> >
> > Thanks for bringing this up! Lately I was thinking about to bump the
> hadoop
> > version to at least 2.6.1 to clean up issues like this:
> >
> >
> https://github.com/apache/flink/blob/8d05393f5bcc0a917b2dab3fe81a58acaccabf13/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/util/HadoopUtils.java#L157-L159
> >
> > All in all +1 from my perspective.
> >
> > Just a question here. Are we stating the minimum Hadoop version for users
> > somewhere in the doc or they need to find it out from source code like
> > this?
> >
> >
> https://github.com/apache/flink/blob/3a4c11371e6f2aacd641d86c1d5b4fd86435f802/tools/azure-pipelines/build-apache-repo.yml#L113
> >
> > BR,
> > G
> >
> >
> > On Wed, Oct 5, 2022 at 5:02 AM Martijn Visser 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Little over a year ago a discussion thread was opened on changing the
> > > minimal supported version of Hadoop and bringing that to 2.8.5. [1] In
> > this
> > > discussion thread, I would like to propose to bring that minimal
> > supported
> > > version of Hadoop to 2.10.2.
> > >
> > > Hadoop 2.8.5 is vulnerable for multiple CVEs which are classified as
> > > Critical. [2] [3]. While Flink is not directly impacted by those, we do
> > see
> > > vulnerability scanners flag Flink as being vulnerable. We could easily
> > > mitigate that by bumping the minimal supported version of Hadoop to
> > 2.10.2.
> > >
> > > I'm looking forward to your opinions on this topic.
> > >
> > > Best regards,
> > >
> > > Martijn
> > > https://twitter.com/MartijnVisser82
> > > https://github.com/MartijnVisser
> > >
> > > [1] https://lists.apache.org/thread/81fhnwfxomjhyy59f9bbofk9rxpdxjo5
> > > [2] https://nvd.nist.gov/vuln/detail/CVE-2022-25168
> > > [3] https://nvd.nist.gov/vuln/detail/CVE-2022-26612
> > >
> >
>


Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

2022-08-23 Thread David Morávek
Hi Zheng,

Thanks for the write-up! I tend to agree with Chesnay that this introduces
additional complexity to an already complex deployment model.

One of the main focuses in this area is to reduce feature sparsity and to
have fewer high-quality options. Example efforts are deprecation (and
eventual removal) of per-job mode, removal of Mesos RM, ...

Let's discuss your points:

> This can save some JVM resources and reduce server costs

if so, the saving would IMO be negligible; why?

- JobMaster is by far the most resource-intensive component inside the
JobManager
- The CPU / memory ratio of the underlying hypervisor remains the same (or
you'd have unused resources on the machine that you still need to pay for)
- The most overhead of the JobMaster comes from the JVM itself, not from RM
/ Dispatcher

> More adequate resource utilization

Can you elaborate? Is this about sharing TMs between multiple jobs (I'd
discourage that for long-running mission-critical workloads)?

> Starting Application Mode has a long resource application and waiting
(because SessionCluster has already applied for fixed TM and JM resources
at startup)

This means you have to overprovision your SessionCluster. This goes against
resource utilization efforts from the previous point (you're shaving off
little resources from JM, but have spare TMs instead, that are the order of
magnitude
 more resource intensive).

If you're able to start TMs upfront with the session cluster, you already
know you're going to need them. If this is a concern, you could as well
start the TMs that will eventually connect to your JM once it starts
(you've decided to submit your job) - there might be some enhancements to
ApplicationMode needed to make this robust, but efforts in this direction
are where the things should IMO be headed.

As for the resource utilization, the session cluster actually blocks you
from leveraging reactive scaling efforts and eventually auto-scaling,
because we'd need to enhance Flink surface area with multi-job scheduling
capabilities (queues, pre-emptions, priorities between jobs) - I don't
think we should ever go in that direction, that's outside Flink's scope.

> Poor isolation between JobMaster threads in JobManager: When there are
too many jobs, the JobManager is under great pressure.

The session mode is mainly designed for interactive workloads but agreed
that JM threads might interfere. Still, I fail to see this as a reason for
introducing additional complexity because this could be mitigated on the
user side (smarter job scheduling, multiple clusters, AM for streaming
jobs).

> there will inevitably be more rich functions running on JobMaster.

This is a separate discussion. So far we were mostly pushing against
running against any user code on JM (there are few exceptions already, but
any enhancement should be carefully considered)

> JobManager's functional responsibilities are too large

from the "architecture perspective", it's just a bundle of independent
components with clearly defined responsibilities, that makes their
coordination simpler and more resource efficient (networking, fewer JVMs -
each comes with a significant overhead)

--

So far I'm under impression that this actually introduces more issues than
it tries to solve.

Best,
D.


On Thu, Aug 18, 2022 at 12:10 PM Zheng Yu Chen  wrote:

> You're right, this does add to the complexity of their communication
> coordination
> I can understand what you mean is similar to ngnix, load balancing to
> different SessionClusters in the front, rather than one more component. In
> fact, I have tried this myself, and it seems to solve the problem of high
> load of cluster JM, but it cannot fundamentally solve the following
> problems
>
> Deploying components is complicated and requires one more ngnix and related
> configuration. You also need to make sure that your jobs are not assigned
> to a busy JobManager
> As well as my previous reply mentioned the problem, this is a trade-off
> solution (after all, you can choose Application Mode, so there will be no
> such problem), when we need to use SessionCluster for long-running jobs, we
> Can you think like this?
>
> what do you think ~
>
>
> Chesnay Schepler  于2022年8月17日周三 22:31写道:
>
> > To be honest I'm terrified at the idea of splitting the Dispatcher into
> > several processes, even more so if this is supposed to be opt-in and
> > specific to session mode.
> > It would fragment the coordination layer even more than it already is,
> > and make ops more complicated (yet another set of processes to monitor,
> > configure etc.).
> >
> > I'm not convinced that this proposal really gets us a lot of benefits;
> > and would rather propose that you split your single session cluster into
> > multiple session clusters (with the scheduling component in front of it
> > to distribute jobs) to even the load.
> >
> >  > The currently idling JobManagers could be utilized to take over some
> > of the workload from the leader.
> >
> > 

Re: [VOTE] FLIP-227: Support overdraft buffer

2022-05-26 Thread David Morávek
+1 (binding)

On Thu, May 26, 2022 at 11:55 AM Dawid Wysakowicz 
wrote:

> +1 (binding)
>
> On Thu, 26 May 2022, 11:21 Piotr Nowojski,  wrote:
>
> > Yes, it will be a good improvement :)
> >
> > +1 (binding)
> >
> > Piotrek
> >
> > czw., 26 maj 2022 o 10:26 Anton Kalashnikov 
> > napisał(a):
> >
> > > Hi.
> > >
> > > Thanks Fanrui for this FLIP. I think it will be useful thing for us.
> > >
> > > +1(non-binding)
> > >
> > > --
> > >
> > > Best regards,
> > > Anton Kalashnikov
> > >
> > > 26.05.2022 06:00, rui fan пишет:
> > > > Hi, everyone,
> > > >
> > > > Thanks for your feedback for FLIP-227: Support overdraft buffer[1] on
> > > > the discussion thread[2].
> > > >
> > > > I'd like to start a vote for it. The vote will be open for at least
> 72
> > > > hours unless there is an objection or not enough votes.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
> > > > [2] https://lists.apache.org/thread/4p3xcf0gg4py61hsnydvwpns07d1nog7
> > > >
> > > > Best wishes
> > > > fanrui
> > >
> > >
> >
>


Re: Flink UI in Application Mode

2022-05-23 Thread David Morávek
Hi Zain,

you can find a link to web-ui either in the CLI output after the job
submission or in the YARN ResourceManager web ui [1]. With YARN Flink needs
to choose the application master port at random (could be somehow
controlled by setting _yarn.application-master.port_) as there might be
multiple JMs running on the same NodeManager.

OT: I've seen you've opened multiple threads on both dev and user mailing
list. As these are all "user" related questions, can you please focus them
on the user ML only? Separating user & development (the Flink
contributions) threads into separate lists allows community to work more
efficiently.

Best,
D.

On Sun, May 22, 2022 at 7:44 PM Zain Haider Nemati 
wrote:

> Hi,
> Which port does flink UI run on in application mode?
> If I am running 5 yarn jobs in application mode would the UI be same for
> each or different ports for each?
>


Re: About Native Deployment's Autoscaling implementation

2022-05-23 Thread David Morávek
Hi Talat,

This is definitely an interesting and rather complex topic.

Few unstructured thoughts / notes / questions:

- The main struggle has always been that it's hard to come up with a
generic one-size-fits-it-all metrics for autoscaling.
  - Flink doesn't have knowledge of the external environment (eg. capacity
planning on the cluster, no notion of pre-emption), so it can not really
make a qualified decision in some cases.
  - ^ the above goes along the same reasoning as why we don't support
reactive mode with the session cluster (multi-job scheduling)
- The re-scaling decision logic most likely needs to be pluggable from the
above reasons
  - We're in general fairly concerned about running any user code in JM for
stability reasons.
  - The most flexible option would be allowing to set the desired
parallelism via rest api and leave the scaling decision to an external
process, which could be reused for both standalone and "active" deployment
modes (there is actually a prototype by Till, that allows this [1])

How do you intend to make an autoscaling decision? Also note that the
re-scaling is still a fairly expensive operation (especially with large
state), so you need to make sure autoscaler doesn't oscillate and doesn't
re-scale too often (this is also something that could vary from workload to
workload).

Note on the metrics question with an auto-scaler living in the JM:
- We shouldn't really collect the metrics into the JM, but instead JM can
pull then from TMs directly on-demand (basically the same thing and
external auto-scaler would do).

Looking forward to your thoughts

[1] https://github.com/tillrohrmann/flink/commits/autoscaling

Best,
D.

On Mon, May 23, 2022 at 8:32 AM Talat Uyarer 
wrote:

> Hi,
> I am working on auto scaling support for native deployments. Today Flink
> provides Reactive mode however it only runs on standalone deployments. We
> use Kubernetes native deployment. So I want to increase or decrease job
> resources for our streamin jobs. Recent Flip-138 and Flip-160 are very
> useful to achieve this goal. I started reading code of Flink JobManager,
> AdaptiveScheduler and DeclarativeSlotPool etc.
>
> My assumption is Required Resources will be calculated on AdaptiveScheduler
> whenever the scheduler receives a heartbeat from a task manager by calling
> public void updateAccumulators(AccumulatorSnapshot accumulatorSnapshot)
> method.
>
> I checked TaskExecutorToJobManagerHeartbeatPayload class however I only see
> *accumulatorReport* and *executionDeploymentReport* . Do you have any
> suggestions to collect metrics from TaskManagers ? Should I add metrics on
> TaskExecutorToJobManagerHeartbeatPayload ?
>
> I am open to another suggestion for this. Whenever I finalize my
> investigation. I will create a FLIP for more detailed implementation.
>
> Thanks for your help in advance.
> Talat
>


Re: [ANNOUNCE] New Flink PMC member: Yang Wang

2022-05-06 Thread David Morávek
Nice! Congrats Yang, well deserved! ;)

On Fri 6. 5. 2022 at 17:53, Peter Huang  wrote:

> Congrats, Yang!
>
>
>
> Best Regards
> Peter Huang
>
> On Fri, May 6, 2022 at 8:46 AM Yu Li  wrote:
>
> > Congrats and welcome, Yang!
> >
> > Best Regards,
> > Yu
> >
> >
> > On Fri, 6 May 2022 at 14:48, Paul Lam  wrote:
> >
> > > Congrats, Yang! Well Deserved!
> > >
> > > Best,
> > > Paul Lam
> > >
> > > > 2022年5月6日 14:38,Yun Tang  写道:
> > > >
> > > > Congratulations, Yang!
> > > >
> > > > Best
> > > > Yun Tang
> > > > 
> > > > From: Jing Ge 
> > > > Sent: Friday, May 6, 2022 14:24
> > > > To: dev 
> > > > Subject: Re: [ANNOUNCE] New Flink PMC member: Yang Wang
> > > >
> > > > Congrats Yang and well Deserved!
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > On Fri, May 6, 2022 at 7:38 AM Lincoln Lee 
> > > wrote:
> > > >
> > > >> Congratulations Yang!
> > > >>
> > > >> Best,
> > > >> Lincoln Lee
> > > >>
> > > >>
> > > >> Őrhidi Mátyás  于2022年5月6日周五 12:46写道:
> > > >>
> > > >>> Congrats Yang! Well deserved!
> > > >>> Best,
> > > >>> Matyas
> > > >>>
> > > >>> On Fri, May 6, 2022 at 5:30 AM huweihua 
> > > wrote:
> > > >>>
> > >  Congratulations Yang!
> > > 
> > >  Best,
> > >  Weihua
> > > 
> > > 
> > > >>>
> > > >>
> > >
> > >
> >
>


Re: [DISCUSS] Planning Flink 1.16

2022-04-27 Thread David Morávek
Thanks Konstantin and Chesnay for starting the discussion and volunteering.
The timeline proposal sounds reasonable :+1:

Best,
D.

On Tue, Apr 26, 2022 at 1:37 PM Martijn Visser 
wrote:

> Hi everyone,
>
> Thanks for starting this discussion. I would also volunteer to help out as
> a release manager for the 1.16 release.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
> https://github.com/MartijnVisser
>
>
> On Tue, 26 Apr 2022 at 13:19, godfrey he  wrote:
>
> > Hi Konstantin & Chesnay,
> >
> > Thanks for driving this discussion, I am willing to volunteer as the
> > release manager for 1.16.
> >
> >
> > Best,
> > Godfrey
> >
> > Konstantin Knauf  于2022年4月26日周二 18:23写道:
> > >
> > > Hi everyone,
> > >
> > > With Flink 1.15 about to be released, the community has started
> planning
> > &
> > > developing features for the next release, Flink 1.16. As such, I would
> > like
> > > to start a discussion around managing this release.
> > >
> > > Specifically, Chesnay & myself would like to volunteer as release
> > managers.
> > > Our focus as release managers would be
> > > * to propose a release timeline
> > > * to provide an overview of all ongoing development threads and ideally
> > > their current status to the community
> > > * to keep an eye on build stability
> > > * facilitate release testing
> > > * to do the actual release incl. communication (blog post, etc.)
> > >
> > > Is anyone else interested in acting as a release manager for Flink
> 1.16?
> > If
> > > so, we are happy to make this a joint effort.
> > >
> > > Besides the question of who will act as a release manager, I think, we
> > can
> > > already use this thread to align on a timeline. For collecting features
> > and
> > > everything else, we would start a dedicated threads shortly.
> > >
> > > Given Flink 1.15 will be released in the next days, and aiming for a 4
> > > months release cycle including stabilization, this would mean *feature
> > > freeze at the end of July*. The exact date could be determined later.
> Any
> > > thoughts on the timeline.?
> > >
> > > Looking forward to your thoughts!
> > >
> > > Thanks,
> > >
> > > Chesnay & Konstantin
> >
>


Re: [DISCUSS] FLIP-220: Temporal State

2022-04-22 Thread David Morávek
>
> With that in mind, we could only offer a couple of selected
> temporal/sorted
> state implementations that are handled internally, but not really a
> generic
> one - even if you let the user explicitly handle binary keys...


If we want to have a generic interface that is portable between different
state backends and allows for all the use-cases described above,
lexicographical binary sort sounds reasonable, because you need to be able
to push sorting out of the JVM boundary.

Only trade off I can think of is that as long as you stay within the JVM
(heap state backend), you need to pay a slight key serialization cost,
which is IMO ok-ish.

Do you have any future state backend ideas in mind, that might not work
with this assumption?

-

I'm really starting to like the idea of having a BinarySortedMapState +
higher level / composite states.

D.


On Fri, Apr 22, 2022 at 1:58 PM David Morávek 
wrote:

> Isn't allowing a TemporalValueState just a special case of b.III? So if a
>> user
>> of the state wants that, then they can leverage a simple API vs. if you
>> want
>> fancier duplicate handling, you'd just go with TemporalListState and
>> implement
>> the logic you want?
>
>
> Yes it is. But it IMO doesn't justify adding a new state primitive. My
> take would be that as long as we can build TVS using other existing state
> primitives (TLS) we should treat it as a "composite state". We currently
> don't have a good user facing API to do that, but it could be added in
> separate FLIP.
>
> eg. something along the lines of
>
> TemporalValueState state = getRuntimeContext().getCompositeState(
> new CompositeStateDescriptor<>(
> "composite", new TemporalValueState(type)));
>
> On Fri, Apr 22, 2022 at 1:44 PM Nico Kruber  wrote:
>
>> David,
>>
>> 1) Good points on the possibility to make the TemporalListState generic
>> -> actually, if you think about it more, we are currently assuming that
>> all
>> state backends use the same comparison on the binary level because we add
>> an
>> appropriate serializer at an earlier abstraction level. This may not
>> actually
>> hold for all (future) state backends and can limit further
>> implementations (if
>> you think this is something to keep in mind!).
>>
>> So we may have to push this serializer implementation further down the
>> stack,
>> i.e. our current implementation is one that fits RocksDB and that alone...
>>
>> With that in mind, we could only offer a couple of selected
>> temporal/sorted
>> state implementations that are handled internally, but not really a
>> generic
>> one - even if you let the user explicitly handle binary keys...
>>
>>
>> 2) Duplicates
>>
>> Isn't allowing a TemporalValueState just a special case of b.III? So if a
>> user
>> of the state wants that, then they can leverage a simple API vs. if you
>> want
>> fancier duplicate handling, you'd just go with TemporalListState and
>> implement
>> the logic you want?
>>
>>
>>
>> Nico
>>
>> On Friday, 22 April 2022 10:43:48 CEST David Morávek wrote:
>> >  Hi Yun & Nico,
>> >
>> > few thoughts on the discussion
>> >
>> > 1) Making the TemporalListState generic
>> >
>> > This is just not possible with the current infrastructure w.r.t type
>> > serializers as the sorting key *needs to be comparable on the binary
>> level*
>> > (serialized form).
>> >
>> > What I could imagine, is introducing some kind of
>> `Sorted(List|Map)State`
>> > with explicit binary keys. User would either have to work directly with
>> > `byte[]` keys or provide a function for transforming keys into the
>> binary
>> > representation that could be sorted (this would have to be different
>> from
>> > `TypeSerializer` which could get more fancy with the binary
>> representation,
>> > eg. to save up space -> varints).
>> >
>> > This kind of interface might be really hard to grasp by the pipeline
>> > authors. There needs to be a deeper understanding how the byte
>> comparison
>> > works (eg. it needs to be different from the java byte comparison which
>> > compares bytes as `signed`). This could be maybe partially mitigated by
>> > providing predefined `to binary sorting key` functions for the common
>> > primitives / types.
>> >
>> > 2) Duplicates
>> >
>> > I guess, this all boils down to dealing with duplicates / values for the
>> >
>> &g

Re: [DISCUSS] FLIP-220: Temporal State

2022-04-22 Thread David Morávek
>
> Isn't allowing a TemporalValueState just a special case of b.III? So if a
> user
> of the state wants that, then they can leverage a simple API vs. if you
> want
> fancier duplicate handling, you'd just go with TemporalListState and
> implement
> the logic you want?


Yes it is. But it IMO doesn't justify adding a new state primitive. My take
would be that as long as we can build TVS using other existing state
primitives (TLS) we should treat it as a "composite state". We currently
don't have a good user facing API to do that, but it could be added in
separate FLIP.

eg. something along the lines of

TemporalValueState state = getRuntimeContext().getCompositeState(
new CompositeStateDescriptor<>(
"composite", new TemporalValueState(type)));

On Fri, Apr 22, 2022 at 1:44 PM Nico Kruber  wrote:

> David,
>
> 1) Good points on the possibility to make the TemporalListState generic
> -> actually, if you think about it more, we are currently assuming that
> all
> state backends use the same comparison on the binary level because we add
> an
> appropriate serializer at an earlier abstraction level. This may not
> actually
> hold for all (future) state backends and can limit further implementations
> (if
> you think this is something to keep in mind!).
>
> So we may have to push this serializer implementation further down the
> stack,
> i.e. our current implementation is one that fits RocksDB and that alone...
>
> With that in mind, we could only offer a couple of selected
> temporal/sorted
> state implementations that are handled internally, but not really a
> generic
> one - even if you let the user explicitly handle binary keys...
>
>
> 2) Duplicates
>
> Isn't allowing a TemporalValueState just a special case of b.III? So if a
> user
> of the state wants that, then they can leverage a simple API vs. if you
> want
> fancier duplicate handling, you'd just go with TemporalListState and
> implement
> the logic you want?
>
>
>
> Nico
>
> On Friday, 22 April 2022 10:43:48 CEST David Morávek wrote:
> >  Hi Yun & Nico,
> >
> > few thoughts on the discussion
> >
> > 1) Making the TemporalListState generic
> >
> > This is just not possible with the current infrastructure w.r.t type
> > serializers as the sorting key *needs to be comparable on the binary
> level*
> > (serialized form).
> >
> > What I could imagine, is introducing some kind of `Sorted(List|Map)State`
> > with explicit binary keys. User would either have to work directly with
> > `byte[]` keys or provide a function for transforming keys into the binary
> > representation that could be sorted (this would have to be different from
> > `TypeSerializer` which could get more fancy with the binary
> representation,
> > eg. to save up space -> varints).
> >
> > This kind of interface might be really hard to grasp by the pipeline
> > authors. There needs to be a deeper understanding how the byte comparison
> > works (eg. it needs to be different from the java byte comparison which
> > compares bytes as `signed`). This could be maybe partially mitigated by
> > providing predefined `to binary sorting key` functions for the common
> > primitives / types.
> >
> > 2) Duplicates
> >
> > I guess, this all boils down to dealing with duplicates / values for the
> >
> > > same timestamp.
> >
> > We should never have duplicates. Let's try to elaborate on what having
> the
> > duplicates really means in this context.
> >
> > a) Temporal Table point of view
> >
> > There could be only a single snapshot of the table at any given point in
> > time (~ physics). If we allow for duplicates we violate this, as it's not
> > certain what the actual state of the table is at that point in time. In
> > case of the temporal join, what should the elements from the other side
> > join against?
> >
> > If we happen to have a duplicate, it actually brings us to b) causality
> > (which could actually answer the previous question).
> >
> > b) Causality
> >
> > When building any kind of state machine, it's important to think about
> > causality (if we switch the order of events, state transitions no longer
> > result in the same state). Temporal table is a specific type of the state
> > machine.
> >
> > There are several approaches to mitigate this:
> > I) nano-second precision -> the chance that two events affecting the same
> > thing happen at the exactly same nanosecond is negligible (from the
> > physical standpoint)
> > II) the sorting key i

Re: [DISCUSS] FLIP-220: Temporal State

2022-04-22 Thread David Morávek
r of serialized bytes same as original
> java objects first.
> For the fixed-length serializer, such as LongSerializer, we just need to
> ensure all numbers are positive or inverting the sign bit.
> However, for non-fixed-length serializer, such as StringSerializer, it
> will write the length of the bytes first, which will break the natural
> order if comparing the bytes. Thus, we might need to avoid to write the
> length in the serialized bytes.
> On the other hand, changelog logger would record operation per key one by
> one in the logs. We need to consider how to distinguish each key in the
> combined serialized byte arrays.
>
> Best
> Yun Tang
>
> --
> *From:* Nico Kruber 
> *Sent:* Thursday, April 21, 2022 23:50
> *To:* dev 
> *Cc:* David Morávek ; Yun Tang 
> *Subject:* Re: [DISCUSS] FLIP-220: Temporal State
>
> Thanks Yun Tang for your clarifications.
> Let me keep my original structure and reply in these points...
>
> 3. Should we generalise the Temporal***State to offer arbitrary key types
> and
> not just Long timestamps?
>
> The use cases you detailed do indeed look similar to the ones we were
> optimising in our TemporalState PoC...
>
> I don't think, I'd like to offer a full implementation of NavigableMap
> though
> because that seems quite some overhead to implement while we can cover the
> mentioned examples with the proposed APIs already when using iterators as
> well
> as single-value retrievals.
> So far, when we were iterating from the smallest key, we could just use
> Long.MIN_VALUE and start from there. That would be difficult to generalise
> for
> arbitrary data types because you may not always know the smallest possible
> value for a certain serialized type (unless we put this into the
> appropriate
> serializer interface).
>
> I see two options here:
> a) a slim API but using NULL as an indicator for smallest/largest
> depending on
> the context, e.g.
>   - `readRange(null, key)` means from beginning to key
>   - `readRange(key, null)` means from key to end
>   - `readRange(null, null)` means from beginning to end
>   - `value[AtOr]Before(null)` means largest available key
>   - `value[AtOr]After(null)` means smallest available key
> b) a larger API with special methods for each of these use cases similar
> to
> what NavigableMap has but based on iterators and single-value functions
> only
>
> > BTW, I prefer to introduce another state descriptor instead of current
> map
> > state descriptor.
>
> Can you elaborate on this? We currently don't need extra functionality, so
> this would be a plain copy of the MapStateDescriptor...
>
> > For the API of SortedMapOfListsState, I think this is a bit bounded to
> > current implementation of RocksDB state-backend.
>
> Actually, I don't think this is special to RocksDB but generic to all
> state
> backends that do not hold values in memory and allow fast append-like
> operations.
> Additionally, since this is a very common use case and RocksDB is also
> widely
> used, I wouldn't want to continue without this specialization. For a
> similar
> reason, we offer ListState and not just ValueState...
>
>
> 4. ChangelogStateBackend
>
> > For the discussion of ChangelogStateBackend, you can think of changelog
> > state-backend as a write-ahead-log service. And we need to record the
> > changes to any state, thus this should be included in the design doc as
> we
> > need to introduce another kind of state, especially you might need to
> > consider how to store key bytes serialized by the new serializer (as we
> > might not be able to write the length in the beginning of serialized
> bytes
> > to make the order of bytes same as natural order).
>
> Since the ChangelogStateBackend "holds the working state in the underlying
> delegatedStateBackend, and forwards state changes to State Changelog", I
> honestly still don't see how this needs special handling. As long as the
> delegated state backend suppors sorted state, ChangelogStateBackend
> doesn't
> have to do anything special except for recording changes to state. Our PoC
> simply uses the namespace for these keys and that's the same thing the
> Window
> API is already using - so there's nothing special here. The order in the
> log
> doesn't have to follow the natural order because this is only required
> inside
> the delegatedStateBackend, isn't it?
>
>
> Nico
>
> On Wednesday, 20 April 2022 17:03:11 CEST Yun Tang wrote:
> > Hi Nico,
> >
> > Thanks for your clarification.
> > For the discussion about generalizing Temporal state to sorted map
> state. I
> > could give some example

Re: [DISCUSS] FLIP-220: Temporal State

2022-04-13 Thread David Morávek
Here is a very naive implementation [1] from a prototype I did few months
back that uses list and insertion sort. Since the list is sorted we can use
binary search to create sub-list, that could leverage the same thing I've
described above.

I think back then I didn't go for the SortedMap as it would be hard to
implement with the current heap state backend internals and would have
bigger memory overhead.

The ideal solution would probably use skip list [2] to lower the overhead
of the binary search, while maintaining a reasonable memory footprint.
Other than that it could be pretty much the same as the prototype
implementation [1].

[1]
https://github.com/dmvk/flink/blob/ecdbc774b13b515e8c0943b2c143fb1e34eca6f0/flink-runtime/src/main/java/org/apache/flink/runtime/state/heap/HeapTemporalListState.java
[2] https://en.wikipedia.org/wiki/Skip_list

Best,
D.

On Wed, Apr 13, 2022 at 1:27 PM David Morávek 
wrote:

> Hi David,
>
> It seems to me that at least with the heap-based state backend, readRange
>> is going to have to do a lot of unnecessary work to implement this
>> isEmpty() operation, since it have will to consider the entire range from
>> MIN_VALUE to MAX_VALUE. (Maybe we should add an explicit isEmpty method?
>> I'm not convinced we need it, but it would be cheaper to implement. Or
>> perhaps this join can be rewritten to not need this operation; I haven't
>> thought enough about that alternative.)
>>
>
> I think this really boils down to how the returned iterable is going to be
> implemented. Basically for checking whether state is empty, you need to do
> something along the lines of:
>
> Iterables.isEmpty(state.readRange(Long.MIN_VALUE, MAX_VALUE)); //
> basically checking `hasNext() == false` or `isEmpty()` in case of
> `Collection`
>
> Few notes:
> 1) It could be lazy (the underlying collection doesn't have to be
> materialized - eg. in case of RocksDB);
> 2) For HeapStateBackend it depends on the underlying implementation. I'd
> probably do something along the lines of sorted tree (eg. SortedMap /
> NavigableMap), that allows effective range scans / range deletes. Then you
> could simply do something like (from top of the head):
>
> @Value
> class TimestampedKey {
>   K key;
>   long timestamap;
> }
>
> SortedMap, V> internalState;
>
> Iterable> readRange(long min, long max) {
>   return toIterable(internalState.subMap(new TimestampedKey(currentKey(),
> min), new TimestampedKey(currentKey(), max)));
> }
>
> This should be fairly cheap. The important bit is that the returned
> iterator is always non-null, but could be empty.
>
> Does that answer your question?
>
> D.
>
> On Wed, Apr 13, 2022 at 12:21 PM David Anderson 
> wrote:
>
>> Yun Tang and Jingsong,
>>
>> Some flavor of OrderedMapState is certainly feasible, and I do see some
>> appeal in supporting Binary**State.
>>
>> However, I haven't seen a motivating use case for this generalization, and
>> would rather keep this as simple as possible. By handling Longs we can
>> already optimize a wide range of use cases.
>>
>> David
>>
>>
>> On Tue, Apr 12, 2022 at 9:21 AM Yun Tang  wrote:
>>
>> >  Hi David,
>> >
>> > Could you share some explanations why SortedMapState cannot work in
>> > details? I just cannot catch up what the statement below means:
>> >
>> > This was rejected as being overly difficult to implement in a way that
>> > would cleanly leverage RocksDB’s iterators.
>> >
>> >
>> > Best
>> > Yun Tang
>> > 
>> > From: Aitozi 
>> > Sent: Tuesday, April 12, 2022 15:00
>> > To: dev@flink.apache.org 
>> > Subject: Re: [DISCUSS] FLIP-220: Temporal State
>> >
>> > Hi David
>> >  I have look through the doc, I think it will be a good improvement
>> to
>> > this pattern usage, I'm interested in it. Do you have some POC work to
>> > share for a closer look.
>> > Besides, I have one question that can we support expose the namespace in
>> > the different state type not limited to `TemporalState`. By this, user
>> can
>> > specify the namespace
>> > and the TemporalState is one of the special case that it use timestamp
>> as
>> > the namespace. I think it will be more extendable.
>> > What do you think about this ?
>> >
>> > Best,
>> > Aitozi.
>> >
>> > David Anderson  于2022年4月11日周一 20:54写道:
>> >
>> > > Greetings, Flink developers.
>> > >
>> > > I would like to open up a discussion of a propo

Re: [DISCUSS] FLIP-220: Temporal State

2022-04-13 Thread David Morávek
Hi David,

It seems to me that at least with the heap-based state backend, readRange
> is going to have to do a lot of unnecessary work to implement this
> isEmpty() operation, since it have will to consider the entire range from
> MIN_VALUE to MAX_VALUE. (Maybe we should add an explicit isEmpty method?
> I'm not convinced we need it, but it would be cheaper to implement. Or
> perhaps this join can be rewritten to not need this operation; I haven't
> thought enough about that alternative.)
>

I think this really boils down to how the returned iterable is going to be
implemented. Basically for checking whether state is empty, you need to do
something along the lines of:

Iterables.isEmpty(state.readRange(Long.MIN_VALUE, MAX_VALUE)); // basically
checking `hasNext() == false` or `isEmpty()` in case of `Collection`

Few notes:
1) It could be lazy (the underlying collection doesn't have to be
materialized - eg. in case of RocksDB);
2) For HeapStateBackend it depends on the underlying implementation. I'd
probably do something along the lines of sorted tree (eg. SortedMap /
NavigableMap), that allows effective range scans / range deletes. Then you
could simply do something like (from top of the head):

@Value
class TimestampedKey {
  K key;
  long timestamap;
}

SortedMap, V> internalState;

Iterable> readRange(long min, long max) {
  return toIterable(internalState.subMap(new TimestampedKey(currentKey(),
min), new TimestampedKey(currentKey(), max)));
}

This should be fairly cheap. The important bit is that the returned
iterator is always non-null, but could be empty.

Does that answer your question?

D.

On Wed, Apr 13, 2022 at 12:21 PM David Anderson 
wrote:

> Yun Tang and Jingsong,
>
> Some flavor of OrderedMapState is certainly feasible, and I do see some
> appeal in supporting Binary**State.
>
> However, I haven't seen a motivating use case for this generalization, and
> would rather keep this as simple as possible. By handling Longs we can
> already optimize a wide range of use cases.
>
> David
>
>
> On Tue, Apr 12, 2022 at 9:21 AM Yun Tang  wrote:
>
> >  Hi David,
> >
> > Could you share some explanations why SortedMapState cannot work in
> > details? I just cannot catch up what the statement below means:
> >
> > This was rejected as being overly difficult to implement in a way that
> > would cleanly leverage RocksDB’s iterators.
> >
> >
> > Best
> > Yun Tang
> > 
> > From: Aitozi 
> > Sent: Tuesday, April 12, 2022 15:00
> > To: dev@flink.apache.org 
> > Subject: Re: [DISCUSS] FLIP-220: Temporal State
> >
> > Hi David
> >  I have look through the doc, I think it will be a good improvement
> to
> > this pattern usage, I'm interested in it. Do you have some POC work to
> > share for a closer look.
> > Besides, I have one question that can we support expose the namespace in
> > the different state type not limited to `TemporalState`. By this, user
> can
> > specify the namespace
> > and the TemporalState is one of the special case that it use timestamp as
> > the namespace. I think it will be more extendable.
> > What do you think about this ?
> >
> > Best,
> > Aitozi.
> >
> > David Anderson  于2022年4月11日周一 20:54写道:
> >
> > > Greetings, Flink developers.
> > >
> > > I would like to open up a discussion of a proposal [1] to add a new
> kind
> > of
> > > state to Flink.
> > >
> > > The goal here is to optimize a fairly common pattern, which is using
> > >
> > > MapState>
> > >
> > > to store lists of events associated with timestamps. This pattern is
> used
> > > internally in quite a few operators that implement sorting and joins,
> and
> > > it also shows up in user code, for example, when implementing custom
> > > windowing in a KeyedProcessFunction.
> > >
> > > Nico Kruber, Seth Wiesman, and I have implemented a POC that achieves a
> > > more than 2x improvement in throughput when performing these operations
> > on
> > > RocksDB by better leveraging the capabilities of the RocksDB state
> > backend.
> > >
> > > See FLIP-220 [1] for details.
> > >
> > > Best,
> > > David
> > >
> > > [1] https://cwiki.apache.org/confluence/x/Xo_FD
> > >
> >
>


Re: [DISCUSS] FLIP-220: Temporal State

2022-04-12 Thread David Morávek
Hi David,

I really like the proposal. This has so much potential for various
optimizations, especially for temporal joins. My only concern is that the
interfaces seems unnecessarily complicated.

My feeling would be that we only need a single, simple interface that would
fit it all (the same way as it's already present in Apache Beam):

@Experimental
public interface TemporalListState
extends MergingState,
Iterable>> {

/**
 * Read a timestamp-limited subrange of the list. The result is ordered
by timestamp.
 *
 * All values with timestamps >= minTimestamp and < limitTimestamp
will be in the resuling
 * iterable. This means that only timestamps strictly less than
 * Instant.ofEpochMilli(Long.MAX_VALUE) can be used as timestamps.
 */
Iterable> readRange(long minTimestamp, long
limitTimestamp);

/**
 * Clear a timestamp-limited subrange of the list.
 *
 * All values with timestamps >= minTimestamp and < limitTimestamp
will be removed from the
 * list.
 */
void clearRange(long minTimestamp, long limitTimestamp);
}

Is there anything missing here? Why do we need a temporal value state at
all? In my understanding it's still basically a "temporal list state", just
with a slightly different API. This is indeed necessary with the "temporal
list state" API you've proposed, would it make sense to try unifying the
two? I really think that the Beam community already did a good job on
designing this API.

Adding one state primitive is already a big change, so if we can keep it
minimal it would be great.

One more point on the proposed API, being able to clear only a single
"timestamped value" at the time might be limiting for some use cases
(performance wise, because we can't optimize it as we are with the range
delete).

Best,
D.

On Tue, Apr 12, 2022 at 9:32 AM Jingsong Li  wrote:

> Hi David,
>
> Thanks for driving.
>
> I understand that state storage itself supports byte ordering, have we
> considered exposing Binary**State? This way the upper layers can be
> implemented on demand, Temporal is just one of them.
>
> Best,
> Jingsong
>
> On Tue, Apr 12, 2022 at 3:01 PM Aitozi  wrote:
> >
> > Hi David
> >  I have look through the doc, I think it will be a good improvement
> to
> > this pattern usage, I'm interested in it. Do you have some POC work to
> > share for a closer look.
> > Besides, I have one question that can we support expose the namespace in
> > the different state type not limited to `TemporalState`. By this, user
> can
> > specify the namespace
> > and the TemporalState is one of the special case that it use timestamp as
> > the namespace. I think it will be more extendable.
> > What do you think about this ?
> >
> > Best,
> > Aitozi.
> >
> > David Anderson  于2022年4月11日周一 20:54写道:
> >
> > > Greetings, Flink developers.
> > >
> > > I would like to open up a discussion of a proposal [1] to add a new
> kind of
> > > state to Flink.
> > >
> > > The goal here is to optimize a fairly common pattern, which is using
> > >
> > > MapState>
> > >
> > > to store lists of events associated with timestamps. This pattern is
> used
> > > internally in quite a few operators that implement sorting and joins,
> and
> > > it also shows up in user code, for example, when implementing custom
> > > windowing in a KeyedProcessFunction.
> > >
> > > Nico Kruber, Seth Wiesman, and I have implemented a POC that achieves a
> > > more than 2x improvement in throughput when performing these
> operations on
> > > RocksDB by better leveraging the capabilities of the RocksDB state
> backend.
> > >
> > > See FLIP-220 [1] for details.
> > >
> > > Best,
> > > David
> > >
> > > [1] https://cwiki.apache.org/confluence/x/Xo_FD
> > >
>


Re: [ANNOUNCE] New Apache Flink Committer - David Morávek

2022-03-06 Thread David Morávek
Thanks everyone!

Best,
D.

On Sun 6. 3. 2022 at 9:07, Yuan Mei  wrote:

> Congratulations, David!
>
> Best Regards,
> Yuan
>
> On Sat, Mar 5, 2022 at 8:13 PM Roman Khachatryan  wrote:
>
> > Congratulations, David!
> >
> > Regards,
> > Roman
> >
> > On Fri, Mar 4, 2022 at 7:54 PM Austin Cawley-Edwards
> >  wrote:
> > >
> > > Congrats David!
> > >
> > > On Fri, Mar 4, 2022 at 12:18 PM Zhilong Hong 
> > wrote:
> > >
> > > > Congratulations, David!
> > > >
> > > > Best,
> > > > Zhilong
> > > >
> > > > On Sat, Mar 5, 2022 at 1:09 AM Piotr Nowojski 
> > > > wrote:
> > > >
> > > > > Congratulations :)
> > > > >
> > > > > pt., 4 mar 2022 o 16:04 Aitozi  napisał(a):
> > > > >
> > > > > > Congratulations David!
> > > > > >
> > > > > > Ingo Bürk  于2022年3月4日周五 22:56写道:
> > > > > >
> > > > > > > Congrats, David!
> > > > > > >
> > > > > > > On 04.03.22 12:34, Robert Metzger wrote:
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > On behalf of the PMC, I'm very happy to announce David
> Morávek
> > as a
> > > > > new
> > > > > > > > Flink committer.
> > > > > > > >
> > > > > > > > His first contributions to Flink date back to 2019. He has
> been
> > > > > > > > increasingly active with reviews and driving major
> initiatives
> > in
> > > > the
> > > > > > > > community. David brings valuable experience from being a
> > committer
> > > > in
> > > > > > the
> > > > > > > > Apache Beam project to Flink.
> > > > > > > >
> > > > > > > >
> > > > > > > > Please join me in congratulating David for becoming a Flink
> > > > > committer!
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Robert
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Martijn Visser

2022-03-03 Thread David Morávek
Congratulations Martijn, well deserved!

Best,
D.

On Fri, Mar 4, 2022 at 7:25 AM Jiangang Liu 
wrote:

> Congratulations Martijn!
>
> Best
> Liu Jiangang
>
> Lijie Wang  于2022年3月4日周五 14:00写道:
>
> > Congratulations Martijn!
> >
> > Best,
> > Lijie
> >
> > Jingsong Li  于2022年3月4日周五 13:42写道:
> >
> > > Congratulations Martijn!
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Fri, Mar 4, 2022 at 1:09 PM Yang Wang 
> wrote:
> > > >
> > > > Congratulations Martijn!
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Yangze Guo  于2022年3月4日周五 11:33写道:
> > > >
> > > > > Congratulations!
> > > > >
> > > > > Best,
> > > > > Yangze Guo
> > > > >
> > > > > On Fri, Mar 4, 2022 at 11:23 AM Lincoln Lee <
> lincoln.8...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Congratulations Martijn!
> > > > > >
> > > > > > Best,
> > > > > > Lincoln Lee
> > > > > >
> > > > > >
> > > > > > Yu Li  于2022年3月4日周五 11:09写道:
> > > > > >
> > > > > > > Congratulations!
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Yu
> > > > > > >
> > > > > > >
> > > > > > > On Fri, 4 Mar 2022 at 10:31, Zhipeng Zhang <
> > > zhangzhipe...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Congratulations Martijn!
> > > > > > > >
> > > > > > > > Qingsheng Ren  于2022年3月4日周五 10:14写道:
> > > > > > > >
> > > > > > > > > Congratulations Martijn!
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > > Qingsheng Ren
> > > > > > > > >
> > > > > > > > > > On Mar 4, 2022, at 9:56 AM, Leonard Xu <
> xbjt...@gmail.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Congratulations and well deserved Martjin !
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Leonard
> > > > > > > > > >
> > > > > > > > > >> 2022年3月4日 上午7:55,Austin Cawley-Edwards <
> > > austin.caw...@gmail.com
> > > > > >
> > > > > > > 写道:
> > > > > > > > > >>
> > > > > > > > > >> Congrats Martijn!
> > > > > > > > > >>
> > > > > > > > > >> On Thu, Mar 3, 2022 at 10:50 AM Robert Metzger <
> > > > > rmetz...@apache.org
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Hi everyone,
> > > > > > > > > >>>
> > > > > > > > > >>> On behalf of the PMC, I'm very happy to announce
> Martijn
> > > > > Visser as
> > > > > > > a
> > > > > > > > > new
> > > > > > > > > >>> Flink committer.
> > > > > > > > > >>>
> > > > > > > > > >>> Martijn is a very active Flink community member,
> driving
> > a
> > > lot
> > > > > of
> > > > > > > > > efforts
> > > > > > > > > >>> on the dev@flink mailing list. He also pushes projects
> > > such as
> > > > > > > > > replacing
> > > > > > > > > >>> Google Analytics with Matomo, so that we can generate
> our
> > > web
> > > > > > > > analytics
> > > > > > > > > >>> within the Apache Software Foundation.
> > > > > > > > > >>>
> > > > > > > > > >>> Please join me in congratulating Martijn for becoming a
> > > Flink
> > > > > > > > > committer!
> > > > > > > > > >>>
> > > > > > > > > >>> Cheers,
> > > > > > > > > >>> Robert
> > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > best,
> > > > > > > > Zhipeng
> > > > > > > >
> > > > > > >
> > > > >
> > >
> >
>


Re: Re: [DISCUSS] Future of Per-Job Mode

2022-02-11 Thread David Morávek
Hi Jark, can you please elaborate about the current need of the per-job
mode for interactive clients (eg. Zeppelin that you've mentioned)? Aren't
these a natural fit for the session cluster?

D.

On Fri, Feb 11, 2022 at 3:25 PM Jark Wu  wrote:

> Hi Konstantin,
>
> I'm not very familiar with the implementation of per-job mode and
> application mode.
> But is there any instruction for users abou how to migrate platforms/jobs
> to application mode?
> IIUC, the biggest difference between the two modes is where the main()
> method is executed.
> However, SQL jobs are not jar applications and don't have the main()
> method.
> For example, SQL CLI submits SQL jobs by invoking
> `StreamExecutionEnvironment#executeAsync(StreamGraph)`.
> How SQL Client and SQL platforms (e.g. Zeppelin) support application mode?
>
> Best,
> Jark
>
>
> On Fri, 28 Jan 2022 at 23:33, Konstantin Knauf  wrote:
>
> > Hi everyone,
> >
> > Thank you for sharing your perspectives. I was not aware of
> > these limitations of per-job mode on YARN. It seems that there is a
> general
> > agreement to deprecate per-job mode and to drop it once the limitations
> > around YARN are resolved. I've started a corresponding vote in [1].
> >
> > Thanks again,
> >
> > Konstantin
> >
> >
> > [1] https://lists.apache.org/thread/v6oz92dfp95qcox45l0f8393089oyjv4
> >
> > On Fri, Jan 28, 2022 at 1:53 PM Ferenc Csaky  >
> > wrote:
> >
> > > Hi Yang,
> > >
> > > Thank you for the clarification. In general I think we will have time
> to
> > > experiment with this until it will be removed totally and migrate our
> > > solution to use application mode.
> > >
> > > Regards,
> > > F
> > >
> > > On 2022/01/26 02:42:24 Yang Wang wrote:
> > > > Hi all,
> > > >
> > > > I remember the application mode was initially named "cluster mode".
> As
> > a
> > > > contrast, the per-job mode is the "client mode".
> > > > So I believe application mode should cover all the functionalities of
> > > > per-job except where we are running the user main code.
> > > > In the containerized or the Kubernetes world, the application mode is
> > > more
> > > > native and easy to use since all the Flink and user
> > > > jars are bundled in the image. I am also in favor of deprecating and
> > > > removing the per-job in the long run.
> > > >
> > > >
> > > >
> > > > @Ferenc
> > > > IIRC, the YARN application mode could ship user jars and dependencies
> > via
> > > > "yarn.ship-files" config option. The only
> > > > limitation is that we could not ship and load the user dependencies
> > with
> > > > user classloader, not the parent classloader.
> > > > FLINK-24897 is trying to fix this via supporting "usrlib" directory
> > > > automatically.
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > >
> > > >
> > > > Ferenc Csaky  于2022年1月25日周二 22:05写道:
> > > >
> > > > > Hi Konstantin,
> > > > >
> > > > > First of all, sorry for the delay. We at Cloudera are currently
> > > relying on
> > > > > per-job mode deploying Flink applications over YARN.
> > > > >
> > > > > Specifically, we allow users to upload connector jars and other
> > > artifacts.
> > > > > There are also some default jars that we need to ship. These are
> all
> > > stored
> > > > > on the local file system of our service’s node. The Flink job is
> > > submitted
> > > > > on the users’ behalf by our service, which also specifies the jars
> to
> > > ship.
> > > > > The service runs on a single node, not on all nodes with Flink
> TM/JM.
> > > It
> > > > > would thus be difficult to manage the jars on every node.
> > > > >
> > > > > We are not familiar with the reasoning behind why application mode
> > > > > currently doesn’t ship the user jars, besides the deployment being
> > > faster
> > > > > this way. Would it be possible for the application mode to
> > (optionally,
> > > > > enabled by some config) distribute these, or are there some
> technical
> > > > > limitations?
> > > > >
> > > > > For us it would be crucial to achieve the functionality we have at
> > the
> > > > > moment over YARN. We started to track
> > > > > https://issues.apache.org/jira/browse/FLINK-24897 that Biao Geng
> > > > > mentioned as well.
> > > > >
> > > > > Considering the above, for us the more soonish removal does not
> sound
> > > > > really well. We can live with this feature as deprecated of course,
> > > but it
> > > > > would be nice to have some time to figure out how we can utilize
> > > > > Application Mode exactly and make necessary changes if required.
> > > > >
> > > > > Thank you,
> > > > > F
> > > > >
> > > > > On 2022/01/13 08:30:48 Konstantin Knauf wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > I would like to discuss and understand if the benefits of having
> > > Per-Job
> > > > > > Mode in Apache Flink outweigh its drawbacks.
> > > > > >
> > > > > >
> > > > > > *# Background: Flink's Deployment Modes*
> > > > > > Flink currently has three deployment modes. They differ in the
> > > following
> > > > > > dimensions:
> > > > > > * 

Re: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-02-11 Thread David Morávek
There are already tools [1] that simplify this for the user.

I honestly don't know, it feels like it can bring more problems that actual
benefits as this heavily relies on the environment. It can easily break for
some users, eg. because of the kernel settings; their architecture might
not be supported; Also we'd need to go an extra mile regarding the security.

Considering there are already other tools that are specifically designed
for this (such as [1]), I personally don't feel that this should be part of
Flink.

[1] https://github.com/yahoo/kubectl-flame


On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid> wrote:

> Our flink application is on k8s.Yes, user can use the async-profiler
> directly, but it is not convenient for user, who should download the jars
> and need to know how to use it. And some users don’t know the tool.if we
> integrate it, user will benefit a lot.
>
> On 2022/01/26 18:56:17 David Morávek wrote:
> > I'd second to Alex's concerns. Is there a reason why you can't use the
> > async-profiler directly? In what kind of environment are your Flink
> > clusters running (YARN / k8s / ...)?
> >
> > Best,
> > D.
> >
> > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
> > wrote:
> >
> >> Hi Jacky,
> >>
> >> Could you please clarify what kind of *problems* you experience with the
> >> large parallelism? You referred to D3, is it something related to
> rendering
> >> on the browser side or is it about the samples collection process? Were
> you
> >> able to identify the bottleneck?
> >>
> >> Fundamentally I have some concerns regarding the proposed approach:
> >> 1. Calling shell scripts triggered via the web UI is a security concern
> and
> >> it needs to be evaluated carefully if it could introduce any unexpected
> >> attack vectors (depending on the implementation, passed parameters etc.)
> >> 2. My understanding is that the async-profiler implementation is
> >> system-dependent. How do you propose to handle multiple architectures?
> >> Would you like to ship each available implementation within Flink? [1]
> >> 3. Do you plan to make use of full async-profiler features including
> native
> >> calls sampling with perf_events? If so, the issue I see is that some
> >> environments restrict ptrace calls by default [2]
> >>
> >> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> >> [2]
> >>
> >>
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
> >>
> >>
> >> Best,
> >> Alexander Fedulov
> >>
> >> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
> >>
> >>> This is an expected feature, as we also experienced browser crashes on
> >>> existing operator-level flame graphs
> >>>
> >>> Best,
> >>> Echo Lee
> >>>
> >>>> 在 2022年1月24日,下午6:16,David Morávek  写道:
> >>>>
> >>>> Hi Jacky,
> >>>>
> >>>> The link seems to be broken, here is the correct one [1].
> >>>>
> >>>> [1]
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> >>>>
> >>>> Best,
> >>>> D.
> >>>>
> >>>>> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
> >>> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>   I would like to start the discussion on FLIP-213 <
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> >>>>> ;
> >>>>> which aims to provide taskmanager level(process level) flame
> >> graph
> >>>>> by async profiler, which is most popular tool in java performance.
> and
> >>> the
> >>>>> arthas and intellij both use it.
> >>>>> And we support it in our ant group company.
> >>>>>  AndFlink supports FLIP-165: Operator's Flame
> Graphs
> >>>>> now. and it draw flame graph by thefront-end
> >>>>> librariesd3-flame-graph, which has some problem in jobs
> >>>>> oflarge of parallelism.
> >>>>>  Please be aware that the FLIP wiki area is not fully
> done
> >>>>> since i don't konw whether it will accept by
> >> flinkcommunity.
> >>>>>  Feel free to add your thoughts to make this feature
> >>> better! i
> >>>>> am looking forward to all your response. Thanks too much!
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Best Jacky Lau
> >>>
> >>
>


Re: [DISCUSS] Drop Jepsen tests

2022-02-09 Thread David Morávek
Network partitions are trickier than simply crashing process. For example
these can be asymmetric -> as a TM you're still able to talk to the JM, but
you're not able to talk to other TMs.

In general this could be achieved by manipulating iptables on the host
machine (considering we spawn all the processes locally), but not sure if
that will solve the "make it less complicated for others to contribute"
part :/ Also this kind of test would be executable on nix systems only.

I assume that jepsen uses the same approach under the hood.

D.

On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler  wrote:

> b/c are part of the same test.
>
>   * We have a job running,
>   * trigger a network partition (failing the job),
>   * then crash HDFS (preventing checkpoints and access to the HA
> storageDir),
>   * then the partition is resolved and HDFS is started again.
>
> Conceptually I would think we can replicate this by nuking half the
> cluster, crashing HDFS/ZK, and restarting everything.
>
> On 09/02/2022 17:39, Chesnay Schepler wrote:
> > The jepsen tests cover 3 cases:
> > a) JM/TM crashes
> > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > c) network partitions
> >
> > a) can (and probably is) reasonably covered by existing ITCases and
> > e2e tests
> > b) We could probably figure this out ourselves if we wanted to.
> > c) is the difficult part.
> >
> > Note that the tests also only cover yarn (per-job/session) and
> > standalone (session) deployments.
> >
> > On 09/02/2022 17:11, Konstantin Knauf wrote:
> >> Thank you for raising this issue. What risks do you see if we drop
> >> it? Do
> >> you see any cheaper alternative to (partially) mitigate those risks?
> >>
> >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler 
> >> wrote:
> >>
> >>> For a few years by now we had a set of Jepsen tests that verify the
> >>> correctness of Flinks coordination layer in the case of process
> >>> crashes.
> >>> In the past it has indeed found issues and thus provided value to the
> >>> project, and in general the core idea of it (and Jepsen for that
> >>> matter)
> >>> is very sound.
> >>>
> >>> However, so far we neither made attempts to make further use of Jepsen
> >>> (and limited ourselves to very basic tests) nor to familiarize
> >>> ourselves
> >>> with the tests/jepsen at all.
> >>> As a result these tests are difficult to maintain. They (and Jepsen)
> >>> are
> >>> written in Clojure, which makes debugging, changes and upstreaming
> >>> contributions very difficult.
> >>> Additionally, the tests also make use of a very complicated
> >>> (Ververica-internal) terraform+ansible setup to spin up and tear down
> >>> AWS machines. While it works (and is actually pretty cool), it's
> >>> difficult to adjust because the people who wrote it have left the
> >>> company.
> >>>
> >>> Why I'm raising this now (and not earlier) is because so far keeping
> >>> the
> >>> tests running wasn't much of a problem; bump a few dependencies here
> >>> and
> >>> there and we're good to go.
> >>>
> >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> >>> which isn't supported by Jepsen out-of-the-box, completely breaking the
> >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> >>> Jepsen (again, in Clojure). While I started working on that and could
> >>> likely finish it, I started to wonder whether it even makes sense to do
> >>> so, and whether we couldn't invest this time elsewhere.
> >>>
> >>> Let me know what you think.
> >>>
> >>>
> >
>


Re: [RESULT][VOTE] FLIP-211: Kerberos delegation token framework

2022-02-08 Thread David Morávek
Thanks Gabor for driving this, I think the change is going to be really
valuable for some of the enterprise users.

Best,
D.


On Tue, Feb 8, 2022 at 8:33 AM Gabor Somogyi 
wrote:

> Hi devs,
>
> FLIP-211 [1] Has been accepted.
> There were 3 binding votes and 2 non-binding in favor.
> None against.
>
> Votes are in the order of arrival:
>
> Binding:
> Gyula Fora
> Marton Balassi
> Chesnay Schepler
>
> Non-binding:
> Junfan Zhang
> David Moravek
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework
>
> BR,
> G
>


Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-02-04 Thread David Morávek
There is a really strong tendency to push many things out of Flink lately
and keep the main building blocks only. However I really think that this
belongs between the minimal building blocks that Flink should provide out
of the box. It's also very likely that security topics will start getting
more attention in the near future.

Maybe one thought, I'm not sure that's the case, but if hard-coding the
delegation framework to Kerberos would be a concern, I could imagine that
we implement this in more generic fashion so other systems that need to
distribute & renew credentials might reuse the same code-path (eg. oauth
tokens for talking to some external APIs).

D.

On Fri, Feb 4, 2022 at 11:32 AM Gyula Fóra  wrote:

> Hi Chesnay,
>
> Thanks for the proposal for the alternative mechanism. I see the conceptual
> value of separating this process from Flink but in practice I feel there
> are a few very serious limitations with that.
>
> Just a few points that come to mind:
> 1. Implementing this as independent distributed processes that communicate
> with each other requires:
> - Secure communication channels
> - Process discovery
> - High availability
> This is a huge effort to say the least, more like a separate project
> than a new feature.
> 2. Independent processes with all of the above would come with their own
> set of dependencies and configuration values for everything ranging from
> communication, ssl settings, etc.
> 3. Flink does not have an existing mechanism for spinning up this processes
> and managing their lifecycle. This would require a completely separate
> design.
>
> If Spark had used external processes now we would still have to design a
> process hook mechanism, every user would have to add an extra probably
> large set of config options just to manage the basic secure process
> communication and would pull in their own dependency mess most likely.
>
> I personally prefer to reuse Flink’s solids secure communication channels
> and existing HA and discovery mechanism.
> So from my side, +1 for embedding this in the existing Flink components.
>
> Kerberos is here to stay for a long time in many large production
> use-cases, and we should aim to solve this long standing limitation in an
> elegant way.
>
> Thank you!
> Gyula
>
> On Friday, February 4, 2022, Chesnay Schepler  wrote:
>
> > The concrete proposal would be to add a generic process startup lifecycle
> > hook (essentially a Consumer), that is run at the start of
> > each processs (JobManager, TaskManager, HistoryServer (, CLI?).
> >
> > Everything else would be left to the implementation which would live
> > outside of Flink.
> >
> > For this specific case an implementation of this hook would (_somehow_)
> > establish a connection to the external process (that it discovered
> > _somehow_) to retrieve the delegation token, in a blocking fashion to
> pause
> > the startup procedure, and (presumably) schedule something into an
> executor
> > to renew the token at a later date.
> > This is of course very simplifies, but you get the general idea.
> >
> > @Gyula It's certainly a reasonable design, and re-using Flinks existing
> > mechanisms does make sense.
> > However, I do have to point that if Spark had used an external process,
> > then we could've just re-used the part that integrates Spark with that,
> and
> > this whole discussion could've been resolved in a day.
> > This is actually what irks me most about this topic. It could be a
> generic
> > solution to address Kerberos scaling issues that other projects could
> > re-use, instead of everyone having to implement their own custom
> solution.
> >
> > On 04/02/2022 09:46, Gabor Somogyi wrote:
> >
> >> Hi All,
> >>
> >> First of all sorry that I've taken couple of mails heavily!
> >> I've had an impression after we've invested roughly 2 months into the
> FLIP
> >> it's moving to a rejection without alternative what we can work on.
> >>
> >> That said earlier which still stands if there is a better idea how that
> >> could be solved I'm open
> >> even with the price of rejecting this. What I would like to ask even in
> >> case of suggestions/or even
> >> reject please come up with a concrete proposal on what we can agree on.
> >>
> >> During this 2 months I've considered many options and this is the
> >> design/code which contains
> >> the least necessary lines of code, relatively rock stable in production
> in
> >> another product, I personally
> >> have roughly 3 years experience with it. The design is not 1to1
> copy-paste
> >> because I've considered
> >> my limited knowledge about Flink.
> >>
> >> Since I'm not the one who has 7+ years within Flink I can accept if
> >> something is not the way it should be done.
> >> Please suggest a better way and I'm sure we're going to come up with
> >> something which makes everybody happy.
> >>
> >> So waiting on the suggestions and we drive the ship there...
> >>
> >> G
> >>
> >>
> >> On Fri, Feb 4, 2022 at 12:08 AM Till 

Re: [VOTE] FLIP-211: Kerberos delegation token framework

2022-02-01 Thread David Morávek
The updated FLIP looks good to me. Thanks Gabor for addressing the comments
and updating the FLIP.

+1 (non-binding)

D.

On Mon, Jan 31, 2022 at 10:16 AM Márton Balassi 
wrote:

> +1 (binding)
>
> Given [1] I consider the issue David raised resolved. Thanks David and
> please confirm here.
>
> [1]  https://lists.apache.org/thread/cvwknd5fhohj0wfv8mfwn70jwpjvxrjj
>
> On Mon, Jan 24, 2022 at 11:07 AM David Morávek 
> wrote:
>
>> Hi Gabor,
>>
>> Thanks for driving this. This is headed in a right direction, but I feel
>> that the FLIP still might need bit more work.
>>
>> -1 (non-binding) until the discussion thread is resolved [1].
>>
>> [1] https://lists.apache.org/thread/cvwknd5fhohj0wfv8mfwn70jwpjvxrjj
>>
>> Best,
>> D.
>>
>>
>>
>> On Mon, Jan 24, 2022 at 10:47 AM Gyula Fóra  wrote:
>>
>> > Hi Gabor,
>> >
>> > +1 (binding) from me
>> >
>> > This is a great effort and significant improvement to the Kerberos
>> security
>> > story .
>> >
>> > Cheers
>> > Gyula
>> >
>> > On Fri, 21 Jan 2022 at 15:58, Gabor Somogyi 
>> > wrote:
>> >
>> > > Hi devs,
>> > >
>> > > I would like to start the vote for FLIP-211 [1], which was discussed
>> and
>> > > reached a consensus in the discussion thread [2].
>> > >
>> > > The vote will be open for at least 72h, unless there is an objection
>> or
>> > not
>> > > enough votes.
>> > >
>> > > BR,
>> > > G
>> > >
>> > > [1]
>> > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework
>> > >
>> > > [2] https://lists.apache.org/thread/cvwknd5fhohj0wfv8mfwn70jwpjvxrjj
>> > >
>> >
>>
>


Re: [VOTE] Deprecate Per-Job Mode in Flink 1.15

2022-01-28 Thread David Morávek
+1 (non-binding)

D.

On Fri 28. 1. 2022 at 17:53, Till Rohrmann  wrote:

> +1 (binding)
>
> Cheers,
> Till
>
> On Fri, Jan 28, 2022 at 4:57 PM Gabor Somogyi 
> wrote:
>
> > +1 (non-binding)
> >
> > We're intended to make tests when FLINK-24897
> >  is fixed.
> > In case of further issues we're going to create further jiras.
> >
> > BR,
> > G
> >
> >
> > On Fri, Jan 28, 2022 at 4:30 PM Konstantin Knauf 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Based on the discussion in [1], I would like to start a vote on
> > deprecating
> > > per-job mode in Flink 1.15. Consequently, we would target to drop it in
> > > Flink 1.16 or Flink 1.17 latest.
> > >
> > > The only limitation that would block dropping Per-Job mode mentioned in
> > [1]
> > > is tracked in https://issues.apache.org/jira/browse/FLINK-24897. In
> > > general, the implementation of application mode in YARN should be on
> par
> > > with the standalone and Kubernetes before we drop per-job mode.
> > >
> > > The vote will last for at least 72 hours, and will be accepted by a
> > > consensus of active committers.
> > >
> > > Thanks,
> > >
> > > Konstantin
> > >
> > > [1] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> > >
> > > --
> > >
> > > Konstantin Knauf
> > >
> > > https://twitter.com/snntrable
> > >
> > > https://github.com/knaufk
> > >
> >
>


Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-28 Thread David Morávek
Hi,

AFAIU an under registration TM is not added to the registered TMs map until
> RegistrationResponse ..
>

I think you're right, with a careful design around threading (delegating
update broadcasts to the main thread) + synchronous initial update (that
would be nice to avoid) this should be doable.

Not sure what you mean "we can't register the TM without providing it with
> token" but in unsecure configuration registration must happen w/o tokens.
>

Exactly as you describe it, this was meant only for the "kerberized /
secured" cluster case, in other cases we wouldn't enforce a non-null token
in the response

I think this is a good idea in general.
>

+1

If you don't have any more thoughts on the RPC / lifecycle part, can you
please reflect it into the FLIP?

D.

On Fri, Jan 28, 2022 at 3:16 PM Gabor Somogyi 
wrote:

> > - Make sure DTs issued by single DTMs are monotonically increasing (can
> be
> sorted on TM side)
>
> AFAIU an under registration TM is not added to the registered TMs map until
> RegistrationResponse
> is processed which would contain the initial tokens. If that's true then
> how is it possible to have race with
> DTM update which is working on the registered TMs list?
> To be more specific "taskExecutors" is the registered map of TMs to which
> DTM can send updated tokens
> but this doesn't contain the under registration TM while
> RegistrationResponse is not processed, right?
>
> Of course if DTM can update while RegistrationResponse is processed then
> somehow sorting would be
> required and that case I would agree.
>
> - Scope DT updates by the RM ID and ensure that TM only accepts update from
> the current leader
>
> I've planned this initially the mentioned way so agreed.
>
> - Return initial token with the RegistrationResponse, which should make the
> RPC contract bit clearer (ensure that we can't register the TM without
> providing it with token)
>
> I think this is a good idea in general. Not sure what you mean "we can't
> register the TM without
> providing it with token" but in unsecure configuration registration must
> happen w/o tokens.
> All in all the newly added tokens field must be somehow optional.
>
> G
>
>
> On Fri, Jan 28, 2022 at 2:22 PM David Morávek  wrote:
>
> > We had a long discussion with Chesnay about the possible edge cases and
> it
> > basically boils down to the following two scenarios:
> >
> > 1) There is a possible race condition between TM registration (the first
> DT
> > update) and token refresh if they happen simultaneously. Than the
> > registration might beat the refreshed token. This could be easily
> addressed
> > if DTs could be sorted (eg. by the expiration time) on the TM side. In
> > other words, if there are multiple updates at the same time we need to
> make
> > sure that we have a deterministic way of choosing the latest one.
> >
> > One idea by Chesnay that popped up during this discussion was whether we
> > could simply return the initial token with the RegistrationResponse to
> > avoid making an extra call during the TM registration.
> >
> > 2) When the RM leadership changes (eg. because zookeeper session times
> out)
> > there might be a race condition where the old RM is shutting down and
> > updates the tokens, that it might again beat the registration token of
> the
> > new RM. This could be avoided if we scope the token by
> _ResourceManagerId_
> > and only accept updates for the current leader (basically we'd have an
> > extra parameter to the _updateDelegationToken_ method).
> >
> > -
> >
> > DTM is way simpler then for example slot management, which could receive
> > updates from the JobMaster that RM might not know about.
> >
> > So if you want to go in the path you're describing it should be doable
> and
> > we'd propose following to cover all cases:
> >
> > - Make sure DTs issued by single DTMs are monotonically increasing (can
> be
> > sorted on TM side)
> > - Scope DT updates by the RM ID and ensure that TM only accepts update
> from
> > the current leader
> > - Return initial token with the RegistrationResponse, which should make
> the
> > RPC contract bit clearer (ensure that we can't register the TM without
> > providing it with token)
> >
> > Any thoughts?
> >
> >
> > On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi <
> gabor.g.somo...@gmail.com>
> > wrote:
> >
> > > Thanks for investing your time!
> > >
> > > The first 2 bulletpoint are clear.
> > > If there is a chance that a TM can go to an inconsistent state th

Re: Re: Looking for Maintainers for Flink on YARN

2022-01-28 Thread David Morávek
Thanks Biao and Ferenc for taking care of this, it's really helpful.

D.

On Fri, Jan 28, 2022 at 1:59 PM Ferenc Csaky 
wrote:

> Hi Konstantin,
>
> We at Cloudera will also help out with this. AFAIK there was a
> conversation about this in the past anyways. I will talk this through with
> the team next week and allocate resource accordingly.
>
> Regards,
> F
>
> On 2022/01/26 09:17:03 Konstantin Knauf wrote:
> > Hi everyone,
> >
> > We are seeing an increasing number of test instabilities related to YARN
> > [1]. Does someone in this group have the time to pick these up? The Flink
> > Confluence contains a guide on how to triage test instability tickets.
> >
> > Thanks,
> >
> > Konstantin
> >
> > [1]
> >
> https://issues.apache.org/jira/browse/FLINK-25514?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20%3D%20%22Deployment%20%2F%20YARN%22%20AND%20labels%20%3D%20test-stability
> > [2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Triage+Test+Instability+Tickets
> >
> > On Mon, Sep 13, 2021 at 2:22 PM 柳尘  wrote:
> >
> > > Thanks to Konstantin for raising this question, and to Marton and Gabor
> > > To strengthen!
> > >
> > >  If i can help
> > > In order to better participate in the work, please let me know.
> > >
> > > the best,
> > > cheng xingyuan
> > >
> > >
> > > > 2021年7月29日 下午4:15,Konstantin Knauf  写道:
> > > >
> > > > Dear community,
> > > >
> > > > We are looking for community members, who would like to maintain
> Flink's
> > > > YARN support going forward. So far, this has been handled by teams at
> > > > Ververica & Alibaba. The focus of these teams has shifted over the
> past
> > > > months so that we only have little time left for this topic. Still,
> we
> > > > think, it is important to maintain high quality support for Flink on
> > > YARN.
> > > >
> > > > What does "Maintaining Flink on YARN" mean? There are no known bigger
> > > > efforts outstanding. We are mainly talking about addressing
> > > > "test-stability" issues, bugs, version upgrades, community
> contributions
> > > &
> > > > smaller feature requests. The prioritization of these would be up to
> the
> > > > future maintainers, except "test-stability" issues which are
> important to
> > > > address for overall productivity.
> > > >
> > > > If a group of community members forms itself, we are happy to give an
> > > > introduction to relevant pieces of the code base, principles,
> > > assumptions,
> > > > ... and hand over open threads.
> > > >
> > > > If you would like to take on this responsibility or can join this
> effort
> > > in
> > > > a supporting role, please reach out!
> > > >
> > > > Cheers,
> > > >
> > > > Konstantin
> > > > for the Deployment & Coordination Team at Ververica
> > > >
> > > > --
> > > >
> > > > Konstantin Knauf
> > > >
> > > > https://twitter.com/snntrable
> > > >
> > > > https://github.com/knaufk
> > >
> > >
> >
> > --
> >
> > Konstantin Knauf | Head of Product
> >
> > +49 160 91394525
> >
> >
> > Follow us @VervericaData Ververica 
> >
> >
> > --
> >
> > Join Flink Forward  - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Ververica GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> > Jinwei (Kevin) Zhang
> >
>


Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-28 Thread David Morávek
We had a long discussion with Chesnay about the possible edge cases and it
basically boils down to the following two scenarios:

1) There is a possible race condition between TM registration (the first DT
update) and token refresh if they happen simultaneously. Than the
registration might beat the refreshed token. This could be easily addressed
if DTs could be sorted (eg. by the expiration time) on the TM side. In
other words, if there are multiple updates at the same time we need to make
sure that we have a deterministic way of choosing the latest one.

One idea by Chesnay that popped up during this discussion was whether we
could simply return the initial token with the RegistrationResponse to
avoid making an extra call during the TM registration.

2) When the RM leadership changes (eg. because zookeeper session times out)
there might be a race condition where the old RM is shutting down and
updates the tokens, that it might again beat the registration token of the
new RM. This could be avoided if we scope the token by _ResourceManagerId_
and only accept updates for the current leader (basically we'd have an
extra parameter to the _updateDelegationToken_ method).

-

DTM is way simpler then for example slot management, which could receive
updates from the JobMaster that RM might not know about.

So if you want to go in the path you're describing it should be doable and
we'd propose following to cover all cases:

- Make sure DTs issued by single DTMs are monotonically increasing (can be
sorted on TM side)
- Scope DT updates by the RM ID and ensure that TM only accepts update from
the current leader
- Return initial token with the RegistrationResponse, which should make the
RPC contract bit clearer (ensure that we can't register the TM without
providing it with token)

Any thoughts?


On Fri, Jan 28, 2022 at 10:53 AM Gabor Somogyi 
wrote:

> Thanks for investing your time!
>
> The first 2 bulletpoint are clear.
> If there is a chance that a TM can go to an inconsistent state then I agree
> with the 3rd bulletpoint.
> Just before we agree on that I would like to learn something new and
> understand how is it possible that a TM
> gets corrupted? (In Spark I've never seen such thing and no mechanism to
> fix this but Flink is definitely not Spark)
>
> Here is my understanding:
> * DTM pushes new obtained DTs to TMs and if any exception occurs then a
> retry after "security.kerberos.tokens.retry-wait"
> happens. This means DTM retries until it's not possible to send new DTs to
> all registered TMs.
> * New TM registration must fail if "updateDelegationToken" fails
> * "updateDelegationToken" fails consistently like a DB (at least I plan to
> implement it that way).
> If DTs are arriving on the TM side then a single
> "UserGroupInformation.getCurrentUser.addCredentials"
> will be called which I've never seen it failed.
> * I hope all other code parts are not touching existing DTs within the JVM
>
> I would like to emphasize I'm not against to add it just want to see what
> kind of problems are we facing.
> It would ease to catch bugs earlier and help in the maintenance.
>
> All in all I would buy the idea to add the 3rd bullet if we foresee the
> need.
>
> G
>
>
> On Fri, Jan 28, 2022 at 10:07 AM David Morávek  wrote:
>
> > Hi Gabor,
> >
> > This is definitely headed in a right direction +1.
> >
> > I think we still need to have a safeguard in case some of the TMs gets
> into
> > the inconsistent state though, which will also eliminate the need for
> > implementing a custom retry mechanism (when _updateDelegationToken_ call
> > fails for some reason).
> >
> > We already have this safeguard in place for slot pool (in case there are
> > some slots in inconsistent state - eg. we haven't freed them for some
> > reason) and for the partition tracker, which could be simply enhanced.
> This
> > is done via periodic heartbeat from TaskManagers to the ResourceManager
> > that contains report about state of these two components (from TM
> > perspective) so the RM can reconcile their state if necessary.
> >
> > I don't think adding an additional field to
> _TaskExecutorHeartbeatPayload_
> > should be a concern as we only heartbeat every ~ 10s by default and the
> new
> > field would be small compared to rest of the existing payload. Also
> > heartbeat doesn't need to contain the whole DT, but just some identifier
> > which signals whether it uses the right one, that could be significantly
> > smaller.
> >
> > This is still a PUSH based approach as the RM would again call the newly
> > introduced _updateDelegationToken_ when it encounters inconsistency (eg.
> > due to a temporary network partition / a race co

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-28 Thread David Morávek
Hi Gabor,

This is definitely headed in a right direction +1.

I think we still need to have a safeguard in case some of the TMs gets into
the inconsistent state though, which will also eliminate the need for
implementing a custom retry mechanism (when _updateDelegationToken_ call
fails for some reason).

We already have this safeguard in place for slot pool (in case there are
some slots in inconsistent state - eg. we haven't freed them for some
reason) and for the partition tracker, which could be simply enhanced. This
is done via periodic heartbeat from TaskManagers to the ResourceManager
that contains report about state of these two components (from TM
perspective) so the RM can reconcile their state if necessary.

I don't think adding an additional field to _TaskExecutorHeartbeatPayload_
should be a concern as we only heartbeat every ~ 10s by default and the new
field would be small compared to rest of the existing payload. Also
heartbeat doesn't need to contain the whole DT, but just some identifier
which signals whether it uses the right one, that could be significantly
smaller.

This is still a PUSH based approach as the RM would again call the newly
introduced _updateDelegationToken_ when it encounters inconsistency (eg.
due to a temporary network partition / a race condition we didn't test for
/ some other scenario we didn't think about). In practice these
inconsistencies are super hard to avoid and reason about (and unfortunately
yes, we see them happen from time to time), so reusing the existing
mechanism that is designed for this exact problem simplify things.

To sum this up we'd have three code paths for calling
_updateDelegationToken_:
1) When the TM registers, we push the token (if DTM already has it) to it
2) When DTM obtains a new token it broadcasts it to all currently connected
TMs
3) When a TM gets out of sync, DTM would reconcile it's state

WDYT?

Best,
D.


On Wed, Jan 26, 2022 at 9:03 PM David Morávek  wrote:

> Thanks the update, I'll go over it tomorrow.
>
> On Wed, Jan 26, 2022 at 5:33 PM Gabor Somogyi 
> wrote:
>
>> Hi All,
>>
>> Since it has turned out that DTM can't be added as member of JobMaster
>> <
>> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
>> >
>> I've
>> came up with a better proposal.
>> David, thanks for pinpointing this out, you've caught a bug in the early
>> phase!
>>
>> Namely ResourceManager
>> <
>> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
>> >
>> is
>> a single instance class where DTM can be added as member variable.
>> It has a list of all already registered TMs and new TM registration is
>> also
>> happening here.
>> The following can be added from logic perspective to be more specific:
>> * Create new DTM instance in ResourceManager
>> <
>> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
>> >
>> and
>> start it (re-occurring thread to obtain new tokens)
>> * Add a new function named "updateDelegationTokens" to TaskExecutorGateway
>> <
>> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutorGateway.java#L54
>> >
>> * Call "updateDelegationTokens" on all registered TMs to propagate new DTs
>> * In case of new TM registration call "updateDelegationTokens" before
>> registration succeeds to setup new TM properly
>>
>> This way:
>> * only a single DTM would live within a cluster which is the expected
>> behavior
>> * DTM is going to be added to a central place where all deployment target
>> can make use of it
>> * DTs are going to be pushed to TMs which would generate less network
>> traffic than pull based approach
>> (please see my previous mail where I've described both approaches)
>> * HA scenario is going to be consistent because such
>> <
>> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1069
>> >
>> a solution can be added to "updateDelegationTokens"
>>
>> @David or all others plz share whether you agree on this or you have
>> better
>> idea/suggestion.
>>
>> BR,
>> G
>>
>>
>> On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi > >
>

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-26 Thread David Morávek
n default configuration tokens needs to
> be
> > re-obtained after one day.
> > DTM tries to obtain new tokens after 1day * 0.75
> > (security.kerberos.tokens.renewal-ratio) = 18 hours.
> > When fails it retries after "security.kerberos.tokens.retry-wait" which
> is
> > 1 hour by default.
> > If it never succeeds then authentication error is going to happen on the
> > TM side and the workload is
> > going to stop.
> >
> > > should we even have the retry mechanism whatsoever?
> >
> > Yes, because there are always temporary cluster issues.
> >
> > > What does it mean for the running application (how does this look like
> > from
> > the user perspective)? As far as I remember the logs are only collected
> > ("aggregated") after the container is stopped, is that correct?
> >
> > With default config it works like that but it can be forced to aggregate
> > at specific intervals.
> > A useful feature is forcing YARN to aggregate logs while the job is still
> > running.
> > For long-running jobs such as streaming jobs, this is invaluable. To do
> > this,
> > yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds must be
> > set to a non-negative value.
> > When this is set, a timer will be set for the given duration, and
> whenever
> > that timer goes off,
> > log aggregation will run on new files.
> >
> > > I think
> > this topic should get its own section in the FLIP (having some cross
> > reference to YARN ticket would be really useful, but I'm not sure if
> there
> > are any).
> >
> > I think this is important knowledge but this FLIP is not touching the
> > already existing behavior.
> > DTs are set on the AM container which is renewed by YARN until it's not
> > possible anymore.
> > Any kind of new code is not going to change this limitation. BTW, there
> is
> > no jira for this.
> > If you think it worth to write this down then I think the good place is
> > the official security doc
> > area as caveat.
> >
> > > If we split the FLIP into two parts / sections that I've suggested, I
> > don't
> > really think that you need to explicitly test for each deployment
> scenario
> > / cluster framework, because the DTM part is completely independent of
> the
> > deployment target. Basically this is what I'm aiming for with "making it
> > work with the standalone" (as simple as starting a new java process)
> Flink
> > first (which is also how most people deploy streaming application on k8s
> > and the direction we're pushing forward with the auto-scaling / reactive
> > mode initiatives).
> >
> > I see your point and agree the main direction. k8s is the megatrend which
> > most of the peoples
> > will use sooner or later. Not 100% sure what kind of split you suggest
> but
> > in my view
> > the main target is to add this feature and I'm open to any logical work
> > ordering.
> > Please share the specific details and we work it out...
> >
> > G
> >
> >
> > On Mon, Jan 24, 2022 at 3:04 PM David Morávek  wrote:
> >
> >> >
> >> > Could you point to a code where you think it could be added exactly? A
> >> > helping hand is welcome here 
> >> >
> >>
> >> I think you can take a look at _ResourceManagerPartitionTracker_ [1]
> which
> >> seems to have somewhat similar properties to the DTM.
> >>
> >> One topic that needs to be addressed there is how the RPC with the
> >> _TaskExecutorGateway_ should look like.
> >> - Do we need to introduce a new RPC method or can we for example
> piggyback
> >> on heartbeats?
> >> - What delivery semantics are we looking for? (what if we're only able
> to
> >> update subset of TMs / what happens if we exhaust retries / should we
> even
> >> have the retry mechanism whatsoever) - I have a feeling that somehow
> >> leveraging the existing heartbeat mechanism could help to answer these
> >> questions
> >>
> >> In short, after DT reaches it's max lifetime then log aggregation stops
> >> >
> >>
> >> What does it mean for the running application (how does this look like
> >> from
> >> the user perspective)? As far as I remember the logs are only collected
> >> ("aggregated") after the container is stopped, is that correct? I think
> >> this topic should get its own section in the FLIP (having some cross
> >> reference to YARN ticket would be really u

Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-01-26 Thread David Morávek
I'd second to Alex's concerns. Is there a reason why you can't use the
async-profiler directly? In what kind of environment are your Flink
clusters running (YARN / k8s / ...)?

Best,
D.

On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
wrote:

> Hi Jacky,
>
> Could you please clarify what kind of *problems* you experience with the
> large parallelism? You referred to D3, is it something related to rendering
> on the browser side or is it about the samples collection process? Were you
> able to identify the bottleneck?
>
> Fundamentally I have some concerns regarding the proposed approach:
> 1. Calling shell scripts triggered via the web UI is a security concern and
> it needs to be evaluated carefully if it could introduce any unexpected
> attack vectors (depending on the implementation, passed parameters etc.)
> 2. My understanding is that the async-profiler implementation is
> system-dependent. How do you propose to handle multiple architectures?
> Would you like to ship each available implementation within Flink? [1]
> 3. Do you plan to make use of full async-profiler features including native
> calls sampling with perf_events? If so, the issue I see is that some
> environments restrict ptrace calls by default [2]
>
> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> [2]
>
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
>
>
> Best,
> Alexander Fedulov
>
> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
>
> > This is an expected feature, as we also experienced browser crashes on
> > existing operator-level flame graphs
> >
> > Best,
> > Echo Lee
> >
> > > 在 2022年1月24日,下午6:16,David Morávek  写道:
> > >
> > > Hi Jacky,
> > >
> > > The link seems to be broken, here is the correct one [1].
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > >
> > > Best,
> > > D.
> > >
> > >> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <281293...@qq.com.invalid>
> > wrote:
> > >>
> > >> Hi All,
> > >>   I would like to start the discussion on FLIP-213 <
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > >> ;
> > >> which aims to provide taskmanager level(process level) flame
> graph
> > >> by async profiler, which is most popular tool in java performance. and
> > the
> > >> arthas and intellij both use it.
> > >> And we support it in our ant group company.
> > >>  AndFlink supports FLIP-165: Operator's Flame Graphs
> > >> now. and it draw flame graph by thefront-end
> > >> librariesd3-flame-graph, which has some problem in jobs
> > >> oflarge of parallelism.
> > >>  Please be aware that the FLIP wiki area is not fully done
> > >> since i don't konw whether it will accept by
> flinkcommunity.
> > >>  Feel free to add your thoughts to make this feature
> > better! i
> > >> am looking forward to all your response. Thanks too much!
> > >>
> > >>
> > >>
> > >>
> > >> Best Jacky Lau
> >
>


Re: [DISCUSS] Pushing Apache Flink 1.15 Feature Freeze

2022-01-26 Thread David Morávek
+1, especially for the reasons Yuan has mentioned

D.

On Wed, Jan 26, 2022 at 9:15 AM Yu Li  wrote:

> +1 to extend the feature freeze date to Feb. 14th, which might be a good
> Valentine's Day present for all Flink developers as well (smile).
>
> Best Regards,
> Yu
>
>
> On Wed, 26 Jan 2022 at 14:50, Yuan Mei  wrote:
>
> > +1 extending feature freeze for one week.
> >
> > Code Freeze on 6th (end of Spring Festival) is equivalent to say code
> > freeze at the end of this week for Chinese buddies, since Spring Festival
> > starts next week.
> > It also means they should be partially available during the holiday,
> > otherwise they would block the release if any unexpected issues arise.
> >
> > The situation sounds a bit stressed and can be resolved very well by
> > extending the freeze date for a bit.
> >
> > Best
> > Yuan
> >
> > On Wed, Jan 26, 2022 at 11:18 AM Yun Tang  wrote:
> >
> > > Since the official Spring Festival holidays in China starts from Jan
> 31th
> > > to Feb 6th, and many developers in China would enjoy the holidays at
> that
> > > time.
> > > +1 for extending the feature freeze.
> > >
> > > Best
> > > Yun Tang
> > > 
> > > From: Jingsong Li 
> > > Sent: Wednesday, January 26, 2022 10:32
> > > To: dev 
> > > Subject: Re: [DISCUSS] Pushing Apache Flink 1.15 Feature Freeze
> > >
> > > +1 for extending the feature freeze.
> > >
> > > Thanks Joe for driving.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Wed, Jan 26, 2022 at 12:04 AM Martijn Visser  >
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > +1 for extending the feature freeze. We could use the time to try to
> > wrap
> > > > up some important SQL related features and improvements.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > > On Tue, 25 Jan 2022 at 16:38, Johannes Moser 
> > wrote:
> > > >
> > > > > Dear Flink community,
> > > > >
> > > > > as mentioned in the summary mail earlier some contributors voiced
> > that
> > > > > they would benefit from pushing the feature freeze for 1.15. by a
> > week.
> > > > > This would mean Monday, 14th of February 2022, end of business
> CEST.
> > > > >
> > > > > Please let us know in case you got any concerns.
> > > > >
> > > > >
> > > > > Best,
> > > > > Till, Yun Gao & Joe
> > >
> >
>


Re: [DISCUSS] Releasing Flink 1.13.6

2022-01-25 Thread David Morávek
Thanks for driving this Martijn, +1 for the release

Also big thanks to Konstantin for volunteering

Best,
D.

On Mon, Jan 24, 2022 at 3:24 PM Till Rohrmann  wrote:

> +1 for the 1.13.6 release and thanks for volunteering Konstantin.
>
> Cheers,
> Till
>
> On Mon, Jan 24, 2022 at 2:57 PM Konstantin Knauf 
> wrote:
>
> > Thanks for starting the discussion and +1 to releasing.
> >
> > I am happy to manage the release aka learn how to do this.
> >
> > Cheers,
> >
> > Konstantin
> >
> > On Mon, Jan 24, 2022 at 2:52 PM Martijn Visser 
> > wrote:
> >
> > > I would like to start a discussion on releasing Flink 1.13.6. Flink
> > 1.13.5
> > > was the latest release on the 16th of December, which was the emergency
> > > release for the Log4j CVE [1]. Flink 1.13.4 was cancelled, leaving
> Flink
> > > 1.13.3 as the last real bugfix release. This one was released on the
> 19th
> > > of October last year.
> > >
> > > Since then, there have been 61 fixed tickets, excluding the test
> > > stabilities [3]. This includes a blocker and a couple of critical
> issues.
> > >
> > > Is there a PMC member who would like to manage the release? I'm more
> than
> > > happy to help with monitoring the status of the tickets.
> > >
> > > Best regards,
> > >
> > > Martijn Visser
> > > https://twitter.com/MartijnVisser82
> > >
> > > [1] https://flink.apache.org/news/2021/12/16/log4j-patch-releases.html
> > > [2] https://flink.apache.org/news/2021/10/19/release-1.13.3.html
> > > [3] JQL filter: project = FLINK AND resolution = Fixed AND fixVersion =
> > > 1.13.6 AND labels != test-stability ORDER BY priority DESC, created
> DESC
> > >
> >
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
> >
>


Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-24 Thread David Morávek
>
> Do we need to introduce a new RPC method or can we for example piggyback
> on heartbeats?
>

Seems we can use the very same approach as
_ResourceManagerPartitionTracker_ is using:
- _TaskManagers_ periodically report which token they're using (eg.
identified by some id). This involves adding a new field into
_TaskExecutorHeartbeatPayload_.
- Once report arrives, DTM checks the token and updates it if necessary
(we'd introduce a new method for that on TaskExecutorGateway).
- If update fails, we don't need to retry. The next heartbeat takes care of
that.
- Heartbeat mechanism already covers TM failure scenarios

On Mon, Jan 24, 2022 at 3:03 PM David Morávek  wrote:

> Could you point to a code where you think it could be added exactly? A
>> helping hand is welcome here 
>>
>
> I think you can take a look at _ResourceManagerPartitionTracker_ [1] which
> seems to have somewhat similar properties to the DTM.
>
> One topic that needs to be addressed there is how the RPC with the
> _TaskExecutorGateway_ should look like.
> - Do we need to introduce a new RPC method or can we for example piggyback
> on heartbeats?
> - What delivery semantics are we looking for? (what if we're only able to
> update subset of TMs / what happens if we exhaust retries / should we even
> have the retry mechanism whatsoever) - I have a feeling that somehow
> leveraging the existing heartbeat mechanism could help to answer these
> questions
>
> In short, after DT reaches it's max lifetime then log aggregation stops
>>
>
> What does it mean for the running application (how does this look like
> from the user perspective)? As far as I remember the logs are only
> collected ("aggregated") after the container is stopped, is that correct? I
> think this topic should get its own section in the FLIP (having some cross
> reference to YARN ticket would be really useful, but I'm not sure if there
> are any).
>
> All deployment modes (per-job, per-app, ...) are planned to be tested and
>> expect to work with the initial implementation however not all deployment
>> targets (k8s, local, ...
>>
>
> If we split the FLIP into two parts / sections that I've suggested, I
> don't really think that you need to explicitly test for each deployment
> scenario / cluster framework, because the DTM part is completely
> independent of the deployment target. Basically this is what I'm aiming for
> with "making it work with the standalone" (as simple as starting a new java
> process) Flink first (which is also how most people deploy streaming
> application on k8s and the direction we're pushing forward with the
> auto-scaling / reactive mode initiatives).
>
> The whole integration with YARN (let's forget about log aggregation for a
> moment) / k8s-native only boils down to how do we make the keytab file
> local to the JobManager so the DTM can read it, so it's basically built on
> top of that. The only special thing that needs to be tested there is the
> "keytab distribution" code path.
>
> [1]
> https://github.com/apache/flink/blob/release-1.14.3/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResourceManagerPartitionTracker.java
>
> Best,
> D.
>
> On Mon, Jan 24, 2022 at 12:35 PM Gabor Somogyi 
> wrote:
>
>> > There is a separate JobMaster for each job
>> within a Flink cluster and each JobMaster only has a partial view of the
>> task managers
>>
>> Good point! I've had a deeper look and you're right. We definitely need to
>> find another place.
>>
>> > Related per-cluster or per-job keytab:
>>
>> In the current code per-cluster keytab is implemented and I'm intended to
>> keep it like this within this FLIP. The reason is simple: tokens on TM
>> side
>> can be stored within the UserGroupInformation (UGI) structure which is
>> global. I'm not telling it's impossible to change that but I think that
>> this is such a complexity which the initial implementation is not required
>> to contain. Additionally we've not seen such need from user side. If the
>> need may rise later on then another FLIP with this topic can be created
>> and
>> discussed. Proper multi-UGI handling within a single JVM is a topic where
>> several round of deep-dive with the Hadoop/YARN guys are required.
>>
>> > single DTM instance embedded with
>> the ResourceManager (the Flink component)
>>
>> Could you point to a code where you think it could be added exactly? A
>> helping hand is welcome here
>>
>> > Then the single (initial) implementation should work with all the
>> deployments modes out of the box (which is not what the FLIP suggests). Is
>> t

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-24 Thread David Morávek
>
> Could you point to a code where you think it could be added exactly? A
> helping hand is welcome here 
>

I think you can take a look at _ResourceManagerPartitionTracker_ [1] which
seems to have somewhat similar properties to the DTM.

One topic that needs to be addressed there is how the RPC with the
_TaskExecutorGateway_ should look like.
- Do we need to introduce a new RPC method or can we for example piggyback
on heartbeats?
- What delivery semantics are we looking for? (what if we're only able to
update subset of TMs / what happens if we exhaust retries / should we even
have the retry mechanism whatsoever) - I have a feeling that somehow
leveraging the existing heartbeat mechanism could help to answer these
questions

In short, after DT reaches it's max lifetime then log aggregation stops
>

What does it mean for the running application (how does this look like from
the user perspective)? As far as I remember the logs are only collected
("aggregated") after the container is stopped, is that correct? I think
this topic should get its own section in the FLIP (having some cross
reference to YARN ticket would be really useful, but I'm not sure if there
are any).

All deployment modes (per-job, per-app, ...) are planned to be tested and
> expect to work with the initial implementation however not all deployment
> targets (k8s, local, ...
>

If we split the FLIP into two parts / sections that I've suggested, I don't
really think that you need to explicitly test for each deployment scenario
/ cluster framework, because the DTM part is completely independent of the
deployment target. Basically this is what I'm aiming for with "making it
work with the standalone" (as simple as starting a new java process) Flink
first (which is also how most people deploy streaming application on k8s
and the direction we're pushing forward with the auto-scaling / reactive
mode initiatives).

The whole integration with YARN (let's forget about log aggregation for a
moment) / k8s-native only boils down to how do we make the keytab file
local to the JobManager so the DTM can read it, so it's basically built on
top of that. The only special thing that needs to be tested there is the
"keytab distribution" code path.

[1]
https://github.com/apache/flink/blob/release-1.14.3/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResourceManagerPartitionTracker.java

Best,
D.

On Mon, Jan 24, 2022 at 12:35 PM Gabor Somogyi 
wrote:

> > There is a separate JobMaster for each job
> within a Flink cluster and each JobMaster only has a partial view of the
> task managers
>
> Good point! I've had a deeper look and you're right. We definitely need to
> find another place.
>
> > Related per-cluster or per-job keytab:
>
> In the current code per-cluster keytab is implemented and I'm intended to
> keep it like this within this FLIP. The reason is simple: tokens on TM side
> can be stored within the UserGroupInformation (UGI) structure which is
> global. I'm not telling it's impossible to change that but I think that
> this is such a complexity which the initial implementation is not required
> to contain. Additionally we've not seen such need from user side. If the
> need may rise later on then another FLIP with this topic can be created and
> discussed. Proper multi-UGI handling within a single JVM is a topic where
> several round of deep-dive with the Hadoop/YARN guys are required.
>
> > single DTM instance embedded with
> the ResourceManager (the Flink component)
>
> Could you point to a code where you think it could be added exactly? A
> helping hand is welcome here
>
> > Then the single (initial) implementation should work with all the
> deployments modes out of the box (which is not what the FLIP suggests). Is
> that correct?
>
> All deployment modes (per-job, per-app, ...) are planned to be tested and
> expect to work with the initial implementation however not all deployment
> targets (k8s, local, ...) are not intended to be tested. Per deployment
> target new jira needs to be created where I expect small number of codes
> needs to be added and relatively expensive testing effort is required.
>
> > I've taken a look into the prototype and in the "YarnClusterDescriptor"
> you're injecting a delegation token into the AM [1] (that's obtained using
> the provided keytab). If I understand this correctly from previous
> discussion / FLIP, this is to support log aggregation and DT has a limited
> validity. How is this DT going to be renewed?
>
> You're clever and touched a limitation which Spark has too. In short, after
> DT reaches it's max lifetime then log aggregation stops. I've had several
> deep-dive rounds with the YARN guys at Spark years because wanted to fill
> this gap. They can't provide

Re: [VOTE] FLIP-203: Incremental savepoints

2022-01-24 Thread David Morávek
+1 (non-binding)

Best,
D.

On Mon, Jan 24, 2022 at 10:54 AM Dawid Wysakowicz 
wrote:

> +1 (binding)
>
> Best,
>
> Dawid
>
> On 24/01/2022 09:56, Piotr Nowojski wrote:
> > Hi,
> >
> > As there seems to be no further questions about the FLIP-203 [1] I would
> > propose to start a voting thread for it.
> >
> > For me there are still two unanswered questions, whether we want to
> support
> > schema evolution and State Processor API with native format snapshots or
> > not. But I would propose to tackle them as follow ups, since those are
> > pre-existing issues of the native format checkpoints, and could be done
> > completely independently of providing the native format support in
> > savepoints.
> >
> > Best,
> > Piotrek
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints
> >
>


Re: [ANNOUNCE] flink-shaded 15.0 released

2022-01-24 Thread David Morávek
That's a great news Chesnay, thanks for driving this! This should unblock
some ongoing Flink efforts +1

Best,
D.

On Mon, Jan 24, 2022 at 10:58 AM Chesnay Schepler 
wrote:

> Hello everyone,
>
> we got a new flink-shaded release, with several nifty things:
>
>   * updated version for ASM, required for Java 17
>   * jackson extensions for optionals/datetime, which will be used by the
> Table API (and maybe REST API)
>   * a relocated version of swagger, finally unblocking the merge of our
> experimental swagger spec
>   * updated version for Netty, providing a proper fix for FLINK-24197
>
>


Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-01-24 Thread David Morávek
Hi Jacky,

The link seems to be broken, here is the correct one [1].

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs

Best,
D.

On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <281293...@qq.com.invalid> wrote:

> Hi All,
>   I would like to start the discussion on FLIP-213 <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> ;
> which aims to provide taskmanager level(process level) flame graph
> by async profiler, which is most popular tool in java performance. and the
> arthas and intellij both use it.
> And we support it in our ant group company.
>  AndFlink supports FLIP-165: Operator's Flame Graphs
> now. and it draw flame graph by thefront-end
> librariesd3-flame-graph, which has some problem in jobs
> oflarge of parallelism.
>  Please be aware that the FLIP wiki area is not fully done
> since i don't konw whether it will accept by flinkcommunity.
>  Feel free to add your thoughts to make this feature better! i
> am looking forward to all your response. Thanks too much!
>
>
>
>
> Best Jacky Lau


Re: [VOTE] FLIP-211: Kerberos delegation token framework

2022-01-24 Thread David Morávek
Hi Gabor,

Thanks for driving this. This is headed in a right direction, but I feel
that the FLIP still might need bit more work.

-1 (non-binding) until the discussion thread is resolved [1].

[1] https://lists.apache.org/thread/cvwknd5fhohj0wfv8mfwn70jwpjvxrjj

Best,
D.



On Mon, Jan 24, 2022 at 10:47 AM Gyula Fóra  wrote:

> Hi Gabor,
>
> +1 (binding) from me
>
> This is a great effort and significant improvement to the Kerberos security
> story .
>
> Cheers
> Gyula
>
> On Fri, 21 Jan 2022 at 15:58, Gabor Somogyi 
> wrote:
>
> > Hi devs,
> >
> > I would like to start the vote for FLIP-211 [1], which was discussed and
> > reached a consensus in the discussion thread [2].
> >
> > The vote will be open for at least 72h, unless there is an objection or
> not
> > enough votes.
> >
> > BR,
> > G
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework
> >
> > [2] https://lists.apache.org/thread/cvwknd5fhohj0wfv8mfwn70jwpjvxrjj
> >
>


Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-24 Thread David Morávek
Hi Gabor,

There is actually a huge difference between JobManager (process) and
JobMaster (job coordinator). The naming is unfortunately bit misleading
here from historical reasons. There is a separate JobMaster for each job
within a Flink cluster and each JobMaster only has a partial view of the
task managers (depends on where the slots for a particular job are
allocated). This means that you'll end up with N "DelegationTokenManagers"
competing with each other (N = number of running jobs in the cluster).

This makes me think we're mixing two abstraction levels here:

a) Per-cluster delegation tokens
- Simpler approach, it would involve a single DTM instance embedded with
the ResourceManager (the Flink component)
b) Per-job delegation tokens
- More complex approach, but could be more flexible from the user side of
things.
- Multiple DTM instances, that are bound with the JobMaster lifecycle.
Delegation tokens are attached with a particular slots that are executing
the job tasks instead of the whole task manager (TM could be executing
multiple jobs with different tokens).
- The question is which keytab should be used for the clustering framework,
to support log aggregation on YARN (an extra keytab, keytab that comes with
the first job?)

I think these are the things that need to be clarified in the FLIP before
proceeding.

A follow-up question for getting a better understanding where this should
be headed: Are there any use cases where user may want to use different
keytabs with each job, or are we fine with using a cluster-wide keytab? If
we go with per-cluster keytabs, is it OK that all jobs submitted into this
cluster can access it (even the future ones)? Should this be a security
concern?

Presume you though I would implement a new class with JobManager name. The
> plan is not that.
>

I've never suggested such thing.


> No. That said earlier DT handling is planned to be done completely in
> Flink. DTM has a renewal thread which re-obtains tokens in the proper time
> when needed.
>

Then the single (initial) implementation should work with all the
deployments modes out of the box (which is not what the FLIP suggests). Is
that correct?

If the cluster framework, also requires delegation token for their inner
working (this is IMO only applies to YARN), it might need an extra step
(injecting the token into application master container).

Separating the individual layers (actual Flink cluster - basically making
this work with a standalone deployment  / "cluster framework" - support for
YARN log aggregation) in the FLIP would be useful.

Reading the linked Spark readme could be useful.
>

I've read that, but please be patient with the questions, Kerberos is not
an easy topic to get into and I've had a very little contact with it in the
past.

https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
>

I've taken a look into the prototype and in the "YarnClusterDescriptor"
you're injecting a delegation token into the AM [1] (that's obtained using
the provided keytab). If I understand this correctly from previous
discussion / FLIP, this is to support log aggregation and DT has a limited
validity. How is this DT going to be renewed?

[1]
https://github.com/gaborgsomogyi/flink/commit/8ab75e46013f159778ccfce52463e7bc63e395a9#diff-02416e2d6ca99e1456f9c3949f3d7c2ac523d3fe25378620c09632e4aac34e4eR1261

Best,
D.

On Fri, Jan 21, 2022 at 9:35 PM Gabor Somogyi 
wrote:

> Here is the exact class, I'm from mobile so not had a look at the exact
> class name:
>
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> That keeps track of TMs where the tokens can be sent to.
>
> > My feeling would be that we shouldn't really introduce a new component
> with
> a custom lifecycle, but rather we should try to incorporate this into
> existing ones.
>
> Can you be more specific? Presume you though I would implement a new class
> with JobManager name. The plan is not that.
>
> > If I understand this correctly, this means that we then push the token
> renewal logic to YARN.
>
> No. That said earlier DT handling is planned to be done completely in
> Flink. DTM has a renewal thread which re-obtains tokens in the proper time
> when needed. YARN log aggregation is a totally different feature, where
> YARN does the renewal. Log aggregation was an example why the code can't be
> 100% reusable for all resource managers. Reading the linked Spark readme
> could be useful.
>
> G
>
> On Fri, 21 Jan 2022, 21:05 David Morávek,  wrote:
>
> > >
> > > JobManager is the Flink class.
> >
> >
> > There is no such class in Flink. The closest thing

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-21 Thread David Morávek
>
> JobManager is the Flink class.


There is no such class in Flink. The closest thing to the JobManager is a
ClusterEntrypoint. The cluster entrypoint spawns new RM Runner & Dispatcher
Runner that start participating in the leader election. Once they gain
leadership they spawn the actual underlying instances of these two "main
components".

My feeling would be that we shouldn't really introduce a new component with
a custom lifecycle, but rather we should try to incorporate this into
existing ones.

My biggest concerns would be:

- How would the lifecycle of the new component look like with regards to HA
setups. If we really try to decide to introduce a completely new component,
how should this work in case of multiple JobManager instances?
- Which components does it talk to / how? For example how does the
broadcast of new token to task managers (TaskManagerGateway) look like? Do
we simply introduce a new RPC on the ResourceManagerGateway that broadcasts
it or does the new component need to do some kind of bookkeeping of task
managers that it needs to notify?

YARN based HDFS log aggregation would not work by dropping that code. Just
> to be crystal clear, the actual implementation contains this fir exactly
> this reason.
>

This is the missing part +1. If I understand this correctly, this means
that we then push the token renewal logic to YARN. How do you plan to
implement the renewal logic on k8s?

D.

On Fri, Jan 21, 2022 at 8:37 PM Gabor Somogyi 
wrote:

> > I think we might both mean something different by the RM.
>
> You feel it well, I've not specified these terms well in the explanation.
> RM I meant resource management framework. JobManager is the Flink class.
> This means that inside JM instance there will be a DTM instance, so they
> would have the same lifecycle. Hope I've answered the question.
>
> > If we have tokens available on the client side, why do we need to set
> them
> into the AM (yarn specific concept) launch context?
>
> YARN based HDFS log aggregation would not work by dropping that code. Just
> to be crystal clear, the actual implementation contains this fir exactly
> this reason.
>
> G
>
> On Fri, 21 Jan 2022, 20:12 David Morávek,  wrote:
>
> > Hi Gabor,
> >
> > 1. One thing is important, token management is planned to be done
> > > generically within Flink and not scattered in RM specific code.
> > JobManager
> > > has a DelegationTokenManager which obtains tokens time-to-time (if
> > > configured properly). JM knows which TaskManagers are in place so it
> can
> > > distribute it to all TMs. That's it basically.
> >
> >
> > I think we might both mean something different by the RM. JobManager is
> > basically just a process encapsulating multiple components, one of which
> is
> > a ResourceManager, which is the component that manages task manager
> > registrations [1]. There is more or less a single implementation of the
> RM
> > with plugable drivers for the active integrations (yarn, k8s).
> >
> > It would be great if you could share more details of how exactly the DTM
> is
> > going to fit in the current JM architecture.
> >
> > 2. 99.9% of the code is generic but each RM handles tokens differently. A
> > > good example is YARN obtains tokens on client side and then sets them
> on
> > > the newly created AM container launch context. This is purely YARN
> > specific
> > > and cant't be spared. With my actual plans standalone can be changed to
> > use
> > > the framework. By using it I mean no RM specific DTM or whatsoever is
> > > needed.
> > >
> >
> > If we have tokens available on the client side, why do we need to set
> them
> > into the AM (yarn specific concept) launch context? Why can't we simply
> > send them to the JM, eg. as a parameter of the job submission / via
> > separate RPC call? There might be something I'm missing due to limited
> > knowledge, but handling the token on the "cluster framework" level
> doesn't
> > seem necessary.
> >
> > [1]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#jobmanager
> >
> > Best,
> > D.
> >
> > On Fri, Jan 21, 2022 at 7:48 PM Gabor Somogyi  >
> > wrote:
> >
> > > Oh and one more thing. I'm planning to add this feature in small chunk
> of
> > > PRs because security is super hairy area. That way reviewers can be
> more
> > > easily obtains the concept.
> > >
> > > On Fri, 21 Jan 2022, 18:03 David Morávek,  wrote:
> > >
> > > > Hi Gabor,
> > &

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-21 Thread David Morávek
Hi Gabor,

1. One thing is important, token management is planned to be done
> generically within Flink and not scattered in RM specific code. JobManager
> has a DelegationTokenManager which obtains tokens time-to-time (if
> configured properly). JM knows which TaskManagers are in place so it can
> distribute it to all TMs. That's it basically.


I think we might both mean something different by the RM. JobManager is
basically just a process encapsulating multiple components, one of which is
a ResourceManager, which is the component that manages task manager
registrations [1]. There is more or less a single implementation of the RM
with plugable drivers for the active integrations (yarn, k8s).

It would be great if you could share more details of how exactly the DTM is
going to fit in the current JM architecture.

2. 99.9% of the code is generic but each RM handles tokens differently. A
> good example is YARN obtains tokens on client side and then sets them on
> the newly created AM container launch context. This is purely YARN specific
> and cant't be spared. With my actual plans standalone can be changed to use
> the framework. By using it I mean no RM specific DTM or whatsoever is
> needed.
>

If we have tokens available on the client side, why do we need to set them
into the AM (yarn specific concept) launch context? Why can't we simply
send them to the JM, eg. as a parameter of the job submission / via
separate RPC call? There might be something I'm missing due to limited
knowledge, but handling the token on the "cluster framework" level doesn't
seem necessary.

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#jobmanager

Best,
D.

On Fri, Jan 21, 2022 at 7:48 PM Gabor Somogyi 
wrote:

> Oh and one more thing. I'm planning to add this feature in small chunk of
> PRs because security is super hairy area. That way reviewers can be more
> easily obtains the concept.
>
> On Fri, 21 Jan 2022, 18:03 David Morávek,  wrote:
>
> > Hi Gabor,
> >
> > thanks for drafting the FLIP, I think having a solid Kerberos support is
> > crucial for many enterprise deployments.
> >
> > I have multiple questions regarding the implementation (note that I have
> > very limited knowledge of Kerberos):
> >
> > 1) If I understand it correctly, we'll only obtain tokens in the job
> > manager and then we'll distribute them via RPC (needs to be secured).
> >
> > Can you please outline how the communication will look like? Is the
> > DelegationTokenManager going to be a part of the ResourceManager? Can you
> > outline it's lifecycle / how it's going to be integrated there?
> >
> > 2) Do we really need a YARN / k8s specific implementations? Is it
> possible
> > to obtain / renew a token in a generic way? Maybe to rephrase that, is it
> > possible to implement DelegationTokenManager for the standalone Flink? If
> > we're able to solve this point, it could be possible to target all
> > deployment scenarios with a single implementation.
> >
> > Best,
> > D.
> >
> > On Fri, Jan 14, 2022 at 3:47 AM Junfan Zhang 
> > wrote:
> >
> > > Hi G
> > >
> > > Thanks for your explain in detail. I have gotten your thoughts, and any
> > > way this proposal
> > > is a great improvement.
> > >
> > > Looking forward to your implementation and i will keep focus on it.
> > > Thanks again.
> > >
> > > Best
> > > JunFan.
> > > On Jan 13, 2022, 9:20 PM +0800, Gabor Somogyi <
> gabor.g.somo...@gmail.com
> > >,
> > > wrote:
> > > > Just to confirm keeping "security.kerberos.fetch.delegation-token" is
> > > added
> > > > to the doc.
> > > >
> > > > BR,
> > > > G
> > > >
> > > >
> > > > On Thu, Jan 13, 2022 at 1:34 PM Gabor Somogyi <
> > gabor.g.somo...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi JunFan,
> > > > >
> > > > > > By the way, maybe this should be added in the migration plan or
> > > > > intergation section in the FLIP-211.
> > > > >
> > > > > Going to add this soon.
> > > > >
> > > > > > Besides, I have a question that the KDC will collapse when the
> > > cluster
> > > > > reached 200 nodes you described
> > > > > in the google doc. Do you have any attachment or reference to prove
> > it?
> > > > >
> > > > > "KDC *may* collapse under some circumstances" is the

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-21 Thread David Morávek
Hi Gabor,

thanks for drafting the FLIP, I think having a solid Kerberos support is
crucial for many enterprise deployments.

I have multiple questions regarding the implementation (note that I have
very limited knowledge of Kerberos):

1) If I understand it correctly, we'll only obtain tokens in the job
manager and then we'll distribute them via RPC (needs to be secured).

Can you please outline how the communication will look like? Is the
DelegationTokenManager going to be a part of the ResourceManager? Can you
outline it's lifecycle / how it's going to be integrated there?

2) Do we really need a YARN / k8s specific implementations? Is it possible
to obtain / renew a token in a generic way? Maybe to rephrase that, is it
possible to implement DelegationTokenManager for the standalone Flink? If
we're able to solve this point, it could be possible to target all
deployment scenarios with a single implementation.

Best,
D.

On Fri, Jan 14, 2022 at 3:47 AM Junfan Zhang 
wrote:

> Hi G
>
> Thanks for your explain in detail. I have gotten your thoughts, and any
> way this proposal
> is a great improvement.
>
> Looking forward to your implementation and i will keep focus on it.
> Thanks again.
>
> Best
> JunFan.
> On Jan 13, 2022, 9:20 PM +0800, Gabor Somogyi ,
> wrote:
> > Just to confirm keeping "security.kerberos.fetch.delegation-token" is
> added
> > to the doc.
> >
> > BR,
> > G
> >
> >
> > On Thu, Jan 13, 2022 at 1:34 PM Gabor Somogyi  >
> > wrote:
> >
> > > Hi JunFan,
> > >
> > > > By the way, maybe this should be added in the migration plan or
> > > intergation section in the FLIP-211.
> > >
> > > Going to add this soon.
> > >
> > > > Besides, I have a question that the KDC will collapse when the
> cluster
> > > reached 200 nodes you described
> > > in the google doc. Do you have any attachment or reference to prove it?
> > >
> > > "KDC *may* collapse under some circumstances" is the proper wording.
> > >
> > > We have several customers who are executing workloads on Spark/Flink.
> Most
> > > of the time I'm facing their
> > > daily issues which is heavily environment and use-case dependent. I've
> > > seen various cases:
> > > * where the mentioned ~1k nodes were working fine
> > > * where KDC thought the number of requests are coming from DDOS attack
> so
> > > discontinued authentication
> > > * where KDC was simply not responding because of the load
> > > * where KDC was intermittently had some outage (this was the most nasty
> > > thing)
> > >
> > > Since you're managing relatively big cluster then you know that KDC is
> not
> > > only used by Spark/Flink workloads
> > > but the whole company IT infrastructure is bombing it so it really
> depends
> > > on other factors too whether KDC is reaching
> > > it's limit or not. Not sure what kind of evidence are you looking for
> but
> > > I'm not authorized to share any information about
> > > our clients data.
> > >
> > > One thing is for sure. The more external system types are used in
> > > workloads (for ex. HDFS, HBase, Hive, Kafka) which
> > > are authenticating through KDC the more possibility to reach this
> > > threshold when the cluster is big enough.
> > >
> > > All in all this feature is here to help all users never reach this
> > > limitation.
> > >
> > > BR,
> > > G
> > >
> > >
> > > On Thu, Jan 13, 2022 at 1:00 PM 张俊帆  wrote:
> > >
> > > > Hi G
> > > >
> > > > Thanks for your quick reply. I think reserving the config of
> > > > *security.kerberos.fetch.delegation-token*
> > > > and simplifying disable the token fetching is a good idea.By the way,
> > > > maybe this should be added
> > > > in the migration plan or intergation section in the FLIP-211.
> > > >
> > > > Besides, I have a question that the KDC will collapse when the
> cluster
> > > > reached 200 nodes you described
> > > > in the google doc. Do you have any attachment or reference to prove
> it?
> > > > Because in our internal per-cluster,
> > > > the nodes reaches > 1000 and KDC looks good. Do i missed or
> misunderstood
> > > > something? Please correct me.
> > > >
> > > > Best
> > > > JunFan.
> > > > On Jan 13, 2022, 5:26 PM +0800, dev@flink.apache.org, wrote:
> > > > >
> > > > >
> > > >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > > >
> > >
>


  1   2   >