Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-18 Thread Yi Zhang
Hi All,


This discussion has been ongoing for some time now and has 
received excellent suggestions and positive feedback—thanks 
again!


If you have any additional questions, concerns, or ideas, please 
feel free to share them.


If there is no further input, I will move forward with starting the 
vote on this FLIP tomorrow.


Best Regards,
Yi



At 2026-01-15 17:45:36, "Yi Zhang"  wrote:
>Hi!
>
>
>Thanks so much for your feedback and suggestions. I have updated the 
>FLIP accordingly and truly appreciate your input!
>
>
>Best,
>Yi
>
>
>
>At 2026-01-15 14:20:53, [email protected] wrote:
>>Hi!
>>
>>I think a separate endpoint with the effective config makes perfect sense and 
>>will cover the requirements.
>>
>>Thank you for including :)
>>
>>Gyula
>>
>>Sent from my iPhone
>>
>>> On 15 Jan 2026, at 06:50, Yi Zhang  wrote:
>>> 
>>> Hi Gyula,
>>> 
>>> 
>>> I have given this some more thought, and I agree with your
>>> point: users (and external controllers like the Kubernetes
>>> Operator) need access to critical context such as the state
>>> recovery path, even in cases where no job has been
>>> successfully submitted.
>>> 
>>> 
>>> What if we introduce a new REST API endpoint, such
>>> as /applications/:applicationid/jobmanager/config,
>>> to expose the effective JobManager configuration used (or
>>> intended to be used) by the application? This could include
>>> key settings like state recovery path and other relevant
>>> configured options.
>>> 
>>> 
>>> It might help make the API responsibilities clearer, and also
>>> provide valuable visibility even when no errors occur. For
>>> actual error details, the
>>> /applications/:applicationid/exceptions endpoint can be
>>> used.
>>> 
>>> 
>>> I’d appreciate your thoughts on this approach. Thanks!
>>> 
>>> 
>>> Best,
>>> Yi
>>> 
>>> At 2026-01-12 15:23:58, "Gyula Fóra"  wrote:
 Hi!
 
 Overall it makes sense, I cannot reproduce a job actually
 submitted/archived properly with invalid savepoint path (I am testing on
 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes
 this seems to cause a jobmanager shutdown immediately.
 
 You are right, I am mostly concerned about all the error scenarios that
 currently do not result in FAILED job submissions.
 
 Regarding what error metadata to expose, I think what you write makes
 sense, with the only specific exception that checkpoint/state recovery
 information (what checkpoint are we restoring from at the moment) should
 always be included in the error/job/app metadata. This is crucial for the
 Kubernetes Operator (and possible other external control planes) to handle
 the error. Currently since this information is sometimes lost, it leads to
 many cornercases requiring manual intervention from users.
 
 Cheers
 Gyula
 
> On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang  wrote:
> 
> 
> 
> Hi Gyula,
> 
> 
> Thank you very much for your explanation.
> 
> 
> "Some errors such as invalid state path are not even submitted or when it
> is Flink
> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that 
> doesn't
> actually include information about the checkpoint restore path/configs
> etc."
> 
> 
> 
> I ran some tests on the latest Flink 2.3 release by submitting a job with
> an invalid
> `execution.state-recovery.path`. The job submission itself succeeded, but
> the job
> failed during initialization. It seems that at least for misconfigured
> state recovery
> paths, the job still goes through submission and gets archived with
> sufficient
> diagnostic info. Am I missing anything here? If there’s a specific Jira
> issue
> describing such a scenario, it would be great to reference it for more
> concrete
> discussion around this requirements.
> 
> 
> That said, I agree that there are different error scenarios we might
> encounter, which
> broadly fall into three categories:
> 
> 1. Failures after successful job submission, which result in a FAILED job
> state. In
> these cases, relevant diagnostics are already accessible via existing
> job-related
> REST endpoints.
> 2. Failures during job submission, leaving no concrete job entity to 
> query.
> 3. Failures in the main() method unrelated to job submission/execution.
> 
> The originally proposed /applications/:applicationid/exceptions endpoint
> is intended
> to expose exceptions from all three categories. From my understanding, 
> your
> primary interest lies in scenario #2, where additional context could help
> diagnose
> why submission failed, even though no real job was created.
> 
> 
> Rather than introducing a general endpoint that exposes all possible
> configuration
> and metadata, would it be more practical to conditiona

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-15 Thread gyula . fora
Hi!

I think a separate endpoint with the effective config makes perfect sense and 
will cover the requirements.

Thank you for including :)

Gyula

Sent from my iPhone

> On 15 Jan 2026, at 06:50, Yi Zhang  wrote:
> 
> Hi Gyula,
> 
> 
> I have given this some more thought, and I agree with your
> point: users (and external controllers like the Kubernetes
> Operator) need access to critical context such as the state
> recovery path, even in cases where no job has been
> successfully submitted.
> 
> 
> What if we introduce a new REST API endpoint, such
> as /applications/:applicationid/jobmanager/config,
> to expose the effective JobManager configuration used (or
> intended to be used) by the application? This could include
> key settings like state recovery path and other relevant
> configured options.
> 
> 
> It might help make the API responsibilities clearer, and also
> provide valuable visibility even when no errors occur. For
> actual error details, the
> /applications/:applicationid/exceptions endpoint can be
> used.
> 
> 
> I’d appreciate your thoughts on this approach. Thanks!
> 
> 
> Best,
> Yi
> 
> At 2026-01-12 15:23:58, "Gyula Fóra"  wrote:
>> Hi!
>> 
>> Overall it makes sense, I cannot reproduce a job actually
>> submitted/archived properly with invalid savepoint path (I am testing on
>> 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes
>> this seems to cause a jobmanager shutdown immediately.
>> 
>> You are right, I am mostly concerned about all the error scenarios that
>> currently do not result in FAILED job submissions.
>> 
>> Regarding what error metadata to expose, I think what you write makes
>> sense, with the only specific exception that checkpoint/state recovery
>> information (what checkpoint are we restoring from at the moment) should
>> always be included in the error/job/app metadata. This is crucial for the
>> Kubernetes Operator (and possible other external control planes) to handle
>> the error. Currently since this information is sometimes lost, it leads to
>> many cornercases requiring manual intervention from users.
>> 
>> Cheers
>> Gyula
>> 
>>> On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang  wrote:
>>> 
>>> 
>>> 
>>> Hi Gyula,
>>> 
>>> 
>>> Thank you very much for your explanation.
>>> 
>>> 
>>> "Some errors such as invalid state path are not even submitted or when it
>>> is Flink
>>> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
>>> actually include information about the checkpoint restore path/configs
>>> etc."
>>> 
>>> 
>>> 
>>> I ran some tests on the latest Flink 2.3 release by submitting a job with
>>> an invalid
>>> `execution.state-recovery.path`. The job submission itself succeeded, but
>>> the job
>>> failed during initialization. It seems that at least for misconfigured
>>> state recovery
>>> paths, the job still goes through submission and gets archived with
>>> sufficient
>>> diagnostic info. Am I missing anything here? If there’s a specific Jira
>>> issue
>>> describing such a scenario, it would be great to reference it for more
>>> concrete
>>> discussion around this requirements.
>>> 
>>> 
>>> That said, I agree that there are different error scenarios we might
>>> encounter, which
>>> broadly fall into three categories:
>>> 
>>> 1. Failures after successful job submission, which result in a FAILED job
>>> state. In
>>> these cases, relevant diagnostics are already accessible via existing
>>> job-related
>>> REST endpoints.
>>> 2. Failures during job submission, leaving no concrete job entity to query.
>>> 3. Failures in the main() method unrelated to job submission/execution.
>>> 
>>> The originally proposed /applications/:applicationid/exceptions endpoint
>>> is intended
>>> to expose exceptions from all three categories. From my understanding, your
>>> primary interest lies in scenario #2, where additional context could help
>>> diagnose
>>> why submission failed, even though no real job was created.
>>> 
>>> 
>>> Rather than introducing a general endpoint that exposes all possible
>>> configuration
>>> and metadata, would it be more practical to conditionally enrich
>>> exceptions? For
>>> example, when a submission fails due to invalid state paths or
>>> misconfigured
>>> options, we could attach the relevant configuration settings. This
>>> approach would
>>> complement the /applications/:applicationid/exceptions design and allow us
>>> to
>>> incrementally evolve toward richer diagnostics over time.
>>> Having a concrete use case would greatly help align on the scope and
>>> implementation details of such enrichment.
>>> 
>>> 
>>> Thanks again for your valuable feedback and suggestions!
>>> 
>>> 
>>> Best,
>>> Yi
>>> 
>>> P.S. I’ve updated the FLIP to reflect the change regarding using job name
>>> for job
>>> matching. Please let me know if you have any further questions or
>>> suggestions.
>>> 
>>> 
>>> At 2026-01-08 17:07:16, "Gyula Fóra"  wrote:
 Hi Yi!
 
 Sorry for the late reply, I s

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-15 Thread Yi Zhang
Hi!


Thanks so much for your feedback and suggestions. I have updated the 
FLIP accordingly and truly appreciate your input!


Best,
Yi



At 2026-01-15 14:20:53, [email protected] wrote:
>Hi!
>
>I think a separate endpoint with the effective config makes perfect sense and 
>will cover the requirements.
>
>Thank you for including :)
>
>Gyula
>
>Sent from my iPhone
>
>> On 15 Jan 2026, at 06:50, Yi Zhang  wrote:
>> 
>> Hi Gyula,
>> 
>> 
>> I have given this some more thought, and I agree with your
>> point: users (and external controllers like the Kubernetes
>> Operator) need access to critical context such as the state
>> recovery path, even in cases where no job has been
>> successfully submitted.
>> 
>> 
>> What if we introduce a new REST API endpoint, such
>> as /applications/:applicationid/jobmanager/config,
>> to expose the effective JobManager configuration used (or
>> intended to be used) by the application? This could include
>> key settings like state recovery path and other relevant
>> configured options.
>> 
>> 
>> It might help make the API responsibilities clearer, and also
>> provide valuable visibility even when no errors occur. For
>> actual error details, the
>> /applications/:applicationid/exceptions endpoint can be
>> used.
>> 
>> 
>> I’d appreciate your thoughts on this approach. Thanks!
>> 
>> 
>> Best,
>> Yi
>> 
>> At 2026-01-12 15:23:58, "Gyula Fóra"  wrote:
>>> Hi!
>>> 
>>> Overall it makes sense, I cannot reproduce a job actually
>>> submitted/archived properly with invalid savepoint path (I am testing on
>>> 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes
>>> this seems to cause a jobmanager shutdown immediately.
>>> 
>>> You are right, I am mostly concerned about all the error scenarios that
>>> currently do not result in FAILED job submissions.
>>> 
>>> Regarding what error metadata to expose, I think what you write makes
>>> sense, with the only specific exception that checkpoint/state recovery
>>> information (what checkpoint are we restoring from at the moment) should
>>> always be included in the error/job/app metadata. This is crucial for the
>>> Kubernetes Operator (and possible other external control planes) to handle
>>> the error. Currently since this information is sometimes lost, it leads to
>>> many cornercases requiring manual intervention from users.
>>> 
>>> Cheers
>>> Gyula
>>> 
 On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang  wrote:
 
 
 
 Hi Gyula,
 
 
 Thank you very much for your explanation.
 
 
 "Some errors such as invalid state path are not even submitted or when it
 is Flink
 uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
 actually include information about the checkpoint restore path/configs
 etc."
 
 
 
 I ran some tests on the latest Flink 2.3 release by submitting a job with
 an invalid
 `execution.state-recovery.path`. The job submission itself succeeded, but
 the job
 failed during initialization. It seems that at least for misconfigured
 state recovery
 paths, the job still goes through submission and gets archived with
 sufficient
 diagnostic info. Am I missing anything here? If there’s a specific Jira
 issue
 describing such a scenario, it would be great to reference it for more
 concrete
 discussion around this requirements.
 
 
 That said, I agree that there are different error scenarios we might
 encounter, which
 broadly fall into three categories:
 
 1. Failures after successful job submission, which result in a FAILED job
 state. In
 these cases, relevant diagnostics are already accessible via existing
 job-related
 REST endpoints.
 2. Failures during job submission, leaving no concrete job entity to query.
 3. Failures in the main() method unrelated to job submission/execution.
 
 The originally proposed /applications/:applicationid/exceptions endpoint
 is intended
 to expose exceptions from all three categories. From my understanding, your
 primary interest lies in scenario #2, where additional context could help
 diagnose
 why submission failed, even though no real job was created.
 
 
 Rather than introducing a general endpoint that exposes all possible
 configuration
 and metadata, would it be more practical to conditionally enrich
 exceptions? For
 example, when a submission fails due to invalid state paths or
 misconfigured
 options, we could attach the relevant configuration settings. This
 approach would
 complement the /applications/:applicationid/exceptions design and allow us
 to
 incrementally evolve toward richer diagnostics over time.
 Having a concrete use case would greatly help align on the scope and
 implementation details of such enrichment.
 
 
 Thanks again for your valuable feedback and sugges

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-14 Thread Yi Zhang
Hi Gyula,


I have given this some more thought, and I agree with your 
point: users (and external controllers like the Kubernetes 
Operator) need access to critical context such as the state 
recovery path, even in cases where no job has been
successfully submitted.


What if we introduce a new REST API endpoint, such 
as /applications/:applicationid/jobmanager/config,
to expose the effective JobManager configuration used (or 
intended to be used) by the application? This could include 
key settings like state recovery path and other relevant
configured options. 


It might help make the API responsibilities clearer, and also
provide valuable visibility even when no errors occur. For
actual error details, the 
/applications/:applicationid/exceptions endpoint can be
used.


I’d appreciate your thoughts on this approach. Thanks!


Best,
Yi

At 2026-01-12 15:23:58, "Gyula Fóra"  wrote:
>Hi!
>
>Overall it makes sense, I cannot reproduce a job actually
>submitted/archived properly with invalid savepoint path (I am testing on
>2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes
>this seems to cause a jobmanager shutdown immediately.
>
>You are right, I am mostly concerned about all the error scenarios that
>currently do not result in FAILED job submissions.
>
>Regarding what error metadata to expose, I think what you write makes
>sense, with the only specific exception that checkpoint/state recovery
>information (what checkpoint are we restoring from at the moment) should
>always be included in the error/job/app metadata. This is crucial for the
>Kubernetes Operator (and possible other external control planes) to handle
>the error. Currently since this information is sometimes lost, it leads to
>many cornercases requiring manual intervention from users.
>
>Cheers
>Gyula
>
>On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang  wrote:
>
>>
>>
>> Hi Gyula,
>>
>>
>> Thank you very much for your explanation.
>>
>>
>> "Some errors such as invalid state path are not even submitted or when it
>> is Flink
>> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
>> actually include information about the checkpoint restore path/configs
>> etc."
>>
>>
>>
>> I ran some tests on the latest Flink 2.3 release by submitting a job with
>> an invalid
>> `execution.state-recovery.path`. The job submission itself succeeded, but
>> the job
>> failed during initialization. It seems that at least for misconfigured
>> state recovery
>> paths, the job still goes through submission and gets archived with
>> sufficient
>> diagnostic info. Am I missing anything here? If there’s a specific Jira
>> issue
>> describing such a scenario, it would be great to reference it for more
>> concrete
>> discussion around this requirements.
>>
>>
>> That said, I agree that there are different error scenarios we might
>> encounter, which
>> broadly fall into three categories:
>>
>> 1. Failures after successful job submission, which result in a FAILED job
>> state. In
>> these cases, relevant diagnostics are already accessible via existing
>> job-related
>> REST endpoints.
>> 2. Failures during job submission, leaving no concrete job entity to query.
>> 3. Failures in the main() method unrelated to job submission/execution.
>>
>> The originally proposed /applications/:applicationid/exceptions endpoint
>> is intended
>> to expose exceptions from all three categories. From my understanding, your
>> primary interest lies in scenario #2, where additional context could help
>> diagnose
>> why submission failed, even though no real job was created.
>>
>>
>> Rather than introducing a general endpoint that exposes all possible
>> configuration
>> and metadata, would it be more practical to conditionally enrich
>> exceptions? For
>> example, when a submission fails due to invalid state paths or
>> misconfigured
>> options, we could attach the relevant configuration settings. This
>> approach would
>> complement the /applications/:applicationid/exceptions design and allow us
>> to
>> incrementally evolve toward richer diagnostics over time.
>> Having a concrete use case would greatly help align on the scope and
>> implementation details of such enrichment.
>>
>>
>> Thanks again for your valuable feedback and suggestions!
>>
>>
>> Best,
>> Yi
>>
>> P.S. I’ve updated the FLIP to reflect the change regarding using job name
>> for job
>> matching. Please let me know if you have any further questions or
>> suggestions.
>>
>>
>> At 2026-01-08 17:07:16, "Gyula Fóra"  wrote:
>> >Hi Yi!
>> >
>> >Sorry for the late reply, I somehow missed your response:
>> >
>> >"Flink’s existing archive mechanism—combined
>> >with the HistoryServer—already provides persistent access to job-related
>> >information after failure. Specifically, the existing HistoryServer
>> endpoint
>> >`/jobs/:jobid/jobmanager/config` seems capable of exposing the
>> configuration
>> >
>> >including checkpoint restore paths, and remains accessible after failure."
>> >
>> >You ar

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-11 Thread Gyula Fóra
Hi!

Overall it makes sense, I cannot reproduce a job actually
submitted/archived properly with invalid savepoint path (I am testing on
2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes
this seems to cause a jobmanager shutdown immediately.

You are right, I am mostly concerned about all the error scenarios that
currently do not result in FAILED job submissions.

Regarding what error metadata to expose, I think what you write makes
sense, with the only specific exception that checkpoint/state recovery
information (what checkpoint are we restoring from at the moment) should
always be included in the error/job/app metadata. This is crucial for the
Kubernetes Operator (and possible other external control planes) to handle
the error. Currently since this information is sometimes lost, it leads to
many cornercases requiring manual intervention from users.

Cheers
Gyula

On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang  wrote:

>
>
> Hi Gyula,
>
>
> Thank you very much for your explanation.
>
>
> "Some errors such as invalid state path are not even submitted or when it
> is Flink
> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
> actually include information about the checkpoint restore path/configs
> etc."
>
>
>
> I ran some tests on the latest Flink 2.3 release by submitting a job with
> an invalid
> `execution.state-recovery.path`. The job submission itself succeeded, but
> the job
> failed during initialization. It seems that at least for misconfigured
> state recovery
> paths, the job still goes through submission and gets archived with
> sufficient
> diagnostic info. Am I missing anything here? If there’s a specific Jira
> issue
> describing such a scenario, it would be great to reference it for more
> concrete
> discussion around this requirements.
>
>
> That said, I agree that there are different error scenarios we might
> encounter, which
> broadly fall into three categories:
>
> 1. Failures after successful job submission, which result in a FAILED job
> state. In
> these cases, relevant diagnostics are already accessible via existing
> job-related
> REST endpoints.
> 2. Failures during job submission, leaving no concrete job entity to query.
> 3. Failures in the main() method unrelated to job submission/execution.
>
> The originally proposed /applications/:applicationid/exceptions endpoint
> is intended
> to expose exceptions from all three categories. From my understanding, your
> primary interest lies in scenario #2, where additional context could help
> diagnose
> why submission failed, even though no real job was created.
>
>
> Rather than introducing a general endpoint that exposes all possible
> configuration
> and metadata, would it be more practical to conditionally enrich
> exceptions? For
> example, when a submission fails due to invalid state paths or
> misconfigured
> options, we could attach the relevant configuration settings. This
> approach would
> complement the /applications/:applicationid/exceptions design and allow us
> to
> incrementally evolve toward richer diagnostics over time.
> Having a concrete use case would greatly help align on the scope and
> implementation details of such enrichment.
>
>
> Thanks again for your valuable feedback and suggestions!
>
>
> Best,
> Yi
>
> P.S. I’ve updated the FLIP to reflect the change regarding using job name
> for job
> matching. Please let me know if you have any further questions or
> suggestions.
>
>
> At 2026-01-08 17:07:16, "Gyula Fóra"  wrote:
> >Hi Yi!
> >
> >Sorry for the late reply, I somehow missed your response:
> >
> >"Flink’s existing archive mechanism—combined
> >with the HistoryServer—already provides persistent access to job-related
> >information after failure. Specifically, the existing HistoryServer
> endpoint
> >`/jobs/:jobid/jobmanager/config` seems capable of exposing the
> configuration
> >
> >including checkpoint restore paths, and remains accessible after failure."
> >
> >You are right, when a job fails this is true we can see the past
> checkpoint
> >history etc. But I think this doesn't apply for jobs that faile during
> >submission or in the main method. Some errors such as invalid state path
> >are not even submitted or when it is Flink uses
> >ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
> >actually include information about the checkpoint restore path/configs
> etc.
> >
> >Cheers
> >Gyula
> >
> >On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang  wrote:
> >
> >> Hi Gyula,
> >>
> >>
> >> Thank you so much for your thoughtful and insightful feedback!
> >>
> >>
> >> 1.  I fully agree that using the job name for job matching is more
> >> user-friendly and
> >> cleaner than relying on a jobIndex parameter. I’ll update the FLIP
> >> accordingly
> >> to reflect this design change.
> >>
> >>
> >> 2. I’d like to dig a bit deeper to make sure I fully understand the
> >> requirement.
> >> You have mentioned the need for a generic information endpoint that
> >>

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-11 Thread Yi Zhang


Hi Gyula,


Thank you very much for your explanation.


"Some errors such as invalid state path are not even submitted or when it is 
Flink 
uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
actually include information about the checkpoint restore path/configs etc."



I ran some tests on the latest Flink 2.3 release by submitting a job with an 
invalid 
`execution.state-recovery.path`. The job submission itself succeeded, but the 
job 
failed during initialization. It seems that at least for misconfigured state 
recovery 
paths, the job still goes through submission and gets archived with sufficient 
diagnostic info. Am I missing anything here? If there’s a specific Jira issue 
describing such a scenario, it would be great to reference it for more concrete 
discussion around this requirements.


That said, I agree that there are different error scenarios we might encounter, 
which
broadly fall into three categories:

1. Failures after successful job submission, which result in a FAILED job 
state. In
these cases, relevant diagnostics are already accessible via existing 
job-related
REST endpoints.
2. Failures during job submission, leaving no concrete job entity to query.
3. Failures in the main() method unrelated to job submission/execution.

The originally proposed /applications/:applicationid/exceptions endpoint is 
intended
to expose exceptions from all three categories. From my understanding, your
primary interest lies in scenario #2, where additional context could help 
diagnose
why submission failed, even though no real job was created.


Rather than introducing a general endpoint that exposes all possible 
configuration
and metadata, would it be more practical to conditionally enrich exceptions? 
For 
example, when a submission fails due to invalid state paths or misconfigured 
options, we could attach the relevant configuration settings. This approach 
would 
complement the /applications/:applicationid/exceptions design and allow us to 
incrementally evolve toward richer diagnostics over time.
Having a concrete use case would greatly help align on the scope and 
implementation details of such enrichment.


Thanks again for your valuable feedback and suggestions!


Best,
Yi

P.S. I’ve updated the FLIP to reflect the change regarding using job name for 
job 
matching. Please let me know if you have any further questions or suggestions.


At 2026-01-08 17:07:16, "Gyula Fóra"  wrote:
>Hi Yi!
>
>Sorry for the late reply, I somehow missed your response:
>
>"Flink’s existing archive mechanism—combined
>with the HistoryServer—already provides persistent access to job-related
>information after failure. Specifically, the existing HistoryServer endpoint
>`/jobs/:jobid/jobmanager/config` seems capable of exposing the configuration
>
>including checkpoint restore paths, and remains accessible after failure."
>
>You are right, when a job fails this is true we can see the past checkpoint
>history etc. But I think this doesn't apply for jobs that faile during
>submission or in the main method. Some errors such as invalid state path
>are not even submitted or when it is Flink uses
>ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
>actually include information about the checkpoint restore path/configs etc.
>
>Cheers
>Gyula
>
>On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang  wrote:
>
>> Hi Gyula,
>>
>>
>> Thank you so much for your thoughtful and insightful feedback!
>>
>>
>> 1.  I fully agree that using the job name for job matching is more
>> user-friendly and
>> cleaner than relying on a jobIndex parameter. I’ll update the FLIP
>> accordingly
>> to reflect this design change.
>>
>>
>> 2. I’d like to dig a bit deeper to make sure I fully understand the
>> requirement.
>> You have mentioned the need for a generic information endpoint that
>> remains
>> accessible even after failure, and that it should include additional info
>> such as
>> the checkpoint restore path and configuration.
>>
>> From my current understanding, Flink’s existing archive mechanism—combined
>> with the HistoryServer—already provides persistent access to job-related
>> information after failure. Specifically, the existing HistoryServer
>> endpoint
>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the
>> configuration
>> including checkpoint restore paths, and remains accessible after failure.
>> On the other hand, the proposed /applications/:appid/exceptions endpoint
>> is
>> intended specifically to surface application-level exceptions that occur
>> outside
>> the job lifecycle, which will also be available through the HistoryServer
>> after
>> failure.
>>
>> So could you help clarify whether there is a specific failure scenario or
>> use case
>> where the current archiving/HistoryServer mechanism falls short or where
>> critical
>> debugging information—like the restore path or configuration—is not
>> retrievable
>> after a failure?
>>
>>
>> Thanks again for your excellent suggestio

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-08 Thread Gyula Fóra
Hi Yi!

Sorry for the late reply, I somehow missed your response:

"Flink’s existing archive mechanism—combined
with the HistoryServer—already provides persistent access to job-related
information after failure. Specifically, the existing HistoryServer endpoint
`/jobs/:jobid/jobmanager/config` seems capable of exposing the configuration

including checkpoint restore paths, and remains accessible after failure."

You are right, when a job fails this is true we can see the past checkpoint
history etc. But I think this doesn't apply for jobs that faile during
submission or in the main method. Some errors such as invalid state path
are not even submitted or when it is Flink uses
ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't
actually include information about the checkpoint restore path/configs etc.

Cheers
Gyula

On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang  wrote:

> Hi Gyula,
>
>
> Thank you so much for your thoughtful and insightful feedback!
>
>
> 1.  I fully agree that using the job name for job matching is more
> user-friendly and
> cleaner than relying on a jobIndex parameter. I’ll update the FLIP
> accordingly
> to reflect this design change.
>
>
> 2. I’d like to dig a bit deeper to make sure I fully understand the
> requirement.
> You have mentioned the need for a generic information endpoint that
> remains
> accessible even after failure, and that it should include additional info
> such as
> the checkpoint restore path and configuration.
>
> From my current understanding, Flink’s existing archive mechanism—combined
> with the HistoryServer—already provides persistent access to job-related
> information after failure. Specifically, the existing HistoryServer
> endpoint
> `/jobs/:jobid/jobmanager/config` seems capable of exposing the
> configuration
> including checkpoint restore paths, and remains accessible after failure.
> On the other hand, the proposed /applications/:appid/exceptions endpoint
> is
> intended specifically to surface application-level exceptions that occur
> outside
> the job lifecycle, which will also be available through the HistoryServer
> after
> failure.
>
> So could you help clarify whether there is a specific failure scenario or
> use case
> where the current archiving/HistoryServer mechanism falls short or where
> critical
> debugging information—like the restore path or configuration—is not
> retrievable
> after a failure?
>
>
> Thanks again for your excellent suggestions!
>
> Best,
> Yi
>
> At 2025-12-25 21:08:49, "Gyula Fóra"  wrote:
> >Hi!
> >
> >Overall I think the design/improvements look great. Some minor comments,
> >improvement possibilities:
> >
> >1. Could we simply use the job name for job matching? I think it's fair to
> >require unique job names (or if they are not unique attach a sequence
> >number to the name) instead of the jobIndex parameter. JobIndex sounds a
> >bit weird and low level.
> >
> >2.A big problem/limitation of the existing submission logic is that the
> >submit-on-error logic is very limited (only handling certain types of
> >errors and only showing exception info). We should capture different
> errors
> >and metadata for failed applications including checkpoint settings (for
> >instance what checkpoint path was used during restore, which is a common
> >cause of the errors). So instead of introducing a
> >/applications/appid/exceptions endpoint, can we instead introduce a more
> >generic information endpoint that would contain other information? This
> >endpoint should be accessible even in cause of failures and populated from
> >the app result store and should also contain some other info such as
> >checkpoint restore path, configuration etc.
> >
> >Capturing more information on failed submissions would help resolve a lot
> >of long outstanding issues in the Flink Kubernetes Operator as well.
> >
> >Cheers
> >Gyula
> >
> >
> >On Thu, Dec 25, 2025 at 1:54 PM Lei Yang  wrote:
> >
> >> Thank you Yi for your reply, looks good to me!
> >> +1 for this proposal
> >> Best,
> >> Lei
> >>
> >> Yi Zhang  于2025年12月25日周四 10:02写道:
> >>
> >> > Hi Lei,
> >> >
> >> >
> >> > Thank you for the feedback!
> >> > The "Archiving Directory Structure" section describes a change in how
> >> > archived
> >> > files are organized under jobmanager.archive.fs.dir. While this change
> >> was
> >> > originally proposed in FLIP-549, it's indeed a significant
> >> > application-level update,
> >> > so I'm glad to have the chance to clarify it here.
> >> >
> >> >
> >> > To answer your question directly: backward compatibility is fully
> >> > preserved.
> >> >
> >> >
> >> > In earlier Flink versions, job archives were written directly under
> the
> >> > configured
> >> > jobmanager.archive.fs.dir. With this update, Flink will instead use a
> >> > hierarchical
> >> > cluster-application-job structure.
> >> > We understand that many users already have archives stored in the
> legacy
> >> > flat
> >> > layout. To ensure a smooth transition, the History Serve

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-08 Thread Yi Zhang
Hi Shengkai,


Thanks for raising these questions. I’ll address each of them below:


1. Behavior when an exception occurs during the recovery


If I understand correctly, the "recovery stage" here refers to the phase 
where the application is re-executed after JM failover. If an exception 
occurs during the re-execution, the application will transition to a failed 
state. Any jobs that are running as part of this application will be 
canceled and cleaned up.

2.  Expectations for JM high availability


In Application Mode, JM high availability guarantees that the entire
application is re-executed upon JM failover. Compared to running without
HA, this means developers should ensure that user logic outside of
Flink’s job execution (i.e., `env.execute()`) is idempotent.
That said, Flink guarantees that the execute() call itself is safe to
re-execute: repeated submissions due to JM failover will not result in
duplicate jobs. However, other user logic is not automatically protected
and must be made resilient to multiple invocations by the user.
While this differs from the per-job recovery mode—where only the job is
recovered and no other user logic is re-run—it can preserve the integrity
of user logic across failovers.

3. Fine-grained control over re-execution

This is a great point. Providing utilities to allow users to skip certain parts
of their code during recovery would enable more precise control and
reduce side effects. This would likely require tracking execution progress
at the user-code level and exposing that context through a well-designed
interface, which can be a future enhancement.


Thanks again for the feedback!

Best,
Yi




At 2026-01-07 10:08:32, "Shengkai Fang"  wrote:
>Hi Yi,
>
>+1 for the proposal.
>
>1. What's the behaviour if an exception happens during the recovery stage?
>Will the running job be canceled?
>
>2. Can you describe what users should expect from JM high availability?
>Compared with the per-job deployment mode, my understanding is that HA in
>application mode cannot guarantee that the job will definitely run
>successfully. Compared to running without HA enabled, what should
>application developers be aware of?
>
>3. In the future, it's better we can provide some utils to give more
>fine-grained code-level control to escape re-execute if jm failsover.
>
>Best,
>Shengkai
>
>Yi Zhang  于2025年12月26日周五 18:00写道:
>
>> Hi Gyula,
>>
>>
>> Thank you so much for your thoughtful and insightful feedback!
>>
>>
>> 1.  I fully agree that using the job name for job matching is more
>> user-friendly and
>> cleaner than relying on a jobIndex parameter. I’ll update the FLIP
>> accordingly
>> to reflect this design change.
>>
>>
>> 2. I’d like to dig a bit deeper to make sure I fully understand the
>> requirement.
>> You have mentioned the need for a generic information endpoint that
>> remains
>> accessible even after failure, and that it should include additional info
>> such as
>> the checkpoint restore path and configuration.
>>
>> From my current understanding, Flink’s existing archive mechanism—combined
>> with the HistoryServer—already provides persistent access to job-related
>> information after failure. Specifically, the existing HistoryServer
>> endpoint
>> `/jobs/:jobid/jobmanager/config` seems capable of exposing the
>> configuration
>> including checkpoint restore paths, and remains accessible after failure.
>> On the other hand, the proposed /applications/:appid/exceptions endpoint
>> is
>> intended specifically to surface application-level exceptions that occur
>> outside
>> the job lifecycle, which will also be available through the HistoryServer
>> after
>> failure.
>>
>> So could you help clarify whether there is a specific failure scenario or
>> use case
>> where the current archiving/HistoryServer mechanism falls short or where
>> critical
>> debugging information—like the restore path or configuration—is not
>> retrievable
>> after a failure?
>>
>>
>> Thanks again for your excellent suggestions!
>>
>> Best,
>> Yi
>>
>> At 2025-12-25 21:08:49, "Gyula Fóra"  wrote:
>> >Hi!
>> >
>> >Overall I think the design/improvements look great. Some minor comments,
>> >improvement possibilities:
>> >
>> >1. Could we simply use the job name for job matching? I think it's fair to
>> >require unique job names (or if they are not unique attach a sequence
>> >number to the name) instead of the jobIndex parameter. JobIndex sounds a
>> >bit weird and low level.
>> >
>> >2.A big problem/limitation of the existing submission logic is that the
>> >submit-on-error logic is very limited (only handling certain types of
>> >errors and only showing exception info). We should capture different
>> errors
>> >and metadata for failed applications including checkpoint settings (for
>> >instance what checkpoint path was used during restore, which is a common
>> >cause of the errors). So instead of introducing a
>> >/applications/appid/exceptions endpoint, can we instead introduce a more
>> >generic inform

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2026-01-06 Thread Shengkai Fang
Hi Yi,

+1 for the proposal.

1. What's the behaviour if an exception happens during the recovery stage?
Will the running job be canceled?

2. Can you describe what users should expect from JM high availability?
Compared with the per-job deployment mode, my understanding is that HA in
application mode cannot guarantee that the job will definitely run
successfully. Compared to running without HA enabled, what should
application developers be aware of?

3. In the future, it's better we can provide some utils to give more
fine-grained code-level control to escape re-execute if jm failsover.

Best,
Shengkai

Yi Zhang  于2025年12月26日周五 18:00写道:

> Hi Gyula,
>
>
> Thank you so much for your thoughtful and insightful feedback!
>
>
> 1.  I fully agree that using the job name for job matching is more
> user-friendly and
> cleaner than relying on a jobIndex parameter. I’ll update the FLIP
> accordingly
> to reflect this design change.
>
>
> 2. I’d like to dig a bit deeper to make sure I fully understand the
> requirement.
> You have mentioned the need for a generic information endpoint that
> remains
> accessible even after failure, and that it should include additional info
> such as
> the checkpoint restore path and configuration.
>
> From my current understanding, Flink’s existing archive mechanism—combined
> with the HistoryServer—already provides persistent access to job-related
> information after failure. Specifically, the existing HistoryServer
> endpoint
> `/jobs/:jobid/jobmanager/config` seems capable of exposing the
> configuration
> including checkpoint restore paths, and remains accessible after failure.
> On the other hand, the proposed /applications/:appid/exceptions endpoint
> is
> intended specifically to surface application-level exceptions that occur
> outside
> the job lifecycle, which will also be available through the HistoryServer
> after
> failure.
>
> So could you help clarify whether there is a specific failure scenario or
> use case
> where the current archiving/HistoryServer mechanism falls short or where
> critical
> debugging information—like the restore path or configuration—is not
> retrievable
> after a failure?
>
>
> Thanks again for your excellent suggestions!
>
> Best,
> Yi
>
> At 2025-12-25 21:08:49, "Gyula Fóra"  wrote:
> >Hi!
> >
> >Overall I think the design/improvements look great. Some minor comments,
> >improvement possibilities:
> >
> >1. Could we simply use the job name for job matching? I think it's fair to
> >require unique job names (or if they are not unique attach a sequence
> >number to the name) instead of the jobIndex parameter. JobIndex sounds a
> >bit weird and low level.
> >
> >2.A big problem/limitation of the existing submission logic is that the
> >submit-on-error logic is very limited (only handling certain types of
> >errors and only showing exception info). We should capture different
> errors
> >and metadata for failed applications including checkpoint settings (for
> >instance what checkpoint path was used during restore, which is a common
> >cause of the errors). So instead of introducing a
> >/applications/appid/exceptions endpoint, can we instead introduce a more
> >generic information endpoint that would contain other information? This
> >endpoint should be accessible even in cause of failures and populated from
> >the app result store and should also contain some other info such as
> >checkpoint restore path, configuration etc.
> >
> >Capturing more information on failed submissions would help resolve a lot
> >of long outstanding issues in the Flink Kubernetes Operator as well.
> >
> >Cheers
> >Gyula
> >
> >
> >On Thu, Dec 25, 2025 at 1:54 PM Lei Yang  wrote:
> >
> >> Thank you Yi for your reply, looks good to me!
> >> +1 for this proposal
> >> Best,
> >> Lei
> >>
> >> Yi Zhang  于2025年12月25日周四 10:02写道:
> >>
> >> > Hi Lei,
> >> >
> >> >
> >> > Thank you for the feedback!
> >> > The "Archiving Directory Structure" section describes a change in how
> >> > archived
> >> > files are organized under jobmanager.archive.fs.dir. While this change
> >> was
> >> > originally proposed in FLIP-549, it's indeed a significant
> >> > application-level update,
> >> > so I'm glad to have the chance to clarify it here.
> >> >
> >> >
> >> > To answer your question directly: backward compatibility is fully
> >> > preserved.
> >> >
> >> >
> >> > In earlier Flink versions, job archives were written directly under
> the
> >> > configured
> >> > jobmanager.archive.fs.dir. With this update, Flink will instead use a
> >> > hierarchical
> >> > cluster-application-job structure.
> >> > We understand that many users already have archives stored in the
> legacy
> >> > flat
> >> > layout. To ensure a smooth transition, the History Server will be
> updated
> >> > to read
> >> > archives from both the old and new directory structures. As a result,
> all
> >> > previously archived jobs will remain accessible and visible.
> >> >
> >> >
> >> > If you have additional questions or specifi

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2025-12-26 Thread Yi Zhang
Hi Gyula,


Thank you so much for your thoughtful and insightful feedback!


1.  I fully agree that using the job name for job matching is more 
user-friendly and
cleaner than relying on a jobIndex parameter. I’ll update the FLIP accordingly
to reflect this design change.


2. I’d like to dig a bit deeper to make sure I fully understand the requirement.
You have mentioned the need for a generic information endpoint that remains 
accessible even after failure, and that it should include additional info such 
as 
the checkpoint restore path and configuration.

From my current understanding, Flink’s existing archive mechanism—combined 
with the HistoryServer—already provides persistent access to job-related 
information after failure. Specifically, the existing HistoryServer endpoint
`/jobs/:jobid/jobmanager/config` seems capable of exposing the configuration 
including checkpoint restore paths, and remains accessible after failure.
On the other hand, the proposed /applications/:appid/exceptions endpoint is 
intended specifically to surface application-level exceptions that occur 
outside 
the job lifecycle, which will also be available through the HistoryServer after 
failure.

So could you help clarify whether there is a specific failure scenario or use 
case 
where the current archiving/HistoryServer mechanism falls short or where 
critical
debugging information—like the restore path or configuration—is not retrievable
after a failure?


Thanks again for your excellent suggestions!

Best,
Yi

At 2025-12-25 21:08:49, "Gyula Fóra"  wrote:
>Hi!
>
>Overall I think the design/improvements look great. Some minor comments,
>improvement possibilities:
>
>1. Could we simply use the job name for job matching? I think it's fair to
>require unique job names (or if they are not unique attach a sequence
>number to the name) instead of the jobIndex parameter. JobIndex sounds a
>bit weird and low level.
>
>2.A big problem/limitation of the existing submission logic is that the
>submit-on-error logic is very limited (only handling certain types of
>errors and only showing exception info). We should capture different errors
>and metadata for failed applications including checkpoint settings (for
>instance what checkpoint path was used during restore, which is a common
>cause of the errors). So instead of introducing a
>/applications/appid/exceptions endpoint, can we instead introduce a more
>generic information endpoint that would contain other information? This
>endpoint should be accessible even in cause of failures and populated from
>the app result store and should also contain some other info such as
>checkpoint restore path, configuration etc.
>
>Capturing more information on failed submissions would help resolve a lot
>of long outstanding issues in the Flink Kubernetes Operator as well.
>
>Cheers
>Gyula
>
>
>On Thu, Dec 25, 2025 at 1:54 PM Lei Yang  wrote:
>
>> Thank you Yi for your reply, looks good to me!
>> +1 for this proposal
>> Best,
>> Lei
>>
>> Yi Zhang  于2025年12月25日周四 10:02写道:
>>
>> > Hi Lei,
>> >
>> >
>> > Thank you for the feedback!
>> > The "Archiving Directory Structure" section describes a change in how
>> > archived
>> > files are organized under jobmanager.archive.fs.dir. While this change
>> was
>> > originally proposed in FLIP-549, it's indeed a significant
>> > application-level update,
>> > so I'm glad to have the chance to clarify it here.
>> >
>> >
>> > To answer your question directly: backward compatibility is fully
>> > preserved.
>> >
>> >
>> > In earlier Flink versions, job archives were written directly under the
>> > configured
>> > jobmanager.archive.fs.dir. With this update, Flink will instead use a
>> > hierarchical
>> > cluster-application-job structure.
>> > We understand that many users already have archives stored in the legacy
>> > flat
>> > layout. To ensure a smooth transition, the History Server will be updated
>> > to read
>> > archives from both the old and new directory structures. As a result, all
>> > previously archived jobs will remain accessible and visible.
>> >
>> >
>> > If you have additional questions or specific edge cases in mind, I’d be
>> > happy to
>> > discuss them further!
>> >
>> >
>> > Best,
>> > Yi
>> >
>> >
>> >
>> > At 2025-12-24 11:35:00, "Lei Yang"  wrote:
>> > >Hi Yi,
>> > >
>> > >Thank you for creating this FLIP! The introduction of the Application
>> > >entity significantly enhances the observability and manageability of
>> > >user logic, especially benefiting batch workloads. This is truly
>> > >excellent work!
>> > >
>> > >However, I have a compatibility concern and would appreciate your
>> > >clarification. In the “Archiving Directory Structure” section, I noticed
>> > >that the directory structure has been changed. If users have configured
>> > >a persistent external path for jobmanager.archive.fs.dir, will their
>> > >existing archives become unreadable after this change? Will the
>> > >implementation of this FLIP maintain backward compatibilit

Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2025-12-25 Thread Gyula Fóra
Hi!

Overall I think the design/improvements look great. Some minor comments,
improvement possibilities:

1. Could we simply use the job name for job matching? I think it's fair to
require unique job names (or if they are not unique attach a sequence
number to the name) instead of the jobIndex parameter. JobIndex sounds a
bit weird and low level.

2.A big problem/limitation of the existing submission logic is that the
submit-on-error logic is very limited (only handling certain types of
errors and only showing exception info). We should capture different errors
and metadata for failed applications including checkpoint settings (for
instance what checkpoint path was used during restore, which is a common
cause of the errors). So instead of introducing a
/applications/appid/exceptions endpoint, can we instead introduce a more
generic information endpoint that would contain other information? This
endpoint should be accessible even in cause of failures and populated from
the app result store and should also contain some other info such as
checkpoint restore path, configuration etc.

Capturing more information on failed submissions would help resolve a lot
of long outstanding issues in the Flink Kubernetes Operator as well.

Cheers
Gyula


On Thu, Dec 25, 2025 at 1:54 PM Lei Yang  wrote:

> Thank you Yi for your reply, looks good to me!
> +1 for this proposal
> Best,
> Lei
>
> Yi Zhang  于2025年12月25日周四 10:02写道:
>
> > Hi Lei,
> >
> >
> > Thank you for the feedback!
> > The "Archiving Directory Structure" section describes a change in how
> > archived
> > files are organized under jobmanager.archive.fs.dir. While this change
> was
> > originally proposed in FLIP-549, it's indeed a significant
> > application-level update,
> > so I'm glad to have the chance to clarify it here.
> >
> >
> > To answer your question directly: backward compatibility is fully
> > preserved.
> >
> >
> > In earlier Flink versions, job archives were written directly under the
> > configured
> > jobmanager.archive.fs.dir. With this update, Flink will instead use a
> > hierarchical
> > cluster-application-job structure.
> > We understand that many users already have archives stored in the legacy
> > flat
> > layout. To ensure a smooth transition, the History Server will be updated
> > to read
> > archives from both the old and new directory structures. As a result, all
> > previously archived jobs will remain accessible and visible.
> >
> >
> > If you have additional questions or specific edge cases in mind, I’d be
> > happy to
> > discuss them further!
> >
> >
> > Best,
> > Yi
> >
> >
> >
> > At 2025-12-24 11:35:00, "Lei Yang"  wrote:
> > >Hi Yi,
> > >
> > >Thank you for creating this FLIP! The introduction of the Application
> > >entity significantly enhances the observability and manageability of
> > >user logic, especially benefiting batch workloads. This is truly
> > >excellent work!
> > >
> > >However, I have a compatibility concern and would appreciate your
> > >clarification. In the “Archiving Directory Structure” section, I noticed
> > >that the directory structure has been changed. If users have configured
> > >a persistent external path for jobmanager.archive.fs.dir, will their
> > >existing archives become unreadable after this change? Will the
> > >implementation of this FLIP maintain backward compatibility with
> > >previously archived job data?
> > >
> > >Best regards,
> > >Lei
> > >
> > >Yi Zhang  于2025年12月17日周三 14:18写道:
> > >
> > >> Hi everyone,
> > >>
> > >> I would like to start a discussion about FLIP-560: Application
> > Capability
> > >> Enhancement [1].
> > >>
> > >> The primary goal of this FLIP is to improve the usability and
> > availability
> > >> of Flink applications
> > >>
> > >>  by introducing the following enhancements:
> > >>
> > >>
> > >>
> > >> 1. Support multi-job execution in Application Mode, which is an
> > important
> > >> batch-processinguse case.
> > >> 2. Support re-running the user's main method after JobManager restarts
> > due
> > >> to failures inSession Mode.
> > >> 3. Expose exceptions thrown in the user's main method via REST/UI.
> > >>
> > >>
> > >>
> > >> Looking forward to your feedback and suggestions!
> > >>
> > >>
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement
> > >>
> > >>
> > >>
> > >> Best Regards,
> > >>
> > >> Yi Zhang
> >
>


Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2025-12-25 Thread Lei Yang
Thank you Yi for your reply, looks good to me!
+1 for this proposal
Best,
Lei

Yi Zhang  于2025年12月25日周四 10:02写道:

> Hi Lei,
>
>
> Thank you for the feedback!
> The "Archiving Directory Structure" section describes a change in how
> archived
> files are organized under jobmanager.archive.fs.dir. While this change was
> originally proposed in FLIP-549, it's indeed a significant
> application-level update,
> so I'm glad to have the chance to clarify it here.
>
>
> To answer your question directly: backward compatibility is fully
> preserved.
>
>
> In earlier Flink versions, job archives were written directly under the
> configured
> jobmanager.archive.fs.dir. With this update, Flink will instead use a
> hierarchical
> cluster-application-job structure.
> We understand that many users already have archives stored in the legacy
> flat
> layout. To ensure a smooth transition, the History Server will be updated
> to read
> archives from both the old and new directory structures. As a result, all
> previously archived jobs will remain accessible and visible.
>
>
> If you have additional questions or specific edge cases in mind, I’d be
> happy to
> discuss them further!
>
>
> Best,
> Yi
>
>
>
> At 2025-12-24 11:35:00, "Lei Yang"  wrote:
> >Hi Yi,
> >
> >Thank you for creating this FLIP! The introduction of the Application
> >entity significantly enhances the observability and manageability of
> >user logic, especially benefiting batch workloads. This is truly
> >excellent work!
> >
> >However, I have a compatibility concern and would appreciate your
> >clarification. In the “Archiving Directory Structure” section, I noticed
> >that the directory structure has been changed. If users have configured
> >a persistent external path for jobmanager.archive.fs.dir, will their
> >existing archives become unreadable after this change? Will the
> >implementation of this FLIP maintain backward compatibility with
> >previously archived job data?
> >
> >Best regards,
> >Lei
> >
> >Yi Zhang  于2025年12月17日周三 14:18写道:
> >
> >> Hi everyone,
> >>
> >> I would like to start a discussion about FLIP-560: Application
> Capability
> >> Enhancement [1].
> >>
> >> The primary goal of this FLIP is to improve the usability and
> availability
> >> of Flink applications
> >>
> >>  by introducing the following enhancements:
> >>
> >>
> >>
> >> 1. Support multi-job execution in Application Mode, which is an
> important
> >> batch-processinguse case.
> >> 2. Support re-running the user's main method after JobManager restarts
> due
> >> to failures inSession Mode.
> >> 3. Expose exceptions thrown in the user's main method via REST/UI.
> >>
> >>
> >>
> >> Looking forward to your feedback and suggestions!
> >>
> >>
> >>
> >> [1]
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement
> >>
> >>
> >>
> >> Best Regards,
> >>
> >> Yi Zhang
>


Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2025-12-24 Thread Yi Zhang
Hi Lei,


Thank you for the feedback!
The "Archiving Directory Structure" section describes a change in how archived 
files are organized under jobmanager.archive.fs.dir. While this change was
originally proposed in FLIP-549, it's indeed a significant application-level 
update,
so I'm glad to have the chance to clarify it here.


To answer your question directly: backward compatibility is fully preserved.


In earlier Flink versions, job archives were written directly under the 
configured
jobmanager.archive.fs.dir. With this update, Flink will instead use a 
hierarchical
cluster-application-job structure.
We understand that many users already have archives stored in the legacy flat 
layout. To ensure a smooth transition, the History Server will be updated to 
read
archives from both the old and new directory structures. As a result, all
previously archived jobs will remain accessible and visible.


If you have additional questions or specific edge cases in mind, I’d be happy to
discuss them further!


Best,
Yi



At 2025-12-24 11:35:00, "Lei Yang"  wrote:
>Hi Yi,
>
>Thank you for creating this FLIP! The introduction of the Application
>entity significantly enhances the observability and manageability of
>user logic, especially benefiting batch workloads. This is truly
>excellent work!
>
>However, I have a compatibility concern and would appreciate your
>clarification. In the “Archiving Directory Structure” section, I noticed
>that the directory structure has been changed. If users have configured
>a persistent external path for jobmanager.archive.fs.dir, will their
>existing archives become unreadable after this change? Will the
>implementation of this FLIP maintain backward compatibility with
>previously archived job data?
>
>Best regards,
>Lei
>
>Yi Zhang  于2025年12月17日周三 14:18写道:
>
>> Hi everyone,
>>
>> I would like to start a discussion about FLIP-560: Application Capability
>> Enhancement [1].
>>
>> The primary goal of this FLIP is to improve the usability and availability
>> of Flink applications
>>
>>  by introducing the following enhancements:
>>
>>
>>
>> 1. Support multi-job execution in Application Mode, which is an important
>> batch-processinguse case.
>> 2. Support re-running the user's main method after JobManager restarts due
>> to failures inSession Mode.
>> 3. Expose exceptions thrown in the user's main method via REST/UI.
>>
>>
>>
>> Looking forward to your feedback and suggestions!
>>
>>
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement
>>
>>
>>
>> Best Regards,
>>
>> Yi Zhang


Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2025-12-23 Thread Lei Yang
Hi Yi,

Thank you for creating this FLIP! The introduction of the Application
entity significantly enhances the observability and manageability of
user logic, especially benefiting batch workloads. This is truly
excellent work!

However, I have a compatibility concern and would appreciate your
clarification. In the “Archiving Directory Structure” section, I noticed
that the directory structure has been changed. If users have configured
a persistent external path for jobmanager.archive.fs.dir, will their
existing archives become unreadable after this change? Will the
implementation of this FLIP maintain backward compatibility with
previously archived job data?

Best regards,
Lei

Yi Zhang  于2025年12月17日周三 14:18写道:

> Hi everyone,
>
> I would like to start a discussion about FLIP-560: Application Capability
> Enhancement [1].
>
> The primary goal of this FLIP is to improve the usability and availability
> of Flink applications
>
>  by introducing the following enhancements:
>
>
>
> 1. Support multi-job execution in Application Mode, which is an important
> batch-processinguse case.
> 2. Support re-running the user's main method after JobManager restarts due
> to failures inSession Mode.
> 3. Expose exceptions thrown in the user's main method via REST/UI.
>
>
>
> Looking forward to your feedback and suggestions!
>
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement
>
>
>
> Best Regards,
>
> Yi Zhang


Re: [DISCUSS] FLIP-560: Application Capability Enhancement

2025-12-17 Thread Zhu Zhu
Hi Yi,
Thanks for creating this FLIP.

Supporting the execution of multiple jobs within a single application can
be highly
beneficial for batch processing. It enables more flexible and complex
workflows,
allowing better resource sharing, coordinated job management, and
simplified
deployment.
+1 for this proposal

Thanks,
Zhu

Yi Zhang  于2025年12月17日周三 14:18写道:

> Hi everyone,
>
> I would like to start a discussion about FLIP-560: Application Capability
> Enhancement [1].
>
> The primary goal of this FLIP is to improve the usability and availability
> of Flink applications
>
>  by introducing the following enhancements:
>
>
>
> 1. Support multi-job execution in Application Mode, which is an important
> batch-processinguse case.
> 2. Support re-running the user's main method after JobManager restarts due
> to failures inSession Mode.
> 3. Expose exceptions thrown in the user's main method via REST/UI.
>
>
>
> Looking forward to your feedback and suggestions!
>
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement
>
>
>
> Best Regards,
>
> Yi Zhang