Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi All, This discussion has been ongoing for some time now and has received excellent suggestions and positive feedback—thanks again! If you have any additional questions, concerns, or ideas, please feel free to share them. If there is no further input, I will move forward with starting the vote on this FLIP tomorrow. Best Regards, Yi At 2026-01-15 17:45:36, "Yi Zhang" wrote: >Hi! > > >Thanks so much for your feedback and suggestions. I have updated the >FLIP accordingly and truly appreciate your input! > > >Best, >Yi > > > >At 2026-01-15 14:20:53, [email protected] wrote: >>Hi! >> >>I think a separate endpoint with the effective config makes perfect sense and >>will cover the requirements. >> >>Thank you for including :) >> >>Gyula >> >>Sent from my iPhone >> >>> On 15 Jan 2026, at 06:50, Yi Zhang wrote: >>> >>> Hi Gyula, >>> >>> >>> I have given this some more thought, and I agree with your >>> point: users (and external controllers like the Kubernetes >>> Operator) need access to critical context such as the state >>> recovery path, even in cases where no job has been >>> successfully submitted. >>> >>> >>> What if we introduce a new REST API endpoint, such >>> as /applications/:applicationid/jobmanager/config, >>> to expose the effective JobManager configuration used (or >>> intended to be used) by the application? This could include >>> key settings like state recovery path and other relevant >>> configured options. >>> >>> >>> It might help make the API responsibilities clearer, and also >>> provide valuable visibility even when no errors occur. For >>> actual error details, the >>> /applications/:applicationid/exceptions endpoint can be >>> used. >>> >>> >>> I’d appreciate your thoughts on this approach. Thanks! >>> >>> >>> Best, >>> Yi >>> >>> At 2026-01-12 15:23:58, "Gyula Fóra" wrote: Hi! Overall it makes sense, I cannot reproduce a job actually submitted/archived properly with invalid savepoint path (I am testing on 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes this seems to cause a jobmanager shutdown immediately. You are right, I am mostly concerned about all the error scenarios that currently do not result in FAILED job submissions. Regarding what error metadata to expose, I think what you write makes sense, with the only specific exception that checkpoint/state recovery information (what checkpoint are we restoring from at the moment) should always be included in the error/job/app metadata. This is crucial for the Kubernetes Operator (and possible other external control planes) to handle the error. Currently since this information is sometimes lost, it leads to many cornercases requiring manual intervention from users. Cheers Gyula > On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang wrote: > > > > Hi Gyula, > > > Thank you very much for your explanation. > > > "Some errors such as invalid state path are not even submitted or when it > is Flink > uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that > doesn't > actually include information about the checkpoint restore path/configs > etc." > > > > I ran some tests on the latest Flink 2.3 release by submitting a job with > an invalid > `execution.state-recovery.path`. The job submission itself succeeded, but > the job > failed during initialization. It seems that at least for misconfigured > state recovery > paths, the job still goes through submission and gets archived with > sufficient > diagnostic info. Am I missing anything here? If there’s a specific Jira > issue > describing such a scenario, it would be great to reference it for more > concrete > discussion around this requirements. > > > That said, I agree that there are different error scenarios we might > encounter, which > broadly fall into three categories: > > 1. Failures after successful job submission, which result in a FAILED job > state. In > these cases, relevant diagnostics are already accessible via existing > job-related > REST endpoints. > 2. Failures during job submission, leaving no concrete job entity to > query. > 3. Failures in the main() method unrelated to job submission/execution. > > The originally proposed /applications/:applicationid/exceptions endpoint > is intended > to expose exceptions from all three categories. From my understanding, > your > primary interest lies in scenario #2, where additional context could help > diagnose > why submission failed, even though no real job was created. > > > Rather than introducing a general endpoint that exposes all possible > configuration > and metadata, would it be more practical to conditiona
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi! I think a separate endpoint with the effective config makes perfect sense and will cover the requirements. Thank you for including :) Gyula Sent from my iPhone > On 15 Jan 2026, at 06:50, Yi Zhang wrote: > > Hi Gyula, > > > I have given this some more thought, and I agree with your > point: users (and external controllers like the Kubernetes > Operator) need access to critical context such as the state > recovery path, even in cases where no job has been > successfully submitted. > > > What if we introduce a new REST API endpoint, such > as /applications/:applicationid/jobmanager/config, > to expose the effective JobManager configuration used (or > intended to be used) by the application? This could include > key settings like state recovery path and other relevant > configured options. > > > It might help make the API responsibilities clearer, and also > provide valuable visibility even when no errors occur. For > actual error details, the > /applications/:applicationid/exceptions endpoint can be > used. > > > I’d appreciate your thoughts on this approach. Thanks! > > > Best, > Yi > > At 2026-01-12 15:23:58, "Gyula Fóra" wrote: >> Hi! >> >> Overall it makes sense, I cannot reproduce a job actually >> submitted/archived properly with invalid savepoint path (I am testing on >> 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes >> this seems to cause a jobmanager shutdown immediately. >> >> You are right, I am mostly concerned about all the error scenarios that >> currently do not result in FAILED job submissions. >> >> Regarding what error metadata to expose, I think what you write makes >> sense, with the only specific exception that checkpoint/state recovery >> information (what checkpoint are we restoring from at the moment) should >> always be included in the error/job/app metadata. This is crucial for the >> Kubernetes Operator (and possible other external control planes) to handle >> the error. Currently since this information is sometimes lost, it leads to >> many cornercases requiring manual intervention from users. >> >> Cheers >> Gyula >> >>> On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang wrote: >>> >>> >>> >>> Hi Gyula, >>> >>> >>> Thank you very much for your explanation. >>> >>> >>> "Some errors such as invalid state path are not even submitted or when it >>> is Flink >>> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't >>> actually include information about the checkpoint restore path/configs >>> etc." >>> >>> >>> >>> I ran some tests on the latest Flink 2.3 release by submitting a job with >>> an invalid >>> `execution.state-recovery.path`. The job submission itself succeeded, but >>> the job >>> failed during initialization. It seems that at least for misconfigured >>> state recovery >>> paths, the job still goes through submission and gets archived with >>> sufficient >>> diagnostic info. Am I missing anything here? If there’s a specific Jira >>> issue >>> describing such a scenario, it would be great to reference it for more >>> concrete >>> discussion around this requirements. >>> >>> >>> That said, I agree that there are different error scenarios we might >>> encounter, which >>> broadly fall into three categories: >>> >>> 1. Failures after successful job submission, which result in a FAILED job >>> state. In >>> these cases, relevant diagnostics are already accessible via existing >>> job-related >>> REST endpoints. >>> 2. Failures during job submission, leaving no concrete job entity to query. >>> 3. Failures in the main() method unrelated to job submission/execution. >>> >>> The originally proposed /applications/:applicationid/exceptions endpoint >>> is intended >>> to expose exceptions from all three categories. From my understanding, your >>> primary interest lies in scenario #2, where additional context could help >>> diagnose >>> why submission failed, even though no real job was created. >>> >>> >>> Rather than introducing a general endpoint that exposes all possible >>> configuration >>> and metadata, would it be more practical to conditionally enrich >>> exceptions? For >>> example, when a submission fails due to invalid state paths or >>> misconfigured >>> options, we could attach the relevant configuration settings. This >>> approach would >>> complement the /applications/:applicationid/exceptions design and allow us >>> to >>> incrementally evolve toward richer diagnostics over time. >>> Having a concrete use case would greatly help align on the scope and >>> implementation details of such enrichment. >>> >>> >>> Thanks again for your valuable feedback and suggestions! >>> >>> >>> Best, >>> Yi >>> >>> P.S. I’ve updated the FLIP to reflect the change regarding using job name >>> for job >>> matching. Please let me know if you have any further questions or >>> suggestions. >>> >>> >>> At 2026-01-08 17:07:16, "Gyula Fóra" wrote: Hi Yi! Sorry for the late reply, I s
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi! Thanks so much for your feedback and suggestions. I have updated the FLIP accordingly and truly appreciate your input! Best, Yi At 2026-01-15 14:20:53, [email protected] wrote: >Hi! > >I think a separate endpoint with the effective config makes perfect sense and >will cover the requirements. > >Thank you for including :) > >Gyula > >Sent from my iPhone > >> On 15 Jan 2026, at 06:50, Yi Zhang wrote: >> >> Hi Gyula, >> >> >> I have given this some more thought, and I agree with your >> point: users (and external controllers like the Kubernetes >> Operator) need access to critical context such as the state >> recovery path, even in cases where no job has been >> successfully submitted. >> >> >> What if we introduce a new REST API endpoint, such >> as /applications/:applicationid/jobmanager/config, >> to expose the effective JobManager configuration used (or >> intended to be used) by the application? This could include >> key settings like state recovery path and other relevant >> configured options. >> >> >> It might help make the API responsibilities clearer, and also >> provide valuable visibility even when no errors occur. For >> actual error details, the >> /applications/:applicationid/exceptions endpoint can be >> used. >> >> >> I’d appreciate your thoughts on this approach. Thanks! >> >> >> Best, >> Yi >> >> At 2026-01-12 15:23:58, "Gyula Fóra" wrote: >>> Hi! >>> >>> Overall it makes sense, I cannot reproduce a job actually >>> submitted/archived properly with invalid savepoint path (I am testing on >>> 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes >>> this seems to cause a jobmanager shutdown immediately. >>> >>> You are right, I am mostly concerned about all the error scenarios that >>> currently do not result in FAILED job submissions. >>> >>> Regarding what error metadata to expose, I think what you write makes >>> sense, with the only specific exception that checkpoint/state recovery >>> information (what checkpoint are we restoring from at the moment) should >>> always be included in the error/job/app metadata. This is crucial for the >>> Kubernetes Operator (and possible other external control planes) to handle >>> the error. Currently since this information is sometimes lost, it leads to >>> many cornercases requiring manual intervention from users. >>> >>> Cheers >>> Gyula >>> On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang wrote: Hi Gyula, Thank you very much for your explanation. "Some errors such as invalid state path are not even submitted or when it is Flink uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't actually include information about the checkpoint restore path/configs etc." I ran some tests on the latest Flink 2.3 release by submitting a job with an invalid `execution.state-recovery.path`. The job submission itself succeeded, but the job failed during initialization. It seems that at least for misconfigured state recovery paths, the job still goes through submission and gets archived with sufficient diagnostic info. Am I missing anything here? If there’s a specific Jira issue describing such a scenario, it would be great to reference it for more concrete discussion around this requirements. That said, I agree that there are different error scenarios we might encounter, which broadly fall into three categories: 1. Failures after successful job submission, which result in a FAILED job state. In these cases, relevant diagnostics are already accessible via existing job-related REST endpoints. 2. Failures during job submission, leaving no concrete job entity to query. 3. Failures in the main() method unrelated to job submission/execution. The originally proposed /applications/:applicationid/exceptions endpoint is intended to expose exceptions from all three categories. From my understanding, your primary interest lies in scenario #2, where additional context could help diagnose why submission failed, even though no real job was created. Rather than introducing a general endpoint that exposes all possible configuration and metadata, would it be more practical to conditionally enrich exceptions? For example, when a submission fails due to invalid state paths or misconfigured options, we could attach the relevant configuration settings. This approach would complement the /applications/:applicationid/exceptions design and allow us to incrementally evolve toward richer diagnostics over time. Having a concrete use case would greatly help align on the scope and implementation details of such enrichment. Thanks again for your valuable feedback and sugges
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Gyula, I have given this some more thought, and I agree with your point: users (and external controllers like the Kubernetes Operator) need access to critical context such as the state recovery path, even in cases where no job has been successfully submitted. What if we introduce a new REST API endpoint, such as /applications/:applicationid/jobmanager/config, to expose the effective JobManager configuration used (or intended to be used) by the application? This could include key settings like state recovery path and other relevant configured options. It might help make the API responsibilities clearer, and also provide valuable visibility even when no errors occur. For actual error details, the /applications/:applicationid/exceptions endpoint can be used. I’d appreciate your thoughts on this approach. Thanks! Best, Yi At 2026-01-12 15:23:58, "Gyula Fóra" wrote: >Hi! > >Overall it makes sense, I cannot reproduce a job actually >submitted/archived properly with invalid savepoint path (I am testing on >2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes >this seems to cause a jobmanager shutdown immediately. > >You are right, I am mostly concerned about all the error scenarios that >currently do not result in FAILED job submissions. > >Regarding what error metadata to expose, I think what you write makes >sense, with the only specific exception that checkpoint/state recovery >information (what checkpoint are we restoring from at the moment) should >always be included in the error/job/app metadata. This is crucial for the >Kubernetes Operator (and possible other external control planes) to handle >the error. Currently since this information is sometimes lost, it leads to >many cornercases requiring manual intervention from users. > >Cheers >Gyula > >On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang wrote: > >> >> >> Hi Gyula, >> >> >> Thank you very much for your explanation. >> >> >> "Some errors such as invalid state path are not even submitted or when it >> is Flink >> uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't >> actually include information about the checkpoint restore path/configs >> etc." >> >> >> >> I ran some tests on the latest Flink 2.3 release by submitting a job with >> an invalid >> `execution.state-recovery.path`. The job submission itself succeeded, but >> the job >> failed during initialization. It seems that at least for misconfigured >> state recovery >> paths, the job still goes through submission and gets archived with >> sufficient >> diagnostic info. Am I missing anything here? If there’s a specific Jira >> issue >> describing such a scenario, it would be great to reference it for more >> concrete >> discussion around this requirements. >> >> >> That said, I agree that there are different error scenarios we might >> encounter, which >> broadly fall into three categories: >> >> 1. Failures after successful job submission, which result in a FAILED job >> state. In >> these cases, relevant diagnostics are already accessible via existing >> job-related >> REST endpoints. >> 2. Failures during job submission, leaving no concrete job entity to query. >> 3. Failures in the main() method unrelated to job submission/execution. >> >> The originally proposed /applications/:applicationid/exceptions endpoint >> is intended >> to expose exceptions from all three categories. From my understanding, your >> primary interest lies in scenario #2, where additional context could help >> diagnose >> why submission failed, even though no real job was created. >> >> >> Rather than introducing a general endpoint that exposes all possible >> configuration >> and metadata, would it be more practical to conditionally enrich >> exceptions? For >> example, when a submission fails due to invalid state paths or >> misconfigured >> options, we could attach the relevant configuration settings. This >> approach would >> complement the /applications/:applicationid/exceptions design and allow us >> to >> incrementally evolve toward richer diagnostics over time. >> Having a concrete use case would greatly help align on the scope and >> implementation details of such enrichment. >> >> >> Thanks again for your valuable feedback and suggestions! >> >> >> Best, >> Yi >> >> P.S. I’ve updated the FLIP to reflect the change regarding using job name >> for job >> matching. Please let me know if you have any further questions or >> suggestions. >> >> >> At 2026-01-08 17:07:16, "Gyula Fóra" wrote: >> >Hi Yi! >> > >> >Sorry for the late reply, I somehow missed your response: >> > >> >"Flink’s existing archive mechanism—combined >> >with the HistoryServer—already provides persistent access to job-related >> >information after failure. Specifically, the existing HistoryServer >> endpoint >> >`/jobs/:jobid/jobmanager/config` seems capable of exposing the >> configuration >> > >> >including checkpoint restore paths, and remains accessible after failure." >> > >> >You ar
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi! Overall it makes sense, I cannot reproduce a job actually submitted/archived properly with invalid savepoint path (I am testing on 2.1 as no 2.2/2.3 docker image is available at the moment). In Kubernetes this seems to cause a jobmanager shutdown immediately. You are right, I am mostly concerned about all the error scenarios that currently do not result in FAILED job submissions. Regarding what error metadata to expose, I think what you write makes sense, with the only specific exception that checkpoint/state recovery information (what checkpoint are we restoring from at the moment) should always be included in the error/job/app metadata. This is crucial for the Kubernetes Operator (and possible other external control planes) to handle the error. Currently since this information is sometimes lost, it leads to many cornercases requiring manual intervention from users. Cheers Gyula On Mon, Jan 12, 2026 at 7:43 AM Yi Zhang wrote: > > > Hi Gyula, > > > Thank you very much for your explanation. > > > "Some errors such as invalid state path are not even submitted or when it > is Flink > uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't > actually include information about the checkpoint restore path/configs > etc." > > > > I ran some tests on the latest Flink 2.3 release by submitting a job with > an invalid > `execution.state-recovery.path`. The job submission itself succeeded, but > the job > failed during initialization. It seems that at least for misconfigured > state recovery > paths, the job still goes through submission and gets archived with > sufficient > diagnostic info. Am I missing anything here? If there’s a specific Jira > issue > describing such a scenario, it would be great to reference it for more > concrete > discussion around this requirements. > > > That said, I agree that there are different error scenarios we might > encounter, which > broadly fall into three categories: > > 1. Failures after successful job submission, which result in a FAILED job > state. In > these cases, relevant diagnostics are already accessible via existing > job-related > REST endpoints. > 2. Failures during job submission, leaving no concrete job entity to query. > 3. Failures in the main() method unrelated to job submission/execution. > > The originally proposed /applications/:applicationid/exceptions endpoint > is intended > to expose exceptions from all three categories. From my understanding, your > primary interest lies in scenario #2, where additional context could help > diagnose > why submission failed, even though no real job was created. > > > Rather than introducing a general endpoint that exposes all possible > configuration > and metadata, would it be more practical to conditionally enrich > exceptions? For > example, when a submission fails due to invalid state paths or > misconfigured > options, we could attach the relevant configuration settings. This > approach would > complement the /applications/:applicationid/exceptions design and allow us > to > incrementally evolve toward richer diagnostics over time. > Having a concrete use case would greatly help align on the scope and > implementation details of such enrichment. > > > Thanks again for your valuable feedback and suggestions! > > > Best, > Yi > > P.S. I’ve updated the FLIP to reflect the change regarding using job name > for job > matching. Please let me know if you have any further questions or > suggestions. > > > At 2026-01-08 17:07:16, "Gyula Fóra" wrote: > >Hi Yi! > > > >Sorry for the late reply, I somehow missed your response: > > > >"Flink’s existing archive mechanism—combined > >with the HistoryServer—already provides persistent access to job-related > >information after failure. Specifically, the existing HistoryServer > endpoint > >`/jobs/:jobid/jobmanager/config` seems capable of exposing the > configuration > > > >including checkpoint restore paths, and remains accessible after failure." > > > >You are right, when a job fails this is true we can see the past > checkpoint > >history etc. But I think this doesn't apply for jobs that faile during > >submission or in the main method. Some errors such as invalid state path > >are not even submitted or when it is Flink uses > >ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't > >actually include information about the checkpoint restore path/configs > etc. > > > >Cheers > >Gyula > > > >On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang wrote: > > > >> Hi Gyula, > >> > >> > >> Thank you so much for your thoughtful and insightful feedback! > >> > >> > >> 1. I fully agree that using the job name for job matching is more > >> user-friendly and > >> cleaner than relying on a jobIndex parameter. I’ll update the FLIP > >> accordingly > >> to reflect this design change. > >> > >> > >> 2. I’d like to dig a bit deeper to make sure I fully understand the > >> requirement. > >> You have mentioned the need for a generic information endpoint that > >>
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Gyula, Thank you very much for your explanation. "Some errors such as invalid state path are not even submitted or when it is Flink uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't actually include information about the checkpoint restore path/configs etc." I ran some tests on the latest Flink 2.3 release by submitting a job with an invalid `execution.state-recovery.path`. The job submission itself succeeded, but the job failed during initialization. It seems that at least for misconfigured state recovery paths, the job still goes through submission and gets archived with sufficient diagnostic info. Am I missing anything here? If there’s a specific Jira issue describing such a scenario, it would be great to reference it for more concrete discussion around this requirements. That said, I agree that there are different error scenarios we might encounter, which broadly fall into three categories: 1. Failures after successful job submission, which result in a FAILED job state. In these cases, relevant diagnostics are already accessible via existing job-related REST endpoints. 2. Failures during job submission, leaving no concrete job entity to query. 3. Failures in the main() method unrelated to job submission/execution. The originally proposed /applications/:applicationid/exceptions endpoint is intended to expose exceptions from all three categories. From my understanding, your primary interest lies in scenario #2, where additional context could help diagnose why submission failed, even though no real job was created. Rather than introducing a general endpoint that exposes all possible configuration and metadata, would it be more practical to conditionally enrich exceptions? For example, when a submission fails due to invalid state paths or misconfigured options, we could attach the relevant configuration settings. This approach would complement the /applications/:applicationid/exceptions design and allow us to incrementally evolve toward richer diagnostics over time. Having a concrete use case would greatly help align on the scope and implementation details of such enrichment. Thanks again for your valuable feedback and suggestions! Best, Yi P.S. I’ve updated the FLIP to reflect the change regarding using job name for job matching. Please let me know if you have any further questions or suggestions. At 2026-01-08 17:07:16, "Gyula Fóra" wrote: >Hi Yi! > >Sorry for the late reply, I somehow missed your response: > >"Flink’s existing archive mechanism—combined >with the HistoryServer—already provides persistent access to job-related >information after failure. Specifically, the existing HistoryServer endpoint >`/jobs/:jobid/jobmanager/config` seems capable of exposing the configuration > >including checkpoint restore paths, and remains accessible after failure." > >You are right, when a job fails this is true we can see the past checkpoint >history etc. But I think this doesn't apply for jobs that faile during >submission or in the main method. Some errors such as invalid state path >are not even submitted or when it is Flink uses >ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't >actually include information about the checkpoint restore path/configs etc. > >Cheers >Gyula > >On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang wrote: > >> Hi Gyula, >> >> >> Thank you so much for your thoughtful and insightful feedback! >> >> >> 1. I fully agree that using the job name for job matching is more >> user-friendly and >> cleaner than relying on a jobIndex parameter. I’ll update the FLIP >> accordingly >> to reflect this design change. >> >> >> 2. I’d like to dig a bit deeper to make sure I fully understand the >> requirement. >> You have mentioned the need for a generic information endpoint that >> remains >> accessible even after failure, and that it should include additional info >> such as >> the checkpoint restore path and configuration. >> >> From my current understanding, Flink’s existing archive mechanism—combined >> with the HistoryServer—already provides persistent access to job-related >> information after failure. Specifically, the existing HistoryServer >> endpoint >> `/jobs/:jobid/jobmanager/config` seems capable of exposing the >> configuration >> including checkpoint restore paths, and remains accessible after failure. >> On the other hand, the proposed /applications/:appid/exceptions endpoint >> is >> intended specifically to surface application-level exceptions that occur >> outside >> the job lifecycle, which will also be available through the HistoryServer >> after >> failure. >> >> So could you help clarify whether there is a specific failure scenario or >> use case >> where the current archiving/HistoryServer mechanism falls short or where >> critical >> debugging information—like the restore path or configuration—is not >> retrievable >> after a failure? >> >> >> Thanks again for your excellent suggestio
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Yi! Sorry for the late reply, I somehow missed your response: "Flink’s existing archive mechanism—combined with the HistoryServer—already provides persistent access to job-related information after failure. Specifically, the existing HistoryServer endpoint `/jobs/:jobid/jobmanager/config` seems capable of exposing the configuration including checkpoint restore paths, and remains accessible after failure." You are right, when a job fails this is true we can see the past checkpoint history etc. But I think this doesn't apply for jobs that faile during submission or in the main method. Some errors such as invalid state path are not even submitted or when it is Flink uses ArchivedExecutionGraph.createSparseArchivedExecutionGraph that doesn't actually include information about the checkpoint restore path/configs etc. Cheers Gyula On Fri, Dec 26, 2025 at 11:00 AM Yi Zhang wrote: > Hi Gyula, > > > Thank you so much for your thoughtful and insightful feedback! > > > 1. I fully agree that using the job name for job matching is more > user-friendly and > cleaner than relying on a jobIndex parameter. I’ll update the FLIP > accordingly > to reflect this design change. > > > 2. I’d like to dig a bit deeper to make sure I fully understand the > requirement. > You have mentioned the need for a generic information endpoint that > remains > accessible even after failure, and that it should include additional info > such as > the checkpoint restore path and configuration. > > From my current understanding, Flink’s existing archive mechanism—combined > with the HistoryServer—already provides persistent access to job-related > information after failure. Specifically, the existing HistoryServer > endpoint > `/jobs/:jobid/jobmanager/config` seems capable of exposing the > configuration > including checkpoint restore paths, and remains accessible after failure. > On the other hand, the proposed /applications/:appid/exceptions endpoint > is > intended specifically to surface application-level exceptions that occur > outside > the job lifecycle, which will also be available through the HistoryServer > after > failure. > > So could you help clarify whether there is a specific failure scenario or > use case > where the current archiving/HistoryServer mechanism falls short or where > critical > debugging information—like the restore path or configuration—is not > retrievable > after a failure? > > > Thanks again for your excellent suggestions! > > Best, > Yi > > At 2025-12-25 21:08:49, "Gyula Fóra" wrote: > >Hi! > > > >Overall I think the design/improvements look great. Some minor comments, > >improvement possibilities: > > > >1. Could we simply use the job name for job matching? I think it's fair to > >require unique job names (or if they are not unique attach a sequence > >number to the name) instead of the jobIndex parameter. JobIndex sounds a > >bit weird and low level. > > > >2.A big problem/limitation of the existing submission logic is that the > >submit-on-error logic is very limited (only handling certain types of > >errors and only showing exception info). We should capture different > errors > >and metadata for failed applications including checkpoint settings (for > >instance what checkpoint path was used during restore, which is a common > >cause of the errors). So instead of introducing a > >/applications/appid/exceptions endpoint, can we instead introduce a more > >generic information endpoint that would contain other information? This > >endpoint should be accessible even in cause of failures and populated from > >the app result store and should also contain some other info such as > >checkpoint restore path, configuration etc. > > > >Capturing more information on failed submissions would help resolve a lot > >of long outstanding issues in the Flink Kubernetes Operator as well. > > > >Cheers > >Gyula > > > > > >On Thu, Dec 25, 2025 at 1:54 PM Lei Yang wrote: > > > >> Thank you Yi for your reply, looks good to me! > >> +1 for this proposal > >> Best, > >> Lei > >> > >> Yi Zhang 于2025年12月25日周四 10:02写道: > >> > >> > Hi Lei, > >> > > >> > > >> > Thank you for the feedback! > >> > The "Archiving Directory Structure" section describes a change in how > >> > archived > >> > files are organized under jobmanager.archive.fs.dir. While this change > >> was > >> > originally proposed in FLIP-549, it's indeed a significant > >> > application-level update, > >> > so I'm glad to have the chance to clarify it here. > >> > > >> > > >> > To answer your question directly: backward compatibility is fully > >> > preserved. > >> > > >> > > >> > In earlier Flink versions, job archives were written directly under > the > >> > configured > >> > jobmanager.archive.fs.dir. With this update, Flink will instead use a > >> > hierarchical > >> > cluster-application-job structure. > >> > We understand that many users already have archives stored in the > legacy > >> > flat > >> > layout. To ensure a smooth transition, the History Serve
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Shengkai, Thanks for raising these questions. I’ll address each of them below: 1. Behavior when an exception occurs during the recovery If I understand correctly, the "recovery stage" here refers to the phase where the application is re-executed after JM failover. If an exception occurs during the re-execution, the application will transition to a failed state. Any jobs that are running as part of this application will be canceled and cleaned up. 2. Expectations for JM high availability In Application Mode, JM high availability guarantees that the entire application is re-executed upon JM failover. Compared to running without HA, this means developers should ensure that user logic outside of Flink’s job execution (i.e., `env.execute()`) is idempotent. That said, Flink guarantees that the execute() call itself is safe to re-execute: repeated submissions due to JM failover will not result in duplicate jobs. However, other user logic is not automatically protected and must be made resilient to multiple invocations by the user. While this differs from the per-job recovery mode—where only the job is recovered and no other user logic is re-run—it can preserve the integrity of user logic across failovers. 3. Fine-grained control over re-execution This is a great point. Providing utilities to allow users to skip certain parts of their code during recovery would enable more precise control and reduce side effects. This would likely require tracking execution progress at the user-code level and exposing that context through a well-designed interface, which can be a future enhancement. Thanks again for the feedback! Best, Yi At 2026-01-07 10:08:32, "Shengkai Fang" wrote: >Hi Yi, > >+1 for the proposal. > >1. What's the behaviour if an exception happens during the recovery stage? >Will the running job be canceled? > >2. Can you describe what users should expect from JM high availability? >Compared with the per-job deployment mode, my understanding is that HA in >application mode cannot guarantee that the job will definitely run >successfully. Compared to running without HA enabled, what should >application developers be aware of? > >3. In the future, it's better we can provide some utils to give more >fine-grained code-level control to escape re-execute if jm failsover. > >Best, >Shengkai > >Yi Zhang 于2025年12月26日周五 18:00写道: > >> Hi Gyula, >> >> >> Thank you so much for your thoughtful and insightful feedback! >> >> >> 1. I fully agree that using the job name for job matching is more >> user-friendly and >> cleaner than relying on a jobIndex parameter. I’ll update the FLIP >> accordingly >> to reflect this design change. >> >> >> 2. I’d like to dig a bit deeper to make sure I fully understand the >> requirement. >> You have mentioned the need for a generic information endpoint that >> remains >> accessible even after failure, and that it should include additional info >> such as >> the checkpoint restore path and configuration. >> >> From my current understanding, Flink’s existing archive mechanism—combined >> with the HistoryServer—already provides persistent access to job-related >> information after failure. Specifically, the existing HistoryServer >> endpoint >> `/jobs/:jobid/jobmanager/config` seems capable of exposing the >> configuration >> including checkpoint restore paths, and remains accessible after failure. >> On the other hand, the proposed /applications/:appid/exceptions endpoint >> is >> intended specifically to surface application-level exceptions that occur >> outside >> the job lifecycle, which will also be available through the HistoryServer >> after >> failure. >> >> So could you help clarify whether there is a specific failure scenario or >> use case >> where the current archiving/HistoryServer mechanism falls short or where >> critical >> debugging information—like the restore path or configuration—is not >> retrievable >> after a failure? >> >> >> Thanks again for your excellent suggestions! >> >> Best, >> Yi >> >> At 2025-12-25 21:08:49, "Gyula Fóra" wrote: >> >Hi! >> > >> >Overall I think the design/improvements look great. Some minor comments, >> >improvement possibilities: >> > >> >1. Could we simply use the job name for job matching? I think it's fair to >> >require unique job names (or if they are not unique attach a sequence >> >number to the name) instead of the jobIndex parameter. JobIndex sounds a >> >bit weird and low level. >> > >> >2.A big problem/limitation of the existing submission logic is that the >> >submit-on-error logic is very limited (only handling certain types of >> >errors and only showing exception info). We should capture different >> errors >> >and metadata for failed applications including checkpoint settings (for >> >instance what checkpoint path was used during restore, which is a common >> >cause of the errors). So instead of introducing a >> >/applications/appid/exceptions endpoint, can we instead introduce a more >> >generic inform
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Yi, +1 for the proposal. 1. What's the behaviour if an exception happens during the recovery stage? Will the running job be canceled? 2. Can you describe what users should expect from JM high availability? Compared with the per-job deployment mode, my understanding is that HA in application mode cannot guarantee that the job will definitely run successfully. Compared to running without HA enabled, what should application developers be aware of? 3. In the future, it's better we can provide some utils to give more fine-grained code-level control to escape re-execute if jm failsover. Best, Shengkai Yi Zhang 于2025年12月26日周五 18:00写道: > Hi Gyula, > > > Thank you so much for your thoughtful and insightful feedback! > > > 1. I fully agree that using the job name for job matching is more > user-friendly and > cleaner than relying on a jobIndex parameter. I’ll update the FLIP > accordingly > to reflect this design change. > > > 2. I’d like to dig a bit deeper to make sure I fully understand the > requirement. > You have mentioned the need for a generic information endpoint that > remains > accessible even after failure, and that it should include additional info > such as > the checkpoint restore path and configuration. > > From my current understanding, Flink’s existing archive mechanism—combined > with the HistoryServer—already provides persistent access to job-related > information after failure. Specifically, the existing HistoryServer > endpoint > `/jobs/:jobid/jobmanager/config` seems capable of exposing the > configuration > including checkpoint restore paths, and remains accessible after failure. > On the other hand, the proposed /applications/:appid/exceptions endpoint > is > intended specifically to surface application-level exceptions that occur > outside > the job lifecycle, which will also be available through the HistoryServer > after > failure. > > So could you help clarify whether there is a specific failure scenario or > use case > where the current archiving/HistoryServer mechanism falls short or where > critical > debugging information—like the restore path or configuration—is not > retrievable > after a failure? > > > Thanks again for your excellent suggestions! > > Best, > Yi > > At 2025-12-25 21:08:49, "Gyula Fóra" wrote: > >Hi! > > > >Overall I think the design/improvements look great. Some minor comments, > >improvement possibilities: > > > >1. Could we simply use the job name for job matching? I think it's fair to > >require unique job names (or if they are not unique attach a sequence > >number to the name) instead of the jobIndex parameter. JobIndex sounds a > >bit weird and low level. > > > >2.A big problem/limitation of the existing submission logic is that the > >submit-on-error logic is very limited (only handling certain types of > >errors and only showing exception info). We should capture different > errors > >and metadata for failed applications including checkpoint settings (for > >instance what checkpoint path was used during restore, which is a common > >cause of the errors). So instead of introducing a > >/applications/appid/exceptions endpoint, can we instead introduce a more > >generic information endpoint that would contain other information? This > >endpoint should be accessible even in cause of failures and populated from > >the app result store and should also contain some other info such as > >checkpoint restore path, configuration etc. > > > >Capturing more information on failed submissions would help resolve a lot > >of long outstanding issues in the Flink Kubernetes Operator as well. > > > >Cheers > >Gyula > > > > > >On Thu, Dec 25, 2025 at 1:54 PM Lei Yang wrote: > > > >> Thank you Yi for your reply, looks good to me! > >> +1 for this proposal > >> Best, > >> Lei > >> > >> Yi Zhang 于2025年12月25日周四 10:02写道: > >> > >> > Hi Lei, > >> > > >> > > >> > Thank you for the feedback! > >> > The "Archiving Directory Structure" section describes a change in how > >> > archived > >> > files are organized under jobmanager.archive.fs.dir. While this change > >> was > >> > originally proposed in FLIP-549, it's indeed a significant > >> > application-level update, > >> > so I'm glad to have the chance to clarify it here. > >> > > >> > > >> > To answer your question directly: backward compatibility is fully > >> > preserved. > >> > > >> > > >> > In earlier Flink versions, job archives were written directly under > the > >> > configured > >> > jobmanager.archive.fs.dir. With this update, Flink will instead use a > >> > hierarchical > >> > cluster-application-job structure. > >> > We understand that many users already have archives stored in the > legacy > >> > flat > >> > layout. To ensure a smooth transition, the History Server will be > updated > >> > to read > >> > archives from both the old and new directory structures. As a result, > all > >> > previously archived jobs will remain accessible and visible. > >> > > >> > > >> > If you have additional questions or specifi
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Gyula, Thank you so much for your thoughtful and insightful feedback! 1. I fully agree that using the job name for job matching is more user-friendly and cleaner than relying on a jobIndex parameter. I’ll update the FLIP accordingly to reflect this design change. 2. I’d like to dig a bit deeper to make sure I fully understand the requirement. You have mentioned the need for a generic information endpoint that remains accessible even after failure, and that it should include additional info such as the checkpoint restore path and configuration. From my current understanding, Flink’s existing archive mechanism—combined with the HistoryServer—already provides persistent access to job-related information after failure. Specifically, the existing HistoryServer endpoint `/jobs/:jobid/jobmanager/config` seems capable of exposing the configuration including checkpoint restore paths, and remains accessible after failure. On the other hand, the proposed /applications/:appid/exceptions endpoint is intended specifically to surface application-level exceptions that occur outside the job lifecycle, which will also be available through the HistoryServer after failure. So could you help clarify whether there is a specific failure scenario or use case where the current archiving/HistoryServer mechanism falls short or where critical debugging information—like the restore path or configuration—is not retrievable after a failure? Thanks again for your excellent suggestions! Best, Yi At 2025-12-25 21:08:49, "Gyula Fóra" wrote: >Hi! > >Overall I think the design/improvements look great. Some minor comments, >improvement possibilities: > >1. Could we simply use the job name for job matching? I think it's fair to >require unique job names (or if they are not unique attach a sequence >number to the name) instead of the jobIndex parameter. JobIndex sounds a >bit weird and low level. > >2.A big problem/limitation of the existing submission logic is that the >submit-on-error logic is very limited (only handling certain types of >errors and only showing exception info). We should capture different errors >and metadata for failed applications including checkpoint settings (for >instance what checkpoint path was used during restore, which is a common >cause of the errors). So instead of introducing a >/applications/appid/exceptions endpoint, can we instead introduce a more >generic information endpoint that would contain other information? This >endpoint should be accessible even in cause of failures and populated from >the app result store and should also contain some other info such as >checkpoint restore path, configuration etc. > >Capturing more information on failed submissions would help resolve a lot >of long outstanding issues in the Flink Kubernetes Operator as well. > >Cheers >Gyula > > >On Thu, Dec 25, 2025 at 1:54 PM Lei Yang wrote: > >> Thank you Yi for your reply, looks good to me! >> +1 for this proposal >> Best, >> Lei >> >> Yi Zhang 于2025年12月25日周四 10:02写道: >> >> > Hi Lei, >> > >> > >> > Thank you for the feedback! >> > The "Archiving Directory Structure" section describes a change in how >> > archived >> > files are organized under jobmanager.archive.fs.dir. While this change >> was >> > originally proposed in FLIP-549, it's indeed a significant >> > application-level update, >> > so I'm glad to have the chance to clarify it here. >> > >> > >> > To answer your question directly: backward compatibility is fully >> > preserved. >> > >> > >> > In earlier Flink versions, job archives were written directly under the >> > configured >> > jobmanager.archive.fs.dir. With this update, Flink will instead use a >> > hierarchical >> > cluster-application-job structure. >> > We understand that many users already have archives stored in the legacy >> > flat >> > layout. To ensure a smooth transition, the History Server will be updated >> > to read >> > archives from both the old and new directory structures. As a result, all >> > previously archived jobs will remain accessible and visible. >> > >> > >> > If you have additional questions or specific edge cases in mind, I’d be >> > happy to >> > discuss them further! >> > >> > >> > Best, >> > Yi >> > >> > >> > >> > At 2025-12-24 11:35:00, "Lei Yang" wrote: >> > >Hi Yi, >> > > >> > >Thank you for creating this FLIP! The introduction of the Application >> > >entity significantly enhances the observability and manageability of >> > >user logic, especially benefiting batch workloads. This is truly >> > >excellent work! >> > > >> > >However, I have a compatibility concern and would appreciate your >> > >clarification. In the “Archiving Directory Structure” section, I noticed >> > >that the directory structure has been changed. If users have configured >> > >a persistent external path for jobmanager.archive.fs.dir, will their >> > >existing archives become unreadable after this change? Will the >> > >implementation of this FLIP maintain backward compatibilit
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi! Overall I think the design/improvements look great. Some minor comments, improvement possibilities: 1. Could we simply use the job name for job matching? I think it's fair to require unique job names (or if they are not unique attach a sequence number to the name) instead of the jobIndex parameter. JobIndex sounds a bit weird and low level. 2.A big problem/limitation of the existing submission logic is that the submit-on-error logic is very limited (only handling certain types of errors and only showing exception info). We should capture different errors and metadata for failed applications including checkpoint settings (for instance what checkpoint path was used during restore, which is a common cause of the errors). So instead of introducing a /applications/appid/exceptions endpoint, can we instead introduce a more generic information endpoint that would contain other information? This endpoint should be accessible even in cause of failures and populated from the app result store and should also contain some other info such as checkpoint restore path, configuration etc. Capturing more information on failed submissions would help resolve a lot of long outstanding issues in the Flink Kubernetes Operator as well. Cheers Gyula On Thu, Dec 25, 2025 at 1:54 PM Lei Yang wrote: > Thank you Yi for your reply, looks good to me! > +1 for this proposal > Best, > Lei > > Yi Zhang 于2025年12月25日周四 10:02写道: > > > Hi Lei, > > > > > > Thank you for the feedback! > > The "Archiving Directory Structure" section describes a change in how > > archived > > files are organized under jobmanager.archive.fs.dir. While this change > was > > originally proposed in FLIP-549, it's indeed a significant > > application-level update, > > so I'm glad to have the chance to clarify it here. > > > > > > To answer your question directly: backward compatibility is fully > > preserved. > > > > > > In earlier Flink versions, job archives were written directly under the > > configured > > jobmanager.archive.fs.dir. With this update, Flink will instead use a > > hierarchical > > cluster-application-job structure. > > We understand that many users already have archives stored in the legacy > > flat > > layout. To ensure a smooth transition, the History Server will be updated > > to read > > archives from both the old and new directory structures. As a result, all > > previously archived jobs will remain accessible and visible. > > > > > > If you have additional questions or specific edge cases in mind, I’d be > > happy to > > discuss them further! > > > > > > Best, > > Yi > > > > > > > > At 2025-12-24 11:35:00, "Lei Yang" wrote: > > >Hi Yi, > > > > > >Thank you for creating this FLIP! The introduction of the Application > > >entity significantly enhances the observability and manageability of > > >user logic, especially benefiting batch workloads. This is truly > > >excellent work! > > > > > >However, I have a compatibility concern and would appreciate your > > >clarification. In the “Archiving Directory Structure” section, I noticed > > >that the directory structure has been changed. If users have configured > > >a persistent external path for jobmanager.archive.fs.dir, will their > > >existing archives become unreadable after this change? Will the > > >implementation of this FLIP maintain backward compatibility with > > >previously archived job data? > > > > > >Best regards, > > >Lei > > > > > >Yi Zhang 于2025年12月17日周三 14:18写道: > > > > > >> Hi everyone, > > >> > > >> I would like to start a discussion about FLIP-560: Application > > Capability > > >> Enhancement [1]. > > >> > > >> The primary goal of this FLIP is to improve the usability and > > availability > > >> of Flink applications > > >> > > >> by introducing the following enhancements: > > >> > > >> > > >> > > >> 1. Support multi-job execution in Application Mode, which is an > > important > > >> batch-processinguse case. > > >> 2. Support re-running the user's main method after JobManager restarts > > due > > >> to failures inSession Mode. > > >> 3. Expose exceptions thrown in the user's main method via REST/UI. > > >> > > >> > > >> > > >> Looking forward to your feedback and suggestions! > > >> > > >> > > >> > > >> [1] > > >> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement > > >> > > >> > > >> > > >> Best Regards, > > >> > > >> Yi Zhang > > >
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Thank you Yi for your reply, looks good to me! +1 for this proposal Best, Lei Yi Zhang 于2025年12月25日周四 10:02写道: > Hi Lei, > > > Thank you for the feedback! > The "Archiving Directory Structure" section describes a change in how > archived > files are organized under jobmanager.archive.fs.dir. While this change was > originally proposed in FLIP-549, it's indeed a significant > application-level update, > so I'm glad to have the chance to clarify it here. > > > To answer your question directly: backward compatibility is fully > preserved. > > > In earlier Flink versions, job archives were written directly under the > configured > jobmanager.archive.fs.dir. With this update, Flink will instead use a > hierarchical > cluster-application-job structure. > We understand that many users already have archives stored in the legacy > flat > layout. To ensure a smooth transition, the History Server will be updated > to read > archives from both the old and new directory structures. As a result, all > previously archived jobs will remain accessible and visible. > > > If you have additional questions or specific edge cases in mind, I’d be > happy to > discuss them further! > > > Best, > Yi > > > > At 2025-12-24 11:35:00, "Lei Yang" wrote: > >Hi Yi, > > > >Thank you for creating this FLIP! The introduction of the Application > >entity significantly enhances the observability and manageability of > >user logic, especially benefiting batch workloads. This is truly > >excellent work! > > > >However, I have a compatibility concern and would appreciate your > >clarification. In the “Archiving Directory Structure” section, I noticed > >that the directory structure has been changed. If users have configured > >a persistent external path for jobmanager.archive.fs.dir, will their > >existing archives become unreadable after this change? Will the > >implementation of this FLIP maintain backward compatibility with > >previously archived job data? > > > >Best regards, > >Lei > > > >Yi Zhang 于2025年12月17日周三 14:18写道: > > > >> Hi everyone, > >> > >> I would like to start a discussion about FLIP-560: Application > Capability > >> Enhancement [1]. > >> > >> The primary goal of this FLIP is to improve the usability and > availability > >> of Flink applications > >> > >> by introducing the following enhancements: > >> > >> > >> > >> 1. Support multi-job execution in Application Mode, which is an > important > >> batch-processinguse case. > >> 2. Support re-running the user's main method after JobManager restarts > due > >> to failures inSession Mode. > >> 3. Expose exceptions thrown in the user's main method via REST/UI. > >> > >> > >> > >> Looking forward to your feedback and suggestions! > >> > >> > >> > >> [1] > >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement > >> > >> > >> > >> Best Regards, > >> > >> Yi Zhang >
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Lei, Thank you for the feedback! The "Archiving Directory Structure" section describes a change in how archived files are organized under jobmanager.archive.fs.dir. While this change was originally proposed in FLIP-549, it's indeed a significant application-level update, so I'm glad to have the chance to clarify it here. To answer your question directly: backward compatibility is fully preserved. In earlier Flink versions, job archives were written directly under the configured jobmanager.archive.fs.dir. With this update, Flink will instead use a hierarchical cluster-application-job structure. We understand that many users already have archives stored in the legacy flat layout. To ensure a smooth transition, the History Server will be updated to read archives from both the old and new directory structures. As a result, all previously archived jobs will remain accessible and visible. If you have additional questions or specific edge cases in mind, I’d be happy to discuss them further! Best, Yi At 2025-12-24 11:35:00, "Lei Yang" wrote: >Hi Yi, > >Thank you for creating this FLIP! The introduction of the Application >entity significantly enhances the observability and manageability of >user logic, especially benefiting batch workloads. This is truly >excellent work! > >However, I have a compatibility concern and would appreciate your >clarification. In the “Archiving Directory Structure” section, I noticed >that the directory structure has been changed. If users have configured >a persistent external path for jobmanager.archive.fs.dir, will their >existing archives become unreadable after this change? Will the >implementation of this FLIP maintain backward compatibility with >previously archived job data? > >Best regards, >Lei > >Yi Zhang 于2025年12月17日周三 14:18写道: > >> Hi everyone, >> >> I would like to start a discussion about FLIP-560: Application Capability >> Enhancement [1]. >> >> The primary goal of this FLIP is to improve the usability and availability >> of Flink applications >> >> by introducing the following enhancements: >> >> >> >> 1. Support multi-job execution in Application Mode, which is an important >> batch-processinguse case. >> 2. Support re-running the user's main method after JobManager restarts due >> to failures inSession Mode. >> 3. Expose exceptions thrown in the user's main method via REST/UI. >> >> >> >> Looking forward to your feedback and suggestions! >> >> >> >> [1] >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement >> >> >> >> Best Regards, >> >> Yi Zhang
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Yi, Thank you for creating this FLIP! The introduction of the Application entity significantly enhances the observability and manageability of user logic, especially benefiting batch workloads. This is truly excellent work! However, I have a compatibility concern and would appreciate your clarification. In the “Archiving Directory Structure” section, I noticed that the directory structure has been changed. If users have configured a persistent external path for jobmanager.archive.fs.dir, will their existing archives become unreadable after this change? Will the implementation of this FLIP maintain backward compatibility with previously archived job data? Best regards, Lei Yi Zhang 于2025年12月17日周三 14:18写道: > Hi everyone, > > I would like to start a discussion about FLIP-560: Application Capability > Enhancement [1]. > > The primary goal of this FLIP is to improve the usability and availability > of Flink applications > > by introducing the following enhancements: > > > > 1. Support multi-job execution in Application Mode, which is an important > batch-processinguse case. > 2. Support re-running the user's main method after JobManager restarts due > to failures inSession Mode. > 3. Expose exceptions thrown in the user's main method via REST/UI. > > > > Looking forward to your feedback and suggestions! > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement > > > > Best Regards, > > Yi Zhang
Re: [DISCUSS] FLIP-560: Application Capability Enhancement
Hi Yi, Thanks for creating this FLIP. Supporting the execution of multiple jobs within a single application can be highly beneficial for batch processing. It enables more flexible and complex workflows, allowing better resource sharing, coordinated job management, and simplified deployment. +1 for this proposal Thanks, Zhu Yi Zhang 于2025年12月17日周三 14:18写道: > Hi everyone, > > I would like to start a discussion about FLIP-560: Application Capability > Enhancement [1]. > > The primary goal of this FLIP is to improve the usability and availability > of Flink applications > > by introducing the following enhancements: > > > > 1. Support multi-job execution in Application Mode, which is an important > batch-processinguse case. > 2. Support re-running the user's main method after JobManager restarts due > to failures inSession Mode. > 3. Expose exceptions thrown in the user's main method via REST/UI. > > > > Looking forward to your feedback and suggestions! > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement > > > > Best Regards, > > Yi Zhang
