Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Hi Allison, Thanks for updating the FLIP. The latest Looks good to me. I think we can move forward to voting. Thanks, Jiangjie (Becket) Qin On Thu, May 22, 2025 at 1:26 PM Allison wrote: > Hi Becket, > > Thank you for your feedback. I have updated the FLIP-505 proposal to > reflect these comments. > > Would appreciate any additional feedback. > > Best, > - Allison Chang > > On Tue, May 13, 2025 at 10:27 AM Becket Qin wrote: > > > Thanks for the FLIP, Allison. The proposal makes a lot of sense in > general. > > The history server is critical to the Flink batch. > > > > A few suggestions: > > 1. It might make sense to keep the existing config > > *historyserver.archive.retained-jobs*. This will only be used to > determine > > the total number of jobs to keep in the remote storage. > > 2. The new configuration *historyserver.archive.cached-retained-jobs* > only > > determines the number of jobs cached locally. The default value is -1. > And > > the valid range is* [1, historyserver.archive.retained-jobs].* When not > > set, the it basically caches everything, which is the current behavior. > > When set, that basically means the history server is in the "partial > > caching mode" rather than the "full mirror mode". > > 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled > > *config. > > This config is a little confusing because the jobs history is fetched > > remotely even now. The difference is whether we fetch everything as a > whole > > or fetch individual jobs on demand. But this isan internal > > implementation detail and is not necessary to expose to the end users. > > 4. The config > *historyserver.archive.num-cached-most-recently-viewed-jobs* > > always > > takes effect, regardless of whether the history server is running in a > > "full mirror mode" or "partial caching mode". > > > > So with the above settings: > > 1. By default, users get the same behavior as today. > > 2. When users set *historyserver.archive.cached-retained-jobs, *the > history > > server enters the partial caching mode and fetches the jobs on demand. > > 3. Some most recently viewed jobs are automatically pinned in the cache > so > > they will not be evicted accidentally and cause cache thrashing. > > > > BTW, It would be good to add a future work part to give a heads-up about > > the plan to use RocksDB for job history rather than raw files. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > > > On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan < > > [email protected]> wrote: > > > > > > Regarding decoupling the two features, would your suggestion be to > > > separate > > > them into two separate FLIPs? > > > > > > Sorry for the late response. > > > > > > Yes, that is correct. If these 2 features are somewhat coupled with > each > > > other, then it makes sense to address it in the same FLIP otherwise I > > think > > > it will be better to tackle it as 2 different FLIPs. > > > > > > Regards > > > Venkata krishnan > > > > > > > > > On Mon, Mar 3, 2025 at 1:42 PM Allison wrote: > > > > > > > Hi Yanquan, > > > > > > > > I've updated the FLIP to contain the default values, thanks for your > > > help! > > > > > > > > Sincerely > > > > - Allison > > > > > > > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv > > wrote: > > > > > > > > > Thank you for your explanation. I have basically solved the > previous > > > > > questions. > > > > > > > > > > Regarding the second point, I would like to suggest clarifying the > > > > default > > > > > values for newly adding parameters in `Public Interfaces` session. > > > > > > > > > > -- Forwarded message - > > > > > 发件人: Allison > > > > > Date: 2025年1月30日周四 上午3:42 > > > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > > > > > Improvements, Remote Data Store Fetch and Per Job Fetch > > > > > To: > > > > > > > > > > > > > > > Hi Yanquan, > > > > > > > > > > Thanks for taking a look at this. Re: your questions: > > > > > > > > > > 1. Yes, I've updated the FLIP to be more clear, but it involve
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Hi Becket, Thank you for your feedback. I have updated the FLIP-505 proposal to reflect these comments. Would appreciate any additional feedback. Best, - Allison Chang On Tue, May 13, 2025 at 10:27 AM Becket Qin wrote: > Thanks for the FLIP, Allison. The proposal makes a lot of sense in general. > The history server is critical to the Flink batch. > > A few suggestions: > 1. It might make sense to keep the existing config > *historyserver.archive.retained-jobs*. This will only be used to determine > the total number of jobs to keep in the remote storage. > 2. The new configuration *historyserver.archive.cached-retained-jobs* only > determines the number of jobs cached locally. The default value is -1. And > the valid range is* [1, historyserver.archive.retained-jobs].* When not > set, the it basically caches everything, which is the current behavior. > When set, that basically means the history server is in the "partial > caching mode" rather than the "full mirror mode". > 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled > *config. > This config is a little confusing because the jobs history is fetched > remotely even now. The difference is whether we fetch everything as a whole > or fetch individual jobs on demand. But this isan internal > implementation detail and is not necessary to expose to the end users. > 4. The config *historyserver.archive.num-cached-most-recently-viewed-jobs* > always > takes effect, regardless of whether the history server is running in a > "full mirror mode" or "partial caching mode". > > So with the above settings: > 1. By default, users get the same behavior as today. > 2. When users set *historyserver.archive.cached-retained-jobs, *the history > server enters the partial caching mode and fetches the jobs on demand. > 3. Some most recently viewed jobs are automatically pinned in the cache so > they will not be evicted accidentally and cause cache thrashing. > > BTW, It would be good to add a future work part to give a heads-up about > the plan to use RocksDB for job history rather than raw files. > > Thanks, > > Jiangjie (Becket) Qin > > > On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan < > [email protected]> wrote: > > > > Regarding decoupling the two features, would your suggestion be to > > separate > > them into two separate FLIPs? > > > > Sorry for the late response. > > > > Yes, that is correct. If these 2 features are somewhat coupled with each > > other, then it makes sense to address it in the same FLIP otherwise I > think > > it will be better to tackle it as 2 different FLIPs. > > > > Regards > > Venkata krishnan > > > > > > On Mon, Mar 3, 2025 at 1:42 PM Allison wrote: > > > > > Hi Yanquan, > > > > > > I've updated the FLIP to contain the default values, thanks for your > > help! > > > > > > Sincerely > > > - Allison > > > > > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv > wrote: > > > > > > > Thank you for your explanation. I have basically solved the previous > > > > questions. > > > > > > > > Regarding the second point, I would like to suggest clarifying the > > > default > > > > values for newly adding parameters in `Public Interfaces` session. > > > > > > > > -- Forwarded message - > > > > 发件人: Allison > > > > Date: 2025年1月30日周四 上午3:42 > > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > > > > Improvements, Remote Data Store Fetch and Per Job Fetch > > > > To: > > > > > > > > > > > > Hi Yanquan, > > > > > > > > Thanks for taking a look at this. Re: your questions: > > > > > > > > 1. Yes, I've updated the FLIP to be more clear, but it involves > > modifying > > > > the existing configuration of historyserver.archive.retained-jobs to > > > > historyserver.archive.cached-retained-jobs. The number of remote-jobs > > > > stored can be infinite, the thought behind this is that the remote > data > > > > storage can be cleaned up or limited by a separate protocol that can > be > > > > customized to each individual use case. > > > > 2. Could you clarify this a bit? I'm not sure I understand this part, > > do > > > > you mean to add what the configurations would be set to in the case > of > > > them > > > > not being defined to the FLIP? > > >
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Thanks for the FLIP, Allison. The proposal makes a lot of sense in general. The history server is critical to the Flink batch. A few suggestions: 1. It might make sense to keep the existing config *historyserver.archive.retained-jobs*. This will only be used to determine the total number of jobs to keep in the remote storage. 2. The new configuration *historyserver.archive.cached-retained-jobs* only determines the number of jobs cached locally. The default value is -1. And the valid range is* [1, historyserver.archive.retained-jobs].* When not set, the it basically caches everything, which is the current behavior. When set, that basically means the history server is in the "partial caching mode" rather than the "full mirror mode". 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled *config. This config is a little confusing because the jobs history is fetched remotely even now. The difference is whether we fetch everything as a whole or fetch individual jobs on demand. But this isan internal implementation detail and is not necessary to expose to the end users. 4. The config *historyserver.archive.num-cached-most-recently-viewed-jobs* always takes effect, regardless of whether the history server is running in a "full mirror mode" or "partial caching mode". So with the above settings: 1. By default, users get the same behavior as today. 2. When users set *historyserver.archive.cached-retained-jobs, *the history server enters the partial caching mode and fetches the jobs on demand. 3. Some most recently viewed jobs are automatically pinned in the cache so they will not be evicted accidentally and cause cache thrashing. BTW, It would be good to add a future work part to give a heads-up about the plan to use RocksDB for job history rather than raw files. Thanks, Jiangjie (Becket) Qin On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan < [email protected]> wrote: > > Regarding decoupling the two features, would your suggestion be to > separate > them into two separate FLIPs? > > Sorry for the late response. > > Yes, that is correct. If these 2 features are somewhat coupled with each > other, then it makes sense to address it in the same FLIP otherwise I think > it will be better to tackle it as 2 different FLIPs. > > Regards > Venkata krishnan > > > On Mon, Mar 3, 2025 at 1:42 PM Allison wrote: > > > Hi Yanquan, > > > > I've updated the FLIP to contain the default values, thanks for your > help! > > > > Sincerely > > - Allison > > > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv wrote: > > > > > Thank you for your explanation. I have basically solved the previous > > > questions. > > > > > > Regarding the second point, I would like to suggest clarifying the > > default > > > values for newly adding parameters in `Public Interfaces` session. > > > > > > -- Forwarded message - > > > 发件人: Allison > > > Date: 2025年1月30日周四 上午3:42 > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > > > Improvements, Remote Data Store Fetch and Per Job Fetch > > > To: > > > > > > > > > Hi Yanquan, > > > > > > Thanks for taking a look at this. Re: your questions: > > > > > > 1. Yes, I've updated the FLIP to be more clear, but it involves > modifying > > > the existing configuration of historyserver.archive.retained-jobs to > > > historyserver.archive.cached-retained-jobs. The number of remote-jobs > > > stored can be infinite, the thought behind this is that the remote data > > > storage can be cleaned up or limited by a separate protocol that can be > > > customized to each individual use case. > > > 2. Could you clarify this a bit? I'm not sure I understand this part, > do > > > you mean to add what the configurations would be set to in the case of > > them > > > not being defined to the FLIP? > > > 3. historyserver.archive.fs.refresh-interval is the time duration > > between a > > > call to the remote data storage to find fresh data. What it configures > is > > > how often the FHS polls the remote data store for new files. The remote > > > data store is written to whenever a job is finished. > > > > > > Hope this clarifies some things. > > > > > > Best, > > > - Allison > > > > > > > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv > wrote: > > > > > > > Hi, Allison. Thanks for driving this FLIP. > > > > I have some questions to confirm: > > > > > > > >
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
> Regarding decoupling the two features, would your suggestion be to separate them into two separate FLIPs? Sorry for the late response. Yes, that is correct. If these 2 features are somewhat coupled with each other, then it makes sense to address it in the same FLIP otherwise I think it will be better to tackle it as 2 different FLIPs. Regards Venkata krishnan On Mon, Mar 3, 2025 at 1:42 PM Allison wrote: > Hi Yanquan, > > I've updated the FLIP to contain the default values, thanks for your help! > > Sincerely > - Allison > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv wrote: > > > Thank you for your explanation. I have basically solved the previous > > questions. > > > > Regarding the second point, I would like to suggest clarifying the > default > > values for newly adding parameters in `Public Interfaces` session. > > > > -- Forwarded message ----- > > 发件人: Allison > > Date: 2025年1月30日周四 上午3:42 > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > > Improvements, Remote Data Store Fetch and Per Job Fetch > > To: > > > > > > Hi Yanquan, > > > > Thanks for taking a look at this. Re: your questions: > > > > 1. Yes, I've updated the FLIP to be more clear, but it involves modifying > > the existing configuration of historyserver.archive.retained-jobs to > > historyserver.archive.cached-retained-jobs. The number of remote-jobs > > stored can be infinite, the thought behind this is that the remote data > > storage can be cleaned up or limited by a separate protocol that can be > > customized to each individual use case. > > 2. Could you clarify this a bit? I'm not sure I understand this part, do > > you mean to add what the configurations would be set to in the case of > them > > not being defined to the FLIP? > > 3. historyserver.archive.fs.refresh-interval is the time duration > between a > > call to the remote data storage to find fresh data. What it configures is > > how often the FHS polls the remote data store for new files. The remote > > data store is written to whenever a job is finished. > > > > Hope this clarifies some things. > > > > Best, > > - Allison > > > > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv wrote: > > > > > Hi, Allison. Thanks for driving this FLIP. > > > I have some questions to confirm: > > > > > > 1. I can’t find any existed configuration name > > > `historyserver.archive.cached-retained-jobs`, I guess that what you > mean > > is > > > modifing existing configuration from > > `historyserver.archive.retained-jobs` > > > to `historyserver.archive.cached-retained-jobs`. If so, If we only > limit > > > the number of retained-jobs stored locally, is the number of > > retained-jobs > > > stored remotely infinite? > > > 2. I think it would be better to provide instructions for adding > default > > > values to HistoryServerOptions. > > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local > > and > > > remote storage simultaneously? > > > > > > Best, > > > Yanquan > > > > > > Allison 于 2025年1月17日周五 上午8:07写道: > > > > > > > Hi everyone, > > > > > > > > I would like to initiate a discussion for the FLIP below, which > > enhances > > > to > > > > the Flink History Server to allow greater scalability of the service. > > > > > > > > Motivation: > > > > > > > > Currently, the Flink History Server (FHS) is limited in the number of > > job > > > > archives it can serve based on the storage capacity of the node that > > the > > > > FHS runs in. Job archives are stored locally in a cache which > creates a > > > > local directory which is expanded out based on the contents of a > single > > > > json archive file. This not only uses up local memory space, but also > > > > because of how the FHS expands the job archives into a nested > directory > > > > structure, for jobs with a large number of taskmanagers or subtasks, > > > inode > > > > space often runs out. In order to make the FHS more performant, we > > would > > > > like to introduce the ability to decouple the job archive storage for > > the > > > > FHS from being limited to the local cache, to being able to store and > > > fetch > > > > jobs archives from a remote file store. > > > > > > > > FLIP proposal document: > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*505*3A*Flink*History*Server*Scability*Improvements*2C*Remote*Data*Store*Fetch*and*Per*Job*Fetch__;KyUrKysrKyUrKysrKysrKw!!IKRxdwAv5BmarQ!cy7YUT3RVhkz3ixGuldCgf5lTCb3IMzUuAUClyB3qRuI0vAjYfvNVmw2NOggm06YnRGkmQ-3hMpOp0Ot7yRPK54$ > > > > > > > > Thanks! > > > > > > > > Best, > > > > - Allison Chang > > > > > > > > > >
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Hi Yanquan, I've updated the FLIP to contain the default values, thanks for your help! Sincerely - Allison On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv wrote: > Thank you for your explanation. I have basically solved the previous > questions. > > Regarding the second point, I would like to suggest clarifying the default > values for newly adding parameters in `Public Interfaces` session. > > -- Forwarded message - > 发件人: Allison > Date: 2025年1月30日周四 上午3:42 > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > Improvements, Remote Data Store Fetch and Per Job Fetch > To: > > > Hi Yanquan, > > Thanks for taking a look at this. Re: your questions: > > 1. Yes, I've updated the FLIP to be more clear, but it involves modifying > the existing configuration of historyserver.archive.retained-jobs to > historyserver.archive.cached-retained-jobs. The number of remote-jobs > stored can be infinite, the thought behind this is that the remote data > storage can be cleaned up or limited by a separate protocol that can be > customized to each individual use case. > 2. Could you clarify this a bit? I'm not sure I understand this part, do > you mean to add what the configurations would be set to in the case of them > not being defined to the FLIP? > 3. historyserver.archive.fs.refresh-interval is the time duration between a > call to the remote data storage to find fresh data. What it configures is > how often the FHS polls the remote data store for new files. The remote > data store is written to whenever a job is finished. > > Hope this clarifies some things. > > Best, > - Allison > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv wrote: > > > Hi, Allison. Thanks for driving this FLIP. > > I have some questions to confirm: > > > > 1. I can’t find any existed configuration name > > `historyserver.archive.cached-retained-jobs`, I guess that what you mean > is > > modifing existing configuration from > `historyserver.archive.retained-jobs` > > to `historyserver.archive.cached-retained-jobs`. If so, If we only limit > > the number of retained-jobs stored locally, is the number of > retained-jobs > > stored remotely infinite? > > 2. I think it would be better to provide instructions for adding default > > values to HistoryServerOptions. > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local > and > > remote storage simultaneously? > > > > Best, > > Yanquan > > > > Allison 于 2025年1月17日周五 上午8:07写道: > > > > > Hi everyone, > > > > > > I would like to initiate a discussion for the FLIP below, which > enhances > > to > > > the Flink History Server to allow greater scalability of the service. > > > > > > Motivation: > > > > > > Currently, the Flink History Server (FHS) is limited in the number of > job > > > archives it can serve based on the storage capacity of the node that > the > > > FHS runs in. Job archives are stored locally in a cache which creates a > > > local directory which is expanded out based on the contents of a single > > > json archive file. This not only uses up local memory space, but also > > > because of how the FHS expands the job archives into a nested directory > > > structure, for jobs with a large number of taskmanagers or subtasks, > > inode > > > space often runs out. In order to make the FHS more performant, we > would > > > like to introduce the ability to decouple the job archive storage for > the > > > FHS from being limited to the local cache, to being able to store and > > fetch > > > jobs archives from a remote file store. > > > > > > FLIP proposal document: > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch > > > > > > Thanks! > > > > > > Best, > > > - Allison Chang > > > > > >
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Hi Venkat, To reply to your questions: 1. Correct, only if remote fetch is enabled as a configuration, will the remote storage and local cache limits be decoupled. Otherwise, the system will behave as previously. 2. I've clarified the description in the FLIP. Regarding decoupling the two features, would your suggestion be to separate them into two separate FLIPs? Thank you for your feedback. Best, - Allison On Thu, Jan 30, 2025 at 7:10 PM Venkatakrishnan Sowrirajan wrote: > Thanks for the FLIP, Allison. This will be a great feature addition to > fetch job archives from remote storage. Also decoupling the local cache > limits from the remote storage archive limits. > > Few questions I have: > > 1. In terms of backwards compatibility, are you saying only if remote fetch > is enabled thats when the remote storage and local cache limits be > decoupled otherwise not? > 2. Description of what historyserver.archive.remote-fetch-cached-jobs > config is meant for is very clear. Can you please clarify that in the FLIP? > Basically what I want to clarify is that there is no limit on how many > remote archives can be fetched but the above config is the local cache > limit of the most recently accessed jobs that can include both already > locally cached archive or a newly fetched remote archive, correct? > > Looks like there are 2 new features or functionalities that are described. > We should decouple them. > > 1. Support to fetch job archives from remote storage. This is entirely a > new feature. No concerns with respect to backwards compatibility. > 2. Introduce local archive cache limits which is decoupled from remote > archive cache limits. This is required to tackle the Flink HistoryServer > scaling issue due to local inode exhaustion. This looks to be a new feature > and improves the overall experience. But if the existing config > historyserver.archive.retained-jobs is modified to > historyserver.archive.cached-retained-jobs, then it won't be backwards > compatible with the older config. This should be clarified in the FLIP > clearly. > > Thanks > Venkat > > On Thu, Jan 30, 2025, 3:21 AM Yanquan Lv wrote: > > > Thank you for your explanation. I have basically solved the previous > > questions. > > > > Regarding the second point, I would like to suggest clarifying the > default > > values for newly adding parameters in `Public Interfaces` session. > > > > -- Forwarded message ----- > > 发件人: Allison > > Date: 2025年1月30日周四 上午3:42 > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > > Improvements, Remote Data Store Fetch and Per Job Fetch > > To: > > > > > > Hi Yanquan, > > > > Thanks for taking a look at this. Re: your questions: > > > > 1. Yes, I've updated the FLIP to be more clear, but it involves modifying > > the existing configuration of historyserver.archive.retained-jobs to > > historyserver.archive.cached-retained-jobs. The number of remote-jobs > > stored can be infinite, the thought behind this is that the remote data > > storage can be cleaned up or limited by a separate protocol that can be > > customized to each individual use case. > > 2. Could you clarify this a bit? I'm not sure I understand this part, do > > you mean to add what the configurations would be set to in the case of > them > > not being defined to the FLIP? > > 3. historyserver.archive.fs.refresh-interval is the time duration > between a > > call to the remote data storage to find fresh data. What it configures is > > how often the FHS polls the remote data store for new files. The remote > > data store is written to whenever a job is finished. > > > > Hope this clarifies some things. > > > > Best, > > - Allison > > > > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv wrote: > > > > > Hi, Allison. Thanks for driving this FLIP. > > > I have some questions to confirm: > > > > > > 1. I can’t find any existed configuration name > > > `historyserver.archive.cached-retained-jobs`, I guess that what you > mean > > is > > > modifing existing configuration from > > `historyserver.archive.retained-jobs` > > > to `historyserver.archive.cached-retained-jobs`. If so, If we only > limit > > > the number of retained-jobs stored locally, is the number of > > retained-jobs > > > stored remotely infinite? > > > 2. I think it would be better to provide instructions for adding > default > > > values to HistoryServerOptions. > > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local > > and
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Thanks for the FLIP, Allison. This will be a great feature addition to fetch job archives from remote storage. Also decoupling the local cache limits from the remote storage archive limits. Few questions I have: 1. In terms of backwards compatibility, are you saying only if remote fetch is enabled thats when the remote storage and local cache limits be decoupled otherwise not? 2. Description of what historyserver.archive.remote-fetch-cached-jobs config is meant for is very clear. Can you please clarify that in the FLIP? Basically what I want to clarify is that there is no limit on how many remote archives can be fetched but the above config is the local cache limit of the most recently accessed jobs that can include both already locally cached archive or a newly fetched remote archive, correct? Looks like there are 2 new features or functionalities that are described. We should decouple them. 1. Support to fetch job archives from remote storage. This is entirely a new feature. No concerns with respect to backwards compatibility. 2. Introduce local archive cache limits which is decoupled from remote archive cache limits. This is required to tackle the Flink HistoryServer scaling issue due to local inode exhaustion. This looks to be a new feature and improves the overall experience. But if the existing config historyserver.archive.retained-jobs is modified to historyserver.archive.cached-retained-jobs, then it won't be backwards compatible with the older config. This should be clarified in the FLIP clearly. Thanks Venkat On Thu, Jan 30, 2025, 3:21 AM Yanquan Lv wrote: > Thank you for your explanation. I have basically solved the previous > questions. > > Regarding the second point, I would like to suggest clarifying the default > values for newly adding parameters in `Public Interfaces` session. > > -- Forwarded message - > 发件人: Allison > Date: 2025年1月30日周四 上午3:42 > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > Improvements, Remote Data Store Fetch and Per Job Fetch > To: > > > Hi Yanquan, > > Thanks for taking a look at this. Re: your questions: > > 1. Yes, I've updated the FLIP to be more clear, but it involves modifying > the existing configuration of historyserver.archive.retained-jobs to > historyserver.archive.cached-retained-jobs. The number of remote-jobs > stored can be infinite, the thought behind this is that the remote data > storage can be cleaned up or limited by a separate protocol that can be > customized to each individual use case. > 2. Could you clarify this a bit? I'm not sure I understand this part, do > you mean to add what the configurations would be set to in the case of them > not being defined to the FLIP? > 3. historyserver.archive.fs.refresh-interval is the time duration between a > call to the remote data storage to find fresh data. What it configures is > how often the FHS polls the remote data store for new files. The remote > data store is written to whenever a job is finished. > > Hope this clarifies some things. > > Best, > - Allison > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv wrote: > > > Hi, Allison. Thanks for driving this FLIP. > > I have some questions to confirm: > > > > 1. I can’t find any existed configuration name > > `historyserver.archive.cached-retained-jobs`, I guess that what you mean > is > > modifing existing configuration from > `historyserver.archive.retained-jobs` > > to `historyserver.archive.cached-retained-jobs`. If so, If we only limit > > the number of retained-jobs stored locally, is the number of > retained-jobs > > stored remotely infinite? > > 2. I think it would be better to provide instructions for adding default > > values to HistoryServerOptions. > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local > and > > remote storage simultaneously? > > > > Best, > > Yanquan > > > > Allison 于 2025年1月17日周五 上午8:07写道: > > > > > Hi everyone, > > > > > > I would like to initiate a discussion for the FLIP below, which > enhances > > to > > > the Flink History Server to allow greater scalability of the service. > > > > > > Motivation: > > > > > > Currently, the Flink History Server (FHS) is limited in the number of > job > > > archives it can serve based on the storage capacity of the node that > the > > > FHS runs in. Job archives are stored locally in a cache which creates a > > > local directory which is expanded out based on the contents of a single > > > json archive file. This not only uses up local memory space, but also > > > because of how the FHS expands the job archives into a nes
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Thank you for your explanation. I have basically solved the previous questions. Regarding the second point, I would like to suggest clarifying the default values for newly adding parameters in `Public Interfaces` session. -- Forwarded message - 发件人: Allison Date: 2025年1月30日周四 上午3:42 Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch To: Hi Yanquan, Thanks for taking a look at this. Re: your questions: 1. Yes, I've updated the FLIP to be more clear, but it involves modifying the existing configuration of historyserver.archive.retained-jobs to historyserver.archive.cached-retained-jobs. The number of remote-jobs stored can be infinite, the thought behind this is that the remote data storage can be cleaned up or limited by a separate protocol that can be customized to each individual use case. 2. Could you clarify this a bit? I'm not sure I understand this part, do you mean to add what the configurations would be set to in the case of them not being defined to the FLIP? 3. historyserver.archive.fs.refresh-interval is the time duration between a call to the remote data storage to find fresh data. What it configures is how often the FHS polls the remote data store for new files. The remote data store is written to whenever a job is finished. Hope this clarifies some things. Best, - Allison On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv wrote: > Hi, Allison. Thanks for driving this FLIP. > I have some questions to confirm: > > 1. I can’t find any existed configuration name > `historyserver.archive.cached-retained-jobs`, I guess that what you mean is > modifing existing configuration from `historyserver.archive.retained-jobs` > to `historyserver.archive.cached-retained-jobs`. If so, If we only limit > the number of retained-jobs stored locally, is the number of retained-jobs > stored remotely infinite? > 2. I think it would be better to provide instructions for adding default > values to HistoryServerOptions. > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local and > remote storage simultaneously? > > Best, > Yanquan > > Allison 于 2025年1月17日周五 上午8:07写道: > > > Hi everyone, > > > > I would like to initiate a discussion for the FLIP below, which enhances > to > > the Flink History Server to allow greater scalability of the service. > > > > Motivation: > > > > Currently, the Flink History Server (FHS) is limited in the number of job > > archives it can serve based on the storage capacity of the node that the > > FHS runs in. Job archives are stored locally in a cache which creates a > > local directory which is expanded out based on the contents of a single > > json archive file. This not only uses up local memory space, but also > > because of how the FHS expands the job archives into a nested directory > > structure, for jobs with a large number of taskmanagers or subtasks, > inode > > space often runs out. In order to make the FHS more performant, we would > > like to introduce the ability to decouple the job archive storage for the > > FHS from being limited to the local cache, to being able to store and > fetch > > jobs archives from a remote file store. > > > > FLIP proposal document: > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch > > > > Thanks! > > > > Best, > > - Allison Chang > > >
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Hi Yanquan, Thanks for taking a look at this. Re: your questions: 1. Yes, I've updated the FLIP to be more clear, but it involves modifying the existing configuration of historyserver.archive.retained-jobs to historyserver.archive.cached-retained-jobs. The number of remote-jobs stored can be infinite, the thought behind this is that the remote data storage can be cleaned up or limited by a separate protocol that can be customized to each individual use case. 2. Could you clarify this a bit? I'm not sure I understand this part, do you mean to add what the configurations would be set to in the case of them not being defined to the FLIP? 3. historyserver.archive.fs.refresh-interval is the time duration between a call to the remote data storage to find fresh data. What it configures is how often the FHS polls the remote data store for new files. The remote data store is written to whenever a job is finished. Hope this clarifies some things. Best, - Allison On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv wrote: > Hi, Allison. Thanks for driving this FLIP. > I have some questions to confirm: > > 1. I can’t find any existed configuration name > `historyserver.archive.cached-retained-jobs`, I guess that what you mean is > modifing existing configuration from `historyserver.archive.retained-jobs` > to `historyserver.archive.cached-retained-jobs`. If so, If we only limit > the number of retained-jobs stored locally, is the number of retained-jobs > stored remotely infinite? > 2. I think it would be better to provide instructions for adding default > values to HistoryServerOptions. > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local and > remote storage simultaneously? > > Best, > Yanquan > > Allison 于 2025年1月17日周五 上午8:07写道: > > > Hi everyone, > > > > I would like to initiate a discussion for the FLIP below, which enhances > to > > the Flink History Server to allow greater scalability of the service. > > > > Motivation: > > > > Currently, the Flink History Server (FHS) is limited in the number of job > > archives it can serve based on the storage capacity of the node that the > > FHS runs in. Job archives are stored locally in a cache which creates a > > local directory which is expanded out based on the contents of a single > > json archive file. This not only uses up local memory space, but also > > because of how the FHS expands the job archives into a nested directory > > structure, for jobs with a large number of taskmanagers or subtasks, > inode > > space often runs out. In order to make the FHS more performant, we would > > like to introduce the ability to decouple the job archive storage for the > > FHS from being limited to the local cache, to being able to store and > fetch > > jobs archives from a remote file store. > > > > FLIP proposal document: > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch > > > > Thanks! > > > > Best, > > - Allison Chang > > >
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Hi, Allison. Thanks for driving this FLIP. I have some questions to confirm: 1. I can’t find any existed configuration name `historyserver.archive.cached-retained-jobs`, I guess that what you mean is modifing existing configuration from `historyserver.archive.retained-jobs` to `historyserver.archive.cached-retained-jobs`. If so, If we only limit the number of retained-jobs stored locally, is the number of retained-jobs stored remotely infinite? 2. I think it would be better to provide instructions for adding default values to HistoryServerOptions. 3. Does `historyserver.archive.fs.refresh-interval` apply to both local and remote storage simultaneously? Best, Yanquan Allison 于 2025年1月17日周五 上午8:07写道: > Hi everyone, > > I would like to initiate a discussion for the FLIP below, which enhances to > the Flink History Server to allow greater scalability of the service. > > Motivation: > > Currently, the Flink History Server (FHS) is limited in the number of job > archives it can serve based on the storage capacity of the node that the > FHS runs in. Job archives are stored locally in a cache which creates a > local directory which is expanded out based on the contents of a single > json archive file. This not only uses up local memory space, but also > because of how the FHS expands the job archives into a nested directory > structure, for jobs with a large number of taskmanagers or subtasks, inode > space often runs out. In order to make the FHS more performant, we would > like to introduce the ability to decouple the job archive storage for the > FHS from being limited to the local cache, to being able to store and fetch > jobs archives from a remote file store. > > FLIP proposal document: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch > > Thanks! > > Best, > - Allison Chang >
Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Hi folks, Just a gentle reminder regarding the FLIP I proposed for improving the Flink History Server. Thanks for your time and attention. Best, - Allison On Thu, Jan 16, 2025 at 4:06 PM Allison wrote: > Hi everyone, > > I would like to initiate a discussion for the FLIP below, which enhances > to the Flink History Server to allow greater scalability of the service. > > Motivation: > > Currently, the Flink History Server (FHS) is limited in the number of job > archives it can serve based on the storage capacity of the node that the > FHS runs in. Job archives are stored locally in a cache which creates a > local directory which is expanded out based on the contents of a single > json archive file. This not only uses up local memory space, but also > because of how the FHS expands the job archives into a nested directory > structure, for jobs with a large number of taskmanagers or subtasks, inode > space often runs out. In order to make the FHS more performant, we would > like to introduce the ability to decouple the job archive storage for the > FHS from being limited to the local cache, to being able to store and fetch > jobs archives from a remote file store. > > FLIP proposal document: > https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch > > Thanks! > > Best, > - Allison Chang >
[DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch
Hi everyone, I would like to initiate a discussion for the FLIP below, which enhances to the Flink History Server to allow greater scalability of the service. Motivation: Currently, the Flink History Server (FHS) is limited in the number of job archives it can serve based on the storage capacity of the node that the FHS runs in. Job archives are stored locally in a cache which creates a local directory which is expanded out based on the contents of a single json archive file. This not only uses up local memory space, but also because of how the FHS expands the job archives into a nested directory structure, for jobs with a large number of taskmanagers or subtasks, inode space often runs out. In order to make the FHS more performant, we would like to introduce the ability to decouple the job archive storage for the FHS from being limited to the local cache, to being able to store and fetch jobs archives from a remote file store. FLIP proposal document: https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch Thanks! Best, - Allison Chang
