Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-05-29 Thread Becket Qin
Hi Allison,

Thanks for updating the FLIP. The latest Looks good to me. I think we can
move forward to voting.

Thanks,

Jiangjie (Becket) Qin

On Thu, May 22, 2025 at 1:26 PM Allison  wrote:

> Hi Becket,
>
> Thank you for your feedback. I have updated the FLIP-505 proposal to
> reflect these comments.
>
> Would appreciate any additional feedback.
>
> Best,
> - Allison Chang
>
> On Tue, May 13, 2025 at 10:27 AM Becket Qin  wrote:
>
> > Thanks for the FLIP, Allison. The proposal makes a lot of sense in
> general.
> > The history server is critical to the Flink batch.
> >
> > A few suggestions:
> > 1. It might make sense to keep the existing config
> > *historyserver.archive.retained-jobs*. This will only be used to
> determine
> > the total number of jobs to keep in the remote storage.
> > 2. The new configuration *historyserver.archive.cached-retained-jobs*
> only
> > determines the number of jobs cached locally. The default value is -1.
> And
> > the valid range is* [1, historyserver.archive.retained-jobs].* When not
> > set, the it basically caches everything, which is the current behavior.
> > When set, that basically means the history server is in the "partial
> > caching mode" rather than the "full mirror mode".
> > 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled
> > *config.
> > This config is a little confusing because the jobs history is fetched
> > remotely even now. The difference is whether we fetch everything as a
> whole
> > or fetch individual jobs on demand. But this isan  internal
> > implementation detail and is not necessary to expose to the end users.
> > 4. The config
> *historyserver.archive.num-cached-most-recently-viewed-jobs*
> > always
> > takes effect, regardless of whether the history server is running in a
> > "full mirror mode" or "partial caching mode".
> >
> > So with the above settings:
> > 1. By default, users get the same behavior as today.
> > 2. When users set *historyserver.archive.cached-retained-jobs, *the
> history
> > server enters the partial caching mode and fetches the jobs on demand.
> > 3. Some most recently viewed jobs are automatically pinned in the cache
> so
> > they will not be evicted accidentally and cause cache thrashing.
> >
> > BTW, It would be good to add a future work part to give a heads-up about
> > the plan to use RocksDB for job history rather than raw files.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan <
> > [email protected]> wrote:
> >
> > > > Regarding decoupling the two features, would your suggestion be to
> > > separate
> > > them into two separate FLIPs?
> > >
> > > Sorry for the late response.
> > >
> > > Yes, that is correct. If these 2 features are somewhat coupled with
> each
> > > other, then it makes sense to address it in the same FLIP otherwise I
> > think
> > > it will be better to tackle it as 2 different FLIPs.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Mon, Mar 3, 2025 at 1:42 PM Allison  wrote:
> > >
> > > > Hi Yanquan,
> > > >
> > > > I've updated the FLIP to contain the default values, thanks for your
> > > help!
> > > >
> > > > Sincerely
> > > > - Allison
> > > >
> > > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv 
> > wrote:
> > > >
> > > > > Thank you for your explanation. I have basically solved the
> previous
> > > > > questions.
> > > > >
> > > > > Regarding the second point, I would like to suggest clarifying the
> > > > default
> > > > > values for newly adding parameters in `Public Interfaces` session.
> > > > >
> > > > > -- Forwarded message -
> > > > > 发件人: Allison 
> > > > > Date: 2025年1月30日周四 上午3:42
> > > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> > > > > Improvements, Remote Data Store Fetch and Per Job Fetch
> > > > > To: 
> > > > >
> > > > >
> > > > > Hi Yanquan,
> > > > >
> > > > > Thanks for taking a look at this. Re: your questions:
> > > > >
> > > > > 1. Yes, I've updated the FLIP to be more clear, but it involve

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-05-22 Thread Allison
Hi Becket,

Thank you for your feedback. I have updated the FLIP-505 proposal to
reflect these comments.

Would appreciate any additional feedback.

Best,
- Allison Chang

On Tue, May 13, 2025 at 10:27 AM Becket Qin  wrote:

> Thanks for the FLIP, Allison. The proposal makes a lot of sense in general.
> The history server is critical to the Flink batch.
>
> A few suggestions:
> 1. It might make sense to keep the existing config
> *historyserver.archive.retained-jobs*. This will only be used to determine
> the total number of jobs to keep in the remote storage.
> 2. The new configuration *historyserver.archive.cached-retained-jobs* only
> determines the number of jobs cached locally. The default value is -1. And
> the valid range is* [1, historyserver.archive.retained-jobs].* When not
> set, the it basically caches everything, which is the current behavior.
> When set, that basically means the history server is in the "partial
> caching mode" rather than the "full mirror mode".
> 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled
> *config.
> This config is a little confusing because the jobs history is fetched
> remotely even now. The difference is whether we fetch everything as a whole
> or fetch individual jobs on demand. But this isan  internal
> implementation detail and is not necessary to expose to the end users.
> 4. The config *historyserver.archive.num-cached-most-recently-viewed-jobs*
> always
> takes effect, regardless of whether the history server is running in a
> "full mirror mode" or "partial caching mode".
>
> So with the above settings:
> 1. By default, users get the same behavior as today.
> 2. When users set *historyserver.archive.cached-retained-jobs, *the history
> server enters the partial caching mode and fetches the jobs on demand.
> 3. Some most recently viewed jobs are automatically pinned in the cache so
> they will not be evicted accidentally and cause cache thrashing.
>
> BTW, It would be good to add a future work part to give a heads-up about
> the plan to use RocksDB for job history rather than raw files.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan <
> [email protected]> wrote:
>
> > > Regarding decoupling the two features, would your suggestion be to
> > separate
> > them into two separate FLIPs?
> >
> > Sorry for the late response.
> >
> > Yes, that is correct. If these 2 features are somewhat coupled with each
> > other, then it makes sense to address it in the same FLIP otherwise I
> think
> > it will be better to tackle it as 2 different FLIPs.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Mon, Mar 3, 2025 at 1:42 PM Allison  wrote:
> >
> > > Hi Yanquan,
> > >
> > > I've updated the FLIP to contain the default values, thanks for your
> > help!
> > >
> > > Sincerely
> > > - Allison
> > >
> > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv 
> wrote:
> > >
> > > > Thank you for your explanation. I have basically solved the previous
> > > > questions.
> > > >
> > > > Regarding the second point, I would like to suggest clarifying the
> > > default
> > > > values for newly adding parameters in `Public Interfaces` session.
> > > >
> > > > -- Forwarded message -
> > > > 发件人: Allison 
> > > > Date: 2025年1月30日周四 上午3:42
> > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> > > > Improvements, Remote Data Store Fetch and Per Job Fetch
> > > > To: 
> > > >
> > > >
> > > > Hi Yanquan,
> > > >
> > > > Thanks for taking a look at this. Re: your questions:
> > > >
> > > > 1. Yes, I've updated the FLIP to be more clear, but it involves
> > modifying
> > > > the existing configuration of historyserver.archive.retained-jobs to
> > > > historyserver.archive.cached-retained-jobs. The number of remote-jobs
> > > > stored can be infinite, the thought behind this is that the remote
> data
> > > > storage can be cleaned up or limited by a separate protocol that can
> be
> > > > customized to each individual use case.
> > > > 2. Could you clarify this a bit? I'm not sure I understand this part,
> > do
> > > > you mean to add what the configurations would be set to in the case
> of
> > > them
> > > > not being defined to the FLIP?
> > >

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-05-13 Thread Becket Qin
Thanks for the FLIP, Allison. The proposal makes a lot of sense in general.
The history server is critical to the Flink batch.

A few suggestions:
1. It might make sense to keep the existing config
*historyserver.archive.retained-jobs*. This will only be used to determine
the total number of jobs to keep in the remote storage.
2. The new configuration *historyserver.archive.cached-retained-jobs* only
determines the number of jobs cached locally. The default value is -1. And
the valid range is* [1, historyserver.archive.retained-jobs].* When not
set, the it basically caches everything, which is the current behavior.
When set, that basically means the history server is in the "partial
caching mode" rather than the "full mirror mode".
3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled *config.
This config is a little confusing because the jobs history is fetched
remotely even now. The difference is whether we fetch everything as a whole
or fetch individual jobs on demand. But this isan  internal
implementation detail and is not necessary to expose to the end users.
4. The config *historyserver.archive.num-cached-most-recently-viewed-jobs*
always
takes effect, regardless of whether the history server is running in a
"full mirror mode" or "partial caching mode".

So with the above settings:
1. By default, users get the same behavior as today.
2. When users set *historyserver.archive.cached-retained-jobs, *the history
server enters the partial caching mode and fetches the jobs on demand.
3. Some most recently viewed jobs are automatically pinned in the cache so
they will not be evicted accidentally and cause cache thrashing.

BTW, It would be good to add a future work part to give a heads-up about
the plan to use RocksDB for job history rather than raw files.

Thanks,

Jiangjie (Becket) Qin


On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan <
[email protected]> wrote:

> > Regarding decoupling the two features, would your suggestion be to
> separate
> them into two separate FLIPs?
>
> Sorry for the late response.
>
> Yes, that is correct. If these 2 features are somewhat coupled with each
> other, then it makes sense to address it in the same FLIP otherwise I think
> it will be better to tackle it as 2 different FLIPs.
>
> Regards
> Venkata krishnan
>
>
> On Mon, Mar 3, 2025 at 1:42 PM Allison  wrote:
>
> > Hi Yanquan,
> >
> > I've updated the FLIP to contain the default values, thanks for your
> help!
> >
> > Sincerely
> > - Allison
> >
> > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv  wrote:
> >
> > > Thank you for your explanation. I have basically solved the previous
> > > questions.
> > >
> > > Regarding the second point, I would like to suggest clarifying the
> > default
> > > values for newly adding parameters in `Public Interfaces` session.
> > >
> > > -- Forwarded message -
> > > 发件人: Allison 
> > > Date: 2025年1月30日周四 上午3:42
> > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> > > Improvements, Remote Data Store Fetch and Per Job Fetch
> > > To: 
> > >
> > >
> > > Hi Yanquan,
> > >
> > > Thanks for taking a look at this. Re: your questions:
> > >
> > > 1. Yes, I've updated the FLIP to be more clear, but it involves
> modifying
> > > the existing configuration of historyserver.archive.retained-jobs to
> > > historyserver.archive.cached-retained-jobs. The number of remote-jobs
> > > stored can be infinite, the thought behind this is that the remote data
> > > storage can be cleaned up or limited by a separate protocol that can be
> > > customized to each individual use case.
> > > 2. Could you clarify this a bit? I'm not sure I understand this part,
> do
> > > you mean to add what the configurations would be set to in the case of
> > them
> > > not being defined to the FLIP?
> > > 3. historyserver.archive.fs.refresh-interval is the time duration
> > between a
> > > call to the remote data storage to find fresh data. What it configures
> is
> > > how often the FHS polls the remote data store for new files. The remote
> > > data store is written to whenever a job is finished.
> > >
> > > Hope this clarifies some things.
> > >
> > > Best,
> > > - Allison
> > >
> > >
> > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv 
> wrote:
> > >
> > > > Hi, Allison. Thanks for driving this FLIP.
> > > > I have some questions to confirm:
> > > >
> > > > 

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-05-12 Thread Venkatakrishnan Sowrirajan
> Regarding decoupling the two features, would your suggestion be to
separate
them into two separate FLIPs?

Sorry for the late response.

Yes, that is correct. If these 2 features are somewhat coupled with each
other, then it makes sense to address it in the same FLIP otherwise I think
it will be better to tackle it as 2 different FLIPs.

Regards
Venkata krishnan


On Mon, Mar 3, 2025 at 1:42 PM Allison  wrote:

> Hi Yanquan,
>
> I've updated the FLIP to contain the default values, thanks for your help!
>
> Sincerely
> - Allison
>
> On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv  wrote:
>
> > Thank you for your explanation. I have basically solved the previous
> > questions.
> >
> > Regarding the second point, I would like to suggest clarifying the
> default
> > values for newly adding parameters in `Public Interfaces` session.
> >
> > -- Forwarded message -----
> > 发件人: Allison 
> > Date: 2025年1月30日周四 上午3:42
> > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> > Improvements, Remote Data Store Fetch and Per Job Fetch
> > To: 
> >
> >
> > Hi Yanquan,
> >
> > Thanks for taking a look at this. Re: your questions:
> >
> > 1. Yes, I've updated the FLIP to be more clear, but it involves modifying
> > the existing configuration of historyserver.archive.retained-jobs to
> > historyserver.archive.cached-retained-jobs. The number of remote-jobs
> > stored can be infinite, the thought behind this is that the remote data
> > storage can be cleaned up or limited by a separate protocol that can be
> > customized to each individual use case.
> > 2. Could you clarify this a bit? I'm not sure I understand this part, do
> > you mean to add what the configurations would be set to in the case of
> them
> > not being defined to the FLIP?
> > 3. historyserver.archive.fs.refresh-interval is the time duration
> between a
> > call to the remote data storage to find fresh data. What it configures is
> > how often the FHS polls the remote data store for new files. The remote
> > data store is written to whenever a job is finished.
> >
> > Hope this clarifies some things.
> >
> > Best,
> > - Allison
> >
> >
> > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv  wrote:
> >
> > > Hi, Allison. Thanks for driving this FLIP.
> > > I have some questions to confirm:
> > >
> > > 1. I can’t find any existed configuration name
> > > `historyserver.archive.cached-retained-jobs`, I guess that what you
> mean
> > is
> > > modifing existing configuration from
> > `historyserver.archive.retained-jobs`
> > > to `historyserver.archive.cached-retained-jobs`. If so, If we only
> limit
> > > the number of retained-jobs stored locally, is the number of
> > retained-jobs
> > > stored remotely infinite?
> > > 2. I think it would be better to provide instructions for adding
> default
> > > values to HistoryServerOptions.
> > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local
> > and
> > > remote storage simultaneously?
> > >
> > > Best,
> > > Yanquan
> > >
> > > Allison  于 2025年1月17日周五 上午8:07写道:
> > >
> > > > Hi everyone,
> > > >
> > > > I would like to initiate a discussion for the FLIP below, which
> > enhances
> > > to
> > > > the Flink History Server to allow greater scalability of the service.
> > > >
> > > > Motivation:
> > > >
> > > > Currently, the Flink History Server (FHS) is limited in the number of
> > job
> > > > archives it can serve based on the storage capacity of the node that
> > the
> > > > FHS runs in. Job archives are stored locally in a cache which
> creates a
> > > > local directory which is expanded out based on the contents of a
> single
> > > > json archive file. This not only uses up local memory space, but also
> > > > because of how the FHS expands the job archives into a nested
> directory
> > > > structure, for jobs with a large number of taskmanagers or subtasks,
> > > inode
> > > > space often runs out.  In order to make the FHS more performant, we
> > would
> > > > like to introduce the ability to decouple the job archive storage for
> > the
> > > > FHS from being limited to the local cache, to being able to store and
> > > fetch
> > > > jobs archives from a remote file store.
> > > >
> > > > FLIP proposal document:
> > > >
> > > >
> > >
> >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*505*3A*Flink*History*Server*Scability*Improvements*2C*Remote*Data*Store*Fetch*and*Per*Job*Fetch__;KyUrKysrKyUrKysrKysrKw!!IKRxdwAv5BmarQ!cy7YUT3RVhkz3ixGuldCgf5lTCb3IMzUuAUClyB3qRuI0vAjYfvNVmw2NOggm06YnRGkmQ-3hMpOp0Ot7yRPK54$
> > > >
> > > > Thanks!
> > > >
> > > > Best,
> > > > - Allison Chang
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-03-03 Thread Allison
Hi Yanquan,

I've updated the FLIP to contain the default values, thanks for your help!

Sincerely
- Allison

On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv  wrote:

> Thank you for your explanation. I have basically solved the previous
> questions.
>
> Regarding the second point, I would like to suggest clarifying the default
> values for newly adding parameters in `Public Interfaces` session.
>
> -- Forwarded message -
> 发件人: Allison 
> Date: 2025年1月30日周四 上午3:42
> Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> Improvements, Remote Data Store Fetch and Per Job Fetch
> To: 
>
>
> Hi Yanquan,
>
> Thanks for taking a look at this. Re: your questions:
>
> 1. Yes, I've updated the FLIP to be more clear, but it involves modifying
> the existing configuration of historyserver.archive.retained-jobs to
> historyserver.archive.cached-retained-jobs. The number of remote-jobs
> stored can be infinite, the thought behind this is that the remote data
> storage can be cleaned up or limited by a separate protocol that can be
> customized to each individual use case.
> 2. Could you clarify this a bit? I'm not sure I understand this part, do
> you mean to add what the configurations would be set to in the case of them
> not being defined to the FLIP?
> 3. historyserver.archive.fs.refresh-interval is the time duration between a
> call to the remote data storage to find fresh data. What it configures is
> how often the FHS polls the remote data store for new files. The remote
> data store is written to whenever a job is finished.
>
> Hope this clarifies some things.
>
> Best,
> - Allison
>
>
> On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv  wrote:
>
> > Hi, Allison. Thanks for driving this FLIP.
> > I have some questions to confirm:
> >
> > 1. I can’t find any existed configuration name
> > `historyserver.archive.cached-retained-jobs`, I guess that what you mean
> is
> > modifing existing configuration from
> `historyserver.archive.retained-jobs`
> > to `historyserver.archive.cached-retained-jobs`. If so, If we only limit
> > the number of retained-jobs stored locally, is the number of
> retained-jobs
> > stored remotely infinite?
> > 2. I think it would be better to provide instructions for adding default
> > values to HistoryServerOptions.
> > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local
> and
> > remote storage simultaneously?
> >
> > Best,
> > Yanquan
> >
> > Allison  于 2025年1月17日周五 上午8:07写道:
> >
> > > Hi everyone,
> > >
> > > I would like to initiate a discussion for the FLIP below, which
> enhances
> > to
> > > the Flink History Server to allow greater scalability of the service.
> > >
> > > Motivation:
> > >
> > > Currently, the Flink History Server (FHS) is limited in the number of
> job
> > > archives it can serve based on the storage capacity of the node that
> the
> > > FHS runs in. Job archives are stored locally in a cache which creates a
> > > local directory which is expanded out based on the contents of a single
> > > json archive file. This not only uses up local memory space, but also
> > > because of how the FHS expands the job archives into a nested directory
> > > structure, for jobs with a large number of taskmanagers or subtasks,
> > inode
> > > space often runs out.  In order to make the FHS more performant, we
> would
> > > like to introduce the ability to decouple the job archive storage for
> the
> > > FHS from being limited to the local cache, to being able to store and
> > fetch
> > > jobs archives from a remote file store.
> > >
> > > FLIP proposal document:
> > >
> > >
> >
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch
> > >
> > > Thanks!
> > >
> > > Best,
> > > - Allison Chang
> > >
> >
>


Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-03-03 Thread Allison
Hi Venkat,

To reply to your questions:
1. Correct, only if remote fetch is enabled as a configuration, will the
remote storage and local cache limits be decoupled. Otherwise, the system
will behave as previously.
2. I've clarified the description in the FLIP.

Regarding decoupling the two features, would your suggestion be to separate
them into two separate FLIPs?

Thank you for your feedback.
Best,
- Allison

On Thu, Jan 30, 2025 at 7:10 PM Venkatakrishnan Sowrirajan 
wrote:

> Thanks for the FLIP, Allison. This will be a great feature addition to
> fetch job archives from remote storage. Also decoupling the local cache
> limits from the remote storage archive limits.
>
> Few questions I have:
>
> 1. In terms of backwards compatibility, are you saying only if remote fetch
> is enabled thats when the remote storage and local cache limits be
> decoupled otherwise not?
> 2. Description of what historyserver.archive.remote-fetch-cached-jobs
> config is meant for is very clear. Can you please clarify that in the FLIP?
> Basically what I want to clarify is that there is no limit on how many
> remote archives can be fetched but the above config is the local cache
> limit of the most recently accessed jobs that can include both already
> locally cached archive or a newly fetched remote archive, correct?
>
> Looks like there are 2 new features or functionalities that are described.
> We should decouple them.
>
> 1. Support to fetch job archives from remote storage. This is entirely a
> new feature. No concerns with respect to backwards compatibility.
> 2. Introduce local archive cache limits which is decoupled from remote
> archive cache limits. This is required to tackle the Flink HistoryServer
> scaling issue due to local inode exhaustion. This looks to be a new feature
> and improves the overall experience. But if the existing config
> historyserver.archive.retained-jobs is modified to
> historyserver.archive.cached-retained-jobs, then it won't be backwards
> compatible with the older config. This should be clarified in the FLIP
> clearly.
>
> Thanks
> Venkat
>
> On Thu, Jan 30, 2025, 3:21 AM Yanquan Lv  wrote:
>
> > Thank you for your explanation. I have basically solved the previous
> > questions.
> >
> > Regarding the second point, I would like to suggest clarifying the
> default
> > values for newly adding parameters in `Public Interfaces` session.
> >
> > -- Forwarded message -----
> > 发件人: Allison 
> > Date: 2025年1月30日周四 上午3:42
> > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> > Improvements, Remote Data Store Fetch and Per Job Fetch
> > To: 
> >
> >
> > Hi Yanquan,
> >
> > Thanks for taking a look at this. Re: your questions:
> >
> > 1. Yes, I've updated the FLIP to be more clear, but it involves modifying
> > the existing configuration of historyserver.archive.retained-jobs to
> > historyserver.archive.cached-retained-jobs. The number of remote-jobs
> > stored can be infinite, the thought behind this is that the remote data
> > storage can be cleaned up or limited by a separate protocol that can be
> > customized to each individual use case.
> > 2. Could you clarify this a bit? I'm not sure I understand this part, do
> > you mean to add what the configurations would be set to in the case of
> them
> > not being defined to the FLIP?
> > 3. historyserver.archive.fs.refresh-interval is the time duration
> between a
> > call to the remote data storage to find fresh data. What it configures is
> > how often the FHS polls the remote data store for new files. The remote
> > data store is written to whenever a job is finished.
> >
> > Hope this clarifies some things.
> >
> > Best,
> > - Allison
> >
> >
> > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv  wrote:
> >
> > > Hi, Allison. Thanks for driving this FLIP.
> > > I have some questions to confirm:
> > >
> > > 1. I can’t find any existed configuration name
> > > `historyserver.archive.cached-retained-jobs`, I guess that what you
> mean
> > is
> > > modifing existing configuration from
> > `historyserver.archive.retained-jobs`
> > > to `historyserver.archive.cached-retained-jobs`. If so, If we only
> limit
> > > the number of retained-jobs stored locally, is the number of
> > retained-jobs
> > > stored remotely infinite?
> > > 2. I think it would be better to provide instructions for adding
> default
> > > values to HistoryServerOptions.
> > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local
> > and

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-01-30 Thread Venkatakrishnan Sowrirajan
Thanks for the FLIP, Allison. This will be a great feature addition to
fetch job archives from remote storage. Also decoupling the local cache
limits from the remote storage archive limits.

Few questions I have:

1. In terms of backwards compatibility, are you saying only if remote fetch
is enabled thats when the remote storage and local cache limits be
decoupled otherwise not?
2. Description of what historyserver.archive.remote-fetch-cached-jobs
config is meant for is very clear. Can you please clarify that in the FLIP?
Basically what I want to clarify is that there is no limit on how many
remote archives can be fetched but the above config is the local cache
limit of the most recently accessed jobs that can include both already
locally cached archive or a newly fetched remote archive, correct?

Looks like there are 2 new features or functionalities that are described.
We should decouple them.

1. Support to fetch job archives from remote storage. This is entirely a
new feature. No concerns with respect to backwards compatibility.
2. Introduce local archive cache limits which is decoupled from remote
archive cache limits. This is required to tackle the Flink HistoryServer
scaling issue due to local inode exhaustion. This looks to be a new feature
and improves the overall experience. But if the existing config
historyserver.archive.retained-jobs is modified to
historyserver.archive.cached-retained-jobs, then it won't be backwards
compatible with the older config. This should be clarified in the FLIP
clearly.

Thanks
Venkat

On Thu, Jan 30, 2025, 3:21 AM Yanquan Lv  wrote:

> Thank you for your explanation. I have basically solved the previous
> questions.
>
> Regarding the second point, I would like to suggest clarifying the default
> values for newly adding parameters in `Public Interfaces` session.
>
> -- Forwarded message -
> 发件人: Allison 
> Date: 2025年1月30日周四 上午3:42
> Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
> Improvements, Remote Data Store Fetch and Per Job Fetch
> To: 
>
>
> Hi Yanquan,
>
> Thanks for taking a look at this. Re: your questions:
>
> 1. Yes, I've updated the FLIP to be more clear, but it involves modifying
> the existing configuration of historyserver.archive.retained-jobs to
> historyserver.archive.cached-retained-jobs. The number of remote-jobs
> stored can be infinite, the thought behind this is that the remote data
> storage can be cleaned up or limited by a separate protocol that can be
> customized to each individual use case.
> 2. Could you clarify this a bit? I'm not sure I understand this part, do
> you mean to add what the configurations would be set to in the case of them
> not being defined to the FLIP?
> 3. historyserver.archive.fs.refresh-interval is the time duration between a
> call to the remote data storage to find fresh data. What it configures is
> how often the FHS polls the remote data store for new files. The remote
> data store is written to whenever a job is finished.
>
> Hope this clarifies some things.
>
> Best,
> - Allison
>
>
> On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv  wrote:
>
> > Hi, Allison. Thanks for driving this FLIP.
> > I have some questions to confirm:
> >
> > 1. I can’t find any existed configuration name
> > `historyserver.archive.cached-retained-jobs`, I guess that what you mean
> is
> > modifing existing configuration from
> `historyserver.archive.retained-jobs`
> > to `historyserver.archive.cached-retained-jobs`. If so, If we only limit
> > the number of retained-jobs stored locally, is the number of
> retained-jobs
> > stored remotely infinite?
> > 2. I think it would be better to provide instructions for adding default
> > values to HistoryServerOptions.
> > 3. Does `historyserver.archive.fs.refresh-interval` apply to both local
> and
> > remote storage simultaneously?
> >
> > Best,
> > Yanquan
> >
> > Allison  于 2025年1月17日周五 上午8:07写道:
> >
> > > Hi everyone,
> > >
> > > I would like to initiate a discussion for the FLIP below, which
> enhances
> > to
> > > the Flink History Server to allow greater scalability of the service.
> > >
> > > Motivation:
> > >
> > > Currently, the Flink History Server (FHS) is limited in the number of
> job
> > > archives it can serve based on the storage capacity of the node that
> the
> > > FHS runs in. Job archives are stored locally in a cache which creates a
> > > local directory which is expanded out based on the contents of a single
> > > json archive file. This not only uses up local memory space, but also
> > > because of how the FHS expands the job archives into a nes

Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-01-30 Thread Yanquan Lv
Thank you for your explanation. I have basically solved the previous
questions.

Regarding the second point, I would like to suggest clarifying the default
values for newly adding parameters in `Public Interfaces` session.

-- Forwarded message -
发件人: Allison 
Date: 2025年1月30日周四 上午3:42
Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability
Improvements, Remote Data Store Fetch and Per Job Fetch
To: 


Hi Yanquan,

Thanks for taking a look at this. Re: your questions:

1. Yes, I've updated the FLIP to be more clear, but it involves modifying
the existing configuration of historyserver.archive.retained-jobs to
historyserver.archive.cached-retained-jobs. The number of remote-jobs
stored can be infinite, the thought behind this is that the remote data
storage can be cleaned up or limited by a separate protocol that can be
customized to each individual use case.
2. Could you clarify this a bit? I'm not sure I understand this part, do
you mean to add what the configurations would be set to in the case of them
not being defined to the FLIP?
3. historyserver.archive.fs.refresh-interval is the time duration between a
call to the remote data storage to find fresh data. What it configures is
how often the FHS polls the remote data store for new files. The remote
data store is written to whenever a job is finished.

Hope this clarifies some things.

Best,
- Allison


On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv  wrote:

> Hi, Allison. Thanks for driving this FLIP.
> I have some questions to confirm:
>
> 1. I can’t find any existed configuration name
> `historyserver.archive.cached-retained-jobs`, I guess that what you mean
is
> modifing existing configuration from `historyserver.archive.retained-jobs`
> to `historyserver.archive.cached-retained-jobs`. If so, If we only limit
> the number of retained-jobs stored locally, is the number of retained-jobs
> stored remotely infinite?
> 2. I think it would be better to provide instructions for adding default
> values to HistoryServerOptions.
> 3. Does `historyserver.archive.fs.refresh-interval` apply to both local
and
> remote storage simultaneously?
>
> Best,
> Yanquan
>
> Allison  于 2025年1月17日周五 上午8:07写道:
>
> > Hi everyone,
> >
> > I would like to initiate a discussion for the FLIP below, which enhances
> to
> > the Flink History Server to allow greater scalability of the service.
> >
> > Motivation:
> >
> > Currently, the Flink History Server (FHS) is limited in the number of
job
> > archives it can serve based on the storage capacity of the node that the
> > FHS runs in. Job archives are stored locally in a cache which creates a
> > local directory which is expanded out based on the contents of a single
> > json archive file. This not only uses up local memory space, but also
> > because of how the FHS expands the job archives into a nested directory
> > structure, for jobs with a large number of taskmanagers or subtasks,
> inode
> > space often runs out.  In order to make the FHS more performant, we
would
> > like to introduce the ability to decouple the job archive storage for
the
> > FHS from being limited to the local cache, to being able to store and
> fetch
> > jobs archives from a remote file store.
> >
> > FLIP proposal document:
> >
> >
>
https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch
> >
> > Thanks!
> >
> > Best,
> > - Allison Chang
> >
>


Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-01-29 Thread Allison
Hi Yanquan,

Thanks for taking a look at this. Re: your questions:

1. Yes, I've updated the FLIP to be more clear, but it involves modifying
the existing configuration of historyserver.archive.retained-jobs to
historyserver.archive.cached-retained-jobs. The number of remote-jobs
stored can be infinite, the thought behind this is that the remote data
storage can be cleaned up or limited by a separate protocol that can be
customized to each individual use case.
2. Could you clarify this a bit? I'm not sure I understand this part, do
you mean to add what the configurations would be set to in the case of them
not being defined to the FLIP?
3. historyserver.archive.fs.refresh-interval is the time duration between a
call to the remote data storage to find fresh data. What it configures is
how often the FHS polls the remote data store for new files. The remote
data store is written to whenever a job is finished.

Hope this clarifies some things.

Best,
- Allison


On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv  wrote:

> Hi, Allison. Thanks for driving this FLIP.
> I have some questions to confirm:
>
> 1. I can’t find any existed configuration name
> `historyserver.archive.cached-retained-jobs`, I guess that what you mean is
> modifing existing configuration from `historyserver.archive.retained-jobs`
> to `historyserver.archive.cached-retained-jobs`. If so, If we only limit
> the number of retained-jobs stored locally, is the number of retained-jobs
> stored remotely infinite?
> 2. I think it would be better to provide instructions for adding default
> values to HistoryServerOptions.
> 3. Does `historyserver.archive.fs.refresh-interval` apply to both local and
> remote storage simultaneously?
>
> Best,
> Yanquan
>
> Allison  于 2025年1月17日周五 上午8:07写道:
>
> > Hi everyone,
> >
> > I would like to initiate a discussion for the FLIP below, which enhances
> to
> > the Flink History Server to allow greater scalability of the service.
> >
> > Motivation:
> >
> > Currently, the Flink History Server (FHS) is limited in the number of job
> > archives it can serve based on the storage capacity of the node that the
> > FHS runs in. Job archives are stored locally in a cache which creates a
> > local directory which is expanded out based on the contents of a single
> > json archive file. This not only uses up local memory space, but also
> > because of how the FHS expands the job archives into a nested directory
> > structure, for jobs with a large number of taskmanagers or subtasks,
> inode
> > space often runs out.  In order to make the FHS more performant, we would
> > like to introduce the ability to decouple the job archive storage for the
> > FHS from being limited to the local cache, to being able to store and
> fetch
> > jobs archives from a remote file store.
> >
> > FLIP proposal document:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch
> >
> > Thanks!
> >
> > Best,
> > - Allison Chang
> >
>


Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-01-27 Thread Yanquan Lv
Hi, Allison. Thanks for driving this FLIP.
I have some questions to confirm:

1. I can’t find any existed configuration name
`historyserver.archive.cached-retained-jobs`, I guess that what you mean is
modifing existing configuration from `historyserver.archive.retained-jobs`
to `historyserver.archive.cached-retained-jobs`. If so, If we only limit
the number of retained-jobs stored locally, is the number of retained-jobs
stored remotely infinite?
2. I think it would be better to provide instructions for adding default
values to HistoryServerOptions.
3. Does `historyserver.archive.fs.refresh-interval` apply to both local and
remote storage simultaneously?

Best,
Yanquan

Allison  于 2025年1月17日周五 上午8:07写道:

> Hi everyone,
>
> I would like to initiate a discussion for the FLIP below, which enhances to
> the Flink History Server to allow greater scalability of the service.
>
> Motivation:
>
> Currently, the Flink History Server (FHS) is limited in the number of job
> archives it can serve based on the storage capacity of the node that the
> FHS runs in. Job archives are stored locally in a cache which creates a
> local directory which is expanded out based on the contents of a single
> json archive file. This not only uses up local memory space, but also
> because of how the FHS expands the job archives into a nested directory
> structure, for jobs with a large number of taskmanagers or subtasks, inode
> space often runs out.  In order to make the FHS more performant, we would
> like to introduce the ability to decouple the job archive storage for the
> FHS from being limited to the local cache, to being able to store and fetch
> jobs archives from a remote file store.
>
> FLIP proposal document:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch
>
> Thanks!
>
> Best,
> - Allison Chang
>


Re: [DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-01-24 Thread Allison
Hi folks,

Just a gentle reminder regarding the FLIP I proposed for improving the
Flink History Server. Thanks for your time and attention.

Best,
- Allison

On Thu, Jan 16, 2025 at 4:06 PM Allison  wrote:

> Hi everyone,
>
> I would like to initiate a discussion for the FLIP below, which enhances
> to the Flink History Server to allow greater scalability of the service.
>
> Motivation:
>
> Currently, the Flink History Server (FHS) is limited in the number of job
> archives it can serve based on the storage capacity of the node that the
> FHS runs in. Job archives are stored locally in a cache which creates a
> local directory which is expanded out based on the contents of a single
> json archive file. This not only uses up local memory space, but also
> because of how the FHS expands the job archives into a nested directory
> structure, for jobs with a large number of taskmanagers or subtasks, inode
> space often runs out.  In order to make the FHS more performant, we would
> like to introduce the ability to decouple the job archive storage for the
> FHS from being limited to the local cache, to being able to store and fetch
> jobs archives from a remote file store.
>
> FLIP proposal document:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch
>
> Thanks!
>
> Best,
> - Allison Chang
>


[DISCUSS] FLIP-505: Flink History Server Scability Improvements, Remote Data Store Fetch and Per Job Fetch

2025-01-16 Thread Allison
Hi everyone,

I would like to initiate a discussion for the FLIP below, which enhances to
the Flink History Server to allow greater scalability of the service.

Motivation:

Currently, the Flink History Server (FHS) is limited in the number of job
archives it can serve based on the storage capacity of the node that the
FHS runs in. Job archives are stored locally in a cache which creates a
local directory which is expanded out based on the contents of a single
json archive file. This not only uses up local memory space, but also
because of how the FHS expands the job archives into a nested directory
structure, for jobs with a large number of taskmanagers or subtasks, inode
space often runs out.  In order to make the FHS more performant, we would
like to introduce the ability to decouple the job archive storage for the
FHS from being limited to the local cache, to being able to store and fetch
jobs archives from a remote file store.

FLIP proposal document:
https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch

Thanks!

Best,
- Allison Chang