[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI

2024-01-11 Thread Zakelly Lan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805471#comment-17805471
 ] 

Zakelly Lan commented on FLINK-33897:
-

{quote}By real world motivation, I meant if that really is an issue that 
someone complained about? 
{quote}
The reason behind this is some of our customers start their job with default 
configuration and find out a back-pressure and checkpoint failures last for a 
while. They reached out to me to ask if there is some way that can eliminate 
the back-pressure without introducing much more delay or pouring much 
duplicated data into sink (not exactly-once). What they complain is they must 
suffer first to restart the job.

> Allow triggering unaligned checkpoint via CLI
> -
>
> Key: FLINK-33897
> URL: https://issues.apache.org/jira/browse/FLINK-33897
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> After FLINK-6755, user could trigger checkpoint through CLI. However I 
> noticed there would be value supporting trigger it in unaligned way, since 
> the job may encounter a high back-pressure and an aligned checkpoint would 
> fail.
>  
> I suggest we provide an option '-unaligned' in CLI to support that.
>  
> Similar option would also be useful for REST api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI

2024-01-05 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803535#comment-17803535
 ] 

Piotr Nowojski commented on FLINK-33897:


By real world motivation, I meant if that really is an issue that someone 
complained about? If not, and this is just a theoretical possibility that comes 
from your observation when implementing FLINK-6755 "it could be implemented, 
someone might find it useful", I would put it aside for the time being. 
Honestly, I doubt many users would use this feature. In most cases just 
cancelling the job and restarting with new configuration would be faster vs 
someone first trying to find out in the docs/user mailing list/stack overflow 
that he can actually trigger unaligned checkpoint from CLI first. This would be 
only useful to a handful of power users, but those should already know about 
that it's better to use unaligned checkpoints from the get go.

{quote}
I'm not very familiar with this part so if you think this is a big change, I 
won't insist on doing it.
{quote}
Adding a new BarrierHandlerState maybe is not a very big change per se, but 
will visible increase complexity of the code when someone needs to 
read/understand it.

{quote}
I do agree we could enable timeout for aligned cp by default, which greatly 
reduce this case
{quote}
Let me start the dev mailing list discussion about that.

> Allow triggering unaligned checkpoint via CLI
> -
>
> Key: FLINK-33897
> URL: https://issues.apache.org/jira/browse/FLINK-33897
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> After FLINK-6755, user could trigger checkpoint through CLI. However I 
> noticed there would be value supporting trigger it in unaligned way, since 
> the job may encounter a high back-pressure and an aligned checkpoint would 
> fail.
>  
> I suggest we provide an option '-unaligned' in CLI to support that.
>  
> Similar option would also be useful for REST api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI

2024-01-04 Thread Zakelly Lan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803407#comment-17803407
 ] 

Zakelly Lan commented on FLINK-33897:
-

[~pnowojski]  Actually there is real world motivation. When a job encountered 
high back-pressure and after dozens of minutes of aligned checkpointing without 
success, the user finds that they need to switch to unaligned cp or enlarge the 
parallelism. Such change requires a job restart, which puts users in a dilemma 
because this involves replaying much data and a longer delay. This feature 
allows users to make an unaligned cp temporarily and restart from it, 
preventing from the large data replay.

I do agree we could enable timeout for aligned cp by default, which greatly 
reduce this case. And I also think there would be value giving user a chance to 
change the configuration and restart the job with less pain when they 
misconfigured their jobs, by supporting triggering a swift and promising 
checkpoint or savepoint. As for the complication supporting this feature, IIUC, 
some changes should apply to the handler states (may introduce a new 
{{{}BarrierHandlerState{}}}) and less change will make to the 
{{SingleCheckpointBarrierHandler}} itself. I'm not very familiar with this part 
so if you think this is a big change, I won't insist on doing it.

> Allow triggering unaligned checkpoint via CLI
> -
>
> Key: FLINK-33897
> URL: https://issues.apache.org/jira/browse/FLINK-33897
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> After FLINK-6755, user could trigger checkpoint through CLI. However I 
> noticed there would be value supporting trigger it in unaligned way, since 
> the job may encounter a high back-pressure and an aligned checkpoint would 
> fail.
>  
> I suggest we provide an option '-unaligned' in CLI to support that.
>  
> Similar option would also be useful for REST api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI

2024-01-04 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803185#comment-17803185
 ] 

Piotr Nowojski commented on FLINK-33897:


I have mixed feelings. Shouldn't the solution be to just use/enable unaligned 
checkpoints? If one sets the alignment timeout to some reasonable value, I 
don't see a reason for someone to use aligned checkpoints anymore. Maybe 
instead let's consider deprecating aligned checkpoints without timeout?

Is there some real world motivation behind this feature?

I would be -1 for this feature, if it requires complicating/making changes to 
the actual barrier handling (apart of replacing 
{{SingleCheckpointBarrierHandler#aligned}} with 
{{SingleCheckpointBarrierHandler#alternating}} call). This code is complicated 
and in the past we had a lot of deadlocks, data corruptions and other critical 
bugs around those areas, so keeping it as simple as possible and minimising 
amount of supported features is quite important. 

> Allow triggering unaligned checkpoint via CLI
> -
>
> Key: FLINK-33897
> URL: https://issues.apache.org/jira/browse/FLINK-33897
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> After FLINK-6755, user could trigger checkpoint through CLI. However I 
> noticed there would be value supporting trigger it in unaligned way, since 
> the job may encounter a high back-pressure and an aligned checkpoint would 
> fail.
>  
> I suggest we provide an option '-unaligned' in CLI to support that.
>  
> Similar option would also be useful for REST api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33897) Allow triggering unaligned checkpoint via CLI

2023-12-20 Thread Zakelly Lan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799233#comment-17799233
 ] 

Zakelly Lan commented on FLINK-33897:
-

This also require the {{SingleCheckpointBarrierHandler}} changing from aligned 
to unaligned state when receiving an unaligned barrier. Would like to hear your 
thoughts [~pnowojski] [~dwysakowicz] 

> Allow triggering unaligned checkpoint via CLI
> -
>
> Key: FLINK-33897
> URL: https://issues.apache.org/jira/browse/FLINK-33897
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> After FLINK-6755, user could trigger checkpoint through CLI. However I 
> noticed there would be value supporting trigger it in unaligned way, since 
> the job may encounter a high back-pressure and an aligned checkpoint would 
> fail.
>  
> I suggest we provide an option '-unaligned' in CLI to support that.
>  
> Similar option would also be useful for REST api



--
This message was sent by Atlassian Jira
(v8.20.10#820010)