[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-06-14 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-501997307
 
 
@StefanRRichter I have fixed most of the issues. Please review again.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-06-13 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-501568073
 
 
   Hi @StefanRRichter thanks for your suggestion. I have fixed some issues.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-06-12 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-501212069
 
 
   Hi @StefanRRichter I have fixed the conflicts and rebased the code. When you 
have time, please review it again. Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-29 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-496903602
 
 
   Hi @StefanRRichter I think I have fixed the existed issues. I have triggered 
the Travis twice, both of them are green. So I'd invite you to review it again. 
Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-27 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-496101546
 
 
   Hi @StefanRRichter In the last few days, I did further refactor, include:
   
   * remove `org.apache.flink.runtime.checkpoint.decline.xxxException` and 
enumerate more reason in `CheckpointFailureReason`
   * remove 
`org.apache.flink.runtime.taskexecutor.exceptions.CheckpointException` and 
reused unified `org.apache.flink.runtime.checkpoint.CheckpointException`
   * temporarily ignore serval failure reason to keep compatibility with 
`failOnCheckpointingErrors `
   
   There is just one issue left now. So would you mind to give a full review 
again? IMO, these refactor about checkpoint exception and the unified 
invocation for `PendingCheckpoint#abort` will be good for other issues.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-25 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-495915181
 
 
   Hi @StefanRRichter I understand what you mean and have given a reply. I 
think this PR has a relationship with FLINK-11662, but there is no dependency. 
If FLINK-11662 can be completed before this PR, it will make the 
CheckpointFailureManager looks better. But if it is resolved after this PR, it 
will not affect the semantics of this PR. I think the semantics of this PR is: 
tolerate failure in the situation we have controlled, and the current unhandled 
exception is not tolerated. In addition, the failure of the job caused by 
FLINK-11662 will cause the entire job to start again.
   
   Isn't the introduction of `CheckpointFailureManager` so that it can handle 
`CheckpointFailureReason` better? In fact, `CheckpointFailureReason` is a 
change point (it may increase because maybe there are other issues like 
FLINK-11662 we do not find or track them, or it may decrease in the future), we 
encapsulate its counting logic in the `handleCheckpointException` to change 
their counting logic in the future.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-24 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-495483854
 
 
   > @StefanRRichter After thinking seriously, my view of point has changed.
   > 
   > There are two options:
   > 
   > * to keep all test cases and user jobs' default behavior, the 
`tolerableCheckpointFailureNumber`'s default value seems should be 
`Integer.MAX_VALUE`;
   > * to keep the compatibility, the `tolerableCheckpointFailureNumber`'s 
default value seems should be 0;
   > 
   > The key problem is the two options ' **range of action** is different. The 
`failOnCheckpointingErrors` option just cover the task's checkpointing error in 
the TaskManager end (even not the whole execution phase). While the 
`tolerableCheckpointFailureNumber` option needs to cover the whole trigger and 
execution phases.
   > 
   > The thought to support option1:
   > 
   > In many scenes, if we set the `tolerableCheckpointFailureNumber`'s default 
value to 0. The behavior of the users' job would be changed. It would cause 
more frequency to fail and restart. For example, the task is not ready to do 
checkpoint so it sends a decline message to trigger failure manager to fail and 
restart the job. So it changed the test cases and user jobs' default behavior. 
This is the reason why I change the default value to `Integer.MAX_VALUE`, 
although they are sporadic.
   > 
   > The thought to support option2:
   > 
   > If we set the `tolerableCheckpointFailureNumber`'s default value to 
`Integer.MAX_VALUE`, it may introduce a **compatibility issue** how to handle 
the `failOnCheckpointingErrors` config option in the future? The original 
thought is that the failOnCheckpointingErrors(two values : 0 and 
Integer.MAX_VALUE) is a subset of the tolerableCheckpointFailureNumber(0 to 
Integer.MAX_VALUE). We had wanted to deprecate the `failOnCheckpointingErrors` 
option in the third step.
   > 
   > What do you think?
   
   I tried to find a way to fix the compatibility issue (user job behavior and 
`failOnCheckpointingErrors`) by refining the decline exception. Currently, I 
extracted a new failure reason  
`CHECKPOINT_DECLINED_TASK_NOT_READY`(CheckpointDeclineTaskNotReadyException) 
from coarse-grained `CHECKPOINT_DECLINED`. 
   
   I think in this way, we can keep `tolerableCheckpointFailureNumber`'s 
default value as 0 and do not change the user jobs' behavior. In addition, we 
should also continue to reduce the exception classes in package 
`org.apache.flink.runtime.checkpoint.decline`. It seems it's the unfinished 
work left over from the first step. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-23 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-495481836
 
 
   > About the problem with the SQL test, having the detailed logs from JM/TMs 
would be helpful. However, if I see correctly, those are batch tests and should 
not care about changes to checkpointing - so the error might very well be 
unrelated.
   
   @StefanRRichter  I have debugged in my local. It seems the problem comes 
from the mechanism of failing job. The `DeduplicateITCase` also triggered job 
fail because of `CheckpointDeclineTaskNotReadyException`. But in 
`ExecutionGraph#failGlobal` method, it should check the main thread by calling 
`assertRunningInJobMasterMainThread` method. I found it can not jump out from 
this method. My guess is the trigger thread is the Timer in 
`CheckpointCoordinator`, not the main thread.
   
   So we may figure out a new way to fail the job.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-23 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-495081839
 
 
   @StefanRRichter I happened to have an exception about Kafka's end-to-end 
testing: https://api.travis-ci.org/v3/job/536105563/log.txt


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-22 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-495066930
 
 
   @klion26 Thanks for your review suggestion. What do you think about the new 
change?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-22 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-494781182
 
 
   The essential difference between `failOnCheckpointingErrors` and 
`tolerableCheckpointFailureNumber` is that the way the two cause the job to 
fail is different. The former is triggered on the TaskManager side and the 
latter is triggered on the JobManager side.
   
   It is difficult to maintain compatibility at the API level. But I personally 
prefer to set the default value of `tolerableCheckpointFailureNumber` to 
Integer.MAX_VALUE. WDYT? @StefanRRichter 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-21 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-494394030
 
 
   @StefanRRichter OK, if it is unrelated, we can ignore it. In addition, what 
do you think about my prior opinion?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-19 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-493837160
 
 
   Triggered again, still has SQL deadlock : 
https://travis-ci.org/apache/flink/jobs/534620910


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-19 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-493826613
 
 
   @StefanRRichter After thinking seriously, my view of point has changed. 
   
   There are two options:
   
   * to keep all test cases and user jobs' default behavior, the 
`tolerableCheckpointFailureNumber`'s default value seems should be 
`Integer.MAX_VALUE`;
   * to keep the compatibility, the `tolerableCheckpointFailureNumber`'s 
default value seems should be 0;
   
   The key problem is the two options ' **range of action** is different. The 
`failOnCheckpointingErrors` option just cover the task's checkpointing error in 
the TaskManager end (even not the whole execution phase). While the 
`tolerableCheckpointFailureNumber` option needs to cover the whole trigger and 
execution phases.
   
   The thought to support option1:
   
   In many scenes, if we set the `tolerableCheckpointFailureNumber`'s default 
value to 0. The behavior of the users' job would be changed. It would cause 
more frequency to fail and restart. For example, the task is not ready to do 
checkpoint so it sends a decline message to trigger failure manager to fail and 
restart the job. So it changed the test cases and user jobs' default behavior. 
This is the reason why I change the default value to `Integer.MAX_VALUE`, 
although they are sporadic.
   
   The thought to support option2:
   
   If we set the `tolerableCheckpointFailureNumber`'s default value to 
`Integer.MAX_VALUE`, it may introduce a **compatibility issue** how to handle 
the `failOnCheckpointingErrors` config option in the future? The original 
thought is that the failOnCheckpointingErrors(two values : 0 and 
Integer.MAX_VALUE) is a subset of the tolerableCheckpointFailureNumber(0 to 
Integer.MAX_VALUE). We had wanted to deprecate the `failOnCheckpointingErrors` 
option in the third step.
   
   What do you think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-19 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-493817814
 
 
   retriggered once, result failure, reason: dead lock (sql) detail : 
https://travis-ci.org/apache/flink/jobs/534067083


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-17 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-493402537
 
 
   For a clearer analysis, I plan to do three steps:
   
   * Rebase master branch
   * Remove all four settings about tolerance number
   * Squash unrelated commits
   
   We can analyze the log again. What do you think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-17 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-493345060
 
 
   @StefanRRichter So far, the test problems caused by this PR have been fixed, 
but Travis has been not very stable recently. During the multiple builds, other 
unrelated failures have been created and tracked as Jira issues.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-16 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-493080226
 
 
   @StefanRRichter Have updated. What do you think about the new commit and the 
whole PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-15 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-492885118
 
 
   Now, it seems Travis's failure is not related to this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-15 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-492636770
 
 
   Have fixed conflicts. In addition, I have refactored 
`CheckpointCoordinator`'s constructor and `ExecutionGraph#enableCheckpointing` 
about the new introduced parameter `isPreferCheckpointForRecovery` because we 
have converged multiple parameters to the instance of the 
`CheckpointCoordinatorConfiguration`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-15 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-492620693
 
 
   @StefanRRichter My another PR #8410 which just merged caused many conflicts. 
The reason is that they both changed checkpoint specific configuration. If you 
do not mind, I'd like to squash this PR to rebase it and fix the conflicts.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-14 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-492471326
 
 
   Irrelevant test error, Have created an issue to report it, will trigger 
rebuild~


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-11 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-491497620
 
 
   @StefanRRichter Sorry, I still do not provide a final implementation of this 
issue. It's a temporary version just for discussion.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-05-08 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-490365680
 
 
   @StefanRRichter Really sorry for the late reply, I just took a holiday and 
attended QCon conference Beijing. I have given a more detailed implementation 
so that we could discuss more easily. The new commit cannot be passed by 
Travis, it is for discussion reason. When we agree about the result, I will 
complete this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure

2019-04-30 Thread GitBox
yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-487947096
 
 
   @StefanRRichter This is the second step's PR of the checkpoint failure 
process improvement.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services