[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-08 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1188101108


##
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java:
##
@@ -473,26 +500,50 @@ public CompletableFuture cancel(Time 
timeout) {
 @Override
 public CompletableFuture updateTaskExecutionState(
 final TaskExecutionState taskExecutionState) {
-FlinkException taskExecutionException;
+checkNotNull(taskExecutionState, "taskExecutionState");
+// Use the main/caller thread for all updates to make sure they are 
processed in order.
+// (MainThreadExecutor i.e., the akka thread pool does not guarantee 
that)
+// Only detach for a FAILED state update that is terminal and may 
perform io heavy labeling.
+if 
(ExecutionState.FAILED.equals(taskExecutionState.getExecutionState())) {
+return labelFailure(taskExecutionState)
+.thenApplyAsync(
+taskStateWithLabels -> {
+try {
+return 
doUpdateTaskExecutionState(taskStateWithLabels);
+} catch (FlinkException e) {
+throw new CompletionException(e);
+}
+},
+getMainThreadExecutor());
+}
 try {
-checkNotNull(taskExecutionState, "taskExecutionState");
+return CompletableFuture.completedFuture(
+doUpdateTaskExecutionState(taskExecutionState));
+} catch (FlinkException e) {
+return FutureUtils.completedExceptionally(e);
+}
+}
 
+private Acknowledge doUpdateTaskExecutionState(final TaskExecutionState 
taskExecutionState)
+throws FlinkException {
+@Nullable FlinkException taskExecutionException;
+try {
 if (schedulerNG.updateTaskExecutionState(taskExecutionState)) {

Review Comment:
   Looks like the main decision we have to take here is if the failure-labels 
going to be used by restart strategies or not -- relying on enrichment for 
restarts is what makes it crucial.
   
   The existing implementation was [based on the 
assumption](https://lists.apache.org/thread/tq8yrncg7zqtpc8ddpxrkxfpovs1wkkw) 
that labels are going to be used by the custom restart strategies in the 
future. Since we wanted them asynchronous, the less risky way was through 
existing async calls e.g., `JobMaster#updateTaskExecutionState`, and probably 
modifying the InternalFailuresListener (rather than changing SchedulerNG update 
state to async).
   
   Deciding the failure enrichment is crucial enough to be synchronous -- maybe 
part of `DefaultScheduler#restartTasksWithDelay`--  is also an option. 
   
   However, decoupling failure labels completely from restart strategies sounds 
like a step back here.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-05 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases within the ExecutionGraph that are not 
easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
 
   
   The intuition here was that all these failures are internal and could maybe 
skip labeling.
   Davids suggestion following all the `fromSchedulerNg  == false` paths could 
also work (need to verify) --  but happy to discuss other thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-05 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases within the ExecutionGraph that are not 
easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
 
   
   The intuition here was that all these failures are internal and could maybe 
skip labeling.
   Davids suggestion following all the `fromSchedulerNg  == false` paths could 
also work --  but happy to discuss other thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-05 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases within the ExecutionGraph that are not 
easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
 
   
   The intuition here was that all these failures are internal and could maybe 
skip labeling -- Davids suggestion following all the `fromSchedulerNg  == 
false` paths also seems to cover this --  but happy to discuss other thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-05 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases part of the Deployment/Execution that are 
not easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
 
   
   The intuition here was that all these failures are internal and could maybe 
skip labeling -- Davids suggestion following all the `fromSchedulerNg  == 
false` paths also seems to cover this --  but happy to discuss other thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-04 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases part of the Deployment/Execution that are 
not easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
 
   
   The intuition here is that all failures are internal and could skip labeling 
-- but happy to discuss other thoughts?



##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases part of the Deployment/Execution that are 
not easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
 
   
   The intuition here is that all these failures are internal and could skip 
labeling -- but happy to discuss other thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-04 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases part of the Deployment/Execution that are 
not easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
 
   
   The intuition here is that all failures are internal and could skip labeling 
-- but happy to discuss alternatives



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-04 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases part of the Deployment/Execution that are 
not easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
 
   
   Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-04 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases part of the Deployment/Execution that are 
not easily covered, e.g., 
[deploy](https://github.com/pgaref/flink/blob/41728f1bae85ce19f843fc18d0d80ddfe75af6b9/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L608),
 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104),
 
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
   
   Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-04 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But there are still corner cases that are not easily covered, e.g., 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104)
  
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
   
   Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-04 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185674060


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java:
##
@@ -778,7 +778,7 @@ private static PartitionInfo createFinishedPartitionInfo(
  */
 @Override
 public void fail(Throwable t) {
-processFail(t, true);
+processFail(t, true, Collections.emptyMap());

Review Comment:
   Again this is covered with the next PR: 
https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR781-R782
   
   But still there are corner cases that are not easily covered, e.g., 
[triggerTaskFailover](https://github.com/apache/flink/pull/22511/files#diff-c7f8fc6bbf59b935414529c76720dcafce6105e16783b918af6b06a10e247ab2R103-R104)
  
[UpdatePartitionInfo](https://github.com/apache/flink/pull/22511/files#diff-f9e390431932be9a4505e581632815b81c1f0520e65cec7614ea4b90c8a52afcR1378)
   
   Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-04 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185672001


##
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java:
##
@@ -473,26 +500,50 @@ public CompletableFuture cancel(Time 
timeout) {
 @Override
 public CompletableFuture updateTaskExecutionState(
 final TaskExecutionState taskExecutionState) {
-FlinkException taskExecutionException;
+checkNotNull(taskExecutionState, "taskExecutionState");
+// Use the main/caller thread for all updates to make sure they are 
processed in order.
+// (MainThreadExecutor i.e., the akka thread pool does not guarantee 
that)
+// Only detach for a FAILED state update that is terminal and may 
perform io heavy labeling.
+if 
(ExecutionState.FAILED.equals(taskExecutionState.getExecutionState())) {
+return labelFailure(taskExecutionState)
+.thenApplyAsync(
+taskStateWithLabels -> {
+try {
+return 
doUpdateTaskExecutionState(taskStateWithLabels);
+} catch (FlinkException e) {
+throw new CompletionException(e);
+}
+},
+getMainThreadExecutor());
+}
 try {
-checkNotNull(taskExecutionState, "taskExecutionState");
+return CompletableFuture.completedFuture(
+doUpdateTaskExecutionState(taskExecutionState));
+} catch (FlinkException e) {
+return FutureUtils.completedExceptionally(e);
+}
+}
 
+private Acknowledge doUpdateTaskExecutionState(final TaskExecutionState 
taskExecutionState)
+throws FlinkException {
+@Nullable FlinkException taskExecutionException;
+try {
 if (schedulerNG.updateTaskExecutionState(taskExecutionState)) {

Review Comment:
   Btw, InternalFailuresListener#notifyTaskFailure is already covered as part 
of FLINK-31891 -- when a TM disconnection happens we need to Release Payload 
Slot and since the error is not fromSchedulerNg we use the 
internalTaskFailuresListener: 
https://github.com/apache/flink/pull/22511/files#diff-d535f910a10f835962b0637e12014068a9727b2152a84223fd9b1bf9c6c074d6R1665
   
   This is not however covering the `onMissingDeploymentsOf` case David 
mentioned above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-04 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1185667103


##
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ErrorInfo.java:
##
@@ -98,4 +112,13 @@ public String getExceptionAsString() {
 public long getTimestamp() {
 return timestamp;
 }
+
+/**
+ * Returns the labels associated with the exception.
+ *
+ * @return Map of exception labels
+ */
+public Map getLabels() {
+return labels;

Review Comment:
   Great idea, added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-03 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1184449298


##
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/factories/DefaultJobMasterServiceFactory.java:
##
@@ -122,6 +123,7 @@ private JobMasterService internalCreateJobMasterService(
 DefaultExecutionDeploymentReconciler::new,
 BlocklistUtils.loadBlocklistHandlerFactory(
 jobMasterConfiguration.getConfiguration()),
+Collections.emptySet(),

Review Comment:
   Right, this is a placeholder for loaded FailureEnrichers using a conf 
(probably from ClusterEndpoint or Dispatcher -- empty for now to facilitate 
testing) -- tracking this as part of FLINK-31993



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-03 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1184450001


##
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/exceptionhistory/ExceptionHistoryEntryMatcher.java:
##
@@ -87,6 +107,18 @@ protected boolean matchesSafely(
 match = false;
 }
 
+if (exceptionHistoryEntry.getLabels() == null) {
+if (expectedLabels != null) {
+description.appendText(" actualLabels=null");
+match = false;
+}
+} else if (exceptionHistoryEntry.getLabels().equals(expectedLabels)) {
+description
+.appendText(" actualLabels=")
+
.appendText(String.valueOf(exceptionHistoryEntry.getLabels()));
+match = false;

Review Comment:
   Agreed, this is now simplified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] pgaref commented on a diff in pull request #22506: [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling

2023-05-03 Thread via GitHub


pgaref commented on code in PR #22506:
URL: https://github.com/apache/flink/pull/22506#discussion_r1184449298


##
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/factories/DefaultJobMasterServiceFactory.java:
##
@@ -122,6 +123,7 @@ private JobMasterService internalCreateJobMasterService(
 DefaultExecutionDeploymentReconciler::new,
 BlocklistUtils.loadBlocklistHandlerFactory(
 jobMasterConfiguration.getConfiguration()),
+Collections.emptySet(),

Review Comment:
   Right, this is a placeholder for loaded FailureEnrichers using a conf (empty 
for now to facilitate testing) -- tracking this as part of FLINK-31993



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org