[jira] [Updated] (FLINK-32631) FlinkSessionJob stuck in Created/Reconciling state because of No Job found error in JobManager

2024-01-04 Thread Bhupendra Yadav (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhupendra Yadav updated FLINK-32631:

Description: 
{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session 
cluster and flink kubernetes operator 1.5.0.

{*}Bug{*}: We frequently encounter a problem where the job gets stuck in 
CREATED/RECONCILING state. On checking flink operator logs we see the error 
{_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
 # When a Flink session job is submitted, the Flink operator submits the job to 
the Flink Cluster.
 # Assume the job is finished(or reached a terminal state) and the job manager 
(JM) restarts for some reason, the job will no longer exist in the JM.
 # Upon reconciliation, the Flink operator queries the JM's REST API for the 
job using its jobID, but it receives a 404 error, indicating that the job is 
not found.
 # The operator then encounters an error and logs it, leading to the job 
getting stuck in an indefinite state.
 # Attempting to restart or suspend the job using the operator's provided 
mechanisms also fails because the operator keeps calling the REST API and 
receiving the same 404 error.

{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and 
finds that it no longer exists in the Flink Cluster, it should handle the 
situation gracefully. Instead of getting stuck and logging errors indefinitely, 
the operator should mark the job as failed or deleted, or set an appropriate 
status for it.

  was:
{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session 
cluster and flink kubernetes operator 1.5.0.

{*}Bug{*}: We frequently encounter a problem where the job gets stuck in 
CREATED/RECONCILING state. On checking flink operator logs we see the error 
{_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
 # When a Flink session job is submitted, the Flink operator submits the job to 
the Flink Cluster.
 # If the Flink job manager (JM) restarts for some reason, the job may no 
longer exist in the JM.
 # Upon reconciliation, the Flink operator queries the JM's REST API for the 
job using its jobID, but it receives a 404 error, indicating that the job is 
not found.
 # The operator then encounters an error and logs it, leading to the job 
getting stuck in an indefinite state.
 # Attempting to restart or suspend the job using the operator's provided 
mechanisms also fails because the operator keeps calling the REST API and 
receiving the same 404 error.

{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and 
finds that it no longer exists in the Flink Cluster, it should handle the 
situation gracefully. Instead of getting stuck and logging errors indefinitely, 
the operator should mark the job as failed or deleted, or set an appropriate 
status for it.


> FlinkSessionJob stuck in Created/Reconciling state because of No Job found 
> error in JobManager
> --
>
> Key: FLINK-32631
> URL: https://issues.apache.org/jira/browse/FLINK-32631
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: 1.16.0
> Environment: Local
>Reporter: Bhupendra Yadav
>Priority: Major
>
> {*}Background{*}: We are using FlinkSessionJob for submitting jobs to a 
> session cluster and flink kubernetes operator 1.5.0.
> {*}Bug{*}: We frequently encounter a problem where the job gets stuck in 
> CREATED/RECONCILING state. On checking flink operator logs we see the error 
> {_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
>  # When a Flink session job is submitted, the Flink operator submits the job 
> to the Flink Cluster.
>  # Assume the job is finished(or reached a terminal state) and the job 
> manager (JM) restarts for some reason, the job will no longer exist in the JM.
>  # Upon reconciliation, the Flink operator queries the JM's REST API for the 
> job using its jobID, but it receives a 404 error, indicating that the job is 
> not found.
>  # The operator then encounters an error and logs it, leading to the job 
> getting stuck in an indefinite state.
>  # Attempting to restart or suspend the job using the operator's provided 
> mechanisms also fails because the operator keeps calling the REST API and 
> receiving the same 404 error.
> {*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job 
> and finds that it no longer exists in the Flink Cluster, it should handle the 
> situation gracefully. Instead of getting stuck and logging errors 
> indefinitely, the operator should mark the job as failed or deleted, or set 
> an appropriate status for it.



--
This message was sent by Atlassian Jira

[jira] [Updated] (FLINK-32631) FlinkSessionJob stuck in Created/Reconciling state because of No Job found error in JobManager

2023-07-19 Thread Bhupendra Yadav (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-32631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhupendra Yadav updated FLINK-32631:

Description: 
{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session 
cluster and flink kubernetes operator 1.5.0.

{*}Bug{*}: We frequently encounter a problem where the job gets stuck in 
CREATED/RECONCILING state. On checking flink operator logs we see the error 
{_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
 # When a Flink session job is submitted, the Flink operator submits the job to 
the Flink Cluster.
 # If the Flink job manager (JM) restarts for some reason, the job may no 
longer exist in the JM.
 # Upon reconciliation, the Flink operator queries the JM's REST API for the 
job using its jobID, but it receives a 404 error, indicating that the job is 
not found.
 # The operator then encounters an error and logs it, leading to the job 
getting stuck in an indefinite state.
 # Attempting to restart or suspend the job using the operator's provided 
mechanisms also fails because the operator keeps calling the REST API and 
receiving the same 404 error.

{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and 
finds that it no longer exists in the Flink Cluster, it should handle the 
situation gracefully. Instead of getting stuck and logging errors indefinitely, 
the operator should mark the job as failed or deleted, or set an appropriate 
status for it.

  was:
{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session 
cluster.

{*}Bug{*}: We frequently encounter a problem where the job gets stuck in 
CREATED/RECONCILING state. On checking flink operator logs we see the error 
{_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
 # When a Flink session job is submitted, the Flink operator submits the job to 
the Flink Cluster.
 # If the Flink job manager (JM) restarts for some reason, the job may no 
longer exist in the JM.
 # Upon reconciliation, the Flink operator queries the JM's REST API for the 
job using its jobID, but it receives a 404 error, indicating that the job is 
not found.
 # The operator then encounters an error and logs it, leading to the job 
getting stuck in an indefinite state.
 # Attempting to restart or suspend the job using the operator's provided 
mechanisms also fails because the operator keeps calling the REST API and 
receiving the same 404 error.

{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and 
finds that it no longer exists in the Flink Cluster, it should handle the 
situation gracefully. Instead of getting stuck and logging errors indefinitely, 
the operator should mark the job as failed or deleted, or set an appropriate 
status for it.


> FlinkSessionJob stuck in Created/Reconciling state because of No Job found 
> error in JobManager
> --
>
> Key: FLINK-32631
> URL: https://issues.apache.org/jira/browse/FLINK-32631
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: 1.16.0
> Environment: Local
>Reporter: Bhupendra Yadav
>Priority: Major
>
> {*}Background{*}: We are using FlinkSessionJob for submitting jobs to a 
> session cluster and flink kubernetes operator 1.5.0.
> {*}Bug{*}: We frequently encounter a problem where the job gets stuck in 
> CREATED/RECONCILING state. On checking flink operator logs we see the error 
> {_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
>  # When a Flink session job is submitted, the Flink operator submits the job 
> to the Flink Cluster.
>  # If the Flink job manager (JM) restarts for some reason, the job may no 
> longer exist in the JM.
>  # Upon reconciliation, the Flink operator queries the JM's REST API for the 
> job using its jobID, but it receives a 404 error, indicating that the job is 
> not found.
>  # The operator then encounters an error and logs it, leading to the job 
> getting stuck in an indefinite state.
>  # Attempting to restart or suspend the job using the operator's provided 
> mechanisms also fails because the operator keeps calling the REST API and 
> receiving the same 404 error.
> {*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job 
> and finds that it no longer exists in the Flink Cluster, it should handle the 
> situation gracefully. Instead of getting stuck and logging errors 
> indefinitely, the operator should mark the job as failed or deleted, or set 
> an appropriate status for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32631) FlinkSessionJob stuck in Created/Reconciling state because of No Job found error in JobManager

2023-07-19 Thread Bhupendra Yadav (Jira)
Bhupendra Yadav created FLINK-32631:
---

 Summary: FlinkSessionJob stuck in Created/Reconciling state 
because of No Job found error in JobManager
 Key: FLINK-32631
 URL: https://issues.apache.org/jira/browse/FLINK-32631
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: 1.16.0
 Environment: Local
Reporter: Bhupendra Yadav


{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session 
cluster.

{*}Bug{*}: We frequently encounter a problem where the job gets stuck in 
CREATED/RECONCILING state. On checking flink operator logs we see the error 
{_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
 # When a Flink session job is submitted, the Flink operator submits the job to 
the Flink Cluster.
 # If the Flink job manager (JM) restarts for some reason, the job may no 
longer exist in the JM.
 # Upon reconciliation, the Flink operator queries the JM's REST API for the 
job using its jobID, but it receives a 404 error, indicating that the job is 
not found.
 # The operator then encounters an error and logs it, leading to the job 
getting stuck in an indefinite state.
 # Attempting to restart or suspend the job using the operator's provided 
mechanisms also fails because the operator keeps calling the REST API and 
receiving the same 404 error.

{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and 
finds that it no longer exists in the Flink Cluster, it should handle the 
situation gracefully. Instead of getting stuck and logging errors indefinitely, 
the operator should mark the job as failed or deleted, or set an appropriate 
status for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)