Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2026-02-05 Thread via GitHub


shahar1 commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3858488161

   > Hi - we have been using `apache-airflow-providers-google==19.3.0` for the 
past several weeks and the issue recurred:
   > 
   > ```
   > [2026-02-02 04:42:52] ERROR - Exception occurred while checking for job 
completion. 
source=airflow.providers.google.cloud.triggers.dataflow.TemplateJobStartTrigger 
loc=dataflow.py:149
   > ServiceUnavailable: 503 Visibility check was unavailable. Please retry the 
request and contact support if the problem persists
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/triggers/dataflow.py",
 line 113 in run
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/hooks/dataflow.py",
 line 1480 in get_job_status
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/hooks/dataflow.py",
 line 1457 in get_job
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/google/cloud/dataflow_v1beta3/services/jobs_v1_beta3/async_client.py",
 line 478 in get_job
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py",
 line 88 in __await__
   > AioRpcError: status = StatusCode.UNAVAILABLE
   >details = "Visibility check was unavailable. Please retry the request 
and contact support if the problem persists"
   >debug_error_string = "UNKNOWN:Error received from peer 
ipv4:74.125.126.95:443 {created_time:"2026-02-02T12:42:52.166480427+00:00", 
grpc_status:14, grpc_message:"Visibility check was unavailable. Please retry 
the request and contact support if the problem persists"}"
   > >
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py",
 line 85 in __await__
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/grpc/aio/_interceptor.py", 
line 472 in __await__
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/grpc/aio/_call.py", line 327 
in __await__
   > ```
   > 
   > The task retry error from the initial bug report `ValueError: dictionary 
update sequence element #0 has length 1; 2 is required` no longer occurs but as 
expected the task is still just considered failed since the triggerer marked 
the task as failed, and the dataflow job is _not_ actually retried (nor does 
the triggerer or any component attempt to get the status a second time)
   > 
   > ```
   > [2026-02-02 04:45:10] ERROR - Task failed with exception source=task 
loc=task_runner.py:972
   > AirflowException: 503 Visibility check was unavailable. Please retry the 
request and contact support if the problem persists
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
 line 920 in run
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
 line 1307 in _execute_task
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py",
 line 1632 in resume_execution
   > File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/operators/dataflow.py",
 line 650 in execute_complete
   > ```
   > 
   > > manage to find how to reproduce it, please comment (without clear 
reproduction steps there's not too much that we can do).
   > 
   > it is dependent on the GCP dataflow api returning a 503 - the only way to 
reliably reproduce it would be with a mock http server that the gcp client 
connects to and having it return a 503 (or by mocking out the 
`JobsV1Beta3AsyncClient.get_job` method to throw the AioRpcError in the 
stacktrace above). i don't have permissions to reopen this issue but it is 
definitely a still a bug with the retry logic in the provider
   
   I've reopened it - if you or someone else could implement the mock for 
reproducing the issue, it would be helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2026-02-04 Thread via GitHub


pmcquighan-camus commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3848772245

   Hi - we have been using `apache-airflow-providers-google==19.3.0` for the 
past several weeks and the issue recurred:
   
   ```
   [2026-02-02 04:42:52] ERROR - Exception occurred while checking for job 
completion. 
source=airflow.providers.google.cloud.triggers.dataflow.TemplateJobStartTrigger 
loc=dataflow.py:149
   ServiceUnavailable: 503 Visibility check was unavailable. Please retry the 
request and contact support if the problem persists
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/triggers/dataflow.py",
 line 113 in run
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/hooks/dataflow.py",
 line 1480 in get_job_status
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/hooks/dataflow.py",
 line 1457 in get_job
   File 
"/home/airflow/.local/lib/python3.12/site-packages/google/cloud/dataflow_v1beta3/services/jobs_v1_beta3/async_client.py",
 line 478 in get_job
   File 
"/home/airflow/.local/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py",
 line 88 in __await__
   AioRpcError: 
   File 
"/home/airflow/.local/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py",
 line 85 in __await__
   File 
"/home/airflow/.local/lib/python3.12/site-packages/grpc/aio/_interceptor.py", 
line 472 in __await__
   File "/home/airflow/.local/lib/python3.12/site-packages/grpc/aio/_call.py", 
line 327 in __await__
   ```
   
   The task retry error from the initial bug report `ValueError: dictionary 
update sequence element #0 has length 1; 2 is required` no longer occurs but as 
expected the task is still just considered failed since the triggerer marked 
the task as failed, and the dataflow job is *not* actually retried (nor does 
the triggerer or any component attempt to get the status a second time)
   ```
   [2026-02-02 04:45:10] ERROR - Task failed with exception source=task 
loc=task_runner.py:972
   AirflowException: 503 Visibility check was unavailable. Please retry the 
request and contact support if the problem persists
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
 line 920 in run
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
 line 1307 in _execute_task
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py",
 line 1632 in resume_execution
   File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/operators/dataflow.py",
 line 650 in execute_complete
   ```
   
   >  manage to find how to reproduce it, please comment (without clear 
reproduction steps there's not too much that we can do).
   
   it is dependent on the GCP dataflow api returning a 503 - the only way to 
reliably reproduce it would be with a mock http server that the gcp client 
connects to and having it return a 503 (or by mocking out the 
`JobsV1Beta3AsyncClient.get_job` method to throw the AioRpcError in the 
stacktrace above).  i don't have permissions to reopen this issue but it is 
definitely a still a bug with the retry logic in the provider 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2026-01-24 Thread via GitHub


shahar1 closed issue #57359: Google Dataflow provider does not retry on service 
503 errors
URL: https://github.com/apache/airflow/issues/57359


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2026-01-24 Thread via GitHub


shahar1 commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3795303706

   I'm closing this issue as non-reproducible. If you encounter this issue 
again after trying the newest version and manage to find how to reproduce it, 
please comment (without clear reproduction steps there's not too much that we 
can do).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2026-01-13 Thread via GitHub


pmcquighan-camus commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3746389415

   Hi @olegkachur-e  - Thanks, I will give that a try. Since this particular 
error occurs with GCP returns 503's it is hard to predict when it might occur 
again but will see if this makes things a little more stable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2026-01-08 Thread via GitHub


olegkachur-e commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3724581013

   Hi, @pmcquighan-camus
   
   I did some tests and took a closer look for the logs you provided, the retry 
error `ValueError: dictionary update sequence element #0 has length 1; 2 is 
required` is linked to the links construction problem, that was fixed a while 
ago in the google-provider >= 18.1.0. 
(https://github.com/apache/airflow/pull/55821)
   
   In case it is fixed, the task retry should work.
   
   Can you please try with a newer version of google-provider and share the 
results?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2025-12-05 Thread via GitHub


pmcquighan-camus commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3618389257

   This can happen from any dag using the `DataflowStartFlexTemplateOperator` 
(possibly only when in deferrable mode?).  The log from 
`TemplateJobStartTrigger` is from the triggerer while waiting for the job 
(launched by `DataflowStartFlexTemplateOperator`) to complete.   The triggerer 
marks the task as failed, and then on retries the 
`DataflowStartFlexTemplateOperator` tries to resume executing, sees that the 
job was marked as failed and dies again.
   
   A sample task is defined like this, but it's not super useful without having 
a flex template defined in your GCP project:
   ```
   DataflowStartFlexTemplateOperator(
   task_id="mytask",
   body={
   "launchParameter": {
   "containerSpecGcsPath": "gs:///templates/",  # Need a dataflow flex template defined
   "environment": {}, # Any job-specific parameters needed here 
like workerRegion
   "jobName": "sample-job",
   "parameters": {},  # Any params here
   },
   },
   location="",
   project_id="",
   deferrable=True,
   append_job_name=True,  # Add unique suffix to job names, so retries 
on a file will create unique names
   )
   ```
   
   Since this failure case depends on GCP throwing 503's it's not very easy to 
replicate.  The trigger catches the 503 exception and sets a TriggerEvent of an 
error 
[here](https://github.com/apache/airflow/blob/3.1.0/providers/google/src/airflow/providers/google/cloud/triggers/dataflow.py#L113-L150),
 which is where doing something like check the exception for a 503 from the 
service provider, and if that's the case, continue looping like it does when 
the job [is still 
running](https://github.com/apache/airflow/blob/3.1.0/providers/google/src/airflow/providers/google/cloud/triggers/dataflow.py#L144)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2025-12-04 Thread via GitHub


olegkachur-e commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3611288443

   Hi @pmcquighan-camus! 
   
   Thank you for reporting this issue. 
   I spotted in the logs two operators: `TemplateJobStartTrigger` and 
`DataflowStartFlexTemplateOperator` (as a retry?). 
   
   Can you please clarify your usage? A sample dag for reproduction is also 
highly appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2025-10-28 Thread via GitHub


pmcquighan-camus commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3457423860

   > [@pmcquighan-camus](https://github.com/pmcquighan-camus), just my two 
cents, but I think that we should stick to Airflow retries for something like 
this. When you add retries within the logic in `.execute()`, it can cause some 
general confusion/hinder understanding as to what's actually going on.
   > 
   > Let's say I'm a new user, and I only ever want my job to retry 3 times. 
I'd set my `retries=3` at the Task-level. Now, unknown to me, there is logic in 
the operator that retries 10 times without the Task failing. This would be 
unintended behavior.
   
   that makes sense to me in general.  in this specific instance, the actual 
dataflow job was only ever launched/tried 1 time, and then the task failed 
while polling for completion status in the trigger.  airflow did 2 retries, but 
it seemed like they both immediately failed as the trigger was marked as 
"failed" from the first attempt, and the 2nd/3rd attempts just started 
executing with `on_complete` and failed immediately 
[here](https://github.com/apache/airflow/blob/3.1.0/providers/google/src/airflow/providers/google/cloud/operators/dataflow.py#L646-L649).
 my understanding of what happened here is the 3 airflow task attempts resulted 
in 1 dataflow job being executed, and all 3 task attempts failed from 1 
service-level 503 when polling for status.
   
   i feel that the 2nd/3rd tries should attempt to re-run the dataflow job (or 
at least retry querying for status of the job, since the dataflow job did in 
fact continue executing to completion and it would be surprising to have 
multiple dataflow jobs running for the airflow task)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]

2025-10-28 Thread via GitHub


jroachgolf84 commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3456275091

   @pmcquighan-camus, just my two cents, but I think that we should stick to 
Airflow retries for something like this. When you add retries within the logic 
in `.execute()`, it can cause some general confusion/hinder understanding as to 
what's actually going on. 
   
   Let's say I'm a new user, and I only ever want my job to retry 3 times. I'd 
set my `retries=3` at the Task-level. Now, unknown to me, there is logic in the 
operator that retries 10 times without the Task failing. This would be 
unintended behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]