Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
shahar1 commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3858488161
> Hi - we have been using `apache-airflow-providers-google==19.3.0` for the
past several weeks and the issue recurred:
>
> ```
> [2026-02-02 04:42:52] ERROR - Exception occurred while checking for job
completion.
source=airflow.providers.google.cloud.triggers.dataflow.TemplateJobStartTrigger
loc=dataflow.py:149
> ServiceUnavailable: 503 Visibility check was unavailable. Please retry the
request and contact support if the problem persists
> File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/triggers/dataflow.py",
line 113 in run
> File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/hooks/dataflow.py",
line 1480 in get_job_status
> File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/hooks/dataflow.py",
line 1457 in get_job
> File
"/home/airflow/.local/lib/python3.12/site-packages/google/cloud/dataflow_v1beta3/services/jobs_v1_beta3/async_client.py",
line 478 in get_job
> File
"/home/airflow/.local/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py",
line 88 in __await__
> AioRpcError: status = StatusCode.UNAVAILABLE
>details = "Visibility check was unavailable. Please retry the request
and contact support if the problem persists"
>debug_error_string = "UNKNOWN:Error received from peer
ipv4:74.125.126.95:443 {created_time:"2026-02-02T12:42:52.166480427+00:00",
grpc_status:14, grpc_message:"Visibility check was unavailable. Please retry
the request and contact support if the problem persists"}"
> >
> File
"/home/airflow/.local/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py",
line 85 in __await__
> File
"/home/airflow/.local/lib/python3.12/site-packages/grpc/aio/_interceptor.py",
line 472 in __await__
> File
"/home/airflow/.local/lib/python3.12/site-packages/grpc/aio/_call.py", line 327
in __await__
> ```
>
> The task retry error from the initial bug report `ValueError: dictionary
update sequence element #0 has length 1; 2 is required` no longer occurs but as
expected the task is still just considered failed since the triggerer marked
the task as failed, and the dataflow job is _not_ actually retried (nor does
the triggerer or any component attempt to get the status a second time)
>
> ```
> [2026-02-02 04:45:10] ERROR - Task failed with exception source=task
loc=task_runner.py:972
> AirflowException: 503 Visibility check was unavailable. Please retry the
request and contact support if the problem persists
> File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
line 920 in run
> File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py",
line 1307 in _execute_task
> File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py",
line 1632 in resume_execution
> File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/operators/dataflow.py",
line 650 in execute_complete
> ```
>
> > manage to find how to reproduce it, please comment (without clear
reproduction steps there's not too much that we can do).
>
> it is dependent on the GCP dataflow api returning a 503 - the only way to
reliably reproduce it would be with a mock http server that the gcp client
connects to and having it return a 503 (or by mocking out the
`JobsV1Beta3AsyncClient.get_job` method to throw the AioRpcError in the
stacktrace above). i don't have permissions to reopen this issue but it is
definitely a still a bug with the retry logic in the provider
I've reopened it - if you or someone else could implement the mock for
reproducing the issue, it would be helpful.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
pmcquighan-camus commented on issue #57359: URL: https://github.com/apache/airflow/issues/57359#issuecomment-3848772245 Hi - we have been using `apache-airflow-providers-google==19.3.0` for the past several weeks and the issue recurred: ``` [2026-02-02 04:42:52] ERROR - Exception occurred while checking for job completion. source=airflow.providers.google.cloud.triggers.dataflow.TemplateJobStartTrigger loc=dataflow.py:149 ServiceUnavailable: 503 Visibility check was unavailable. Please retry the request and contact support if the problem persists File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/triggers/dataflow.py", line 113 in run File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/hooks/dataflow.py", line 1480 in get_job_status File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/hooks/dataflow.py", line 1457 in get_job File "/home/airflow/.local/lib/python3.12/site-packages/google/cloud/dataflow_v1beta3/services/jobs_v1_beta3/async_client.py", line 478 in get_job File "/home/airflow/.local/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py", line 88 in __await__ AioRpcError: File "/home/airflow/.local/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py", line 85 in __await__ File "/home/airflow/.local/lib/python3.12/site-packages/grpc/aio/_interceptor.py", line 472 in __await__ File "/home/airflow/.local/lib/python3.12/site-packages/grpc/aio/_call.py", line 327 in __await__ ``` The task retry error from the initial bug report `ValueError: dictionary update sequence element #0 has length 1; 2 is required` no longer occurs but as expected the task is still just considered failed since the triggerer marked the task as failed, and the dataflow job is *not* actually retried (nor does the triggerer or any component attempt to get the status a second time) ``` [2026-02-02 04:45:10] ERROR - Task failed with exception source=task loc=task_runner.py:972 AirflowException: 503 Visibility check was unavailable. Please retry the request and contact support if the problem persists File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 920 in run File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", line 1307 in _execute_task File "/home/airflow/.local/lib/python3.12/site-packages/airflow/sdk/bases/operator.py", line 1632 in resume_execution File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/google/cloud/operators/dataflow.py", line 650 in execute_complete ``` > manage to find how to reproduce it, please comment (without clear reproduction steps there's not too much that we can do). it is dependent on the GCP dataflow api returning a 503 - the only way to reliably reproduce it would be with a mock http server that the gcp client connects to and having it return a 503 (or by mocking out the `JobsV1Beta3AsyncClient.get_job` method to throw the AioRpcError in the stacktrace above). i don't have permissions to reopen this issue but it is definitely a still a bug with the retry logic in the provider -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
shahar1 closed issue #57359: Google Dataflow provider does not retry on service 503 errors URL: https://github.com/apache/airflow/issues/57359 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
shahar1 commented on issue #57359: URL: https://github.com/apache/airflow/issues/57359#issuecomment-3795303706 I'm closing this issue as non-reproducible. If you encounter this issue again after trying the newest version and manage to find how to reproduce it, please comment (without clear reproduction steps there's not too much that we can do). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
pmcquighan-camus commented on issue #57359: URL: https://github.com/apache/airflow/issues/57359#issuecomment-3746389415 Hi @olegkachur-e - Thanks, I will give that a try. Since this particular error occurs with GCP returns 503's it is hard to predict when it might occur again but will see if this makes things a little more stable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
olegkachur-e commented on issue #57359: URL: https://github.com/apache/airflow/issues/57359#issuecomment-3724581013 Hi, @pmcquighan-camus I did some tests and took a closer look for the logs you provided, the retry error `ValueError: dictionary update sequence element #0 has length 1; 2 is required` is linked to the links construction problem, that was fixed a while ago in the google-provider >= 18.1.0. (https://github.com/apache/airflow/pull/55821) In case it is fixed, the task retry should work. Can you please try with a newer version of google-provider and share the results? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
pmcquighan-camus commented on issue #57359:
URL: https://github.com/apache/airflow/issues/57359#issuecomment-3618389257
This can happen from any dag using the `DataflowStartFlexTemplateOperator`
(possibly only when in deferrable mode?). The log from
`TemplateJobStartTrigger` is from the triggerer while waiting for the job
(launched by `DataflowStartFlexTemplateOperator`) to complete. The triggerer
marks the task as failed, and then on retries the
`DataflowStartFlexTemplateOperator` tries to resume executing, sees that the
job was marked as failed and dies again.
A sample task is defined like this, but it's not super useful without having
a flex template defined in your GCP project:
```
DataflowStartFlexTemplateOperator(
task_id="mytask",
body={
"launchParameter": {
"containerSpecGcsPath": "gs:///templates/", # Need a dataflow flex template defined
"environment": {}, # Any job-specific parameters needed here
like workerRegion
"jobName": "sample-job",
"parameters": {}, # Any params here
},
},
location="",
project_id="",
deferrable=True,
append_job_name=True, # Add unique suffix to job names, so retries
on a file will create unique names
)
```
Since this failure case depends on GCP throwing 503's it's not very easy to
replicate. The trigger catches the 503 exception and sets a TriggerEvent of an
error
[here](https://github.com/apache/airflow/blob/3.1.0/providers/google/src/airflow/providers/google/cloud/triggers/dataflow.py#L113-L150),
which is where doing something like check the exception for a 503 from the
service provider, and if that's the case, continue looping like it does when
the job [is still
running](https://github.com/apache/airflow/blob/3.1.0/providers/google/src/airflow/providers/google/cloud/triggers/dataflow.py#L144)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
olegkachur-e commented on issue #57359: URL: https://github.com/apache/airflow/issues/57359#issuecomment-3611288443 Hi @pmcquighan-camus! Thank you for reporting this issue. I spotted in the logs two operators: `TemplateJobStartTrigger` and `DataflowStartFlexTemplateOperator` (as a retry?). Can you please clarify your usage? A sample dag for reproduction is also highly appreciated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
pmcquighan-camus commented on issue #57359: URL: https://github.com/apache/airflow/issues/57359#issuecomment-3457423860 > [@pmcquighan-camus](https://github.com/pmcquighan-camus), just my two cents, but I think that we should stick to Airflow retries for something like this. When you add retries within the logic in `.execute()`, it can cause some general confusion/hinder understanding as to what's actually going on. > > Let's say I'm a new user, and I only ever want my job to retry 3 times. I'd set my `retries=3` at the Task-level. Now, unknown to me, there is logic in the operator that retries 10 times without the Task failing. This would be unintended behavior. that makes sense to me in general. in this specific instance, the actual dataflow job was only ever launched/tried 1 time, and then the task failed while polling for completion status in the trigger. airflow did 2 retries, but it seemed like they both immediately failed as the trigger was marked as "failed" from the first attempt, and the 2nd/3rd attempts just started executing with `on_complete` and failed immediately [here](https://github.com/apache/airflow/blob/3.1.0/providers/google/src/airflow/providers/google/cloud/operators/dataflow.py#L646-L649). my understanding of what happened here is the 3 airflow task attempts resulted in 1 dataflow job being executed, and all 3 task attempts failed from 1 service-level 503 when polling for status. i feel that the 2nd/3rd tries should attempt to re-run the dataflow job (or at least retry querying for status of the job, since the dataflow job did in fact continue executing to completion and it would be surprising to have multiple dataflow jobs running for the airflow task) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Google Dataflow provider does not retry on service 503 errors [airflow]
jroachgolf84 commented on issue #57359: URL: https://github.com/apache/airflow/issues/57359#issuecomment-3456275091 @pmcquighan-camus, just my two cents, but I think that we should stick to Airflow retries for something like this. When you add retries within the logic in `.execute()`, it can cause some general confusion/hinder understanding as to what's actually going on. Let's say I'm a new user, and I only ever want my job to retry 3 times. I'd set my `retries=3` at the Task-level. Now, unknown to me, there is logic in the operator that retries 10 times without the Task failing. This would be unintended behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
