Re: [I] Airflow task stuck on DEFERRED status [airflow]
gopidesupavan closed issue #43634: Airflow task stuck on DEFERRED status URL: https://github.com/apache/airflow/issues/43634 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Airflow task stuck on DEFERRED status [airflow]
gopidesupavan commented on issue #43634: URL: https://github.com/apache/airflow/issues/43634#issuecomment-2465514983 This is difficult to reproduce and debug. There are optimizations related to triggers in Airflow 2.10.3, so try upgrading and check if any issues persist. Moving this to discussions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Airflow task stuck on DEFERRED status [airflow]
kunaljubce commented on issue #43634: URL: https://github.com/apache/airflow/issues/43634#issuecomment-2464412214 @pvieito I looked through the logs and it does not seem to be giving us an awful lot in terms of an actual issue. I would probably have to point you back to your platform/devops team who maintains your Airflow environment to check if this is memory related? Because from the logs, I can see some of the tasks being submitted, executed, and completed. Only after a period of time does your triggerer start behaving weirdly and starts indefinitely deferring the tasks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Airflow task stuck on DEFERRED status [airflow]
pvieito commented on issue #43634: URL: https://github.com/apache/airflow/issues/43634#issuecomment-2460358188 Hi @kunaljubce! I attach the full logs from the tasks, workers & scheduler from 2024-11-04T09:00:00 to 2024-11-04T11:30:00. - At around 9:00 UTC as you can see for example the task with AWS Batch identifier '16ffa872-59c1-4eb9-a7b4-50622364db0b' is correctly deferred and `waiter_with_logging.py` logs the expected checks. - All seems to be working fine until around 10:00 UTC. Then all task (for example the one with AWS Batch identifier 'c6157b39-4197-4a98-9a0d-2310866a2495') starts being deferred but no checks are done: no more `waiter_with_logging.py` logs after the one at 2024-11-04 09:55:39.871). [Airflow-Logs.csv](https://github.com/user-attachments/files/17650761/Airflow-Logs.csv) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Airflow task stuck on DEFERRED status [airflow]
kunaljubce commented on issue #43634: URL: https://github.com/apache/airflow/issues/43634#issuecomment-2457503916 @pvieito - Thanks for these, nothing apparent seems to jump out from these logs. Would it be possible for you to look at all logs for triggerer, scheduler etc.? Maybe try clearing the tasks and then follow the system logs to see if it prints something referring your deferred tasks. If you can add those logs below, we can try looking for a possible issue because given your information that nothing changed on the airflow platform side, there doesn't seem much for us to go on. cc: @andrewgodwin I see a similar issue reported long back (#25630), hence tagging you here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Airflow task stuck on DEFERRED status [airflow]
pvieito commented on issue #43634: URL: https://github.com/apache/airflow/issues/43634#issuecomment-2456756439 Hi @kunaljubce! - No updates in that time window. - All our tasks are vanilla Airflow AWS operators and super-simple DAGs based on some templates (basically `BatchOperator` & `GlueOperator`): ```python with DAG( dag_id=dag_id, schedule=dag_schedule, start_date=pendulum.parse(dag_schedule_start_date), tags=dag_deployment_tags, catchup=False, ) as dag: batch_job = BatchOperator( task_id="batch_job", job_name=job_name, job_definition=job_definition_arn, job_queue=job_queue_name, tags=job_tags, aws_conn_id=job_aws_connection_id, container_overrides=dict( environment=aws_batch_job_environment_variables, )) log_batch_job() >> batch_job ``` - All deferred task started to mis-behave indefinetly and all new deferred task also mis-behave. - Some extra full logs of task being stuck in DEFERRED without any trigger logs: - With `GlueOperator`: ``` ip-10-5-146-175.eu-west-1.compute.internal *** Reading remote log from Cloudwatch log_group: airflow-DT-LAB-Airflow-Environment-Task log_stream: dag_id=DT-LAB-Autostop-Schedule-PushProximity-1/run_id=manual__2024-11-04T09_55_51+00_00/task_id=glue_job/attempt=1.log. [2024-11-04, 09:55:59 UTC] {local_task_job_runner.py:123} ▼ Pre task execution logs [2024-11-04, 09:55:59 UTC] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=non-requeueable deps ti= [2024-11-04, 09:55:59 UTC] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=requeueable deps ti= [2024-11-04, 09:55:59 UTC] {taskinstance.py:2865} INFO - Starting attempt 1 of 1 [2024-11-04, 09:55:59 UTC] {taskinstance.py:2888} INFO - Executing on 2024-11-04 09:55:51+00:00 [2024-11-04, 09:55:59 UTC] {standard_task_runner.py:72} INFO - Started process 2450 to run task [2024-11-04, 09:55:59 UTC] {standard_task_runner.py:104} INFO - Running: ['airflow', 'tasks', 'run', 'DT-LAB-Autostop-Schedule-PushProximity-1', 'glue_job', 'manual__2024-11-04T09:55:51+00:00', '--job-id', '39075', '--raw', '--subdir', 'DAGS_FOLDER/DT-LAB-Autostop-Schedule-PushProximity-1.py', '--cfg-path', '/tmp/tmpceiye8fs'] [2024-11-04, 09:55:59 UTC] {standard_task_runner.py:105} INFO - Job 39075: Subtask glue_job [2024-11-04, 09:55:59 UTC] {task_command.py:467} INFO - Running on host ip-10-5-146-175.eu-west-1.compute.internal [2024-11-04, 09:56:00 UTC] {taskinstance.py:3131} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='DT-LAB-Autostop-Schedule-PushProximity-1' AIRFLOW_CTX_TASK_ID='glue_job' AIRFLOW_CTX_EXECUTION_DATE='2024-11-04T09:55:51+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2024-11-04T09:55:51+00:00' [2024-11-04, 09:56:00 UTC] {taskinstance.py:731} ▲▲▲ Log group end [2024-11-04, 09:56:00 UTC] {glue.py:188} INFO - Initializing AWS Glue Job: DT-LAB-Autostop-Schedule-PushProximity-1. Wait for completion: True [2024-11-04, 09:56:00 UTC] {glue.py:365} INFO - Checking if job already exists: DT-LAB-Autostop-Schedule-PushProximity-1 [2024-11-04, 09:56:00 UTC] {base.py:84} INFO - Retrieving connection 'aws_shopfully_data' [2024-11-04, 09:56:02 UTC] {glue.py:209} INFO - You can monitor this Glue Job run at: https://console.aws.amazon.com/gluestudio/home?region=eu-west-1#/job/DT-LAB-Autostop-Schedule-PushProximity-1/run/jr_d8efd3e7c7ca097b048db1e2e4342cd3e8eb3f03a225335bdef726c3ff63abb3 [2024-11-04, 09:56:02 UTC] {taskinstance.py:288} INFO - Pausing task as DEFERRED. dag_id=DT-LAB-Autostop-Schedule-PushProximity-1, task_id=glue_job, run_id=manual__2024-11-04T09:55:51+00:00, execution_date=20241104T095551, start_date=20241104T095559 [2024-11-04, 09:56:02 UTC] {taskinstance.py:340} ▼ Post task execution logs [2024-11-04, 09:56:02 UTC] {local_task_job_runner.py:260} INFO - Task exited with return code 100 (task deferral) [2024-11-04, 09:56:02 UTC] {local_task_job_runner.py:245} ▲▲▲ Log group end [2024-11-04, 09:56:04 UTC] {base.py:84} INFO - Retrieving connection 'aws_shopfully_data' [2024-11-04, 09:56:05 UTC] {base.py:84} INFO - Retrieving connection 'aws_shopfully_data' ``` - With `BatchOperator`: ``` ip-10-5-146-175.eu-west-1.compute.internal *** Reading remote log from Cloudwatch log_group: airflow-DT-LAB-Airflow-Environment-Task log_stream: dag_id=DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1/run_id=scheduled__2024-11-04T09_00_00+00_00/task_id=batch_job/attempt=1.log. [2024-11-04, 10:00:13 UTC] {local_task_job_runner.py:123} ▼ Pre task execution logs [2024-11-04, 10:00:13 UTC] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=non-requeueable deps ti= [2024-11-04, 10:00:13 UTC] {taskinsta
Re: [I] Airflow task stuck on DEFERRED status [airflow]
kunaljubce commented on issue #43634: URL: https://github.com/apache/airflow/issues/43634#issuecomment-2455253102 @pvieito - * Given it was working all this while and suddenly stopped working from today, assuming that your airflow platform did not get upgraded meanwhile, seems to mean something changed upstream? If you're querying an API, maybe you can add a retry logic if it's not already there? * Do you see any helpful details in the `▶ Post task execution logs`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Airflow task stuck on DEFERRED status [airflow]
boring-cyborg[bot] commented on issue #43634: URL: https://github.com/apache/airflow/issues/43634#issuecomment-2454641827 Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Airflow task stuck on DEFERRED status [airflow]
pvieito opened a new issue, #43634: URL: https://github.com/apache/airflow/issues/43634 ### Apache Airflow version 2.10.2 ### If "Other Airflow 2 version" selected, which one? _No response_ ### What happened? We are trying to migrate all our DAGs to use the deferreable operators. I set it as the default in out MWAA configuration: operators.default_deferrable | true -- | -- This seemed to work properly for a week, but today all jobs started being stuck in the DEFERRED status. For example: Typical run: ``` ip-10-5-146-175.eu-west-1.compute.internal *** Reading remote log from Cloudwatch log_group: airflow-DT-LAB-Airflow-Environment-Task log_stream: dag_id=DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1/run_id=scheduled__2024-11-04T08_00_00+00_00/task_id=batch_job/attempt=1.log. [2024-11-04, 09:00:15 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs [2024-11-04, 09:00:15 UTC] {batch.py:303} INFO - Running AWS Batch job - job definition: arn:aws:batch:eu-west-1:532132528437:job-definition/DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1:358 - on queue DT-LAB-BatchExecution-Standard [2024-11-04, 09:00:15 UTC] {batch.py:310} INFO - AWS Batch job - container overrides: {'environment': [{'name': '_DATATOOLS_DAG_DEPLOYMENT_IDENTIFIER', 'value': 'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1'}, {'name': '_DATATOOLS_DAG_DEPLOYMENT_TARGET', 'value': 'airflow'}, {'name': 'DATATOOLS_ENVIRONMENT', 'value': 'LAB'}, {'name': '_DATATOOLS_AIRFLOW_DEPLOYMENT_DAG_ID', 'value': 'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1'}, {'name': '_DATATOOLS_AIRFLOW_DEPLOYMENT_ENVIRONMENT_NAME', 'value': 'shopfully-data-execution'}, {'name': '_DATATOOLS_AIRFLOW_DEPLOYMENT_DATA_INTERVAL_START_DATETIME', 'value': '2024-11-04 08:00:00+00:00'}, {'name': '_DATATOOLS_AIRFLOW_DEPLOYMENT_DATA_INTERVAL_END_DATETIME', 'value': '2024-11-04 09:00:00+00:00'}, {'name': '_DATATOOLS_PROCESS_EXTRA_ENVIRONMENT_BLOB', 'value': ''}]} [2024-11-04, 09:00:15 UTC] {base.py:84} INFO - Retrieving connection 'aws_shopfully_data_execution' [2024-11-04, 09:00:17 UTC] {batch.py:347} INFO - AWS Batch job (16ffa872-59c1-4eb9-a7b4-50622364db0b) started: {'ResponseMetadata': {'RequestId': '18c85cda-9a0e-4347-8801-ac6e5a1da429', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 04 Nov 2024 09:00:17 GMT', 'content-type': 'application/json', 'content-length': '198', 'connection': 'keep-alive', 'x-amzn-requestid': '18c85cda-9a0e-4347-8801-ac6e5a1da429', 'access-control-allow-origin': '*', 'x-amz-apigw-id': 'Atr9NEPzDoEEWPg=', 'access-control-expose-headers': 'X-amzn-errortype,X-amzn-requestid,X-amzn-errormessage,X-amzn-trace-id,X-amz-apigw-id,date', 'x-amzn-trace-id': 'Root=1-67288d20-10faae0a2da5180b3c952dc2'}, 'RetryAttempts': 0}, 'jobArn': 'arn:aws:batch:eu-west-1:532132528437:job/16ffa872-59c1-4eb9-a7b4-50622364db0b', 'jobName': 'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1', 'jobId': '16ffa872-59c1-4eb9-a7b4-50622364db0b'} [2024-11-04, 09:00:17 UTC] {taskinstance.py:288} INFO - Pausing task as DEFERRED. dag_id=DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1, task_id=batch_job, run_id=scheduled__2024-11-04T08:00:00+00:00, execution_date=20241104T08, start_date=20241104T090015 [2024-11-04, 09:00:17 UTC] {taskinstance.py:340} ▶ Post task execution logs [2024-11-04, 09:00:18 UTC] {base.py:84} INFO - Retrieving connection 'aws_shopfully_data_execution' [2024-11-04, 09:00:18 UTC] {waiter_with_logging.py:129} INFO - Batch job 16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['SUBMITTED'] [2024-11-04, 09:00:48 UTC] {waiter_with_logging.py:129} INFO - Batch job 16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['STARTING'] [2024-11-04, 09:01:19 UTC] {waiter_with_logging.py:129} INFO - Batch job 16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['STARTING'] [2024-11-04, 09:01:49 UTC] {waiter_with_logging.py:129} INFO - Batch job 16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['RUNNING'] [2024-11-04, 09:02:19 UTC] {waiter_with_logging.py:129} INFO - Batch job 16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['RUNNING'] [2024-11-04, 09:02:49 UTC] {waiter_with_logging.py:129} INFO - Batch job 16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['RUNNING'] [2024-11-04, 09:03:19 UTC] {triggerer_job_runner.py:631} INFO - Trigger DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1/scheduled__2024-11-04T08:00:00+00:00/batch_job/-1/1 (ID 4448) fired: TriggerEvent<{'status': 'success', 'job_id': '16ffa872-59c1-4eb9-a7b4-50622364db0b'}> [2024-11-04, 09:03:23 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs [2024-11-04, 09:03:23 UTC] {batch.py:290} INFO - Job completed. [2024-11-04, 09:03:24 UTC] {taskinstance.py:340} ▶ Post task execution logs ``` All runs since today at around 10:00 UTC: ``` ip-10-5-146-175.eu-west-1.compute.internal