Re: [I] Airflow task stuck on DEFERRED status [airflow]

2024-11-08 Thread via GitHub


gopidesupavan closed issue #43634: Airflow task stuck on DEFERRED status
URL: https://github.com/apache/airflow/issues/43634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Airflow task stuck on DEFERRED status [airflow]

2024-11-08 Thread via GitHub


gopidesupavan commented on issue #43634:
URL: https://github.com/apache/airflow/issues/43634#issuecomment-2465514983

   This is difficult to reproduce and debug. There are optimizations related to 
triggers in Airflow 2.10.3, so try upgrading and check if any issues persist. 
Moving this to discussions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Airflow task stuck on DEFERRED status [airflow]

2024-11-08 Thread via GitHub


kunaljubce commented on issue #43634:
URL: https://github.com/apache/airflow/issues/43634#issuecomment-2464412214

   @pvieito I looked through the logs and it does not seem to be giving us an 
awful lot in terms of an actual issue. I would probably have to point you back 
to your platform/devops team who maintains your Airflow environment to check if 
this is memory related? Because from the logs, I can see some of the tasks 
being submitted, executed, and completed. Only after a period of time does your 
triggerer start behaving weirdly and starts indefinitely deferring the tasks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Airflow task stuck on DEFERRED status [airflow]

2024-11-06 Thread via GitHub


pvieito commented on issue #43634:
URL: https://github.com/apache/airflow/issues/43634#issuecomment-2460358188

   Hi @kunaljubce!
   
   I attach the full logs from the tasks, workers & scheduler from 
2024-11-04T09:00:00 to 2024-11-04T11:30:00.
   
   - At around 9:00 UTC as you can see for example the task with AWS Batch 
identifier '16ffa872-59c1-4eb9-a7b4-50622364db0b' is correctly deferred and 
`waiter_with_logging.py` logs the expected checks.
   - All seems to be working fine until around 10:00 UTC. Then all task (for 
example the one with AWS Batch identifier 
'c6157b39-4197-4a98-9a0d-2310866a2495') starts being deferred but no checks are 
done: no more `waiter_with_logging.py` logs after the one at 2024-11-04 
09:55:39.871).
   
   
[Airflow-Logs.csv](https://github.com/user-attachments/files/17650761/Airflow-Logs.csv)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Airflow task stuck on DEFERRED status [airflow]

2024-11-05 Thread via GitHub


kunaljubce commented on issue #43634:
URL: https://github.com/apache/airflow/issues/43634#issuecomment-2457503916

   @pvieito - Thanks for these, nothing apparent seems to jump out from these 
logs.
   
   Would it be possible for you to look at all logs for triggerer, scheduler 
etc.? Maybe try clearing the tasks and then follow the system logs to see if it 
prints something referring your deferred tasks.  If you can add those logs 
below, we can try looking for a possible issue because given your information 
that nothing changed on the airflow platform side, there doesn't seem much for 
us to go on. 
   
   cc: @andrewgodwin I see a similar issue reported long back (#25630), hence 
tagging you here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Airflow task stuck on DEFERRED status [airflow]

2024-11-05 Thread via GitHub


pvieito commented on issue #43634:
URL: https://github.com/apache/airflow/issues/43634#issuecomment-2456756439

   Hi @kunaljubce!
   
   - No updates in that time window.
   - All our tasks are vanilla Airflow AWS operators and super-simple DAGs 
based on some templates (basically `BatchOperator` & `GlueOperator`):
   
   ```python
   with DAG(
   dag_id=dag_id,
   schedule=dag_schedule,
   start_date=pendulum.parse(dag_schedule_start_date),
   tags=dag_deployment_tags,
   catchup=False,
   ) as dag:
   batch_job = BatchOperator(
   task_id="batch_job",
   job_name=job_name,
   job_definition=job_definition_arn,
   job_queue=job_queue_name,
   tags=job_tags,
   aws_conn_id=job_aws_connection_id,
   container_overrides=dict(
   environment=aws_batch_job_environment_variables,
   ))
   
   log_batch_job() >> batch_job
   ```
   
   - All deferred task started to mis-behave indefinetly and all new deferred 
task also mis-behave.
   - Some extra full logs of task being stuck in DEFERRED without any trigger 
logs:
   
   - With `GlueOperator`:
   ```
   ip-10-5-146-175.eu-west-1.compute.internal
   *** Reading remote log from Cloudwatch log_group: 
airflow-DT-LAB-Airflow-Environment-Task log_stream: 
dag_id=DT-LAB-Autostop-Schedule-PushProximity-1/run_id=manual__2024-11-04T09_55_51+00_00/task_id=glue_job/attempt=1.log.
   [2024-11-04, 09:55:59 UTC] {local_task_job_runner.py:123} ▼ Pre task 
execution logs
   [2024-11-04, 09:55:59 UTC] {taskinstance.py:2612} INFO - Dependencies all 
met for dep_context=non-requeueable deps ti=
   [2024-11-04, 09:55:59 UTC] {taskinstance.py:2612} INFO - Dependencies all 
met for dep_context=requeueable deps ti=
   [2024-11-04, 09:55:59 UTC] {taskinstance.py:2865} INFO - Starting attempt 1 
of 1
   [2024-11-04, 09:55:59 UTC] {taskinstance.py:2888} INFO - Executing 
 on 2024-11-04 09:55:51+00:00
   [2024-11-04, 09:55:59 UTC] {standard_task_runner.py:72} INFO - Started 
process 2450 to run task
   [2024-11-04, 09:55:59 UTC] {standard_task_runner.py:104} INFO - Running: 
['airflow', 'tasks', 'run', 'DT-LAB-Autostop-Schedule-PushProximity-1', 
'glue_job', 'manual__2024-11-04T09:55:51+00:00', '--job-id', '39075', '--raw', 
'--subdir', 'DAGS_FOLDER/DT-LAB-Autostop-Schedule-PushProximity-1.py', 
'--cfg-path', '/tmp/tmpceiye8fs']
   [2024-11-04, 09:55:59 UTC] {standard_task_runner.py:105} INFO - Job 39075: 
Subtask glue_job
   [2024-11-04, 09:55:59 UTC] {task_command.py:467} INFO - Running 
 on host 
ip-10-5-146-175.eu-west-1.compute.internal
   [2024-11-04, 09:56:00 UTC] {taskinstance.py:3131} INFO - Exporting env vars: 
AIRFLOW_CTX_DAG_OWNER='airflow' 
AIRFLOW_CTX_DAG_ID='DT-LAB-Autostop-Schedule-PushProximity-1' 
AIRFLOW_CTX_TASK_ID='glue_job' 
AIRFLOW_CTX_EXECUTION_DATE='2024-11-04T09:55:51+00:00' 
AIRFLOW_CTX_TRY_NUMBER='1' 
AIRFLOW_CTX_DAG_RUN_ID='manual__2024-11-04T09:55:51+00:00'
   [2024-11-04, 09:56:00 UTC] {taskinstance.py:731} ▲▲▲ Log group end
   [2024-11-04, 09:56:00 UTC] {glue.py:188} INFO - Initializing AWS Glue Job: 
DT-LAB-Autostop-Schedule-PushProximity-1. Wait for completion: True
   [2024-11-04, 09:56:00 UTC] {glue.py:365} INFO - Checking if job already 
exists: DT-LAB-Autostop-Schedule-PushProximity-1
   [2024-11-04, 09:56:00 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data'
   [2024-11-04, 09:56:02 UTC] {glue.py:209} INFO - You can monitor this Glue 
Job run at: 
https://console.aws.amazon.com/gluestudio/home?region=eu-west-1#/job/DT-LAB-Autostop-Schedule-PushProximity-1/run/jr_d8efd3e7c7ca097b048db1e2e4342cd3e8eb3f03a225335bdef726c3ff63abb3
   [2024-11-04, 09:56:02 UTC] {taskinstance.py:288} INFO - Pausing task as 
DEFERRED. dag_id=DT-LAB-Autostop-Schedule-PushProximity-1, task_id=glue_job, 
run_id=manual__2024-11-04T09:55:51+00:00, execution_date=20241104T095551, 
start_date=20241104T095559
   [2024-11-04, 09:56:02 UTC] {taskinstance.py:340} ▼ Post task execution logs
   [2024-11-04, 09:56:02 UTC] {local_task_job_runner.py:260} INFO - Task exited 
with return code 100 (task deferral)
   [2024-11-04, 09:56:02 UTC] {local_task_job_runner.py:245} ▲▲▲ Log group end
   [2024-11-04, 09:56:04 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data'
   [2024-11-04, 09:56:05 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data'
   ```
   
   - With `BatchOperator`:
   
   ```
   ip-10-5-146-175.eu-west-1.compute.internal
   *** Reading remote log from Cloudwatch log_group: 
airflow-DT-LAB-Airflow-Environment-Task log_stream: 
dag_id=DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1/run_id=scheduled__2024-11-04T09_00_00+00_00/task_id=batch_job/attempt=1.log.
   [2024-11-04, 10:00:13 UTC] {local_task_job_runner.py:123} ▼ Pre task 
execution logs
   [2024-11-04, 10:00:13 UTC] {taskinstance.py:2612} INFO - Dependencies all 
met for dep_context=non-requeueable deps ti=
   [2024-11-04, 10:00:13 UTC] {taskinsta

Re: [I] Airflow task stuck on DEFERRED status [airflow]

2024-11-04 Thread via GitHub


kunaljubce commented on issue #43634:
URL: https://github.com/apache/airflow/issues/43634#issuecomment-2455253102

   @pvieito - 
   
   * Given it was working all this while and suddenly stopped working from 
today, assuming that your airflow platform did not get upgraded meanwhile, 
seems to mean something changed upstream? If you're querying an API, maybe you 
can add a retry logic if it's not already there?
   * Do you see any helpful details in the `▶ Post task execution logs`? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Airflow task stuck on DEFERRED status [airflow]

2024-11-04 Thread via GitHub


boring-cyborg[bot] commented on issue #43634:
URL: https://github.com/apache/airflow/issues/43634#issuecomment-2454641827

   Thanks for opening your first issue here! Be sure to follow the issue 
template! If you are willing to raise PR to address this issue please do so, no 
need to wait for approval.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Airflow task stuck on DEFERRED status [airflow]

2024-11-04 Thread via GitHub


pvieito opened a new issue, #43634:
URL: https://github.com/apache/airflow/issues/43634

   ### Apache Airflow version
   
   2.10.2
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   We are trying to migrate all our DAGs to use the deferreable operators. I 
set it as the default in out MWAA configuration:
   
   
   operators.default_deferrable | true
   -- | --
   
   This seemed to work properly for a week, but today all jobs started being 
stuck in the DEFERRED status. For example:
   
   Typical run:
   
   ```
   ip-10-5-146-175.eu-west-1.compute.internal
   *** Reading remote log from Cloudwatch log_group: 
airflow-DT-LAB-Airflow-Environment-Task log_stream: 
dag_id=DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1/run_id=scheduled__2024-11-04T08_00_00+00_00/task_id=batch_job/attempt=1.log.
   [2024-11-04, 09:00:15 UTC] {local_task_job_runner.py:123} ▶ Pre task 
execution logs
   [2024-11-04, 09:00:15 UTC] {batch.py:303} INFO - Running AWS Batch job - job 
definition: 
arn:aws:batch:eu-west-1:532132528437:job-definition/DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1:358
 - on queue DT-LAB-BatchExecution-Standard
   [2024-11-04, 09:00:15 UTC] {batch.py:310} INFO - AWS Batch job - container 
overrides: {'environment': [{'name': '_DATATOOLS_DAG_DEPLOYMENT_IDENTIFIER', 
'value': 'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1'}, {'name': 
'_DATATOOLS_DAG_DEPLOYMENT_TARGET', 'value': 'airflow'}, {'name': 
'DATATOOLS_ENVIRONMENT', 'value': 'LAB'}, {'name': 
'_DATATOOLS_AIRFLOW_DEPLOYMENT_DAG_ID', 'value': 
'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1'}, {'name': 
'_DATATOOLS_AIRFLOW_DEPLOYMENT_ENVIRONMENT_NAME', 'value': 
'shopfully-data-execution'}, {'name': 
'_DATATOOLS_AIRFLOW_DEPLOYMENT_DATA_INTERVAL_START_DATETIME', 'value': 
'2024-11-04 08:00:00+00:00'}, {'name': 
'_DATATOOLS_AIRFLOW_DEPLOYMENT_DATA_INTERVAL_END_DATETIME', 'value': 
'2024-11-04 09:00:00+00:00'}, {'name': 
'_DATATOOLS_PROCESS_EXTRA_ENVIRONMENT_BLOB', 'value': ''}]}
   [2024-11-04, 09:00:15 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data_execution'
   [2024-11-04, 09:00:17 UTC] {batch.py:347} INFO - AWS Batch job 
(16ffa872-59c1-4eb9-a7b4-50622364db0b) started: {'ResponseMetadata': 
{'RequestId': '18c85cda-9a0e-4347-8801-ac6e5a1da429', 'HTTPStatusCode': 200, 
'HTTPHeaders': {'date': 'Mon, 04 Nov 2024 09:00:17 GMT', 'content-type': 
'application/json', 'content-length': '198', 'connection': 'keep-alive', 
'x-amzn-requestid': '18c85cda-9a0e-4347-8801-ac6e5a1da429', 
'access-control-allow-origin': '*', 'x-amz-apigw-id': 'Atr9NEPzDoEEWPg=', 
'access-control-expose-headers': 
'X-amzn-errortype,X-amzn-requestid,X-amzn-errormessage,X-amzn-trace-id,X-amz-apigw-id,date',
 'x-amzn-trace-id': 'Root=1-67288d20-10faae0a2da5180b3c952dc2'}, 
'RetryAttempts': 0}, 'jobArn': 
'arn:aws:batch:eu-west-1:532132528437:job/16ffa872-59c1-4eb9-a7b4-50622364db0b',
 'jobName': 'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1', 'jobId': 
'16ffa872-59c1-4eb9-a7b4-50622364db0b'}
   [2024-11-04, 09:00:17 UTC] {taskinstance.py:288} INFO - Pausing task as 
DEFERRED. dag_id=DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1, 
task_id=batch_job, run_id=scheduled__2024-11-04T08:00:00+00:00, 
execution_date=20241104T08, start_date=20241104T090015
   [2024-11-04, 09:00:17 UTC] {taskinstance.py:340} ▶ Post task execution logs
   [2024-11-04, 09:00:18 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data_execution'
   [2024-11-04, 09:00:18 UTC] {waiter_with_logging.py:129} INFO - Batch job 
16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['SUBMITTED']
   [2024-11-04, 09:00:48 UTC] {waiter_with_logging.py:129} INFO - Batch job 
16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['STARTING']
   [2024-11-04, 09:01:19 UTC] {waiter_with_logging.py:129} INFO - Batch job 
16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['STARTING']
   [2024-11-04, 09:01:49 UTC] {waiter_with_logging.py:129} INFO - Batch job 
16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['RUNNING']
   [2024-11-04, 09:02:19 UTC] {waiter_with_logging.py:129} INFO - Batch job 
16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['RUNNING']
   [2024-11-04, 09:02:49 UTC] {waiter_with_logging.py:129} INFO - Batch job 
16ffa872-59c1-4eb9-a7b4-50622364db0b not ready yet: ['RUNNING']
   [2024-11-04, 09:03:19 UTC] {triggerer_job_runner.py:631} INFO - Trigger 
DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1/scheduled__2024-11-04T08:00:00+00:00/batch_job/-1/1
 (ID 4448) fired: TriggerEvent<{'status': 'success', 'job_id': 
'16ffa872-59c1-4eb9-a7b4-50622364db0b'}>
   [2024-11-04, 09:03:23 UTC] {local_task_job_runner.py:123} ▶ Pre task 
execution logs
   [2024-11-04, 09:03:23 UTC] {batch.py:290} INFO - Job completed.
   [2024-11-04, 09:03:24 UTC] {taskinstance.py:340} ▶ Post task execution logs
   ```
   
   All runs since today at around 10:00 UTC:
   
   ```
   ip-10-5-146-175.eu-west-1.compute.internal