Darren Weber created AIRFLOW-5889:
-------------------------------------

             Summary: AWS Batch Operator - API request limits should not fail a 
task
                 Key: AIRFLOW-5889
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5889
             Project: Apache Airflow
          Issue Type: Improvement
          Components: aws, contrib
    Affects Versions: 1.10.4
            Reporter: Darren Weber
            Assignee: Darren Weber
             Fix For: 2.0.0, 1.10.6


The AWS Batch Operator attempts to use a boto3 feature that is not available 
and has not been merged in years, see
 - [https://github.com/boto/botocore/pull/1307]
 - see also [https://github.com/broadinstitute/cromwell/issues/4303]

This is a curious case of premature optimization. So, in the meantime, this 
means that the fallback is the exponential backoff routine for the status 
checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is 
very high (100's of tasks), this fallback polling hits the AWS Batch API too 
hard and the AWS API throttle throws an error, which fails the Airflow task, 
simply because the status is polled too frequently.

Check the output from the retry algorithm, e.g. within the first 10 retries, 
the status of an AWS batch job is checked about 10 times at a rate that is 
approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
concurrent batch jobs, this hits the API too frequently and crashes the Airflow 
task (plus it occupies a worker in too much busy work).
{code:java}
In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
 Out[4]: 
 [1.0,
 1.01,
 1.04,
 1.09,
 1.1600000000000001,
 1.25,
 1.36,
 1.4900000000000002,
 1.6400000000000001,
 1.81,
 2.0,
 2.21,
 2.4400000000000004,
 2.6900000000000004,
 2.9600000000000004,
 3.25,
 3.5600000000000005,
 3.8900000000000006,
 4.24,
 4.61]{code}
Possible solutions are to introduce an initial sleep (say 60 sec?) right after 
issuing the request, so that the batch job has some time to spin up. The job 
progresses through a through phases before it gets to RUNNING state and polling 
for each phase of that sequence might help. Since batch jobs tend to be 
long-running jobs (rather than near-real time jobs), it might help to issue 
less frequent polls when it's in the RUNNING state. Something on the order of 
10's seconds might be reasonable for batch jobs? Maybe the class could expose a 
parameter for the rate of polling (or a callable)?

 

Another option is to use something like the sensor-poke approach, with 
rescheduling, e.g.

- 
[https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to