[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-10-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949736#comment-16949736
 ] 

ASF subversion and git services commented on AIRFLOW-5218:
--

Commit a198969b5e3acaee67479ebab145d29866607453 in airflow's branch 
refs/heads/v1-10-stable from Darren Weber
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=a198969 ]

[AIRFLOW-5218] Less polling of AWS Batch job status (#5825)

https://issues.apache.org/jira/browse/AIRFLOW-5218
- avoid the AWS API throttle limits for highly concurrent tasks
- a small increase in the backoff factor could avoid excessive polling
- random sleep before polling to allow the batch task to spin-up
  - the random sleep helps to avoid API throttling
- revise the retry logic slightly to avoid unnecessary pause
  when there are no more retries required

(cherry picked from commit fc972fb6c82010f9809a437eb6b9772918a584d2)


> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
> Fix For: 2.0.0, 1.10.6
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?
>  
> Another option is to use something like the sensor-poke approach, with 
> rescheduling, e.g.
> - 
> [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-10-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949735#comment-16949735
 ] 

ASF subversion and git services commented on AIRFLOW-5218:
--

Commit a198969b5e3acaee67479ebab145d29866607453 in airflow's branch 
refs/heads/v1-10-stable from Darren Weber
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=a198969 ]

[AIRFLOW-5218] Less polling of AWS Batch job status (#5825)

https://issues.apache.org/jira/browse/AIRFLOW-5218
- avoid the AWS API throttle limits for highly concurrent tasks
- a small increase in the backoff factor could avoid excessive polling
- random sleep before polling to allow the batch task to spin-up
  - the random sleep helps to avoid API throttling
- revise the retry logic slightly to avoid unnecessary pause
  when there are no more retries required

(cherry picked from commit fc972fb6c82010f9809a437eb6b9772918a584d2)


> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
> Fix For: 2.0.0, 1.10.6
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?
>  
> Another option is to use something like the sensor-poke approach, with 
> rescheduling, e.g.
> - 
> [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914430#comment-16914430
 ] 

ASF subversion and git services commented on AIRFLOW-5218:
--

Commit fc972fb6c82010f9809a437eb6b9772918a584d2 in airflow's branch 
refs/heads/master from Darren Weber
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=fc972fb ]

[AIRFLOW-5218] Less polling of AWS Batch job status (#5825)

https://issues.apache.org/jira/browse/AIRFLOW-5218
- avoid the AWS API throttle limits for highly concurrent tasks
- a small increase in the backoff factor could avoid excessive polling
- random sleep before polling to allow the batch task to spin-up
  - the random sleep helps to avoid API throttling
- revise the retry logic slightly to avoid unnecessary pause
  when there are no more retries required

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
> Fix For: 2.0.0
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?
>  
> Another option is to use something like the sensor-poke approach, with 
> rescheduling, e.g.
> - 
> [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914429#comment-16914429
 ] 

ASF GitHub Bot commented on AIRFLOW-5218:
-

kaxil commented on pull request #5825: [AIRFLOW-5218] less polling for AWS 
Batch status
URL: https://github.com/apache/airflow/pull/5825
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?
>  
> Another option is to use something like the sensor-poke approach, with 
> rescheduling, e.g.
> - 
> [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914431#comment-16914431
 ] 

ASF subversion and git services commented on AIRFLOW-5218:
--

Commit fc972fb6c82010f9809a437eb6b9772918a584d2 in airflow's branch 
refs/heads/master from Darren Weber
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=fc972fb ]

[AIRFLOW-5218] Less polling of AWS Batch job status (#5825)

https://issues.apache.org/jira/browse/AIRFLOW-5218
- avoid the AWS API throttle limits for highly concurrent tasks
- a small increase in the backoff factor could avoid excessive polling
- random sleep before polling to allow the batch task to spin-up
  - the random sleep helps to avoid API throttling
- revise the retry logic slightly to avoid unnecessary pause
  when there are no more retries required

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
> Fix For: 2.0.0
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?
>  
> Another option is to use something like the sensor-poke approach, with 
> rescheduling, e.g.
> - 
> [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907774#comment-16907774
 ] 

Darren Weber commented on AIRFLOW-5218:
---

There is something weird in the polling logs.  The timestamps in the logs 
indicate that the retry polling interval is not what it says it will be, e.g. 
it reports the retry attempt count as the number of seconds (it's not).
{noformat}
[2019-08-15 02:33:57,163] {awsbatch_operator.py:103} INFO - AWS Batch Job 
started: ...
[2019-08-15 02:33:57,166] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 0 seconds
[2019-08-15 02:33:58,284] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 1 seconds
[2019-08-15 02:33:59,412] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 2 seconds 
[2019-08-15 02:34:00,568] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 3 seconds 
[2019-08-15 02:34:01,866] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 4 seconds 
[2019-08-15 02:34:03,140] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 5 seconds 
[2019-08-15 02:34:04,695] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 6 seconds 
[2019-08-15 02:34:06,165] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 7 seconds 
[2019-08-15 02:34:07,764] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 8 seconds 
[2019-08-15 02:34:09,514] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 9 seconds
[2019-08-15 02:34:11,440] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 10 seconds
{noformat}

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907762#comment-16907762
 ] 

ASF GitHub Bot commented on AIRFLOW-5218:
-

darrenleeweber commented on pull request #5825: [AIRFLOW-5218] less polling for 
AWS Batch status
URL: https://github.com/apache/airflow/pull/5825
 
 
   ### Jira
   
   - [x] My PR addresses the following [Airflow Jira]
   - https://issues.apache.org/jira/browse/AIRFLOW-5218
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI 
changes:
   - a small increase in the backoff factor could avoid excessive polling
   - avoid the AWS API throttle limits for highly concurrent tasks
   
   ### Tests
   
   - [ ] My PR does not need testing for this extremely good reason:
   - it's the smallest possible change that might address the issue
   - the change does not impact any public API
   - if there are tests on the polling interval (or should be), LMK
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines
   - it's just one commit
   - the commit message is succinct, LMK if you want it amended
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - no changes required to documentation
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907749#comment-16907749
 ] 

Darren Weber commented on AIRFLOW-5218:
---

Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code}
from datetime import datetime
from time import sleep

In [18]: for i in [1 + pow(retries * 0.3, 2) for retries in range(10)]: 
...: print(f"{datetime.now()}: sleeping for {i}") 
...: sleep(i) 
...:

  
2019-08-14 18:52:01.688705: sleeping for 1.0
2019-08-14 18:52:02.690385: sleeping for 1.09
2019-08-14 18:52:03.781384: sleeping for 1.3599
2019-08-14 18:52:05.144492: sleeping for 1.8098
2019-08-14 18:52:06.956547: sleeping for 2.44
2019-08-14 18:52:09.401454: sleeping for 3.25
2019-08-14 18:52:12.652212: sleeping for 4.239
2019-08-14 18:52:16.897060: sleeping for 5.41
2019-08-14 18:52:22.313692: sleeping for 6.76
2019-08-14 18:52:29.082087: sleeping for 8.29
{code}

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
> - https://github.com/boto/botocore/pull/1307
> - see also https://github.com/broadinstitute/cromwell/issues/4303
> This is a curious case of premature optimization.  So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job.  Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec.  When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)]  
>   
>   
> Out[4]: 
> [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up.  
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help.  Since batch jobs 
> tend to be long-running jobs (rather than near-real time jobs), it might help 
> to issue less frequent polls when it's in the RUNNING state.  Something on 
> the order of 10's seconds might be reasonable for batch jobs?  Maybe the 
> class could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)