[jira] [Commented] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files
[ https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907850#comment-16907850 ] ASF subversion and git services commented on AIRFLOW-5161: -- Commit 58fa6b56e514996e8a63d3cab32ece3721253724 in airflow's branch refs/heads/v1-10-test from Jarek Potiuk [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=58fa6b5 ] [AIRFLOW-5161] Static checks are run automatically in pre-commit hooks (#5777) > Add pre-commit hooks to run static checks for only changed files > > > Key: AIRFLOW-5161 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5161 > Project: Apache Airflow > Issue Type: Improvement > Components: ci >Affects Versions: 2.0.0 >Reporter: Jarek Potiuk >Priority: Major > Fix For: 1.10.5 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] potiuk commented on issue #5808: [AIRFLOW-5205] Check xml files with xmllint + Licenses
potiuk commented on issue #5808: [AIRFLOW-5205] Check xml files with xmllint + Licenses URL: https://github.com/apache/airflow/pull/5808#issuecomment-521523483 Made the PR standalone (not depending on series of PRs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] potiuk commented on a change in pull request #5808: [AIRFLOW-5205] Check xml files with xmllint + Licenses
potiuk commented on a change in pull request #5808: [AIRFLOW-5205] Check xml files with xmllint + Licenses URL: https://github.com/apache/airflow/pull/5808#discussion_r314181813 ## File path: airflow/_vendor/slugify/slugify.py ## @@ -1,3 +1,6 @@ +# -*- coding: utf-8 -*- +# pylint: skip-file +"""Slugify !""" Review comment: Removed in the first commit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] potiuk commented on issue #5807: [AIRFLOW-5204] Shellcheck + common licences in shell files
potiuk commented on issue #5807: [AIRFLOW-5204] Shellcheck + common licences in shell files URL: https://github.com/apache/airflow/pull/5807#issuecomment-521519887 Again - another set of checks. This time for shell files (shellcheck + shebangs/executable + licenses) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] potiuk commented on a change in pull request #5807: [AIRFLOW-5204] Shellcheck + common licence in shell files
potiuk commented on a change in pull request #5807: [AIRFLOW-5204] Shellcheck + common licence in shell files URL: https://github.com/apache/airflow/pull/5807#discussion_r314178750 ## File path: airflow/example_dags/entrypoint.sh ## @@ -1,20 +1,20 @@ -# -*- coding: utf-8 -*- +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at +#http://www.apache.org/licenses/LICENSE-2.0 # -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. -["/bin/bash", "-c", "/bin/sleep 30; /bin/mv {{params.source_location}}/{{ ti.xcom_pull('view_file') }} {{params.target_location}}; /bin/echo '{{params.target_location}}/{{ ti.xcom_pull('view_file') }}';"] +# TODO: Uncomment this code when we start using it +#[ "/bin/bash", "-c", "/bin/sleep 30; /bin/mv {{params.source_location}}/{{ ti.xcom_pull('view_file') }} {{params.target_location}}; /bin/echo '{{params.target_location}}/{{ ti.xcom_pull('view_file') }}';" ] # shellcheck disable=SC1073,SC1072,SC1035 Review comment: This is a problematic implementation of DockerOperator w/regards to command. The command can be either a string or array. It can be templated and it can also ba a file with .bash or .sh extension. In this case the python array was stored in a file with .sh extension - that was valid from the DockerOPerator point of view (see docker_copy_data.py) but it makes little sense to store an array in .sh file. Those tests in docker_copy_data.py were anyhow commented out with suggestion to uncomment if you want to run your own testing. Rather than commenting out I simply moved the array to docker_copy_data.py and removed the entrypoint.sh now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Updated] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darren Weber updated AIRFLOW-5218: -- Description: The AWS Batch Operator attempts to use a boto3 feature that is not available and has not been merged in years, see - [https://github.com/boto/botocore/pull/1307] - see also [https://github.com/broadinstitute/cromwell/issues/4303] This is a curious case of premature optimization. So, in the meantime, this means that the fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the Airflow task, simply because the status is polled too frequently. Check the output from the retry algorithm, e.g. within the first 10 retries, the status of an AWS batch job is checked about 10 times at a rate that is approx 1 retry/sec. When an Airflow instance is running 10's or 100's of concurrent batch jobs, this hits the API too frequently and crashes the Airflow task (plus it occupies a worker in too much busy work). {code:java} In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] Out[4]: [1.0, 1.01, 1.04, 1.09, 1.1601, 1.25, 1.36, 1.4902, 1.6401, 1.81, 2.0, 2.21, 2.4404, 2.6904, 2.9604, 3.25, 3.5605, 3.8906, 4.24, 4.61]{code} Possible solutions are to introduce an initial sleep (say 60 sec?) right after issuing the request, so that the batch job has some time to spin up. The job progresses through a through phases before it gets to RUNNING state and polling for each phase of that sequence might help. Since batch jobs tend to be long-running jobs (rather than near-real time jobs), it might help to issue less frequent polls when it's in the RUNNING state. Something on the order of 10's seconds might be reasonable for batch jobs? Maybe the class could expose a parameter for the rate of polling (or a callable)? Another option is to use something like the sensor-poke approach, with rescheduling, e.g. - [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117] was: The AWS Batch Operator attempts to use a boto3 feature that is not available and has not been merged in years, see - [https://github.com/boto/botocore/pull/1307] - see also [https://github.com/broadinstitute/cromwell/issues/4303] This is a curious case of premature optimization. So, in the meantime, this means that the fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the Airflow task, simply because the status is polled too frequently. Check the output from the retry algorithm, e.g. within the first 10 retries, the status of an AWS batch job is checked about 10 times at a rate that is approx 1 retry/sec. When an Airflow instance is running 10's or 100's of concurrent batch jobs, this hits the API too frequently and crashes the Airflow task (plus it occupies a worker in too much busy work). {code:java} In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] Out[4]: [1.0, 1.01, 1.04, 1.09, 1.1601, 1.25, 1.36, 1.4902, 1.6401, 1.81, 2.0, 2.21, 2.4404, 2.6904, 2.9604, 3.25, 3.5605, 3.8906, 4.24, 4.61]{code} Possible solutions are to introduce an initial sleep (say 60 sec?) right after issuing the request, so that the batch job has some time to spin up. The job progresses through a through phases before it gets to RUNNING state and polling for each phase of that sequence might help. Since batch jobs tend to be long-running jobs (rather than near-real time jobs), it might help to issue less frequent polls when it's in the RUNNING state. Something on the order of 10's seconds might be reasonable for batch jobs? Maybe the class could expose a parameter for the rate of polling (or a callable)? > AWS Batch Operator - status polling too often, esp. for high concurrency > > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib >Affects Versions: 1.10.4 >Reporter: Darren Weber >Assignee: Darren Weber >Priority: Major > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged i
[GitHub] [airflow] potiuk commented on issue #5790: [AIRFLOW-5180] Added static checks (yamllint) + auto-licences for yaml
potiuk commented on issue #5790: [AIRFLOW-5180] Added static checks (yamllint) + auto-licences for yaml URL: https://github.com/apache/airflow/pull/5790#issuecomment-521513998 Part of static checks dealing with yaml (yamllint + consistent licenses). Removed the chaing of depnding commits. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Resolved] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files
[ https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarek Potiuk resolved AIRFLOW-5161. --- Resolution: Fixed Fix Version/s: 1.10.5 > Add pre-commit hooks to run static checks for only changed files > > > Key: AIRFLOW-5161 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5161 > Project: Apache Airflow > Issue Type: Improvement > Components: ci >Affects Versions: 2.0.0 >Reporter: Jarek Potiuk >Priority: Major > Fix For: 1.10.5 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files
[ https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907824#comment-16907824 ] ASF subversion and git services commented on AIRFLOW-5161: -- Commit df4dc31ea109b4a6b832a9d6b3a4d54e1efd6e5a in airflow's branch refs/heads/v1-10-test from Jarek Potiuk [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=df4dc31 ] [AIRFLOW-5161] Static checks are run automatically in pre-commit hooks (#5777) > Add pre-commit hooks to run static checks for only changed files > > > Key: AIRFLOW-5161 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5161 > Project: Apache Airflow > Issue Type: Improvement > Components: ci >Affects Versions: 2.0.0 >Reporter: Jarek Potiuk >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] potiuk commented on issue #5786: [AIRFLOW-5170] Fix encoding pragmas, consistent licences for python files and related pylint fixes
potiuk commented on issue #5786: [AIRFLOW-5170] Fix encoding pragmas, consistent licences for python files and related pylint fixes URL: https://github.com/apache/airflow/pull/5786#issuecomment-521509287 @ashb @dimberman @Fokko -> this is the first additional set of checks (for python files) added after merging the pylint/mypy/flake checks in pre-commit. It will make our python code much more consistent (and fixes/disables a lot of pylint errors). We also have a script that can refresh pylint_todo.txt This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] derrick-mink-sp opened a new pull request #5826: Sailpoint internal/pod aliases
derrick-mink-sp opened a new pull request #5826: Sailpoint internal/pod aliases URL: https://github.com/apache/airflow/pull/5826 Make sure you have checked _all_ steps below. ### Jira - [X] My PR addresses the following [Airflow Jira] - https://issues.apache.org/jira/browse/AIRFLOW-5221 ### Description - [X] Here are some details about my PR, including screenshots of any UI changes: - This PR will give users the ability to add DNS entries to their Kubernetes pods via hostAliases ### Tests - [X] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: tests/minikube/test_kubernetes_pod_operator.py - test_host_aliases ### Commits - [] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [ ] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (AIRFLOW-5221) Add host alias support to the KubernetesPodOperator
Derrick Mink created AIRFLOW-5221: - Summary: Add host alias support to the KubernetesPodOperator Key: AIRFLOW-5221 URL: https://issues.apache.org/jira/browse/AIRFLOW-5221 Project: Apache Airflow Issue Type: Improvement Components: operators Affects Versions: 1.10.4 Reporter: Derrick Mink Assignee: Derrick Mink [https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/] The only wait to manage DNS entries for kubernetes pods is through hosts aliases -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (AIRFLOW-5220) Easy form to create airflow dags
huangyan created AIRFLOW-5220: - Summary: Easy form to create airflow dags Key: AIRFLOW-5220 URL: https://issues.apache.org/jira/browse/AIRFLOW-5220 Project: Apache Airflow Issue Type: New Feature Components: DAG, database Affects Versions: 1.10.5 Reporter: huangyan Assignee: huangyan The airflow usage threshold is higher and the user must write a Python dag file. However, many users don't write Python, they want to create dags directly from forms. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (AIRFLOW-5219) Alarm if the task is not executed within the expected time range.
huangyan created AIRFLOW-5219: - Summary: Alarm if the task is not executed within the expected time range. Key: AIRFLOW-5219 URL: https://issues.apache.org/jira/browse/AIRFLOW-5219 Project: Apache Airflow Issue Type: New Feature Components: DAG Affects Versions: 1.10.4 Reporter: huangyan Assignee: huangyan Fix For: 1.10.4 When using Airflow, user has an expected time range for the task. Beyond this range, the user expects to get an alert instead of performing the task directly. They may not want the task to be executed automatically, and then manually perform the task after analyzing the cause. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907774#comment-16907774 ] Darren Weber commented on AIRFLOW-5218: --- There is something weird in the polling logs. The timestamps in the logs indicate that the retry polling interval is not what it says it will be, e.g. it reports the retry attempt count as the number of seconds (it's not). {noformat} [2019-08-15 02:33:57,163] {awsbatch_operator.py:103} INFO - AWS Batch Job started: ... [2019-08-15 02:33:57,166] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 0 seconds [2019-08-15 02:33:58,284] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 1 seconds [2019-08-15 02:33:59,412] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 2 seconds [2019-08-15 02:34:00,568] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 3 seconds [2019-08-15 02:34:01,866] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 4 seconds [2019-08-15 02:34:03,140] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 5 seconds [2019-08-15 02:34:04,695] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 6 seconds [2019-08-15 02:34:06,165] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 7 seconds [2019-08-15 02:34:07,764] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 8 seconds [2019-08-15 02:34:09,514] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 9 seconds [2019-08-15 02:34:11,440] {awsbatch_operator.py:137} INFO - AWS Batch retry in the next 10 seconds {noformat} > AWS Batch Operator - status polling too often, esp. for high concurrency > > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib >Affects Versions: 1.10.4 >Reporter: Darren Weber >Assignee: Darren Weber >Priority: Major > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged in years, see > - [https://github.com/boto/botocore/pull/1307] > - see also [https://github.com/broadinstitute/cromwell/issues/4303] > This is a curious case of premature optimization. So, in the meantime, this > means that the fallback is the exponential backoff routine for the status > checks on the batch job. Unfortunately, when the concurrency of Airflow jobs > is very high (100's of tasks), this fallback polling hits the AWS Batch API > too hard and the AWS API throttle throws an error, which fails the Airflow > task, simply because the status is polled too frequently. > Check the output from the retry algorithm, e.g. within the first 10 retries, > the status of an AWS batch job is checked about 10 times at a rate that is > approx 1 retry/sec. When an Airflow instance is running 10's or 100's of > concurrent batch jobs, this hits the API too frequently and crashes the > Airflow task (plus it occupies a worker in too much busy work). > {code:java} > In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] > Out[4]: > [1.0, > 1.01, > 1.04, > 1.09, > 1.1601, > 1.25, > 1.36, > 1.4902, > 1.6401, > 1.81, > 2.0, > 2.21, > 2.4404, > 2.6904, > 2.9604, > 3.25, > 3.5605, > 3.8906, > 4.24, > 4.61]{code} > Possible solutions are to introduce an initial sleep (say 60 sec?) right > after issuing the request, so that the batch job has some time to spin up. > The job progresses through a through phases before it gets to RUNNING state > and polling for each phase of that sequence might help. Since batch jobs tend > to be long-running jobs (rather than near-real time jobs), it might help to > issue less frequent polls when it's in the RUNNING state. Something on the > order of 10's seconds might be reasonable for batch jobs? Maybe the class > could expose a parameter for the rate of polling (or a callable)? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darren Weber reassigned AIRFLOW-5218: - Assignee: Darren Weber > AWS Batch Operator - status polling too often, esp. for high concurrency > > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib >Affects Versions: 1.10.4 >Reporter: Darren Weber >Assignee: Darren Weber >Priority: Major > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged in years, see > - [https://github.com/boto/botocore/pull/1307] > - see also [https://github.com/broadinstitute/cromwell/issues/4303] > This is a curious case of premature optimization. So, in the meantime, this > means that the fallback is the exponential backoff routine for the status > checks on the batch job. Unfortunately, when the concurrency of Airflow jobs > is very high (100's of tasks), this fallback polling hits the AWS Batch API > too hard and the AWS API throttle throws an error, which fails the Airflow > task, simply because the status is polled too frequently. > Check the output from the retry algorithm, e.g. within the first 10 retries, > the status of an AWS batch job is checked about 10 times at a rate that is > approx 1 retry/sec. When an Airflow instance is running 10's or 100's of > concurrent batch jobs, this hits the API too frequently and crashes the > Airflow task (plus it occupies a worker in too much busy work). > {code:java} > In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] > Out[4]: > [1.0, > 1.01, > 1.04, > 1.09, > 1.1601, > 1.25, > 1.36, > 1.4902, > 1.6401, > 1.81, > 2.0, > 2.21, > 2.4404, > 2.6904, > 2.9604, > 3.25, > 3.5605, > 3.8906, > 4.24, > 4.61]{code} > Possible solutions are to introduce an initial sleep (say 60 sec?) right > after issuing the request, so that the batch job has some time to spin up. > The job progresses through a through phases before it gets to RUNNING state > and polling for each phase of that sequence might help. Since batch jobs tend > to be long-running jobs (rather than near-real time jobs), it might help to > issue less frequent polls when it's in the RUNNING state. Something on the > order of 10's seconds might be reasonable for batch jobs? Maybe the class > could expose a parameter for the rate of polling (or a callable)? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907749#comment-16907749 ] Darren Weber edited comment on AIRFLOW-5218 at 8/15/19 2:15 AM: PR at [https://github.com/apache/airflow/pull/5825] applies the following suggestion. Even bumping the backoff factor from `0.1` to `0.3` might help, e.g. {code:java} from datetime import datetime from time import sleep for retries in range(10): pause = 1 + pow(retries * 0.3, 2) print(f"{datetime.now()}: retry ({retries:04d}) sleeping for {pause:6.2f} sec") sleep(pause) 2019-08-14 19:02:58.745923: retry () sleeping for 1.00 sec 2019-08-14 19:02:59.747635: retry (0001) sleeping for 1.09 sec 2019-08-14 19:03:00.840129: retry (0002) sleeping for 1.36 sec 2019-08-14 19:03:02.202734: retry (0003) sleeping for 1.81 sec 2019-08-14 19:03:04.015686: retry (0004) sleeping for 2.44 sec 2019-08-14 19:03:06.458972: retry (0005) sleeping for 3.25 sec 2019-08-14 19:03:09.713452: retry (0006) sleeping for 4.24 sec 2019-08-14 19:03:13.954253: retry (0007) sleeping for 5.41 sec 2019-08-14 19:03:19.368445: retry (0008) sleeping for 6.76 sec 2019-08-14 19:03:26.135600: retry (0009) sleeping for 8.29 sec {code} was (Author: dazza): Even bumping the backoff factor from `0.1` to `0.3` might help, e.g. {code:java} from datetime import datetime from time import sleep for retries in range(10): pause = 1 + pow(retries * 0.3, 2) print(f"{datetime.now()}: retry ({retries:04d}) sleeping for {pause:6.2f} sec") sleep(pause) 2019-08-14 19:02:58.745923: retry () sleeping for 1.00 sec 2019-08-14 19:02:59.747635: retry (0001) sleeping for 1.09 sec 2019-08-14 19:03:00.840129: retry (0002) sleeping for 1.36 sec 2019-08-14 19:03:02.202734: retry (0003) sleeping for 1.81 sec 2019-08-14 19:03:04.015686: retry (0004) sleeping for 2.44 sec 2019-08-14 19:03:06.458972: retry (0005) sleeping for 3.25 sec 2019-08-14 19:03:09.713452: retry (0006) sleeping for 4.24 sec 2019-08-14 19:03:13.954253: retry (0007) sleeping for 5.41 sec 2019-08-14 19:03:19.368445: retry (0008) sleeping for 6.76 sec 2019-08-14 19:03:26.135600: retry (0009) sleeping for 8.29 sec {code} > AWS Batch Operator - status polling too often, esp. for high concurrency > > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib >Affects Versions: 1.10.4 >Reporter: Darren Weber >Priority: Major > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged in years, see > - [https://github.com/boto/botocore/pull/1307] > - see also [https://github.com/broadinstitute/cromwell/issues/4303] > This is a curious case of premature optimization. So, in the meantime, this > means that the fallback is the exponential backoff routine for the status > checks on the batch job. Unfortunately, when the concurrency of Airflow jobs > is very high (100's of tasks), this fallback polling hits the AWS Batch API > too hard and the AWS API throttle throws an error, which fails the Airflow > task, simply because the status is polled too frequently. > Check the output from the retry algorithm, e.g. within the first 10 retries, > the status of an AWS batch job is checked about 10 times at a rate that is > approx 1 retry/sec. When an Airflow instance is running 10's or 100's of > concurrent batch jobs, this hits the API too frequently and crashes the > Airflow task (plus it occupies a worker in too much busy work). > {code:java} > In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] > Out[4]: > [1.0, > 1.01, > 1.04, > 1.09, > 1.1601, > 1.25, > 1.36, > 1.4902, > 1.6401, > 1.81, > 2.0, > 2.21, > 2.4404, > 2.6904, > 2.9604, > 3.25, > 3.5605, > 3.8906, > 4.24, > 4.61]{code} > Possible solutions are to introduce an initial sleep (say 60 sec?) right > after issuing the request, so that the batch job has some time to spin up. > The job progresses through a through phases before it gets to RUNNING state > and polling for each phase of that sequence might help. Since batch jobs tend > to be long-running jobs (rather than near-real time jobs), it might help to > issue less frequent polls when it's in the RUNNING state. Something on the > order of 10's seconds might be reasonable for batch jobs? Maybe the class > could expose a parameter for the rate of polling (or a callable)? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907762#comment-16907762 ] ASF GitHub Bot commented on AIRFLOW-5218: - darrenleeweber commented on pull request #5825: [AIRFLOW-5218] less polling for AWS Batch status URL: https://github.com/apache/airflow/pull/5825 ### Jira - [x] My PR addresses the following [Airflow Jira] - https://issues.apache.org/jira/browse/AIRFLOW-5218 ### Description - [x] Here are some details about my PR, including screenshots of any UI changes: - a small increase in the backoff factor could avoid excessive polling - avoid the AWS API throttle limits for highly concurrent tasks ### Tests - [ ] My PR does not need testing for this extremely good reason: - it's the smallest possible change that might address the issue - the change does not impact any public API - if there are tests on the polling interval (or should be), LMK ### Commits - [x] My commits all reference Jira issues in their subject lines - it's just one commit - the commit message is succinct, LMK if you want it amended ### Documentation - [x] In case of new functionality, my PR adds documentation that describes how to use it. - no changes required to documentation ### Code Quality - [ ] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > AWS Batch Operator - status polling too often, esp. for high concurrency > > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib >Affects Versions: 1.10.4 >Reporter: Darren Weber >Priority: Major > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged in years, see > - [https://github.com/boto/botocore/pull/1307] > - see also [https://github.com/broadinstitute/cromwell/issues/4303] > This is a curious case of premature optimization. So, in the meantime, this > means that the fallback is the exponential backoff routine for the status > checks on the batch job. Unfortunately, when the concurrency of Airflow jobs > is very high (100's of tasks), this fallback polling hits the AWS Batch API > too hard and the AWS API throttle throws an error, which fails the Airflow > task, simply because the status is polled too frequently. > Check the output from the retry algorithm, e.g. within the first 10 retries, > the status of an AWS batch job is checked about 10 times at a rate that is > approx 1 retry/sec. When an Airflow instance is running 10's or 100's of > concurrent batch jobs, this hits the API too frequently and crashes the > Airflow task (plus it occupies a worker in too much busy work). > {code:java} > In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] > Out[4]: > [1.0, > 1.01, > 1.04, > 1.09, > 1.1601, > 1.25, > 1.36, > 1.4902, > 1.6401, > 1.81, > 2.0, > 2.21, > 2.4404, > 2.6904, > 2.9604, > 3.25, > 3.5605, > 3.8906, > 4.24, > 4.61]{code} > Possible solutions are to introduce an initial sleep (say 60 sec?) right > after issuing the request, so that the batch job has some time to spin up. > The job progresses through a through phases before it gets to RUNNING state > and polling for each phase of that sequence might help. Since batch jobs tend > to be long-running jobs (rather than near-real time jobs), it might help to > issue less frequent polls when it's in the RUNNING state. Something on the > order of 10's seconds might be reasonable for batch jobs? Maybe the class > could expose a parameter for the rate of polling (or a callable)? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] darrenleeweber opened a new pull request #5825: [AIRFLOW-5218] less polling for AWS Batch status
darrenleeweber opened a new pull request #5825: [AIRFLOW-5218] less polling for AWS Batch status URL: https://github.com/apache/airflow/pull/5825 ### Jira - [x] My PR addresses the following [Airflow Jira] - https://issues.apache.org/jira/browse/AIRFLOW-5218 ### Description - [x] Here are some details about my PR, including screenshots of any UI changes: - a small increase in the backoff factor could avoid excessive polling - avoid the AWS API throttle limits for highly concurrent tasks ### Tests - [ ] My PR does not need testing for this extremely good reason: - it's the smallest possible change that might address the issue - the change does not impact any public API - if there are tests on the polling interval (or should be), LMK ### Commits - [x] My commits all reference Jira issues in their subject lines - it's just one commit - the commit message is succinct, LMK if you want it amended ### Documentation - [x] In case of new functionality, my PR adds documentation that describes how to use it. - no changes required to documentation ### Code Quality - [ ] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Comment Edited] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907749#comment-16907749 ] Darren Weber edited comment on AIRFLOW-5218 at 8/15/19 2:04 AM: Even bumping the backoff factor from `0.1` to `0.3` might help, e.g. {code:java} from datetime import datetime from time import sleep for retries in range(10): pause = 1 + pow(retries * 0.3, 2) print(f"{datetime.now()}: retry ({retries:04d}) sleeping for {pause:6.2f} sec") sleep(pause) 2019-08-14 19:02:58.745923: retry () sleeping for 1.00 sec 2019-08-14 19:02:59.747635: retry (0001) sleeping for 1.09 sec 2019-08-14 19:03:00.840129: retry (0002) sleeping for 1.36 sec 2019-08-14 19:03:02.202734: retry (0003) sleeping for 1.81 sec 2019-08-14 19:03:04.015686: retry (0004) sleeping for 2.44 sec 2019-08-14 19:03:06.458972: retry (0005) sleeping for 3.25 sec 2019-08-14 19:03:09.713452: retry (0006) sleeping for 4.24 sec 2019-08-14 19:03:13.954253: retry (0007) sleeping for 5.41 sec 2019-08-14 19:03:19.368445: retry (0008) sleeping for 6.76 sec 2019-08-14 19:03:26.135600: retry (0009) sleeping for 8.29 sec {code} was (Author: dazza): Even bumping the backoff factor from `0.1` to `0.3` might help, e.g. {code} from datetime import datetime from time import sleep In [18]: for i in [1 + pow(retries * 0.3, 2) for retries in range(10)]: ...: print(f"{datetime.now()}: sleeping for {i}") ...: sleep(i) ...: 2019-08-14 18:52:01.688705: sleeping for 1.0 2019-08-14 18:52:02.690385: sleeping for 1.09 2019-08-14 18:52:03.781384: sleeping for 1.3599 2019-08-14 18:52:05.144492: sleeping for 1.8098 2019-08-14 18:52:06.956547: sleeping for 2.44 2019-08-14 18:52:09.401454: sleeping for 3.25 2019-08-14 18:52:12.652212: sleeping for 4.239 2019-08-14 18:52:16.897060: sleeping for 5.41 2019-08-14 18:52:22.313692: sleeping for 6.76 2019-08-14 18:52:29.082087: sleeping for 8.29 {code} > AWS Batch Operator - status polling too often, esp. for high concurrency > > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib >Affects Versions: 1.10.4 >Reporter: Darren Weber >Priority: Major > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged in years, see > - [https://github.com/boto/botocore/pull/1307] > - see also [https://github.com/broadinstitute/cromwell/issues/4303] > This is a curious case of premature optimization. So, in the meantime, this > means that the fallback is the exponential backoff routine for the status > checks on the batch job. Unfortunately, when the concurrency of Airflow jobs > is very high (100's of tasks), this fallback polling hits the AWS Batch API > too hard and the AWS API throttle throws an error, which fails the Airflow > task, simply because the status is polled too frequently. > Check the output from the retry algorithm, e.g. within the first 10 retries, > the status of an AWS batch job is checked about 10 times at a rate that is > approx 1 retry/sec. When an Airflow instance is running 10's or 100's of > concurrent batch jobs, this hits the API too frequently and crashes the > Airflow task (plus it occupies a worker in too much busy work). > {code:java} > In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] > Out[4]: > [1.0, > 1.01, > 1.04, > 1.09, > 1.1601, > 1.25, > 1.36, > 1.4902, > 1.6401, > 1.81, > 2.0, > 2.21, > 2.4404, > 2.6904, > 2.9604, > 3.25, > 3.5605, > 3.8906, > 4.24, > 4.61]{code} > Possible solutions are to introduce an initial sleep (say 60 sec?) right > after issuing the request, so that the batch job has some time to spin up. > The job progresses through a through phases before it gets to RUNNING state > and polling for each phase of that sequence might help. Since batch jobs tend > to be long-running jobs (rather than near-real time jobs), it might help to > issue less frequent polls when it's in the RUNNING state. Something on the > order of 10's seconds might be reasonable for batch jobs? Maybe the class > could expose a parameter for the rate of polling (or a callable)? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907749#comment-16907749 ] Darren Weber commented on AIRFLOW-5218: --- Even bumping the backoff factor from `0.1` to `0.3` might help, e.g. {code} from datetime import datetime from time import sleep In [18]: for i in [1 + pow(retries * 0.3, 2) for retries in range(10)]: ...: print(f"{datetime.now()}: sleeping for {i}") ...: sleep(i) ...: 2019-08-14 18:52:01.688705: sleeping for 1.0 2019-08-14 18:52:02.690385: sleeping for 1.09 2019-08-14 18:52:03.781384: sleeping for 1.3599 2019-08-14 18:52:05.144492: sleeping for 1.8098 2019-08-14 18:52:06.956547: sleeping for 2.44 2019-08-14 18:52:09.401454: sleeping for 3.25 2019-08-14 18:52:12.652212: sleeping for 4.239 2019-08-14 18:52:16.897060: sleeping for 5.41 2019-08-14 18:52:22.313692: sleeping for 6.76 2019-08-14 18:52:29.082087: sleeping for 8.29 {code} > AWS Batch Operator - status polling too often, esp. for high concurrency > > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib >Affects Versions: 1.10.4 >Reporter: Darren Weber >Priority: Major > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged in years, see > - https://github.com/boto/botocore/pull/1307 > - see also https://github.com/broadinstitute/cromwell/issues/4303 > This is a curious case of premature optimization. So, in the meantime, this > means that the fallback is the exponential backoff routine for the status > checks on the batch job. Unfortunately, when the concurrency of Airflow jobs > is very high (100's of tasks), this fallback polling hits the AWS Batch API > too hard and the AWS API throttle throws an error, which fails the Airflow > task, simply because the status is polled too frequently. > Check the output from the retry algorithm, e.g. within the first 10 retries, > the status of an AWS batch job is checked about 10 times at a rate that is > approx 1 retry/sec. When an Airflow instance is running 10's or 100's of > concurrent batch jobs, this hits the API too frequently and crashes the > Airflow task (plus it occupies a worker in too much busy work). > In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] > > > Out[4]: > [1.0, > 1.01, > 1.04, > 1.09, > 1.1601, > 1.25, > 1.36, > 1.4902, > 1.6401, > 1.81, > 2.0, > 2.21, > 2.4404, > 2.6904, > 2.9604, > 3.25, > 3.5605, > 3.8906, > 4.24, > 4.61] > Possible solutions are to introduce an initial sleep (say 60 sec?) right > after issuing the request, so that the batch job has some time to spin up. > The job progresses through a through phases before it gets to RUNNING state > and polling for each phase of that sequence might help. Since batch jobs > tend to be long-running jobs (rather than near-real time jobs), it might help > to issue less frequent polls when it's in the RUNNING state. Something on > the order of 10's seconds might be reasonable for batch jobs? Maybe the > class could expose a parameter for the rate of polling (or a callable)? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
[ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darren Weber updated AIRFLOW-5218: -- Description: The AWS Batch Operator attempts to use a boto3 feature that is not available and has not been merged in years, see - [https://github.com/boto/botocore/pull/1307] - see also [https://github.com/broadinstitute/cromwell/issues/4303] This is a curious case of premature optimization. So, in the meantime, this means that the fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the Airflow task, simply because the status is polled too frequently. Check the output from the retry algorithm, e.g. within the first 10 retries, the status of an AWS batch job is checked about 10 times at a rate that is approx 1 retry/sec. When an Airflow instance is running 10's or 100's of concurrent batch jobs, this hits the API too frequently and crashes the Airflow task (plus it occupies a worker in too much busy work). {code:java} In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] Out[4]: [1.0, 1.01, 1.04, 1.09, 1.1601, 1.25, 1.36, 1.4902, 1.6401, 1.81, 2.0, 2.21, 2.4404, 2.6904, 2.9604, 3.25, 3.5605, 3.8906, 4.24, 4.61]{code} Possible solutions are to introduce an initial sleep (say 60 sec?) right after issuing the request, so that the batch job has some time to spin up. The job progresses through a through phases before it gets to RUNNING state and polling for each phase of that sequence might help. Since batch jobs tend to be long-running jobs (rather than near-real time jobs), it might help to issue less frequent polls when it's in the RUNNING state. Something on the order of 10's seconds might be reasonable for batch jobs? Maybe the class could expose a parameter for the rate of polling (or a callable)? was: The AWS Batch Operator attempts to use a boto3 feature that is not available and has not been merged in years, see - https://github.com/boto/botocore/pull/1307 - see also https://github.com/broadinstitute/cromwell/issues/4303 This is a curious case of premature optimization. So, in the meantime, this means that the fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the Airflow task, simply because the status is polled too frequently. Check the output from the retry algorithm, e.g. within the first 10 retries, the status of an AWS batch job is checked about 10 times at a rate that is approx 1 retry/sec. When an Airflow instance is running 10's or 100's of concurrent batch jobs, this hits the API too frequently and crashes the Airflow task (plus it occupies a worker in too much busy work). In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] Out[4]: [1.0, 1.01, 1.04, 1.09, 1.1601, 1.25, 1.36, 1.4902, 1.6401, 1.81, 2.0, 2.21, 2.4404, 2.6904, 2.9604, 3.25, 3.5605, 3.8906, 4.24, 4.61] Possible solutions are to introduce an initial sleep (say 60 sec?) right after issuing the request, so that the batch job has some time to spin up. The job progresses through a through phases before it gets to RUNNING state and polling for each phase of that sequence might help. Since batch jobs tend to be long-running jobs (rather than near-real time jobs), it might help to issue less frequent polls when it's in the RUNNING state. Something on the order of 10's seconds might be reasonable for batch jobs? Maybe the class could expose a parameter for the rate of polling (or a callable)? > AWS Batch Operator - status polling too often, esp. for high concurrency > > > Key: AIRFLOW-5218 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, contrib >Affects Versions: 1.10.4 >Reporter: Darren Weber >Priority: Major > > The AWS Batch Operator attempts to use a boto3 feature that is not available > and has not been merged in years, see > - [https://github.com/boto/botocore/pull/1307] > - see also [https://github.com/b
[jira] [Updated] (AIRFLOW-5170) Add static checks for encoding pragma, consistent licences for python files and related pylint fixes
[ https://issues.apache.org/jira/browse/AIRFLOW-5170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarek Potiuk updated AIRFLOW-5170: -- Description: Automated check for encoding pragma, consisten licence files can be added for python files. Since we have pylint checks in pre-commits added we should also make sure to fix all pylint related changes however for all the changed python files. was:Automated check for encoding pragma can be easily added. Since we have pylint checks in pre-commits added we should also make sure to fix all pylint related changes however. Summary: Add static checks for encoding pragma, consistent licences for python files and related pylint fixes (was: Add static checks for encoding pragma (and related pylint fixes)) > Add static checks for encoding pragma, consistent licences for python files > and related pylint fixes > > > Key: AIRFLOW-5170 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5170 > Project: Apache Airflow > Issue Type: Sub-task > Components: ci >Affects Versions: 2.0.0 >Reporter: Jarek Potiuk >Assignee: Jarek Potiuk >Priority: Major > > Automated check for encoding pragma, consisten licence files can be added for > python files. > Since we have pylint checks in pre-commits added we should also make sure to > fix all pylint related changes however for all the changed python files. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
Darren Weber created AIRFLOW-5218: - Summary: AWS Batch Operator - status polling too often, esp. for high concurrency Key: AIRFLOW-5218 URL: https://issues.apache.org/jira/browse/AIRFLOW-5218 Project: Apache Airflow Issue Type: Improvement Components: aws, contrib Affects Versions: 1.10.4 Reporter: Darren Weber The AWS Batch Operator attempts to use a boto3 feature that is not available and has not been merged in years, see - https://github.com/boto/botocore/pull/1307 - see also https://github.com/broadinstitute/cromwell/issues/4303 This is a curious case of premature optimization. So, in the meantime, this means that the fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the Airflow task, simply because the status is polled too frequently. Check the output from the retry algorithm, e.g. within the first 10 retries, the status of an AWS batch job is checked about 10 times at a rate that is approx 1 retry/sec. When an Airflow instance is running 10's or 100's of concurrent batch jobs, this hits the API too frequently and crashes the Airflow task (plus it occupies a worker in too much busy work). In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] Out[4]: [1.0, 1.01, 1.04, 1.09, 1.1601, 1.25, 1.36, 1.4902, 1.6401, 1.81, 2.0, 2.21, 2.4404, 2.6904, 2.9604, 3.25, 3.5605, 3.8906, 4.24, 4.61] Possible solutions are to introduce an initial sleep (say 60 sec?) right after issuing the request, so that the batch job has some time to spin up. The job progresses through a through phases before it gets to RUNNING state and polling for each phase of that sequence might help. Since batch jobs tend to be long-running jobs (rather than near-real time jobs), it might help to issue less frequent polls when it's in the RUNNING state. Something on the order of 10's seconds might be reasonable for batch jobs? Maybe the class could expose a parameter for the rate of polling (or a callable)? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (AIRFLOW-5207) Mark Success and Mark Failed views error out due to DAG reassignment
[ https://issues.apache.org/jira/browse/AIRFLOW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Levine closed AIRFLOW-5207. -- Resolution: Not A Problem This turned out to be an issue with one of our plugins > Mark Success and Mark Failed views error out due to DAG reassignment > > > Key: AIRFLOW-5207 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5207 > Project: Apache Airflow > Issue Type: Bug > Components: ui >Affects Versions: 1.10.4 >Reporter: Marcus Levine >Assignee: Marcus Levine >Priority: Major > Fix For: 1.10.5 > > Original Estimate: 1h > Remaining Estimate: 1h > > When trying to clear a task after upgrading to 1.10.4, I get the following > traceback: > {code:java} > File "/usr/local/lib/python3.7/site-packages/airflow/www/views.py", line > 1451, in failed future, past, State.FAILED) File > "/usr/local/lib/python3.7/site-packages/airflow/www/views.py", line 1396, in > _mark_task_instance_state task.dag = dag File > "/usr/local/lib/python3.7/site-packages/airflow/models/baseoperator.py", line > 509, in dag "The DAG assigned to {} can not be changed.".format(self)) > airflow.exceptions.AirflowException: The DAG assigned to > can not be changed.{code} > This should be a simple fix by either dropping the offending line, or if it > is required to keep things working, just set the private attribute instead: > {code:java} > task._dag = dag > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5207) Mark Success and Mark Failed views error out due to DAG reassignment
[ https://issues.apache.org/jira/browse/AIRFLOW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907716#comment-16907716 ] ASF GitHub Bot commented on AIRFLOW-5207: - marcusianlevine commented on pull request #5811: [AIRFLOW-5207] Fix Mark Success and Failure views URL: https://github.com/apache/airflow/pull/5811 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Mark Success and Mark Failed views error out due to DAG reassignment > > > Key: AIRFLOW-5207 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5207 > Project: Apache Airflow > Issue Type: Bug > Components: ui >Affects Versions: 1.10.4 >Reporter: Marcus Levine >Assignee: Marcus Levine >Priority: Major > Fix For: 1.10.5 > > Original Estimate: 1h > Remaining Estimate: 1h > > When trying to clear a task after upgrading to 1.10.4, I get the following > traceback: > {code:java} > File "/usr/local/lib/python3.7/site-packages/airflow/www/views.py", line > 1451, in failed future, past, State.FAILED) File > "/usr/local/lib/python3.7/site-packages/airflow/www/views.py", line 1396, in > _mark_task_instance_state task.dag = dag File > "/usr/local/lib/python3.7/site-packages/airflow/models/baseoperator.py", line > 509, in dag "The DAG assigned to {} can not be changed.".format(self)) > airflow.exceptions.AirflowException: The DAG assigned to > can not be changed.{code} > This should be a simple fix by either dropping the offending line, or if it > is required to keep things working, just set the private attribute instead: > {code:java} > task._dag = dag > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] marcusianlevine commented on issue #5811: [AIRFLOW-5207] Fix Mark Success and Failure views
marcusianlevine commented on issue #5811: [AIRFLOW-5207] Fix Mark Success and Failure views URL: https://github.com/apache/airflow/pull/5811#issuecomment-521479241 Nevermind, this turned out to be an issue with one of our dynamic DAG plugins This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] marcusianlevine closed pull request #5811: [AIRFLOW-5207] Fix Mark Success and Failure views
marcusianlevine closed pull request #5811: [AIRFLOW-5207] Fix Mark Success and Failure views URL: https://github.com/apache/airflow/pull/5811 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] dossett commented on issue #5419: [AIRFLOW-XXXX] Update pydoc of mlengine_operator
dossett commented on issue #5419: [AIRFLOW-] Update pydoc of mlengine_operator URL: https://github.com/apache/airflow/pull/5419#issuecomment-521474769 Thanks @mik-laj, comment updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] potiuk merged pull request #5777: [AIRFLOW-5161] Static checks are run automatically in pre-commit hooks
potiuk merged pull request #5777: [AIRFLOW-5161] Static checks are run automatically in pre-commit hooks URL: https://github.com/apache/airflow/pull/5777 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files
[ https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907710#comment-16907710 ] ASF subversion and git services commented on AIRFLOW-5161: -- Commit 70e937a8d8ff308a9fb9055ceb7ef2c034200b36 in airflow's branch refs/heads/master from Jarek Potiuk [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=70e937a ] [AIRFLOW-5161] Static checks are run automatically in pre-commit hooks (#5777) > Add pre-commit hooks to run static checks for only changed files > > > Key: AIRFLOW-5161 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5161 > Project: Apache Airflow > Issue Type: Improvement > Components: ci >Affects Versions: 2.0.0 >Reporter: Jarek Potiuk >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files
[ https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907709#comment-16907709 ] ASF GitHub Bot commented on AIRFLOW-5161: - potiuk commented on pull request #5777: [AIRFLOW-5161] Static checks are run automatically in pre-commit hooks URL: https://github.com/apache/airflow/pull/5777 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add pre-commit hooks to run static checks for only changed files > > > Key: AIRFLOW-5161 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5161 > Project: Apache Airflow > Issue Type: Improvement > Components: ci >Affects Versions: 2.0.0 >Reporter: Jarek Potiuk >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] pgagnon commented on issue #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class
pgagnon commented on issue #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class URL: https://github.com/apache/airflow/pull/5824#issuecomment-521466853 Test failure seems unrelated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Updated] (AIRFLOW-5217) Fix Pod docstring
[ https://issues.apache.org/jira/browse/AIRFLOW-5217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philippe Gagnon updated AIRFLOW-5217: - Description: {{Pod}} class docstring is currently out of date with regards to its {{__init__}} method's arguments. (was: {{Pod}} class docstring is currently out of date with regards to its {{__init__}} method's docstring.) > Fix Pod docstring > - > > Key: AIRFLOW-5217 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5217 > Project: Apache Airflow > Issue Type: Improvement > Components: executors >Affects Versions: 2.0.0 >Reporter: Philippe Gagnon >Assignee: Philippe Gagnon >Priority: Minor > > {{Pod}} class docstring is currently out of date with regards to its > {{__init__}} method's arguments. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (AIRFLOW-5217) Fix Pod docstring
Philippe Gagnon created AIRFLOW-5217: Summary: Fix Pod docstring Key: AIRFLOW-5217 URL: https://issues.apache.org/jira/browse/AIRFLOW-5217 Project: Apache Airflow Issue Type: Improvement Components: executors Affects Versions: 2.0.0 Reporter: Philippe Gagnon Assignee: Philippe Gagnon {{Pod}} class docstring is currently out of date with regards to its {{__init__}} method's docstring. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] pgagnon commented on issue #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class
pgagnon commented on issue #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class URL: https://github.com/apache/airflow/pull/5824#issuecomment-521460427 @ashb @dimberman This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (AIRFLOW-5215) Add sidecar container support to Pod object
[ https://issues.apache.org/jira/browse/AIRFLOW-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907695#comment-16907695 ] ASF GitHub Bot commented on AIRFLOW-5215: - pgagnon commented on pull request #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class URL: https://github.com/apache/airflow/pull/5824 Make sure you have checked _all_ steps below. ### Jira - [X] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-XXX - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a Jira issue. - In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)). - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Description - [X] Here are some details about my PR, including screenshots of any UI changes: Adds a `sidecar_containers` argument to `Pod`, allowing users to pass a list of sidecar container definitions to add to the Pod. This is notably useful with the pod mutation hook. ### Tests - [X] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: - `test_extract_sidecar_containers`. ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release `Pod`'s docstring is currently not up to date. Will address in a subsequent PR. ### Code Quality - [X] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add sidecar container support to Pod object > --- > > Key: AIRFLOW-5215 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5215 > Project: Apache Airflow > Issue Type: New Feature > Components: scheduler >Affects Versions: 2.0.0 >Reporter: Philippe Gagnon >Assignee: Philippe Gagnon >Priority: Major > > Add sidecar container support to Pod object. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] pgagnon opened a new pull request #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class
pgagnon opened a new pull request #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class URL: https://github.com/apache/airflow/pull/5824 Make sure you have checked _all_ steps below. ### Jira - [X] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-XXX - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a Jira issue. - In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)). - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Description - [X] Here are some details about my PR, including screenshots of any UI changes: Adds a `sidecar_containers` argument to `Pod`, allowing users to pass a list of sidecar container definitions to add to the Pod. This is notably useful with the pod mutation hook. ### Tests - [X] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: - `test_extract_sidecar_containers`. ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release `Pod`'s docstring is currently not up to date. Will address in a subsequent PR. ### Code Quality - [X] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] mik-laj opened a new pull request #5823: [AIRFLOW-XXX] Create "Using the CLI" page
mik-laj opened a new pull request #5823: [AIRFLOW-XXX] Create "Using the CLI" page URL: https://github.com/apache/airflow/pull/5823 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-XXX - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a Jira issue. - In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)). - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Description - [ ] Here are some details about my PR, including screenshots of any UI changes: ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [ ] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r314102570 ## File path: airflow/models/serialized_dag.py ## @@ -0,0 +1,155 @@ +# -*- coding: utf-8 -*- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Serialzed DAG table in database.""" + +import hashlib +from typing import Any, Dict, List, Optional, TYPE_CHECKING +from sqlalchemy import Column, Index, Integer, String, Text, and_ +from sqlalchemy.sql import exists + +from airflow.models.base import Base, ID_LEN +from airflow.utils import db, timezone +from airflow.utils.sqlalchemy import UtcDateTime + + +if TYPE_CHECKING: +from airflow.dag.serialization.serialized_dag import SerializedDAG # noqa: F401, E501; # pylint: disable=cyclic-import +from airflow.models import DAG # noqa: F401; # pylint: disable=cyclic-import + + +class SerializedDagModel(Base): +"""A table for serialized DAGs. + +serialized_dag table is a snapshot of DAG files synchronized by scheduler. +This feature is controlled by: +[core] dagcached = False: enable this feature +[core] dagcached_min_update_interval = 30 (s): +serialized DAGs are updated in DB when a file gets processed by scheduler, +to reduce DB write rate, there is a minimal interval of updating serialized DAGs. +[scheduler] dag_dir_list_interval = 300 (s): +interval of deleting serialized DAGs in DB when the files are deleted, suggest +to use a smaller interval such as 60 + +It is used by webserver to load dagbags when dagcached=True. Because reading from +database is lightweight compared to importing from files, it solves the webserver +scalability issue. +""" +__tablename__ = 'serialized_dag' + +dag_id = Column(String(ID_LEN), primary_key=True) +fileloc = Column(String(2000)) +# The max length of fileloc exceeds the limit of indexing. +fileloc_hash = Column(Integer) +data = Column(Text) +last_updated = Column(UtcDateTime) + +__table_args__ = ( +Index('idx_fileloc_hash', fileloc_hash, unique=False), +) + +def __init__(self, dag): +from airflow.dag.serialization import Serialization + +self.dag_id = dag.dag_id +self.fileloc = dag.full_filepath +self.fileloc_hash = SerializedDagModel.dag_fileloc_hash(self.fileloc) +self.data = Serialization.to_json(dag) Review comment: > Either here, or inside to_json we should ensure that the JSON blob is valid - I want to minimize the chance of writing "odd"/invalid data in to our DB. Done in https://github.com/apache/airflow/pull/5743/commits/977a2fe3fd244bc3f366a1228324f8b3c58f30ac . WDYT - is that OK? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] mik-laj merged pull request #5776: [AIRFLOW-XXX] Group references in one section
mik-laj merged pull request #5776: [AIRFLOW-XXX] Group references in one section URL: https://github.com/apache/airflow/pull/5776 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (AIRFLOW-3333) New features enable transferring of files or data from GCS to a SFTP remote path and SFTP to GCS path.
[ https://issues.apache.org/jira/browse/AIRFLOW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907620#comment-16907620 ] Kamil Bregula commented on AIRFLOW-: [~pulinpathneja] Any progress? Maybe I can help in some way. > New features enable transferring of files or data from GCS to a SFTP remote > path and SFTP to GCS path. > --- > > Key: AIRFLOW- > URL: https://issues.apache.org/jira/browse/AIRFLOW- > Project: Apache Airflow > Issue Type: New Feature > Components: contrib, gcp >Reporter: Pulin Pathneja >Assignee: Pulin Pathneja >Priority: Major > > New features enable transferring of files or data from GCS(Google Cloud > Storage) to a SFTP remote path and SFTP to GCS(Google Cloud Storage) path. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-4758) Add GoogleCloudStorageToGoogleDrive Operator
[ https://issues.apache.org/jira/browse/AIRFLOW-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907598#comment-16907598 ] Kamil Bregula commented on AIRFLOW-4758: I have created an operator that copies data from GCS to GDrive. Writing an operator that copies directories between GDrive and GCS will not be easy, because GDrive stores files in graphs. The directory structure may contain cycles. It is possible to write an operator that copies one file from GDrive, but its usability will be very limited. What do you think? > Add GoogleCloudStorageToGoogleDrive Operator > > > Key: AIRFLOW-4758 > URL: https://issues.apache.org/jira/browse/AIRFLOW-4758 > Project: Apache Airflow > Issue Type: Wish > Components: gcp, operators >Affects Versions: 1.10.3 >Reporter: jack >Priority: Major > > Add Operators: > GoogleCloudStorageToGoogleDrive > GoogleDriveToGoogleCloudStorage > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (AIRFLOW-5176) Add integration with Azure Data Explorer
[ https://issues.apache.org/jira/browse/AIRFLOW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Spector reassigned AIRFLOW-5176: Assignee: (was: Michael Spector) > Add integration with Azure Data Explorer > > > Key: AIRFLOW-5176 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5176 > Project: Apache Airflow > Issue Type: New Feature > Components: hooks, operators >Affects Versions: 1.10.4, 2.0.0 >Reporter: Michael Spector >Priority: Major > > Add a hook and an operator that allow sending queries to Azure Data Explorer > (Kusto) cluster. > ADX (Azure Data Explorer) is relatively new but very promising analytics data > store / data processing offering in Azure. > PR: https://github.com/apache/airflow/pull/5785 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-4758) Add GoogleCloudStorageToGoogleDrive Operator
[ https://issues.apache.org/jira/browse/AIRFLOW-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907597#comment-16907597 ] ASF GitHub Bot commented on AIRFLOW-4758: - mik-laj commented on pull request #5822: [AIRFLOW-4758] Add GcsToGDriveOperator URL: https://github.com/apache/airflow/pull/5822 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-4758 - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a Jira issue. - In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)). - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Description - [ ] Here are some details about my PR, including screenshots of any UI changes: ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [ ] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add GoogleCloudStorageToGoogleDrive Operator > > > Key: AIRFLOW-4758 > URL: https://issues.apache.org/jira/browse/AIRFLOW-4758 > Project: Apache Airflow > Issue Type: Wish > Components: gcp, operators >Affects Versions: 1.10.3 >Reporter: jack >Priority: Major > > Add Operators: > GoogleCloudStorageToGoogleDrive > GoogleDriveToGoogleCloudStorage > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] matwerber1 commented on issue #4068: [AIRFLOW-2310]: Add AWS Glue Job Compatibility to Airflow
matwerber1 commented on issue #4068: [AIRFLOW-2310]: Add AWS Glue Job Compatibility to Airflow URL: https://github.com/apache/airflow/pull/4068#issuecomment-521402915 I see the merge failed from what is (hopefully?) a small conflict - can we get eyes on this? can I help? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] mik-laj opened a new pull request #5822: [AIRFLOW-4758] Add GcsToGDriveOperator
mik-laj opened a new pull request #5822: [AIRFLOW-4758] Add GcsToGDriveOperator URL: https://github.com/apache/airflow/pull/5822 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-4758 - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a Jira issue. - In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)). - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Description - [ ] Here are some details about my PR, including screenshots of any UI changes: ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [ ] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r314046484 ## File path: airflow/models/serialized_dag.py ## @@ -0,0 +1,155 @@ +# -*- coding: utf-8 -*- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Serialzed DAG table in database.""" + +import hashlib +from typing import Any, Dict, List, Optional, TYPE_CHECKING +from sqlalchemy import Column, Index, Integer, String, Text, and_ +from sqlalchemy.sql import exists + +from airflow.models.base import Base, ID_LEN +from airflow.utils import db, timezone +from airflow.utils.sqlalchemy import UtcDateTime + + +if TYPE_CHECKING: +from airflow.dag.serialization.serialized_dag import SerializedDAG # noqa: F401, E501; # pylint: disable=cyclic-import +from airflow.models import DAG # noqa: F401; # pylint: disable=cyclic-import + + +class SerializedDagModel(Base): +"""A table for serialized DAGs. + +serialized_dag table is a snapshot of DAG files synchronized by scheduler. +This feature is controlled by: +[core] dagcached = False: enable this feature +[core] dagcached_min_update_interval = 30 (s): +serialized DAGs are updated in DB when a file gets processed by scheduler, +to reduce DB write rate, there is a minimal interval of updating serialized DAGs. +[scheduler] dag_dir_list_interval = 300 (s): +interval of deleting serialized DAGs in DB when the files are deleted, suggest +to use a smaller interval such as 60 + +It is used by webserver to load dagbags when dagcached=True. Because reading from +database is lightweight compared to importing from files, it solves the webserver +scalability issue. +""" +__tablename__ = 'serialized_dag' + +dag_id = Column(String(ID_LEN), primary_key=True) +fileloc = Column(String(2000)) +# The max length of fileloc exceeds the limit of indexing. +fileloc_hash = Column(Integer) +data = Column(Text) +last_updated = Column(UtcDateTime) + +__table_args__ = ( +Index('idx_fileloc_hash', fileloc_hash, unique=False), +) + +def __init__(self, dag): +from airflow.dag.serialization import Serialization + +self.dag_id = dag.dag_id +self.fileloc = dag.full_filepath +self.fileloc_hash = SerializedDagModel.dag_fileloc_hash(self.fileloc) +self.data = Serialization.to_json(dag) +self.last_updated = timezone.utcnow() + +@staticmethod +def dag_fileloc_hash(full_filepath: str) -> int: +Hashing file location for indexing. + +:param full_filepath: full filepath of DAG file +:return: hashed full_filepath +""" +# hashing is needed because the length of fileloc is 2000 as an Airflow convention, +# which is over the limit of indexing. If we can reduce the length of fileloc, then +# hashing is not needed. +return int(0x & int( +hashlib.sha1(full_filepath.encode('utf-8')).hexdigest(), 16)) + +@classmethod +def write_dag(cls, dag: 'DAG', min_update_interval: Optional[int] = None): +"""Serializes a DAG and writes it into database. + +:param dag: a DAG to be written into database +:param min_update_interval: minimal interval in seconds to update serialized DAG +""" +with db.create_session() as session: +if min_update_interval is not None: +result = session.query(cls.last_updated).filter( +cls.dag_id == dag.dag_id).first() +if result is not None and ( +timezone.utcnow() - result.last_updated).total_seconds() < min_update_interval: +return +session.merge(cls(dag)) + +@classmethod +def read_all_dags(cls) -> Dict[str, 'SerializedDAG']: +"""Reads all DAGs in serialized_dag table. + +:returns: a dict of DAGs read from database +""" +from airflow.dag.serialization import Serialization + +with db.create_session() as session: +seriali
[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#issuecomment-520101071 Pending Issues: - ~Add Timezone support to `serialized_dag` table~ - ~We still have the issue of `SerializedBaseOperator` being displayed in Graph View.~ ![image](https://user-images.githubusercontent.com/8811558/62814712-56b0b880-bb0a-11e9-9ef0-0dd9090b624b.png) - ~Issue displaying SubDags~: ![image](https://user-images.githubusercontent.com/8811558/62814991-66c99780-bb0c-11e9-9a36-f692b2ec3db5.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] JCoder01 commented on issue #5672: [AIRFLOW-5056] Add argument to filter mails in ImapHook and related operators
JCoder01 commented on issue #5672: [AIRFLOW-5056] Add argument to filter mails in ImapHook and related operators URL: https://github.com/apache/airflow/pull/5672#issuecomment-521384333 I'm actually not using IMAP anymore after the powers that be made me switch to office 365 and disabled IMAPs access, but looking it over, it looks good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r314042957 ## File path: airflow/models/dag.py ## @@ -1509,8 +1518,19 @@ def get_last_dagrun(self, session=None, include_externally_triggered=False): def safe_dag_id(self): return self.dag_id.replace('.', '__dot__') -def get_dag(self): -return DagBag(dag_folder=self.fileloc).get_dag(self.dag_id) +def get_dag(self, dagcached_enabled=False): +"""Creates a dagbag to load and return a DAG. + +Calling it from UI should set dagcached_enabled = DAGCACHED_ENABLED. +There may be a delay for scheduler to write serialized DAG into database, +loads from file in this case. +FIXME: removes it when webserver does not access to DAG folder in future. +""" +dag = DagBag( +dag_folder=self.fileloc, dagcached_enabled=dagcached_enabled).get_dag(self.dag_id) +if dagcached_enabled and dag is None: Review comment: >There may be a delay for scheduler to write serialized DAG into database, loads from file in this case. I guess the idea is if for any reason (connectivity probably or some other DB issue) this method will load the DAG from file, hence it uses recursion to reload DagBag without cache_enabled. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r314041450 ## File path: airflow/models/serialized_dag.py ## @@ -0,0 +1,155 @@ +# -*- coding: utf-8 -*- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Serialzed DAG table in database.""" + +import hashlib +from typing import Any, Dict, List, Optional, TYPE_CHECKING +from sqlalchemy import Column, Index, Integer, String, Text, and_ +from sqlalchemy.sql import exists + +from airflow.models.base import Base, ID_LEN +from airflow.utils import db, timezone +from airflow.utils.sqlalchemy import UtcDateTime + + +if TYPE_CHECKING: +from airflow.dag.serialization.serialized_dag import SerializedDAG # noqa: F401, E501; # pylint: disable=cyclic-import +from airflow.models import DAG # noqa: F401; # pylint: disable=cyclic-import + + +class SerializedDagModel(Base): +"""A table for serialized DAGs. + +serialized_dag table is a snapshot of DAG files synchronized by scheduler. +This feature is controlled by: +[core] dagcached = False: enable this feature +[core] dagcached_min_update_interval = 30 (s): +serialized DAGs are updated in DB when a file gets processed by scheduler, +to reduce DB write rate, there is a minimal interval of updating serialized DAGs. +[scheduler] dag_dir_list_interval = 300 (s): +interval of deleting serialized DAGs in DB when the files are deleted, suggest +to use a smaller interval such as 60 Review comment: Sorry didn't understand this one, what do you mean? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r314040957 ## File path: airflow/models/serialized_dag.py ## @@ -0,0 +1,155 @@ +# -*- coding: utf-8 -*- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Serialzed DAG table in database.""" + +import hashlib +from typing import Any, Dict, List, Optional, TYPE_CHECKING +from sqlalchemy import Column, Index, Integer, String, Text, and_ +from sqlalchemy.sql import exists + +from airflow.models.base import Base, ID_LEN +from airflow.utils import db, timezone +from airflow.utils.sqlalchemy import UtcDateTime + + +if TYPE_CHECKING: +from airflow.dag.serialization.serialized_dag import SerializedDAG # noqa: F401, E501; # pylint: disable=cyclic-import +from airflow.models import DAG # noqa: F401; # pylint: disable=cyclic-import + + +class SerializedDagModel(Base): +"""A table for serialized DAGs. + +serialized_dag table is a snapshot of DAG files synchronized by scheduler. +This feature is controlled by: +[core] dagcached = False: enable this feature +[core] dagcached_min_update_interval = 30 (s): +serialized DAGs are updated in DB when a file gets processed by scheduler, +to reduce DB write rate, there is a minimal interval of updating serialized DAGs. +[scheduler] dag_dir_list_interval = 300 (s): +interval of deleting serialized DAGs in DB when the files are deleted, suggest +to use a smaller interval such as 60 + +It is used by webserver to load dagbags when dagcached=True. Because reading from +database is lightweight compared to importing from files, it solves the webserver +scalability issue. +""" +__tablename__ = 'serialized_dag' + +dag_id = Column(String(ID_LEN), primary_key=True) +fileloc = Column(String(2000)) +# The max length of fileloc exceeds the limit of indexing. +fileloc_hash = Column(Integer) +data = Column(Text) +last_updated = Column(UtcDateTime) + +__table_args__ = ( +Index('idx_fileloc_hash', fileloc_hash, unique=False), +) + +def __init__(self, dag): +from airflow.dag.serialization import Serialization + +self.dag_id = dag.dag_id +self.fileloc = dag.full_filepath +self.fileloc_hash = SerializedDagModel.dag_fileloc_hash(self.fileloc) +self.data = Serialization.to_json(dag) +self.last_updated = timezone.utcnow() + +@staticmethod +def dag_fileloc_hash(full_filepath: str) -> int: +Hashing file location for indexing. + +:param full_filepath: full filepath of DAG file +:return: hashed full_filepath +""" +# hashing is needed because the length of fileloc is 2000 as an Airflow convention, +# which is over the limit of indexing. If we can reduce the length of fileloc, then +# hashing is not needed. +return int(0x & int( +hashlib.sha1(full_filepath.encode('utf-8')).hexdigest(), 16)) + +@classmethod +def write_dag(cls, dag: 'DAG', min_update_interval: Optional[int] = None): +"""Serializes a DAG and writes it into database. + +:param dag: a DAG to be written into database +:param min_update_interval: minimal interval in seconds to update serialized DAG +""" +with db.create_session() as session: +if min_update_interval is not None: +result = session.query(cls.last_updated).filter( +cls.dag_id == dag.dag_id).first() +if result is not None and ( +timezone.utcnow() - result.last_updated).total_seconds() < min_update_interval: +return +session.merge(cls(dag)) + +@classmethod +def read_all_dags(cls) -> Dict[str, 'SerializedDAG']: +"""Reads all DAGs in serialized_dag table. + +:returns: a dict of DAGs read from database +""" +from airflow.dag.serialization import Serialization + +with db.create_session() as session: +seriali
[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r314041221 ## File path: airflow/models/serialized_dag.py ## @@ -0,0 +1,155 @@ +# -*- coding: utf-8 -*- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Serialzed DAG table in database.""" + +import hashlib +from typing import Any, Dict, List, Optional, TYPE_CHECKING +from sqlalchemy import Column, Index, Integer, String, Text, and_ +from sqlalchemy.sql import exists + +from airflow.models.base import Base, ID_LEN +from airflow.utils import db, timezone +from airflow.utils.sqlalchemy import UtcDateTime + + +if TYPE_CHECKING: +from airflow.dag.serialization.serialized_dag import SerializedDAG # noqa: F401, E501; # pylint: disable=cyclic-import +from airflow.models import DAG # noqa: F401; # pylint: disable=cyclic-import + + +class SerializedDagModel(Base): +"""A table for serialized DAGs. + +serialized_dag table is a snapshot of DAG files synchronized by scheduler. +This feature is controlled by: +[core] dagcached = False: enable this feature +[core] dagcached_min_update_interval = 30 (s): +serialized DAGs are updated in DB when a file gets processed by scheduler, +to reduce DB write rate, there is a minimal interval of updating serialized DAGs. +[scheduler] dag_dir_list_interval = 300 (s): +interval of deleting serialized DAGs in DB when the files are deleted, suggest +to use a smaller interval such as 60 + +It is used by webserver to load dagbags when dagcached=True. Because reading from +database is lightweight compared to importing from files, it solves the webserver +scalability issue. +""" +__tablename__ = 'serialized_dag' + +dag_id = Column(String(ID_LEN), primary_key=True) +fileloc = Column(String(2000)) +# The max length of fileloc exceeds the limit of indexing. +fileloc_hash = Column(Integer) +data = Column(Text) +last_updated = Column(UtcDateTime) + +__table_args__ = ( +Index('idx_fileloc_hash', fileloc_hash, unique=False), +) + +def __init__(self, dag): +from airflow.dag.serialization import Serialization + +self.dag_id = dag.dag_id +self.fileloc = dag.full_filepath +self.fileloc_hash = SerializedDagModel.dag_fileloc_hash(self.fileloc) +self.data = Serialization.to_json(dag) +self.last_updated = timezone.utcnow() + +@staticmethod +def dag_fileloc_hash(full_filepath: str) -> int: +Hashing file location for indexing. + +:param full_filepath: full filepath of DAG file +:return: hashed full_filepath +""" +# hashing is needed because the length of fileloc is 2000 as an Airflow convention, +# which is over the limit of indexing. If we can reduce the length of fileloc, then +# hashing is not needed. +return int(0x & int( Review comment: Will let @coufon answer this one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954 @zhongjiajie > > Pending Issues: > > > > * We still have the issue of `SerializedBaseOperator` being displayed in Graph View. > > This is because graph.html and tree.html use `op.__class__.__name__`. Replaced that by op.task_type to fix it. Found this issue with that fix: We have a `BashOperator` label for each task instead of unique lablels. ![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png) This is causes because we are making a dict of unique TaskInstance and not Operator Class in **L1335**: https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336 Previously it was comparing Classes directly, hence it would remove duplicates. https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1332-L1338 Fixing this now. **Fixed** it with https://github.com/apache/airflow/pull/5743/commits/7859d787ca70225a32ce9dbc21d87facb59a3143 ![image](https://user-images.githubusercontent.com/8811558/63048079-a8bc5a00-becd-11e9-8c05-02449f30fffc.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Updated] (AIRFLOW-5216) Add Azure File Share Sensor
[ https://issues.apache.org/jira/browse/AIRFLOW-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Yau updated AIRFLOW-5216: Description: Add sensor to check if a file exists on Azure File Share. (was: Add sensor to c**heck if a file exists on Azure File Share.) > Add Azure File Share Sensor > --- > > Key: AIRFLOW-5216 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5216 > Project: Apache Airflow > Issue Type: New Feature > Components: contrib >Affects Versions: 1.10.4 >Reporter: Albert Yau >Assignee: Albert Yau >Priority: Minor > > Add sensor to check if a file exists on Azure File Share. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (AIRFLOW-5216) Add Azure File Share Sensor
[ https://issues.apache.org/jira/browse/AIRFLOW-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Yau updated AIRFLOW-5216: Description: Add sensor to check if a file exists on Azure File Share using the existing AzureFileShareHook. (was: Add sensor to check if a file exists on Azure File Share.) > Add Azure File Share Sensor > --- > > Key: AIRFLOW-5216 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5216 > Project: Apache Airflow > Issue Type: New Feature > Components: contrib >Affects Versions: 1.10.4 >Reporter: Albert Yau >Assignee: Albert Yau >Priority: Minor > > Add sensor to check if a file exists on Azure File Share using the existing > AzureFileShareHook. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components
[ https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907535#comment-16907535 ] ASF GitHub Bot commented on AIRFLOW-5118: - idralyuk commented on pull request #5821: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5821 …ClusterCreateOperator ### Jira https://issues.apache.org/jira/browse/AIRFLOW-5118 ### Description This PR adds ability to specify optional components in DataprocClusterCreateOperator For more info see https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component ### Tests One test added: it checks whether the optional components were set correctly ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [X] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Airflow DataprocClusterCreateOperator does not currently support setting > optional components > > > Key: AIRFLOW-5118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5118 > Project: Apache Airflow > Issue Type: New Feature > Components: operators >Affects Versions: 1.10.3 >Reporter: Omid Vahdaty >Assignee: Igor >Priority: Minor > > there need to be an option to install optional components via > DataprocClusterCreateOperator . components such as zeppelin. > From the source code of the DataprocClusterCreateOperator[1], the only > software configs that can be set are the imageVersion and the properties. As > the Zeppelin component needs to be set through softwareConfig > optionalComponents[2], the DataprocClusterCreateOperator does not currently > support setting optional components. > > As a workaround for the time being, you could create your clusters by > directly using the gcloud command rather than the > DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can > execute gcloud commands that create your Dataproc cluster with the required > optional components. > [1] > [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py] > > [2] > [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig] > > [3] [https://airflow.apache.org/howto/operator/bash.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] idralyuk opened a new pull request #5821: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…
idralyuk opened a new pull request #5821: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5821 …ClusterCreateOperator ### Jira https://issues.apache.org/jira/browse/AIRFLOW-5118 ### Description This PR adds ability to specify optional components in DataprocClusterCreateOperator For more info see https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component ### Tests One test added: it checks whether the optional components were set correctly ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [X] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (AIRFLOW-5216) Add Azure File Share Sensor
Albert Yau created AIRFLOW-5216: --- Summary: Add Azure File Share Sensor Key: AIRFLOW-5216 URL: https://issues.apache.org/jira/browse/AIRFLOW-5216 Project: Apache Airflow Issue Type: New Feature Components: contrib Affects Versions: 1.10.4 Reporter: Albert Yau Assignee: Albert Yau Add sensor to c**heck if a file exists on Azure File Share. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components
[ https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907528#comment-16907528 ] ASF GitHub Bot commented on AIRFLOW-5118: - idralyuk commented on pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5820 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Airflow DataprocClusterCreateOperator does not currently support setting > optional components > > > Key: AIRFLOW-5118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5118 > Project: Apache Airflow > Issue Type: New Feature > Components: operators >Affects Versions: 1.10.3 >Reporter: Omid Vahdaty >Assignee: Igor >Priority: Minor > > there need to be an option to install optional components via > DataprocClusterCreateOperator . components such as zeppelin. > From the source code of the DataprocClusterCreateOperator[1], the only > software configs that can be set are the imageVersion and the properties. As > the Zeppelin component needs to be set through softwareConfig > optionalComponents[2], the DataprocClusterCreateOperator does not currently > support setting optional components. > > As a workaround for the time being, you could create your clusters by > directly using the gcloud command rather than the > DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can > execute gcloud commands that create your Dataproc cluster with the required > optional components. > [1] > [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py] > > [2] > [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig] > > [3] [https://airflow.apache.org/howto/operator/bash.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] idralyuk closed pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…
idralyuk closed pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5820 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] feluelle commented on issue #5672: [AIRFLOW-5056] Add argument to filter mails in ImapHook and related operators
feluelle commented on issue #5672: [AIRFLOW-5056] Add argument to filter mails in ImapHook and related operators URL: https://github.com/apache/airflow/pull/5672#issuecomment-521363251 @JCoder01 @kurtqq aren't you using the IMAP thingy? 😁 ..and want to have a final look at it? :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components
[ https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907520#comment-16907520 ] ASF GitHub Bot commented on AIRFLOW-5118: - idralyuk commented on pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5820 …ClusterCreateOperator ### Jira https://issues.apache.org/jira/browse/AIRFLOW-5118 ### Description This PR adds ability to specify optional components in DataprocClusterCreateOperator For more info see https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component ### Tests One test added: it checks whether the optional components were set correctly ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [X] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Airflow DataprocClusterCreateOperator does not currently support setting > optional components > > > Key: AIRFLOW-5118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5118 > Project: Apache Airflow > Issue Type: New Feature > Components: operators >Affects Versions: 1.10.3 >Reporter: Omid Vahdaty >Assignee: Igor >Priority: Minor > > there need to be an option to install optional components via > DataprocClusterCreateOperator . components such as zeppelin. > From the source code of the DataprocClusterCreateOperator[1], the only > software configs that can be set are the imageVersion and the properties. As > the Zeppelin component needs to be set through softwareConfig > optionalComponents[2], the DataprocClusterCreateOperator does not currently > support setting optional components. > > As a workaround for the time being, you could create your clusters by > directly using the gcloud command rather than the > DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can > execute gcloud commands that create your Dataproc cluster with the required > optional components. > [1] > [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py] > > [2] > [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig] > > [3] [https://airflow.apache.org/howto/operator/bash.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] idralyuk opened a new pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…
idralyuk opened a new pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5820 …ClusterCreateOperator ### Jira https://issues.apache.org/jira/browse/AIRFLOW-5118 ### Description This PR adds ability to specify optional components in DataprocClusterCreateOperator For more info see https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component ### Tests One test added: it checks whether the optional components were set correctly ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [X] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954 @zhongjiajie > > Pending Issues: > > > > * We still have the issue of `SerializedBaseOperator` being displayed in Graph View. > > This is because graph.html and tree.html use `op.__class__.__name__`. Replaced that by op.task_type to fix it. Found this issue with that fix: We have a `BashOperator` label for each task instead of unique lablels. ![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png) This is causes because we are making a dict of unique TaskInstance and not Operator Class in **L1335**: https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336 Previously it was comparing Classes directly, hence it would remove duplicates. https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1322-L1328 Fixing this now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954 @zhongjiajie > > Pending Issues: > > > > * We still have the issue of `SerializedBaseOperator` being displayed in Graph View. > > This is because graph.html and tree.html use `op.__class__.__name__`. Replaced that by op.task_type to fix it. Found this issue with that fix: We have a `BashOperator` label for each task instead of unique lablels. ![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png) This is causes because we are making a dict of unique TaskInstance and not Operator Class in **L1335**: https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336 Previously it was comparing Classes directly, hence it would remove duplicates. https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1332-L1338 Fixing this now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954 > > Pending Issues: > > > > * We still have the issue of `SerializedBaseOperator` being displayed in Graph View. > > This is because graph.html and tree.html use `op.__class__.__name__`. Replaced that by op.task_type to fix it. Found this issue with that fix: We have a `BashOperator` label for each task instead of unique lablels. ![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png) This is causes because we are making a dict of unique TaskInstance and not Operator Class in **L1335**: https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336 Previously it was comparing Classes directly, hence it would remove duplicates. https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1422-L1436 Fixing this now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954 @zhongjiajie > > Pending Issues: > > > > * We still have the issue of `SerializedBaseOperator` being displayed in Graph View. > > This is because graph.html and tree.html use `op.__class__.__name__`. Replaced that by op.task_type to fix it. Found this issue with that fix: We have a `BashOperator` label for each task instead of unique lablels. ![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png) This is causes because we are making a dict of unique TaskInstance and not Operator Class in **L1335**: https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336 Previously it was comparing Classes directly, hence it would remove duplicates. https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1422-L1436 Fixing this now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components
[ https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907510#comment-16907510 ] ASF GitHub Bot commented on AIRFLOW-5118: - idralyuk commented on pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5820 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Airflow DataprocClusterCreateOperator does not currently support setting > optional components > > > Key: AIRFLOW-5118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5118 > Project: Apache Airflow > Issue Type: New Feature > Components: operators >Affects Versions: 1.10.3 >Reporter: Omid Vahdaty >Assignee: Igor >Priority: Minor > > there need to be an option to install optional components via > DataprocClusterCreateOperator . components such as zeppelin. > From the source code of the DataprocClusterCreateOperator[1], the only > software configs that can be set are the imageVersion and the properties. As > the Zeppelin component needs to be set through softwareConfig > optionalComponents[2], the DataprocClusterCreateOperator does not currently > support setting optional components. > > As a workaround for the time being, you could create your clusters by > directly using the gcloud command rather than the > DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can > execute gcloud commands that create your Dataproc cluster with the required > optional components. > [1] > [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py] > > [2] > [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig] > > [3] [https://airflow.apache.org/howto/operator/bash.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] rasmi commented on issue #5720: [AIRFLOW-5099][WIP-DONT-MERGE] Implement Google Cloud AutoML operators
rasmi commented on issue #5720: [AIRFLOW-5099][WIP-DONT-MERGE] Implement Google Cloud AutoML operators URL: https://github.com/apache/airflow/pull/5720#issuecomment-521356636 No review comments here, I'm just excited for this to be merged -- thank you all for your work! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] idralyuk closed pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…
idralyuk closed pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5820 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (AIRFLOW-5215) Add sidecar container support to Pod object
Philippe Gagnon created AIRFLOW-5215: Summary: Add sidecar container support to Pod object Key: AIRFLOW-5215 URL: https://issues.apache.org/jira/browse/AIRFLOW-5215 Project: Apache Airflow Issue Type: New Feature Components: scheduler Affects Versions: 2.0.0 Reporter: Philippe Gagnon Assignee: Philippe Gagnon Add sidecar container support to Pod object. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components
[ https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907469#comment-16907469 ] ASF GitHub Bot commented on AIRFLOW-5118: - idralyuk commented on pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5820 …ClusterCreateOperator ### Jira https://issues.apache.org/jira/browse/AIRFLOW-5118 ### Description This PR adds ability to specify optional components in DataprocClusterCreateOperator For more info see https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component ### Tests One test added: it checks whether the optional components were set correctly ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [X] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Airflow DataprocClusterCreateOperator does not currently support setting > optional components > > > Key: AIRFLOW-5118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5118 > Project: Apache Airflow > Issue Type: New Feature > Components: operators >Affects Versions: 1.10.3 >Reporter: Omid Vahdaty >Assignee: Igor >Priority: Minor > > there need to be an option to install optional components via > DataprocClusterCreateOperator . components such as zeppelin. > From the source code of the DataprocClusterCreateOperator[1], the only > software configs that can be set are the imageVersion and the properties. As > the Zeppelin component needs to be set through softwareConfig > optionalComponents[2], the DataprocClusterCreateOperator does not currently > support setting optional components. > > As a workaround for the time being, you could create your clusters by > directly using the gcloud command rather than the > DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can > execute gcloud commands that create your Dataproc cluster with the required > optional components. > [1] > [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py] > > [2] > [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig] > > [3] [https://airflow.apache.org/howto/operator/bash.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] idralyuk opened a new pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…
idralyuk opened a new pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc… URL: https://github.com/apache/airflow/pull/5820 …ClusterCreateOperator ### Jira https://issues.apache.org/jira/browse/AIRFLOW-5118 ### Description This PR adds ability to specify optional components in DataprocClusterCreateOperator For more info see https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component ### Tests One test added: it checks whether the optional components were set correctly ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [X] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [X] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Closed] (AIRFLOW-5183) Preprare documentation for new GCP import paths
[ https://issues.apache.org/jira/browse/AIRFLOW-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kamil Bregula closed AIRFLOW-5183. -- Resolution: Fixed Fix Version/s: 2.0.0 > Preprare documentation for new GCP import paths > --- > > Key: AIRFLOW-5183 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5183 > Project: Apache Airflow > Issue Type: Improvement > Components: gcp >Affects Versions: 2.0.0 >Reporter: Tomasz Urbaszek >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5183) Preprare documentation for new GCP import paths
[ https://issues.apache.org/jira/browse/AIRFLOW-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907456#comment-16907456 ] ASF subversion and git services commented on AIRFLOW-5183: -- Commit 40745aa225ae14ad700e2da1f421cd5d0df8e292 in airflow's branch refs/heads/master from Tomek [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=40745aa ] [AIRFLOW-5183] Preprare documentation for new GCP import paths (#5791) > Preprare documentation for new GCP import paths > --- > > Key: AIRFLOW-5183 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5183 > Project: Apache Airflow > Issue Type: Improvement > Components: gcp >Affects Versions: 2.0.0 >Reporter: Tomasz Urbaszek >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5183) Preprare documentation for new GCP import paths
[ https://issues.apache.org/jira/browse/AIRFLOW-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907455#comment-16907455 ] ASF GitHub Bot commented on AIRFLOW-5183: - mik-laj commented on pull request #5791: [AIRFLOW-5183] Preprare documentation for new GCP import paths URL: https://github.com/apache/airflow/pull/5791 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Preprare documentation for new GCP import paths > --- > > Key: AIRFLOW-5183 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5183 > Project: Apache Airflow > Issue Type: Improvement > Components: gcp >Affects Versions: 2.0.0 >Reporter: Tomasz Urbaszek >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] mik-laj merged pull request #5791: [AIRFLOW-5183] Preprare documentation for new GCP import paths
mik-laj merged pull request #5791: [AIRFLOW-5183] Preprare documentation for new GCP import paths URL: https://github.com/apache/airflow/pull/5791 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Assigned] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components
[ https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor reassigned AIRFLOW-5118: - Assignee: Igor (was: Kaxil Naik) > Airflow DataprocClusterCreateOperator does not currently support setting > optional components > > > Key: AIRFLOW-5118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5118 > Project: Apache Airflow > Issue Type: New Feature > Components: operators >Affects Versions: 1.10.3 >Reporter: Omid Vahdaty >Assignee: Igor >Priority: Minor > > there need to be an option to install optional components via > DataprocClusterCreateOperator . components such as zeppelin. > From the source code of the DataprocClusterCreateOperator[1], the only > software configs that can be set are the imageVersion and the properties. As > the Zeppelin component needs to be set through softwareConfig > optionalComponents[2], the DataprocClusterCreateOperator does not currently > support setting optional components. > > As a workaround for the time being, you could create your clusters by > directly using the gcloud command rather than the > DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can > execute gcloud commands that create your Dataproc cluster with the required > optional components. > [1] > [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py] > > [2] > [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig] > > [3] [https://airflow.apache.org/howto/operator/bash.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r313965300 ## File path: airflow/api/common/experimental/delete_dag.py ## @@ -45,6 +49,11 @@ def delete_dag(dag_id: str, keep_records_in_log: bool = True, session=None) -> i raise DagFileExists("Dag id {} is still in DagBag. " "Remove the DAG file first: {}".format(dag_id, dag.fileloc)) +# Scheduler removes DAGs without files from serialized_dag table every dag_dir_list_interval. +# There may be a lag, so explicitly removes serialized DAG here. +if DAGCACHED_ENABLED and SerializedDagModel.has_dag(dag_id): +SerializedDagModel.remove_dag(dag_id) Review comment: Updated in https://github.com/apache/airflow/pull/5743/commits/b814f8dfd9448ee3ceef2722c7f0291d8a680700 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Commented] (AIRFLOW-5147) Annotations for k8s executors should support extended alphabet (like '/'))
[ https://issues.apache.org/jira/browse/AIRFLOW-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907384#comment-16907384 ] ASF GitHub Bot commented on AIRFLOW-5147: - andrei-l commented on pull request #5819: [AIRFLOW-5147] extended character set for for k8s worker pods annotations URL: https://github.com/apache/airflow/pull/5819 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. - https://issues.apache.org/jira/browse/AIRFLOW-5147 ### Description - [ ] Here are some details about my PR, including screenshots of any UI changes: This PR fixes the previous solution (https://github.com/apache/airflow/pull/4589) of providing k8s annotations to workers created by k8s executor. Previously each annotation key had to be declared as part of the airflow config key which implied having some limitations on it (like it could not contatin `/` character). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ``` executors.TestKubeConfig.test_kube_config_worker_annotations_properly_parsed executors.TestKubeConfig.test_kube_config_no_worker_annotations ``` and updates ``` executors.TestKubernetesWorkerConfiguration.test_make_pod_with_empty_executor_config ``` ### Commits - [ ] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [ ] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Annotations for k8s executors should support extended alphabet (like '/')) > --- > > Key: AIRFLOW-5147 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5147 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes, executors >Affects Versions: 1.10.3, 1.10.4 >Reporter: Andrei Loginov >Assignee: Daniel Imberman >Priority: Major > > The fix to introduce k8s annotations for executors > ([https://github.com/apache/airflow/pull/4589] for > https://issues.apache.org/jira/browse/AIRFLOW-3766) limited the character set > allowed for the annotation key to [-._a-zA-Z0-9] set. However many > annotations contain `/` in it, for example: > {code:java} > injector.tumblr.com/request{code} > or > {code:java} > iam.amazonaws.com/role{code} > Which would not be allowed in the current solution. > > I believe original solution should be completely revisited. And instead of > using a separate *kubernetes_annotations* section there should be a key which > will contain a set of key:value annotations in some format. E.g. json: > {code:java} > [kubernetes] > annotations = { "iam.amazonaws.com/role": > "arn:aws:iam:::role/some-role-CKU5HL9BIPXG", "some-other-anno-key": > "some/value" } > {code} > > Supported character set for annotations: > https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] andrei-l opened a new pull request #5819: [AIRFLOW-5147] extended character set for for k8s worker pods annotations
andrei-l opened a new pull request #5819: [AIRFLOW-5147] extended character set for for k8s worker pods annotations URL: https://github.com/apache/airflow/pull/5819 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. - https://issues.apache.org/jira/browse/AIRFLOW-5147 ### Description - [ ] Here are some details about my PR, including screenshots of any UI changes: This PR fixes the previous solution (https://github.com/apache/airflow/pull/4589) of providing k8s annotations to workers created by k8s executor. Previously each annotation key had to be declared as part of the airflow config key which implied having some limitations on it (like it could not contatin `/` character). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ``` executors.TestKubeConfig.test_kube_config_worker_annotations_properly_parsed executors.TestKubeConfig.test_kube_config_no_worker_annotations ``` and updates ``` executors.TestKubernetesWorkerConfiguration.test_make_pod_with_empty_executor_config ``` ### Commits - [ ] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [ ] Passes `flake8` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] danfrankj commented on issue #5815: [AIRFLOW-5210] Make finding template files more efficient
danfrankj commented on issue #5815: [AIRFLOW-5210] Make finding template files more efficient URL: https://github.com/apache/airflow/pull/5815#issuecomment-521300533 @BasPH was something wrong with this PR? - I'm seeing a message above about a revert This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Assigned] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'
[ https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] venkata Bonu reassigned AIRFLOW-5213: - Assignee: venkata Bonu > DockerOperator failing when the docker default logging drivers are other than > 'journald','json-file' > > > Key: AIRFLOW-5213 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5213 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, operators >Affects Versions: 1.10.4 >Reporter: venkata Bonu >Assignee: venkata Bonu >Priority: Major > Labels: easyfix > > Background: > Docker can be configured with multiple logging drivers. > * syslog > * local > * json - file > * journald > * local > * gelf > * fluentd > * awslogs > * splunk > * etwlogs > * gcplogs > * Logentries > But reading docker logs is supported only with drivers local , json-file , > journald > Docker documentation: > [https://docs.docker.com/config/containers/logging/configure/] > > Description: > When a docker is configured with a logging driver other than local , > json-file , jourmald , Airflow Tasks which are using DockerOperator are > failing with an error > _docker.errors.APIError: 501 Server Error: Not Implemented ("configured > logging driver does not support reading")_ > Issue exists in the below lines of the code when the operator is trying to > read the logs by attaching the container. > ``` > {code:python} > line = '' > for line in self.cli.attach(container=self.container['Id'], stdout=True, > stderr=True, stream=True): > line = line.strip() > if hasattr(line, 'decode'): > line = line.decode('utf-8') > self.log.info(line) > {code} > ``` > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'
[ https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] venkata Bonu reassigned AIRFLOW-5213: - Assignee: (was: venkata Bonu) > DockerOperator failing when the docker default logging drivers are other than > 'journald','json-file' > > > Key: AIRFLOW-5213 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5213 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, operators >Affects Versions: 1.10.4 >Reporter: venkata Bonu >Priority: Major > Labels: easyfix > > Background: > Docker can be configured with multiple logging drivers. > * syslog > * local > * json - file > * journald > * local > * gelf > * fluentd > * awslogs > * splunk > * etwlogs > * gcplogs > * Logentries > But reading docker logs is supported only with drivers local , json-file , > journald > Docker documentation: > [https://docs.docker.com/config/containers/logging/configure/] > > Description: > When a docker is configured with a logging driver other than local , > json-file , jourmald , Airflow Tasks which are using DockerOperator are > failing with an error > _docker.errors.APIError: 501 Server Error: Not Implemented ("configured > logging driver does not support reading")_ > Issue exists in the below lines of the code when the operator is trying to > read the logs by attaching the container. > ``` > {code:python} > line = '' > for line in self.cli.attach(container=self.container['Id'], stdout=True, > stderr=True, stream=True): > line = line.strip() > if hasattr(line, 'decode'): > line = line.decode('utf-8') > self.log.info(line) > {code} > ``` > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5210) Resolving Template Files for large DAGs hurts performance
[ https://issues.apache.org/jira/browse/AIRFLOW-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907336#comment-16907336 ] ASF subversion and git services commented on AIRFLOW-5210: -- Commit 577970c210c9160be9e2382ecfd3ae79b01e4d88 in airflow's branch refs/heads/revert-5815-df_resolve_template_files from Bas Harenslak [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=577970c ] Revert "[AIRFLOW-5210] Make finding template files more efficient (#5815)" This reverts commit eeac82318a6440b2d65f9a35b3437b91813945f4. > Resolving Template Files for large DAGs hurts performance > -- > > Key: AIRFLOW-5210 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5210 > Project: Apache Airflow > Issue Type: Bug > Components: DAG >Affects Versions: 1.10.4 >Reporter: Daniel Frank >Priority: Major > Fix For: 1.10.5 > > > During task execution, "resolve_template_files" runs for all tasks in a > given DAG. For large DAGs this takes a long time and is not necessary for > tasks that do not use the template_ext field -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (AIRFLOW-5211) Add pass_value to template_fields -- BigQueryValueCheckOperator
[ https://issues.apache.org/jira/browse/AIRFLOW-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Damon Liao resolved AIRFLOW-5211. - Resolution: Fixed > Add pass_value to template_fields -- BigQueryValueCheckOperator > --- > > Key: AIRFLOW-5211 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5211 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Affects Versions: 1.10.4 >Reporter: Damon Liao >Assignee: Damon Liao >Priority: Minor > Fix For: 1.10.5, 1.10.4 > > > There's use cases to fill *pass_value* from *XCom* when use > *BigQueryValueCheckOperator*, so add pass_value to template_fields. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5182) "KubernetesOperator" isn't implemented
[ https://issues.apache.org/jira/browse/AIRFLOW-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907329#comment-16907329 ] Ash Berlin-Taylor commented on AIRFLOW-5182: There never has been a KubernetesOperator, and the import is otherwise unused in the doc, so that line should just be removed from the docs. > "KubernetesOperator" isn't implemented > -- > > Key: AIRFLOW-5182 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5182 > Project: Apache Airflow > Issue Type: Bug > Components: documentation >Affects Versions: 1.10.3, 1.10.4 >Reporter: Esfahan >Priority: Minor > > h2. Problem > I encountered the following error with `KubernetesOperator`. > {code:java} > Broken DAG: [/root/airflow/dags/sample_k8s.py] cannot import name > KubernetesOperator > {code} > h2. Investigation > The following document is written a sample code to describe how to use > Kubernetes Executor. > [https://airflow.apache.org/kubernetes.html#kubernetes-operator] > There is a line `import KubernetesOperator`, but I think it isn't implemented > on airflow and it isn't used in this script. > {code:java} > from airflow.contrib.operators import KubernetesOperator > {code} > I couldn't find `KubernetesOperator` in the following dirs. > * [https://github.com/apache/airflow/tree/1.10.4/airflow/contrib/operators] > * [https://github.com/apache/airflow/tree/1.10.4/airflow/operators] > Could you check it? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (AIRFLOW-5214) Airflow leaves too many TIME_WAIT TCP connections
Oliver Ricken created AIRFLOW-5214: -- Summary: Airflow leaves too many TIME_WAIT TCP connections Key: AIRFLOW-5214 URL: https://issues.apache.org/jira/browse/AIRFLOW-5214 Project: Apache Airflow Issue Type: Bug Components: DagRun, database Affects Versions: 1.10.4, 1.10.2 Environment: CentOS 7, Airflow 1.10.4, Maria DB Reporter: Oliver Ricken Dear experts, in Airflow version 1.10.2 as well as 1.10.4, we experience a severe problem with the limitation of the number of concurrent tasks. We observe that for more than 8 tasks being started and executed in parallel, that the majority of those tasks fails with the error "Can't connect to MySQL server" and error code 2006(99). This error code boils down to "Cannot bind socket to resource", which is why we started looking into the TCP conenctions of our Airflow host (a single node that hosts the webserver, scheduler and worker). When the 8 tasks are simultaneously running, we observe more than 15,000 TIME_WAIT connections while less than 50 are established. Given, that the number of available ports is somewhat smaller than 30,000, this large number of blocked but unused TCP connections would explain the failing of further task executions. Can anyone explain how these many open connections blocking ports/sockets come about? Given that we have connection pooling enabled, we do not see any explanation yet. Your help is very much appreciated, this issue strongly limits our current performance! Cheers Oliver -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability URL: https://github.com/apache/airflow/pull/5743#discussion_r313909969 ## File path: airflow/dag/serialization/serialized_baseoperator.py ## @@ -45,6 +45,8 @@ def __init__(self, *args, **kwargs): self.ui_color = BaseOperator.ui_color self.ui_fgcolor = BaseOperator.ui_fgcolor self.template_fields = BaseOperator.template_fields +# Not None for SubDagOperator. Review comment: Added in tests https://github.com/apache/airflow/pull/5743/commits/c68ee581e0534b91e41f5e696394a9b1d6e12baa This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'
[ https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] venkata Bonu updated AIRFLOW-5213: -- Attachment: (was: Screen Shot 2019-08-14 at 7.08.44 AM.png) > DockerOperator failing when the docker default logging drivers are other than > 'journald','json-file' > > > Key: AIRFLOW-5213 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5213 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, operators >Affects Versions: 1.10.4 >Reporter: venkata Bonu >Assignee: venkata Bonu >Priority: Major > Labels: easyfix > > Background: > Docker can be configured with multiple logging drivers. > * syslog > * local > * json - file > * journald > * local > * gelf > * fluentd > * awslogs > * splunk > * etwlogs > * gcplogs > * Logentries > But reading docker logs is supported only with drivers local , json-file , > journald > Docker documentation: > [https://docs.docker.com/config/containers/logging/configure/] > > Description: > When a docker is configured with a logging driver other than local , > json-file , jourmald , Airflow Tasks which are using DockerOperator are > failing with an error > _docker.errors.APIError: 501 Server Error: Not Implemented ("configured > logging driver does not support reading")_ > Issue exists in the below lines of the code when the operator is trying to > read the logs by attaching the container. > ``` > {code:python} > line = '' > for line in self.cli.attach(container=self.container['Id'], stdout=True, > stderr=True, stream=True): > line = line.strip() > if hasattr(line, 'decode'): > line = line.decode('utf-8') > self.log.info(line) > {code} > ``` > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'
[ https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] venkata Bonu updated AIRFLOW-5213: -- Attachment: (was: Screen Shot 2019-08-14 at 7.10.01 AM.png) > DockerOperator failing when the docker default logging drivers are other than > 'journald','json-file' > > > Key: AIRFLOW-5213 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5213 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, operators >Affects Versions: 1.10.4 >Reporter: venkata Bonu >Assignee: venkata Bonu >Priority: Major > Labels: easyfix > > Background: > Docker can be configured with multiple logging drivers. > * syslog > * local > * json - file > * journald > * local > * gelf > * fluentd > * awslogs > * splunk > * etwlogs > * gcplogs > * Logentries > But reading docker logs is supported only with drivers local , json-file , > journald > Docker documentation: > [https://docs.docker.com/config/containers/logging/configure/] > > Description: > When a docker is configured with a logging driver other than local , > json-file , jourmald , Airflow Tasks which are using DockerOperator are > failing with an error > _docker.errors.APIError: 501 Server Error: Not Implemented ("configured > logging driver does not support reading")_ > Issue exists in the below lines of the code when the operator is trying to > read the logs by attaching the container. > ``` > {code:python} > line = '' > for line in self.cli.attach(container=self.container['Id'], stdout=True, > stderr=True, stream=True): > line = line.strip() > if hasattr(line, 'decode'): > line = line.decode('utf-8') > self.log.info(line) > {code} > ``` > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'
[ https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] venkata Bonu updated AIRFLOW-5213: -- Attachment: Screen Shot 2019-08-14 at 7.08.44 AM.png > DockerOperator failing when the docker default logging drivers are other than > 'journald','json-file' > > > Key: AIRFLOW-5213 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5213 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, operators >Affects Versions: 1.10.4 >Reporter: venkata Bonu >Assignee: venkata Bonu >Priority: Major > Labels: easyfix > Attachments: Screen Shot 2019-08-14 at 7.08.44 AM.png, Screen Shot > 2019-08-14 at 7.10.01 AM.png > > > Background: > Docker can be configured with multiple logging drivers. > * syslog > * local > * json - file > * journald > * local > * gelf > * fluentd > * awslogs > * splunk > * etwlogs > * gcplogs > * Logentries > But reading docker logs is supported only with drivers local , json-file , > journald > Docker documentation: > [https://docs.docker.com/config/containers/logging/configure/] > > Description: > When a docker is configured with a logging driver other than local , > json-file , jourmald , Airflow Tasks which are using DockerOperator are > failing with an error > _docker.errors.APIError: 501 Server Error: Not Implemented ("configured > logging driver does not support reading")_ > Issue exists in the below lines of the code when the operator is trying to > read the logs by attaching the container. > ``` > {code:python} > line = '' > for line in self.cli.attach(container=self.container['Id'], stdout=True, > stderr=True, stream=True): > line = line.strip() > if hasattr(line, 'decode'): > line = line.decode('utf-8') > self.log.info(line) > {code} > ``` > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'
[ https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] venkata Bonu updated AIRFLOW-5213: -- Attachment: Screen Shot 2019-08-14 at 7.09.27 AM.png > DockerOperator failing when the docker default logging drivers are other than > 'journald','json-file' > > > Key: AIRFLOW-5213 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5213 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, operators >Affects Versions: 1.10.4 >Reporter: venkata Bonu >Assignee: venkata Bonu >Priority: Major > Labels: easyfix > Attachments: Screen Shot 2019-08-14 at 7.08.44 AM.png, Screen Shot > 2019-08-14 at 7.10.01 AM.png > > > Background: > Docker can be configured with multiple logging drivers. > * syslog > * local > * json - file > * journald > * local > * gelf > * fluentd > * awslogs > * splunk > * etwlogs > * gcplogs > * Logentries > But reading docker logs is supported only with drivers local , json-file , > journald > Docker documentation: > [https://docs.docker.com/config/containers/logging/configure/] > > Description: > When a docker is configured with a logging driver other than local , > json-file , jourmald , Airflow Tasks which are using DockerOperator are > failing with an error > _docker.errors.APIError: 501 Server Error: Not Implemented ("configured > logging driver does not support reading")_ > Issue exists in the below lines of the code when the operator is trying to > read the logs by attaching the container. > ``` > {code:python} > line = '' > for line in self.cli.attach(container=self.container['Id'], stdout=True, > stderr=True, stream=True): > line = line.strip() > if hasattr(line, 'decode'): > line = line.decode('utf-8') > self.log.info(line) > {code} > ``` > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'
[ https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] venkata Bonu updated AIRFLOW-5213: -- Attachment: (was: Screen Shot 2019-08-14 at 7.09.27 AM.png) > DockerOperator failing when the docker default logging drivers are other than > 'journald','json-file' > > > Key: AIRFLOW-5213 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5213 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, operators >Affects Versions: 1.10.4 >Reporter: venkata Bonu >Assignee: venkata Bonu >Priority: Major > Labels: easyfix > Attachments: Screen Shot 2019-08-14 at 7.08.44 AM.png, Screen Shot > 2019-08-14 at 7.10.01 AM.png > > > Background: > Docker can be configured with multiple logging drivers. > * syslog > * local > * json - file > * journald > * local > * gelf > * fluentd > * awslogs > * splunk > * etwlogs > * gcplogs > * Logentries > But reading docker logs is supported only with drivers local , json-file , > journald > Docker documentation: > [https://docs.docker.com/config/containers/logging/configure/] > > Description: > When a docker is configured with a logging driver other than local , > json-file , jourmald , Airflow Tasks which are using DockerOperator are > failing with an error > _docker.errors.APIError: 501 Server Error: Not Implemented ("configured > logging driver does not support reading")_ > Issue exists in the below lines of the code when the operator is trying to > read the logs by attaching the container. > ``` > {code:python} > line = '' > for line in self.cli.attach(container=self.container['Id'], stdout=True, > stderr=True, stream=True): > line = line.strip() > if hasattr(line, 'decode'): > line = line.decode('utf-8') > self.log.info(line) > {code} > ``` > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[GitHub] [airflow] potiuk commented on a change in pull request #5808: [AIRFLOW-5205] Check xml files depends on AIRFLOW-5161, AIRFLOW-5170, AIRFLOW-5180, AIRFLOW-5204,
potiuk commented on a change in pull request #5808: [AIRFLOW-5205] Check xml files depends on AIRFLOW-5161, AIRFLOW-5170, AIRFLOW-5180, AIRFLOW-5204, URL: https://github.com/apache/airflow/pull/5808#discussion_r313897477 ## File path: airflow/_vendor/slugify/slugify.py ## @@ -1,3 +1,6 @@ +# -*- coding: utf-8 -*- +# pylint: skip-file +"""Slugify !""" Review comment: Yeah. I will split those and exclude vendor from the original change. I thought I did that everywhere but I might have corrected vendor accidentally. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'
venkata Bonu created AIRFLOW-5213: - Summary: DockerOperator failing when the docker default logging drivers are other than 'journald','json-file' Key: AIRFLOW-5213 URL: https://issues.apache.org/jira/browse/AIRFLOW-5213 Project: Apache Airflow Issue Type: Bug Components: DAG, operators Affects Versions: 1.10.4 Reporter: venkata Bonu Assignee: venkata Bonu Attachments: Screen Shot 2019-08-14 at 7.10.01 AM.png Background: Docker can be configured with multiple logging drivers. * syslog * local * json - file * journald * local * gelf * fluentd * awslogs * splunk * etwlogs * gcplogs * Logentries But reading docker logs is supported only with drivers local , json-file , journald Docker documentation: [https://docs.docker.com/config/containers/logging/configure/] Description: When a docker is configured with a logging driver other than local , json-file , jourmald , Airflow Tasks which are using DockerOperator are failing with an error _docker.errors.APIError: 501 Server Error: Not Implemented ("configured logging driver does not support reading")_ Issue exists in the below lines of the code when the operator is trying to read the logs by attaching the container. ``` {code:python} line = '' for line in self.cli.attach(container=self.container['Id'], stdout=True, stderr=True, stream=True): line = line.strip() if hasattr(line, 'decode'): line = line.decode('utf-8') self.log.info(line) {code} ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (AIRFLOW-5179) Top level __init__.py breaks imports
[ https://issues.apache.org/jira/browse/AIRFLOW-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907291#comment-16907291 ] ASF subversion and git services commented on AIRFLOW-5179: -- Commit 4e03d2390fc77e6a911fb97d8585fad482c589a6 in airflow's branch refs/heads/master from Ash Berlin-Taylor [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=4e03d23 ] [AIRFLOW-5179] Remove top level __init__.py (#5818) The recent commit 3724c2aa to master introduced a __init__.py file in the project root folder, which basically breaks all imports in local development (`pip install -e .`) as it turns the project root into a package. [ci skip] > Top level __init__.py breaks imports > > > Key: AIRFLOW-5179 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5179 > Project: Apache Airflow > Issue Type: Bug > Components: build >Affects Versions: 2.0.0 >Reporter: Cedrik Neumann >Assignee: Ash Berlin-Taylor >Priority: Blocker > > The recent commit > [3724c2aaf4cfee4a60f6c7231777bfb256090c7c|https://github.com/apache/airflow/commit/3724c2aaf4cfee4a60f6c7231777bfb256090c7c] > to master introduced a {{__init__.py}} file in the project root folder, > which basically breaks all imports in local development ({{pip install -e > .}}) as it turns the project root into a package. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (AIRFLOW-5179) Top level __init__.py breaks imports
[ https://issues.apache.org/jira/browse/AIRFLOW-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-5179. Resolution: Fixed > Top level __init__.py breaks imports > > > Key: AIRFLOW-5179 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5179 > Project: Apache Airflow > Issue Type: Bug > Components: build >Affects Versions: 2.0.0 >Reporter: Cedrik Neumann >Assignee: Ash Berlin-Taylor >Priority: Blocker > > The recent commit > [3724c2aaf4cfee4a60f6c7231777bfb256090c7c|https://github.com/apache/airflow/commit/3724c2aaf4cfee4a60f6c7231777bfb256090c7c] > to master introduced a {{__init__.py}} file in the project root folder, > which basically breaks all imports in local development ({{pip install -e > .}}) as it turns the project root into a package. -- This message was sent by Atlassian JIRA (v7.6.14#76016)