[jira] [Created] (AIRFLOW-5658) Links issue through reverse proxy access

2019-10-15 Thread Jira
Mikołaj Morawski created AIRFLOW-5658:
-

 Summary: Links issue through reverse proxy access 
 Key: AIRFLOW-5658
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5658
 Project: Apache Airflow
  Issue Type: Bug
  Components: ui
Affects Versions: 1.10.5, 1.10.4
Reporter: Mikołaj Morawski
 Attachments: image-2019-10-15-16-02-45-731.png

The access through a reverse proxy is not working for the following two uris:

/configuration

/view

 

I think that the root cause is here:

[https://github.com/apache/airflow/blob/master/airflow/www/app.py]
{code:java}
appbuilder.add_link("Configurations",
href='/configuration',
category="Admin",
category_icon="fa-user")

appbuilder.add_link('Version',
href='/version',
category='About',
category_icon='fa-th')
{code}
 the "href" parameter does not use the "views" reference here. The add_link 
should not be used for this reference.  

 

 The second problem is that the "Task Tries" icon is not displayed properly. 

 

Regards,

Mikolaj 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (AIRFLOW-5658) Links issue through reverse proxy access

2019-10-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikołaj Morawski updated AIRFLOW-5658:
--
 Attachment: image-2019-10-15-16-02-45-731.png
Description: 
The access through a reverse proxy is not working for the following two uris:

/configuration

/view

 

I think that the root cause is here:

[https://github.com/apache/airflow/blob/master/airflow/www/app.py]
{code:java}
appbuilder.add_link("Configurations",
href='/configuration',
category="Admin",
category_icon="fa-user")

appbuilder.add_link('Version',
href='/version',
category='About',
category_icon='fa-th')
{code}
 the "href" parameter does not use the "views" reference here. The add_link 
should not be used for this reference.  

 

 The second problem is that the "Task Tries" icon is not displayed properly. 

  !image-2019-10-15-16-02-45-731.png!

Regards,

Mikolaj 

 

 

 

 

  was:
The access through a reverse proxy is not working for the following two uris:

/configuration

/view

 

I think that the root cause is here:

[https://github.com/apache/airflow/blob/master/airflow/www/app.py]
{code:java}
appbuilder.add_link("Configurations",
href='/configuration',
category="Admin",
category_icon="fa-user")

appbuilder.add_link('Version',
href='/version',
category='About',
category_icon='fa-th')
{code}
 the "href" parameter does not use the "views" reference here. The add_link 
should not be used for this reference.  

 

 The second problem is that the "Task Tries" icon is not displayed properly. 

 

Regards,

Mikolaj 

 

 

 

 


> Links issue through reverse proxy access 
> -
>
> Key: AIRFLOW-5658
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5658
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: ui
>Affects Versions: 1.10.4, 1.10.5
>Reporter: Mikołaj Morawski
>Priority: Trivial
> Attachments: image-2019-10-15-16-02-45-731.png
>
>
> The access through a reverse proxy is not working for the following two uris:
> /configuration
> /view
>  
> I think that the root cause is here:
> [https://github.com/apache/airflow/blob/master/airflow/www/app.py]
> {code:java}
> appbuilder.add_link("Configurations",
> href='/configuration',
> category="Admin",
> category_icon="fa-user")
> appbuilder.add_link('Version',
> href='/version',
> category='About',
> category_icon='fa-th')
> {code}
>  the "href" parameter does not use the "views" reference here. The add_link 
> should not be used for this reference.  
>  
>  The second problem is that the "Task Tries" icon is not displayed properly. 
>   !image-2019-10-15-16-02-45-731.png!
> Regards,
> Mikolaj 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (AIRFLOW-5658) Links issue through reverse proxy access

2019-10-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikołaj Morawski updated AIRFLOW-5658:
--
Description: 
The access through a reverse proxy is not working for the following two uris:

/configuration

/version

 

I think that the root cause is here:

[https://github.com/apache/airflow/blob/master/airflow/www/app.py]
{code:java}
appbuilder.add_link("Configurations",
href='/configuration',
category="Admin",
category_icon="fa-user")

appbuilder.add_link('Version',
href='/version',
category='About',
category_icon='fa-th')
{code}
 the "href" parameter does not use the "views" reference here. The add_link 
should not be used for this reference.  

 

 The second problem is that the "Task Tries" icon is not displayed properly. 

  !image-2019-10-15-16-02-45-731.png!

Regards,

Mikolaj 

 

 

 

 

  was:
The access through a reverse proxy is not working for the following two uris:

/configuration

/view

 

I think that the root cause is here:

[https://github.com/apache/airflow/blob/master/airflow/www/app.py]
{code:java}
appbuilder.add_link("Configurations",
href='/configuration',
category="Admin",
category_icon="fa-user")

appbuilder.add_link('Version',
href='/version',
category='About',
category_icon='fa-th')
{code}
 the "href" parameter does not use the "views" reference here. The add_link 
should not be used for this reference.  

 

 The second problem is that the "Task Tries" icon is not displayed properly. 

  !image-2019-10-15-16-02-45-731.png!

Regards,

Mikolaj 

 

 

 

 


> Links issue through reverse proxy access 
> -
>
> Key: AIRFLOW-5658
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5658
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: ui
>Affects Versions: 1.10.4, 1.10.5
>Reporter: Mikołaj Morawski
>Priority: Trivial
> Attachments: image-2019-10-15-16-02-45-731.png
>
>
> The access through a reverse proxy is not working for the following two uris:
> /configuration
> /version
>  
> I think that the root cause is here:
> [https://github.com/apache/airflow/blob/master/airflow/www/app.py]
> {code:java}
> appbuilder.add_link("Configurations",
> href='/configuration',
> category="Admin",
> category_icon="fa-user")
> appbuilder.add_link('Version',
> href='/version',
> category='About',
> category_icon='fa-th')
> {code}
>  the "href" parameter does not use the "views" reference here. The add_link 
> should not be used for this reference.  
>  
>  The second problem is that the "Task Tries" icon is not displayed properly. 
>   !image-2019-10-15-16-02-45-731.png!
> Regards,
> Mikolaj 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5636) Allow adding or overriding existing Extra Operator Links for existing Operators

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951953#comment-16951953
 ] 

ASF GitHub Bot commented on AIRFLOW-5636:
-

kaxil commented on pull request #6302: [AIRFLOW-5636] Allow adding or 
overriding existing Operator Links for…
URL: https://github.com/apache/airflow/pull/6302
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow adding or overriding existing Extra Operator Links for existing 
> Operators
> ---
>
> Key: AIRFLOW-5636
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5636
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core, plugins, ui
>Affects Versions: 2.0.0, 1.10.5
>Reporter: Kaxil Naik
>Assignee: Kaxil Naik
>Priority: Major
> Fix For: 2.0.0, 1.10.6
>
>
> Currently, we can add Operator link globally for all tasks or add operator 
> links in existing tasks.
> We should be able to add Operator links on a Single Operator or override 
> existing Operator link on an operator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] kaxil merged pull request #6302: [AIRFLOW-5636] Allow adding or overriding existing Operator Links for…

2019-10-15 Thread GitBox
kaxil merged pull request #6302: [AIRFLOW-5636] Allow adding or overriding 
existing Operator Links for…
URL: https://github.com/apache/airflow/pull/6302
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (AIRFLOW-5636) Allow adding or overriding existing Extra Operator Links for existing Operators

2019-10-15 Thread Kaxil Naik (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaxil Naik resolved AIRFLOW-5636.
-
Resolution: Fixed

> Allow adding or overriding existing Extra Operator Links for existing 
> Operators
> ---
>
> Key: AIRFLOW-5636
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5636
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core, plugins, ui
>Affects Versions: 2.0.0, 1.10.5
>Reporter: Kaxil Naik
>Assignee: Kaxil Naik
>Priority: Major
> Fix For: 1.10.6
>
>
> Currently, we can add Operator link globally for all tasks or add operator 
> links in existing tasks.
> We should be able to add Operator links on a Single Operator or override 
> existing Operator link on an operator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5636) Allow adding or overriding existing Extra Operator Links for existing Operators

2019-10-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951954#comment-16951954
 ] 

ASF subversion and git services commented on AIRFLOW-5636:
--

Commit 1de210b8a96b8fff0ef59b1e740c8a08c14b6394 in airflow's branch 
refs/heads/master from Kaxil Naik
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=1de210b ]

[AIRFLOW-5636] Allow adding or overriding existing Operator Links (#6302)



> Allow adding or overriding existing Extra Operator Links for existing 
> Operators
> ---
>
> Key: AIRFLOW-5636
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5636
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core, plugins, ui
>Affects Versions: 2.0.0, 1.10.5
>Reporter: Kaxil Naik
>Assignee: Kaxil Naik
>Priority: Major
> Fix For: 2.0.0, 1.10.6
>
>
> Currently, we can add Operator link globally for all tasks or add operator 
> links in existing tasks.
> We should be able to add Operator links on a Single Operator or override 
> existing Operator link on an operator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (AIRFLOW-5636) Allow adding or overriding existing Extra Operator Links for existing Operators

2019-10-15 Thread Kaxil Naik (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaxil Naik updated AIRFLOW-5636:

Fix Version/s: (was: 2.0.0)

> Allow adding or overriding existing Extra Operator Links for existing 
> Operators
> ---
>
> Key: AIRFLOW-5636
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5636
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core, plugins, ui
>Affects Versions: 2.0.0, 1.10.5
>Reporter: Kaxil Naik
>Assignee: Kaxil Naik
>Priority: Major
> Fix For: 1.10.6
>
>
> Currently, we can add Operator link globally for all tasks or add operator 
> links in existing tasks.
> We should be able to add Operator links on a Single Operator or override 
> existing Operator link on an operator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AIRFLOW-5659) Add support for ephemeral storage on KubernetesPodOperator

2019-10-15 Thread Leonardo Miguel (Jira)
Leonardo Miguel created AIRFLOW-5659:


 Summary: Add support for ephemeral storage on KubernetesPodOperator
 Key: AIRFLOW-5659
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5659
 Project: Apache Airflow
  Issue Type: New Feature
  Components: operators
Affects Versions: 2.0.0
Reporter: Leonardo Miguel
Assignee: Leonardo Miguel


KubernetesPodOperator currently doesn't support requests and limits for 
resource 'ephemeral-storage'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] adityav commented on a change in pull request #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
adityav commented on a change in pull request #6340: [Airflow-5660] Try to find 
the task in DB before regressing to search…
URL: https://github.com/apache/airflow/pull/6340#discussion_r335107805
 
 

 ##
 File path: airflow/executors/kubernetes_executor.py
 ##
 @@ -545,6 +545,26 @@ def _labels_to_key(self, labels):
 return None
 
 with create_session() as session:
+# check if we can find the task directly before scanning
+# every task with the execution_date
+task = (
+session
+.query(TaskInstance)
+.filter_by(task_id=task_id, dag_id=dag_id, 
execution_date=ex_time)
+.first()
+)
+if task:
+self.log.info(
+'Found matching task %s-%s (%s) with current state of %s',
+task.dag_id, task.task_id, task.execution_date, task.state
+)
+return (dag_id, task_id, ex_time, try_num)
+else:
+self.log.warning(
+"Unable to find Task in db directly for %s-%s (%s). Scan 
all tasks",
 
 Review comment:
   sure. I will rework the code a bit to check for this explicitly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335179058
 
 

 ##
 File path: docs/howto/enable-dag-serialization.rst
 ##
 @@ -0,0 +1,109 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+
+Enable DAG Serialization
+
+
+Add the following settings in ``airflow.cfg``:
+
+.. code-block:: ini
+
+[core]
+store_serialized_dags = True
+min_serialized_dag_update_interval = 30
+
+*   ``store_serialized_dags``: This flag decides whether to serialises DAGs 
and persist them in DB.
+If set to True, Webserver reads from DB instead of parsing DAG files
+*   ``min_serialized_dag_update_interval``: This flag sets the minimum 
interval (in seconds) after which
+the serialized DAG in DB should be updated. This helps in reducing 
database write rate.
+
+If you are updating Airflow from <1.10.6, please do not forget to run 
``airflow db upgrade``.
+
+
+How it works
+
+
+In order to make Airflow Webserver stateless (almost!), Airflow >=1.10.6 
supports
+DAG Serialization and DB Persistence.
+
+.. image:: ../img/dag_serialization.png
+
+As shown in the image above in Vanilla Airflow, the Webserver and the 
Scheduler both
 
 Review comment:
   ```suggestion
   In Vanilla Airflow, the Webserver and the Scheduler both
   ```
   
   The term "Vanilia Airflow" seems strange to me. Are you sure you need to 
enter this terms in the documentation? I would prefer "Airflow with DAG 
persistence in DB" and "Airflow without DAG persistence in DB".


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335179483
 
 

 ##
 File path: docs/howto/enable-dag-serialization.rst
 ##
 @@ -0,0 +1,109 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+
+Enable DAG Serialization
+
+
+Add the following settings in ``airflow.cfg``:
+
+.. code-block:: ini
+
+[core]
+store_serialized_dags = True
+min_serialized_dag_update_interval = 30
+
+*   ``store_serialized_dags``: This flag decides whether to serialises DAGs 
and persist them in DB.
+If set to True, Webserver reads from DB instead of parsing DAG files
+*   ``min_serialized_dag_update_interval``: This flag sets the minimum 
interval (in seconds) after which
+the serialized DAG in DB should be updated. This helps in reducing 
database write rate.
+
+If you are updating Airflow from <1.10.6, please do not forget to run 
``airflow db upgrade``.
+
+
+How it works
+
+
+In order to make Airflow Webserver stateless (almost!), Airflow >=1.10.6 
supports
+DAG Serialization and DB Persistence.
+
+.. image:: ../img/dag_serialization.png
+
+As shown in the image above in Vanilla Airflow, the Webserver and the 
Scheduler both
+needs access to the DAG files. Both the scheduler and webserver parses the DAG 
files.
+
+With **DAG Serialization** we aim to decouple the webserver from DAG parsing
+which would make the Webserver very light-weight.
+
+As shown in the image above, when using the dag_serilization feature,
 
 Review comment:
   ```suggestion
   As shown in the image above, when using the this feature,
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] jaketf commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
jaketf commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-542428297
 
 
   @potiuk I'd love to work on this further as I have time.
   I'd really like input on next steps: 
   1. I don't know how to fix the failing docs build.
   1. Do people think the proposed XCom for tracking external resource id makes 
sense?
   1. Are there additional features we need to make this viable? (See point 6) 
   1. Is there a good candidate operator to rework to use this? (there was 
discussion about dataflow and dataproc above but either the hooks are not set 
up well or the operators are actively being reworked)
   1. Is there more test coverage needed? I tested what I considered new and 
not already tested in tests for BaseSensorOperator
   1. Do we have any process for "scale testing" a feature like this? My 
concern is while this might pass basic unit tests,  If I have tons of dags 
using async operators will it make the scheduler fall over? Certainly this 
could be mitigated by using different pool / priorities but should we be 
building a lower priority weight into BaseAsyncOperator defaults? Should there 
be a mechanism that specifies a submit priority weight vs a poke priority 
weight?
   1. I think this feature would be awesome but will be confusing to users 
sometimes. We should probably add a section to the docs on BaseAsyncOperator, 
when it makes sense to use it and what assumptions it makes about a hook. Where 
would this belong? Should docs only PRs be separate?
   1. Validate my assumption that this would not need a retroactive AIP.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
kaxil commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized 
DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#issuecomment-542431579
 
 
   @mik-laj Agree with most of your comments, will update it by end of this 
week.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] houqp opened a new pull request #6342: [AIRFLOW-5662] fix incorrect naming for scheduler used slot metric

2019-10-15 Thread GitBox
houqp opened a new pull request #6342: [AIRFLOW-5662] fix incorrect naming for 
scheduler used slot metric
URL: https://github.com/apache/airflow/pull/6342
 
 
   
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-5662
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   other than the naming fix, this patch also adds:
   
   * metric collection code optimization to avoid unnecessary SQL query
   * queued and occupied slots metrics
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5662) Scheduler incorrectly emits occupied slots count metric as used slot count

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952230#comment-16952230
 ] 

ASF GitHub Bot commented on AIRFLOW-5662:
-

houqp commented on pull request #6342: [AIRFLOW-5662] fix incorrect naming for 
scheduler used slot metric
URL: https://github.com/apache/airflow/pull/6342
 
 
   
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-5662
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   other than the naming fix, this patch also adds:
   
   * metric collection code optimization to avoid unnecessary SQL query
   * queued and occupied slots metrics
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Scheduler incorrectly emits occupied slots count metric as used slot count
> --
>
> Key: AIRFLOW-5662
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5662
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.0.0
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Trivial
>
> Value of pool.used_slots metric is actually occupied slot count. There is 
> also a redundant SQL query to the database that can be removed in that loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] SaturnFromTitan closed pull request #6320: [AIRFLOW-5623]: attempt to add a failing test

2019-10-15 Thread GitBox
SaturnFromTitan closed pull request #6320: [AIRFLOW-5623]: attempt to add a 
failing test
URL: https://github.com/apache/airflow/pull/6320
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5623) latest_only_operator fails for schedule_interval='@once'

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952246#comment-16952246
 ] 

ASF GitHub Bot commented on AIRFLOW-5623:
-

SaturnFromTitan commented on pull request #6320: [AIRFLOW-5623]: attempt to add 
a failing test
URL: https://github.com/apache/airflow/pull/6320
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> latest_only_operator fails for schedule_interval='@once'
> 
>
> Key: AIRFLOW-5623
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5623
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: operators
>Affects Versions: 1.10.5
>Reporter: Gerben Oostra
>Assignee: Martin Winkel
>Priority: Minor
>
> Observation: In a dag with schedule_interval set to @once, the 
> `latest_only_operator` fails with the following error:
>  
> {{[2019-10-09 09:51:37,346] \{latest_only_operator.py:48} INFO - Checking 
> latest only with left_window: None right_window: None now: 2019-10-09 
> 07:51:37.346697+00:00
> [2019-10-09 09:51:37,347] \{models.py:1736} ERROR - '<' not supported between 
> instances of 'NoneType' and 'datetime.datetime'
> Traceback (most recent call last):
>   File 
> "//anaconda/envs/airflow/lib/python3.6/site-packages/airflow/models.py", 
> line 1633, in _run_raw_task
> result = task_copy.execute(context=context)
>   File 
> "//anaconda/envs/airflow/lib/python3.6/site-packages/airflow/operators/latest_only_operator.py",
>  line 51, in execute
> if not left_window < now <= right_window:
> TypeError: '<' not supported between instances of 'NoneType' and 
> 'datetime.datetime'
> [2019-10-09 09:51:37,363] \{models.py:1756} INFO - Marking task as 
> UP_FOR_RETRY}}
> I expected it to succeed, and allow the remainder of the dag to be ran. (if 
> an @once dag is running, it is always the latest)
> Rootcause analysis:
> If the `schedule_interval` of a dag is `@once`, the dag's field 
> `self._schedule_interval` is set to `None`.
> The `latest_only_operator` determines the window by passing the execution 
> date to the dags `following_schedule()`. There the dag's 
> `self._schedule_interval` type is compared to `six.string_types` and 
> `timedelta`. Both type checks fail, so nothing (`None`) is returned.
> Causing the time window comparison to fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335177865
 
 

 ##
 File path: docs/howto/enable-dag-serialization.rst
 ##
 @@ -0,0 +1,109 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+
+Enable DAG Serialization
 
 Review comment:
   I think that you should first describe the feature and then describe how to 
enable it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335181941
 
 

 ##
 File path: docs/howto/enable-dag-serialization.rst
 ##
 @@ -0,0 +1,109 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+
+Enable DAG Serialization
+
+
+Add the following settings in ``airflow.cfg``:
+
+.. code-block:: ini
+
+[core]
+store_serialized_dags = True
+min_serialized_dag_update_interval = 30
+
+*   ``store_serialized_dags``: This flag decides whether to serialises DAGs 
and persist them in DB.
+If set to True, Webserver reads from DB instead of parsing DAG files
+*   ``min_serialized_dag_update_interval``: This flag sets the minimum 
interval (in seconds) after which
+the serialized DAG in DB should be updated. This helps in reducing 
database write rate.
+
+If you are updating Airflow from <1.10.6, please do not forget to run 
``airflow db upgrade``.
+
+
+How it works
+
+
+In order to make Airflow Webserver stateless (almost!), Airflow >=1.10.6 
supports
+DAG Serialization and DB Persistence.
+
+.. image:: ../img/dag_serialization.png
+
+As shown in the image above in Vanilla Airflow, the Webserver and the 
Scheduler both
+needs access to the DAG files. Both the scheduler and webserver parses the DAG 
files.
+
+With **DAG Serialization** we aim to decouple the webserver from DAG parsing
+which would make the Webserver very light-weight.
+
+As shown in the image above, when using the dag_serilization feature,
+the Scheduler parses the DAG files, serializes them in JSON format and saves 
them in the Metadata DB.
+
+The Webserver now instead of having to parse the DAG file again, reads the
+serialized DAGs in JSON, de-serializes them and create the DagBag and uses it
+to show in the UI.
+
+One of the key features that is implemented as the part of DAG Serialization 
is that
+instead of loading an entire DagBag when the WebServer starts we only load 
each DAG on demand from the
+Serialized Dag table. This helps reduce Webserver startup time and memory. The 
reduction is notable
+when you have large number of DAGs.
+
+Below is the screenshot of the ``serialized_dag`` table in Metadata DB:
+
+.. image:: ../img/serialized_dag_table.png
+
+Limitations
+---
+The Webserver will still need access to DAG files in the following cases,
+which is why we said "almost" stateless.
+
+*   **Rendered Template** tab will still have to parse Python file as it needs 
all the details like
+the execution date and even the data passed by the upstream task using 
Xcom.
+*   **Code View** will read the DAG File & show it using Pygments.
+However, it does not need to Parse the Python file so it is still a small 
operation.
+*   :doc:`Extra Operator Links ` would not work out of
+the box. They need to be defined in Airflow Plugin.
+
+**Existing Airflow Operators**:
 
 Review comment:
   I don't understand the relationship of this text to DAG serialization


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] dstandish commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
dstandish commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-542409947
 
 
   > @dstandish mentioned 
([slack](https://apache-airflow.slack.com/archives/CCY359SCV/p1569608756022600?thread_ts=1569540521.006000=CCY359SCV))
 if polling / retrival is lower priority than kicking the job off this could be 
a reason to have separate operator + sensor.
   
   In the case i was talking about, async operator would have worked great.  
splitting into two tasks at the time was a hacky way to sortof achieve same 
behavior that this operator does in one task. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335183403
 
 

 ##
 File path: docs/howto/enable-dag-serialization.rst
 ##
 @@ -0,0 +1,109 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+
+Enable DAG Serialization
+
+
+Add the following settings in ``airflow.cfg``:
+
+.. code-block:: ini
+
+[core]
+store_serialized_dags = True
+min_serialized_dag_update_interval = 30
+
+*   ``store_serialized_dags``: This flag decides whether to serialises DAGs 
and persist them in DB.
+If set to True, Webserver reads from DB instead of parsing DAG files
+*   ``min_serialized_dag_update_interval``: This flag sets the minimum 
interval (in seconds) after which
+the serialized DAG in DB should be updated. This helps in reducing 
database write rate.
+
+If you are updating Airflow from <1.10.6, please do not forget to run 
``airflow db upgrade``.
+
+
+How it works
+
+
+In order to make Airflow Webserver stateless (almost!), Airflow >=1.10.6 
supports
+DAG Serialization and DB Persistence.
+
+.. image:: ../img/dag_serialization.png
+
+As shown in the image above in Vanilla Airflow, the Webserver and the 
Scheduler both
+needs access to the DAG files. Both the scheduler and webserver parses the DAG 
files.
+
+With **DAG Serialization** we aim to decouple the webserver from DAG parsing
+which would make the Webserver very light-weight.
+
+As shown in the image above, when using the dag_serilization feature,
+the Scheduler parses the DAG files, serializes them in JSON format and saves 
them in the Metadata DB.
+
+The Webserver now instead of having to parse the DAG file again, reads the
+serialized DAGs in JSON, de-serializes them and create the DagBag and uses it
+to show in the UI.
+
+One of the key features that is implemented as the part of DAG Serialization 
is that
+instead of loading an entire DagBag when the WebServer starts we only load 
each DAG on demand from the
+Serialized Dag table. This helps reduce Webserver startup time and memory. The 
reduction is notable
+when you have large number of DAGs.
+
+Below is the screenshot of the ``serialized_dag`` table in Metadata DB:
+
+.. image:: ../img/serialized_dag_table.png
+
+Limitations
+---
+The Webserver will still need access to DAG files in the following cases,
+which is why we said "almost" stateless.
+
+*   **Rendered Template** tab will still have to parse Python file as it needs 
all the details like
+the execution date and even the data passed by the upstream task using 
Xcom.
+*   **Code View** will read the DAG File & show it using Pygments.
+However, it does not need to Parse the Python file so it is still a small 
operation.
+*   :doc:`Extra Operator Links ` would not work out of
+the box. They need to be defined in Airflow Plugin.
+
+**Existing Airflow Operators**:
+To make extra operator links work with existing operators like BigQuery, 
copy all
+the classes that are defined in ``operator_extra_links`` property.
+
+For :class:`~airflow.gcp.operators.bigquery.BigQueryOperator` you can 
create the following plugin for extra links:
+
+.. code-block:: python
+
+from airflow.plugins_manager import AirflowPlugin
+from airflow.models.baseoperator import BaseOperatorLink
+from airflow.gcp.operators.bigquery import BigQueryOperator
+
+class BigQueryConsoleLink(BaseOperatorLink):
+"""
+Helper class for constructing BigQuery link.
+"""
+name = 'BigQuery Console'
+operators = [BigQueryOperator]
+
+def get_link(self, operator, dttm):
+ti = TaskInstance(task=operator, execution_date=dttm)
+job_id = ti.xcom_pull(task_ids=operator.task_id, key='job_id')
+return BIGQUERY_JOB_DETAILS_LINK_FMT.format(job_id=job_id) if 
job_id else ''
 
 Review comment:
   This code is not 100% valid. It has been updated. 
https://github.com/apache/airflow/pull/5906/files


[GitHub] [airflow] ikurian-coatue opened a new pull request #6343: Cron scheduler

2019-10-15 Thread GitBox
ikurian-coatue opened a new pull request #6343: Cron scheduler
URL: https://github.com/apache/airflow/pull/6343
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-XXX
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-4761) Airflow Task Clear function throws error

2019-10-15 Thread Ben Storrie (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952347#comment-16952347
 ] 

Ben Storrie commented on AIRFLOW-4761:
--

I was not. It generally happens on clearing any task within dags that have 
certain kinds of hooks or operators. The S3Hook is a good example, I have to 
clear an entire dagrun rather than a task within a dagrun if I have multiple S3 
ops, and they fail on deepcopy

> Airflow Task Clear function throws error
> 
>
> Key: AIRFLOW-4761
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4761
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, DagRun
>Affects Versions: 1.10.3
> Environment: CentOS 7, Python 2.7.10
>Reporter: Ben Storrie
>Priority: Major
>
> When using the airflow webserver to clear a task inside a dagrun, an error is 
> thrown on certain types of tasks:
>  
> {code:java}
> Traceback (most recent call last):
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 2311, in wsgi_app
> response = self.full_dispatch_request()
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1834, in full_dispatch_request
> rv = self.handle_user_exception(e)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1737, in handle_user_exception
> reraise(exc_type, exc_value, tb)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1832, in full_dispatch_request
> rv = self.dispatch_request()
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1818, in dispatch_request
> return self.view_functions[rule.endpoint](**req.view_args)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_admin/base.py",
>  line 69, in inner
> return self._run_view(f, *args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_admin/base.py",
>  line 368, in _run_view
> return fn(self, *args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_login/utils.py",
>  line 261, in decorated_view
> return func(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/utils.py",
>  line 275, in wrapper
> return f(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/utils.py",
>  line 322, in wrapper
> return f(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/views.py",
>  line 1202, in clear
> include_upstream=upstream)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 3830, in sub_dag
> dag = copy.deepcopy(self)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 174, in deepcopy
> y = copier(memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 3815, in __deepcopy__
> setattr(result, k, copy.deepcopy(v, memo))
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 163, in deepcopy
> y = copier(x, memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 257, in _deepcopy_dict
> y[deepcopy(key, memo)] = deepcopy(value, memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 174, in deepcopy
> y = copier(memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 2492, in __deepcopy__
> setattr(result, k, copy.copy(v))
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 96, in copy
> return _reconstruct(x, rv, 0)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 329, in _reconstruct
> y = callable(*args)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy_reg.py",
>  line 93, in __newobj__
> return cls.__new__(cls, *args)
> TypeError: instancemethod expected at least 2 arguments, got 0{code}
>  
> I had expected AIRFLOW-2060 being resolved to resolve this on upgrade to 
> 1.10.3:
> {code:java}
> (my-hadoop-airflow) [user@hostname ~]$ pip freeze | grep pendulum
> pendulum==1.4.4{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] aKumpan edited a comment on issue #6256: [AIRFLOW-5590] Add run_id to trigger DAG run API response

2019-10-15 Thread GitBox
aKumpan edited a comment on issue #6256: [AIRFLOW-5590] Add run_id to trigger 
DAG run API response
URL: https://github.com/apache/airflow/pull/6256#issuecomment-538399007
 
 
   Hi @XD-DENG, could you please provide more details why you do not like the 
[AIRFLOW-5593 proposal](https://issues.apache.org/jira/browse/AIRFLOW-5593)? So 
I will have better understanding if it is worth implementing.
   
   The only concern I have - there will be API breaking changes in the 
implementation. But since API is experimental - better to do it now, then never.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (AIRFLOW-5662) Scheduler incorrectly emits occupied slots count metric as used slot count

2019-10-15 Thread QP Hou (Jira)
QP Hou created AIRFLOW-5662:
---

 Summary: Scheduler incorrectly emits occupied slots count metric 
as used slot count
 Key: AIRFLOW-5662
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5662
 Project: Apache Airflow
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.0
Reporter: QP Hou
Assignee: QP Hou


Value of pool.used_slots metric is actually occupied slot count. There is also 
a redundant SQL query to the database that can be removed in that loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1695#comment-1695
 ] 

Aditya Vishwakarma commented on AIRFLOW-5660:
-

[~dimberman] Can you tell me more about the scale testing? I am trying to run 
very large dags and also trying to run a lot of them. In essense I am running 
into scaling issues in prod. This one is one of the few issues I faced.

For eg, next one on my list is the DagFileProcessor, it can take 300-500 
seconds to process a large dag. I would love to be able to experiment with some 
performance testing framework.

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] dimberman commented on issue #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
dimberman commented on issue #6340: [Airflow-5660] Try to find the task in DB 
before regressing to search…
URL: https://github.com/apache/airflow/pull/6340#issuecomment-542403513
 
 
   LGTM. Rerunning tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] potiuk commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
potiuk commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-542356008
 
 
   Hey Jake - are you going to work on it further > I think this is very cool 
idea and we should make it happen :).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
mik-laj commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335182483
 
 

 ##
 File path: docs/howto/enable-dag-serialization.rst
 ##
 @@ -0,0 +1,109 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+
+Enable DAG Serialization
+
+
+Add the following settings in ``airflow.cfg``:
+
+.. code-block:: ini
+
+[core]
+store_serialized_dags = True
+min_serialized_dag_update_interval = 30
+
+*   ``store_serialized_dags``: This flag decides whether to serialises DAGs 
and persist them in DB.
+If set to True, Webserver reads from DB instead of parsing DAG files
+*   ``min_serialized_dag_update_interval``: This flag sets the minimum 
interval (in seconds) after which
+the serialized DAG in DB should be updated. This helps in reducing 
database write rate.
+
+If you are updating Airflow from <1.10.6, please do not forget to run 
``airflow db upgrade``.
+
+
+How it works
+
+
+In order to make Airflow Webserver stateless (almost!), Airflow >=1.10.6 
supports
+DAG Serialization and DB Persistence.
+
+.. image:: ../img/dag_serialization.png
+
+As shown in the image above in Vanilla Airflow, the Webserver and the 
Scheduler both
+needs access to the DAG files. Both the scheduler and webserver parses the DAG 
files.
+
+With **DAG Serialization** we aim to decouple the webserver from DAG parsing
+which would make the Webserver very light-weight.
+
+As shown in the image above, when using the dag_serilization feature,
+the Scheduler parses the DAG files, serializes them in JSON format and saves 
them in the Metadata DB.
+
+The Webserver now instead of having to parse the DAG file again, reads the
+serialized DAGs in JSON, de-serializes them and create the DagBag and uses it
+to show in the UI.
+
+One of the key features that is implemented as the part of DAG Serialization 
is that
+instead of loading an entire DagBag when the WebServer starts we only load 
each DAG on demand from the
+Serialized Dag table. This helps reduce Webserver startup time and memory. The 
reduction is notable
+when you have large number of DAGs.
+
+Below is the screenshot of the ``serialized_dag`` table in Metadata DB:
+
+.. image:: ../img/serialized_dag_table.png
+
+Limitations
+---
+The Webserver will still need access to DAG files in the following cases,
+which is why we said "almost" stateless.
+
+*   **Rendered Template** tab will still have to parse Python file as it needs 
all the details like
+the execution date and even the data passed by the upstream task using 
Xcom.
+*   **Code View** will read the DAG File & show it using Pygments.
+However, it does not need to Parse the Python file so it is still a small 
operation.
+*   :doc:`Extra Operator Links ` would not work out of
+the box. They need to be defined in Airflow Plugin.
+
+**Existing Airflow Operators**:
+To make extra operator links work with existing operators like BigQuery, 
copy all
+the classes that are defined in ``operator_extra_links`` property.
+
+For :class:`~airflow.gcp.operators.bigquery.BigQueryOperator` you can 
create the following plugin for extra links:
+
+.. code-block:: python
+
+from airflow.plugins_manager import AirflowPlugin
 
 Review comment:
   I'm not a fan of posting python code in rst files. We use literalinclude for 
GCP guides. This allows us to automatically test code snippets from the 
documentation


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] adityav commented on issue #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
adityav commented on issue #6340: [Airflow-5660] Try to find the task in DB 
before regressing to search…
URL: https://github.com/apache/airflow/pull/6340#issuecomment-542429056
 
 
   Looks like there is a problem with CI? I have no idea why this build failed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] adityav edited a comment on issue #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
adityav edited a comment on issue #6340: [Airflow-5660] Try to find the task in 
DB before regressing to search…
URL: https://github.com/apache/airflow/pull/6340#issuecomment-542429056
 
 
   Looks like there is a problem with CI? I have no idea why the build failed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] ashb commented on issue #4437: AIRFLOW-3632 Adding replace_microseconds parameter to dag_runs endpoint

2019-10-15 Thread GitBox
ashb commented on issue #4437: AIRFLOW-3632 Adding replace_microseconds 
parameter to dag_runs endpoint
URL: https://github.com/apache/airflow/pull/4437#issuecomment-542429417
 
 
   Someone is welcome to take up this PR, add unit tests and address the 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Vishwakarma updated AIRFLOW-5660:

Summary: Scheduler becomes unresponsive when processing large DAGs on 
kubernetes.  (was: Scheduler becomes responsive when processing large DAGs on 
kubernetes.)

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow-site] kgabryje commented on a change in pull request #76: Add list components

2019-10-15 Thread GitBox
kgabryje commented on a change in pull request #76: Add list components
URL: https://github.com/apache/airflow-site/pull/76#discussion_r335048342
 
 

 ##
 File path: landing-pages/site/layouts/partials/boxes/commiter.html
 ##
 @@ -0,0 +1,34 @@
+{{/*
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing,
+* software distributed under the License is distributed on an
+* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+* KIND, either express or implied.  See the License for the
+* specific language governing permissions and limitations
+* under the License.
+*/}}
+
+
+
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow-site] kgabryje commented on a change in pull request #76: Add list components

2019-10-15 Thread GitBox
kgabryje commented on a change in pull request #76: Add list components
URL: https://github.com/apache/airflow-site/pull/76#discussion_r335048389
 
 

 ##
 File path: landing-pages/site/assets/scss/_list-boxes.scss
 ##
 @@ -0,0 +1,59 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+@import "colors";
+
+.list-container {
+  display: grid;
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow-site] kgabryje commented on a change in pull request #76: Add list components

2019-10-15 Thread GitBox
kgabryje commented on a change in pull request #76: Add list components
URL: https://github.com/apache/airflow-site/pull/76#discussion_r335048420
 
 

 ##
 File path: landing-pages/site/data/commiters.json
 ##
 @@ -0,0 +1,194 @@
+[
+  {
+"name": "Maxime “Max” Beauchemin",
+"nick": "mistercrunch",
+"image": "/stock-guy.jpg",
+"social_media": [
+  {
+"icon": "icons/github.svg"
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] adityav closed pull request #6338: [Airflow-5660] Try to find the task in DB before regressing to searching every task

2019-10-15 Thread GitBox
adityav closed pull request #6338: [Airflow-5660] Try to find the task in DB 
before regressing to searching every task
URL: https://github.com/apache/airflow/pull/6338
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-4761) Airflow Task Clear function throws error

2019-10-15 Thread Connie Chen (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952339#comment-16952339
 ] 

Connie Chen commented on AIRFLOW-4761:
--

[~brstorrie] were you able to fix this? I am getting this error too when 
upgrading to 1.10.5, but I can't figure out why it's happening to some tasks 
and some not

> Airflow Task Clear function throws error
> 
>
> Key: AIRFLOW-4761
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4761
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, DagRun
>Affects Versions: 1.10.3
> Environment: CentOS 7, Python 2.7.10
>Reporter: Ben Storrie
>Priority: Major
>
> When using the airflow webserver to clear a task inside a dagrun, an error is 
> thrown on certain types of tasks:
>  
> {code:java}
> Traceback (most recent call last):
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 2311, in wsgi_app
> response = self.full_dispatch_request()
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1834, in full_dispatch_request
> rv = self.handle_user_exception(e)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1737, in handle_user_exception
> reraise(exc_type, exc_value, tb)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1832, in full_dispatch_request
> rv = self.dispatch_request()
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1818, in dispatch_request
> return self.view_functions[rule.endpoint](**req.view_args)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_admin/base.py",
>  line 69, in inner
> return self._run_view(f, *args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_admin/base.py",
>  line 368, in _run_view
> return fn(self, *args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_login/utils.py",
>  line 261, in decorated_view
> return func(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/utils.py",
>  line 275, in wrapper
> return f(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/utils.py",
>  line 322, in wrapper
> return f(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/views.py",
>  line 1202, in clear
> include_upstream=upstream)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 3830, in sub_dag
> dag = copy.deepcopy(self)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 174, in deepcopy
> y = copier(memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 3815, in __deepcopy__
> setattr(result, k, copy.deepcopy(v, memo))
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 163, in deepcopy
> y = copier(x, memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 257, in _deepcopy_dict
> y[deepcopy(key, memo)] = deepcopy(value, memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 174, in deepcopy
> y = copier(memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 2492, in __deepcopy__
> setattr(result, k, copy.copy(v))
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 96, in copy
> return _reconstruct(x, rv, 0)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 329, in _reconstruct
> y = callable(*args)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy_reg.py",
>  line 93, in __newobj__
> return cls.__new__(cls, *args)
> TypeError: instancemethod expected at least 2 arguments, got 0{code}
>  
> I had expected AIRFLOW-2060 being resolved to resolve this on upgrade to 
> 1.10.3:
> {code:java}
> (my-hadoop-airflow) [user@hostname ~]$ pip freeze | grep pendulum
> pendulum==1.4.4{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] enricapq commented on a change in pull request #6327: [AIRFLOW-4675] Make airflow/lineage Pylint compatible

2019-10-15 Thread GitBox
enricapq commented on a change in pull request #6327: [AIRFLOW-4675] Make 
airflow/lineage Pylint compatible
URL: https://github.com/apache/airflow/pull/6327#discussion_r335116081
 
 

 ##
 File path: airflow/lineage/__init__.py
 ##
 @@ -16,10 +16,16 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+#
+"""Define the objects `prepare_lineage` and `apply_lineage`
 
 Review comment:
   thank you Jarek, I apply it  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] potiuk commented on a change in pull request #6285: [AIRFLOW-XXX] Updates to Breeze documentation from GSOD

2019-10-15 Thread GitBox
potiuk commented on a change in pull request #6285: [AIRFLOW-XXX] Updates to 
Breeze documentation from GSOD
URL: https://github.com/apache/airflow/pull/6285#discussion_r335135865
 
 

 ##
 File path: BREEZE.rst
 ##
 @@ -46,409 +46,586 @@ Here is the short 10 minute video about Airflow Breeze
 Prerequisites
 =
 
-Docker
---
+Docker Community Edition
+
 
-You need latest stable Docker Community Edition installed and on the PATH. It 
should be
-configured to be able to run ``docker`` commands directly and not only via 
root user. Your user
-should be in the ``docker`` group. See `Docker installation guide 
`_
+- **Version**: Install the latest stable Docker Community Edition and add it 
to the PATH.
+- **Permissions**: Configure to run the ``docker`` commands directly and not 
only via root user.
+  Your user should be in the ``docker`` group.
+  See `Docker installation guide `_ for 
details.
+- **Disk space**: On macOS, increase your available disk space before starting 
to work with
+  the environment. At least 128 GB of free disk space is recommended. You can 
also get by with a
+  smaller space but make sure to clean up the Docker disk space periodically.
+  See also `Docker for Mac - Space 
`_ for details
+  on increasing disk space available for Docker on Mac.
+- **Docker problems**: Sometimes it is not obvious that space is an issue when 
you run into
+  a problem with Docker. If you see a weird behaviour, try
+  `cleaning up the images <#cleaning-up-the-images>`_. Also see
+  `pruning `_ instructions from 
Docker.
+
+Docker Compose
+--
 
-When you develop on Mac OS you usually have not enough disk space for Docker 
if you start using it
-seriously. You should increase disk space available before starting to work 
with the environment.
-Usually you have weird problems of docker containers when you run out of Disk 
space. It might not be
-obvious that space is an issue. At least 128 GB of Disk space is recommended. 
You can also get by with smaller space but you should more
-often clean the docker disk space periodically.
+- **Version**: Install the latest stable Docker Compose and add it to the PATH.
+  See `Docker Compose Installation Guide 
`_ for details.
 
-If you get into weird behaviour try `Cleaning up the images 
<#cleaning-up-the-images>`_.
+- **Permissions**: Configure to run the ``docker-compose`` command.
 
-See also `Docker for Mac - Space 
`_ for details of increasing
-disk space available for Docker on Mac.
+Docker Images Used by Breeze
+
 
-Docker compose
---
+For all development tasks, related integration tests and static code checks, 
we use Docker
+images maintained on the Docker Hub in the ``apache/airflow`` repository.
 
-Latest stable Docker Compose installed and on the PATH. It should be
-configured to be able to run ``docker-compose`` command.
-See `Docker compose installation guide 
`_
+There are three images that we are currently managing:
 
-Getopt and gstat
-
+* **Slim CI** image that is used for static code checks (size of ~500MB). Its 
tag follows the pattern
+  of ``-python-ci-slim`` (for example, 
``apache/airflow:master-python3.6-ci-slim``).
+  The image is built using the ``_ Dockerfile.
+* **Full CI image*** that is used for testing. It contains a lot more 
test-related installed software
+  (size of ~1GB). Its tag follows the pattern of 
``-python-ci``
+  (for example, ``apache/airflow:master-python3.6-ci``). The image is built 
using the
+  ``_ Dockerfile.
+* **Checklicense image** that is used during license check with the Apache RAT 
tool. It does not
+  require any of the dependencies that the two CI images need so it is built 
using a different Dockerfile
+  ``_ and only contains Java + Apache RAT tool. The 
image is
+  labelled with ``checklicence``, for example: 
``apache/airflow:checklicence``. No versioning is used for
+  the Checklicence image.
 
-* If you are on MacOS
+Before you run tests, enter the environment or run local static checks, the 
necessary local images should be
+pulled and built from Docker Hub. This happens automatically for the test 
environment but you need to
+manually trigger it for static checks as described in `Building the images 
<#bulding-the-images>`_
+and `Pulling the latest images <#pulling-the-latest-images>`_.
+The static checks will fail and inform what to do if the image is not yet 
built.
 
-  * you need gnu ``getopt`` and ``gstat`` to get Airflow Breeze running.
+Building the image first time pulls a pre-built version of images from the 
Docker Hub, which may take some
+time. But for subsequent source code changes, no 

[GitHub] [airflow] codecov-io commented on issue #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
codecov-io commented on issue #6340: [Airflow-5660] Try to find the task in DB 
before regressing to search…
URL: https://github.com/apache/airflow/pull/6340#issuecomment-542393221
 
 
   # [Codecov](https://codecov.io/gh/apache/airflow/pull/6340?src=pr=h1) 
Report
   > Merging 
[#6340](https://codecov.io/gh/apache/airflow/pull/6340?src=pr=desc) into 
[master](https://codecov.io/gh/apache/airflow/commit/1de210b8a96b8fff0ef59b1e740c8a08c14b6394?src=pr=desc)
 will **increase** coverage by `0.37%`.
   > The diff coverage is `0%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/airflow/pull/6340/graphs/tree.svg?width=650=WdLKlKHOAU=150=pr)](https://codecov.io/gh/apache/airflow/pull/6340?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#6340  +/-   ##
   ==
   + Coverage   79.68%   80.06%   +0.37% 
   ==
 Files 616  616  
 Lines   3579835803   +5 
   ==
   + Hits2852728666 +139 
   + Misses   7271 7137 -134
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/airflow/pull/6340?src=pr=tree) | 
Coverage Δ | |
   |---|---|---|
   | 
[airflow/executors/kubernetes\_executor.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy9leGVjdXRvcnMva3ViZXJuZXRlc19leGVjdXRvci5weQ==)
 | `58.29% <0%> (-0.7%)` | :arrow_down: |
   | 
[airflow/models/taskinstance.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy9tb2RlbHMvdGFza2luc3RhbmNlLnB5)
 | `93.79% <0%> (+0.5%)` | :arrow_up: |
   | 
[airflow/hooks/dbapi\_hook.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy9ob29rcy9kYmFwaV9ob29rLnB5)
 | `88.13% <0%> (+0.84%)` | :arrow_up: |
   | 
[airflow/models/connection.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy9tb2RlbHMvY29ubmVjdGlvbi5weQ==)
 | `65% <0%> (+1.11%)` | :arrow_up: |
   | 
[airflow/jobs/scheduler\_job.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy9qb2JzL3NjaGVkdWxlcl9qb2IucHk=)
 | `74.55% <0%> (+1.19%)` | :arrow_up: |
   | 
[airflow/hooks/hive\_hooks.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy9ob29rcy9oaXZlX2hvb2tzLnB5)
 | `77.6% <0%> (+1.78%)` | :arrow_up: |
   | 
[airflow/utils/dag\_processing.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy91dGlscy9kYWdfcHJvY2Vzc2luZy5weQ==)
 | `58.9% <0%> (+2.66%)` | :arrow_up: |
   | 
[airflow/executors/\_\_init\_\_.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy9leGVjdXRvcnMvX19pbml0X18ucHk=)
 | `67.34% <0%> (+4.08%)` | :arrow_up: |
   | 
[airflow/utils/sqlalchemy.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy91dGlscy9zcWxhbGNoZW15LnB5)
 | `93.22% <0%> (+15.25%)` | :arrow_up: |
   | 
[airflow/utils/log/colored\_log.py](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree#diff-YWlyZmxvdy91dGlscy9sb2cvY29sb3JlZF9sb2cucHk=)
 | `93.18% <0%> (+20.45%)` | :arrow_up: |
   | ... and [3 
more](https://codecov.io/gh/apache/airflow/pull/6340/diff?src=pr=tree-more) 
| |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/airflow/pull/6340?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/airflow/pull/6340?src=pr=footer). 
Last update 
[1de210b...90d2553](https://codecov.io/gh/apache/airflow/pull/6340?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on issue #3229: [AIRFLOW-2325] Add cloudwatch task handler (IN PROGRESS)

2019-10-15 Thread GitBox
mik-laj commented on issue #3229: [AIRFLOW-2325] Add cloudwatch task handler 
(IN PROGRESS)
URL: https://github.com/apache/airflow/pull/3229#issuecomment-542414169
 
 
   https://airflow.readthedocs.io/en/latest/howto/write-logs.html
   There is a lack of documentation in this PR. Can you add it?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (AIRFLOW-5663) PythonVirtualenvOperator doesn't print logs until the operator is finished

2019-10-15 Thread Wei Jiang (Jira)
Wei Jiang created AIRFLOW-5663:
--

 Summary: PythonVirtualenvOperator doesn't print logs until the 
operator is finished
 Key: AIRFLOW-5663
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5663
 Project: Apache Airflow
  Issue Type: Bug
  Components: operators
Affects Versions: 1.10.5
Reporter: Wei Jiang


It seems the {{PythonVirtualenvOperator }}doesn't print out the log until the 
job is done. When we use logging in {{PythonOperator}}, it does print out 
logging in real time. It would be nice for the {{PythonVirtualenvOperator}} to 
have the same functionality as the {{PythonOperator}} so we can see the logging 
output as the PythonVirtualenvOperator makes progress.

[https://github.com/apache/airflow/blob/master/airflow/operators/python_operator.py#L332]
{code:python}
def _execute_in_subprocess(self, cmd):
try:
self.log.info("Executing cmd\n%s", cmd)
output = subprocess.check_output(cmd,
 stderr=subprocess.STDOUT,
 close_fds=True)
if output:
self.log.info("Got output\n%s", output)
except subprocess.CalledProcessError as e:
self.log.info("Got error output\n%s", e.output)
raise
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] ikurian-coatue closed pull request #6343: Cron scheduler

2019-10-15 Thread GitBox
ikurian-coatue closed pull request #6343: Cron scheduler
URL: https://github.com/apache/airflow/pull/6343
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] adityav opened a new pull request #6338: [Airflow-5660] Try to find the task in DB before regressing to searching every task

2019-10-15 Thread GitBox
adityav opened a new pull request #6338: [Airflow-5660] Try to find the task in 
DB before regressing to searching every task
URL: https://github.com/apache/airflow/pull/6338
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-XXX
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   Most task_ids and dag_ids tend to be safe to use as labels in kubernetes. 
Which means we can search for those tasks directly in the DB rather than 
searching every task for with the same execution date. This fix takes advantage 
of this fact.
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow-site] kgabryje commented on a change in pull request #76: Add list components

2019-10-15 Thread GitBox
kgabryje commented on a change in pull request #76: Add list components
URL: https://github.com/apache/airflow-site/pull/76#discussion_r335048447
 
 

 ##
 File path: landing-pages/site/layouts/examples/list.html
 ##
 @@ -1,10 +1,62 @@
+{{/*
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing,
+* software distributed under the License is distributed on an
+* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+* KIND, either express or implied.  See the License for the
+* specific language governing permissions and limitations
+* under the License.
+*/}}
+
 {{ define "main" }}
 {{ with .Content }}
 {{ . }}
 {{ end }}
-
-{{ partial "buttons/button-filled" (dict "text" "Learn more") }}
-{{ partial "buttons/button-hollow" (dict "text" "Learn more") }}
-{{ partial "buttons/button-with-icon" (dict "text" "Suggest a change 
on this page") }}
+
+
+
+{{ range .Site.Data.meetups }}
+{{ partial "boxes/event" . }}
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952101#comment-16952101
 ] 

Aditya Vishwakarma commented on AIRFLOW-5660:
-

[~dimberman] That is a good plan. 

I am a bit confused about which branch to send PR for. I forked  v1-10-test to 
implement the fix as this issue was critical for us. Does this one work? 
https://github.com/apache/airflow/pull/6338

I will also sent a PR for the master branch as well.

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] aaltay commented on issue #6334: [AIRFLOW-5657] Update the upper bound for dill

2019-10-15 Thread GitBox
aaltay commented on issue #6334: [AIRFLOW-5657] Update the upper bound for dill
URL: https://github.com/apache/airflow/pull/6334#issuecomment-542307377
 
 
   Thank you for the review @kaxil. I am not familiar with the process. Who 
could merge this change?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] dimberman commented on a change in pull request #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
dimberman commented on a change in pull request #6340: [Airflow-5660] Try to 
find the task in DB before regressing to search…
URL: https://github.com/apache/airflow/pull/6340#discussion_r335072357
 
 

 ##
 File path: airflow/executors/kubernetes_executor.py
 ##
 @@ -545,6 +545,26 @@ def _labels_to_key(self, labels):
 return None
 
 with create_session() as session:
+# check if we can find the task directly before scanning
+# every task with the execution_date
+task = (
+session
+.query(TaskInstance)
+.filter_by(task_id=task_id, dag_id=dag_id, 
execution_date=ex_time)
+.first()
+)
+if task:
+self.log.info(
+'Found matching task %s-%s (%s) with current state of %s',
+task.dag_id, task.task_id, task.execution_date, task.state
+)
+return (dag_id, task_id, ex_time, try_num)
+else:
+self.log.warning(
+"Unable to find Task in db directly for %s-%s (%s). Scan 
all tasks",
 
 Review comment:
   can you emphasize that this can cause significant slowdowns?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] dimberman commented on a change in pull request #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
dimberman commented on a change in pull request #6340: [Airflow-5660] Try to 
find the task in DB before regressing to search…
URL: https://github.com/apache/airflow/pull/6340#discussion_r335073047
 
 

 ##
 File path: airflow/executors/kubernetes_executor.py
 ##
 @@ -545,6 +545,26 @@ def _labels_to_key(self, labels):
 return None
 
 with create_session() as session:
+# check if we can find the task directly before scanning
+# every task with the execution_date
+task = (
+session
+.query(TaskInstance)
+.filter_by(task_id=task_id, dag_id=dag_id, 
execution_date=ex_time)
+.first()
+)
+if task:
+self.log.info(
+'Found matching task %s-%s (%s) with current state of %s',
+task.dag_id, task.task_id, task.execution_date, task.state
+)
+return (dag_id, task_id, ex_time, try_num)
+else:
+self.log.warning(
+"Unable to find Task in db directly for %s-%s (%s). Scan 
all tasks",
 
 Review comment:
   And add a link to k8s labeling guidelines?  
https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] SaturnFromTitan commented on issue #6315: [AIRFLOW-5640] fix get_email_address_list types

2019-10-15 Thread GitBox
SaturnFromTitan commented on issue #6315: [AIRFLOW-5640] fix 
get_email_address_list types
URL: https://github.com/apache/airflow/pull/6315#issuecomment-542291917
 
 
   臘‍♂ Yes of course. Ready for another round @ashb  - now with succeeding 
tests 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5657) Upgrade dill dependency

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952152#comment-16952152
 ] 

ASF GitHub Bot commented on AIRFLOW-5657:
-

kaxil commented on pull request #6334: [AIRFLOW-5657] Update the upper bound 
for dill
URL: https://github.com/apache/airflow/pull/6334
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade dill dependency
> ---
>
> Key: AIRFLOW-5657
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5657
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: dependencies
>Affects Versions: 1.10.6
>Reporter: Ahmet Altay
>Priority: Major
>
> Airflow depends on "dill>=0.2.2, <0.3'," 
> (https://github.com/apache/airflow/blob/1a4c16432b8718189269e63c3afcd2709eed7379/setup.py#L359).
> However there is a new 0.3.x range for dill (https://pypi.org/project/dill/)
> I would like to upgrade the version to the same range that matches Apache 
> Beam's version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (AIRFLOW-5657) Upgrade dill dependency

2019-10-15 Thread Kaxil Naik (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaxil Naik resolved AIRFLOW-5657.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

> Upgrade dill dependency
> ---
>
> Key: AIRFLOW-5657
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5657
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: dependencies
>Affects Versions: 1.10.6
>Reporter: Ahmet Altay
>Priority: Major
> Fix For: 2.0.0
>
>
> Airflow depends on "dill>=0.2.2, <0.3'," 
> (https://github.com/apache/airflow/blob/1a4c16432b8718189269e63c3afcd2709eed7379/setup.py#L359).
> However there is a new 0.3.x range for dill (https://pypi.org/project/dill/)
> I would like to upgrade the version to the same range that matches Apache 
> Beam's version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] kaxil merged pull request #6334: [AIRFLOW-5657] Update the upper bound for dill

2019-10-15 Thread GitBox
kaxil merged pull request #6334: [AIRFLOW-5657] Update the upper bound for dill
URL: https://github.com/apache/airflow/pull/6334
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] Fortschritt69 opened a new pull request #6341: Updated airflow

2019-10-15 Thread GitBox
Fortschritt69 opened a new pull request #6341: Updated airflow
URL: https://github.com/apache/airflow/pull/6341
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-XXX
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] ToxaZ opened a new pull request #6336: [AIRFLOW-5073] Change SQLSensor to optionally treat NULL as keep poking

2019-10-15 Thread GitBox
ToxaZ opened a new pull request #6336: [AIRFLOW-5073] Change SQLSensor to 
optionally treat NULL as keep poking
URL: https://github.com/apache/airflow/pull/6336
 
 
   Fixing bug treating `"NULL"` as `NULL` in sql sensor.
   
   ### Jira
   
   - [x] My PR addresses the following:
 - https://issues.apache.org/jira/browse/AIRFLOW-5073
   
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI 
changes:
   - fixes bugs from previous PR - 
https://github.com/apache/airflow/pull/5913#discussion_r319019205
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] leonardoam opened a new pull request #6337: [AIRFLOW-5659] - Add support for ephemeral storage on KubernetesPodOp…

2019-10-15 Thread GitBox
leonardoam opened a new pull request #6337: [AIRFLOW-5659] - Add support for 
ephemeral storage on KubernetesPodOp…
URL: https://github.com/apache/airflow/pull/6337
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW-5659) issues and references 
them in the PR title.
   
   ### Description
   
   - [X] Add support for specifying requests and limits on ephemeral-storage 
for KubernetesPodOperator.
   
   ### Tests
   
   - [X] My PR adds check on unit tests: test_gen_pod and 
test_gen_pod_extract_xcom
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] adityav opened a new pull request #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
adityav opened a new pull request #6340: [Airflow-5660] Try to find the task in 
DB before regressing to search…
URL: https://github.com/apache/airflow/pull/6340
 
 
   …ing every task
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-XXX
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5659) Add support for ephemeral storage on KubernetesPodOperator

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951971#comment-16951971
 ] 

ASF GitHub Bot commented on AIRFLOW-5659:
-

leonardoam commented on pull request #6337: [AIRFLOW-5659] - Add support for 
ephemeral storage on KubernetesPodOp…
URL: https://github.com/apache/airflow/pull/6337
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW-5659) issues and references 
them in the PR title.
   
   ### Description
   
   - [X] Add support for specifying requests and limits on ephemeral-storage 
for KubernetesPodOperator.
   
   ### Tests
   
   - [X] My PR adds check on unit tests: test_gen_pod and 
test_gen_pod_extract_xcom
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add support for ephemeral storage on KubernetesPodOperator
> --
>
> Key: AIRFLOW-5659
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5659
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: operators
>Affects Versions: 2.0.0
>Reporter: Leonardo Miguel
>Assignee: Leonardo Miguel
>Priority: Minor
>
> KubernetesPodOperator currently doesn't support requests and limits for 
> resource 'ephemeral-storage'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] ToxaZ commented on a change in pull request #5913: [AIRFLOW-5073] Change SQLSensor to treat NULL failure

2019-10-15 Thread GitBox
ToxaZ commented on a change in pull request #5913: [AIRFLOW-5073] Change 
SQLSensor to treat NULL failure
URL: https://github.com/apache/airflow/pull/5913#discussion_r334991386
 
 

 ##
 File path: airflow/sensors/sql_sensor.py
 ##
 @@ -102,6 +98,4 @@ def poke(self, context):
 return self.success(first_cell)
 else:
 raise AirflowException("self.success is present, but not 
callable -> {}".format(self.success))
-if self.allow_null:
-return str(first_cell) not in ('0', '')
 return str(first_cell) not in ('0', '', 'None')
 
 Review comment:
   fixed here https://github.com/apache/airflow/pull/6336


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5073) Change SQLSensor behavior returning 'success' for NULL result

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951970#comment-16951970
 ] 

ASF GitHub Bot commented on AIRFLOW-5073:
-

ToxaZ commented on pull request #6336: [AIRFLOW-5073] Change SQLSensor to 
optionally treat NULL as keep poking
URL: https://github.com/apache/airflow/pull/6336
 
 
   Fixing bug treating `"NULL"` as `NULL` in sql sensor.
   
   ### Jira
   
   - [x] My PR addresses the following:
 - https://issues.apache.org/jira/browse/AIRFLOW-5073
   
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI 
changes:
   - fixes bugs from previous PR - 
https://github.com/apache/airflow/pull/5913#discussion_r319019205
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Change SQLSensor behavior returning 'success' for NULL result 
> --
>
> Key: AIRFLOW-5073
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5073
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.10.3
>Reporter: Anton Zayniev
>Priority: Major
>
> Change SQLSensor to treat NULL as fail.
> Add boolean _allow_null_ parameter to support legacy behavior passing NULL as 
> success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] feluelle commented on a change in pull request #6315: [AIRFLOW-5640] fix get_email_address_list types

2019-10-15 Thread GitBox
feluelle commented on a change in pull request #6315: [AIRFLOW-5640] fix 
get_email_address_list types
URL: https://github.com/apache/airflow/pull/6315#discussion_r335028135
 
 

 ##
 File path: airflow/utils/email.py
 ##
 @@ -120,13 +122,22 @@ def send_MIME_email(e_from, e_to, mime_msg, 
dryrun=False):
 s.quit()
 
 
-def get_email_address_list(address_string):
-if isinstance(address_string, str):
-if ',' in address_string:
-address_string = [address.strip() for address in 
address_string.split(',')]
-elif ';' in address_string:
-address_string = [address.strip() for address in 
address_string.split(';')]
-else:
-address_string = [address_string]
+def get_email_address_list(addresses: Union[str, Iterable[str]]) -> List[str]:
+if isinstance(addresses, str):
+return _get_email_list_from_str(addresses)
 
-return address_string
+elif isinstance(addresses, Iterable):
+if not all(isinstance(item, str) for item in addresses):
+raise TypeError("The items in your iterable must be strings.")
+return list(addresses)
+
+received_type = type(addresses).__name__
+raise TypeError("Unexpected argument type: Received 
'{}'.".format(received_type))
+
+
+def _get_email_list_from_str(addresses: str) -> List[str]:
+delimiters = [",", ";"]
+for delimiter in delimiters:
+if delimiter in addresses:
+return [address.strip() for address in addresses.split(delimiter)]
+return [addresses]
 
 Review comment:
   Agree.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Daniel Imberman (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952097#comment-16952097
 ] 

Daniel Imberman commented on AIRFLOW-5660:
--

Also to note: This will be the type of bug we catch before it hits in the wild 
once our scale testing is all set up.

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] codecov-io edited a comment on issue #6057: [AIRFLOW-5442] implementing get_pandas_df method for druid broker hook

2019-10-15 Thread GitBox
codecov-io edited a comment on issue #6057: [AIRFLOW-5442] implementing 
get_pandas_df method for druid broker hook
URL: https://github.com/apache/airflow/pull/6057#issuecomment-529290281
 
 
   # [Codecov](https://codecov.io/gh/apache/airflow/pull/6057?src=pr=h1) 
Report
   > Merging 
[#6057](https://codecov.io/gh/apache/airflow/pull/6057?src=pr=desc) into 
[master](https://codecov.io/gh/apache/airflow/commit/76fe45e1d127b657b1aad5c0fd657e011f5a09bc?src=pr=desc)
 will **decrease** coverage by `0.02%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/airflow/pull/6057/graphs/tree.svg?width=650=WdLKlKHOAU=150=pr)](https://codecov.io/gh/apache/airflow/pull/6057?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#6057  +/-   ##
   ==
   - Coverage   80.05%   80.03%   -0.03% 
   ==
 Files 610  594  -16 
 Lines   3526434745 -519 
   ==
   - Hits2823227807 -425 
   + Misses   7032 6938  -94
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/airflow/pull/6057?src=pr=tree) | 
Coverage Δ | |
   |---|---|---|
   | 
[airflow/hooks/druid\_hook.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9ob29rcy9kcnVpZF9ob29rLnB5)
 | `88.88% <ø> (+1.05%)` | :arrow_up: |
   | 
[...rflow/contrib/operators/bigquery\_check\_operator.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL29wZXJhdG9ycy9iaWdxdWVyeV9jaGVja19vcGVyYXRvci5weQ==)
 | `0% <0%> (-100%)` | :arrow_down: |
   | 
[airflow/contrib/sensors/bigquery\_sensor.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL3NlbnNvcnMvYmlncXVlcnlfc2Vuc29yLnB5)
 | `0% <0%> (-100%)` | :arrow_down: |
   | 
[airflow/contrib/hooks/bigquery\_hook.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL2hvb2tzL2JpZ3F1ZXJ5X2hvb2sucHk=)
 | `70.74% <0%> (-29.26%)` | :arrow_down: |
   | 
[airflow/contrib/operators/gcs\_download\_operator.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL29wZXJhdG9ycy9nY3NfZG93bmxvYWRfb3BlcmF0b3IucHk=)
 | `72.5% <0%> (-27.5%)` | :arrow_down: |
   | 
[...ow/contrib/operators/bigquery\_to\_mysql\_operator.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL29wZXJhdG9ycy9iaWdxdWVyeV90b19teXNxbF9vcGVyYXRvci5weQ==)
 | `73.46% <0%> (-26.54%)` | :arrow_down: |
   | 
[airflow/contrib/hooks/gcs\_hook.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL2hvb2tzL2djc19ob29rLnB5)
 | `83.6% <0%> (-16.41%)` | :arrow_down: |
   | 
[airflow/contrib/operators/bigquery\_get\_data.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL29wZXJhdG9ycy9iaWdxdWVyeV9nZXRfZGF0YS5weQ==)
 | `83.78% <0%> (-16.22%)` | :arrow_down: |
   | 
[airflow/contrib/sensors/python\_sensor.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL3NlbnNvcnMvcHl0aG9uX3NlbnNvci5weQ==)
 | `85% <0%> (-15%)` | :arrow_down: |
   | 
[airflow/utils/sqlalchemy.py](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree#diff-YWlyZmxvdy91dGlscy9zcWxhbGNoZW15LnB5)
 | `79.06% <0%> (-14.16%)` | :arrow_down: |
   | ... and [294 
more](https://codecov.io/gh/apache/airflow/pull/6057/diff?src=pr=tree-more) 
| |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/airflow/pull/6057?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/airflow/pull/6057?src=pr=footer). 
Last update 
[76fe45e...9a746d2](https://codecov.io/gh/apache/airflow/pull/6057?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (AIRFLOW-5660) Scheduler becomes responsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)
Aditya Vishwakarma created AIRFLOW-5660:
---

 Summary: Scheduler becomes responsive when processing large DAGs 
on kubernetes.
 Key: AIRFLOW-5660
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
 Project: Apache Airflow
  Issue Type: Bug
  Components: executor-kubernetes
Affects Versions: 1.10.5
Reporter: Aditya Vishwakarma
Assignee: Daniel Imberman


For very large dags( 10,000+) and high parallelism, the scheduling loop can 
take more 5-10 minutes. 

It seems that `_labels_to_key` function in kubernetes_executor loads all tasks 
with a given execution date into memory. It does it for every task in progress. 
So, if 100 tasks are in progress of a dag with 10,000 tasks, it will load 
million tasks on every tick of the scheduler from db.

[https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]

A quick fix is to search for task in the db directly before regressing to full 
scan. I can submit a PR for it.

A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
dag_id, task_id, execution_date) somewhere, probably in the metadatabase.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Daniel Imberman (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952090#comment-16952090
 ] 

Daniel Imberman commented on AIRFLOW-5660:
--

Hi [~adivish] thank you for catching this. Yeah this just looks like an 
implementation bug. If I'm reading this correctly the issue is that there's no 
guarantee that the task id label would be the same as the task id in the DB. I 
think we can for the most part solve this with the following steps.

1. Do exactly what you said and first do a task_id/dag_id search of the DB to 
significantly reduce search time
2. In the `_make_safe_label_value` function we can add a warning if a task_id 
or dag_id will require hashing (which will slow down processing
3. IF there is no database match we do a full scan for that day.

[~ash] does that sound good?

[~adivish] If you can make a PR I'll gladly review :)

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Daniel Imberman (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952107#comment-16952107
 ] 

Daniel Imberman commented on AIRFLOW-5660:
--

[~adivish] fork from master and then we can cherry pick it to the v1-10 branch. 
The v1-10 branches are only for releases

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] nuclearpinguin opened a new pull request #6339: [AIRFLOW-5661] Fix create_cluster method of GKEClusterHook

2019-10-15 Thread GitBox
nuclearpinguin opened a new pull request #6339: [AIRFLOW-5661] Fix 
create_cluster method of GKEClusterHook
URL: https://github.com/apache/airflow/pull/6339
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-5661
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   This PR fixes `create_cluster` method of `GKEClusterHook`. The method uses 
`get_cluster` which already returns a string, so calling `self_link` once more 
rises error. 
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5661) Fix create_cluster method of GKEClusterHook

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952076#comment-16952076
 ] 

ASF GitHub Bot commented on AIRFLOW-5661:
-

nuclearpinguin commented on pull request #6339: [AIRFLOW-5661] Fix 
create_cluster method of GKEClusterHook
URL: https://github.com/apache/airflow/pull/6339
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-5661
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   This PR fixes `create_cluster` method of `GKEClusterHook`. The method uses 
`get_cluster` which already returns a string, so calling `self_link` once more 
rises error. 
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix create_cluster method of GKEClusterHook
> ---
>
> Key: AIRFLOW-5661
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5661
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: gcp
>Affects Versions: 2.0.0
>Reporter: Tomasz Urbaszek
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AIRFLOW-5661) Fix create_cluster method of GKEClusterHook

2019-10-15 Thread Tomasz Urbaszek (Jira)
Tomasz Urbaszek created AIRFLOW-5661:


 Summary: Fix create_cluster method of GKEClusterHook
 Key: AIRFLOW-5661
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5661
 Project: Apache Airflow
  Issue Type: Improvement
  Components: gcp
Affects Versions: 2.0.0
Reporter: Tomasz Urbaszek






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952109#comment-16952109
 ] 

Aditya Vishwakarma commented on AIRFLOW-5660:
-

PR for master: https://github.com/apache/airflow/pull/6340

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5657) Upgrade dill dependency

2019-10-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952153#comment-16952153
 ] 

ASF subversion and git services commented on AIRFLOW-5657:
--

Commit 76fe5e2cc059f738132550c376e97344d63d6a52 in airflow's branch 
refs/heads/master from Ahmet Altay
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=76fe5e2 ]

[AIRFLOW-5657] Update the upper bound for dill (#6334)



> Upgrade dill dependency
> ---
>
> Key: AIRFLOW-5657
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5657
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: dependencies
>Affects Versions: 1.10.6
>Reporter: Ahmet Altay
>Priority: Major
> Fix For: 2.0.0
>
>
> Airflow depends on "dill>=0.2.2, <0.3'," 
> (https://github.com/apache/airflow/blob/1a4c16432b8718189269e63c3afcd2709eed7379/setup.py#L359).
> However there is a new 0.3.x range for dill (https://pypi.org/project/dill/)
> I would like to upgrade the version to the same range that matches Apache 
> Beam's version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5636) Allow adding or overriding existing Extra Operator Links for existing Operators

2019-10-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952176#comment-16952176
 ] 

ASF subversion and git services commented on AIRFLOW-5636:
--

Commit f5a7beee365bfcf4d2661dfa28cbba2eb2aaf296 in airflow's branch 
refs/heads/v1-10-test from Kaxil Naik
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=f5a7bee ]

[AIRFLOW-5636] Allow adding or overriding existing Operator Links (#6302)

(cherry-picked from 1de210b8a96b8fff0ef59b1e740c8a08c14b6394)


> Allow adding or overriding existing Extra Operator Links for existing 
> Operators
> ---
>
> Key: AIRFLOW-5636
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5636
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core, plugins, ui
>Affects Versions: 2.0.0, 1.10.5
>Reporter: Kaxil Naik
>Assignee: Kaxil Naik
>Priority: Major
> Fix For: 1.10.6
>
>
> Currently, we can add Operator link globally for all tasks or add operator 
> links in existing tasks.
> We should be able to add Operator links on a Single Operator or override 
> existing Operator link on an operator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-4761) Airflow Task Clear function throws error

2019-10-15 Thread Connie Chen (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-4761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952351#comment-16952351
 ] 

Connie Chen commented on AIRFLOW-4761:
--

[~brstorrie] the weird part is that it works with the cli and that some tasks 
within the same DAG work and some don't, and I have one task that is being used 
across two DAGs, in one it work and in the other it doesn't. 

 

> Airflow Task Clear function throws error
> 
>
> Key: AIRFLOW-4761
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4761
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, DagRun
>Affects Versions: 1.10.3
> Environment: CentOS 7, Python 2.7.10
>Reporter: Ben Storrie
>Priority: Major
>
> When using the airflow webserver to clear a task inside a dagrun, an error is 
> thrown on certain types of tasks:
>  
> {code:java}
> Traceback (most recent call last):
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 2311, in wsgi_app
> response = self.full_dispatch_request()
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1834, in full_dispatch_request
> rv = self.handle_user_exception(e)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1737, in handle_user_exception
> reraise(exc_type, exc_value, tb)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1832, in full_dispatch_request
> rv = self.dispatch_request()
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask/app.py",
>  line 1818, in dispatch_request
> return self.view_functions[rule.endpoint](**req.view_args)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_admin/base.py",
>  line 69, in inner
> return self._run_view(f, *args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_admin/base.py",
>  line 368, in _run_view
> return fn(self, *args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/flask_login/utils.py",
>  line 261, in decorated_view
> return func(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/utils.py",
>  line 275, in wrapper
> return f(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/utils.py",
>  line 322, in wrapper
> return f(*args, **kwargs)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/www/views.py",
>  line 1202, in clear
> include_upstream=upstream)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 3830, in sub_dag
> dag = copy.deepcopy(self)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 174, in deepcopy
> y = copier(memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 3815, in __deepcopy__
> setattr(result, k, copy.deepcopy(v, memo))
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 163, in deepcopy
> y = copier(x, memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 257, in _deepcopy_dict
> y[deepcopy(key, memo)] = deepcopy(value, memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 174, in deepcopy
> y = copier(memo)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/site-packages/airflow/models/__init__.py",
>  line 2492, in __deepcopy__
> setattr(result, k, copy.copy(v))
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 96, in copy
> return _reconstruct(x, rv, 0)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy.py", 
> line 329, in _reconstruct
> y = callable(*args)
> File 
> "/opt/my-miniconda/miniconda/envs/my-hadoop-airflow/lib/python2.7/copy_reg.py",
>  line 93, in __newobj__
> return cls.__new__(cls, *args)
> TypeError: instancemethod expected at least 2 arguments, got 0{code}
>  
> I had expected AIRFLOW-2060 being resolved to resolve this on upgrade to 
> 1.10.3:
> {code:java}
> (my-hadoop-airflow) [user@hostname ~]$ pip freeze | grep pendulum
> pendulum==1.4.4{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] mik-laj edited a comment on issue #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
mik-laj edited a comment on issue #6210: [AIRFLOW-5567] BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-542448658
 
 
   1.  I prepared a fix for you. If you don't add commit to your branch, please 
run the following command:
   ```
   curl https://pastebin.com/raw/12CfEUn8 | git am
   ```
   2. That sounds very sensible to me.
   3.  Cloud Build seems to be a good and simple candidate.
   6. Performance testing is a topic that has not yet been addressed in the 
community. My team  would like to take care of it, but it will last. I think we 
will not be able to set priorities because it may depend on business needs.
   7. This should be in the same PR. The documentation is currently being 
rewritten. @KKcorps Can you help with that?
   8. In my opinion, AIP is not necessary.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
mik-laj commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-542448658
 
 
   1. 
   I prepared a fix for you. If you don't add commit to your branch, please run 
the following command:
   ```
   curl https://pastebin.com/raw/12CfEUn8 | git am
   ```
   2. That sounds very sensible to me.
   3.  Cloud Build seems to be a good and simple candidate.
   6. Performance testing is a topic that has not yet been addressed in the 
community. My team  would like to take care of it, but it will last. I think we 
will not be able to set priorities because it may depend on business needs.
   7. This should be in the same PR. The documentation is currently being 
rewritten. @KKcorps Can you help with that?
   8. In my opinion, AIP is not necessary.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] milton0825 merged pull request #6297: [AIRFLOW-5224] Add encoding parameter to GoogleCloudStorageToBigQuery…

2019-10-15 Thread GitBox
milton0825 merged pull request #6297: [AIRFLOW-5224] Add encoding parameter to 
GoogleCloudStorageToBigQuery…
URL: https://github.com/apache/airflow/pull/6297
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] 
BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#discussion_r335262026
 
 

 ##
 File path: airflow/models/base_async_operator.py
 ##
 @@ -0,0 +1,161 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Base Asynchronous Operator for kicking off a long running
+operations and polling for completion with reschedule mode.
+"""
+
+from abc import abstractmethod
+from typing import Dict, List, Optional, Union
+
+from airflow.models import SkipMixin, TaskReschedule
+from airflow.models.xcom import XCOM_EXTERNAL_RESOURCE_ID_KEY
+from airflow.sensors.base_sensor_operator import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+
+class BaseAsyncOperator(BaseSensorOperator, SkipMixin):
+"""
+AsyncOperators are derived from this class and inherit these attributes.
+AsyncOperators should be used for long running operations where the task
+can tolerate a longer poke interval. They use the task rescheduling
+mechanism similar to sensors to avoid occupying a worker slot between
+pokes.
+
+Developing concrete operators that provide parameterized flexibility
+for synchronous or asynchronous poking depending on the invocation is
+possible by programing against this `BaseAsyncOperator` interface,
+and overriding the execute method as demonstrated below.
+
+```python3
+class DummyFlexiblePokingOperator(BaseAsyncOperator):
+  def __init__(self, async=False, *args, **kwargs):
+self.async = async
+super().__init(*args, **kwargs)
+
+  def execute(self, context: Dict) -> None:
+if self.async:
+  # use the BaseAsyncOperator's execute
+  super().execute(context)
+else:
+  self.submit_request(context)
+  while not self.poke():
+time.sleep(self.poke_interval)
+  self.process_results(context)
+
+  def sumbit_request(self, context: Dict) -> Optional[str]:
 
 Review comment:
   nit: typo in function name


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] 
BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#discussion_r335277825
 
 

 ##
 File path: airflow/models/base_async_operator.py
 ##
 @@ -0,0 +1,168 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Base Asynchronous Operator for kicking off a long running
+operations and polling for completion with reschedule mode.
+"""
+
+from abc import abstractmethod
+from typing import Dict, List, Optional, Union
+
+from airflow.models import SkipMixin, TaskReschedule
+from airflow.models.xcom import XCOM_EXTERNAL_RESOURCE_ID_KEY
+from airflow.sensors.base_sensor_operator import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+PLACEHOLDER_RESOURCE_ID = 'RESOURCE_ID_NOT_APPLICABLE'
+
+
+class BaseAsyncOperator(BaseSensorOperator, SkipMixin):
+"""
+AsyncOperators are derived from this class and inherit these attributes.
+AsyncOperators should be used for long running operations where the task
+can tolerate a longer poke interval. They use the task rescheduling
+mechanism similar to sensors to avoid occupying a worker slot between
+pokes.
+
+Developing concrete operators that provide parameterized flexibility
+for synchronous or asynchronous poking depending on the invocation is
+possible by programing against this `BaseAsyncOperator` interface,
+and overriding the execute method as demonstrated below.
+
+```python3
+class DummyFlexiblePokingOperator(BaseAsyncOperator):
+  def __init__(self, async=False, *args, **kwargs):
+self.async = async
+super().__init(*args, **kwargs)
+
+  def execute(self, context: Dict) -> None:
+if self.async:
+  # use the BaseAsyncOperator's execute
+  super().execute(context)
+else:
+  self.submit_request(context)
+  while not self.poke():
+time.sleep(self.poke_interval)
+  self.process_results(context)
+
+  def sumbit_request(self, context: Dict) -> Optional[str]:
+return None
+
+  def poke(self, context: Dict) -> bool:
+return bool(random.getrandbits(1))
+```
+
+AsyncOperators must override the following methods:
+:py:meth:`submit_request`: fire a request for a long running operation
+:py:meth:`poke`: a method to check if the long running operation is
+complete it should return True when a success criteria is met.
+
+Optionally, AsyncOperators can override:
+:py:meth: `process_result` to perform any operations after the success
+criteria is met in :py:meth: `poke`
+
+:py:meth: `poke` is executed at a time interval and succeed when a
+criteria is met and fail if and when they time out.
+
+:param soft_fail: Set to true to mark the task as SKIPPED on failure
+:type soft_fail: bool
+:param poke_interval: Time in seconds that the job should wait in
+between each tries
+:type poke_interval: int
+:param timeout: Time, in seconds before the task times out and fails.
+:type timeout: int
+
+"""
+ui_color = '#9933ff'  # type: str
+
+@apply_defaults
+def __init__(self,
+ *args,
+ **kwargs) -> None:
+super().__init__(mode='reschedule', *args, **kwargs)
+
+@abstractmethod
+def submit_request(self, context: Dict) -> Optional[Union[str, List, 
Dict]]:
+"""
+This method should kick off a long running operation.
+This method should return the ID for the long running operation if
+applicable.
+Context is the same dictionary used as when rendering jinja templates.
+
+Refer to get_template_context for more context.
+
+:returns: a resource_id for the long running operation.
+:rtype: Optional[Union[String, List, Dict]]
+"""
+raise NotImplementedError
+
+def process_result(self, context: Dict):
+"""
+This method can optionally be overriden to process the result of a 
long running operation.
+Context is the same dictionary used as when rendering jinja templates.
+
+Refer to get_template_context for 

[jira] [Created] (AIRFLOW-5665) Add path_exists method to SFTPHook

2019-10-15 Thread Tobiasz Kedzierski (Jira)
Tobiasz Kedzierski created AIRFLOW-5665:
---

 Summary: Add path_exists method to SFTPHook
 Key: AIRFLOW-5665
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5665
 Project: Apache Airflow
  Issue Type: Improvement
  Components: contrib
Affects Versions: 1.10.5
Reporter: Tobiasz Kedzierski
Assignee: Tobiasz Kedzierski






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] 
BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#discussion_r335271755
 
 

 ##
 File path: airflow/models/base_async_operator.py
 ##
 @@ -0,0 +1,96 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+Base Asynchronous Operator for kicking off a long running
+operations and polling for completion with reschedule mode.
+"""
+from functools import wraps
+from airflow.sensors.base_sensor_operator import BaseSensorOperator
+from airflow.exceptions import AirflowException
+from airflow.models.xcom import XCOM_EXTERNAL_RESOURCE_ID_KEY
+
+class BaseAsyncOperator(BaseSensorOperator, SkipMixin):
 
 Review comment:
   Ref. the name, I agree that it's important for people to consider this as 
something that takes action and that's implied by the Operator suffix. That's 
why I consider BaseSensorOperator in its current form to have a misleading 
name: it implies that it's both for taking action and sensing. Luckily the 
actual derived sensor implementations are all suffixed with Sensor - e.g. 
EmrStepSensor - which doesn't mislead.
   
   I feel like having "Async" in your new class highlights the wrong thing. 
It's not that it is async in the sense of non-synchronous or rescheduled 
(Airflow terminology) that makes this class unique, since all sensors derived 
from BaseSensorOperator *can* operate as async/rescheduled: it implies not 
occupying a task slot. What differentiates your class is that it combines 
action and sensing. That's why I tend to think that the already used 
BaseSensorOperator would actually be a great name to describe what your new 
class does; so why not enhance the existing class to optionally do the action 
part?
   
   There might be technical reasons to have a separate class - e.g. code is 
easier to read or to reduce risk by extending rather than modifying (SOLID) - 
but I can't think of a better alternative to "Async": BaseDoThenWaitOperator 
(hah); BaseActionSensor; eventually back to BaseSensorOperator.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] 
BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#discussion_r335278117
 
 

 ##
 File path: airflow/models/base_async_operator.py
 ##
 @@ -0,0 +1,161 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Base Asynchronous Operator for kicking off a long running
+operations and polling for completion with reschedule mode.
+"""
+
+from abc import abstractmethod
+from typing import Dict, List, Optional, Union
+
+from airflow.models import SkipMixin, TaskReschedule
+from airflow.models.xcom import XCOM_EXTERNAL_RESOURCE_ID_KEY
+from airflow.sensors.base_sensor_operator import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+
+class BaseAsyncOperator(BaseSensorOperator, SkipMixin):
+"""
+AsyncOperators are derived from this class and inherit these attributes.
+AsyncOperators should be used for long running operations where the task
+can tolerate a longer poke interval. They use the task rescheduling
+mechanism similar to sensors to avoid occupying a worker slot between
+pokes.
+
+Developing concrete operators that provide parameterized flexibility
+for synchronous or asynchronous poking depending on the invocation is
+possible by programing against this `BaseAsyncOperator` interface,
+and overriding the execute method as demonstrated below.
+
+```python3
+class DummyFlexiblePokingOperator(BaseAsyncOperator):
+  def __init__(self, async=False, *args, **kwargs):
+self.async = async
+super().__init(*args, **kwargs)
+
+  def execute(self, context: Dict) -> None:
+if self.async:
+  # use the BaseAsyncOperator's execute
+  super().execute(context)
+else:
+  self.submit_request(context)
+  while not self.poke():
+time.sleep(self.poke_interval)
+  self.process_results(context)
+
+  def sumbit_request(self, context: Dict) -> Optional[str]:
+return None
+
+  def poke(self, context: Dict) -> bool:
+return bool(random.getrandbits(1))
+```
+
+AsyncOperators must override the following methods:
+:py:meth:`submit_request`: fire a request for a long running operation
+:py:meth:`poke`: a method to check if the long running operation is
+complete it should return True when a success criteria is met.
+
+Optionally, AsyncOperators can override:
+:py:meth: `process_result` to perform any operations after the success
+criteria is met in :py:meth: `poke`
+
+:py:meth: `poke` is executed at a time interval and succeed when a
+criteria is met and fail if and when they time out.
+
+:param soft_fail: Set to true to mark the task as SKIPPED on failure
+:type soft_fail: bool
+:param poke_interval: Time in seconds that the job should wait in
+between each tries
+:type poke_interval: int
+:param timeout: Time, in seconds before the task times out and fails.
+:type timeout: int
+
+"""
+ui_color = '#9933ff'  # type: str
+
+@apply_defaults
+def __init__(self,
+ *args,
+ **kwargs) -> None:
+super().__init__(mode='reschedule', *args, **kwargs)
+
+@abstractmethod
+def submit_request(self, context: Dict) -> Optional[Union[str, List, 
Dict]]:
+"""
+This method should kick off a long running operation.
+This method should return the ID for the long running operation if
+applicable.
+Context is the same dictionary used as when rendering jinja templates.
+
+Refer to get_template_context for more context.
+
+:returns: a resource_id for the long running operation.
+:rtype: Optional[Union[String, List, Dict]]
+"""
+raise NotImplementedError
+
+def process_result(self, context: Dict):
+"""
+This method can optionally be overriden to process the result of a 
long running operation.
+Context is the same dictionary used as when rendering jinja templates.
+
+Refer to get_template_context for more context.
+"""
+self.log.info('Using 

[GitHub] [airflow] JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] 
BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#discussion_r335277825
 
 

 ##
 File path: airflow/models/base_async_operator.py
 ##
 @@ -0,0 +1,168 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Base Asynchronous Operator for kicking off a long running
+operations and polling for completion with reschedule mode.
+"""
+
+from abc import abstractmethod
+from typing import Dict, List, Optional, Union
+
+from airflow.models import SkipMixin, TaskReschedule
+from airflow.models.xcom import XCOM_EXTERNAL_RESOURCE_ID_KEY
+from airflow.sensors.base_sensor_operator import BaseSensorOperator
+from airflow.utils.decorators import apply_defaults
+
+PLACEHOLDER_RESOURCE_ID = 'RESOURCE_ID_NOT_APPLICABLE'
+
+
+class BaseAsyncOperator(BaseSensorOperator, SkipMixin):
+"""
+AsyncOperators are derived from this class and inherit these attributes.
+AsyncOperators should be used for long running operations where the task
+can tolerate a longer poke interval. They use the task rescheduling
+mechanism similar to sensors to avoid occupying a worker slot between
+pokes.
+
+Developing concrete operators that provide parameterized flexibility
+for synchronous or asynchronous poking depending on the invocation is
+possible by programing against this `BaseAsyncOperator` interface,
+and overriding the execute method as demonstrated below.
+
+```python3
+class DummyFlexiblePokingOperator(BaseAsyncOperator):
+  def __init__(self, async=False, *args, **kwargs):
+self.async = async
+super().__init(*args, **kwargs)
+
+  def execute(self, context: Dict) -> None:
+if self.async:
+  # use the BaseAsyncOperator's execute
+  super().execute(context)
+else:
+  self.submit_request(context)
+  while not self.poke():
+time.sleep(self.poke_interval)
+  self.process_results(context)
+
+  def sumbit_request(self, context: Dict) -> Optional[str]:
+return None
+
+  def poke(self, context: Dict) -> bool:
+return bool(random.getrandbits(1))
+```
+
+AsyncOperators must override the following methods:
+:py:meth:`submit_request`: fire a request for a long running operation
+:py:meth:`poke`: a method to check if the long running operation is
+complete it should return True when a success criteria is met.
+
+Optionally, AsyncOperators can override:
+:py:meth: `process_result` to perform any operations after the success
+criteria is met in :py:meth: `poke`
+
+:py:meth: `poke` is executed at a time interval and succeed when a
+criteria is met and fail if and when they time out.
+
+:param soft_fail: Set to true to mark the task as SKIPPED on failure
+:type soft_fail: bool
+:param poke_interval: Time in seconds that the job should wait in
+between each tries
+:type poke_interval: int
+:param timeout: Time, in seconds before the task times out and fails.
+:type timeout: int
+
+"""
+ui_color = '#9933ff'  # type: str
+
+@apply_defaults
+def __init__(self,
+ *args,
+ **kwargs) -> None:
+super().__init__(mode='reschedule', *args, **kwargs)
+
+@abstractmethod
+def submit_request(self, context: Dict) -> Optional[Union[str, List, 
Dict]]:
+"""
+This method should kick off a long running operation.
+This method should return the ID for the long running operation if
+applicable.
+Context is the same dictionary used as when rendering jinja templates.
+
+Refer to get_template_context for more context.
+
+:returns: a resource_id for the long running operation.
+:rtype: Optional[Union[String, List, Dict]]
+"""
+raise NotImplementedError
+
+def process_result(self, context: Dict):
+"""
+This method can optionally be overriden to process the result of a 
long running operation.
+Context is the same dictionary used as when rendering jinja templates.
+
+Refer to get_template_context for 

[jira] [Commented] (AIRFLOW-5224) gcs_to_bq.GoogleCloudStorageToBigQueryOperator - Specify Encoding for BQ ingestion

2019-10-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952474#comment-16952474
 ] 

ASF subversion and git services commented on AIRFLOW-5224:
--

Commit 8b0c9cbb555bf6d43bfb901b8c5fda5e2da48031 in airflow's branch 
refs/heads/master from TobKed
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=8b0c9cb ]

[AIRFLOW-5224] Add encoding parameter to GoogleCloudStorageToBigQuery… (#6297)



> gcs_to_bq.GoogleCloudStorageToBigQueryOperator - Specify Encoding for BQ 
> ingestion
> --
>
> Key: AIRFLOW-5224
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5224
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, gcp
>Affects Versions: 1.10.0
> Environment: airflow software platform
>Reporter: Anand Kumar
>Priority: Blocker
>
> Hi,
> The current business project we are enabling has been built completely on GCP 
> components with composer with airflow being one of the key process. We have 
> built various data pipelines using airflow for multiple work-streams where 
> data is being ingested from gcs bucket to Big query.
> Based on the recent updates on Google BQ infra end, there seems to be some 
> tightened validations on UTF-8 characters which has resulted in mutiple 
> failures of our existing business process.
> On further analysis we found out that while ingesting data to BQ from a 
> Google bucket the encoding needs to be explicitly specified going forward but 
> the below operator currently doesn't  supply any params to specify explicit 
> encoding
> _*gcs_to_bq.GoogleCloudStorageToBigQueryOperator*_
>  Could someone please treat this as a priority and help us with a fix to 
> bring us back in BAU mode
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
JonnyIncognito commented on a change in pull request #6210: [AIRFLOW-5567] 
BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#discussion_r335271755
 
 

 ##
 File path: airflow/models/base_async_operator.py
 ##
 @@ -0,0 +1,96 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+Base Asynchronous Operator for kicking off a long running
+operations and polling for completion with reschedule mode.
+"""
+from functools import wraps
+from airflow.sensors.base_sensor_operator import BaseSensorOperator
+from airflow.exceptions import AirflowException
+from airflow.models.xcom import XCOM_EXTERNAL_RESOURCE_ID_KEY
+
+class BaseAsyncOperator(BaseSensorOperator, SkipMixin):
 
 Review comment:
   Ref. the name, I agree that it's important for people to consider this as 
something that takes action and that's implied by the Operator suffix. That's 
why I consider BaseSensorOperator in its current form to have a misleading 
name: it implies that it's both for taking action and sensing. Luckily the 
actual derived sensor implementations are all suffixed with Sensor - e.g. 
EmrStepSensor - which doesn't mislead.
   
   I feel like having "Async" in your new class highlights the wrong thing. 
It's not that it is async in the sense of non-synchronous or rescheduled 
(Airflow terminology) that makes this class unique, since all sensors derived 
from BaseSensorOperator *can* operate as async/rescheduled: it implies not 
occupying a task slot. What differentiates your class is that it combines 
action and sensing. That's why I tend to think that the already used 
BaseSensorOperator would actually be a great name to describe what your new 
class does; so why not enhance the existing class to optionally do the action 
part?
   
   There might be technical reasons to have a separate class - e.g. code is 
easier to read or to reduce risk by extending rather than modifying (SOLID) - 
but I can't think of a better alternative to "Async": BaseDoThenWaitOperator 
(hah); BaseActionSensor; maybe BaseAtomicOperator; eventually back to 
BaseSensorOperator.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] sann3 commented on issue #6259: [AIRFLOW-XXX] Example for BigQuery to BigQuery operator

2019-10-15 Thread GitBox
sann3 commented on issue #6259: [AIRFLOW-XXX] Example for BigQuery to BigQuery 
operator
URL: https://github.com/apache/airflow/pull/6259#issuecomment-542513553
 
 
   Is anything pending from my side?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] milton0825 commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
milton0825 commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335280049
 
 

 ##
 File path: airflow/models/serialized_dag.py
 ##
 @@ -0,0 +1,214 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Serialzed DAG table in database."""
+
+import hashlib
+from datetime import timedelta
+from typing import TYPE_CHECKING, Any, Dict, List, Optional
+
+# from sqlalchemy import Column, Index, Integer, String, Text, JSON, and_, exc
+from sqlalchemy import JSON, Column, Index, Integer, String, and_
+from sqlalchemy.sql import exists
+
+from airflow.models.base import ID_LEN, Base
+from airflow.utils import db, timezone
+from airflow.utils.log.logging_mixin import LoggingMixin
+from airflow.utils.sqlalchemy import UtcDateTime
+
+if TYPE_CHECKING:
+from airflow.models import DAG  # noqa: F401; # pylint: 
disable=cyclic-import
+from airflow.serialization import SerializedDAG  # noqa: F401
+
+
+log = LoggingMixin().log
+
+
+class SerializedDagModel(Base):
+"""A table for serialized DAGs.
+
+serialized_dag table is a snapshot of DAG files synchronized by scheduler.
+This feature is controlled by:
+
+* ``[core] store_serialized_dags = True``: enable this feature
+* ``[core] min_serialized_dag_update_interval = 30`` (s):
+  serialized DAGs are updated in DB when a file gets processed by 
scheduler,
+  to reduce DB write rate, there is a minimal interval of updating 
serialized DAGs.
+* ``[scheduler] dag_dir_list_interval = 300`` (s):
+  interval of deleting serialized DAGs in DB when the files are deleted, 
suggest
+  to use a smaller interval such as 60
+
+It is used by webserver to load dagbags when 
``store_serialized_dags=True``.
+Because reading from database is lightweight compared to importing from 
files,
+it solves the webserver scalability issue.
+"""
+__tablename__ = 'serialized_dag'
+
+dag_id = Column(String(ID_LEN), primary_key=True)
+fileloc = Column(String(2000), nullable=False)
+# The max length of fileloc exceeds the limit of indexing.
+fileloc_hash = Column(Integer, nullable=False)
+data = Column(JSON, nullable=False)
+last_updated = Column(UtcDateTime, nullable=False)
+
+__table_args__ = (
+Index('idx_fileloc_hash', fileloc_hash, unique=False),
+)
+
+def __init__(self, dag: 'DAG'):
+from airflow.serialization import SerializedDAG  # noqa # pylint: 
disable=redefined-outer-name
+
+self.dag_id = dag.dag_id
+self.fileloc = dag.full_filepath
+self.fileloc_hash = self.dag_fileloc_hash(self.fileloc)
+self.data = SerializedDAG.to_dict(dag)
+self.last_updated = timezone.utcnow()
+
+@staticmethod
+def dag_fileloc_hash(full_filepath: str) -> int:
+Hashing file location for indexing.
+
+:param full_filepath: full filepath of DAG file
+:return: hashed full_filepath
+"""
+# hashing is needed because the length of fileloc is 2000 as an 
Airflow convention,
+# which is over the limit of indexing. If we can reduce the length of 
fileloc, then
+# hashing is not needed.
+return int.from_bytes(
+hashlib.sha1(full_filepath.encode('utf-8')).digest()[-2:], 
byteorder='big', signed=False)
+
+@classmethod
+@db.provide_session
+def write_dag(cls, dag: 'DAG', min_update_interval: Optional[int] = None, 
session=None):
+"""Serializes a DAG and writes it into database.
+
+:param dag: a DAG to be written into database
+:param min_update_interval: minimal interval in seconds to update 
serialized DAG
+:param session: ORM Session
+"""
+log.debug("Writing DAG: %s to the DB", dag)
 
 Review comment:
   ```suggestion
   log.debug("Writing DAG: %s to the DB", dag.dag_id)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For 

[jira] [Created] (AIRFLOW-5666) Create a BaseTransferToS3Operator

2019-10-15 Thread Chao-Han Tsai (Jira)
Chao-Han Tsai created AIRFLOW-5666:
--

 Summary: Create a BaseTransferToS3Operator
 Key: AIRFLOW-5666
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5666
 Project: Apache Airflow
  Issue Type: Improvement
  Components: aws
Affects Versions: 1.10.5
Reporter: Chao-Han Tsai
Assignee: Chao-Han Tsai


Create a BaseTransferToS3Operator so that operators such as 
DynamodbToS3Operator can inherit and share common code logic to upload files to 
S3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] ashb commented on a change in pull request #6340: [Airflow-5660] Try to find the task in DB before regressing to search…

2019-10-15 Thread GitBox
ashb commented on a change in pull request #6340: [Airflow-5660] Try to find 
the task in DB before regressing to search…
URL: https://github.com/apache/airflow/pull/6340#discussion_r335210463
 
 

 ##
 File path: airflow/executors/kubernetes_executor.py
 ##
 @@ -545,6 +545,27 @@ def _labels_to_key(self, labels):
 return None
 
 with create_session() as session:
+task = (
+session
+.query(TaskInstance)
+.filter_by(task_id=task_id, dag_id=dag_id, 
execution_date=ex_time)
+.first()
 
 Review comment:
   ```suggestion
   .one_or_none()
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow-site] mik-laj opened a new pull request #78: Add insert-license lint rule

2019-10-15 Thread GitBox
mik-laj opened a new pull request #78: Add insert-license lint rule
URL: https://github.com/apache/airflow-site/pull/78
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow-site] mik-laj opened a new pull request #79: Add .dockerignore

2019-10-15 Thread GitBox
mik-laj opened a new pull request #79: Add .dockerignore
URL: https://github.com/apache/airflow-site/pull/79
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5224) gcs_to_bq.GoogleCloudStorageToBigQueryOperator - Specify Encoding for BQ ingestion

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952473#comment-16952473
 ] 

ASF GitHub Bot commented on AIRFLOW-5224:
-

milton0825 commented on pull request #6297: [AIRFLOW-5224] Add encoding 
parameter to GoogleCloudStorageToBigQuery…
URL: https://github.com/apache/airflow/pull/6297
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> gcs_to_bq.GoogleCloudStorageToBigQueryOperator - Specify Encoding for BQ 
> ingestion
> --
>
> Key: AIRFLOW-5224
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5224
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, gcp
>Affects Versions: 1.10.0
> Environment: airflow software platform
>Reporter: Anand Kumar
>Priority: Blocker
>
> Hi,
> The current business project we are enabling has been built completely on GCP 
> components with composer with airflow being one of the key process. We have 
> built various data pipelines using airflow for multiple work-streams where 
> data is being ingested from gcs bucket to Big query.
> Based on the recent updates on Google BQ infra end, there seems to be some 
> tightened validations on UTF-8 characters which has resulted in mutiple 
> failures of our existing business process.
> On further analysis we found out that while ingesting data to BQ from a 
> Google bucket the encoding needs to be explicitly specified going forward but 
> the below operator currently doesn't  supply any params to specify explicit 
> encoding
> _*gcs_to_bq.GoogleCloudStorageToBigQueryOperator*_
>  Could someone please treat this as a priority and help us with a fix to 
> bring us back in BAU mode
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] milton0825 commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
milton0825 commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335283942
 
 

 ##
 File path: docs/howto/enable-dag-serialization.rst
 ##
 @@ -0,0 +1,109 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+
+
+Enable DAG Serialization
+
+
+Add the following settings in ``airflow.cfg``:
+
+.. code-block:: ini
+
+[core]
+store_serialized_dags = True
+min_serialized_dag_update_interval = 30
+
+*   ``store_serialized_dags``: This flag decides whether to serialises DAGs 
and persist them in DB.
+If set to True, Webserver reads from DB instead of parsing DAG files
+*   ``min_serialized_dag_update_interval``: This flag sets the minimum 
interval (in seconds) after which
+the serialized DAG in DB should be updated. This helps in reducing 
database write rate.
+
+If you are updating Airflow from <1.10.6, please do not forget to run 
``airflow db upgrade``.
+
+
+How it works
+
+
+In order to make Airflow Webserver stateless (almost!), Airflow >=1.10.6 
supports
+DAG Serialization and DB Persistence.
+
+.. image:: ../img/dag_serialization.png
+
+As shown in the image above in Vanilla Airflow, the Webserver and the 
Scheduler both
+needs access to the DAG files. Both the scheduler and webserver parses the DAG 
files.
+
+With **DAG Serialization** we aim to decouple the webserver from DAG parsing
+which would make the Webserver very light-weight.
+
+As shown in the image above, when using the dag_serilization feature,
+the Scheduler parses the DAG files, serializes them in JSON format and saves 
them in the Metadata DB.
+
+The Webserver now instead of having to parse the DAG file again, reads the
+serialized DAGs in JSON, de-serializes them and create the DagBag and uses it
+to show in the UI.
+
+One of the key features that is implemented as the part of DAG Serialization 
is that
+instead of loading an entire DagBag when the WebServer starts we only load 
each DAG on demand from the
+Serialized Dag table. This helps reduce Webserver startup time and memory. The 
reduction is notable
+when you have large number of DAGs.
+
+Below is the screenshot of the ``serialized_dag`` table in Metadata DB:
+
+.. image:: ../img/serialized_dag_table.png
+
+Limitations
+---
+The Webserver will still need access to DAG files in the following cases,
+which is why we said "almost" stateless.
+
+*   **Rendered Template** tab will still have to parse Python file as it needs 
all the details like
+the execution date and even the data passed by the upstream task using 
Xcom.
+*   **Code View** will read the DAG File & show it using Pygments.
+However, it does not need to Parse the Python file so it is still a small 
operation.
+*   :doc:`Extra Operator Links ` would not work out of
+the box. They need to be defined in Airflow Plugin.
+
+**Existing Airflow Operators**:
 
 Review comment:
   I think @kaxil is referring to existing operators with `operator extra links`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (AIRFLOW-5664) postgres_to_gcs operator drops milliseconds from timestamps

2019-10-15 Thread Joseph (Jira)
Joseph created AIRFLOW-5664:
---

 Summary: postgres_to_gcs operator drops milliseconds from 
timestamps
 Key: AIRFLOW-5664
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5664
 Project: Apache Airflow
  Issue Type: Bug
  Components: operators
Affects Versions: 1.10.5
Reporter: Joseph


Postgres stores timestamps with microsecond resolution. When using the 
postgres_to_gcs operator, timestamps are converted to epoch/unix time using the 
datetime.timetuple() method. This method drops the microseconds and so you'll 
end up with a storage object that looks like this:
{code:java}
{"id": 1, "last_modified": 1571038537.0}
{"id": 2, "last_modified": 1571038537.0}
{"id": 3, "last_modified": 1571038537.0}
{code}
When it should look like this:
{code:java}
{"id": 1, "last_modified": 1571038537.123}
{"id": 2, "last_modified": 1571038537.400}
{"id": 3, "last_modified": 1571038537.455}
{code}
It would be useful to keep the timestamps' full resolution.

I believe the same issue may occur with airflow.operators.mysql_to_gcs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] mik-laj edited a comment on issue #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
mik-laj edited a comment on issue #6210: [AIRFLOW-5567] BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-542448658
 
 
   1.  I prepared a fix for you. If you don't add commit to your branch, please 
run the following command:
   ```
   curl https://pastebin.com/raw/12CfEUn8 | git am
   ```
   You should also do a rebase
   2. That sounds very sensible to me.
   3.  Cloud Build seems to be a good and simple candidate.
   6. Performance testing is a topic that has not yet been addressed in the 
community. My team  would like to take care of it, but it will last. I think we 
will not be able to set priorities because it may depend on business needs.
   7. This should be in the same PR. The documentation is currently being 
rewritten. @KKcorps Can you help with that?
   8. In my opinion, AIP is not necessary.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] dstandish commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator

2019-10-15 Thread GitBox
dstandish commented on issue #6210: [AIRFLOW-5567] BaseAsyncOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-542495934
 
 
   I gave this a try ... one curious thing while the task is in "waiting 
for retry" status, the log is not visible.
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5665) Add path_exists method to SFTPHook

2019-10-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952488#comment-16952488
 ] 

ASF GitHub Bot commented on AIRFLOW-5665:
-

TobKed commented on pull request #6344: [AIRFLOW-5665] Add path_exists method 
to SFTPHook
URL: https://github.com/apache/airflow/pull/6344
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-5665
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add path_exists method to SFTPHook
> --
>
> Key: AIRFLOW-5665
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5665
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: contrib
>Affects Versions: 1.10.5
>Reporter: Tobiasz Kedzierski
>Assignee: Tobiasz Kedzierski
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [airflow] TobKed opened a new pull request #6344: [AIRFLOW-5665] Add path_exists method to SFTPHook

2019-10-15 Thread GitBox
TobKed opened a new pull request #6344: [AIRFLOW-5665] Add path_exists method 
to SFTPHook
URL: https://github.com/apache/airflow/pull/6344
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-5665
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] milton0825 commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-10-15 Thread GitBox
milton0825 commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r335280952
 
 

 ##
 File path: airflow/models/serialized_dag.py
 ##
 @@ -0,0 +1,214 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Serialzed DAG table in database."""
+
+import hashlib
+from datetime import timedelta
+from typing import TYPE_CHECKING, Any, Dict, List, Optional
+
+# from sqlalchemy import Column, Index, Integer, String, Text, JSON, and_, exc
+from sqlalchemy import JSON, Column, Index, Integer, String, and_
+from sqlalchemy.sql import exists
+
+from airflow.models.base import ID_LEN, Base
+from airflow.utils import db, timezone
+from airflow.utils.log.logging_mixin import LoggingMixin
+from airflow.utils.sqlalchemy import UtcDateTime
+
+if TYPE_CHECKING:
+from airflow.models import DAG  # noqa: F401; # pylint: 
disable=cyclic-import
+from airflow.serialization import SerializedDAG  # noqa: F401
+
+
+log = LoggingMixin().log
+
+
+class SerializedDagModel(Base):
+"""A table for serialized DAGs.
+
+serialized_dag table is a snapshot of DAG files synchronized by scheduler.
+This feature is controlled by:
+
+* ``[core] store_serialized_dags = True``: enable this feature
+* ``[core] min_serialized_dag_update_interval = 30`` (s):
+  serialized DAGs are updated in DB when a file gets processed by 
scheduler,
+  to reduce DB write rate, there is a minimal interval of updating 
serialized DAGs.
+* ``[scheduler] dag_dir_list_interval = 300`` (s):
+  interval of deleting serialized DAGs in DB when the files are deleted, 
suggest
+  to use a smaller interval such as 60
+
+It is used by webserver to load dagbags when 
``store_serialized_dags=True``.
+Because reading from database is lightweight compared to importing from 
files,
+it solves the webserver scalability issue.
+"""
+__tablename__ = 'serialized_dag'
+
+dag_id = Column(String(ID_LEN), primary_key=True)
+fileloc = Column(String(2000), nullable=False)
+# The max length of fileloc exceeds the limit of indexing.
+fileloc_hash = Column(Integer, nullable=False)
+data = Column(JSON, nullable=False)
+last_updated = Column(UtcDateTime, nullable=False)
+
+__table_args__ = (
+Index('idx_fileloc_hash', fileloc_hash, unique=False),
+)
+
+def __init__(self, dag: 'DAG'):
+from airflow.serialization import SerializedDAG  # noqa # pylint: 
disable=redefined-outer-name
+
+self.dag_id = dag.dag_id
+self.fileloc = dag.full_filepath
+self.fileloc_hash = self.dag_fileloc_hash(self.fileloc)
+self.data = SerializedDAG.to_dict(dag)
+self.last_updated = timezone.utcnow()
+
+@staticmethod
+def dag_fileloc_hash(full_filepath: str) -> int:
+Hashing file location for indexing.
+
+:param full_filepath: full filepath of DAG file
+:return: hashed full_filepath
+"""
+# hashing is needed because the length of fileloc is 2000 as an 
Airflow convention,
+# which is over the limit of indexing. If we can reduce the length of 
fileloc, then
+# hashing is not needed.
+return int.from_bytes(
+hashlib.sha1(full_filepath.encode('utf-8')).digest()[-2:], 
byteorder='big', signed=False)
+
+@classmethod
+@db.provide_session
+def write_dag(cls, dag: 'DAG', min_update_interval: Optional[int] = None, 
session=None):
+"""Serializes a DAG and writes it into database.
+
+:param dag: a DAG to be written into database
+:param min_update_interval: minimal interval in seconds to update 
serialized DAG
+:param session: ORM Session
+"""
+log.debug("Writing DAG: %s to the DB", dag)
+# Checks if (Current Time - Time when the DAG was written to DB) < 
min_update_interval
+# If Yes, does nothing
+# If No or the DAG does not exists, updates / writes Serialized DAG to 
DB
+if min_update_interval is not None:
+if session.query(exists().where(
+and_(cls.dag_id == 

  1   2   3   >