date:20191121

[GitHub] [airflow] mik-laj edited a comment on issue #6622: [AIRLFOW-6024] Do not use the logger in CLI

2019-11-21 Thread GitBox

mik-laj edited a comment on issue #6622: [AIRLFOW-6024] Do not use the logger 
in CLI
URL: https://github.com/apache/airflow/pull/6622#issuecomment-556976441
 
 
   There is no need to use the logger because all messages are intended to be 
displayed on the console. This should not be saved to a file or other external 
system e.g. Stackdriver. If you need to write to a file, you can redirect the 
stream.
   
   We want the messages also to be displayed also when the following command is 
used.
   ```
   AIRFLOW__CORE__LOG_LEVEL=fatal airflow pools list
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (AIRFLOW-6028) Add a Looker Hook.

2019-11-21 Thread Nathan Hadfield (Jira)

Nathan Hadfield created AIRFLOW-6028:


 Summary: Add a Looker Hook.
 Key: AIRFLOW-6028
 URL: https://issues.apache.org/jira/browse/AIRFLOW-6028
 Project: Apache Airflow
  Issue Type: New Feature
  Components: hooks
Affects Versions: 2.0.0
Reporter: Nathan Hadfield
 Fix For: 2.0.0


This addition of a hook for Looker ([https://looker.com/]) will enable the 
integration of Airflow with the Looker SDK.  This can then for the basis for a 
suite of operators to automate common Looker actions, e.g. sending a Looker 
dashboard via email.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (AIRFLOW-6028) Add a Looker Hook.

2019-11-21 Thread Nathan Hadfield (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Hadfield reassigned AIRFLOW-6028:


Assignee: Nathan Hadfield

> Add a Looker Hook.
> --
>
> Key: AIRFLOW-6028
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6028
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: hooks
>Affects Versions: 2.0.0
>Reporter: Nathan Hadfield
>Assignee: Nathan Hadfield
>Priority: Minor
> Fix For: 2.0.0
>
>
> This addition of a hook for Looker ([https://looker.com/]) will enable the 
> integration of Airflow with the Looker SDK.  This can then for the basis for 
> a suite of operators to automate common Looker actions, e.g. sending a Looker 
> dashboard via email.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] stale[bot] commented on issue #6254: [AIRFLOW-5591] Remove Glyphicons in favor of FontAwesome icons

2019-11-21 Thread GitBox

stale[bot] commented on issue #6254: [AIRFLOW-5591] Remove Glyphicons in favor 
of FontAwesome icons
URL: https://github.com/apache/airflow/pull/6254#issuecomment-557027825
 
 
   This issue has been automatically marked as stale because it has not had 
recent activity. It will be closed if no further activity occurs. Thank you for 
your contributions.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] ashb commented on issue #6623: Donating Google's Airflow K8s Operator to Apache Foundation

2019-11-21 Thread GitBox

ashb commented on issue #6623: Donating Google's Airflow K8s Operator to Apache 
Foundation
URL: https://github.com/apache/airflow/pull/6623#issuecomment-556995070
 
 
   Excited to see we finally got here!
   
   A couple of thoughts/comments:
   
   - Would this be better off in a different repo to Airflow, say 
`apache/airflow-kuberentes-controler`
   - We're in the tricky situation where "Operator" means different things in 
Kube and Airflow :)
   - We need licesnse headers in _every_ file :( (The pre-commit hooks set up 
in this repo should automatically add almost all of them for us, or can be 
extended to add them for `.go` files if it doesn't already support that.
   - How much of the architecture is configurable? For example if we want to 
pre-bake the dags in to an image and don't want the DAG-pv.
   - Does Kube on AWS (EKS? Too many acronyms) support ReadMany PVs?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] ashb edited a comment on issue #6623: Donating Google's Airflow K8s Operator to Apache Foundation

2019-11-21 Thread GitBox

ashb edited a comment on issue #6623: Donating Google's Airflow K8s Operator to 
Apache Foundation
URL: https://github.com/apache/airflow/pull/6623#issuecomment-556995070
 
 
   Excited to see we finally got here!
   
   A couple of thoughts/comments:
   
   - Would this be better off in a different repo to Airflow, say 
`apache/airflow-kuberentes-controler`
   - We're in the tricky situation where "Operator" means different things in 
Kube and Airflow :)
   - We need licesnse headers in _every_ file :( (The pre-commit hooks set up 
in this repo should automatically add almost all of them for us, or can be 
extended to add them for `.go` files if it doesn't already support that.
   - How much of the architecture is configurable? For example if we want to 
pre-bake the dags in to an image and don't want the DAG-pv.
   - Does Kube on AWS (EKS? Too many acronyms) support ReadMany PVs?
   - Why did you choose Stateful sets for the scheduler and UI, rather than 
Deployements?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (AIRFLOW-6028) Add a Looker Hook.

2019-11-21 Thread Nathan Hadfield (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Hadfield updated AIRFLOW-6028:
-
Description: This addition of a hook for Looker ([https://looker.com/]) 
will enable the integration of Airflow with the Looker SDK.  This can then form 
the basis for a suite of operators to automate common Looker actions, e.g. 
sending a Looker dashboard via email.  (was: This addition of a hook for Looker 
([https://looker.com/]) will enable the integration of Airflow with the Looker 
SDK.  This can then for the basis for a suite of operators to automate common 
Looker actions, e.g. sending a Looker dashboard via email.)

> Add a Looker Hook.
> --
>
> Key: AIRFLOW-6028
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6028
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: hooks
>Affects Versions: 2.0.0
>Reporter: Nathan Hadfield
>Assignee: Nathan Hadfield
>Priority: Minor
> Fix For: 2.0.0
>
>
> This addition of a hook for Looker ([https://looker.com/]) will enable the 
> integration of Airflow with the Looker SDK.  This can then form the basis for 
> a suite of operators to automate common Looker actions, e.g. sending a Looker 
> dashboard via email.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] mik-laj commented on issue #6622: [AIRLFOW-6024] Do not use the logger in CLI

2019-11-21 Thread GitBox

mik-laj commented on issue #6622: [AIRLFOW-6024] Do not use the logger in CLI
URL: https://github.com/apache/airflow/pull/6622#issuecomment-556976441
 
 
   There is no need to use the logger because all messages are intended to be 
displayed on the screen. This should not be saved to a file. If you need to 
write to a file, you can redirect the stream.
   
   We want the messages to be displayed also when the following command is used.
   ```
   AIRFLOW__CORE__LOG_LEVEL=fatal airflow pools list
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] nuclearpinguin commented on a change in pull request #6622: [AIRLFOW-6024] Do not use the logger in CLI

2019-11-21 Thread GitBox

nuclearpinguin commented on a change in pull request #6622: [AIRLFOW-6024] Do 
not use the logger in CLI
URL: https://github.com/apache/airflow/pull/6622#discussion_r348963624
 
 

 ##
 File path: airflow/cli/commands/dag_command.py
 ##
 @@ -104,16 +104,15 @@ def dag_trigger(args):
 :return:
 """
 api_client = get_current_api_client()
-log = LoggingMixin().log
 try:
 message = api_client.trigger_dag(dag_id=args.dag_id,
  run_id=args.run_id,
  conf=args.conf,
  execution_date=args.exec_date)
+print(message)
 except OSError as err:
-log.error(err)
+print(str(err))
 raise AirflowException(err)
 
 Review comment:
   Do we need print here? The same exception with message will be rise.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] nuclearpinguin commented on a change in pull request #6622: [AIRLFOW-6024] Do not use the logger in CLI

2019-11-21 Thread GitBox

nuclearpinguin commented on a change in pull request #6622: [AIRLFOW-6024] Do 
not use the logger in CLI
URL: https://github.com/apache/airflow/pull/6622#discussion_r348963675
 
 

 ##
 File path: airflow/cli/commands/dag_command.py
 ##
 @@ -125,16 +124,15 @@ def dag_delete(args):
 :return:
 """
 api_client = get_current_api_client()
-log = LoggingMixin().log
 if args.yes or input(
 "This will drop all existing records related to the specified DAG. 
"
 "Proceed? (y/n)").upper() == "Y":
 try:
 message = api_client.delete_dag(dag_id=args.dag_id)
+print(message)
 except OSError as err:
-log.error(err)
+print(str(err))
 raise AirflowException(err)
 
 Review comment:
   Do we need print here? The same exception with message will be rise.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (AIRFLOW-6029) Allow webserver to bind to multiple address

2019-11-21 Thread Mathieu Cinquin (Jira)

Mathieu Cinquin created AIRFLOW-6029:


 Summary: Allow webserver to bind to multiple address
 Key: AIRFLOW-6029
 URL: https://issues.apache.org/jira/browse/AIRFLOW-6029
 Project: Apache Airflow
  Issue Type: Improvement
  Components: webserver
Affects Versions: 1.10.6
Reporter: Mathieu Cinquin


Hello,

 

A great move will be to allow webserver to bind to multiple address.

eg: 
{code:java}
web_server_host: 1.1.1.1, 1.1.1.2 {code}
 

Use _0.0.0.0_ is definitively not a good security approach. 

 

Gunicorn is already able to bind on multiple specified ip addresses: 
[https://github.com/benoitc/gunicorn/commit/b7b51adf13e92044211b267ba07e3498585f219a]

 

Regards



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] ratb3rt commented on a change in pull request #6609: [AIRFLOW-5950] AIP-21 Change import paths for "apache/cassandra" modules

2019-11-21 Thread GitBox

ratb3rt commented on a change in pull request #6609: [AIRFLOW-5950] AIP-21 
Change import paths for "apache/cassandra" modules
URL: https://github.com/apache/airflow/pull/6609#discussion_r349019087
 
 

 ##
 File path: scripts/ci/pylint_todo.txt
 ##
 @@ -1,4 +1,5 @@
 ./airflow/configuration.py
+./airflow/configuration.py:1:0:
 
 Review comment:
   pylint script ticketed, fixed, PRed and merged :)
   This instance has been corrected.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj edited a comment on issue #6622: [AIRLFOW-6024] Do not use the logger in CLI

2019-11-21 Thread GitBox

mik-laj edited a comment on issue #6622: [AIRLFOW-6024] Do not use the logger 
in CLI
URL: https://github.com/apache/airflow/pull/6622#issuecomment-556976441
 
 
   There is no need to use the logger because all messages are intended to be 
displayed on the console. This should not be saved to a file or other external 
system e.g. Stackdeiver. If you need to write to a file, you can redirect the 
stream.
   
   We want the messages to be displayed also when the following command is used.
   ```
   AIRFLOW__CORE__LOG_LEVEL=fatal airflow pools list
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj edited a comment on issue #6622: [AIRLFOW-6024] Do not use the logger in CLI

2019-11-21 Thread GitBox

mik-laj edited a comment on issue #6622: [AIRLFOW-6024] Do not use the logger 
in CLI
URL: https://github.com/apache/airflow/pull/6622#issuecomment-556976441
 
 
   There is no need to use the logger because all messages are intended to be 
displayed on the console. This should not be saved to a file or other external 
system e.g. Stackdriver. If you need to write to a file, you can redirect the 
stream.
   
   We want the messages to be displayed also when the following command is used.
   ```
   AIRFLOW__CORE__LOG_LEVEL=fatal airflow pools list
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] codecov-io edited a comment on issue #6124: [AIRFLOW-5501] in_cluster default value in KubernetesPodOperator overwrites configuration

2019-11-21 Thread GitBox

codecov-io edited a comment on issue #6124: [AIRFLOW-5501] in_cluster default 
value in KubernetesPodOperator overwrites configuration
URL: https://github.com/apache/airflow/pull/6124#issuecomment-556979532
 
 
   # [Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=h1) 
Report
   > Merging 
[#6124](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=desc) into 
[master](https://codecov.io/gh/apache/airflow/commit/fab957e763f40bf2a2398770312b4834fbd613e1?src=pr=desc)
 will **decrease** coverage by `0.32%`.
   > The diff coverage is `0%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/airflow/pull/6124/graphs/tree.svg?width=650=WdLKlKHOAU=150=pr)](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#6124  +/-   ##
   ==
   - Coverage83.8%   83.48%   -0.33% 
   ==
 Files 669  669  
 Lines   3756437566   +2 
   ==
   - Hits3148031361 -119 
   - Misses   6084 6205 +121
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=tree) | 
Coverage Δ | |
   |---|---|---|
   | 
[...rflow/contrib/operators/kubernetes\_pod\_operator.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL29wZXJhdG9ycy9rdWJlcm5ldGVzX3BvZF9vcGVyYXRvci5weQ==)
 | `75% <0%> (-23.58%)` | :arrow_down: |
   | 
[airflow/kubernetes/volume\_mount.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3ZvbHVtZV9tb3VudC5weQ==)
 | `44.44% <0%> (-55.56%)` | :arrow_down: |
   | 
[airflow/kubernetes/volume.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3ZvbHVtZS5weQ==)
 | `52.94% <0%> (-47.06%)` | :arrow_down: |
   | 
[airflow/kubernetes/pod\_launcher.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3BvZF9sYXVuY2hlci5weQ==)
 | `45.25% <0%> (-46.72%)` | :arrow_down: |
   | 
[airflow/kubernetes/refresh\_config.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3JlZnJlc2hfY29uZmlnLnB5)
 | `50.98% <0%> (-23.53%)` | :arrow_down: |
   | 
[airflow/configuration.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9jb25maWd1cmF0aW9uLnB5)
 | `89.13% <0%> (-3.63%)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=footer). 
Last update 
[fab957e...8043c49](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] codecov-io commented on issue #6124: [AIRFLOW-5501] in_cluster default value in KubernetesPodOperator overwrites configuration

2019-11-21 Thread GitBox

codecov-io commented on issue #6124: [AIRFLOW-5501] in_cluster default value in 
KubernetesPodOperator overwrites configuration
URL: https://github.com/apache/airflow/pull/6124#issuecomment-556979532
 
 
   # [Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=h1) 
Report
   > Merging 
[#6124](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=desc) into 
[master](https://codecov.io/gh/apache/airflow/commit/fab957e763f40bf2a2398770312b4834fbd613e1?src=pr=desc)
 will **decrease** coverage by `0.32%`.
   > The diff coverage is `0%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/airflow/pull/6124/graphs/tree.svg?width=650=WdLKlKHOAU=150=pr)](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#6124  +/-   ##
   ==
   - Coverage83.8%   83.48%   -0.33% 
   ==
 Files 669  669  
 Lines   3756437566   +2 
   ==
   - Hits3148031361 -119 
   - Misses   6084 6205 +121
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=tree) | 
Coverage Δ | |
   |---|---|---|
   | 
[...rflow/contrib/operators/kubernetes\_pod\_operator.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL29wZXJhdG9ycy9rdWJlcm5ldGVzX3BvZF9vcGVyYXRvci5weQ==)
 | `75% <0%> (-23.58%)` | :arrow_down: |
   | 
[airflow/kubernetes/volume\_mount.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3ZvbHVtZV9tb3VudC5weQ==)
 | `44.44% <0%> (-55.56%)` | :arrow_down: |
   | 
[airflow/kubernetes/volume.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3ZvbHVtZS5weQ==)
 | `52.94% <0%> (-47.06%)` | :arrow_down: |
   | 
[airflow/kubernetes/pod\_launcher.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3BvZF9sYXVuY2hlci5weQ==)
 | `45.25% <0%> (-46.72%)` | :arrow_down: |
   | 
[airflow/kubernetes/refresh\_config.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3JlZnJlc2hfY29uZmlnLnB5)
 | `50.98% <0%> (-23.53%)` | :arrow_down: |
   | 
[airflow/configuration.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9jb25maWd1cmF0aW9uLnB5)
 | `89.13% <0%> (-3.63%)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=footer). 
Last update 
[fab957e...8043c49](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] codecov-io edited a comment on issue #6124: [AIRFLOW-5501] in_cluster default value in KubernetesPodOperator overwrites configuration

2019-11-21 Thread GitBox

codecov-io edited a comment on issue #6124: [AIRFLOW-5501] in_cluster default 
value in KubernetesPodOperator overwrites configuration
URL: https://github.com/apache/airflow/pull/6124#issuecomment-556979532
 
 
   # [Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=h1) 
Report
   > Merging 
[#6124](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=desc) into 
[master](https://codecov.io/gh/apache/airflow/commit/fab957e763f40bf2a2398770312b4834fbd613e1?src=pr=desc)
 will **decrease** coverage by `0.32%`.
   > The diff coverage is `0%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/airflow/pull/6124/graphs/tree.svg?width=650=WdLKlKHOAU=150=pr)](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#6124  +/-   ##
   ==
   - Coverage83.8%   83.48%   -0.33% 
   ==
 Files 669  669  
 Lines   3756437566   +2 
   ==
   - Hits3148031361 -119 
   - Misses   6084 6205 +121
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=tree) | 
Coverage Δ | |
   |---|---|---|
   | 
[...rflow/contrib/operators/kubernetes\_pod\_operator.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9jb250cmliL29wZXJhdG9ycy9rdWJlcm5ldGVzX3BvZF9vcGVyYXRvci5weQ==)
 | `75% <0%> (-23.58%)` | :arrow_down: |
   | 
[airflow/kubernetes/volume\_mount.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3ZvbHVtZV9tb3VudC5weQ==)
 | `44.44% <0%> (-55.56%)` | :arrow_down: |
   | 
[airflow/kubernetes/volume.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3ZvbHVtZS5weQ==)
 | `52.94% <0%> (-47.06%)` | :arrow_down: |
   | 
[airflow/kubernetes/pod\_launcher.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3BvZF9sYXVuY2hlci5weQ==)
 | `45.25% <0%> (-46.72%)` | :arrow_down: |
   | 
[airflow/kubernetes/refresh\_config.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9rdWJlcm5ldGVzL3JlZnJlc2hfY29uZmlnLnB5)
 | `50.98% <0%> (-23.53%)` | :arrow_down: |
   | 
[airflow/configuration.py](https://codecov.io/gh/apache/airflow/pull/6124/diff?src=pr=tree#diff-YWlyZmxvdy9jb25maWd1cmF0aW9uLnB5)
 | `89.13% <0%> (-3.63%)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=footer). 
Last update 
[fab957e...8043c49](https://codecov.io/gh/apache/airflow/pull/6124?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] qlemaire22 commented on issue #6124: [AIRFLOW-5501] in_cluster default value in KubernetesPodOperator overwrites configuration

2019-11-21 Thread GitBox

qlemaire22 commented on issue #6124: [AIRFLOW-5501] in_cluster default value in 
KubernetesPodOperator overwrites configuration
URL: https://github.com/apache/airflow/pull/6124#issuecomment-556979944
 
 
   The tests are successful now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (AIRFLOW-5072) gcs_hook causes out-of-memory error when downloading huge files

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-5072:
---
Fix Version/s: (was: 2.0.0)
   1.10.7

> gcs_hook causes out-of-memory error when downloading huge files
> ---
>
> Key: AIRFLOW-5072
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5072
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: gcp
>Affects Versions: 1.10.3
>Reporter: Tobias Kaymak
>Assignee: Tobias Kaymak
>Priority: Major
> Fix For: 1.10.7
>
>
> Possibly there is an "else" missing here, but the gcs_hook's `download` 
> method *always* downloads a blob as a string even when a filename was 
> supplied. This causes the method to take twice as long when a filename is 
> supplied and for huge blobs it can even cause out-of-memory errors.
> I think that there is an else missing?
> [https://github.com/apache/airflow/blob/05c01a97497e992c7d8b05a39a7855343dee1603/airflow/contrib/hooks/gcs_hook.py#L176]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] nuclearpinguin commented on issue #6621: [AIRFLOW-6025] Add label to uniquely identify creator of Pod

2019-11-21 Thread GitBox

nuclearpinguin commented on issue #6621: [AIRFLOW-6025] Add label to uniquely 
identify creator of Pod
URL: https://github.com/apache/airflow/pull/6621#issuecomment-557094437
 
 
   It seems that we have a flaky test:
   ```
  Traceback (most recent call last):
   tests/task/task_runner/test_standard_task_runner.py line 129 in 
test_on_kill
 with open(path, "r") as f:
  FileNotFoundError: [Errno 2] No such file or directory: 
'/tmp/airflow_on_kill'
   
   >> begin captured stdout << -
  [%(asctime)s] {{%(filename)s:%(lineno)d}} %(levelname)s - %(message)s
  [2019-11-21 13:02:57,945] {test_task_view_type_check.py:49} INFO - 
class_instance type: 
  [%(asctime)s] {{%(filename)s:%(lineno)d}} %(levelname)s - %(message)s
  [%(asctime)s] {{%(filename)s:%(lineno)d}} %(levelname)s - %(message)s
  [%(asctime)s] {{%(filename)s:%(lineno)d}} %(levelname)s - %(message)s
  [%(asctime)s] {{%(filename)s:%(lineno)d}} %(levelname)s - %(message)s
  [%(asctime)s] {{%(filename)s:%(lineno)d}} %(levelname)s - %(message)s
  [%(asctime)s] {{%(filename)s:%(lineno)d}} %(levelname)s - %(message)s
   
  - >> end captured stdout << --
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (AIRFLOW-5931) Spawning new python interpreter for every task slow

2019-11-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/AIRFLOW-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979356#comment-16979356
 ] 

ASF GitHub Bot commented on AIRFLOW-5931:
-

ashb commented on pull request #6627: [AIRFLOW-5931] Use os.fork when 
appropriate to speed up task execution.
URL: https://github.com/apache/airflow/pull/6627
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] https://issues.apache.org/jira/browse/AIRFLOW-5931
   
   ### Description
   
   - [x] Rather than running a fresh python interpreter which then has to 
re-load
 all of Airflow and its dependencies we should use os.fork when it is
 available/suitable which should speed up task running, espeically for
 short lived tasks.
   
 I've profiled this and it took the task duration (as measured by the
 `duration` column in the TI table) from an average of 14.063s down to
 just 0.932s!
   
 I _could_ make this change deeper and bypass the `CLIFactory`/go directly 
to `_run_raw_task`, but this makes the change the minimum needed to work.
   
   ### Tests
   
   - [x] No unit tests added. Hopefully existing tests good enough. Manual 
testing shows this working
   
   Other tests I need to perform:
   
   - [ ] Check if `os._exit` is right (this doesn't run atexit callbacks) - so 
I need to check if logging in the subprocess istidied up properly.
   - [ ] Test if this leaves "dangling"/broken DB connections.
   - [ ] Check remote log uploading
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Spawning new python interpreter for every task slow
> ---
>
> Key: AIRFLOW-5931
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5931
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executors, worker
>Affects Versions: 2.0.0
>Reporter: Ash Berlin-Taylor
>Assignee: Ash Berlin-Taylor
>Priority: Major
>
> There are a number of places in the Executors and Task Runners where we spawn 
> a whole new python interpreter.
> My profiling has shown that this is slow. Rather than running a fresh python 
> interpreter which then has to re-load all of Airflow and its dependencies we 
> should use {{os.fork}} when it is available/suitable which should speed up 
> task running, espeically for short lived tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] ashb opened a new pull request #6627: [AIRFLOW-5931] Use os.fork when appropriate to speed up task execution.

2019-11-21 Thread GitBox

ashb opened a new pull request #6627: [AIRFLOW-5931] Use os.fork when 
appropriate to speed up task execution.
URL: https://github.com/apache/airflow/pull/6627
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] https://issues.apache.org/jira/browse/AIRFLOW-5931
   
   ### Description
   
   - [x] Rather than running a fresh python interpreter which then has to 
re-load
 all of Airflow and its dependencies we should use os.fork when it is
 available/suitable which should speed up task running, espeically for
 short lived tasks.
   
 I've profiled this and it took the task duration (as measured by the
 `duration` column in the TI table) from an average of 14.063s down to
 just 0.932s!
   
 I _could_ make this change deeper and bypass the `CLIFactory`/go directly 
to `_run_raw_task`, but this makes the change the minimum needed to work.
   
   ### Tests
   
   - [x] No unit tests added. Hopefully existing tests good enough. Manual 
testing shows this working
   
   Other tests I need to perform:
   
   - [ ] Check if `os._exit` is right (this doesn't run atexit callbacks) - so 
I need to check if logging in the subprocess istidied up properly.
   - [ ] Test if this leaves "dangling"/broken DB connections.
   - [ ] Check remote log uploading
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Comment Edited] (AIRFLOW-6013) Last heartbeat check does not account for execution time of session.commit()

2019-11-21 Thread Oliver Frost (Jira)



[ 
https://issues.apache.org/jira/browse/AIRFLOW-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979290#comment-16979290
 ] 

Oliver Frost edited comment on AIRFLOW-6013 at 11/21/19 1:57 PM:
-

Just filed the same issue. I think an additional sleep time in the LocalTaskJob 
is fine, but should be reverted to something like the logic before commit 
68b8ec5f4 [1], without interferring with the desired self-termination.

[1] 
https://github.com/apache/airflow/commit/68b8ec5f415795e4fa4ff7df35a3e75c712a7bad


was (Author: ofrost):
Just wanted to file the same issue. I think an additional sleep time in the 
LocalTaskJob is fine, but should be reverted to something like the logic before 
commit 68b8ec5f4 [1], without interferring with the desired self-termination.

[1] 
https://github.com/apache/airflow/commit/68b8ec5f415795e4fa4ff7df35a3e75c712a7bad

> Last heartbeat check does not account for execution time of session.commit()
> 
>
> Key: AIRFLOW-6013
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6013
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: worker
>Affects Versions: 1.10.6
>Reporter: Alex B
>Priority: Minor
>
> Assuming the last hearbeat is not older than the heartbeat_time_limit, this 
> warning will Always fire:
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/local_task_job.py#L120]
> There's a few commands between:
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/base_job.py#L195]
> and
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/local_task_job.py#L111]
> so _(timezone.utcnow() - self.latest_heartbeat).total_seconds()_ will always 
> be some small but non-0 number.
>  
> We get many log warnings in our task-logs similar to:
> {code:java}
> WARNING - Time since last heartbeat(0.01 s) < heartrate(5.0 s), sleeping for 
> 4.991735 s{code}
>  
> Does local_task_job need the extra check on last_heartbeat?
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/local_task_job.py#L121]
> Since base_job is already making sure to sleep through the gap:
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/base_job.py#L187]
> ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] ashb commented on issue #6627: [AIRFLOW-5931] Use os.fork when appropriate to speed up task execution.

2019-11-21 Thread GitBox

ashb commented on issue #6627: [AIRFLOW-5931] Use os.fork when appropriate to 
speed up task execution.
URL: https://github.com/apache/airflow/pull/6627#issuecomment-557132298
 
 
   My benchmarking results of this change below.
   
   All my tests were of `BashOperator(..., bash_command="true")`. This change 
helps bring down the average (14.062799955 to 0.9318446075, x15 faster) and the 
maximum time (from 22.927918 to 1.865752, x12 faster!)
   
   ** Without  change, 40 dag runs**:
   ```
   $ psql airflow -c "select dag_id, task_id, execution_date, state, count(*), 
min(duration) as min_dur, avg(duration) as avg_dur, max(duration) as max_dur, 
sum(duration) as running_time, max(end_date)-min(start_date) as total_duration, 
avg(start_date - queued_dttm) as queue_delay from task_instance where state is 
not null group by grouping sets ((dag_id, task_id, state), 
(dag_id,execution_date, state), (dag_id, state)) order by 1,2,3"
   
  dag_id|   task_id| execution_date |  state  | count |  
min_dur  |   avg_dur|  max_dur  | running_time | total_duration  |   
queue_delay
   
-+--++-+---+---+--+---+--+-+-
example_dag | print_date1  || success |10 | 
13.540218 |   14.2083814 | 14.745729 |   142.083814 | 00:00:17.130217 | 
00:00:14.927132
example_dag | print_date10 || success |10 | 
11.480039 |   13.4748527 | 14.403048 |   134.748527 | 00:01:45.278854 | 
00:09:42.822216
example_dag | print_date11 || success |10 | 
10.084806 |   13.2927993 | 14.148362 |   132.927993 | 00:01:43.667161 | 
00:09:44.954266
example_dag | print_date12 || success |10 |  
7.529389 |   13.1063045 | 14.152888 |   131.063045 | 00:01:41.729084 | 
00:09:48.473788
example_dag | print_date13 || success |10 | 
13.352645 |14.171775 | 15.294458 |141.71775 | 00:05:34.40845  | 
00:07:12.078185
example_dag | print_date14 || success |10 | 
12.685846 |   14.1814867 |  15.92092 |   141.814867 | 00:05:54.331766 | 
00:07:00.183005
example_dag | print_date15 || success |10 | 
13.261651 |   13.6241582 |   14.4745 |   136.241582 | 00:06:01.605317 | 
00:06:46.927206
example_dag | print_date16 || success |10 | 
13.463467 |   13.9461268 | 14.327437 |   139.461268 | 00:06:03.748068 | 
00:06:43.670468
example_dag | print_date17 || success |10 | 
12.838915 |   13.7431276 | 14.127847 |   137.431276 | 00:06:16.971633 | 
00:06:39.727049
example_dag | print_date18 || success |10 | 
13.478566 |   16.0986632 | 19.783983 |   160.986632 | 00:06:39.378521 | 
00:06:20.19122
example_dag | print_date19 || success |10 | 
13.427385 |   15.7544223 | 22.927918 |   157.544223 | 00:06:40.99252  | 
00:06:52.527796
example_dag | print_date2  || success |10 | 
13.692921 |   14.3700994 | 14.829447 |   143.700994 | 00:00:46.73697  | 
00:00:24.922475
example_dag | print_date20 || success |10 | 
13.749788 |   15.9622433 |  21.03353 |   159.622433 | 00:06:46.79555  | 
00:06:47.494261
example_dag | print_date21 || success |10 | 
13.533584 |   15.3698928 | 20.041363 |   153.698928 | 00:07:01.543776 | 
00:06:40.238809
example_dag | print_date22 || success |10 | 
13.292851 |   14.3784663 | 15.897209 |   143.784663 | 00:07:07.038294 | 
00:06:22.919986
example_dag | print_date23 || success |10 | 
13.571403 |   14.7570801 | 16.540063 |   147.570801 | 00:06:58.41123  | 
00:06:17.252477
example_dag | print_date24 || success |10 | 
13.698992 |   14.5122418 | 17.094243 |   145.122418 | 00:06:51.823565 | 
00:06:14.655425
example_dag | print_date25 || success |10 | 
11.696784 |   13.6606921 | 14.262454 |   136.606921 | 00:07:14.91126  | 
00:06:10.941776
example_dag | print_date26 || success |10 | 
13.115977 |   13.8040275 | 14.581811 |   138.040275 | 00:07:22.865953 | 
00:05:55.113162
example_dag | print_date27 || success |10 | 
13.203122 |   13.8406964 | 14.061522 |   138.406964 | 00:07:21.852219 | 
00:05:50.572565
example_dag | print_date28 || success |10 | 
13.684101 |   14.1242237 | 14.625835 |   141.242237 | 00:07:52.563805 | 
00:05:45.37103
example_dag | print_date29 || success |10 | 
13.392751 |   13.8983641 | 14.356764 |   138.983641 | 00:07:43.298189 | 
00:05:33.6447
example_dag | print_date3  || success |10

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349062357
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349052059
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349049920
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
 
 Review comment:
   ```suggestion
   You should avoid usage of Variables outside an operator's ``execute()`` 
method or Jinja templates if possible, as Variables create a connection to 
metadata DB of Airflow to fetch the value which can slow down parsing and place 
extra load on the DB.
   ```
   
   (Using Variables is common in dynamic dags.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349062029
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349063292
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349049272
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
 
 Review comment:
   Please expand this to include something like "and then push a path to the 
remote file in Xcom to use in downstream tasks"


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349062851
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349052555
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[jira] [Updated] (AIRFLOW-5931) Spawning new python interpreter for every task slow

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-5931:
---
Summary: Spawning new python interpreter for every task slow  (was: 
Spawning new python interpreter is slow)

> Spawning new python interpreter for every task slow
> ---
>
> Key: AIRFLOW-5931
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5931
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executors, worker
>Affects Versions: 2.0.0
>Reporter: Ash Berlin-Taylor
>Assignee: Ash Berlin-Taylor
>Priority: Major
>
> There are a number of places in the Executors and Task Runners where we spawn 
> a whole new python interpreter.
> My profiling has shown that this is slow. Rather than running a fresh python 
> interpreter which then has to re-load all of Airflow and its dependencies we 
> should use {{os.fork}} when it is available/suitable which should speed up 
> task running, espeically for short lived tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (AIRFLOW-6030) add ability to interact with Marketo

2019-11-21 Thread lovk korm (Jira)

lovk korm created AIRFLOW-6030:
--

 Summary: add ability to interact with Marketo
 Key: AIRFLOW-6030
 URL: https://issues.apache.org/jira/browse/AIRFLOW-6030
 Project: Apache Airflow
  Issue Type: Bug
  Components: hooks, operators
Affects Versions: 1.10.6
Reporter: lovk korm


I found this 
[https://github.com/astronomer/airflow-guides/blob/master/guides/marketo-to-redshift.md]

[https://github.com/airflow-plugins/marketo_plugin/blob/master/hooks/marketo_hook.py]

 

but it's very basic.. there is no marketo connection in the UI etc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] stale[bot] commented on issue #6273: [AIRFLOW-5039] Fix broken XCom JSON serialization

2019-11-21 Thread GitBox

stale[bot] commented on issue #6273: [AIRFLOW-5039] Fix broken XCom JSON 
serialization
URL: https://github.com/apache/airflow/pull/6273#issuecomment-557049250
 
 
   This issue has been automatically marked as stale because it has not had 
recent activity. It will be closed if no further activity occurs. Thank you for 
your contributions.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] KKcorps commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

KKcorps commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349088616
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
 
 Review comment:
   From the airflow code, it seems like a new session is created per variable


This is an automated message from the Apache Git Service.
To respond to the

[jira] [Closed] (AIRFLOW-5859) Tasks locking and heartbeat warnings

2019-11-21 Thread Jacob Ward (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Ward closed AIRFLOW-5859.
---
Resolution: Invalid

> Tasks locking and heartbeat warnings
> 
>
> Key: AIRFLOW-5859
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5859
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DagRun
>Affects Versions: 1.10.6
> Environment: Airflow using LocalExecutor and Postgres
>Reporter: Jacob Ward
>Priority: Major
>
> Having two potentially related issues.
> Issue 1:
> Some of my tasks (has only been PythonOperators so far) are starting and then 
> doing nothing. I had a task that usually executes in 10-30 minutes running 
> for over 24hrs without any error messages in the logs (other than the 
> heartbeat warnings shown below) and without failing.
> So far this has only happened to tasks inside sub-dags, not sure if that's to 
> do with it?
> The logs for the sub-dag show: 
> {code:}
> [2019-11-05 16:41:34,364] {logging_mixin.py:112} INFO - [2019-11-05 
> 16:41:34,364] {backfill_job.py:363} INFO - [backfill progress] | finished run 
> 0 of 1 | tasks waiting: 7 | succeeded: 5 | running: 1 | failed: 0 | skipped: 
> 1 | deadlocked: 0 | not ready: 7
> [2019-11-05 16:41:34,831] {logging_mixin.py:112} INFO - [2019-11-05 
> 16:41:34,830] {local_task_job.py:124} WARNING - Time since last 
> heartbeat(0.01 s) < heartrate(5.0 s), sleeping for 4.986693 s
> [2019-11-05 16:41:39,376] {logging_mixin.py:112} INFO - [2019-11-05 
> 16:41:39,376] {backfill_job.py:363} INFO - [backfill progress] | finished run 
> 0 of 1 | tasks waiting: 7 | succeeded: 5 | running: 1 | failed: 0 | skipped: 
> 1 | deadlocked: 0 | not ready: 7
> [2019-11-05 16:41:39,859] {logging_mixin.py:112} INFO - [2019-11-05 
> 16:41:39,859] {local_task_job.py:124} WARNING - Time since last 
> heartbeat(0.01 s) < heartrate(5.0 s), sleeping for 4.986141 s
> {code}
> repeated during that 24hr period.
> Issue 2:
> In all of the logs this warning message is printed every 5 seconds:
> {code:}
> [2019-11-06 14:25:29,466] {logging_mixin.py:112} INFO - [2019-11-06 
> 14:25:29,465] {local_task_job.py:124} WARNING - Time since last 
> heartbeat(0.01 s) < heartrate(5.0 s), sleeping for 4.987354 s {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] JavierLopezT commented on a change in pull request #6396: [AIRFLOW-5726] Delete table as file name in RedshiftToS3Transfer

2019-11-21 Thread GitBox

JavierLopezT commented on a change in pull request #6396: [AIRFLOW-5726] Delete 
table as file name in RedshiftToS3Transfer
URL: https://github.com/apache/airflow/pull/6396#discussion_r349121915
 
 

 ##
 File path: tests/operators/test_redshift_to_s3_operator.py
 ##
 @@ -31,7 +32,8 @@ class TestRedshiftToS3Transfer(unittest.TestCase):
 
 @mock.patch("boto3.session.Session")
 @mock.patch("airflow.hooks.postgres_hook.PostgresHook.run")
-def test_execute(self, mock_run, mock_session):
+@parameterized([(True, ), (False, )])
 
 Review comment:
   Unfortunately, it is still failing =(


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] JavierLopezT commented on a change in pull request #6396: [AIRFLOW-5726] Delete table as file name in RedshiftToS3Transfer

2019-11-21 Thread GitBox

JavierLopezT commented on a change in pull request #6396: [AIRFLOW-5726] Delete 
table as file name in RedshiftToS3Transfer
URL: https://github.com/apache/airflow/pull/6396#discussion_r346397018
 
 

 ##
 File path: tests/operators/test_redshift_to_s3_operator.py
 ##
 @@ -51,23 +52,25 @@ def test_execute(self, mock_run, mock_session):
 redshift_conn_id="redshift_conn_id",
 aws_conn_id="aws_conn_id",
 task_id="task_id",
+table_as_file_name=table_as_file_name,
 dag=None
 ).execute(None)
 
 unload_options = '\n\t\t\t'.join(unload_options)
+s3_key = '{}/{}_'.format(s3_key, table) if table_as_file_name else 
s3_key
 
 Review comment:
   Thank you very much ashb for your help. I have removed the changes and added 
a case for table_as_file_name = False. However, I am not sure that I have 
understood you completely
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] JavierLopezT commented on a change in pull request #6396: [AIRFLOW-5726] Delete table as file name in RedshiftToS3Transfer

2019-11-21 Thread GitBox

JavierLopezT commented on a change in pull request #6396: [AIRFLOW-5726] Delete 
table as file name in RedshiftToS3Transfer
URL: https://github.com/apache/airflow/pull/6396#discussion_r345707738
 
 

 ##
 File path: airflow/operators/redshift_to_s3_operator.py
 ##
 @@ -103,19 +108,33 @@ def execute(self, context):
 credentials = s3_hook.get_credentials()
 unload_options = '\n\t\t\t'.join(self.unload_options)
 select_query = "SELECT * FROM 
{schema}.{table}".format(schema=self.schema, table=self.table)
-unload_query = """
-UNLOAD ('{select_query}')
-TO 's3://{s3_bucket}/{s3_key}/{table}_'
-with credentials
-
'aws_access_key_id={access_key};aws_secret_access_key={secret_key}'
-{unload_options};
-""".format(select_query=select_query,
-   table=self.table,
-   s3_bucket=self.s3_bucket,
-   s3_key=self.s3_key,
-   access_key=credentials.access_key,
-   secret_key=credentials.secret_key,
-   unload_options=unload_options)
+if self.table_as_file_name:
 
 Review comment:
   I have applied your suggestion of simplification.
   
   The _ was already there. I guess it is for the cases where you set 'parallel 
on' the files have every numerical identificator separated by a '_'. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Closed] (AIRFLOW-5308) Pass credentials object to pandas_gbq

2019-11-21 Thread Kamil Bregula (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Bregula closed AIRFLOW-5308.
--
Fix Version/s: 1.10.7
   Resolution: Fixed

> Pass credentials object to pandas_gbq
> -
>
> Key: AIRFLOW-5308
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5308
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: gcp
>Affects Versions: 1.10.4
>Reporter: Kamil Bregula
>Priority: Major
> Fix For: 1.10.7
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] nuclearpinguin opened a new pull request #6626: Test commit

2019-11-21 Thread GitBox

nuclearpinguin opened a new pull request #6626: Test commit
URL: https://github.com/apache/airflow/pull/6626
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-XXX
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Closed] (AIRFLOW-5877) Improve job_id detection in DataflowRunner

2019-11-21 Thread Kamil Bregula (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Bregula closed AIRFLOW-5877.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

> Improve job_id detection in DataflowRunner
> --
>
> Key: AIRFLOW-5877
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5877
> Project: Apache Airflow
>  Issue Type: Sub-task
>  Components: gcp
>Affects Versions: 1.10.6
>Reporter: Kamil Bregula
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (AIRFLOW-5310) Add PrestoToGoogleStorageOperator

2019-11-21 Thread Kamil Bregula (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Bregula closed AIRFLOW-5310.
--
Resolution: Duplicate

> Add PrestoToGoogleStorageOperator
> -
>
> Key: AIRFLOW-5310
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5310
> Project: Apache Airflow
>  Issue Type: Wish
>  Components: contrib, gcp, operators
>Affects Versions: 1.10.4
>Reporter: lovk korm
>Priority: Major
>
> Please add PrestoToGoogleStorageOperator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349061827
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349053006
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349053499
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349053798
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349048944
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
 
 Review comment:
   Unforutnately we can't be so bold as to say "Always" -- not every system is 
supported, so "Where at all possible" might be the best we can say.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349052009
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349053904
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349048684
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
 
 Review comment:
   The directory bit isn't true, but the issue here is as you mention tasks can 
be executed on different machines.
   
   And even if using the a LocalExecutor, storing files on local disk can make 
retries harder (especially if another task might have deleted the file in the 
mean time)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349051235
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
 
 Review comment:
   ```suggestion
   The best way of using variables is via a Jinja template which will delay 
reading the value until the task

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349050914
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
 
 Review comment:
   ```
   Airflow parses all the DAG files in a loop, trying to parse each file every 
``processor_poll_interval`` seconds (default 1 second). During parsing, Airflow 
will open and close a new connection to the metadata DB for

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349062979
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349061642
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349052306
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349062575
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349052656
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to 
make DAGs production ready
URL: https://github.com/apache/airflow/pull/6515#discussion_r349053376
 
 

 ##
 File path: docs/best-practices.rst
 ##
 @@ -0,0 +1,271 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+Best Practices
+==
+
+Running Airflow in production is seamless. It comes bundled with all the 
plugins and configs
+necessary to run most of the DAGs. However, you can come across certain 
pitfalls, which can cause occasional errors.
+Let's take a look at what you need to do at various stages to avoid these 
pitfalls, starting from writing the DAG 
+to the actual deployment in the production environment.
+
+
+Writing a DAG
+^^
+Creating a new DAG in Airflow is quite simple. However, there are many things 
that you need to take care of
+to ensure the DAG run or failure does not produce unexpected results.
+
+Creating a task
+---
+
+You should treat tasks in Airflow equivalent to transactions in a database. It 
implies that you should never produce
+incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
+
+Airflow retries a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
+Some of the ways you can avoid producing a different result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
+  Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data 
in a task. 
+  Someone may update the input data between re-runs, which results in 
different outputs. 
+  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition. 
+  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
+* The python datetime ``now()`` function gives the current datetime object. 
+  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run. 
+  It's fine to use it, for example, to generate a temporary log.
+
+.. tip::
+
+You should define repetitive parameters such as ``connection_id`` or S3 
paths in ``default_args`` rather than declaring them for each task.
+The ``default_args`` help to avoid mistakes such as typographical errors.
+
+
+Deleting a task
+
+
+Never delete a task from a DAG. In case of deletion, the historical 
information of the task disappears from the Airflow UI. 
+It is advised to create a new DAG in case the tasks need to be deleted.
+
+
+Communication
+--
+
+Airflow executes tasks of a DAG in different directories, which can even be 
present 
+on different servers in case you are using :doc:`Kubernetes executor 
<../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. 
+Therefore, you should not store any file or config in the local filesystem — 
for example, a task that downloads the JAR file that the next task executes.
+
+Always use XCom to communicate small messages between tasks or S3/HDFS to 
communicate large messages/files.
+
+The tasks should also not store any authentication parameters such as 
passwords or token inside them. 
+Always use :ref:`Connections ` to store data securely in 
Airflow backend and retrieve them using a unique connection id.
+
+
+Variables
+-
+
+You should avoid usage of Variables outside an operator's execute() method or 
Jinja templates. Variables create a connection to metadata DB of Airflow to 
fetch the value.
+Airflow parses all the DAGs in the background at a specific period.
+The default period is set using ``processor_poll_interval`` config, which is 
by default 1 second. During parsing, Airflow creates a new connection to the 
metadata DB for each Variable.
+It can result in a lot of open connections.
+
+If you really want to use Variables, we advice to use them from a Jinja 
template with the syntax :
+
+.. code::
+
+{{ var.value. }}
+
+or if you need to deserialize a json object from the variable :
+
+.. code::
+
+{{ var.json. }}
+
+

[jira] [Created] (AIRFLOW-6031) Check for additional sleep time in LocalTaskJob is flawed

2019-11-21 Thread Oliver Frost (Jira)

Oliver Frost created AIRFLOW-6031:
-

 Summary: Check for additional sleep time in LocalTaskJob is flawed
 Key: AIRFLOW-6031
 URL: https://issues.apache.org/jira/browse/AIRFLOW-6031
 Project: Apache Airflow
  Issue Type: Bug
  Components: worker
Affects Versions: 1.10.6
Reporter: Oliver Frost


Since the PR for issue AIRFLOW-5102 the implementation of the heartbeat is such 
that the check for additional sleep time in the LocalTaskJob always triggers, 
because it uses the time _after_ the heartbeat's sleep as reference.

Hence a warning like 
{noformat}
{local_task_job.py:124} WARNING - Time since last heartbeat(0.01 s) < 
heartrate(10.0 s), sleeping for 9.9881 s
{noformat}
 
Others already noticed and reported that the warning issued on that code path 
was flooding their logging systems in AIRFLOW-5690.

Sadly the pull request associated with AIRFLOW-5690 only silences the warning 
instead of realizing that the implementation logic is actually amiss.

The check for additional sleep time should be reverted to the logic before the 
PR related to AIRFLOW-5102, while keeping the self termination route intact.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AIRFLOW-6013) Last heartbeat check does not account for execution time of session.commit()

2019-11-21 Thread Oliver Frost (Jira)



[ 
https://issues.apache.org/jira/browse/AIRFLOW-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979290#comment-16979290
 ] 

Oliver Frost commented on AIRFLOW-6013:
---

Just wanted to file the same issue. I think an additional sleep time in the 
LocalTaskJob is fine, but should be reverted to something like the logic before 
commit 68b8ec5f4 [1], without interferring with the desired self-termination.

[1] 
https://github.com/apache/airflow/commit/68b8ec5f415795e4fa4ff7df35a3e75c712a7bad

> Last heartbeat check does not account for execution time of session.commit()
> 
>
> Key: AIRFLOW-6013
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6013
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: worker
>Affects Versions: 1.10.6
>Reporter: Alex B
>Priority: Minor
>
> Assuming the last hearbeat is not older than the heartbeat_time_limit, this 
> warning will Always fire:
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/local_task_job.py#L120]
> There's a few commands between:
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/base_job.py#L195]
> and
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/local_task_job.py#L111]
> so _(timezone.utcnow() - self.latest_heartbeat).total_seconds()_ will always 
> be some small but non-0 number.
>  
> We get many log warnings in our task-logs similar to:
> {code:java}
> WARNING - Time since last heartbeat(0.01 s) < heartrate(5.0 s), sleeping for 
> 4.991735 s{code}
>  
> Does local_task_job need the extra check on last_heartbeat?
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/local_task_job.py#L121]
> Since base_job is already making sure to sleep through the gap:
> [https://github.com/apache/airflow/blob/1.10.6/airflow/jobs/base_job.py#L187]
> ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] eserdyuk-exos commented on issue #6614: [AIRFLOW-6020] fix python 3 KubernetesExecutor iteritems exception

2019-11-21 Thread GitBox

eserdyuk-exos commented on issue #6614: [AIRFLOW-6020] fix python 3 
KubernetesExecutor iteritems exception
URL: https://github.com/apache/airflow/pull/6614#issuecomment-557099468
 
 
   @mik-laj well, it's fixed now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] dimberman commented on issue #6627: [AIRFLOW-5931] Use os.fork when appropriate to speed up task execution.

2019-11-21 Thread GitBox

dimberman commented on issue #6627: [AIRFLOW-5931] Use os.fork when appropriate 
to speed up task execution.
URL: https://github.com/apache/airflow/pull/6627#issuecomment-557130023
 
 
   I have tested this locally and it seems to work fine. 
   
   @ashb when are situations where CAN_FORK is false besides when doing 
run_as_user?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (AIRFLOW-5082) add subject in aws sns hook

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-5082:
---
Fix Version/s: (was: 2.0.0)
   1.10.7

> add subject in aws sns hook
> ---
>
> Key: AIRFLOW-5082
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5082
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws
>Affects Versions: 1.10.4
>Reporter: MOHAMMD SHAKEEL SHAIK
>Assignee: MOHAMMD SHAKEEL SHAIK
>Priority: Major
> Fix For: 1.10.7
>
>
> While sending SNS notification to AWS. The subject is an optional field. If 
> we don't send Subject AWS will add default SNS Subject to email "*AWS 
> Notification Message*". If anyone wants to add a different Subject. They can 
> send Subject parameter in AWS SNS hook. 
>  
> It is also optional only



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AIRFLOW-4938) next_execution_date is not a Pendulum object

2019-11-21 Thread jack (Jira)



[ 
https://issues.apache.org/jira/browse/AIRFLOW-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979371#comment-16979371
 ] 

jack commented on AIRFLOW-4938:
---

Was fixed in 1.10.4:

https://issues.apache.org/jira/browse/AIRFLOW-4788

> next_execution_date is not a Pendulum object
> 
>
> Key: AIRFLOW-4938
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4938
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: models
>Affects Versions: 1.10.2
> Environment: Airlfow 1.10.2 in Docker using puckel docker image.
> Configured for timezone Copenhagen
>Reporter: Adam Andersen Læssøe
>Priority: Major
>
> When templating, it seems `execution_date` refers to a Pendulum object while 
> `next_execution_date` refers to a native `datetime` object.
>  This is inconsistent, and contrary to what the docs say, so I'm thinking 
> it's a bug.
> *Observation*
>  Airflow is configured to run in timezone Copenhagen.
>  I have a where clause like
>  ```
> {code:java}
> WHERE inserted_at >= '{{execution_date}}' AND inserted_at < 
> '{{next_execution_date}}'{code}
> I execute the task using `airflow test ... 2019-07-01`.
>  The clause is rendered as
> {code:java}
> WHERE inserted_at >= '2019-07-01T00:00:00+02:00' AND inserted_at < 
> '2019-07-01 22:00:00+00:00'{code}
>  
> Note how `execution_date` is printed as UTC+2, while `next_execution_date` is 
> printed in UTC. I believe the timestamps actually decribe the correct 
> interval, but to be certain I tried explicitly converting to UTC:
> {code:java}
> WHERE inserted_at >= '{{execution_date.in_tz('UTC')}}' AND inserted_at < 
> '{{next_execution_date.in_tz('UTC')}}'{code}
> I then get an error that datetime.datetime does not have an in_tz method.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AIRFLOW-5474) Add Basic Auth to Druid hook

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-5474:
---
Fix Version/s: (was: 2.0.0)
   1.10.7

> Add Basic Auth to Druid hook
> 
>
> Key: AIRFLOW-5474
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5474
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Affects Versions: 1.10.5
>Reporter: Adam Welsh
>Assignee: Adam Welsh
>Priority: Minor
> Fix For: 1.10.7
>
>
> Use login and password from druid ingestion connection to add Basic HTTP auth 
> to druid hook. If login and/or password is None then ensure hook still works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] zacharya19 commented on a change in pull request #6489: [AIRFLOW-3959] [AIRFLOW-4026] Add filter by DAG tags

2019-11-21 Thread GitBox

zacharya19 commented on a change in pull request #6489: [AIRFLOW-3959] 
[AIRFLOW-4026] Add filter by DAG tags
URL: https://github.com/apache/airflow/pull/6489#discussion_r349197932
 
 

 ##
 File path: airflow/www/views.py
 ##
 @@ -213,6 +218,18 @@ def get_int_arg(value, default=0):
 
 arg_current_page = request.args.get('page', '0')
 arg_search_query = request.args.get('search', None)
+arg_tags_filter = request.args.getlist('tags', None)
+flask_session.permanent = True
 
 Review comment:
   Yes, changed the entire flask session cookie to be permanent (with the 
expiration set in `PERMANENT_SESSION_LIFETIME`).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] kaxil commented on a change in pull request #6621: [AIRFLOW-6025] Add label to uniquely identify creator of Pod

2019-11-21 Thread GitBox

kaxil commented on a change in pull request #6621: [AIRFLOW-6025] Add label to 
uniquely identify creator of Pod
URL: https://github.com/apache/airflow/pull/6621#discussion_r349232197
 
 

 ##
 File path: airflow/contrib/operators/kubernetes_pod_operator.py
 ##
 @@ -127,6 +128,15 @@ def execute(self, context):
  
cluster_context=self.cluster_context,
  config_file=self.config_file)
 
+# Add Airflow Version to the label
+# And a label to identify that pod is launched by 
KubernetesPodOperator
+self.labels.update(
+{
+'airflow_version': airflow_version.replace('+', '-'),
 
 Review comment:
   
https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
 recommend some but given we already use labels and don't follow such 
conventions, I made this change :
   
   
https://github.com/apache/airflow/blob/fab957e763f40bf2a2398770312b4834fbd613e1/airflow/kubernetes/worker_configuration.py#L372-L378


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] codecov-io edited a comment on issue #6489: [AIRFLOW-3959] [AIRFLOW-4026] Add filter by DAG tags

2019-11-21 Thread GitBox

codecov-io edited a comment on issue #6489: [AIRFLOW-3959] [AIRFLOW-4026] Add 
filter by DAG tags
URL: https://github.com/apache/airflow/pull/6489#issuecomment-552128422
 
 
   # [Codecov](https://codecov.io/gh/apache/airflow/pull/6489?src=pr=h1) 
Report
   > Merging 
[#6489](https://codecov.io/gh/apache/airflow/pull/6489?src=pr=desc) into 
[master](https://codecov.io/gh/apache/airflow/commit/fab957e763f40bf2a2398770312b4834fbd613e1?src=pr=desc)
 will **increase** coverage by `0.01%`.
   > The diff coverage is `96.55%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/airflow/pull/6489/graphs/tree.svg?width=650=WdLKlKHOAU=150=pr)](https://codecov.io/gh/apache/airflow/pull/6489?src=pr=tree)
   
   ```diff
   @@Coverage Diff @@
   ##   master#6489  +/-   ##
   ==
   + Coverage83.8%   83.81%   +0.01% 
   ==
 Files 669  669  
 Lines   3756437609  +45 
   ==
   + Hits3148031523  +43 
   - Misses   6084 6086   +2
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/airflow/pull/6489?src=pr=tree) | 
Coverage Δ | |
   |---|---|---|
   | 
[airflow/example\_dags/example\_pig\_operator.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV9waWdfb3BlcmF0b3IucHk=)
 | `100% <ø> (ø)` | :arrow_up: |
   | 
[airflow/example\_dags/example\_python\_operator.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV9weXRob25fb3BlcmF0b3IucHk=)
 | `63.33% <ø> (ø)` | :arrow_up: |
   | 
[...ample\_dags/example\_branch\_python\_dop\_operator\_3.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV9icmFuY2hfcHl0aG9uX2RvcF9vcGVyYXRvcl8zLnB5)
 | `75% <ø> (ø)` | :arrow_up: |
   | 
[...le\_dags/example\_passing\_params\_via\_test\_command.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV9wYXNzaW5nX3BhcmFtc192aWFfdGVzdF9jb21tYW5kLnB5)
 | `100% <ø> (ø)` | :arrow_up: |
   | 
[airflow/example\_dags/example\_branch\_operator.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV9icmFuY2hfb3BlcmF0b3IucHk=)
 | `100% <ø> (ø)` | :arrow_up: |
   | 
[airflow/example\_dags/example\_gcs\_to\_gcs.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV9nY3NfdG9fZ2NzLnB5)
 | `100% <ø> (ø)` | :arrow_up: |
   | 
[...low/example\_dags/example\_trigger\_controller\_dag.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV90cmlnZ2VyX2NvbnRyb2xsZXJfZGFnLnB5)
 | `100% <ø> (ø)` | :arrow_up: |
   | 
[airflow/example\_dags/example\_bash\_operator.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV9iYXNoX29wZXJhdG9yLnB5)
 | `94.44% <ø> (ø)` | :arrow_up: |
   | 
[airflow/example\_dags/example\_subdag\_operator.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV9zdWJkYWdfb3BlcmF0b3IucHk=)
 | `100% <ø> (ø)` | :arrow_up: |
   | 
[airflow/example\_dags/example\_trigger\_target\_dag.py](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree#diff-YWlyZmxvdy9leGFtcGxlX2RhZ3MvZXhhbXBsZV90cmlnZ2VyX3RhcmdldF9kYWcucHk=)
 | `90% <ø> (ø)` | :arrow_up: |
   | ... and [14 
more](https://codecov.io/gh/apache/airflow/pull/6489/diff?src=pr=tree-more) 
| |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/airflow/pull/6489?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/airflow/pull/6489?src=pr=footer). 
Last update 
[fab957e...bd060b1](https://codecov.io/gh/apache/airflow/pull/6489?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (AIRFLOW-3185) Add chunking to DBAPI_hook by implementing fetchmany and pandas chunksize

2019-11-21 Thread jack (Jira)



[ 
https://issues.apache.org/jira/browse/AIRFLOW-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979489#comment-16979489
 ] 

jack commented on AIRFLOW-3185:
---

[~tomanizer] do you have a final version to PR?

> Add chunking to DBAPI_hook by implementing fetchmany and pandas chunksize
> -
>
> Key: AIRFLOW-3185
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3185
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Affects Versions: 1.10.0
>Reporter: Thomas Haederle
>Assignee: Thomas Haederle
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> DbApiHook currently implements get_records and get_pandas_df, where both 
> methods fetch all records into memory.
> We should implement two new methods which return a generator with a 
> configurable chunksize:
> - def get_many_records(self, sql, parameters=None, chunksize=20, 
> iterate_singles=False):
> - def get_pandas_df_chunks(self, sql, parameters=None, chunksize=20)
> this should work for all DB hooks which inherit from this class.
> We could also adapt existing methods, but that could be problematic because 
> these methods will return a generator whereas the others return either 
> records or dataframes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (AIRFLOW-5451) Spark Submit Hook don't set namespace if default

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-5451.

Fix Version/s: (was: 2.0.0)
   1.10.7
   Resolution: Fixed

> Spark Submit Hook don't set namespace if default
> 
>
> Key: AIRFLOW-5451
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5451
> Project: Apache Airflow
>  Issue Type: Task
>  Components: hooks
>Affects Versions: 1.10.5
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.10.7
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AIRFLOW-6028) Add and Looker Hook and Operators.

2019-11-21 Thread Nathan Hadfield (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Hadfield updated AIRFLOW-6028:
-
Component/s: operators

> Add and Looker Hook and Operators.
> --
>
> Key: AIRFLOW-6028
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6028
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: hooks, operators
>Affects Versions: 2.0.0
>Reporter: Nathan Hadfield
>Assignee: Nathan Hadfield
>Priority: Minor
> Fix For: 2.0.0
>
>
> This addition of a hook for Looker ([https://looker.com/]) will enable the 
> integration of Airflow with the Looker SDK.  This can then form the basis for 
> a suite of operators to automate common Looker actions, e.g. sending a Looker 
> dashboard via email.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AIRFLOW-6028) Add and Looker Hook and Operators.

2019-11-21 Thread Nathan Hadfield (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Hadfield updated AIRFLOW-6028:
-
Summary: Add and Looker Hook and Operators.  (was: Add a Looker Hook.)

> Add and Looker Hook and Operators.
> --
>
> Key: AIRFLOW-6028
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6028
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: hooks
>Affects Versions: 2.0.0
>Reporter: Nathan Hadfield
>Assignee: Nathan Hadfield
>Priority: Minor
> Fix For: 2.0.0
>
>
> This addition of a hook for Looker ([https://looker.com/]) will enable the 
> integration of Airflow with the Looker SDK.  This can then form the basis for 
> a suite of operators to automate common Looker actions, e.g. sending a Looker 
> dashboard via email.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AIRFLOW-6023) Remove deprecated Celery configs

2019-11-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/AIRFLOW-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979465#comment-16979465
 ] 

ASF subversion and git services commented on AIRFLOW-6023:
--

Commit 1d8b8cfcbc0d1d81758e42fcf7a789efd797c931 in airflow's branch 
refs/heads/master from Kaxil Naik
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=1d8b8cf ]

[AIRFLOW-6023] Remove deprecated Celery configs (#6620)



> Remove deprecated Celery configs
> 
>
> Key: AIRFLOW-6023
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6023
> Project: Apache Airflow
>  Issue Type: Task
>  Components: configuration
>Affects Versions: 1.10.6
>Reporter: Kaxil Naik
>Assignee: Kaxil Naik
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Some of the celery configs have been deprecated since 1.10 
> https://github.com/apache/airflow/blob/master/UPDATING.md#celery-config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] kaxil merged pull request #6620: [AIRFLOW-6023] Remove deprecated Celery configs

2019-11-21 Thread GitBox

kaxil merged pull request #6620: [AIRFLOW-6023] Remove deprecated Celery configs
URL: https://github.com/apache/airflow/pull/6620
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (AIRFLOW-6023) Remove deprecated Celery configs

2019-11-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/AIRFLOW-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979464#comment-16979464
 ] 

ASF GitHub Bot commented on AIRFLOW-6023:
-

kaxil commented on pull request #6620: [AIRFLOW-6023] Remove deprecated Celery 
configs
URL: https://github.com/apache/airflow/pull/6620
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove deprecated Celery configs
> 
>
> Key: AIRFLOW-6023
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6023
> Project: Apache Airflow
>  Issue Type: Task
>  Components: configuration
>Affects Versions: 1.10.6
>Reporter: Kaxil Naik
>Assignee: Kaxil Naik
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Some of the celery configs have been deprecated since 1.10 
> https://github.com/apache/airflow/blob/master/UPDATING.md#celery-config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (AIRFLOW-6034) Fix Deprecation Elasticsearch configs on Master

2019-11-21 Thread Kaxil Naik (Jira)

Kaxil Naik created AIRFLOW-6034:
---

 Summary: Fix Deprecation Elasticsearch configs on Master
 Key: AIRFLOW-6034
 URL: https://issues.apache.org/jira/browse/AIRFLOW-6034
 Project: Apache Airflow
  Issue Type: New Feature
  Components: configuration
Affects Versions: 2.0.0
Reporter: Kaxil Naik
Assignee: Kaxil Naik


This has already been fixed in the master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AIRFLOW-6035) Remove comand method in TaskInstance

2019-11-21 Thread Kamil Bregula (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Bregula updated AIRFLOW-6035:
---
Summary: Remove comand method in TaskInstance  (was: Remove comand method 
in Task)

> Remove comand method in TaskInstance
> 
>
> Key: AIRFLOW-6035
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6035
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.10.6
>Reporter: Kamil Bregula
>Priority: Trivial
>
> This method is not used. In addition, this method does not work properly 
> because the arguments should be processed using the shlex.quote function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AIRFLOW-5701) Don't clear xcom explicitly before execution

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-5701:
---
Fix Version/s: (was: 2.0.0)

> Don't clear xcom explicitly before execution
> 
>
> Key: AIRFLOW-5701
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5701
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: xcom
>Affects Versions: 1.10.5
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (AIRFLOW-5701) Don't clear xcom explicitly before execution

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor reopened AIRFLOW-5701:


> Don't clear xcom explicitly before execution
> 
>
> Key: AIRFLOW-5701
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5701
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: xcom
>Affects Versions: 1.10.5
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (AIRFLOW-5701) Don't clear xcom explicitly before execution

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-5701.

Resolution: Won't Fix

> Don't clear xcom explicitly before execution
> 
>
> Key: AIRFLOW-5701
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5701
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: xcom
>Affects Versions: 1.10.5
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (AIRFLOW-4940) DynamoDB to S3 backup operator

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor reopened AIRFLOW-4940:


> DynamoDB to S3 backup operator
> --
>
> Key: AIRFLOW-4940
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4940
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: aws
>Affects Versions: 1.10.4
>Reporter: Chao-Han Tsai
>Assignee: Chao-Han Tsai
>Priority: Major
>
> Add an Airflow operator that back up DynamoDB table to S3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (AIRFLOW-4940) DynamoDB to S3 backup operator

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-4940.

Fix Version/s: 1.10.7
   Resolution: Fixed

> DynamoDB to S3 backup operator
> --
>
> Key: AIRFLOW-4940
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4940
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: aws
>Affects Versions: 1.10.4
>Reporter: Chao-Han Tsai
>Assignee: Chao-Han Tsai
>Priority: Major
> Fix For: 1.10.7
>
>
> Add an Airflow operator that back up DynamoDB table to S3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (AIRFLOW-6033) UI crashes at "Landing Time" after switching task_id caps/small letters

2019-11-21 Thread ivan de los santos (Jira)

ivan de los santos created AIRFLOW-6033:
---

 Summary: UI crashes at "Landing Time" after switching task_id 
caps/small letters
 Key: AIRFLOW-6033
 URL: https://issues.apache.org/jira/browse/AIRFLOW-6033
 Project: Apache Airflow
  Issue Type: Bug
  Components: DAG, ui
Affects Versions: 1.10.6
Reporter: ivan de los santos


Airflow UI will crash in the browser returning "Oops" message and the Traceback 
of the crashing error.

This is caused by modifying a task_id with a capital/small letter, I will point 
out some examples that will cause airflow to crash:
 - task_id = "DUMMY_TASK" to task_id = "dUMMY_TASK"
 - task_id = "Dummy_Task" to task_id = "dummy_Task" or "Dummy_task",...
 - task_id = "Dummy_task" to task_id = "Dummy_tASk"

_

If you change the name of the task_id to something different such as, in our 
example:
 - task_id = "Dummy_Task" to task_id = "DummyTask" or "Dummytask"

It won't fail since it will be recognized as new tasks, which is the expected 
behaviour.

If we switch back the modified name to the original name it won't crash since 
it will access to the correct tasks instances. I will explain in next 
paragraphs where this error is located.

_

 *How to replicate*: 
 # Launch airflow webserver -p 8080
 # Go to the Airflow-UI
 # Create an example DAG with a task_id name up to your choice in small letters 
(ex. "run")
 # Launch the DAG and wait its execution to finish
 # Modify the task_id inside the DAG with the first letter to capital letter 
(ex. "Run")
 # Refresh the DAG
 # Go to "Landing Times" inside the DAG menu in the UI
 # You will get an "oops" message with the Traceback.

 

*File causing the problem*:  
[https://github.com/apache/airflow/blob/master/airflow/www/views.py] (lines 
1643 - 1654)

 

*Reasons of the problem*:
 #  KeyError: 'run', meaning a dictionary does not contain the task_id "run", 
it will get more into the details of where this comes from.

{code:python}
Traceback (most recent call last):
  File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 2446, 
in wsgi_app
response = self.full_dispatch_request()
  File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 1951, 
in full_dispatch_request
rv = self.handle_user_exception(e)
  File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 1820, 
in handle_user_exception
reraise(exc_type, exc_value, tb)
  File "/home/rde/.local/lib/python3.6/site-packages/flask/_compat.py", line 
39, in reraise
raise value
  File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 1949, 
in full_dispatch_request
rv = self.dispatch_request()
  File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 1935, 
in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/rde/.local/lib/python3.6/site-packages/flask_admin/base.py", line 
69, in inner
return self._run_view(f, *args, **kwargs)
  File "/home/rde/.local/lib/python3.6/site-packages/flask_admin/base.py", line 
368, in _run_view
return fn(self, *args, **kwargs)
  File "/home/rde/.local/lib/python3.6/site-packages/flask_login/utils.py", 
line 258, in decorated_view
return func(*args, **kwargs)
  File "/home/rde/.local/lib/python3.6/site-packages/airflow/www/utils.py", 
line 295, in wrapper
return f(*args, **kwargs)
  File "/home/rde/.local/lib/python3.6/site-packages/airflow/utils/db.py", line 
74, in wrapper
return func(*args, **kwargs)
  File "/home/rde/.local/lib/python3.6/site-packages/airflow/www/views.py", 
line 1921, in landing_times
x[ti.task_id].append(dttm)
KeyError: 'run'

{code}
_
h2. Code
{code:python}
for task in dag.tasks:
y[task.task_id] = []
x[task.task_id] = []

for ti in task.get_task_instances(start_date=min_date, end_date=base_date):

ts = ti.execution_date
if dag.schedule_interval and dag.following_schedule(ts):
ts = dag.following_schedule(ts)
if ti.end_date:
dttm = wwwutils.epoch(ti.execution_date)
secs = (ti.end_date - ts).total_seconds()
x[ti.task_id].append(dttm)
y[ti.task_id].append(secs)

{code}
 
We can see in first two lines inside the first for loop, how the dictionary x 
and y is being filled with tasks_id attributes which comes from the actual DAG.

*The problem actually comes in the second for loop* when you get the task 
instances from a DAG, I am not sure about this next part and I wish someone to 
clarify my question about this.

I think that the task instances (ti) received from get_task_instances() 
function comes from the information stored into the database, that is the 
reason of crash when you access to "Landing Times" page, is that the x and y 
where filled

[jira] [Assigned] (AIRFLOW-6033) UI crashes at "Landing Time" after switching task_id caps/small letters

2019-11-21 Thread ivan de los santos (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ivan de los santos reassigned AIRFLOW-6033:
---

Assignee: ivan de los santos

> UI crashes at "Landing Time" after switching task_id caps/small letters
> ---
>
> Key: AIRFLOW-6033
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6033
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, ui
>Affects Versions: 1.10.6
>Reporter: ivan de los santos
>Assignee: ivan de los santos
>Priority: Minor
>
> Airflow UI will crash in the browser returning "Oops" message and the 
> Traceback of the crashing error.
> This is caused by modifying a task_id with a capital/small letter, I will 
> point out some examples that will cause airflow to crash:
>  - task_id = "DUMMY_TASK" to task_id = "dUMMY_TASK"
>  - task_id = "Dummy_Task" to task_id = "dummy_Task" or "Dummy_task",...
>  - task_id = "Dummy_task" to task_id = "Dummy_tASk"
> _
> If you change the name of the task_id to something different such as, in our 
> example:
>  - task_id = "Dummy_Task" to task_id = "DummyTask" or "Dummytask"
> It won't fail since it will be recognized as new tasks, which is the expected 
> behaviour.
> If we switch back the modified name to the original name it won't crash since 
> it will access to the correct tasks instances. I will explain in next 
> paragraphs where this error is located.
> _
>  *How to replicate*: 
>  # Launch airflow webserver -p 8080
>  # Go to the Airflow-UI
>  # Create an example DAG with a task_id name up to your choice in small 
> letters (ex. "run")
>  # Launch the DAG and wait its execution to finish
>  # Modify the task_id inside the DAG with the first letter to capital letter 
> (ex. "Run")
>  # Refresh the DAG
>  # Go to "Landing Times" inside the DAG menu in the UI
>  # You will get an "oops" message with the Traceback.
>  
> *File causing the problem*:  
> [https://github.com/apache/airflow/blob/master/airflow/www/views.py] (lines 
> 1643 - 1654)
>  
> *Reasons of the problem*:
>  #  KeyError: 'run', meaning a dictionary does not contain the task_id "run", 
> it will get more into the details of where this comes from.
> {code:python}
> Traceback (most recent call last):
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 2446, in wsgi_app
> response = self.full_dispatch_request()
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 1951, in full_dispatch_request
> rv = self.handle_user_exception(e)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 1820, in handle_user_exception
> reraise(exc_type, exc_value, tb)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/_compat.py", line 
> 39, in reraise
> raise value
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 1949, in full_dispatch_request
> rv = self.dispatch_request()
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 1935, in dispatch_request
> return self.view_functions[rule.endpoint](**req.view_args)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask_admin/base.py", 
> line 69, in inner
> return self._run_view(f, *args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask_admin/base.py", 
> line 368, in _run_view
> return fn(self, *args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask_login/utils.py", 
> line 258, in decorated_view
> return func(*args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/airflow/www/utils.py", 
> line 295, in wrapper
> return f(*args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/airflow/utils/db.py", 
> line 74, in wrapper
> return func(*args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/airflow/www/views.py", 
> line 1921, in landing_times
> x[ti.task_id].append(dttm)
> KeyError: 'run'
> {code}
> _
> h2. Code
> {code:python}
> for task in dag.tasks:
> y[task.task_id] = []
> x[task.task_id] = []
> for ti in task.get_task_instances(start_date=min_date, 
> end_date=base_date):
> ts = ti.execution_date
> if dag.schedule_interval and dag.following_schedule(ts):
> ts = dag.following_schedule(ts)
> if ti.end_date:
> dttm = wwwutils.epoch(ti.execution_date)
> secs = (ti.end_date - ts).total_seconds()
> x[ti.task_id].append(dttm)
> y[ti.task_id].append(secs)
> {code}
>  
> We can see in first two lines inside the first for loop, how the dictionary x 
> and y is being filled with

[jira] [Assigned] (AIRFLOW-6033) UI crashes at "Landing Time" after switching task_id caps/small letters

2019-11-21 Thread ivan de los santos (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ivan de los santos reassigned AIRFLOW-6033:
---

Assignee: (was: ivan de los santos)

> UI crashes at "Landing Time" after switching task_id caps/small letters
> ---
>
> Key: AIRFLOW-6033
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6033
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, ui
>Affects Versions: 1.10.6
>Reporter: ivan de los santos
>Priority: Minor
>
> Airflow UI will crash in the browser returning "Oops" message and the 
> Traceback of the crashing error.
> This is caused by modifying a task_id with a capital/small letter, I will 
> point out some examples that will cause airflow to crash:
>  - task_id = "DUMMY_TASK" to task_id = "dUMMY_TASK"
>  - task_id = "Dummy_Task" to task_id = "dummy_Task" or "Dummy_task",...
>  - task_id = "Dummy_task" to task_id = "Dummy_tASk"
> _
> If you change the name of the task_id to something different such as, in our 
> example:
>  - task_id = "Dummy_Task" to task_id = "DummyTask" or "Dummytask"
> It won't fail since it will be recognized as new tasks, which is the expected 
> behaviour.
> If we switch back the modified name to the original name it won't crash since 
> it will access to the correct tasks instances. I will explain in next 
> paragraphs where this error is located.
> _
>  *How to replicate*: 
>  # Launch airflow webserver -p 8080
>  # Go to the Airflow-UI
>  # Create an example DAG with a task_id name up to your choice in small 
> letters (ex. "run")
>  # Launch the DAG and wait its execution to finish
>  # Modify the task_id inside the DAG with the first letter to capital letter 
> (ex. "Run")
>  # Refresh the DAG
>  # Go to "Landing Times" inside the DAG menu in the UI
>  # You will get an "oops" message with the Traceback.
>  
> *File causing the problem*:  
> [https://github.com/apache/airflow/blob/master/airflow/www/views.py] (lines 
> 1643 - 1654)
>  
> *Reasons of the problem*:
>  #  KeyError: 'run', meaning a dictionary does not contain the task_id "run", 
> it will get more into the details of where this comes from.
> {code:python}
> Traceback (most recent call last):
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 2446, in wsgi_app
> response = self.full_dispatch_request()
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 1951, in full_dispatch_request
> rv = self.handle_user_exception(e)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 1820, in handle_user_exception
> reraise(exc_type, exc_value, tb)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/_compat.py", line 
> 39, in reraise
> raise value
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 1949, in full_dispatch_request
> rv = self.dispatch_request()
>   File "/home/rde/.local/lib/python3.6/site-packages/flask/app.py", line 
> 1935, in dispatch_request
> return self.view_functions[rule.endpoint](**req.view_args)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask_admin/base.py", 
> line 69, in inner
> return self._run_view(f, *args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask_admin/base.py", 
> line 368, in _run_view
> return fn(self, *args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/flask_login/utils.py", 
> line 258, in decorated_view
> return func(*args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/airflow/www/utils.py", 
> line 295, in wrapper
> return f(*args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/airflow/utils/db.py", 
> line 74, in wrapper
> return func(*args, **kwargs)
>   File "/home/rde/.local/lib/python3.6/site-packages/airflow/www/views.py", 
> line 1921, in landing_times
> x[ti.task_id].append(dttm)
> KeyError: 'run'
> {code}
> _
> h2. Code
> {code:python}
> for task in dag.tasks:
> y[task.task_id] = []
> x[task.task_id] = []
> for ti in task.get_task_instances(start_date=min_date, 
> end_date=base_date):
> ts = ti.execution_date
> if dag.schedule_interval and dag.following_schedule(ts):
> ts = dag.following_schedule(ts)
> if ti.end_date:
> dttm = wwwutils.epoch(ti.execution_date)
> secs = (ti.end_date - ts).total_seconds()
> x[ti.task_id].append(dttm)
> y[ti.task_id].append(secs)
> {code}
>  
> We can see in first two lines inside the first for loop, how the dictionary x 
> and y is being filled with tasks_id attributes which comes from

[jira] [Work started] (AIRFLOW-5931) Spawning new python interpreter for every task slow

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on AIRFLOW-5931 started by Ash Berlin-Taylor.
--
> Spawning new python interpreter for every task slow
> ---
>
> Key: AIRFLOW-5931
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5931
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executors, worker
>Affects Versions: 2.0.0
>Reporter: Ash Berlin-Taylor
>Assignee: Ash Berlin-Taylor
>Priority: Major
>
> There are a number of places in the Executors and Task Runners where we spawn 
> a whole new python interpreter.
> My profiling has shown that this is slow. Rather than running a fresh python 
> interpreter which then has to re-load all of Airflow and its dependencies we 
> should use {{os.fork}} when it is available/suitable which should speed up 
> task running, espeically for short lived tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] kaxil commented on issue #6628: [AIRFLOW-6034] Fix Deprecation Elasticsearch configs on Master

2019-11-21 Thread GitBox

kaxil commented on issue #6628: [AIRFLOW-6034] Fix Deprecation Elasticsearch 
configs on Master
URL: https://github.com/apache/airflow/pull/6628#issuecomment-557211496
 
 
   I had fixed incorrect label in https://github.com/apache/airflow/pull/6620 
but I didn't know that they were wrong i. key were actually values and vice 
versa.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (AIRFLOW-4897) Location not used to create empty dataset by bigquery_hook cursor

2019-11-21 Thread Kamil Bregula (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Bregula updated AIRFLOW-4897:
---
Component/s: gcp

> Location not used to create empty dataset by bigquery_hook cursor
> -
>
> Key: AIRFLOW-4897
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4897
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: gcp, hooks
>Affects Versions: 1.10.2, 1.10.3
> Environment: composer-1.7.1-airflow-1.10.2
> Python 3
>Reporter: Benjamin
>Priority: Major
>
> {code:java}
> bq_cursor = BigQueryHook(use_legacy_sql=False,
>  bigquery_conn_id='google_cloud_default',
>  location=EU").get_conn().cursor()
> print(f'Location Cursor : {bq_cursor.location}') // EU is printed
> bq_cursor.create_empty_dataset(dataset_id, project_id){code}
> 'EU' is printed but my empty dataset has been created in location : 'US'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] deshraj commented on issue #6380: [AIRFLOW-3632] Allow replace_microseconds in trigger_dag REST request

2019-11-21 Thread GitBox

deshraj commented on issue #6380: [AIRFLOW-3632] Allow replace_microseconds in 
trigger_dag REST request
URL: https://github.com/apache/airflow/pull/6380#issuecomment-557221923
 
 
   @ashb may I ask you about the release timeline for this feature? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (AIRFLOW-6035) Remove comand method in Task

2019-11-21 Thread Kamil Bregula (Jira)

Kamil Bregula created AIRFLOW-6035:
--

 Summary: Remove comand method in Task
 Key: AIRFLOW-6035
 URL: https://issues.apache.org/jira/browse/AIRFLOW-6035
 Project: Apache Airflow
  Issue Type: Bug
  Components: core
Affects Versions: 1.10.6
Reporter: Kamil Bregula


This method is not used. In addition, this method does not work properly 
because the arguments should be processed using the shlex.quote function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] kaxil commented on a change in pull request #6396: [AIRFLOW-5726] Delete table as file name in RedshiftToS3Transfer

2019-11-21 Thread GitBox

kaxil commented on a change in pull request #6396: [AIRFLOW-5726] Delete table 
as file name in RedshiftToS3Transfer
URL: https://github.com/apache/airflow/pull/6396#discussion_r349161658
 
 

 ##
 File path: tests/operators/test_redshift_to_s3_operator.py
 ##
 @@ -31,7 +32,8 @@ class TestRedshiftToS3Transfer(unittest.TestCase):
 
 @mock.patch("boto3.session.Session")
 @mock.patch("airflow.hooks.postgres_hook.PostgresHook.run")
-def test_execute(self, mock_run, mock_session):
+@parameterized.expand([(True, ), (False, )])
+def test_execute(self, mock_run, mock_session, boolean_value):
 
 Review comment:
   ```diff
   -@mock.patch("boto3.session.Session")
   -@mock.patch("airflow.hooks.postgres_hook.PostgresHook.run")
   -@parameterized.expand([(True, ), (False, )])
   -def test_execute(self, mock_run, mock_session, boolean_value):
   
   +@parameterized.expand([(True, ), (False, )])
   +@mock.patch("boto3.session.Session")
   +@mock.patch("airflow.hooks.postgres_hook.PostgresHook.run")
   +def test_execute(self, boolean_value, mock_run, mock_session):
   ```
   
   I going to try and run this my test on my machine and let you know


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] OmerJog commented on issue #5317: [AIRFLOW-4562 ] Fix missing try_number parameter in TaskInstance.log_filepath method

2019-11-21 Thread GitBox

OmerJog commented on issue #5317: [AIRFLOW-4562 ] Fix missing try_number 
parameter in TaskInstance.log_filepath method
URL: https://github.com/apache/airflow/pull/5317#issuecomment-557143006
 
 
   @XD-DENG @nurikk Did you mean for this PR to be forgotten? :( 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (AIRFLOW-5936) Allow explicit get_pty in SSHOperator

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-5936:
---
Fix Version/s: (was: 2.0.0)
   1.10.7

> Allow explicit get_pty in SSHOperator
> -
>
> Key: AIRFLOW-5936
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5936
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: operators
>Affects Versions: 1.10.6
>Reporter: Zikun Zhu
>Priority: Major
> Fix For: 1.10.7
>
>
> Currently when execution_timeout is reached for an SSHOperator task, the ssh 
> connection will be closed but the remote process continues to run. In many 
> scenarios, users might want the process to be killed upon task timeout. 
> Giving users an explicit get_pty option achieves this goal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (AIRFLOW-5928) hive hooks load_file short circuit

2019-11-21 Thread Ash Berlin-Taylor (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-5928.

Fix Version/s: 1.10.7
   Resolution: Fixed

> hive hooks load_file short circuit
> --
>
> Key: AIRFLOW-5928
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5928
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Affects Versions: 1.10.6
>Reporter: zhongjiajie
>Assignee: zhongjiajie
>Priority: Major
> Fix For: 1.10.7
>
>
> If function `load_file` with parameter `create` and `recreate` are `False`, 
> `hql = ''` and should not call `HiveCliHook.run_cli`
> Due to `recreate` in two `if` statement, `HiveCliHook.run_cli` only need to 
> in the last one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] kaxil commented on a change in pull request #6396: [AIRFLOW-5726] Delete table as file name in RedshiftToS3Transfer

2019-11-21 Thread GitBox

kaxil commented on a change in pull request #6396: [AIRFLOW-5726] Delete table 
as file name in RedshiftToS3Transfer
URL: https://github.com/apache/airflow/pull/6396#discussion_r349161658
 
 

 ##
 File path: tests/operators/test_redshift_to_s3_operator.py
 ##
 @@ -31,7 +32,8 @@ class TestRedshiftToS3Transfer(unittest.TestCase):
 
 @mock.patch("boto3.session.Session")
 @mock.patch("airflow.hooks.postgres_hook.PostgresHook.run")
-def test_execute(self, mock_run, mock_session):
+@parameterized.expand([(True, ), (False, )])
+def test_execute(self, mock_run, mock_session, boolean_value):
 
 Review comment:
   
   
   I going to try and run this my test on my machine and let you know


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] kaxil commented on issue #6396: [AIRFLOW-5726] Delete table as file name in RedshiftToS3Transfer

2019-11-21 Thread GitBox

kaxil commented on issue #6396: [AIRFLOW-5726] Delete table as file name in 
RedshiftToS3Transfer
URL: https://github.com/apache/airflow/pull/6396#issuecomment-557171010
 
 
   @JavierLopezT Apply the following code and the test will pass. Passes on my 
local machine:
   
   
   ```diff
   diff --git a/tests/operators/test_redshift_to_s3_operator.py 
b/tests/operators/test_redshift_to_s3_operator.py
   index 5fd8d46e3..baa4aad32 100644
   --- a/tests/operators/test_redshift_to_s3_operator.py
   +++ b/tests/operators/test_redshift_to_s3_operator.py
   @@ -30,10 +30,13 @@ from airflow.utils.tests import 
assertEqualIgnoreMultipleSpaces
   
class TestRedshiftToS3Transfer(unittest.TestCase):
   
   +@parameterized.expand([
   +[True, "key/table_"],
   +[False, "key"],
   +])
@mock.patch("boto3.session.Session")
@mock.patch("airflow.hooks.postgres_hook.PostgresHook.run")
   -@parameterized.expand([(True, ), (False, )])
   -def test_execute(self, mock_run, mock_session, boolean_value):
   +def test_execute(self, table_as_file_name, expected_s3_key, mock_run, 
mock_session,):
access_key = "aws_access_key_id"
secret_key = "aws_secret_access_key"
mock_session.return_value = Session(access_key, secret_key)
   @@ -42,7 +45,6 @@ class TestRedshiftToS3Transfer(unittest.TestCase):
s3_bucket = "bucket"
s3_key = "key"
unload_options = ['HEADER', ]
   -table_as_file_name = boolean_value
   
RedshiftToS3Transfer(
schema=schema,
   @@ -62,14 +64,14 @@ class TestRedshiftToS3Transfer(unittest.TestCase):
select_query = "SELECT * FROM 
{schema}.{table}".format(schema=schema, table=table)
unload_query = """
UNLOAD ('{select_query}')
   -TO 's3://{s3_bucket}/{s3_key}/{table}_'
   +TO 's3://{s3_bucket}/{s3_key}'
with credentials

'aws_access_key_id={access_key};aws_secret_access_key={secret_key}'
{unload_options};
""".format(select_query=select_query,
   -   table=table,
   +   # table=table,
   s3_bucket=s3_bucket,
   -   s3_key=s3_key,
   +   s3_key=expected_s3_key,
   access_key=access_key,
   secret_key=secret_key,
   unload_options=unload_options)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Work started] (AIRFLOW-5947) Make the json backend pluggable for DAG Serialization

2019-11-21 Thread Kaxil Naik (Jira)



 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on AIRFLOW-5947 started by Kaxil Naik.
---
> Make the json backend pluggable for DAG Serialization
> -
>
> Key: AIRFLOW-5947
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5947
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core, scheduler
>Affects Versions: 2.0.0, 1.10.7
>Reporter: Kaxil Naik
>Assignee: Kaxil Naik
>Priority: Major
>
> Allow users the option to choose the JSON library of their choice for DAG 
> Serialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [airflow] ashb commented on a change in pull request #6620: [AIRFLOW-6023] Remove deprecated Celery configs

2019-11-21 Thread GitBox

ashb commented on a change in pull request #6620: [AIRFLOW-6023] Remove 
deprecated Celery configs
URL: https://github.com/apache/airflow/pull/6620#discussion_r349228457
 
 

 ##
 File path: airflow/configuration.py
 ##
 @@ -114,14 +112,7 @@ class AirflowConfigParser(ConfigParser):
 # new_name, the old_name will be checked to see if it exists. If it does a
 # DeprecationWarning will be issued and the old name will be used instead
 deprecated_options = {
-'celery': {
-# Remove these keys in Airflow 1.11
-'worker_concurrency': 'celeryd_concurrency',
-'result_backend': 'celery_result_backend',
-'broker_url': 'celery_broker_url',
-'ssl_active': 'celery_ssl_active',
-'ssl_cert': 'celery_ssl_cert',
-'ssl_key': 'celery_ssl_key',
+'elasticsearch': {
 
 Review comment:
   (we should do this on its commit on v1.10 too)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

1 2 >

1 - 100 of 182 matches

Mail list logo