[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r350232483 ## File path: docs/best-practices.rst ## @@ -0,0 +1,296 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow can retry a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem as the next task is likely to run on a different server without access to it — for example, a task that downloads the data file that the next task processes. +In the case of :class:`Local executor `, +storing a file on disk can make retries harder e.g., your task requires a config file that is deleted by another task in DAG. + +If possible, use ``XCom`` to communicate small messages between tasks and a good way of passing larger data between tasks is to use a remote storage such as S3/HDFS. +For example, if we have a task that stores processed data in S3 that task can push the S3 path for the output data in ``Xcom``, +and the downstream tasks can pull the path from XCom and use it to read the data. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Where at all possible, use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's ``execute()`` method or Jinja templates if possible, +as Variables create a connection to metadata DB of Airflow to fetch the value, which can slow down parsing and place
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349801794 ## File path: docs/best-practices.rst ## @@ -0,0 +1,296 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. +In the case of :class:`Local executor `, +storing a file on disk can make retries harder e.g., your task requires a config file that is deleted by another task in DAG. + +If possible, use ``XCom`` to communicate small messages between tasks or S3/HDFS to communicate large messages/files. +For example, a task that stores processed data in S3. The task can push the S3 path for the latest data in ``Xcom``, +and the downstream tasks can pull the path from XCom and use it to read the data. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Where at all possible, use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's ``execute()`` method or Jinja templates if possible, +as Variables create a connection to metadata DB of Airflow to fetch the value, which can slow down parsing and place extra load on the DB. + +Airflow parses all the DAGs in the background at a specific period. +The default period is set using
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349800313 ## File path: docs/best-practices.rst ## @@ -0,0 +1,296 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. +In the case of :class:`Local executor `, +storing a file on disk can make retries harder e.g., your task requires a config file that is deleted by another task in DAG. + +If possible, use ``XCom`` to communicate small messages between tasks or S3/HDFS to communicate large messages/files. Review comment: ```suggestion If possible, use ``XCom`` to communicate small messages between tasks and a good way of passing larger data between tasks is to use a remote storage such as S3/HDFS. ``` We should probably find a link for xcom too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349798847 ## File path: docs/best-practices.rst ## @@ -0,0 +1,296 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. Review comment: Can, if configured. Default is to not retry though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349800893 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. Review comment: We configure things to use a session pool, or _should_. But I have not tested this to see what actually happens in a long time :) This is an automated
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349799603 ## File path: docs/best-practices.rst ## @@ -0,0 +1,296 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. Review comment: ```suggestion Therefore, you should not store any file or config in the local filesystem as the next task is likely to run on a different server without access to it — for example, a task that downloads the data file that the next task processes. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349800563 ## File path: docs/best-practices.rst ## @@ -0,0 +1,296 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. +In the case of :class:`Local executor `, +storing a file on disk can make retries harder e.g., your task requires a config file that is deleted by another task in DAG. + +If possible, use ``XCom`` to communicate small messages between tasks or S3/HDFS to communicate large messages/files. +For example, a task that stores processed data in S3. The task can push the S3 path for the latest data in ``Xcom``, Review comment: ```suggestion For example, if we have a task that stores processed data in S3 that task can push the S3 path for the output data in ``Xcom``, ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349801887 ## File path: docs/best-practices.rst ## @@ -0,0 +1,296 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. +In the case of :class:`Local executor `, +storing a file on disk can make retries harder e.g., your task requires a config file that is deleted by another task in DAG. + +If possible, use ``XCom`` to communicate small messages between tasks or S3/HDFS to communicate large messages/files. +For example, a task that stores processed data in S3. The task can push the S3 path for the latest data in ``Xcom``, +and the downstream tasks can pull the path from XCom and use it to read the data. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Where at all possible, use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's ``execute()`` method or Jinja templates if possible, +as Variables create a connection to metadata DB of Airflow to fetch the value, which can slow down parsing and place extra load on the DB. + +Airflow parses all the DAGs in the background at a specific period. +The default period is set using
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349062029 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349052306 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349062575 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349062357 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349052059 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349049920 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. Review comment: ```suggestion You should avoid usage of Variables outside an operator's ``execute()`` method or Jinja templates if possible, as Variables create a connection to metadata DB of Airflow to fetch the value which can slow down parsing and place extra load on the DB. ``` (Using Variables is common in dynamic dags.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349049272 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. Review comment: Please expand this to include something like "and then push a path to the remote file in Xcom to use in downstream tasks" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349052555 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349052009 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349053904 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349048684 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present Review comment: The directory bit isn't true, but the issue here is as you mention tasks can be executed on different machines. And even if using the a LocalExecutor, storing files on local disk can make retries harder (especially if another task might have deleted the file in the mean time) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349051235 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : Review comment: ```suggestion The best way of using variables is via a Jinja template which will delay reading the value until the task
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349050914 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. Review comment: ``` Airflow parses all the DAG files in a loop, trying to parse each file every ``processor_poll_interval`` seconds (default 1 second). During parsing, Airflow will open and close a new connection to the metadata DB for
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349062979 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349061642 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349053376 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349063292 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349062851 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349061827 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349053006 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349053499 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349053798 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349048944 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. Review comment: Unforutnately we can't be so bold as to say "Always" -- not every system is supported, so "Where at all possible" might be the best we can say. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [airflow] ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready
ashb commented on a change in pull request #6515: [AIRFLOW-XXX] GSoD: How to make DAGs production ready URL: https://github.com/apache/airflow/pull/6515#discussion_r349052656 ## File path: docs/best-practices.rst ## @@ -0,0 +1,271 @@ + .. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +Best Practices +== + +Running Airflow in production is seamless. It comes bundled with all the plugins and configs +necessary to run most of the DAGs. However, you can come across certain pitfalls, which can cause occasional errors. +Let's take a look at what you need to do at various stages to avoid these pitfalls, starting from writing the DAG +to the actual deployment in the production environment. + + +Writing a DAG +^^ +Creating a new DAG in Airflow is quite simple. However, there are many things that you need to take care of +to ensure the DAG run or failure does not produce unexpected results. + +Creating a task +--- + +You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce +incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. + +Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. +Some of the ways you can avoid producing a different result - + +* Do not use INSERT during a task re-run, an INSERT statement might lead to duplicate rows in your database. + Replace it with UPSERT. +* Read and write in a specific partition. Never read the latest available data in a task. + Someone may update the input data between re-runs, which results in different outputs. + A better way is to read the input data from a specific partition. You can use ``execution_date`` as a partition. + You should follow this partitioning method while writing data in S3/HDFS, as well. +* The python datetime ``now()`` function gives the current datetime object. + This function should never be used inside a task, especially to do the critical computation, as it leads to different outcomes on each run. + It's fine to use it, for example, to generate a temporary log. + +.. tip:: + +You should define repetitive parameters such as ``connection_id`` or S3 paths in ``default_args`` rather than declaring them for each task. +The ``default_args`` help to avoid mistakes such as typographical errors. + + +Deleting a task + + +Never delete a task from a DAG. In case of deletion, the historical information of the task disappears from the Airflow UI. +It is advised to create a new DAG in case the tasks need to be deleted. + + +Communication +-- + +Airflow executes tasks of a DAG in different directories, which can even be present +on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. +Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. + +Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. + +The tasks should also not store any authentication parameters such as passwords or token inside them. +Always use :ref:`Connections ` to store data securely in Airflow backend and retrieve them using a unique connection id. + + +Variables +- + +You should avoid usage of Variables outside an operator's execute() method or Jinja templates. Variables create a connection to metadata DB of Airflow to fetch the value. +Airflow parses all the DAGs in the background at a specific period. +The default period is set using ``processor_poll_interval`` config, which is by default 1 second. During parsing, Airflow creates a new connection to the metadata DB for each Variable. +It can result in a lot of open connections. + +If you really want to use Variables, we advice to use them from a Jinja template with the syntax : + +.. code:: + +{{ var.value. }} + +or if you need to deserialize a json object from the variable : + +.. code:: + +{{ var.json. }} + +