This is an automated email from the ASF dual-hosted git repository. potiuk pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/airflow.git
The following commit(s) were added to refs/heads/main by this push: new 02ce3a0238 Edited timetable docs (#38505) 02ce3a0238 is described below commit 02ce3a0238ef0c170b2121a3ce80b77f54fa1071 Author: Laura Zdanski <25642903+lzdan...@users.noreply.github.com> AuthorDate: Thu Apr 4 07:00:09 2024 -0400 Edited timetable docs (#38505) --------- Co-authored-by: Collin McNulty <collin.mcnu...@gmail.com> Co-authored-by: Tzu-ping Chung <uranu...@gmail.com> --- .../authoring-and-scheduling/timetable.rst | 135 +++++++++++---------- 1 file changed, 72 insertions(+), 63 deletions(-) diff --git a/docs/apache-airflow/authoring-and-scheduling/timetable.rst b/docs/apache-airflow/authoring-and-scheduling/timetable.rst index 78234910a6..9ca7d436db 100644 --- a/docs/apache-airflow/authoring-and-scheduling/timetable.rst +++ b/docs/apache-airflow/authoring-and-scheduling/timetable.rst @@ -19,9 +19,9 @@ Timetables ========== -For DAGs with time-based schedules (as opposed to event-driven), the scheduling -decisions are driven by its internal "timetable". The timetable also -determines the data interval and the logical date of each run created for the DAG. +For a DAG with a time-based schedule (as opposed to event-driven), the DAG's internal "timetable" +drives scheduling. The timetable also determines the data interval and the logical date of +each run created for the DAG. DAGs scheduled with a cron expression or ``timedelta`` object are internally converted to always use a timetable. @@ -29,39 +29,39 @@ internally converted to always use a timetable. If a cron expression or ``timedelta`` is sufficient for your use case, you don't need to worry about writing a custom timetable because Airflow has default timetables that handle those cases. But for more complicated scheduling requirements, -you may create your own timetable class and pass that to the DAG's ``schedule`` argument. +you can create your own timetable class and pass that to the DAG's ``schedule`` argument. -Here are some examples of when custom timetable implementations are useful: +Some examples of when custom timetable implementations are useful: -* Data intervals with "holes" between. (Instead of continuous, as both the cron - expression and ``timedelta`` schedules represent.) -* Run tasks at different times each day. For example, an astronomer may find it +* Task runs that occur at different times each day. For example, an astronomer might find it useful to run a task at dawn to process data collected from the previous night-time period. -* Schedules not following the Gregorian calendar. For example, create a run for +* Schedules that don't follow the Gregorian calendar. For example, create a run for each month in the `Traditional Chinese Calendar`_. This is conceptually - similar to the sunset case above, but for a different time scale. -* Rolling windows, or overlapping data intervals. For example, one may want to + similar to the sunrise case, but for a different time scale. +* Rolling windows, or overlapping data intervals. For example, you might want to have a run each day, but make each run cover the period of the previous seven - days. It is possible to "hack" this with a cron expression, but a custom data - interval would be a more natural representation. + days. It is possible to hack this with a cron expression, but a custom data + interval provides a more natural representation. +* Data intervals with "holes" between intervals instead of a continuous interval, as both the cron + expression and ``timedelta`` schedules represent continuous intervals. See :ref:`data-interval`. .. _`Traditional Chinese Calendar`: https://en.wikipedia.org/wiki/Chinese_calendar -As such, Airflow allows for custom timetables to be written in plugins and used by -DAGs. An example demonstrating a custom timetable can be found in the +Airflow allows you to write custom timetables in plugins and used by +DAGs. You can find an example demonstrating a custom timetable in the :doc:`/howto/timetable` how-to guide. .. note:: - As a general rule, always access Variables, Connections etc or anything that would access + As a general rule, always access Variables, Connections, or anything else that needs access to the database as late as possible in your code. See :ref:`best_practices/timetables` for more best practices to follow. Built-in Timetables ------------------- -Airflow comes with several common timetables built in to cover the most common use cases. Additional timetables +Airflow comes with several common timetables built-in to cover the most common use cases. Additional timetables may be available in plugins. .. _CronTriggerTimetable: @@ -82,9 +82,8 @@ A timetable that accepts a cron expression, and triggers DAG runs according to i def example_dag(): pass -It is also possible to provide a static data interval to the timetable. The optional ``interval`` argument -must be a :class:`datetime.timedelta` or ``dateutil.relativedelta.relativedelta``. If given, a triggered DAG -run's data interval would span the specified duration, and *ends* with the trigger time. +You can also provide a static data interval to the timetable. The optional ``interval`` argument +must be a :class:`datetime.timedelta` or ``dateutil.relativedelta.relativedelta``. When using these arguments, a triggered DAG run's data interval spans the specified duration, and *ends* with the trigger time. .. code-block:: python @@ -111,11 +110,11 @@ run's data interval would span the specified duration, and *ends* with the trigg DeltaDataIntervalTimetable ^^^^^^^^^^^^^^^^^^^^^^^^^^ -Schedules data intervals with a time delta. Can be selected by providing a +A timetable that schedules data intervals with a time delta. You can select it by providing a :class:`datetime.timedelta` or ``dateutil.relativedelta.relativedelta`` to the ``schedule`` parameter of a DAG. -This timetable is more focused on the data interval value and does not necessarily align execution dates with -arbitrary bounds such as start of day or of hour. +This timetable focuses on the data interval value and does not necessarily align execution dates with +arbitrary bounds, such as the start of day or of hour. .. seealso:: `Differences between the cron and delta data interval timetables`_ @@ -136,8 +135,8 @@ trigger points, and triggers a DAG run at the end of each data interval. .. seealso:: `Differences between the two cron timetables`_ .. seealso:: `Differences between the cron and delta data interval timetables`_ -This can be selected by providing a string that is a valid cron expression to the ``schedule`` -parameter of a DAG as described in the :doc:`../core-concepts/dags` documentation. +Select this timetable by providing a valid cron expression as a string to the ``schedule`` +parameter of a DAG, as described in the :doc:`../core-concepts/dags` documentation. .. code-block:: python @@ -148,13 +147,13 @@ parameter of a DAG as described in the :doc:`../core-concepts/dags` documentatio EventsTimetable ^^^^^^^^^^^^^^^ -Simply pass a list of ``datetime``\s for the DAG to run after. Useful for timing based on sporting -events, planned communication campaigns, and other schedules that are arbitrary and irregular but predictable. +Pass a list of ``datetime``\s for the DAG to run after. This can be useful for timing based on sporting +events, planned communication campaigns, and other schedules that are arbitrary and irregular, but predictable. -The list of events must be finite and of reasonable size as it must be loaded every time the DAG is parsed. Optionally, -the ``restrict_to_events`` flag can be used to force manual runs of the DAG to use the time of the most recent (or very -first) event for the data interval, otherwise manual runs will run with a ``data_interval_start`` and -``data_interval_end`` equal to the time at which the manual run was begun. You can also name the set of events using the +The list of events must be finite and of reasonable size as it must be loaded every time the DAG is parsed. Optionally, use +the ``restrict_to_events`` flag to force manual runs of the DAG that use the time of the most recent, or very +first, event for the data interval. Otherwise, manual runs begin with a ``data_interval_start`` and +``data_interval_end`` equal to the time at which the manual run started. You can also name the set of events using the ``description`` parameter, which will be displayed in the Airflow UI. .. code-block:: python @@ -181,9 +180,9 @@ first) event for the data interval, otherwise manual runs will run with a ``data Dataset event based scheduling with time based scheduling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Combining conditional dataset expressions with time-based schedules enhances scheduling flexibility: +Combining conditional dataset expressions with time-based schedules enhances scheduling flexibility. -The ``DatasetOrTimeSchedule`` is a specialized timetable allowing for the scheduling of DAGs based on both time-based schedules and dataset events. It facilitates the creation of scheduled runs (as per traditional timetables) and dataset-triggered runs, which operate independently. +The ``DatasetOrTimeSchedule`` is a specialized timetable that allows for the scheduling of DAGs based on both time-based schedules and dataset events. It also facilitates the creation of both scheduled runs, as per traditional timetables, and dataset-triggered runs, which operate independently. This feature is particularly useful in scenarios where a DAG needs to run on dataset updates and also at periodic intervals. It ensures that the workflow remains responsive to data changes and consistently runs regular checks or updates. @@ -210,52 +209,62 @@ Here's an example of a DAG using ``DatasetOrTimeSchedule``: Timetables comparisons ---------------------- - .. _Differences between the two cron timetables: Differences between the two cron timetables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -There are two timetables `CronTriggerTimetable`_ and `CronDataIntervalTimetable`_ that accepts a cron expression. -There are some differences between the two: -- `CronTriggerTimetable`_ does not take care of *Data Interval*, while `CronDataIntervalTimetable`_ does. -- The time when a DAG run is triggered by `CronTriggerTimetable`_ is more intuitive and more similar to what people -expect cron to behave than that of `CronDataIntervalTimetable`_ (when ``catchup`` is ``False``). +Airflow has two timetables `CronTriggerTimetable`_ and `CronDataIntervalTimetable`_ that accept a cron expression. + +However, there are differences between the two: +- `CronTriggerTimetable`_ does not address *Data Interval*, while `CronDataIntervalTimetable`_ does. +- The timestamp in the ``run_id``, the ``logical_date`` for `CronTriggerTimetable`_ and `CronDataIntervalTimetable`_ are defined differently based on how they handle the data interval, as described in :ref:`timetables_run_id_logical_date`. Whether taking care of *Data Interval* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -`CronTriggerTimetable`_ *does not* care the idea of *data interval*. It means the value of ``data_interval_start``, -``data_interval_end`` and legacy ``execution_date`` are the same - the time when a DAG run is triggered. +`CronTriggerTimetable`_ *does not* include *data interval*. This means that the value of ``data_interval_start`` and +``data_interval_end`` (and the legacy ``execution_date``) are the same; the time when a DAG run is triggered. + +However, `CronDataIntervalTimetable`_ *does* include *data interval*. This means the value of +``data_interval_start`` and ``data_interval_end`` (and legacy ``execution_date``) are different. ``data_interval_start`` is the time when a +DAG run is triggered and ``data_interval_end`` is the end of the interval. + +*Catchup* behavior +^^^^^^^^^^^^^^^^^^ + +Whether you're using `CronTriggerTimetable`_ or `CronDataIntervalTimetable`_, there is no difference when ``catchup`` is ``True``. -On the other hand, `CronDataIntervalTimetable`_ *does* care the idea of *data interval*. It means the value of -``data_interval_start`` and ``data_interval_end`` (and legacy ``execution_date``) are different. They are the start -and end of the interval respectively. +You might want to use ``False`` for ``catchup`` for certain scenarios, to prevent running unnecessary DAGs: +- If you create a new DAG with a start date in the past, and don't want to run DAGs for the past. If ``catchup`` is ``True``, Airflow runs all DAGs that would have run in that time interval. +- If you pause an existing DAG, and then restart it at a later date, and don't want to If ``catchup`` is ``True``, + +In these scenarios, the ``logical_date`` in the ``run_id`` are based on how `CronTriggerTimetable`_ or `CronDataIntervalTimetable`_ handle the data interval. + +See :ref:`dag-catchup` for more information about how DAG runs are triggered when using ``catchup``. + +.. _timetables_run_id_logical_date: The time when a DAG run is triggered ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -There is no difference between the two when ``catchup`` is ``True``. :ref:`dag-catchup` tells you how DAG runs are -triggered when ``catchup`` is ``True``. +`CronTriggerTimetable`_ and `CronDataIntervalTimetable`_ trigger DAG runs at the same time. However, the timestamp for the ``run_id`` is different for each. + +- `CronTriggerTimetable`_ has a ``run_id`` timestamp, the ``logical_date``, showing when DAG run is able to start. +- `CronTriggerTimetable`_ and `CronDataIntervalTimetable`_ trigger DAG runs at the same time. However, the timestamp for the ``run_id`` (``logical_date``) is different for each. -When ``catchup`` is ``False``, there is difference in how a new DAG run is triggered. `CronTriggerTimetable`_ triggers -a new DAG run *after* the current time, while `CronDataIntervalTimetable`_ does *before* the current time (assuming -the value of ``start_date`` is past time). +For example, suppose there is a cron expression ``@daily`` or ``0 0 * * *``, which is scheduled to run at 12AM every day. If you enable DAGs using the two timetables at 3PM on January +31st, +- `CronTriggerTimetable`_ triggers a new DAG run at 12AM on February 1st. The ``run_id`` timestamp is midnight, on February 1st. +- `CronDataIntervalTimetable`_ immediately triggers a new DAG run, because a DAG run for the daily time interval beginning at 12AM on January 31st did not occur yet. The ``run_id`` timestamp is midnight, on January 31st, since that is the beginning of the data interval. -Here is an example showing how the first DAG run is triggered. Supposes there is a cron expression ``@daily`` or -``0 0 * * *``, which is aimed to run at 12AM every day. If you enable DAGs using the two timetables at 3PM on January -31st, `CronTriggerTimetable`_ will trigger a new DAG run at 12AM on February 1st. `CronDataIntervalTimetable`_, on the other -hand, will immediately trigger a new DAG run which is supposed to trigger at 12AM on January 31st if the DAG had been -enabled beforehand. +This is another example showing the difference in the case of skipping DAG runs. -This is another example showing the difference in the case of skipping DAG runs. Suppose there are two running DAGs -using the two timetables with a cron expression ``@daily`` or ``0 0 * * *``. If you pause the DAGs at 3PM on January -31st and re-enable them at 3PM on February 2nd, `CronTriggerTimetable`_ skips the DAG runs which are supposed to -trigger on February 1st and 2nd. The next DAG run will be triggered at 12AM on February 3rd. `CronDataIntervalTimetable`_, -on the other hand, skips the DAG runs which are supposed to trigger on February 1st only. A DAG run for February 2nd -is immediately triggered after you re-enable the DAG. +Suppose there are two running DAGs with a cron expression ``@daily`` or ``0 0 * * *`` that use the two different timetables. If you pause the DAGs at 3PM on January 31st and re-enable them at 3PM on February 2nd, +- `CronTriggerTimetable`_ skips the DAG runs that were supposed to trigger on February 1st and 2nd. The next DAG run will be triggered at 12AM on February 3rd. +- `CronDataIntervalTimetable`_ skips the DAG runs that were supposed to trigger on February 1st only. A DAG run for February 2nd is immediately triggered after you re-enable the DAG. -By these examples, you see how `CronTriggerTimetable`_ triggers DAG runs is more intuitive and more similar to what +In these examples, you see how `CronTriggerTimetable`_ triggers DAG runs is more intuitive and more similar to what people expect cron to behave than how `CronDataIntervalTimetable`_ does. @@ -265,8 +274,8 @@ Differences between the cron and delta data interval timetables: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Choosing between `DeltaDataIntervalTimetable`_ and `CronDataIntervalTimetable`_ depends on your use case. -If you enable a DAG at 01:05 on February 1st, the following table summarizes the DAG runs created (and the -data interval that they cover), depending on 3 arguments: ``schedule``, ``start_date`` and ``catchup``. +If you enable a DAG at 01:05 on February 1st, the following table summarizes the DAG runs created and the +data interval that they cover, depending on 3 arguments: ``schedule``, ``start_date`` and ``catchup``. .. list-table:: :header-rows: 1