[ https://issues.apache.org/jira/browse/AIRFLOW-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bolke de Bruin updated AIRFLOW-20: ---------------------------------- Summary: Improving the scheduler by make dag runs more coherent (was: Align start_date with the schedule_interval) > Improving the scheduler by make dag runs more coherent > ------------------------------------------------------ > > Key: AIRFLOW-20 > URL: https://issues.apache.org/jira/browse/AIRFLOW-20 > Project: Apache Airflow > Issue Type: Improvement > Reporter: Bolke de Bruin > Labels: backfill, database, scheduler > > The need to align the start_date with the interval is counter intuitive > and leads to a lot of questions and issue creation, although it is in the > documentation. If we are > able to fix this with none or little consequences for current setups that > should be preferred, I think. > The dependency explainer is really great work, but it doesn’t address the > core issue. > If you consider a DAG a description of cohesion between work items (in OOP > java terms > a class), then a DagRun is the instantiation of a DAG in time (in OOP java > terms an instance). > Tasks are then the description of a work item and a TaskInstance the > instantiation of the Task in time. > In my opinion issues pop up due to the current paradigm of considering the > TaskInstance > the smallest unit of work and asking it to maintain its own state in relation > to other TaskInstances > in a DagRun and in a previous DagRun of which it has no (real) perception. > Tasks are instantiated > by a cartesian product with the dates of DagRun instead of the DagRuns > itself. > The very loose coupling between DagRuns and TaskInstances can be improved > while maintaining > flexibility to run tasks without a DagRun. This would help with a couple of > things: > 1. start_date can be used as a ‘execution_date’ or a point in time when to > start looking > 2. a new interval for a dag will maintain depends_on_past > 3. paused dags do not give trouble > 4. tasks will be executed in order > 5. the ignore_first_depend_on_past could be removed as a task will now know > if it is really the first > In PR-1431 a lot of this work has been done by: > 1. Adding a “previous” field to a DagRun allowing it to connect to its > predecessor > 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the > DagRun if needed > 3. Using start_date + interval as the first run date unless start_date is on > the interval then start_date is the first run date -- This message was sent by Atlassian JIRA (v6.3.4#6332)