Fundamentally tasks are defined by the code itself, and so a worker process
can only determine what code to execute when running by parsing the python
code that defines it. Maybe there are some cases where a task can be well
defined outside of the full context of the dag that contains it but that
doesn't apply in general.

Chris

On Thu, Jan 30, 2020 at 3:31 PM Reed Villanueva <[email protected]>
wrote:

> Thanks for the clarification.
>
> Yes, ultimately pickling the configs I use to build the graph was what I
> did (the graph is created in a loop that does DB queries to make DAG
> braches for a set of tables, so queries are involved ans was causing
> overhead b/c the queries were being done for every single task when only
> really needed to do it once to make the config dict for the DAG).
>
> Could you explain a bit as to why any task would even care about the DAG
> structure? I would think that if the scheduler sets them to run, then they
> should run and have no reason to care about the overall structure of the
> DAG.
>
> On Thu, Jan 30, 2020 at 5:33 AM Shaw, Damian P. <
> [email protected]> wrote:
>
>> Yes, every task is run in process isolation (and could be running across
>> separate machines) so every tasks builds the DAG from scratch.
>>
>>
>>
>> If you don’t expect your DAG to change across an amount of time and they
>> run on the same machine you could cache / pickle the DAG object and before
>> trying to build the DAG check if the cache / pickle file is available and
>> recent and load it from there. Or I am sure there are many other
>> solutions.
>>
>>
>>
>> Damian
>>
>>
>>
>> *From:* Reed Villanueva <[email protected]>
>> *Sent:* Thursday, January 30, 2020 00:14
>> *To:* [email protected]
>> *Subject:* How often is dag definition file read during a single dag run?
>>
>>
>>
>> How often is a dag definition file read during a single dag run?
>>
>> Have a large dag that takes long amount of time to build (~1-3min).
>> Looking at the logs of each task as the dag is running it appears that the
>> dag definition file is being executed for every task before it runs...
>>
>> *** Reading local file: 
>> /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log
>>
>> [2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met 
>> for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
>>
>> [2020-01-29 19:02:10,866] {taskinstance.py:655} INFO - Dependencies all met 
>> for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
>>
>> [2020-01-29 19:02:10,866] {taskinstance.py:866} INFO -
>>
>> --------------------------------------------------------------------------------
>>
>> [2020-01-29 19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of 
>> 1
>>
>> [2020-01-29 19:02:10,866] {taskinstance.py:868} INFO -
>>
>> --------------------------------------------------------------------------------
>>
>> [2020-01-29 19:02:10,883] {taskinstance.py:887} INFO - Executing 
>> <Task(BashOperator): precheck_db_perms> on 2020-01-30T04:51:34.621883+00:00
>>
>> [2020-01-29 19:02:10,887] {standard_task_runner.py:52} INFO - Started 
>> process 140570 to run task
>>
>> [2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO - [2020-01-29 
>> 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag from 
>> /home/airflow/airflow/dags/mydag.py
>>
>> [2020-01-29 19:02:11,052] {logging_mixin.py:112} INFO - <output from my dag 
>> definition file>
>>
>> [2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more output from my 
>> dag definition file>
>>
>> ....
>>
>> ....
>>
>> ....
>>
>> [2020-01-29 19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host 
>> %s <TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00 [running]> 
>> airflowetl.co.local
>>
>> [2020-01-29 19:02:58,674] {bash_operator.py:81} INFO - Tmp dir root location:
>>
>>  /tmp
>>
>> [2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting the 
>> following env vars:
>>
>> [email protected]
>>
>> AIRFLOW_CTX_DAG_OWNER=me
>>
>> AIRFLOW_CTX_DAG_ID=mydag
>>
>> AIRFLOW_CTX_TASK_ID=mytask
>>
>> AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00
>>
>> AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00
>>
>> [2020-01-29 19:02:58,675] {bash_operator.py:105} INFO - Temporary script 
>> location: /tmp/airflowtmphwu1ckty/mytaskbmnsizw5
>>
>> <only now does the actual task logic output seem to start>
>>
>> where the first whole part of the log seems to imply that the dag file is
>> being run each time a new task is run (I see this for every task).
>>
>> Is this indeed what is happening here? Is this normal / expected
>> behavior? Note that since my dag takes some time to build, this would mean
>> that that time is being multiplied across every task in the dag (of which
>> there are many in this case), which makes me think this is either not
>> normal or there is some best practice I am not using here. Could anyone
>> with more airflow experience help explain what I'm seeing here?
>>
>>
>> This electronic message is intended only for the named
>> recipient, and may contain information that is confidential or
>> privileged. If you are not the intended recipient, you are
>> hereby notified that any disclosure, copying, distribution or
>> use of the contents of this message is strictly prohibited. If
>> you have received this message in error or are not the named
>> recipient, please notify us immediately by contacting the
>> sender at the electronic mail address noted above, and delete
>> and destroy all copies of this message. Thank you.
>>
>>
>>
>>
>> ==============================================================================
>> Please access the attached hyperlink for an important electronic
>> communications disclaimer:
>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>
>> ==============================================================================
>>
>
> This electronic message is intended only for the named
> recipient, and may contain information that is confidential or
> privileged. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying, distribution or
> use of the contents of this message is strictly prohibited. If
> you have received this message in error or are not the named
> recipient, please notify us immediately by contacting the
> sender at the electronic mail address noted above, and delete
> and destroy all copies of this message. Thank you.
>

Reply via email to