Fundamentally tasks are defined by the code itself, and so a worker process can only determine what code to execute when running by parsing the python code that defines it. Maybe there are some cases where a task can be well defined outside of the full context of the dag that contains it but that doesn't apply in general.
Chris On Thu, Jan 30, 2020 at 3:31 PM Reed Villanueva <[email protected]> wrote: > Thanks for the clarification. > > Yes, ultimately pickling the configs I use to build the graph was what I > did (the graph is created in a loop that does DB queries to make DAG > braches for a set of tables, so queries are involved ans was causing > overhead b/c the queries were being done for every single task when only > really needed to do it once to make the config dict for the DAG). > > Could you explain a bit as to why any task would even care about the DAG > structure? I would think that if the scheduler sets them to run, then they > should run and have no reason to care about the overall structure of the > DAG. > > On Thu, Jan 30, 2020 at 5:33 AM Shaw, Damian P. < > [email protected]> wrote: > >> Yes, every task is run in process isolation (and could be running across >> separate machines) so every tasks builds the DAG from scratch. >> >> >> >> If you don’t expect your DAG to change across an amount of time and they >> run on the same machine you could cache / pickle the DAG object and before >> trying to build the DAG check if the cache / pickle file is available and >> recent and load it from there. Or I am sure there are many other >> solutions. >> >> >> >> Damian >> >> >> >> *From:* Reed Villanueva <[email protected]> >> *Sent:* Thursday, January 30, 2020 00:14 >> *To:* [email protected] >> *Subject:* How often is dag definition file read during a single dag run? >> >> >> >> How often is a dag definition file read during a single dag run? >> >> Have a large dag that takes long amount of time to build (~1-3min). >> Looking at the logs of each task as the dag is running it appears that the >> dag definition file is being executed for every task before it runs... >> >> *** Reading local file: >> /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log >> >> [2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met >> for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]> >> >> [2020-01-29 19:02:10,866] {taskinstance.py:655} INFO - Dependencies all met >> for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]> >> >> [2020-01-29 19:02:10,866] {taskinstance.py:866} INFO - >> >> -------------------------------------------------------------------------------- >> >> [2020-01-29 19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of >> 1 >> >> [2020-01-29 19:02:10,866] {taskinstance.py:868} INFO - >> >> -------------------------------------------------------------------------------- >> >> [2020-01-29 19:02:10,883] {taskinstance.py:887} INFO - Executing >> <Task(BashOperator): precheck_db_perms> on 2020-01-30T04:51:34.621883+00:00 >> >> [2020-01-29 19:02:10,887] {standard_task_runner.py:52} INFO - Started >> process 140570 to run task >> >> [2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO - [2020-01-29 >> 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag from >> /home/airflow/airflow/dags/mydag.py >> >> [2020-01-29 19:02:11,052] {logging_mixin.py:112} INFO - <output from my dag >> definition file> >> >> [2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more output from my >> dag definition file> >> >> .... >> >> .... >> >> .... >> >> [2020-01-29 19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host >> %s <TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00 [running]> >> airflowetl.co.local >> >> [2020-01-29 19:02:58,674] {bash_operator.py:81} INFO - Tmp dir root location: >> >> /tmp >> >> [2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting the >> following env vars: >> >> [email protected] >> >> AIRFLOW_CTX_DAG_OWNER=me >> >> AIRFLOW_CTX_DAG_ID=mydag >> >> AIRFLOW_CTX_TASK_ID=mytask >> >> AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00 >> >> AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00 >> >> [2020-01-29 19:02:58,675] {bash_operator.py:105} INFO - Temporary script >> location: /tmp/airflowtmphwu1ckty/mytaskbmnsizw5 >> >> <only now does the actual task logic output seem to start> >> >> where the first whole part of the log seems to imply that the dag file is >> being run each time a new task is run (I see this for every task). >> >> Is this indeed what is happening here? Is this normal / expected >> behavior? Note that since my dag takes some time to build, this would mean >> that that time is being multiplied across every task in the dag (of which >> there are many in this case), which makes me think this is either not >> normal or there is some best practice I am not using here. Could anyone >> with more airflow experience help explain what I'm seeing here? >> >> >> This electronic message is intended only for the named >> recipient, and may contain information that is confidential or >> privileged. If you are not the intended recipient, you are >> hereby notified that any disclosure, copying, distribution or >> use of the contents of this message is strictly prohibited. If >> you have received this message in error or are not the named >> recipient, please notify us immediately by contacting the >> sender at the electronic mail address noted above, and delete >> and destroy all copies of this message. Thank you. >> >> >> >> >> ============================================================================== >> Please access the attached hyperlink for an important electronic >> communications disclaimer: >> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html >> >> ============================================================================== >> > > This electronic message is intended only for the named > recipient, and may contain information that is confidential or > privileged. If you are not the intended recipient, you are > hereby notified that any disclosure, copying, distribution or > use of the contents of this message is strictly prohibited. If > you have received this message in error or are not the named > recipient, please notify us immediately by contacting the > sender at the electronic mail address noted above, and delete > and destroy all copies of this message. Thank you. >
