I can see how my first email was confusing, where I said:
"Our first attempt at productionizing Airflow used the vanilla DAGs folder,
including all the deps of all the DAGs with the airflow binary itself"
What I meant is that we have separate DAGs deployment, but we are being
forced to package
Our DAG deployment is already a separate deployment from Airflow itself.
The issue is that the Airflow binary (whether acting as webserver,
scheduler, worker), is the one that *reads* the DAG files. So if you have,
for example, a DAG that has this import statement in it:
import mylib.foobar
We're running into a lot of pain with this. We have a CI system that
enables very rapid iteration on DAG code. Whenever you need to modify
plugin code, it requires a re-ship of all of the infrastructure, which
takes at least 10x longer than a DAG deployment Jenkins build.
I think that Airflow
Thanks Kelvin and Max for your inputs!
To Kelvin’s questions:
1. “Shard by # of files may not yield same load”: fully agree with you. This
concern was also raised by other co-workers in my team. But given this is a
preliminary trial, we didn’t consider this yet.
2. We haven’t started to look
>> 1. “Shard by # of files may not yield same load”: fully agree with you.
This concern was also raised by other co-workers in my team. But given this
is a preliminary trial, we didn’t consider this yet.
One issue here is that when do you decide to add one more shard? I think if
you monitor the
Does using hooks can do this job? Any code snippet or ideas would be useful.
Saving logins on Airflow:
Example from official documentation:
Python 2.7.9 (default, Feb 10 2015, 03:28:08)
Type "help", "copyright", "credits" or "license" for more information.
>>> import airflow