Flow-based Airflow?

2017-01-23 Thread Bolke de Bruin
Hi All, I came by a write up of some of the downsides in current workflow management systems like Airflow and Luigi (http://bionics.it/posts/workflows-dataflow-not-task-deps) where they argue dependencies should be between inputs and outputs of tasks rather than between tasks

Re: Flow-based Airflow?

2017-01-23 Thread Edwards, Jesse
When we first started using airflow we had legacy systems that were using Autosys, and cloud systems using airflow and we needed to bridge the gap between them. We also had an additional desire to move away from timed based scheduling / processing to more of a dependency driven model. Our

Re: Flow-based Airflow?

2017-01-23 Thread Bolke de Bruin
O that’s interesting! I think the way Airflow uses tasks doesn’t entirely fit with the Flow model, e.g. in Luigi one is normal to derive from a Task. In Tasks you can just add the inlets (data dependency) you require for your particular dag. In Airflow we use templating more extensively and

Re: Flow-based Airflow?

2017-01-23 Thread Van Klaveren, Brian N.
I can give some insight from the physics world as far as this goes. First off, I think the dataflow puck is moving to platforms like Apache Beam. The main reason people (in science) don't just use Beam would be because they don't control the clusters they execute on. This is almost always true

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Hey all, I've upgraded on production. Things seem to be working so far (only been an hour), but I am seeing this in the scheduler logs: File Path PID RuntimeLast RuntimeLast Run

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Digging. Might be a bit. On Mon, Jan 23, 2017 at 1:32 PM, Bolke de Bruin wrote: > Slow query log? Db load? > > B. > > Verstuurd vanaf mijn iPad > > > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini > het volgende geschreven: > > > > Note: 6.5 million

Re: Experiences with 1.8.0

2017-01-23 Thread Arthur Wiedmer
Chris, Just double checking, you mean more than 15 seconds not 15 minutes, right? Best, Arthur On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini wrote: > Hey all, > > I've upgraded on production. Things seem to be working so far (only been an > hour), but I am seeing

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Also, seeing this in EVERY task that runs: [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of this instance has been externally set to queued. Taking the poison pill. So long. [2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited with return code 0 All successful tasks are

Re: Flow-based Airflow?

2017-01-23 Thread Maxime Beauchemin
Just commented on the blog post: I agree that workflow engines should expose a way to document data objects it reads from and writes to, so that it can be aware of the full graph of tasks and data objects and how it all relates. This metadata allows for clarity around

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Hey Bolke, Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost all free as well. I am thinking that this is DB related, given that it's pausing when executing an update. Was looking at the update_state method in models.py, which logs right before the 15s pause. Cheers, Chris On

Re: Experiences with 1.8.0

2017-01-23 Thread Bolke de Bruin
Slow query log? Db load? B. Verstuurd vanaf mijn iPad > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini het > volgende geschreven: > > Note: 6.5 million TIs in the task_instance table. > > On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini >

Re: Experiences with 1.8.0

2017-01-23 Thread Maxime Beauchemin
Can you rebuild your indexes and recompute the table's stats and see if the optimizer is still off tracks? Assuming InnoDB and from memory: OPTIMIZE TABLE task_instances; ANALYZE TABLE task_instances; Max On Mon, Jan 23, 2017 at 3:45 PM, Arthur Wiedmer wrote: >

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
OK, it's using `state` instead of PRIMARY. Using PRIMARY with a hint, query takes .47s. Without hint, 10s. Going to try and patch. On Mon, Jan 23, 2017 at 2:57 PM, Chris Riccomini wrote: > This inner query takes 10s: > > SELECT task_instance.task_id AS task_id,

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
With this patch: $ git diff diff --git a/airflow/jobs.py b/airflow/jobs.py index f1de333..9d08e75 100644 --- a/airflow/jobs.py +++ b/airflow/jobs.py @@ -544,6 +544,7 @@ class SchedulerJob(BaseJob): .query( TI.task_id,

Re: Airflow Meetup @ Paypal (San Jose)

2017-01-23 Thread Jayesh Senjaliya
I am actually up for both, Paypal can host after Strata. waiting for community to comment as well. Thanks Jayesh On Mon, Jan 23, 2017 at 3:45 PM, Russell Jurney wrote: > I reached out and am awaiting to hear if they have space. They did say that > attendees of

Re: Flow-based Airflow?

2017-01-23 Thread Maxime Beauchemin
A few other thoughts related to this. Early on in the project, I had designed but never launched a feature called "data lineage annotations" allowing people to define a list of sources, and a list of targets related to a each task for documentation purposes. My idea was to use a simple annotation

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
This inner query takes 10s: SELECT task_instance.task_id AS task_id, max(task_instance.execution_date) AS max_ti FROM task_instance WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success' AND task_instance.task_id IN ('t1', 't2') GROUP BY task_instance.task_id Explain seems OK:

Re: Experiences with 1.8.0

2017-01-23 Thread Arthur Wiedmer
Maybe we can start with " .with_hint(TI, 'USE INDEX (PRIMARY)', dialect_name='mysql')" and see if other databases exhibit the same query plan issue ? Best, Arthur On Mon, Jan 23, 2017 at 3:27 PM, Chris Riccomini wrote: > With this patch: > > $ git diff > diff --git

Re: Airflow Meetup @ Paypal (San Jose)

2017-01-23 Thread Russell Jurney
I reached out and am awaiting to hear if they have space. They did say that attendees of meetups in the evening do NOT need to have a Strata pass. I'm new here, so I don't want to hijack your meetup. If you guys want Paypal, lets have Paypal host. I'm sure it will be great either way. On Fri,

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Oops, yes, 15 seconds, sorry. Operating without much sleep. :P On Mon, Jan 23, 2017 at 12:35 PM, Arthur Wiedmer wrote: > Chris, > > Just double checking, you mean more than 15 seconds not 15 minutes, right? > > Best, > Arthur > > On Mon, Jan 23, 2017 at 12:27 PM,

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Note: 6.5 million TIs in the task_instance table. On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini wrote: > Hey Bolke, > > Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost all free > as well. > > I am thinking that this is DB related, given that it's

Re: Experiences with 1.8.0

2017-01-23 Thread Chris Riccomini
Can confirm it's a slow query on task_instance table. Still digging. Unfortunately, the query is truncated in my UI right now: SELECT task_instance.task_id AS task_instance_... On Mon, Jan 23, 2017 at 1:56 PM, Chris Riccomini wrote: > Digging. Might be a bit. > > On Mon,

QCon London

2017-01-23 Thread siddharth anand
Hi Folks! I will be attending QCon London Mar 5-8. Happy to meet locals and talk Airflow and data infrastructure if there is interest. FYI, I'm also a co-chair for QCon London and would be very interested in getting a deeper understanding of the local (London and environs) tech scene and potential

Re: Flow-based Airflow?

2017-01-23 Thread Laura Lorenz
We were struggling with the same problem and came up with fileflow which is what we wrote to deal with passing data down a DAG in Airflow. We co-opt Airflow's task dependency system to represent the data dependencies and let fileflow handle knowing where

Re: Flow-based Airflow?

2017-01-23 Thread Glenn McClements
We’ve just started using Airflow as a platform to replace some older internally built systems, but one of the things we also looked at was a _newer_ internally built system which basically did the below. In fact it came as a surprise when I started looking around at open source systems like

Re: Flow-based Airflow?

2017-01-23 Thread Boris Tyukin
this is a good discussion. Most of traditional ETL tools (SSIS, Informatica, DataStage etc.) have both - control flow (or task dependency) and data flow. Some tools like SSIS make a clear distinction between them - you create a control flow that calls data flows as a part of overall control flow.