fantastic! congrats to all. great xmas news, enjoy the holidays.
Sent from my iPhone
> On 20 Dec 2018, at 18:43, Kaxil Naik wrote:
>
> Congratulations all.
>
>> On Thu, Dec 20, 2018, 21:41 Driesprong, Fokko >
>> Awesome! Congrats!
>>
>> Cheers, Fokko
>>
>> Op do 20 dec. 2018 om 22:40
There was a discussion about a unit testing approach last year 2017 I
believe. If you dig the mail archives, you can find it.
My take is:
- You should test "hooks" against some real system, which can be a docker
container. Make sure the behavior is predictable when talking against that
system.
(
> task_id=shift_callable.__name__,
> python_callable=adwords_callable,
> ) >> TriggerDagRunOperator(
> task_id='retry_dag_on_failure',
> trigger_dag_id=dag_id,
> trigger_rule=TriggerRule.ONE_FAILED,
>
Have you looked into pools? Pools allow you to specify how many tasks at
any given time should use a common resource.
That way you could limit this to 1, 2, or 3 for example. Pools are not
dynamic however, so it only allows you to upper limit how many
number of clients are going to hit the API at
We are using two cluster instances. One cluster is for the engineering
teams that are in the "tech" wing and which rigorously follow
tech principles, the other instance is for use by business analysts and
more ad-hoc, experimental work, who do not necessarily follow the
principles. We have a nomad
Hi all,
I have a question regarding the processing of individual files:
We collect some flat files from different sources in csv, raw and
unstructured formats.
These files are stored in a "{process}//MM/DD/" hierarchy and we've
built
a GCSToGCSTransform operator, which runs a
Really good stuff. About 1.5 years ago I talked to some googlers on how
awesome it would be to
integrate the principles of airflow onto GCP and maybe even make it
available as some sort of
"launcher". Looks like you went beyond that and made it a core product
instead!
Going to look into this over
fantastic effort, much appreciated! go go go
On Mon, Apr 23, 2018 at 11:08 AM, Ace Haidrey wrote:
> Great work Fokko and Bolke!
>
> Sent from my iPhone
>
> > On Apr 23, 2018, at 11:07 AM, Sid Anand wrote:
> >
> > Awesome!
> > -s
> >
> >> On Mon, Apr 23,
Someone contacted me, looking specifically for someone with airflow
experience. If anyone is interested, you can contact him through the email
below:
For one of our clients in Stockholm we're currently searching for a
contractor to assist with Airflow ETL orchestration. It's an
Yesterday I finished the draft of a new example on the "ETL with airflow"
site. This example explores the concept of a "Data vault" methodology on
top of Hive, 100% orchestrated by airflow:
https://gtoonstra.github.io/etl-with-airflow/datavault2.html
The theory of the data vault is that you can
y, but could it be the location of where max_active_runs
> is specified? In our DAGs we pass it directly as an argument to the DAG()
> call, not via default_arguments and it behaves itself for us. I think
> I should check that!
>
> -ash
>
>
> > On 14 Feb 2018, at 13:43, Gera
A user on airflow 1.9.0 reports that 'max_active_runs' isn't respected. I
remembered having fixed something related to this ages ago and this is
here:
https://issues.apache.org/jira/browse/AIRFLOW-137
That however was related to backfills and clearing the dagruns.
I watched him in the scenario
As long as the differences are in API methods and not a rearrangement of
the
package structure the latter option would work. This is because the
operators would
be imported by the scheduler, just not executed (and therefore perhaps not
call the
specific operator methods).
If you serialize the
What we need is an airflow flamethrower
On Sat, Jan 27, 2018 at 2:21 AM, Hbw
wrote:
> Me too!
>
> Had an Ant hat for years...
>
> Sent from a device with less than stellar autocorrect
>
> > On Jan 26, 2018, at 2:43 PM, Trent Robbins
This was a question I put in a survey I once conducted. The survey is
available here (including the individual results at the bottom):
https://cwiki.apache.org/confluence/display/AIRFLOW/Apache+Airflow+survey+2017-06-24
1. I recommend not to use different versions as in separate directories,
but
I've added a new example using a "Data Vault" methodology available here:
https://gtoonstra.github.io/etl-with-airflow/datavault.html
What I find compelling about DataVault is how it enables you to store data
in a flexible way and regenerate some downstream star schema on the fly
from scratch
Hi sid,
Is this open to europeans as well, those who dont necessarily have a us visa?
Rgds,
gerard
Sent from my iPhone
> On 2 Jan 2018, at 20:13, Sid Anand wrote:
>
> Hi Folks!
> I'm looking for a few folks who want to work on Apache Airflow (@ PayPal
> scale), which is
Hello,
Has anyone looked at / implemented /disconsidered saml2.0 authentication
for airflow? Did a search on google, but this didn't return anything
specific.
Rgds,
Gerard
accept everyone and then we can
start writing your ideas, references,
things you looked at, tried, etc.
See you there!
Gerard
On Sun, Dec 3, 2017 at 11:40 AM, Sam Elamin <hussam.ela...@gmail.com> wrote:
> I'm def in.
>
> Thanks for organising Gerard!
>
> On Sun, 3 Dec 2017 at 07:
Good morning,
The meeting has been scheduled for wednesday 6th december:
London 17:00:00 UTC, Amsterdam 18:00:00 UTC+1, San Francisco 09:00:00 PST,
New York 12:00:00 EST
Gerard Toonstra is inviting you to a scheduled Zoom meeting.
Join from PC, Mac, Linux, iOS or Android: https
working on sql scanners, extractors and other tools that
> >> allow me
> >>> to
> >>>>> populate the database
> >>>>> ‘’’
> >>>>>
> >>>>> Very cool. Cloudera Navigator ( not an open source product) does
Hi all,
So something that really drew my attention recently is a "data portal" as
described by a team from airbnb somewhere in May. The idea is basically a
"facebook of data":
https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770
Unfortunately it looks like it's not
; AIRFLOW__GOOGLE__OAUTH_CALLBACK_ROUTE=/
> AIRFLOW__GOOGLE__DOMAIN=XXXXX
>
> -----Original Message-
> From: Gerard Toonstra [mailto:gtoons...@gmail.com]
> Sent: Thursday, November 9, 2017 1:59 PM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow configuration in envi
What's the variable key you are using. Does it follow this convention?
https://airflow.apache.org/configuration.html
That's AIRFLOW (two underscores) configuration section (two underscores)
env var.
G>
On Thu, Nov 9, 2017 at 8:30 AM, Somasundaram Sekar <
somasundar.se...@tigeranalytics.com>
gt; wrote:
> Any chance your talk was recorded?
>
> Thanks,
> Mike
>
>
> > On Oct 29, 2017, at 6:29 AM, Gerard Toonstra <gtoons...@gmail.com>
> wrote:
> >
> > Hi all,
> >
> > Thursday the 26/10 my employer Coolblue organized a "Behind
Hey,
As Bolke said, with LE and tasks consuming variable amounts of memory, you
can run into memory issues on a container. I'd reconsider running on a
containerized
environment at all, because with the LE and the scheduler, you need to set
up a huge one for that to work. You're probably better
ntered when migrating from Azkaban to Airflow?
>
> Kind regards,
> Fokko Driesprong
>
> 2017-10-29 11:29 GMT+01:00 Gerard Toonstra <gtoons...@gmail.com>:
>
> > Hi all,
> >
> > Thursday the 26/10 my employer Coolblue organized a "Behind the Scene
Hi all,
Thursday the 26/10 my employer Coolblue organized a "Behind the Scenes"
event. It is an opportunity for engineers to talk about stuff they work on
and usually they provide two presentations.
This event was about BigData and Processing. As (now) team lead of Data
Platform, I decided to
I'll be doing a little talk about Apache Airflow at the London BigDataWeek
conference on 13th of October, focusing on how airflow is designed around
following some important ETL principles and showcasing an example
deployment solution architecture on AWS:
Hi Larry,
The important thing to question is what kind of interface that other system
has. It is a little bit unusual in the sense that this DAG processes across
multiple days.
The potential issue I foresee here is that you don't mention a consistent
start date for the DAG and you expect this to
Hi David,
When tasks are put on the MQ, they are out of the control of the scheduler.
The scheduler puts the state of that task instance in "queued".
What happens next:
1. A worker picks up the task to run and tries to run it.
2. It first executes a couple of checks against the DB prior to
send me an invite too!
On Thu, Jul 20, 2017 at 8:17 PM, Jeremiah Lowin wrote:
> I'm interested as well.
>
> On Thu, Jul 20, 2017 at 1:51 PM Marc Bollinger wrote:
>
> > +1 We're in the middle of moving some services to k8s, and have had our
> > eye on
It would be really good if you'd share experiences on how to run this on
kubernetes and ECS.
I'm not aware of a good guide on how to run this on either for example, but
it's a very useful and
quick setup to start with, especially combining that with deployment
manager and cloudformation
>>
>>> > @chris: Thank you! My wiki name is dimberman.
>>> > @gerard: I've started writing out my reply but there's a fair amount to
>>> > respond to so I'll need a few minutes :).
>>> >
>>> > On Wed, Jul 5, 2017 at 1:17 PM Chris Riccom
There is an api where you can get table details in python. There are multiple
apis all using the underlying rest one. The one i talk about is where you can
call exists and get rownum and create and modified details. Saves some money
and time perhaps.
Sent from my iPhone
> On 5 Jul 2017, at
_port='hostnme.jp.local:9553'
> cdna_daily_common.env='dev'
> cdna_daily_common.alert_email='dev-dsd-...@mail.com'
> cdna_daily_common.spdb_sync_prefix='echo SPDBSync'
> cdna_daily_common.post_validate_prefix='echo PostVal'
> cdna_daily_common.schedule_interval='0 2 * * *'
> cdna_daily_common.d
For airflow to find dags, a .py file is read. The file should contain
either "DAG" or "airflow" somewhere to be considered a potential dag file.
Then there are some additional rules whether this actually gets scheduled
or not. The log files for the dag file processors is not the same
as the main
Hi all,
The Apache Airflow survey period expired. The results are collected and, as
promised, I'm sharing the full results of the survey. It's on the wiki with
the raw survey results available at the bottom of the page:
Hi all,
I was talking about a dev project I was working on and there's some
progress:
https://github.com/gtoonstra/airflow-hovercraft
There are two types of tests:
1. behavior tests: These test the behavior of operators against a stubbed
out "hook", which is driven through python "behave"
f we do not use it in
> production yet but going to, are we eligible to take it? :)
> Boris
>
> On Sat, Jun 10, 2017 at 6:40 AM, Gerard Toonstra <gtoons...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I'm curious how others are using and deploying airfl
Hi all,
I'm curious how others are using and deploying airflow. Rather than
inviting people to reply to this email on a dev list, I created a survey
with 10 questions (all visible on first page):
https://goo.gl/forms/FeSBMfI7O8oe8wZu2
I'm going to close and share the results of that survey in 2
and I can
add you.
Rgds,
Gerard
On Thu, May 18, 2017 at 2:00 PM, Gerard Toonstra <gtoons...@gmail.com>
wrote:
>
>> On Tue, May 9, 2017 at 9:46 PM, Arthur Wiedmer <arthur.wied...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I would love to
Is anyone at or going to visit the Strata conference London? I'll be there
tomorrow and Thursday, would be good to connect with others using airflow,
share experiences, have a coffee, etc.
Rgds,
Gerard
ce
and if/how this contributes to the bottom line of higher quality?
Rgds,
Gerard
On Wed, May 10, 2017 at 8:27 AM, Gerard Toonstra <gtoons...@gmail.com>
wrote:
> Hi Laura,
>
> Yes, testing hooks and operators is about the basic behavior of those, so
> you look for infrastruct
scheduler go at it, and
> > then
> > > check the metadata database for what workflow happened (and, if we had
> > test
> > > integration services, maybe also check the output against the known
> > output
> > > for the seeded input). I can defini
Very interesting video. I was unable to take part. I watched only part of
it for now.
Let us know where the discussion is being moved to.
The confluence does indeed seem to be the place to put final conclusions
and thoughts.
For airflow, I like to make a distinction between "platform" and
There was a discussion on google groups about that:
https://groups.google.com/forum/#!topic/airbnb_airflow/GRdoW30PNUI
On Thu, Apr 27, 2017 at 9:42 AM, Devjyoti Patra
wrote:
> Hi,
>
> I am trying to pass the following configuration parameters to Airflow CLI
> while
a dag...
> >>
> >>
> >> Sent from my iPhone
> >>
> >> > On 24 Apr 2017, at 22:46, Dan Davydov <dan.davy...@airbnb.com.
> INVALID>
> >> wrote:
> >> >
> >> > One idea to solve this is to use a daemon that uses
Hey,
I've seen some people complain about DAG file processing times. An issue
was raised about this today:
https://issues.apache.org/jira/browse/AIRFLOW-1139
I attempted to provide a good explanation what's going on. Feel free to
validate and comment.
I'm noticing that the file processor is a
Very nice. I noticed a bug in one line:
pip install pip install airflow==1.8.0
I see you've used the plugin class to add operators, so they appear in the
airflow.operators namespace.
I'm wondering about what other people are doing there and what the best way
is to add custom operators
to
ags_are_paused_at_creation = True
> non_pooled_task_slot_count = 128
> max_active_runs_per_dag = 16
> ...
>
> Pretty much the defaults; I've never tweaked these values.
>
>
>
> -N
> nik.hodgkin...@collectivehealth.com
>
> On Mon, Mar 27, 2017 at 12:12 PM, Gerard Toonstr
; done.
> > > > Attaching to program: /usr/bin/python, process 2391
> > > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
> > from
> > > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
> > > > done.
> > > > Loaded s
>
>
> By the way, I remember that the scheduler would only spawn one or three
> processes, but I may be wrong.
> Right now when I start, it spawns 7 separate processes for the scheduler
> (8 total) with some additional
> ones spawned when the dag file processor starts.
>
>
These other processes
what may be helpful to dive into this a bit more is "pyrasite" . You need
gdb installed on the machine, but afterwards you can attach to a running
process and
then use python "payloads" to investigate what's going on, for example dump
the stack trace per threads:
Hi Rui,
I worked a bit on the scheduler and added some of my comments below.
On Tue, Mar 14, 2017 at 11:08 PM, Rui Wang
wrote:
> Hi,
> The design doc below I created is trying to make airflow scheduler more
> centralized. Briefly speaking, I propose moving state
You can also generalize the tasks to make them more reusable:
1. an operator that runs a query and stores the result in a file in a
generally available location (for all workers).
2. an operator extending the email operator that pulls in the file from the
general location to the local worker and
Hi Ali,
Sounds like a better thing to do. If the response gets too big, you'll
probably want to store the results as a file or otherwise
immediately process them in a bit of python code.
If you want to write a custom operator for this depends on how many times
you wish to reuse this
35c99e51ea936c756c00332c4a4a/airflow/models.py#L1489>
>
> Bolke
>
> > On 21 Feb 2017, at 22:41, Gerard Toonstra <gtoons...@gmail.com> wrote:
> >
> > Hey all,
> >
> > I'm writing up a bit more about best practices for airflow and realize
> that
Hey all,
I'm writing up a bit more about best practices for airflow and realize that
there may be one important macro that's missing, but which sounds really
useful. This is a list of the default macro's:
https://airflow.incubator.apache.org/code.html#macros
The "execution_date" or "ds" is some
>
>
>
> Per this email thread, it almost sounds like a slack team/discourse for
> data engineering might be useful.
>
>
I certainly would not mind getting more knowledge on this topic and I'd
like to be invited to such a slack group (or google group).
You mentioned Vertica and Parquet. Is it recommended to use these newer
tools even when the DWH is not BigData
size (about 150G in size) ?
So there are a couple of good benefits, but are there any downsides and
disadvantages you have to take into account
comparing Vertica vs. SQL Server for
More ideas:
- An "airflow" plugin at the moment is more of an extension; operators,
hooks, macros.
Consider an additional plugin API + default implementation for code
inside airflow that
has a cross-cutting concern, like:
* We start to use datadog for heavier monitoring of what's going
+1 on driving everything through a REST API including the UI. This unifies
the access to the scheduler and increases stability.
Consider running a very small webserver (node.js + socket.io), which
enables airflow to communicate scheduler events as they happen
to anything that connects to it
Also in 1.7.1.3, there's the ShortCircuitOperator, which can give you an
example.
https://github.com/apache/incubator-airflow/blob/1.7.1.3/airflow/operators/python_operator.py
You'd have to modify this to your needs, but the way it works is that if
the condition evaluates to True, none of the
.@apache.org>
> wrote:
>
> > Gerard,
> > Please sign up for a CWiki <https://github.com/apache/incubator-airflow>
> > account and reply to this email with your user name. I just searched
> > for "Gerard
> > Toonstra" and didn't find a user on
Hey Jason,
Let me try to answer them for you. I hope I get everything 100% right,
because I'm also pretty new to airflow.
Hopefully someone on the list corrects me if it's horribly wrong.
On Wed, Nov 2, 2016 at 9:24 PM, Jason Chen
wrote:
> Hi Airflow team,
>
> We are
They are both on the project page of the airflow documentation in resources
& links and on the wiki, the wiki is a bit
richer in that regard. Maybe link to the wiki from the doc pages instead,
so it's all in one place?
https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Links
David,
When you say "massive" oversubscribing, are you running a lot of dags in
parallel that
use your configured pools? Access to pools is not atomic at the moment.
Can you also quantify "massive" ? Not that it matters, but to get a better
idea.
Rgds,
Gerard
On Tue, Nov 1, 2016 at 4:09 PM,
I was looking at trying to fix AIRFLOW-137 (max_active_runs not respected),
but quickly noticed that the code that does all the scheduling is rather
complex with state updates going on across multiple source files in
multiple threads, etc.
It's then best to find a suitable way to visualize all
cron expressions:
https://airflow.incubator.apache.org/scheduler.html#dag-runs
On Tue, Oct 25, 2016 at 12:23 PM, Manning, Kieran (Consultant) <
kieran.mann...@consultant.renre.com> wrote:
> Hi all,
>
> Sorry to piggyback on this question thread but I have something similar.
> We need to be able
What some people do is give the new dag a new version in the name, like _v1
or _v2 at the end. Then it's treated like another dag and you can disable
the old one. IF you make changes to dags, it's possible that old
operators/tasks are no longer visible in the UI and you no longer have
access to
Even when the
pool is 10 and the number
of instances 7, it takes longer for the instances to actually run.
Looking forward to your comments on how some approaches could be improved.
Rgds,
Gerard
On Wed, Oct 19, 2016 at 8:17 AM, Gerard Toonstra <gtoons...@gmail.com>
wrote:
>
> T
m hoping to see some best practices for the design of incremental
> > loads
> > > and using timestamps from source database systems (not being on UTC so
> > > still confused about it in Airflow). Also practical use of subdags and
> > > dynamic generation of t
ices for the design of incremental loads
> and using timestamps from source database systems (not being on UTC so
> still confused about it in Airflow). Also practical use of subdags and
> dynamic generation of tasks using some external metadata (maybe describe in
> details something simi
Hi all,
About a year ago, I contributed the HTTPOperator/Sensor and I've been
tracking airflow since. Right now it looks like we're going to adopt
airflow at the company I'm currently working at.
In preparation for that, I've done a bit of research work how airflow
pipelines should fit together,
Dinesh,
Interesting use case. I'm not sure how this will work out for you
eventually compared to a specialized workflow tool,
but here are some considerations that you should make to evaluate your
chances of success:
A complex business workflow will at some point require some more complex
input
The scheduler is probably single threaded, but it's a good idea to make
sure and investigate postgres (or mysql) locks:
https://wiki.postgresql.org/wiki/Lock_Monitoring
On Wed, Sep 7, 2016 at 8:30 AM, Bolke de Bruin wrote:
> Thanks!
>
> Apache scrubs attachments. Can you
Hi all,
I did a demo of airflow for an organisation where they currently use
azkaban and they liked the project and demonstrated interest in using it.
The installation however was considered a bit more work than they wanted:
mysql db, celery, rabbitMQ and scheduler that all had to be puppetized
78 matches
Mail list logo