Song:
You can put an operator as the very first node in the DAG, and have
everything else in the DAG depend on it. For example, this is the approach
we use to only execute DAG tasks on stock market trading days.
-James M.
On Fri, May 11, 2018 at 3:57 AM, Song Liu wrote:
>
Hi folks,
At Velocity New York in October, I will be presenting about how Quantopian
uses Airflow for financial data:
https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/70048
We couldn't have adopted Airflow so quickly without the hard work of
contributors who have made it
Hi folks,
I got an email from our email administrator that:
"However, it looks like the AirFlow mailing list isn't rewriting email
headers in messages sent to the list, such that all messages sent to the
list from domains that use DMARC are non-compliant.
At some point we're going to have to
ne is
> taking longer than X hours, not just if each task is taking more than X
> hours. Is this still possible, to do that feasibly?
>
> > On May 24, 2018, at 12:12 PM, James Meickle <jmeic...@quantopian.com>
> wrote:
> >
> > Just giving this a bump; it's a pretty
Just giving this a bump; it's a pretty major rework so I'd love to know
whether this effort is likely to be accepted if I bring it to a PR-able
state, before I invest more time.
On Wed, May 23, 2018 at 1:59 PM, James Meickle <jmeic...@quantopian.com>
wrote:
> Hi folks,
>
> I've c
n Wed, May 9, 2018 at 12:43 PM, James Meickle <jmeic...@quantopian.com>
wrote:
> Hi all,
>
> Since the response so far has been positive or neutral, I intend to submit
> one or more PRs targeting 2.0 (I think that some parts will be separable
> from a larger SLA refactor). I int
Hi,
I have a sandbox cluster at work (3 EC2 VMs + Celery on Elasticache) that I
have been running 1.10 on. This is because we want to test in advance the
Kubernetes operator and RBAC. Happy to lend some assistance with that.
-James M.
On Fri, Jun 15, 2018 at 6:28 AM, Driesprong, Fokko
wrote:
AM, James Meickle
wrote:
> While Googling something Airflow-related a few weeks ago, I noticed that
> someone's Airflow dashboard had been indexed by Google and was accessible
> to the outside world without authentication. A little more Googling
> revealed a handful of other index
We have to use a lot of time sensors like this, for reports that shouldn't
be filed to a third party before a certain time of day. Since these sensors
are themselves tasks, they can fail to be scheduled or can fail, like if
the underlying worker instance dies. I would recommend double checking
t;
> > Taylor
> >
> > *Taylor Edmiston*
> > Blog <https://blog.tedmiston.com/> | CV
> > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > <https://www.linkedin.com/in/tedmiston/> | AngelList
> > <https://angel.co/taylor> | Stac
et. Is this one of
> the points you are making?
> Thanks.
>
> On Tue, Jun 5, 2018 at 9:41 AM James Meickle
> wrote:
>
> > We have to use a lot of time sensors like this, for reports that
> shouldn't
> > be filed to a third party before a certain time of day. Since these
&
An important consideration here is that there are several settings that are
cluster-wide. In particular, cluster-wide concurrency settings could result
in Team B's DAG refusing to schedule based on an error in Team A's DAG.
Do your teams follow similar practices in how eagerly they ship code, or
One way to do this would be to have your DAG file return two nearly
identical DAGs (like put it in a factory function and call it twice). The
difference would be that the "final" run would have a conditional to add an
extra time sensor at the DAG root, to wait N days for the data to finalize.
The
I installed v1-10-test at work. I'm having an issue with it:
File
"XXX/airflow/dags/airflow-sandbox/submodules/quantflow/tests/operators/test_zipline_operators.py",
line 6, in
from freezegun import freeze_time
ImportError: No module named 'freezegun'
>From the same dir:
$ cat
ile'
},
'consumer_key': '{{ vault_airflow_google_oauth_key }}',
'consumer_secret': '{{ vault_airflow_google_oauth_secret }}'
}
}]
On Mon, Apr 30, 2018 at 12:54 PM, Jørn A Hansen <jornhan...@gmail.com>
wrote:
> On Mon, 30 Apr 2018 at 15.56, James Meickle <jmeic...@quantopian.com>
> wrote:
>
> > Ins
At Quantopian we use Airflow to produce artifacts based on the previous
day's stock market data. These artifacts are required for us to trade on
today's stock market. Therefore, I've been investing time in improving
Airflow notifications (such as writing PagerDuty and Slack integrations).
My
Installed this off of the branch, and I do get the Kubernetes executor
(incl. demo DAG) and some bug fixes - but I don't see any RBAC feature
anywhere I'd think to look. Do I need to set up some config to get that to
show up?
On Mon, Apr 23, 2018 at 2:06 PM, Bolke de Bruin
; about,
> > expected execution times
> >
> > Also SLA trigger on backfills and manual reruns of tasks
> >
> > I see this as a critical feature for production monitoring so would love
> to
> > see this get improved
> >
> > On Wed, May 2, 2018, 12:00 PM James
I have made this mistake a few times. I think it would be great if Airflow
warned about DAG-level arguments being passed into tasks they don't apply
to, since that would indicate an easily fixable mistake.
On Wed, Feb 14, 2018 at 9:22 AM, Gerard Toonstra
wrote:
> One of
Can we make it a policy going forward to push GH tags for all RCs as part
of the release announcement? I deploy via the incubator-airflow repo, but
currently it only has tags for up to RC2, which means I have to look up and
then specify an ugly-looking commit to deploy an RC :(
On Wed, Aug 15,
Not a vote, but a comment: it might be worth noting that the new
environment variable is also required if you have any Airflow plugin test
suites that install Airflow as part of their dependencies. In my case, I
had to set the new env var outsidfe of tox and add this:
```
[testenv]
passenv =
If you want to run (daily, rolling weekly, rolling monthly) backups on a
daily basis, and they're mostly the same but have some additional
dependencies, you can write a DAG factory method, which you call three
times. Certain nodes only get added to the longer-than-daily backups.
On Wed, Aug 8,
t all until the first 7
> daily
> > rollups worth of data have built up?
> >
> > *Taylor Edmiston*
> > Blog <https://blog.tedmiston.com/> | CV
> > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > <https://www.linkedin.com/in/tedmiston/>
Really excited for this one - we have a lot of internal access controls and
this will help us implement them properly. It's going to be great being
able to give everyone access to see the overall state of DAG progress
without seeing its parameters or logs!
On Tue, Jul 17, 2018 at 12:48 AM, Ruiqin
We use LatestOnlyOperator in production. Generally our data is available on
a regular schedule, and we update production services with it as soon as it
is available; we might occasionally want to re-run historical days, in
which case we want to run the same DAG but without interacting with live
Thank you! Shared that with the team here.
On Thu, Jul 19, 2018 at 7:57 AM, Sumit Maheshwari
wrote:
> Yeah, sorry missed that the webinar is in IST.
>
> Anyway, got the recording link
> https://www.brighttalk.com/webcast/15789/330901
>
>
>
> On Tue, Jul 10, 2018 a
I am in the gitter chat most work days and there's always activity.
I would be fine with switching to permanent retention slack for
searchability but don't see the point of switching without that feature.
On Fri, Aug 31, 2018, 12:59 Sid Anand wrote:
> For a while now, we have had an Airflow
-1 unless we get a Slack with full retention.
On Thu, Sep 6, 2018 at 8:16 AM Daniel Cohen
wrote:
> +1
>
> On Thu, Sep 6, 2018 at 3:04 PM Adam Boscarino
> wrote:
>
> > +1
> >
> > On Wed, Sep 5, 2018 at 10:30 PM Sid Anand wrote:
> >
> > > Hi Folks!
> > > In the Apache tradition, I'd like to ask
Airflow logs are stored on the worker filesystem. When a worker starts, it
runs a subprocess that serves logs via Flask:
https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L985
If you use the remote logging feature, the logs are (instead? also?) stored
in S3.
Postgres
I think it's going to be an antipattern to write Python configuration in
Airflow to configure a Kubernetes deployment since even a "simple"
deployment would likely be more classes/objects than the DAG itself. I do
like the idea of having a more featured operator than the PodOperator, but
if I were
Looking at that diff, it seems like the function as a whole needs some
love, even if that commit were reverted. The use of os.walk means it's
going to crawl the entire tree every time, and use accumulated patterns to
check against each file in each folder. The behavior should be to use
patterns to
Hi folks,
Based on my earlier email to the list, I have submitted a PR that splits
`sla=` into three independent SLA parameters, as well as heavily
restructuring other parts of the SLA feature:
https://github.com/apache/incubator-airflow/pull/3584
This is my first Airflow PR and I'm still
To my mind, it's not a great idea to clear a resource that you're
dynamically using to determine the contents of a DAG. Is there any way that
you can refactor the table to be immutable? Instead of querying all rows in
the table, you would query records in an "unprocessed" state. Instead of
To my mind, I would expect the MVP of per-DAG RBAC to include three
settings: viewing DAG state, executing or modifying DAGs, and viewing tasks
within the DAG (logs/code/details). For instance we would love to expose a
view of the production dataload state to our engineers, without exposing
This is super exciting for us, as we want one of our non-technical teams to
be able to re-run failed DAGs. Will be giving this a try soon as I'm back
from SREcon!
On Fri, Mar 23, 2018 at 9:45 PM, Joy Gao wrote:
> Hey guys!
>
> The RBAC UI
I'm very excited about the possibility of implementing a DAGFetcher (per
prior thread about this) that is aware of dynamic data sources, and can
handle abstracting/caching/deploying them itself, rather than having each
Airflow process run the query for each DAG refresh.
On Thu, Mar 22, 2018 at
While Googling something Airflow-related a few weeks ago, I noticed that
someone's Airflow dashboard had been indexed by Google and was accessible
to the outside world without authentication. A little more Googling
revealed a handful of other indexed instances in various states of
security. I did
Another reason you would want separated infrastructure is that there are a
lot of ways to exhaust Airflow resources or otherwise cause contention -
like having too many sensors or sub-DAGs using up all available tasks.
Doesn't seem like a great idea to push for having different teams with
As long as the Airflow process can't find the DAG as a top-level object in
the module, it won't be registered. For example, we have a function that
returns DAGs; the function returns nothing if it's not in the right
environment.
On Sun, Oct 7, 2018 at 2:31 PM Shah Altaf wrote:
> Hi all,
>
>
> > I'd still love to get some eyes on this one if anyone has time.
> Definitely>
> > > needs some direction as to what is required before merging, since this
> is a>
> > > higher-level API change...>
> > >>
> > > -James M.>
> >
We use Airflow for a mixture of traditional data pipeline work, and
orchestration tasks that need to express complex date and dependency logic.
The former is a bit better supported, but it's still a great tool for the
latter.
But it sounds like your format is "this task should happen whenever
For something to add to Airflow itself: I would love a more flexible
mapping between data time and processing time. The default is "n-1" (day
over day, you're aiming to process yesterday's data) but people post other
use cases on this mailing list quite frequently.
On Fri, Oct 12, 2018 at 7:46 AM
We're running into a lot of pain with this. We have a CI system that
enables very rapid iteration on DAG code. Whenever you need to modify
plugin code, it requires a re-ship of all of the infrastructure, which
takes at least 10x longer than a DAG deployment Jenkins build.
I think that Airflow
Just a guess, but do you need to reload supervisorctl itself before
restarting the service? If you add an env var to the supervisor config, and
then restart the supervisor-managed service, it will actually be running
with the old supervisor config file still. The supervisor daemon itself
must be
are common problems
> many of the airflow users/developers face.
> @James lets catch up someone on slack/hangout to
> discuss how these enhancements can be done.
>
>
> On Thu, 15 Nov 2018 at 00:10, James Meickle .invalid>
> wrote:
>
> > As the author of the fir
As the author of the first linked PR, I think your points are good. Here is
my attempt to address them:
1: It is possible to do this today if you write a Slack callback. I would
be happy to share my code for this if you're having trouble integrating
Slack. That being said, it would be great if
I suggest not adopting pipenv. It has a nice "first five minutes" demo but
it's simply not baked enough to depend on as a swap in pip replacement. We
are in the process of removing it after finding several serious bugs in our
POC of it.
On Thu, Oct 4, 2018, 20:30 Alex Guziel
wrote:
> FWIW,
Harasymiv <
vharasy...@activecampaign.com> wrote:
> Please do share slides/video after, James!
>
> On Thu, May 17, 2018 at 6:53 AM, James Meickle
> wrote:
>
> > Hi folks,
> >
> > At Velocity New York in October, I will be presenting about how
> Quantopian
>
he airflow db, it will get a row on the UI with a black "I'
> >> indicating that the dag is missing. As far as I know, then only way
> >> to remove it is to manually edit the database.
> >> On Mon, Oct 8, 2018 at 9:43 AM Chris Palmer wrote:
> >> >
&g
Anthony:
Could you just have the "success" path be declared with "all_success" (the
default), and the "failure" side branches be declared with "all_failed"
depending on the previous task? This will have the same branching structure
you want but with less intermediary operators.
-James M.
On
I'm personally against having some kind of auto-increment numeric ID for
DAGs. While this makes a lot of sense for systems where creation is a
database activity (like a POST request), in Airflow, DAG creation is
actually a code ship activity. There are all kinds of complex scenarios
around that:
Cheers to y'all, keep up the great work!
On Thu, Dec 20, 2018, 16:13 Jakob Homan wrote:
> Hey all-
>The Board minutes haven't been published yet (probably due to
> Holiday-related slowness), but I can see through the admin tool that
> our Graduation resolution was approved yesterday at the
I suggest at least adding a commit to remove the broken S3 logging section
I just reported here: https://issues.apache.org/jira/browse/AIRFLOW-3449
On Wed, Nov 28, 2018 at 5:41 PM Kaxil Naik wrote:
> Hi everyone,
>
> I'm starting the process of gathering fixes for a 1.10.2. So far the list
> of
I would be very interested in helping draft a rearchitecting AIP. Of
course, that's a vague statement. I am interested in several specific areas
of Airflow functionality that would be hard to modify without some
refactoring taking place first:
1) Improving Airflow's data model so it's easier to
Definitely agree with this. I'm not always opposed to JIRA for projects,
but the way it's being used for this project makes it very hard to break
into contributing. The split between GH and JIRA is also painful since
there's no automatic integration of them.
On Sun, Sep 16, 2018 at 9:29 AM
So in favor of just using Python modules for operators. I initially wrote
mine as Airflow plugin compatible, and eventually had to un-write them that
way, so it's really a new-user trap.
I've had at least a half dozen times installing/testing/operating Airflow
where we had some issue based on an
ight better fit much of Airflow's user base.
> >>>>>
> >>>>>
> >>>>> On Sun, Sep 16, 2018, 9:21 AM Jeff Payne wrote:
> >>>>>
> >>>>>> I agree that Jira could be better utilized. I read the original
> >>>>>&
57 matches
Mail list logo