Re: How to know the DAG is starting to run

2018-05-11 Thread James Meickle
Song: You can put an operator as the very first node in the DAG, and have everything else in the DAG depend on it. For example, this is the approach we use to only execute DAG tasks on stock market trading days. -James M. On Fri, May 11, 2018 at 3:57 AM, Song Liu wrote: >

Airflow presentation at Velocity New York

2018-05-17 Thread James Meickle
Hi folks, At Velocity New York in October, I will be presenting about how Quantopian uses Airflow for financial data: https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/70048 We couldn't have adopted Airflow so quickly without the hard work of contributors who have made it

Airflow dev mailing list DMARC settings

2018-05-16 Thread James Meickle
Hi folks, I got an email from our email administrator that: "However, it looks like the AirFlow mailing list isn't rewriting email headers in messages sent to the list, such that all messages sent to the list from domains that use DMARC are non-compliant. At some point we're going to have to

Re: Improving Airflow SLAs

2018-05-24 Thread James Meickle
ne is > taking longer than X hours, not just if each task is taking more than X > hours. Is this still possible, to do that feasibly? > > > On May 24, 2018, at 12:12 PM, James Meickle <jmeic...@quantopian.com> > wrote: > > > > Just giving this a bump; it's a pretty

Re: Improving Airflow SLAs

2018-05-24 Thread James Meickle
Just giving this a bump; it's a pretty major rework so I'd love to know whether this effort is likely to be accepted if I bring it to a PR-able state, before I invest more time. On Wed, May 23, 2018 at 1:59 PM, James Meickle <jmeic...@quantopian.com> wrote: > Hi folks, > > I've c

Re: Improving Airflow SLAs

2018-05-23 Thread James Meickle
n Wed, May 9, 2018 at 12:43 PM, James Meickle <jmeic...@quantopian.com> wrote: > Hi all, > > Since the response so far has been positive or neutral, I intend to submit > one or more PRs targeting 2.0 (I think that some parts will be separable > from a larger SLA refactor). I int

Re: Airflow 1.10.0

2018-06-15 Thread James Meickle
Hi, I have a sandbox cluster at work (3 EC2 VMs + Celery on Elasticache) that I have been running 1.10 on. This is because we want to test in advance the Kubernetes operator and RBAC. Happy to lend some assistance with that. -James M. On Fri, Jun 15, 2018 at 6:28 AM, Driesprong, Fokko wrote:

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread James Meickle
AM, James Meickle wrote: > While Googling something Airflow-related a few weeks ago, I noticed that > someone's Airflow dashboard had been indexed by Google and was accessible > to the outside world without authentication. A little more Googling > revealed a handful of other index

Re: Dealing with data latency

2018-06-05 Thread James Meickle
We have to use a lot of time sensors like this, for reports that shouldn't be filed to a third party before a certain time of day. Since these sensors are themselves tasks, they can fail to be scheduled or can fail, like if the underlying worker instance dies. I would recommend double checking

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread James Meickle
t; > > Taylor > > > > *Taylor Edmiston* > > Blog <https://blog.tedmiston.com/> | CV > > <https://stackoverflow.com/cv/taylor> | LinkedIn > > <https://www.linkedin.com/in/tedmiston/> | AngelList > > <https://angel.co/taylor> | Stac

Re: Dealing with data latency

2018-06-06 Thread James Meickle
et. Is this one of > the points you are making? > Thanks. > > On Tue, Jun 5, 2018 at 9:41 AM James Meickle > wrote: > > > We have to use a lot of time sensors like this, for reports that > shouldn't > > be filed to a third party before a certain time of day. Since these &

Re: Single Airflow Instance Vs Multiple Airflow Instance

2018-06-06 Thread James Meickle
An important consideration here is that there are several settings that are cluster-wide. In particular, cluster-wide concurrency settings could result in Team B's DAG refusing to schedule based on an error in Team A's DAG. Do your teams follow similar practices in how eagerly they ship code, or

Re: Capturing data changes that happen after the initial data pull

2018-06-07 Thread James Meickle
One way to do this would be to have your DAG file return two nearly identical DAGs (like put it in a factory function and call it twice). The difference would be that the "final" run would have a conditional to add an extra time sensor at the DAG root, to wait N days for the data to finalize. The

Re: Apache Airflow 1.10.0b2

2018-06-22 Thread James Meickle
I installed v1-10-test at work. I'm having an issue with it: File "XXX/airflow/dags/airflow-sandbox/submodules/quantflow/tests/operators/test_zipline_operators.py", line 6, in from freezegun import freeze_time ImportError: No module named 'freezegun' >From the same dir: $ cat

Re: 1.10.0beta1 now available for download

2018-05-01 Thread James Meickle
ile' }, 'consumer_key': '{{ vault_airflow_google_oauth_key }}', 'consumer_secret': '{{ vault_airflow_google_oauth_secret }}' } }] On Mon, Apr 30, 2018 at 12:54 PM, Jørn A Hansen <jornhan...@gmail.com> wrote: > On Mon, 30 Apr 2018 at 15.56, James Meickle <jmeic...@quantopian.com> > wrote: > > > Ins

Improving Airflow SLAs

2018-05-02 Thread James Meickle
At Quantopian we use Airflow to produce artifacts based on the previous day's stock market data. These artifacts are required for us to trade on today's stock market. Therefore, I've been investing time in improving Airflow notifications (such as writing PagerDuty and Slack integrations). My

Re: 1.10.0beta1 now available for download

2018-04-30 Thread James Meickle
Installed this off of the branch, and I do get the Kubernetes executor (incl. demo DAG) and some bug fixes - but I don't see any RBAC feature anywhere I'd think to look. Do I need to set up some config to get that to show up? On Mon, Apr 23, 2018 at 2:06 PM, Bolke de Bruin

Re: Improving Airflow SLAs

2018-05-03 Thread James Meickle
; about, > > expected execution times > > > > Also SLA trigger on backfills and manual reruns of tasks > > > > I see this as a critical feature for production monitoring so would love > to > > see this get improved > > > > On Wed, May 2, 2018, 12:00 PM James

Re: max_active_runs

2018-02-15 Thread James Meickle
I have made this mistake a few times. I think it would be great if Airflow warned about DAG-level arguments being passed into tasks they don't apply to, since that would indicate an easily fixable mistake. On Wed, Feb 14, 2018 at 9:22 AM, Gerard Toonstra wrote: > One of

Re: apache-airflow v1.10.0 on PyPi?

2018-08-15 Thread James Meickle
Can we make it a policy going forward to push GH tags for all RCs as part of the release announcement? I deploy via the incubator-airflow repo, but currently it only has tags for up to RC2, which means I have to look up and then specify an ugly-looking commit to deploy an RC :( On Wed, Aug 15,

Re: [VOTE] Airflow 1.10.0rc3

2018-08-06 Thread James Meickle
Not a vote, but a comment: it might be worth noting that the new environment variable is also required if you have any Airflow plugin test suites that install Airflow as part of their dependencies. In my case, I had to set the new env var outsidfe of tox and add this: ``` [testenv] passenv =

Re: Basic modeling question

2018-08-08 Thread James Meickle
If you want to run (daily, rolling weekly, rolling monthly) backups on a daily basis, and they're mostly the same but have some additional dependencies, you can write a DAG factory method, which you call three times. Certain nodes only get added to the longer-than-daily backups. On Wed, Aug 8,

Re: Basic modeling question

2018-08-08 Thread James Meickle
t all until the first 7 > daily > > rollups worth of data have built up? > > > > *Taylor Edmiston* > > Blog <https://blog.tedmiston.com/> | CV > > <https://stackoverflow.com/cv/taylor> | LinkedIn > > <https://www.linkedin.com/in/tedmiston/>

Re: DAG Level permissions (was Re: RBAC Update)

2018-07-17 Thread James Meickle
Really excited for this one - we have a lot of internal access controls and this will help us implement them properly. It's going to be great being able to give everyone access to see the overall state of DAG progress without seeing its parameters or logs! On Tue, Jul 17, 2018 at 12:48 AM, Ruiqin

Re: Catchup By default = False vs LatestOnlyOperator

2018-07-23 Thread James Meickle
We use LatestOnlyOperator in production. Generally our data is available on a regular schedule, and we update production services with it as soon as it is available; we might occasionally want to re-run historical days, in which case we want to run the same DAG but without interacting with live

Re: [Live Webinar Tomorrow] AIRFlow at Scale - Register Now

2018-07-19 Thread James Meickle
Thank you! Shared that with the team here. On Thu, Jul 19, 2018 at 7:57 AM, Sumit Maheshwari wrote: > Yeah, sorry missed that the webinar is in IST. > > Anyway, got the recording link > https://www.brighttalk.com/webcast/15789/330901 > > > > On Tue, Jul 10, 2018 a

Re: Retiring Airflow Gitter?

2018-08-31 Thread James Meickle
I am in the gitter chat most work days and there's always activity. I would be fine with switching to permanent retention slack for searchability but don't see the point of switching without that feature. On Fri, Aug 31, 2018, 12:59 Sid Anand wrote: > For a while now, we have had an Airflow

Re: [VOTE] Replace with Gitter with Slack?

2018-09-06 Thread James Meickle
-1 unless we get a Slack with full retention. On Thu, Sep 6, 2018 at 8:16 AM Daniel Cohen wrote: > +1 > > On Thu, Sep 6, 2018 at 3:04 PM Adam Boscarino > wrote: > > > +1 > > > > On Wed, Sep 5, 2018 at 10:30 PM Sid Anand wrote: > > > > > Hi Folks! > > > In the Apache tradition, I'd like to ask

Re: What information is passed around different components of Airflow?

2018-07-05 Thread James Meickle
Airflow logs are stored on the worker filesystem. When a worker starts, it runs a subprocess that serves logs via Flask: https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L985 If you use the remote logging feature, the logs are (instead? also?) stored in S3. Postgres

Re: K8S deployment operator proposal

2018-07-05 Thread James Meickle
I think it's going to be an antipattern to write Python configuration in Airflow to configure a Kubernetes deployment since even a "simple" deployment would likely be more classes/objects than the DAG itself. I do like the idea of having a more featured operator than the PodOperator, but if I were

Re: [VOTE] Airflow 1.10.0rc1

2018-07-12 Thread James Meickle
Looking at that diff, it seems like the function as a whole needs some love, even if that commit were reverted. The use of os.walk means it's going to crawl the entire tree every time, and use accumulated patterns to check against each file in each folder. The behavior should be to use patterns to

PR for refactoring Airflow SLAs

2018-07-09 Thread James Meickle
Hi folks, Based on my earlier email to the list, I have submitted a PR that splits `sla=` into three independent SLA parameters, as well as heavily restructuring other parts of the SLA feature: https://github.com/apache/incubator-airflow/pull/3584 This is my first Airflow PR and I'm still

Re: How to have dynamic downstream tasks that depend on data generated upstream

2018-03-15 Thread James Meickle
To my mind, it's not a great idea to clear a resource that you're dynamically using to determine the contents of a DAG. Is there any way that you can refactor the table to be immutable? Instead of querying all rows in the table, you would query records in an "unprocessed" state. Instead of

Re: RBAC Update

2018-04-02 Thread James Meickle
To my mind, I would expect the MVP of per-DAG RBAC to include three settings: viewing DAG state, executing or modifying DAGs, and viewing tasks within the DAG (logs/code/details). For instance we would love to expose a view of the production dataload state to our engineers, without exposing

Re: RBAC Update

2018-03-26 Thread James Meickle
This is super exciting for us, as we want one of our non-technical teams to be able to re-run failed DAGs. Will be giving this a try soon as I'm back from SREcon! On Fri, Mar 23, 2018 at 9:45 PM, Joy Gao wrote: > Hey guys! > > The RBAC UI

Re: Submitting 1000+ tasks to airflow programatically

2018-03-22 Thread James Meickle
I'm very excited about the possibility of implementing a DAGFetcher (per prior thread about this) that is aware of dynamic data sources, and can handle abstracting/caching/deploying them itself, rather than having each Airflow process run the query for each DAG refresh. On Thu, Mar 22, 2018 at

PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-03-23 Thread James Meickle
While Googling something Airflow-related a few weeks ago, I noticed that someone's Airflow dashboard had been indexed by Google and was accessible to the outside world without authentication. A little more Googling revealed a handful of other indexed instances in various states of security. I did

Re: About the project support in Airflow

2018-04-25 Thread James Meickle
Another reason you would want separated infrastructure is that there are a lot of ways to exhaust Airflow resources or otherwise cause contention - like having too many sensors or sub-DAGs using up all available tasks. Doesn't seem like a great idea to push for having different teams with

Re: Can a DAG be conditionally hidden from the UI?

2018-10-08 Thread James Meickle
As long as the Airflow process can't find the DAG as a top-level object in the module, it won't be registered. For example, we have a function that returns DAGs; the function returns nothing if it's not in the right environment. On Sun, Oct 7, 2018 at 2:31 PM Shah Altaf wrote: > Hi all, > >

Re: PR for refactoring Airflow SLAs

2018-10-08 Thread James Meickle
> > I'd still love to get some eyes on this one if anyone has time. > Definitely> > > > needs some direction as to what is required before merging, since this > is a> > > > higher-level API change...> > > >> > > > -James M.> > >

Re: Is airflow a fit?

2018-10-18 Thread James Meickle
We use Airflow for a mixture of traditional data pipeline work, and orchestration tasks that need to express complex date and dependency logic. The former is a bit better supported, but it's still a great tool for the latter. But it sounds like your format is "this task should happen whenever

Re: Ingest daily data, but delivery is always delayed by two days

2018-10-12 Thread James Meickle
For something to add to Airflow itself: I would love a more flexible mapping between data time and processing time. The default is "n-1" (day over day, you're aiming to process yesterday's data) but people post other use cases on this mailing list quite frequently. On Fri, Oct 12, 2018 at 7:46 AM

Re: Deployment / Execution Model

2018-11-01 Thread James Meickle
We're running into a lot of pain with this. We have a CI system that enables very rapid iteration on DAG code. Whenever you need to modify plugin code, it requires a re-ship of all of the infrastructure, which takes at least 10x longer than a DAG deployment Jenkins build. I think that Airflow

Re: Moving Airflow Config to Database.

2018-11-15 Thread James Meickle
Just a guess, but do you need to reload supervisorctl itself before restarting the service? If you add an env var to the supervisor config, and then restart the supervisor-managed service, it will actually be running with the old supervisor config file still. The supervisor daemon itself must be

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

2018-11-14 Thread James Meickle
are common problems > many of the airflow users/developers face. > @James lets catch up someone on slack/hangout to > discuss how these enhancements can be done. > > > On Thu, 15 Nov 2018 at 00:10, James Meickle .invalid> > wrote: > > > As the author of the fir

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

2018-11-14 Thread James Meickle
As the author of the first linked PR, I think your points are good. Here is my attempt to address them: 1: It is possible to do this today if you write a Slack callback. I would be happy to share my code for this if you're having trouble integrating Slack. That being said, it would be great if

Re: Pinning dependencies for Apache Airflow

2018-10-04 Thread James Meickle
I suggest not adopting pipenv. It has a nice "first five minutes" demo but it's simply not baked enough to depend on as a swap in pip replacement. We are in the process of removing it after finding several serious bugs in our POC of it. On Thu, Oct 4, 2018, 20:30 Alex Guziel wrote: > FWIW,

Re: Airflow presentation at Velocity New York

2018-10-08 Thread James Meickle
Harasymiv < vharasy...@activecampaign.com> wrote: > Please do share slides/video after, James! > > On Thu, May 17, 2018 at 6:53 AM, James Meickle > wrote: > > > Hi folks, > > > > At Velocity New York in October, I will be presenting about how > Quantopian >

Re: Can a DAG be conditionally hidden from the UI?

2018-10-09 Thread James Meickle
he airflow db, it will get a row on the UI with a black "I' > >> indicating that the dag is missing. As far as I know, then only way > >> to remove it is to manually edit the database. > >> On Mon, Oct 8, 2018 at 9:43 AM Chris Palmer wrote: > >> > &g

Re: error handling flow in DAG

2018-10-08 Thread James Meickle
Anthony: Could you just have the "success" path be declared with "all_success" (the default), and the "failure" side branches be declared with "all_failed" depending on the previous task? This will have the same branching structure you want but with less intermediary operators. -James M. On

Re: Fundamental change - Separate DAG name and id.

2018-09-20 Thread James Meickle
I'm personally against having some kind of auto-increment numeric ID for DAGs. While this makes a lot of sense for systems where creation is a database activity (like a POST request), in Airflow, DAG creation is actually a code ship activity. There are all kinds of complex scenarios around that:

Re: Graduation resolution passed - Airflow is a TLP

2018-12-20 Thread James Meickle
Cheers to y'all, keep up the great work! On Thu, Dec 20, 2018, 16:13 Jakob Homan wrote: > Hey all- >The Board minutes haven't been published yet (probably due to > Holiday-related slowness), but I can see through the admin tool that > our Graduation resolution was approved yesterday at the

Re: Call for fixes for Airflow 1.10.2

2018-12-06 Thread James Meickle
I suggest at least adding a commit to remove the broken S3 logging section I just reported here: https://issues.apache.org/jira/browse/AIRFLOW-3449 On Wed, Nov 28, 2018 at 5:41 PM Kaxil Naik wrote: > Hi everyone, > > I'm starting the process of gathering fixes for a 1.10.2. So far the list > of

Re: programmatically creating and airflow quirks

2018-11-28 Thread James Meickle
I would be very interested in helping draft a rearchitecting AIP. Of course, that's a vague statement. I am interested in several specific areas of Airflow functionality that would be hard to modify without some refactoring taking place first: 1) Improving Airflow's data model so it's easier to

Re: It's very hard to become a committer on the project

2018-09-16 Thread James Meickle
Definitely agree with this. I'm not always opposed to JIRA for projects, but the way it's being used for this project makes it very hard to break into contributing. The split between GH and JIRA is also painful since there's no automatic integration of them. On Sun, Sep 16, 2018 at 9:29 AM

Re: Guidelines on Contrib vs Non-contrib

2018-09-18 Thread James Meickle
So in favor of just using Python modules for operators. I initially wrote mine as Airflow plugin compatible, and eventually had to un-write them that way, so it's really a new-user trap. I've had at least a half dozen times installing/testing/operating Airflow where we had some issue based on an

Re: It's very hard to become a committer on the project

2018-09-18 Thread James Meickle
ight better fit much of Airflow's user base. > >>>>> > >>>>> > >>>>> On Sun, Sep 16, 2018, 9:21 AM Jeff Payne wrote: > >>>>> > >>>>>> I agree that Jira could be better utilized. I read the original > >>>>>&