Re: Is `airflow backfill` disfunctional?

2018-06-05 Thread Bolke de Bruin
Thinking out loud here, because it is a while back that I did work on backfills. There were some real issues with backfills: 1. Tasks were running in non deterministic order ending up in regular deadlocks 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs could not be

Re: Pentaho to Airflow

2018-06-05 Thread Arash Soheili
I have looked through those and didn't find what I needed. Although there is the mysql operator and I have used that to implement and insert or update. I was looking for something like this https://wiki.pentaho.com/plugins/servlet/mobile?contentId=8292089#content/view/8292089 . A way to bulk

Re: Pentaho to Airflow

2018-06-05 Thread Taylor Edmiston
Hey Arash - There are some common operators built-in to Airflow and some in contrib as well. We also maintain a community sourced GitHub

Pentaho to Airflow

2018-06-05 Thread Arash Soheili
Hi, I'm new to Airlfow and helping to setup our organization to transition away from using Pentaho Data Integration for our ETL. Although there are a lot of things I don't like about Pentaho they do have some nice standard modules like batch databae insert/update which are common ETL tasks. As

Re: Airflow-related Talk: Functional Data Engineering, a set of Best Practices

2018-06-05 Thread Naik Kaxil
Thanks for sharing this Max. On 06/06/2018, 01:23, "Maxime Beauchemin" wrote: I'm taking the freedom to share my talk from DataEngConf 2018 with this group since it's somewhat related to Airflow. https://www.youtube.com/watch?v=4Spo2QRTz1k Related blog post:

Airflow-related Talk: Functional Data Engineering, a set of Best Practices

2018-06-05 Thread Maxime Beauchemin
I'm taking the freedom to share my talk from DataEngConf 2018 with this group since it's somewhat related to Airflow. https://www.youtube.com/watch?v=4Spo2QRTz1k Related blog post:

Re: Dealing with data latency

2018-06-05 Thread Pedro Machado
Hi James, I've noticed that some dags fail if the services are restarted while a sensor is waiting. Originally I didn't think retries would be relevant for a time sensor but it sounds like if the worker crashes, the only way for the sensor to rerun is if the retry count hasn't been met. Is this

Re: Is `airflow backfill` disfunctional?

2018-06-05 Thread Scott Halgrim
The request was for opposition, but I’d like to weigh in on the side of “it’s a better behavior [to have failed tasks re-run when cleared in a backfill" On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin , wrote: > @Jeremiah Lowin & @Bolke de Bruin I > think you may have some context on why this

Re: Is `airflow backfill` disfunctional?

2018-06-05 Thread Maxime Beauchemin
@Jeremiah Lowin & @Bolke de Bruin I think you may have some context on why this may have changed at some point. I'm assuming that when DagRun handling was added to the backfill logic, the behavior just happened to change to what it is now. Any opposition in moving back towards re-running failed

Re: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Naik Kaxil
+1 for securing it by default. I also suggest to create a admin user password while running "initdb". On 05/06/2018, 22:49, "Bolke de Bruin" wrote: Tbh I like to go to a setup where it is secure by default. Airflow is getting more and more used so it also increases the attack surface.

Re: Is `airflow backfill` disfunctional?

2018-06-05 Thread Tao Feng
After discussing with Max, we think it would be great if `airflow backfill` could be able to auto pick up and rerun those failed tasks. Currently, it will throw exceptions( https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489) without rerunning the failed tasks. But since

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Alex Guziel
I suggest reading the section on password complexity here https://pages.nist.gov/800-63-3/sp800-63b.html which recommends just a minimum length and a check against a list of the most common passwords. On Tue, Jun 5, 2018 at 3:14 PM, Maxime Beauchemin < maximebeauche...@gmail.com> wrote: >

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Maxime Beauchemin
Agreed, secured by default is ideal. Though I wouldn't want people to get an unreasonable sense of safety and open their instance to the web. I like the idea of generating a temporary key/token and exposing it in the console where the process was started. Other option is to use the

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Bolke de Bruin
Tbh I like to go to a setup where it is secure by default. Airflow is getting more and more used so it also increases the attack surface. If you run “initdb” or “resetdb” it is easy to provide a generated password. I don’t see a reason anymore for having a unsecured version. B. Verstuurd

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Christopher Bockman
+1 to being able to disable--we have authentication in place, but use a separate solution that (probably?) Airflow won't realize is enabled, so having a continuous giant warning banner would be rather unfortunate. On Tue, Jun 5, 2018 at 2:05 PM, Alek Storm wrote: > This is a great idea, but

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Alek Storm
This is a great idea, but we'd appreciate a setting that disables the banner even if those conditions aren't met - our instance is deployed without authentication, but is only accessible via our intranet. Alek On Tue, Jun 5, 2018, 3:35 PM James Meickle wrote: > I think that a banner

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread James Meickle
I think that a banner notification would be a fair penalty if you access Airflow without authentication, or have API authentication turned off, or are accessing via http:// with a non-localhost `Host:`. (Are there any other circumstances to think of?) I would also suggest serving a default

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Christian Barra
> On 5 Jun 2018, at 19:51, Maxime Beauchemin wrote: > > What about a clear alert on the UI showing when auth is off? Perhaps a > large red triangle-exclamation icon on the navbar with a tooltip > "Authentication is off, this Airflow instance in not secure." and clicking > take you to the

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Maxime Beauchemin
What about a clear alert on the UI showing when auth is off? Perhaps a large red triangle-exclamation icon on the navbar with a tooltip "Authentication is off, this Airflow instance in not secure." and clicking take you to the doc's security page. Well and then of course people should make sure

Python3 and Pig Hook

2018-06-05 Thread miqbal88
Hi, I have a small update for the Pig Hook to make it compatible with Python3, and it seems complicated to set up a test for, b/c I would have to configure Pig in a testing suite. I noticed there weren't any tests for it listed before. Would it be okay for me to skip the test in this case?

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Taylor Edmiston
One of our engineers wrote a blog post about the UMG mistakes as well. https://www.astronomer.io/blog/universal-music-group-airflow-leak/ I know that best practices are well known here, but I second James' suggestion that we add some docs, code, or config so that the framework optimizes for

Re: Dealing with data latency

2018-06-05 Thread James Meickle
We have to use a lot of time sensors like this, for reports that shouldn't be filed to a third party before a certain time of day. Since these sensors are themselves tasks, they can fail to be scheduled or can fail, like if the underlying worker instance dies. I would recommend double checking

Re: Dealing with data latency

2018-06-05 Thread Pedro Machado
Thanks, Max! On Mon, Jun 4, 2018 at 12:47 PM Maxime Beauchemin < maximebeauche...@gmail.com> wrote: > The common standard is to have the execution_date aligned with the > partition date in the database (say 2018-08-08) and contain data from > 2018-08-08T00:00:000 > to 2018-08-09T23:59:999. > >

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread James Meickle
Bumping this one because now Airflow is in the news over it... https://www.bleepingcomputer.com/news/security/contractor-exposes-credentials-for-universal-music-groups-it-infrastructure/?utm_campaign=Security%2BNewsletter_medium=email_source=Security_Newsletter_co_79 On Fri, Mar 23, 2018 at 9:33