Airflow Multi-Tenancy to filter dags by team not users

2018-08-09 Thread ayush . chauhan
Hi, 

I am trying to implement airflow multi-tenancy and Google authentication in 
airflow. I am using airlfow 1.9.0. I am having the following doubts about 
implementing it:-

1) How to filter dags by team instead of individual users. I know a workaround 
for this will creating a user for each team and making the team members log in 
on airflow using the team email id. But can we do it any other way?

2) How to create super users in google authentication flow? Or every user is 
created as a superuser in this flow? I have looked at the content of users 
table but found nothing about how to give different rights to different users.


Re: [VOTE] Airflow 1.10.0rc4

2018-08-09 Thread Naik Kaxil
+1 (binding) Tested it on Python2.7 with flask UI

On 09/08/2018, 23:58, "Daniel Imberman"  wrote:

-1 (non-binding),

There is a k8s bug fix that should be PRed by @jordan.zucker this weekend
(relating to the tracking of resourceVersions). There have also been
multiple users requesting the ability to pre-bake docker images which I
will make a PR for this weekend.

On Thu, Aug 9, 2018 at 11:22 AM Bolke de Bruin  wrote:

> 0.5?? Can we score fractions :-) ? Sorry I missed this Ash. I think Fokko
> really wants a 1.10.1 quickly so better include it then? Can you make your
> vote +1?
>
> Thx
> Bolke
>
> > On 9 Aug 2018, at 14:06, Ash Berlin-Taylor  wrote:
> >
> > +0.5 (binding) from me.
> >
> > Tested upgrading form 1.9.0 metadb on Py3.5. Timezones behaving
> themselves on Postgres. Have not tested the Rbac-based UI.
> >
> >
> 
https://github.com/apache/incubator-airflow/commit/d9fecba14c5eb56990508573a91b13ab27ca5153
> <
> 
https://github.com/apache/incubator-airflow/commit/d9fecba14c5eb56990508573a91b13ab27ca5153>
> (expanding on UPDATING.md for Logging changes) isn't in the release, but
> would only affect people who look at the UPDATING.md in the source 
tarball,
> which isn't going to be very many - most people will check in the repo and
> just install via PyPi I'd guess?
> >
> > -ash
> >
> >> On 8 Aug 2018, at 19:21, Bolke de Bruin  wrote:
> >>
> >> Hey all,
> >>
> >> I have cut Airflow 1.10.0 RC4. This email is calling a vote on the
> release,
> >> which will last for 72 hours. Consider this my (binding) +1.
> >>
> >> Airflow 1.10.0 RC 4 is available at:
> >>
> >> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/ <
> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/>
> >>
> >> apache-airflow-1.10.0rc4+incubating-source.tar.gz is a source release
> that
> >> comes with INSTALL instructions.
> >> apache-airflow-1.10.0rc4+incubating-bin.tar.gz is the binary Python
> "sdist"
> >> release.
> >>
> >> Public keys are available at:
> >>
> >> https://dist.apache.org/repos/dist/release/incubator/airflow/ <
> https://dist.apache.org/repos/dist/release/incubator/airflow/>
> >>
> >> The amount of JIRAs fixed is over 700. Please have a look at the
> changelog.
> >> Since RC3 the following has been fixed:
> >>
> >> [AIRFLOW-2870] Use abstract TaskInstance for migration
> >> [AIRFLOW-2859] Implement own UtcDateTime
> >> [AIRFLOW-2140] Don't require kubernetes for the SparkSubmit hook
> >> [AIRFLOW-2869] Remove smart quote from default config
> >> [AIRFLOW-2857] Fix Read the Docs env
> >>
> >> Please note that the version number excludes the `rcX` string as well
> >> as the "+incubating" string, so it's now simply 1.10.0. This will allow
> us
> >> to rename the artifact without modifying the artifact checksums when we
> >> actually release.
> >>
> >> WARNING: Due to licensing requirements you will need to set
> >> SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when
> >> installing or upgrading. We will try to remove this requirement for the
> >> next release.
> >>
> >> Cheers,
> >> Bolke
> >
>
>






Kaxil Naik 

Data Reply
2nd Floor, Nova South
160 Victoria Street, Westminster
London SW1E 5LB - UK 
phone: +44 (0)20 7730 6000
k.n...@reply.com
www.reply.com


Re: [VOTE] Airflow 1.10.0rc4

2018-08-09 Thread Daniel Imberman
-1 (non-binding),

There is a k8s bug fix that should be PRed by @jordan.zucker this weekend
(relating to the tracking of resourceVersions). There have also been
multiple users requesting the ability to pre-bake docker images which I
will make a PR for this weekend.

On Thu, Aug 9, 2018 at 11:22 AM Bolke de Bruin  wrote:

> 0.5?? Can we score fractions :-) ? Sorry I missed this Ash. I think Fokko
> really wants a 1.10.1 quickly so better include it then? Can you make your
> vote +1?
>
> Thx
> Bolke
>
> > On 9 Aug 2018, at 14:06, Ash Berlin-Taylor  wrote:
> >
> > +0.5 (binding) from me.
> >
> > Tested upgrading form 1.9.0 metadb on Py3.5. Timezones behaving
> themselves on Postgres. Have not tested the Rbac-based UI.
> >
> >
> https://github.com/apache/incubator-airflow/commit/d9fecba14c5eb56990508573a91b13ab27ca5153
> <
> https://github.com/apache/incubator-airflow/commit/d9fecba14c5eb56990508573a91b13ab27ca5153>
> (expanding on UPDATING.md for Logging changes) isn't in the release, but
> would only affect people who look at the UPDATING.md in the source tarball,
> which isn't going to be very many - most people will check in the repo and
> just install via PyPi I'd guess?
> >
> > -ash
> >
> >> On 8 Aug 2018, at 19:21, Bolke de Bruin  wrote:
> >>
> >> Hey all,
> >>
> >> I have cut Airflow 1.10.0 RC4. This email is calling a vote on the
> release,
> >> which will last for 72 hours. Consider this my (binding) +1.
> >>
> >> Airflow 1.10.0 RC 4 is available at:
> >>
> >> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/ <
> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/>
> >>
> >> apache-airflow-1.10.0rc4+incubating-source.tar.gz is a source release
> that
> >> comes with INSTALL instructions.
> >> apache-airflow-1.10.0rc4+incubating-bin.tar.gz is the binary Python
> "sdist"
> >> release.
> >>
> >> Public keys are available at:
> >>
> >> https://dist.apache.org/repos/dist/release/incubator/airflow/ <
> https://dist.apache.org/repos/dist/release/incubator/airflow/>
> >>
> >> The amount of JIRAs fixed is over 700. Please have a look at the
> changelog.
> >> Since RC3 the following has been fixed:
> >>
> >> [AIRFLOW-2870] Use abstract TaskInstance for migration
> >> [AIRFLOW-2859] Implement own UtcDateTime
> >> [AIRFLOW-2140] Don't require kubernetes for the SparkSubmit hook
> >> [AIRFLOW-2869] Remove smart quote from default config
> >> [AIRFLOW-2857] Fix Read the Docs env
> >>
> >> Please note that the version number excludes the `rcX` string as well
> >> as the "+incubating" string, so it's now simply 1.10.0. This will allow
> us
> >> to rename the artifact without modifying the artifact checksums when we
> >> actually release.
> >>
> >> WARNING: Due to licensing requirements you will need to set
> >> SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when
> >> installing or upgrading. We will try to remove this requirement for the
> >> next release.
> >>
> >> Cheers,
> >> Bolke
> >
>
>


Podling Report Reminder - August 2018

2018-08-09 Thread jmclean
Dear podling,

This email was sent by an automated system on behalf of the Apache
Incubator PMC. It is an initial reminder to give you plenty of time to
prepare your quarterly board report.

The board meeting is scheduled for Wed, 15 August 2018, 10:30 am PDT.
The report for your podling will form a part of the Incubator PMC
report. The Incubator PMC requires your report to be submitted 2 weeks
before the board meeting, to allow sufficient time for review and
submission (Wed, August 01).

Please submit your report with sufficient time to allow the Incubator
PMC, and subsequently board members to review and digest. Again, the
very latest you should submit your report is 2 weeks prior to the board
meeting.

Candidate names should not be made public before people are actually
elected, so please do not include the names of potential committers or
PPMC members in your report.

Thanks,

The Apache Incubator PMC

Submitting your Report

--

Your report should contain the following:

*   Your project name
*   A brief description of your project, which assumes no knowledge of
the project or necessarily of its field
*   A list of the three most important issues to address in the move
towards graduation.
*   Any issues that the Incubator PMC or ASF Board might wish/need to be
aware of
*   How has the community developed since the last report
*   How has the project developed since the last report.
*   How does the podling rate their own maturity.

This should be appended to the Incubator Wiki page at:

https://wiki.apache.org/incubator/August2018

Note: This is manually populated. You may need to wait a little before
this page is created from a template.

Mentors
---

Mentors should review reports for their project(s) and sign them off on
the Incubator wiki page. Signing off reports shows that you are
following the project - projects that are not signed may raise alarms
for the Incubator PMC.

Incubator PMC


Re: apache-airflow v1.10.0 on PyPi?

2018-08-09 Thread Krish Sigler
Got it, will use the mailing list in the future.  Thanks for the info

On Thu, Aug 9, 2018 at 2:42 PM, Bolke de Bruin  wrote:

> Hi Kris,
>
> Please use the mailing list for these kind of questions.
>
> Airflow 1.10.0 hasn’t been released yet. We are going through the motions,
> but it will take a couple of days before it’s official (if all goes well).
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> Op 9 aug. 2018 om 23:33 heeft Krish Sigler  het
> volgende geschreven:
>
> Hi,
>
> First, I apologize if this is weird.  I saw on the Airflow github page
> that you most recently updated the v1.10.0 changelog, and I found your
> email using the instructions here (https://www.sourcecon.com/
> how-to-find-almost-any-github-users-email-address/).  If that's too weird
> feel free to tell me and/or ignore this.
>
> I'm emailing because I'm working with the apache-airflow project,
> specifically for setting up pipelines involving GCP packages.  My
> environment uses Python3, and I've been running into the issue outlined in
> this PR: https://github.com/apache/incubator-airflow/pull/3273.  I
> noticed that the fix is part of the v1.10.0 changelog.
> However, the latest version available on PyPi is 1.9.0.  On the Airflow
> wiki page I read that the project is intended to be updated every ~6
> months, and v1.9.0 was released in January.
>
> So my question, if you're at liberty to tell me, is can I expect v1.10.0
> to be available on PyPi in the near future?  If so then great!  That would
> solve my package dependency problem.  If not, then I'll look into some
> workaround for my issue.
>
> Thanks,
> Krish
>
>


Re: apache-airflow v1.10.0 on PyPi?

2018-08-09 Thread Bolke de Bruin
Hi Kris,

Please use the mailing list for these kind of questions.

Airflow 1.10.0 hasn’t been released yet. We are going through the motions, but 
it will take a couple of days before it’s official (if all goes well).

B.

Verstuurd vanaf mijn iPad

> Op 9 aug. 2018 om 23:33 heeft Krish Sigler  het volgende 
> geschreven:
> 
> Hi,
> 
> First, I apologize if this is weird.  I saw on the Airflow github page that 
> you most recently updated the v1.10.0 changelog, and I found your email using 
> the instructions here 
> (https://www.sourcecon.com/how-to-find-almost-any-github-users-email-address/).
>   If that's too weird feel free to tell me and/or ignore this.
> 
> I'm emailing because I'm working with the apache-airflow project, 
> specifically for setting up pipelines involving GCP packages.  My environment 
> uses Python3, and I've been running into the issue outlined in this PR: 
> https://github.com/apache/incubator-airflow/pull/3273.  I noticed that the 
> fix is part of the v1.10.0 changelog.
> However, the latest version available on PyPi is 1.9.0.  On the Airflow wiki 
> page I read that the project is intended to be updated every ~6 months, and 
> v1.9.0 was released in January.
> 
> So my question, if you're at liberty to tell me, is can I expect v1.10.0 to 
> be available on PyPi in the near future?  If so then great!  That would solve 
> my package dependency problem.  If not, then I'll look into some workaround 
> for my issue.
> 
> Thanks,
> Krish


Re: Broken DAG message won't go away in webserver

2018-08-09 Thread Alex Guziel
IIRC the scheduler sets these messages in the error table in the db.

On Thu, Aug 9, 2018 at 2:13 PM, Ben Laird  wrote:

> The messages persist even after restarting the webserver. I've verified
> with other airflow users in the office that they'd have to manually delete
> records from the 'import_error' table.
>
> When you say 'sync your DAGs', what do you mean exactly? When we fix a DAG,
> we'd normally kill the webserver process, push a zip containing our dag
> directory (with the fixed code), unzip and restart the webserver.
>
> Thanks
>
> On Thu, Aug 9, 2018 at 4:43 PM, Taylor Edmiston 
> wrote:
>
> > Yeah, you definitely shouldn't need to do a resetdb for that.
> >
> > Did you try restarting the webserver?
> >
> > How do you sync your DAGs to the webserver?  Is it possible the fixed DAG
> > didn't get synced there?
> >
> > For me, IIRC, the error stops persisting once the DAG is fixed and
> synced.
> >
> > *Taylor Edmiston*
> > Blog  | CV
> >  | LinkedIn
> >  | AngelList
> >  | Stack Overflow
> > 
> >
> >
> > On Thu, Aug 9, 2018 at 3:35 PM, Ben Laird  wrote:
> >
> > > Hello -
> > >
> > > I've noticed this several times and not sure what the solution is. If I
> > > have a DAG error at some point, I'll see message in the webserver that
> > says
> > > "Broken DAG: [Error]". However, after fixing the code, restarting the
> > > webserver, etc, the error persists. After closing it out, it will just
> > pop
> > > up again after reloading.
> > >
> > > The only way I was able to delete was by doing a `airflow resetdb`. I'd
> > > like to avoid manually deleting records from the DB, as now in prod we
> > > cannot just kill the DB state.
> > >
> > > Any suggestions?
> > >
> > > Thanks,
> > > Ben Laird
> > >
> >
>


Re: Broken DAG message won't go away in webserver

2018-08-09 Thread Ben Laird
The messages persist even after restarting the webserver. I've verified
with other airflow users in the office that they'd have to manually delete
records from the 'import_error' table.

When you say 'sync your DAGs', what do you mean exactly? When we fix a DAG,
we'd normally kill the webserver process, push a zip containing our dag
directory (with the fixed code), unzip and restart the webserver.

Thanks

On Thu, Aug 9, 2018 at 4:43 PM, Taylor Edmiston  wrote:

> Yeah, you definitely shouldn't need to do a resetdb for that.
>
> Did you try restarting the webserver?
>
> How do you sync your DAGs to the webserver?  Is it possible the fixed DAG
> didn't get synced there?
>
> For me, IIRC, the error stops persisting once the DAG is fixed and synced.
>
> *Taylor Edmiston*
> Blog  | CV
>  | LinkedIn
>  | AngelList
>  | Stack Overflow
> 
>
>
> On Thu, Aug 9, 2018 at 3:35 PM, Ben Laird  wrote:
>
> > Hello -
> >
> > I've noticed this several times and not sure what the solution is. If I
> > have a DAG error at some point, I'll see message in the webserver that
> says
> > "Broken DAG: [Error]". However, after fixing the code, restarting the
> > webserver, etc, the error persists. After closing it out, it will just
> pop
> > up again after reloading.
> >
> > The only way I was able to delete was by doing a `airflow resetdb`. I'd
> > like to avoid manually deleting records from the DB, as now in prod we
> > cannot just kill the DB state.
> >
> > Any suggestions?
> >
> > Thanks,
> > Ben Laird
> >
>


Re: Broken DAG message won't go away in webserver

2018-08-09 Thread Taylor Edmiston
Yeah, you definitely shouldn't need to do a resetdb for that.

Did you try restarting the webserver?

How do you sync your DAGs to the webserver?  Is it possible the fixed DAG
didn't get synced there?

For me, IIRC, the error stops persisting once the DAG is fixed and synced.

*Taylor Edmiston*
Blog  | CV
 | LinkedIn
 | AngelList
 | Stack Overflow



On Thu, Aug 9, 2018 at 3:35 PM, Ben Laird  wrote:

> Hello -
>
> I've noticed this several times and not sure what the solution is. If I
> have a DAG error at some point, I'll see message in the webserver that says
> "Broken DAG: [Error]". However, after fixing the code, restarting the
> webserver, etc, the error persists. After closing it out, it will just pop
> up again after reloading.
>
> The only way I was able to delete was by doing a `airflow resetdb`. I'd
> like to avoid manually deleting records from the DB, as now in prod we
> cannot just kill the DB state.
>
> Any suggestions?
>
> Thanks,
> Ben Laird
>


Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Maxime Beauchemin
The change on perf for the DAG table would be extremely negligible.

Maybe for task_instances (large table with millions of rows, 3 fields
composite key) it *could* be a decent idea. Though you'd then need to have
two indexes to store and maintain and we may have to change the code to
actually use and reference that new more efficient pk in places where it's
more efficient to use that index (some of it SQLAlchemy would do right out
of the box).

This mostly affects the index size (btree(id) is much smaller than
btree(dag_id, task_id, execution_date)), not the key lookup time much as it
is log(n). We'd still have to use the composite btree when we want to do
range scans, which we use frequently to get sets of tasks for a dag or
specific dag task. Since lookups are log(n), and that we need to maintain
that composite btree anyways for range scans, I don't see where that would
really help. It would be a better index (less pages, less memory usage,
...) if we didn't need that other composite one, which we do.

Max

On Thu, Aug 9, 2018 at 8:05 AM Vardan Gupta 
wrote:

> Point well taken on backward compatibility, we will have to take this
> change very diligently, if implemented.
>
> On Thu, Aug 9, 2018 at 7:29 PM Юли Волкова  wrote:
>
> > Because in case what you described nothing about backward compatibility.
> > You think what all who use need to change all theirs DAG's? It's very
> > strange, because you propose one of the most critical change and it will
> > side everyone. If you want id - call it dag_metadata_id and add it. But
> not
> > propose change what hasn't backward compatibility. It's to strange.
> >
> > On Thu, Aug 9, 2018 at 7:04 AM vardangupta...@gmail.com <
> > vardangupta...@gmail.com> wrote:
> >
> > >
> > >
> > > On 2018/08/09 11:55:11, Ash Berlin-Taylor  wrote:
> > > > Absolutely - there will still need to be a human-readable DAG id,
> even
> > > we end up with an auto-icrementing integer ID column internally and for
> > > table join performance reasons.
> > > >
> > > > -ash
> > > >
> > > > > On 9 Aug 2018, at 12:35, Юли Волкова  wrote:
> > > > >
> > > > > How will you understand what your DAG 2 doing enter to it? For
> > > each of
> > > > > 100, for example?
> > > > > Especially, if you are not a developer, who create it. You are a
> > > support
> > > > > team and have 120 DAGs.
> > > > >
> > > > > The first time, when want to also send the answer to dev-mail list.
> > > Please,
> > > > > don't do it.
> > > > >
> > > > > I think it's will be really bad to all who use dag_id like a saying
> > > name of
> > > > > dag. If I will be looked at 0329313 this does not say anything
> useful
> > > for
> > > > > me and it will be very very complicated to identify for which
> process
> > > dag
> > > > > using.  It could be another id for the indexes in DB if it's real
> > > problem
> > > > > for somebody. But, please, do not change dag_id.
> > > > >
> > > > > On Mon, Aug 6, 2018 at 1:32 AM vardangupta...@gmail.com <
> > > > > vardangupta...@gmail.com> wrote:
> > > > >
> > > > >> Hi Everyone,
> > > > >>
> > > > >> Do we have any plan to change type of dag_id from String to
> Number,
> > > this
> > > > >> will make queries on metadata more performant, proposal could be
> > > generating
> > > > >> an auto-incremental value in dag table and this id getting used in
> > > rest of
> > > > >> the other tables?
> > > > >>
> > > > >>
> > > > >> Regards,
> > > > >> Vardan Gupta
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > _
> > > > >
> > > > > С уважением, Юлия Волкова
> > > > > Тел. : +7 (911) 116-71-82
> > > >
> > > >
> > >
> > > Thanks Ash for your reply, I am aligned with what you're saying.
> > >
> > > I was not proposing to take away human readable dag_id instead I was
> > > thinking, why can't we create another field like dag_name which will
> hold
> > > this information at all front facing sites while dag_id is changed to
> > > integer, this will help in making joins work faster in metastore.
> Though,
> > > currently dag_id is indexed but still indexing int (4 bytes) vs
> > > varchar(250) are going to take more index blocks and therefore more
> look
> > up
> > > time. Also, if dag_id is not trivial to change to int, let it be
> present
> > > and let's introduce another col which is actually integer in type and
> let
> > > joining happen on this column across all tables.
> > >
> >
> >
> > --
> > _
> >
> > С уважением, Юлия Волкова
> > Тел. : +7 (911) 116-71-82
>


Broken DAG message won't go away in webserver

2018-08-09 Thread Ben Laird
Hello -

I've noticed this several times and not sure what the solution is. If I
have a DAG error at some point, I'll see message in the webserver that says
"Broken DAG: [Error]". However, after fixing the code, restarting the
webserver, etc, the error persists. After closing it out, it will just pop
up again after reloading.

The only way I was able to delete was by doing a `airflow resetdb`. I'd
like to avoid manually deleting records from the DB, as now in prod we
cannot just kill the DB state.

Any suggestions?

Thanks,
Ben Laird


Re: [VOTE] Airflow 1.10.0rc4

2018-08-09 Thread Bolke de Bruin
0.5?? Can we score fractions :-) ? Sorry I missed this Ash. I think Fokko 
really wants a 1.10.1 quickly so better include it then? Can you make your vote 
+1?

Thx
Bolke

> On 9 Aug 2018, at 14:06, Ash Berlin-Taylor  wrote:
> 
> +0.5 (binding) from me.
> 
> Tested upgrading form 1.9.0 metadb on Py3.5. Timezones behaving themselves on 
> Postgres. Have not tested the Rbac-based UI.
> 
> https://github.com/apache/incubator-airflow/commit/d9fecba14c5eb56990508573a91b13ab27ca5153
>  
> 
>  (expanding on UPDATING.md for Logging changes) isn't in the release, but 
> would only affect people who look at the UPDATING.md in the source tarball, 
> which isn't going to be very many - most people will check in the repo and 
> just install via PyPi I'd guess?
> 
> -ash
> 
>> On 8 Aug 2018, at 19:21, Bolke de Bruin  wrote:
>> 
>> Hey all,
>> 
>> I have cut Airflow 1.10.0 RC4. This email is calling a vote on the release,
>> which will last for 72 hours. Consider this my (binding) +1.
>> 
>> Airflow 1.10.0 RC 4 is available at:
>> 
>> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/ 
>> 
>> 
>> apache-airflow-1.10.0rc4+incubating-source.tar.gz is a source release that
>> comes with INSTALL instructions.
>> apache-airflow-1.10.0rc4+incubating-bin.tar.gz is the binary Python "sdist"
>> release.
>> 
>> Public keys are available at:
>> 
>> https://dist.apache.org/repos/dist/release/incubator/airflow/ 
>> 
>> 
>> The amount of JIRAs fixed is over 700. Please have a look at the changelog. 
>> Since RC3 the following has been fixed:
>> 
>> [AIRFLOW-2870] Use abstract TaskInstance for migration
>> [AIRFLOW-2859] Implement own UtcDateTime
>> [AIRFLOW-2140] Don't require kubernetes for the SparkSubmit hook
>> [AIRFLOW-2869] Remove smart quote from default config
>> [AIRFLOW-2857] Fix Read the Docs env
>> 
>> Please note that the version number excludes the `rcX` string as well
>> as the "+incubating" string, so it's now simply 1.10.0. This will allow us
>> to rename the artifact without modifying the artifact checksums when we
>> actually release.
>> 
>> WARNING: Due to licensing requirements you will need to set 
>> SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when
>> installing or upgrading. We will try to remove this requirement for the 
>> next release.
>> 
>> Cheers,
>> Bolke
> 



Re: Modeling rate limited api calls in airflow

2018-08-09 Thread Gerard Toonstra
Have you looked into pools?  Pools allow you to specify how many tasks at
any given time should use a common resource.
That way you could limit this to 1, 2, or 3 for example. Pools are not
dynamic however, so it only allows you to upper limit how many
number of clients are going to hit the API at any moment, not determine how
many when the rate limit is in effect
(unless you use code to reconfigure the pool on demand, but I'm not
sure if I should recommend that, i.e. reconfigure the # of clients
on the basis of hitting the rate limit.)  It sounds as if this logic is
best introduced at the hook level, where it determines that it passes
out an API interface only when the rate limit is not in place, where
operators specify how many retries should occur.

The Adwords API does allow increasing the rate limit threshold though and
you're probably better off negotiating
with Google to up that threshold, explaining your business case etc.?

Gerard



On Thu, Aug 9, 2018 at 10:43 AM r...@goshift.com  wrote:

> Hello,
>
> I am in the process of migrating a bespoke data pipe line built around
> celery into airflow.
>
> We have a number of different tasks which interact with the Adwords API
> which has a rate limiting policy. The policy isn't a fixed number of
> requests its variable.
>
> In our celery code we have handled this by capturing a rate limit error
> response and setting a key in redis to make sure that no tasks execute
> against the API until it's expired. Any task that does get executed checks
> for the presence of the key and if the key exists issues a retry for when
> the rate limit is due to expire.
>
> Moving over to Airflow I can't find a way to go about scheduling a task to
> retry in a specific amount of time. Doing some reading it seems a Sensor
> could work to prevent other dags from executing whilst the rate limit is
> present.
>
> I also can't seem to find an example of handling different exceptions from
> a python task and adapting the retry logic accordingly.
>
> Any pointers would be much appreciated,
>
> Rob
>


Re: [VOTE] Airflow 1.10.0rc4

2018-08-09 Thread Ash Berlin-Taylor
+0.5 (binding) from me.

Tested upgrading form 1.9.0 metadb on Py3.5. Timezones behaving themselves on 
Postgres. Have not tested the Rbac-based UI.

https://github.com/apache/incubator-airflow/commit/d9fecba14c5eb56990508573a91b13ab27ca5153
 

 (expanding on UPDATING.md for Logging changes) isn't in the release, but would 
only affect people who look at the UPDATING.md in the source tarball, which 
isn't going to be very many - most people will check in the repo and just 
install via PyPi I'd guess?

-ash

> On 8 Aug 2018, at 19:21, Bolke de Bruin  wrote:
> 
> Hey all,
> 
> I have cut Airflow 1.10.0 RC4. This email is calling a vote on the release,
> which will last for 72 hours. Consider this my (binding) +1.
> 
> Airflow 1.10.0 RC 4 is available at:
> 
> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/ 
> 
> 
> apache-airflow-1.10.0rc4+incubating-source.tar.gz is a source release that
> comes with INSTALL instructions.
> apache-airflow-1.10.0rc4+incubating-bin.tar.gz is the binary Python "sdist"
> release.
> 
> Public keys are available at:
> 
> https://dist.apache.org/repos/dist/release/incubator/airflow/ 
> 
> 
> The amount of JIRAs fixed is over 700. Please have a look at the changelog. 
> Since RC3 the following has been fixed:
> 
> [AIRFLOW-2870] Use abstract TaskInstance for migration
> [AIRFLOW-2859] Implement own UtcDateTime
> [AIRFLOW-2140] Don't require kubernetes for the SparkSubmit hook
> [AIRFLOW-2869] Remove smart quote from default config
> [AIRFLOW-2857] Fix Read the Docs env
> 
> Please note that the version number excludes the `rcX` string as well
> as the "+incubating" string, so it's now simply 1.10.0. This will allow us
> to rename the artifact without modifying the artifact checksums when we
> actually release.
> 
> WARNING: Due to licensing requirements you will need to set 
> SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when
> installing or upgrading. We will try to remove this requirement for the 
> next release.
> 
> Cheers,
> Bolke



Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Vardan Gupta
Point well taken on backward compatibility, we will have to take this
change very diligently, if implemented.

On Thu, Aug 9, 2018 at 7:29 PM Юли Волкова  wrote:

> Because in case what you described nothing about backward compatibility.
> You think what all who use need to change all theirs DAG's? It's very
> strange, because you propose one of the most critical change and it will
> side everyone. If you want id - call it dag_metadata_id and add it. But not
> propose change what hasn't backward compatibility. It's to strange.
>
> On Thu, Aug 9, 2018 at 7:04 AM vardangupta...@gmail.com <
> vardangupta...@gmail.com> wrote:
>
> >
> >
> > On 2018/08/09 11:55:11, Ash Berlin-Taylor  wrote:
> > > Absolutely - there will still need to be a human-readable DAG id, even
> > we end up with an auto-icrementing integer ID column internally and for
> > table join performance reasons.
> > >
> > > -ash
> > >
> > > > On 9 Aug 2018, at 12:35, Юли Волкова  wrote:
> > > >
> > > > How will you understand what your DAG 2 doing enter to it? For
> > each of
> > > > 100, for example?
> > > > Especially, if you are not a developer, who create it. You are a
> > support
> > > > team and have 120 DAGs.
> > > >
> > > > The first time, when want to also send the answer to dev-mail list.
> > Please,
> > > > don't do it.
> > > >
> > > > I think it's will be really bad to all who use dag_id like a saying
> > name of
> > > > dag. If I will be looked at 0329313 this does not say anything useful
> > for
> > > > me and it will be very very complicated to identify for which process
> > dag
> > > > using.  It could be another id for the indexes in DB if it's real
> > problem
> > > > for somebody. But, please, do not change dag_id.
> > > >
> > > > On Mon, Aug 6, 2018 at 1:32 AM vardangupta...@gmail.com <
> > > > vardangupta...@gmail.com> wrote:
> > > >
> > > >> Hi Everyone,
> > > >>
> > > >> Do we have any plan to change type of dag_id from String to Number,
> > this
> > > >> will make queries on metadata more performant, proposal could be
> > generating
> > > >> an auto-incremental value in dag table and this id getting used in
> > rest of
> > > >> the other tables?
> > > >>
> > > >>
> > > >> Regards,
> > > >> Vardan Gupta
> > > >>
> > > >
> > > >
> > > > --
> > > > _
> > > >
> > > > С уважением, Юлия Волкова
> > > > Тел. : +7 (911) 116-71-82
> > >
> > >
> >
> > Thanks Ash for your reply, I am aligned with what you're saying.
> >
> > I was not proposing to take away human readable dag_id instead I was
> > thinking, why can't we create another field like dag_name which will hold
> > this information at all front facing sites while dag_id is changed to
> > integer, this will help in making joins work faster in metastore. Though,
> > currently dag_id is indexed but still indexing int (4 bytes) vs
> > varchar(250) are going to take more index blocks and therefore more look
> up
> > time. Also, if dag_id is not trivial to change to int, let it be present
> > and let's introduce another col which is actually integer in type and let
> > joining happen on this column across all tables.
> >
>
>
> --
> _
>
> С уважением, Юлия Волкова
> Тел. : +7 (911) 116-71-82


Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Vardan Gupta
Absolutely, I'll work on producing some results. Also, it's not just a
matter of joining table, even pointed queries on individual tables like
task_instance, dag_run, fag_failure will be faster with integer identifier.

On Thu, Aug 9, 2018 at 7:59 PM Ash Berlin-Taylor 
wrote:

> Since this is a big change that would touch much of the code base, before
> we do this we need to see some hard numbers - timing or benchmarks of
> queries etc.
>
> Also how often do we actually do such a join etc?
>
> -ash
>
> > On 9 Aug 2018, at 13:04, vardangupta...@gmail.com  vardangupta...@gmail.com> wrote:
> >
> > Thanks Ash for your reply, I am aligned with what you're saying.
> >
> > I was not proposing to take away human readable dag_id instead I was
> thinking, why can't we create another field like dag_name which will hold
> this information at all front facing sites while dag_id is changed to
> integer, this will help in making joins work faster in metastore. Though,
> currently dag_id is indexed but still indexing int (4 bytes) vs
> varchar(250) are going to take more index blocks and therefore more look up
> time. Also, if dag_id is not trivial to change to int, let it be present
> and let's introduce another col which is actually integer in type and let
> joining happen on this column across all tables.


Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Ash Berlin-Taylor
Since this is a big change that would touch much of the code base, before we do 
this we need to see some hard numbers - timing or benchmarks of queries etc.

Also how often do we actually do such a join etc?

-ash

> On 9 Aug 2018, at 13:04, vardangupta...@gmail.com 
>  wrote:
> 
> Thanks Ash for your reply, I am aligned with what you're saying. 
> 
> I was not proposing to take away human readable dag_id instead I was 
> thinking, why can't we create another field like dag_name which will hold 
> this information at all front facing sites while dag_id is changed to 
> integer, this will help in making joins work faster in metastore. Though, 
> currently dag_id is indexed but still indexing int (4 bytes) vs varchar(250) 
> are going to take more index blocks and therefore more look up time. Also, if 
> dag_id is not trivial to change to int, let it be present and let's introduce 
> another col which is actually integer in type and let joining happen on this 
> column across all tables.



Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Юли Волкова
Because in case what you described nothing about backward compatibility.
You think what all who use need to change all theirs DAG's? It's very
strange, because you propose one of the most critical change and it will
side everyone. If you want id - call it dag_metadata_id and add it. But not
propose change what hasn't backward compatibility. It's to strange.

On Thu, Aug 9, 2018 at 7:04 AM vardangupta...@gmail.com <
vardangupta...@gmail.com> wrote:

>
>
> On 2018/08/09 11:55:11, Ash Berlin-Taylor  wrote:
> > Absolutely - there will still need to be a human-readable DAG id, even
> we end up with an auto-icrementing integer ID column internally and for
> table join performance reasons.
> >
> > -ash
> >
> > > On 9 Aug 2018, at 12:35, Юли Волкова  wrote:
> > >
> > > How will you understand what your DAG 2 doing enter to it? For
> each of
> > > 100, for example?
> > > Especially, if you are not a developer, who create it. You are a
> support
> > > team and have 120 DAGs.
> > >
> > > The first time, when want to also send the answer to dev-mail list.
> Please,
> > > don't do it.
> > >
> > > I think it's will be really bad to all who use dag_id like a saying
> name of
> > > dag. If I will be looked at 0329313 this does not say anything useful
> for
> > > me and it will be very very complicated to identify for which process
> dag
> > > using.  It could be another id for the indexes in DB if it's real
> problem
> > > for somebody. But, please, do not change dag_id.
> > >
> > > On Mon, Aug 6, 2018 at 1:32 AM vardangupta...@gmail.com <
> > > vardangupta...@gmail.com> wrote:
> > >
> > >> Hi Everyone,
> > >>
> > >> Do we have any plan to change type of dag_id from String to Number,
> this
> > >> will make queries on metadata more performant, proposal could be
> generating
> > >> an auto-incremental value in dag table and this id getting used in
> rest of
> > >> the other tables?
> > >>
> > >>
> > >> Regards,
> > >> Vardan Gupta
> > >>
> > >
> > >
> > > --
> > > _
> > >
> > > С уважением, Юлия Волкова
> > > Тел. : +7 (911) 116-71-82
> >
> >
>
> Thanks Ash for your reply, I am aligned with what you're saying.
>
> I was not proposing to take away human readable dag_id instead I was
> thinking, why can't we create another field like dag_name which will hold
> this information at all front facing sites while dag_id is changed to
> integer, this will help in making joins work faster in metastore. Though,
> currently dag_id is indexed but still indexing int (4 bytes) vs
> varchar(250) are going to take more index blocks and therefore more look up
> time. Also, if dag_id is not trivial to change to int, let it be present
> and let's introduce another col which is actually integer in type and let
> joining happen on this column across all tables.
>


-- 
_

С уважением, Юлия Волкова
Тел. : +7 (911) 116-71-82


Re: [VOTE] Airflow 1.10.0rc4

2018-08-09 Thread Driesprong, Fokko
Good point Bolke, Sid, seems that there are still a few issues with
Tenacity as well ,
therefore I would like to change my vote:

+1 (binding)

Cheers, Fokko

2018-08-09 14:08 GMT+02:00 Ash Berlin-Taylor :

> +0.5 (binding) from me.
>
> Tested upgrading form 1.9.0 metadb on Py3.5. Timezones behaving themselves
> on Postgres. Have not tested the Rbac-based UI.
>
> https://github.com/apache/incubator-airflow/commit/
> d9fecba14c5eb56990508573a91b13ab27ca5153  incubator-airflow/commit/d9fecba14c5eb56990508573a91b13ab27ca5153>
> (expanding on UPDATING.md for Logging changes) isn't in the release, but
> would only affect people who look at the UPDATING.md in the source tarball,
> which isn't going to be very many - most people will check in the repo and
> just install via PyPi I'd guess?
>
> -ash
>
> > On 8 Aug 2018, at 19:21, Bolke de Bruin  wrote:
> >
> > Hey all,
> >
> > I have cut Airflow 1.10.0 RC4. This email is calling a vote on the
> release,
> > which will last for 72 hours. Consider this my (binding) +1.
> >
> > Airflow 1.10.0 RC 4 is available at:
> >
> > https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/ <
> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/>
> >
> > apache-airflow-1.10.0rc4+incubating-source.tar.gz is a source release
> that
> > comes with INSTALL instructions.
> > apache-airflow-1.10.0rc4+incubating-bin.tar.gz is the binary Python
> "sdist"
> > release.
> >
> > Public keys are available at:
> >
> > https://dist.apache.org/repos/dist/release/incubator/airflow/ <
> https://dist.apache.org/repos/dist/release/incubator/airflow/>
> >
> > The amount of JIRAs fixed is over 700. Please have a look at the
> changelog.
> > Since RC3 the following has been fixed:
> >
> > [AIRFLOW-2870] Use abstract TaskInstance for migration
> > [AIRFLOW-2859] Implement own UtcDateTime
> > [AIRFLOW-2140] Don't require kubernetes for the SparkSubmit hook
> > [AIRFLOW-2869] Remove smart quote from default config
> > [AIRFLOW-2857] Fix Read the Docs env
> >
> > Please note that the version number excludes the `rcX` string as well
> > as the "+incubating" string, so it's now simply 1.10.0. This will allow
> us
> > to rename the artifact without modifying the artifact checksums when we
> > actually release.
> >
> > WARNING: Due to licensing requirements you will need to set
> > SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when
> > installing or upgrading. We will try to remove this requirement for the
> > next release.
> >
> > Cheers,
> > Bolke
>
>


Re: [VOTE] Airflow 1.10.0rc4

2018-08-09 Thread Ash Berlin-Taylor
+0.5 (binding) from me.

Tested upgrading form 1.9.0 metadb on Py3.5. Timezones behaving themselves on 
Postgres. Have not tested the Rbac-based UI.

https://github.com/apache/incubator-airflow/commit/d9fecba14c5eb56990508573a91b13ab27ca5153
 

 (expanding on UPDATING.md for Logging changes) isn't in the release, but would 
only affect people who look at the UPDATING.md in the source tarball, which 
isn't going to be very many - most people will check in the repo and just 
install via PyPi I'd guess?

-ash

> On 8 Aug 2018, at 19:21, Bolke de Bruin  wrote:
> 
> Hey all,
> 
> I have cut Airflow 1.10.0 RC4. This email is calling a vote on the release,
> which will last for 72 hours. Consider this my (binding) +1.
> 
> Airflow 1.10.0 RC 4 is available at:
> 
> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0rc4/ 
> 
> 
> apache-airflow-1.10.0rc4+incubating-source.tar.gz is a source release that
> comes with INSTALL instructions.
> apache-airflow-1.10.0rc4+incubating-bin.tar.gz is the binary Python "sdist"
> release.
> 
> Public keys are available at:
> 
> https://dist.apache.org/repos/dist/release/incubator/airflow/ 
> 
> 
> The amount of JIRAs fixed is over 700. Please have a look at the changelog. 
> Since RC3 the following has been fixed:
> 
> [AIRFLOW-2870] Use abstract TaskInstance for migration
> [AIRFLOW-2859] Implement own UtcDateTime
> [AIRFLOW-2140] Don't require kubernetes for the SparkSubmit hook
> [AIRFLOW-2869] Remove smart quote from default config
> [AIRFLOW-2857] Fix Read the Docs env
> 
> Please note that the version number excludes the `rcX` string as well
> as the "+incubating" string, so it's now simply 1.10.0. This will allow us
> to rename the artifact without modifying the artifact checksums when we
> actually release.
> 
> WARNING: Due to licensing requirements you will need to set 
> SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when
> installing or upgrading. We will try to remove this requirement for the 
> next release.
> 
> Cheers,
> Bolke



Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread vardanguptacse



On 2018/08/09 11:55:11, Ash Berlin-Taylor  wrote: 
> Absolutely - there will still need to be a human-readable DAG id, even we end 
> up with an auto-icrementing integer ID column internally and for table join 
> performance reasons.
> 
> -ash
> 
> > On 9 Aug 2018, at 12:35, Юли Волкова  wrote:
> > 
> > How will you understand what your DAG 2 doing enter to it? For each of
> > 100, for example?
> > Especially, if you are not a developer, who create it. You are a support
> > team and have 120 DAGs.
> > 
> > The first time, when want to also send the answer to dev-mail list. Please,
> > don't do it.
> > 
> > I think it's will be really bad to all who use dag_id like a saying name of
> > dag. If I will be looked at 0329313 this does not say anything useful for
> > me and it will be very very complicated to identify for which process dag
> > using.  It could be another id for the indexes in DB if it's real problem
> > for somebody. But, please, do not change dag_id.
> > 
> > On Mon, Aug 6, 2018 at 1:32 AM vardangupta...@gmail.com <
> > vardangupta...@gmail.com> wrote:
> > 
> >> Hi Everyone,
> >> 
> >> Do we have any plan to change type of dag_id from String to Number, this
> >> will make queries on metadata more performant, proposal could be generating
> >> an auto-incremental value in dag table and this id getting used in rest of
> >> the other tables?
> >> 
> >> 
> >> Regards,
> >> Vardan Gupta
> >> 
> > 
> > 
> > -- 
> > _
> > 
> > С уважением, Юлия Волкова
> > Тел. : +7 (911) 116-71-82
> 
> 

Thanks Ash for your reply, I am aligned with what you're saying. 

I was not proposing to take away human readable dag_id instead I was thinking, 
why can't we create another field like dag_name which will hold this 
information at all front facing sites while dag_id is changed to integer, this 
will help in making joins work faster in metastore. Though, currently dag_id is 
indexed but still indexing int (4 bytes) vs varchar(250) are going to take more 
index blocks and therefore more look up time. Also, if dag_id is not trivial to 
change to int, let it be present and let's introduce another col which is 
actually integer in type and let joining happen on this column across all 
tables.


Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Ash Berlin-Taylor
Absolutely - there will still need to be a human-readable DAG id, even we end 
up with an auto-icrementing integer ID column internally and for table join 
performance reasons.

-ash

> On 9 Aug 2018, at 12:35, Юли Волкова  wrote:
> 
> How will you understand what your DAG 2 doing enter to it? For each of
> 100, for example?
> Especially, if you are not a developer, who create it. You are a support
> team and have 120 DAGs.
> 
> The first time, when want to also send the answer to dev-mail list. Please,
> don't do it.
> 
> I think it's will be really bad to all who use dag_id like a saying name of
> dag. If I will be looked at 0329313 this does not say anything useful for
> me and it will be very very complicated to identify for which process dag
> using.  It could be another id for the indexes in DB if it's real problem
> for somebody. But, please, do not change dag_id.
> 
> On Mon, Aug 6, 2018 at 1:32 AM vardangupta...@gmail.com <
> vardangupta...@gmail.com> wrote:
> 
>> Hi Everyone,
>> 
>> Do we have any plan to change type of dag_id from String to Number, this
>> will make queries on metadata more performant, proposal could be generating
>> an auto-incremental value in dag table and this id getting used in rest of
>> the other tables?
>> 
>> 
>> Regards,
>> Vardan Gupta
>> 
> 
> 
> -- 
> _
> 
> С уважением, Юлия Волкова
> Тел. : +7 (911) 116-71-82



Re: Identifying delay between schedule & run instances

2018-08-09 Thread vardanguptacse



On 2018/08/09 06:27:30, Bolke de Bruin  wrote: 
> Hi vardang,
> 
> What do you intent to gain from this metric? There are many influences that 
> influence a difference between execution date and start date. You named one 
> of them, but there are also functional ones (limits reached etc). We are not 
> a real time system so we never really purposefully aimed for lowering a 
> difference because.
> 
> B.
> 
> Verstuurd vanaf mijn iPad
> 
> > Op 9 aug. 2018 om 08:04 heeft vardangupta...@gmail.com 
> >  het volgende geschreven:
> > 
> > 
> > 
> >> On 2018/08/06 07:07:05, vardangupta...@gmail.com 
> >>  wrote: 
> >> Hi Everyone,
> >> 
> >> We just wanted to calculate a metric which can talk about what's the 
> >> delay(if any) between DAG getting active in scheduler & server and then 
> >> tasks of DAG actually getting kicked off (let's suppose start_date was of 
> >> 1 hour earlier and schedule was every 10 minutes).
> >> 
> >> Currently task_instance table has execution_date, start_date, end_date & 
> >> queued_dttm, we can easily get this metric from the difference of 
> >> start_date  & execution_date but in case of back fill, execution_date will 
> >> be of previous schedule occurrence and difference of start_date & 
> >> execution_date will be skewed, though it will be okay for any future runs 
> >> to get the delay in scheduling but for back fills, this number won't be 
> >> trustworthy, any suggestions how to smartly identify this metric, may be 
> >> by knowing somehow back fill details? Even in DAG table, there is no 
> >> create_date & update_date notion which can tell me when this DAG was 
> >> originally brought to existence?
> >> 
> >> 
> >> Regards,
> >> Vardan Gupta
> >> 
> > Can someone look at the issue?
> 
Yes, you're right. Nature of Airflow is not to schedule real time scenarios, 
but as a service provider in our organization, we wanted to reach a number 
before talking to our internal teams, so that we could possibly convey a 
number, let's say in 95 percentile scheduling, there will be no more delay of x 
minutes.


Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread vardanguptacse



On 2018/08/09 06:29:45, Tao Feng  wrote: 
> +1 on Bolke. I don't think we have such plan. And I believe dag id has been
> indexed already in many tables.
> 
> On Wed, Aug 8, 2018 at 11:22 PM, Bolke de Bruin  wrote:
> 
> > No we don’t have such plan. Dag ids are used to have a readable
> > identifier. If you think it makes such a big difference in speed please
> > show us numbers from Airflow running with and without.
> >
> > Thx
> > B.
> >
> > Verstuurd vanaf mijn iPad
> >
> > > Op 6 aug. 2018 om 08:31 heeft vardangupta...@gmail.com <
> > vardangupta...@gmail.com> het volgende geschreven:
> > >
> > > Hi Everyone,
> > >
> > > Do we have any plan to change type of dag_id from String to Number, this
> > will make queries on metadata more performant, proposal could be generating
> > an auto-incremental value in dag table and this id getting used in rest of
> > the other tables?
> > >
> > >
> > > Regards,
> > > Vardan Gupta
> >
> 
Sure, I'll post few numbers soon, thanks for replying!


Re: SubdagOperator and Pools

2018-08-09 Thread Andreas Koeltringer

Hi,

to clarify, I created a Gist with instructions for how to reproduce this 
issue:


https://gist.github.com/akoeltringer/63fcf0340ae219c112b2a5377e6d2715

thanks, regards
Andreas


On 08/09/2018 07:41 AM, Andreas Koeltringer wrote:

Hi Tao,

thanks for your response.

That's just the thing: I am talking about ONE SubdagOperator: the tasks 
within in execute in parallel. That's what confuses me.



Kind regards,
Andreas


On 08/08/2018 06:41 PM, Tao Feng wrote:

Hi Andreas,

The default executor for SubdagOperator is SequentialExecutor which makes
sure all the tasks within subdag are executed in sequential order. But if
you have too many subdags within single DAG and want to control with
pooling(https://airflow.apache.org/concepts.html#pools), subdagOperator u
nfortunately doesn't respect pooling(
https://issues.apache.org/jira/browse/AIRFLOW-2371) at this momement. My
understanding is that airflow uses backfill Scheduler to schedule
subdagOperator instead of the normal scheduler which backfill 
scheduler has

certain discrepancies with the normal scheduler on pooling support.

Best,
-Tao

On Wed, Aug 8, 2018 at 9:14 AM, Andreas Koeltringer <
andreas.koeltrin...@n-fuse.co> wrote:


Hi,

we have a SubdagOperator with lots of tasks in it. We want to limit the
parallelism, with which these tasks execute. Therefore we created a pool
and added the tasks within the SubdagOperator to this pool.

However, this setting is not respected (see image attached).

Now we am wondering why that is. In 'subdag_operator.py' on the master
branch there is a comment that

 "Airflow pool is not honored by SubDagOperator."

This comment is not in the file in v1.9.0 (which I am using).

So this means that Pools are not respected for Subdags?

On the other handside it states that Subdags use the SequentialExecutor,
which *should* execute tasks sequentially?

Can anyone clarify this, please?
And if pools do not work, what options do we have to limit 
parallelism in

a Subdag?

Thanks in advance,
Andreas







Modeling rate limited api calls in airflow

2018-08-09 Thread rob
Hello,

I am in the process of migrating a bespoke data pipe line built around celery 
into airflow.

We have a number of different tasks which interact with the Adwords API which 
has a rate limiting policy. The policy isn't a fixed number of requests its 
variable.

In our celery code we have handled this by capturing a rate limit error 
response and setting a key in redis to make sure that no tasks execute against 
the API until it's expired. Any task that does get executed checks for the 
presence of the key and if the key exists issues a retry for when the rate 
limit is due to expire.

Moving over to Airflow I can't find a way to go about scheduling a task to 
retry in a specific amount of time. Doing some reading it seems a Sensor could 
work to prevent other dags from executing whilst the rate limit is present.

I also can't seem to find an example of handling different exceptions from a 
python task and adapting the retry logic accordingly.

Any pointers would be much appreciated,

Rob


Re: Custom authentication with RBAC

2018-08-09 Thread Ravi Kotecha
Hi Gabriel,

We have extended the auth backend for FAB to support OpenIDConnect here:
https://github.com/ministryofjustice/fab-oidc
and you can see how to configure it in our helm chart

.
What auth scheme are you using? Maybe we can upstream the most common ones?


On Wed, Aug 8, 2018 at 10:31 PM Gabriel Silk 
wrote:

> Hello Airflow devs,
>
> It seems that it is not possible to use a custom auth backend with the new
> RBAC web server, like it was with the old.
>
> In the old webserver, you could simple set "webserver.auth_backend" to a
> classname and implement any logic you like.
>
> The absence of this feature is a blocker for adapting RBAC.
>
> Is there any easy fix for this? Is it possible to extend FAB in a similar
> way?
>
> Thanks!
>


Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Tao Feng
+1 on Bolke. I don't think we have such plan. And I believe dag id has been
indexed already in many tables.

On Wed, Aug 8, 2018 at 11:22 PM, Bolke de Bruin  wrote:

> No we don’t have such plan. Dag ids are used to have a readable
> identifier. If you think it makes such a big difference in speed please
> show us numbers from Airflow running with and without.
>
> Thx
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 6 aug. 2018 om 08:31 heeft vardangupta...@gmail.com <
> vardangupta...@gmail.com> het volgende geschreven:
> >
> > Hi Everyone,
> >
> > Do we have any plan to change type of dag_id from String to Number, this
> will make queries on metadata more performant, proposal could be generating
> an auto-incremental value in dag table and this id getting used in rest of
> the other tables?
> >
> >
> > Regards,
> > Vardan Gupta
>


Re: Identifying delay between schedule & run instances

2018-08-09 Thread Bolke de Bruin
Hi vardang,

What do you intent to gain from this metric? There are many influences that 
influence a difference between execution date and start date. You named one of 
them, but there are also functional ones (limits reached etc). We are not a 
real time system so we never really purposefully aimed for lowering a 
difference because.

B.

Verstuurd vanaf mijn iPad

> Op 9 aug. 2018 om 08:04 heeft vardangupta...@gmail.com 
>  het volgende geschreven:
> 
> 
> 
>> On 2018/08/06 07:07:05, vardangupta...@gmail.com  
>> wrote: 
>> Hi Everyone,
>> 
>> We just wanted to calculate a metric which can talk about what's the 
>> delay(if any) between DAG getting active in scheduler & server and then 
>> tasks of DAG actually getting kicked off (let's suppose start_date was of 1 
>> hour earlier and schedule was every 10 minutes).
>> 
>> Currently task_instance table has execution_date, start_date, end_date & 
>> queued_dttm, we can easily get this metric from the difference of start_date 
>>  & execution_date but in case of back fill, execution_date will be of 
>> previous schedule occurrence and difference of start_date & execution_date 
>> will be skewed, though it will be okay for any future runs to get the delay 
>> in scheduling but for back fills, this number won't be trustworthy, any 
>> suggestions how to smartly identify this metric, may be by knowing somehow 
>> back fill details? Even in DAG table, there is no create_date & update_date 
>> notion which can tell me when this DAG was originally brought to existence?
>> 
>> 
>> Regards,
>> Vardan Gupta
>> 
> Can someone look at the issue?


Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Bolke de Bruin
No we don’t have such plan. Dag ids are used to have a readable identifier. If 
you think it makes such a big difference in speed please show us numbers from 
Airflow running with and without.

Thx
B.

Verstuurd vanaf mijn iPad

> Op 6 aug. 2018 om 08:31 heeft vardangupta...@gmail.com 
>  het volgende geschreven:
> 
> Hi Everyone,
> 
> Do we have any plan to change type of dag_id from String to Number, this will 
> make queries on metadata more performant, proposal could be generating an 
> auto-incremental value in dag table and this id getting used in rest of the 
> other tables?
> 
> 
> Regards,
> Vardan Gupta


Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread vardanguptacse



On 2018/08/06 06:31:58, vardangupta...@gmail.com  
wrote: 
> Hi Everyone,
> 
> Do we have any plan to change type of dag_id from String to Number, this will 
> make queries on metadata more performant, proposal could be generating an 
> auto-incremental value in dag table and this id getting used in rest of the 
> other tables?
> 
> 
> Regards,
> Vardan Gupta
> 

Can someone look at the issue?


Re: Identifying delay between schedule & run instances

2018-08-09 Thread vardanguptacse



On 2018/08/06 07:07:05, vardangupta...@gmail.com  
wrote: 
> Hi Everyone,
> 
> We just wanted to calculate a metric which can talk about what's the delay(if 
> any) between DAG getting active in scheduler & server and then tasks of DAG 
> actually getting kicked off (let's suppose start_date was of 1 hour earlier 
> and schedule was every 10 minutes).
> 
> Currently task_instance table has execution_date, start_date, end_date & 
> queued_dttm, we can easily get this metric from the difference of start_date  
> & execution_date but in case of back fill, execution_date will be of previous 
> schedule occurrence and difference of start_date & execution_date will be 
> skewed, though it will be okay for any future runs to get the delay in 
> scheduling but for back fills, this number won't be trustworthy, any 
> suggestions how to smartly identify this metric, may be by knowing somehow 
> back fill details? Even in DAG table, there is no create_date & update_date 
> notion which can tell me when this DAG was originally brought to existence?
> 
> 
> Regards,
> Vardan Gupta
> 
Can someone look at the issue?