Re: Python 3

2017-01-09 Thread Maycock, Luke
Thank you Conor! We have managed to get this working after looking at the 
configuration you provided. We switched out librabbitmq with pyamqp.


Also, thank you to everyone else who responded to this thread!


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Conor Nash 
Sent: 03 January 2017 14:51
To: dev@airflow.incubator.apache.org
Subject: Re: Python 3

We are using Airflow with Python 3.5 and RabbitMQ (CeleryExecutor). We're
using Celery version 3.1.24. For the BROKER_URL connection string we use
"pyamqp://:@host:5671/vhost" and for CELERY_RESULT_BACKEND we
use "rpc://".

Is Redis the preferred/better supported backend?

Cheers,

Conor Nash
Data Science Consulting
www.conornash.com | @conornash <https://twitter.com/conornash>

On Sun, Jan 1, 2017 at 9:12 PM, George Leslie-Waksman <
geo...@cloverhealth.com.invalid> wrote:

> We are using Airflow with Python3, PostgreSQL and Redis; it works just
> fine.
>
> On Thu, Dec 1, 2016 at 10:30 AM Maycock, Luke <
> luke.mayc...@affiliate.oliverwyman.com> wrote:
>
> > Thanks Dennis. We will perhaps try Redis as there has been a ticket to
> > make librabbitmq Python 3 compatible since 2012.
> >
> >
> > Cheers,
> > Luke Maycock
> > OLIVER WYMAN
> > luke.mayc...@affiliate.oliverwyman.com > luke.mayc...@affiliate.oliverwyman.com>
> > www.oliverwyman.com<http://www.oliverwyman.com/>
> >
> >
> >
> > 
> > From: Dennis O'Brien 
> > Sent: 01 December 2016 04:45
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Python 3
> >
> > Hi Luke,
> >
> > I've been running Airflow on Python 3.5 with Celery.  I'm using Redis as
> > the message broker (and ElastiCache Redis in production).  I haven't
> tried
> > RabbitMQ so I can't speak to its compatibility with Python 3.5.
> >
> > On Mon, Nov 28, 2016 at 9:18 AM Maycock, Luke <
> > luke.mayc...@affiliate.oliverwyman.com> wrote:
> >
> > > I thought it would be worth following up on this given our experience
> of
> > > testing Python 3 compatibility.
> > >
> > >
> > > We are using Celery and RabbitMQ. Celery uses librabbitmq for
> interaction
> > > with RabbitMQ but librabbitmq is not compatible with Python 3 (see
> > > https://github.com/celery/librabbitmq/issues/13). Has anyone else been
> > > able to get Airflow/Python 3/Celery working?
> > >
> > >
> > >
> > > Cheers,
> > > Luke Maycock
> > > OLIVER WYMAN
> > > luke.mayc...@affiliate.oliverwyman.com > > luke.mayc...@affiliate.oliverwyman.com>
> > > www.oliverwyman.com<http://www.oliverwyman.com/>
> > >
> > >
> > >
> > > 
> > > From: Maycock, Luke 
> > > Sent: 28 November 2016 12:33
> > > To: dev@airflow.incubator.apache.org
> > > Subject: Re: Python 3
> > >
> > > Thank you Dmitriy and Li.
> > >
> > > Luke Maycock
> > > OLIVER WYMAN
> > > luke.mayc...@affiliate.oliverwyman.com > > luke.mayc...@affiliate.oliverwyman.com>
> > > www.oliverwyman.com<http://www.oliverwyman.com/>
> > >
> > >
> > >
> > > 
> > > From: Li Xuan Ji 
> > > Sent: 28 November 2016 12:20
> > > To: dev@airflow.incubator.apache.org
> > > Subject: Re: Python 3
> > >
> > > Airflow is supposed to be Python 3 compatible and any
> > > incompatibilities are bugs on our side.
> > >
> > > I haven't heard of anyone running airflow on windows, but I'm not sure
> > > what the official stance is.
> > >
> > > On 28 November 2016 at 07:01, Dmitriy Krasnikov <
> dkrasni...@hotmail.com>
> > > wrote:
> > > > I have been using it and developing plugins for it in Python 3.5 with
> > no
> > > problems.
> > > >
> > > >
> > > > -Original Message-
> > > > From: Maycock, Luke [mailto:luke.mayc...@affiliate.oliverwyman.com]
> > > > Sent: Monday, November 28, 2016 6:58 AM
> > > > To: dev@airflow.incubator.apache.org
> > > > Subject: Python 3
> > > >
> > > > Hi All,
> > > >
> > > >
> > > > I have been using Airflow for a while now and have a 

Re: How to abort a DagRun?

2016-12-13 Thread Maycock, Luke
Hi Christian,


I believe you can achieve this by going to 'Browse > Dag Runs'. Find the DAG 
run you wish to stop and click the check box on the left. Next, click 'With 
selected' and set it to either 'success' or 'failed'.


I believe this will prevent any further tasks in the DAG run from running but 
will let the currently running tasks complete.


If you wish to restart the DAG run from where it was stopped, just do the same 
again but set the state to 'running'.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com




From: Christian Trebing 
Sent: 13 December 2016 11:58
To: dev@airflow.incubator.apache.org
Subject: How to abort a DagRun?

Hi,

we sometimes have the situation that a DagRun is started, but it needs to
be aborted. The current way of handling that situation is to stop airflow,
reinitialize the database and restart airflow. Is there a better way how to
handle this in the airflow UI without erasing the history of DagRuns?

Regards,
 Christian


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Skip task

2016-12-12 Thread Maycock, Luke
Excellent - thanks for your help Bolke, much appreciated!


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Bolke de Bruin 
Sent: 12 December 2016 10:40
To: dev@airflow.incubator.apache.org
Subject: Re: Skip task

Have a look at: 
https://github.com/apache/incubator-airflow/blob/master/airflow/migrations/versions/4addfa1236f1_add_fractional_seconds_to_mysql_tables.py
 
<https://github.com/apache/incubator-airflow/blob/master/airflow/migrations/versions/4addfa1236f1_add_fractional_seconds_to_mysql_tables.py>

Make sure to include "type_=mysql.DATETIME(fsp=6)” for your DateTime types on 
MySQL.

- Bolke



> Op 12 dec. 2016, om 11:33 heeft Maycock, Luke 
>  het volgende geschreven:
>
> It is a new table named 'TaskExclusion'. The migration script for this is as 
> follows:
>
> def upgrade():
>op.create_table(
>'task_exclusion',
>sa.Column('id', sa.Integer(), nullable=False),
>sa.Column('dag_id', sa.String(length=250), nullable=False),
>sa.Column('task_id', sa.String(length=250), nullable=False),
>sa.Column('exclusion_type', sa.String(length=32), nullable=False),
>sa.Column('exclusion_start_date', sa.DateTime(), nullable=False),
>sa.Column('exclusion_end_date', sa.DateTime(), nullable=False),
>sa.Column('created_by', sa.String(length=256), nullable=False),
>sa.Column('created_on', sa.DateTime(), nullable=False),
>sa.PrimaryKeyConstraint('id'))
>
>
> def downgrade():
>op.drop_table('task_exclusion')
>
> This is the PR for the exclusion of a task. We review our code internally 
> before setting up a PR into the main repo for the next review, hence the PR 
> being in our fork. The PR does not yet contain our unit tests.
>
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
>
> 
> From: Bolke de Bruin 
> Sent: 09 December 2016 20:54
> To: dev@airflow.incubator.apache.org
> Subject: Re: Skip task
>
> What table was this? I recently pushed a fix that allows fractional seconds 
> in our minimum supported version of MySQL (5.6.4 and beyond).
>
> I might have missed something.
>
> Thanks
> Bolke
>
> Sent from my iPhone
>
>> On 9 Dec 2016, at 14:27, Maycock, Luke 
>>  wrote:
>>
>> I found the issue to be that, for MySQL, the datetime was being rounded to 
>> the nearest second. The strange thing is that if a datetime without the 
>> microseconds was passed to SQLAlchemy, the insertion into MySQL failed; but 
>> when a datetime with microseconds was passed, the microseconds are removed 
>> by rounding to the nearest second.
>>
>>
>> Hopefully, this will prevent someone else going down the same rabbit hole 
>> that I did.
>>
>>
>> Cheers,
>> Luke Maycock
>> OLIVER WYMAN
>> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
>> www.oliverwyman.com<http://www.oliverwyman.com/>
>>
>>
>> 
>> From: Maycock, Luke 
>> Sent: 08 December 2016 10:44:32
>> To: dev@airflow.incubator.apache.org
>> Subject: Re: Skip task
>>
>> Hi All,
>>
>>
>> We have implemented a solution for allowing the exclusion of individual 
>> tasks during a DAG run. However, when writing unit tests for this, we are 
>> encountering an issue with MySQL, which I am hoping someone is able to help 
>> us with.
>>
>>
>> For our solution, we have a new 'TaskExclusion' table in the meta-data. Our 
>> unit tests were run by Travis, not locally.
>>
>>
>> The code block under test:
>>
>>
>> class TaskExclusion(Base):
>>  """
>> This class is used to define objects that can be used to specify not to
>> run a given task in a given dag on a variety of execution date conditions.
>> These objects will be stored in the backend database in the task_exclusion
>> table.
>> Static methods are provided for the creation, removal and investigation of
>> these objects.
>> """
>>
>> __tablename__ = "task_exclusion"
>>
>> id = Column(Integer(), primary_key=True)
>>  dag_id = Column(String(ID_LEN), nullable=

Re: Skip task

2016-12-12 Thread Maycock, Luke
It is a new table named 'TaskExclusion'. The migration script for this is as 
follows:

def upgrade():
op.create_table(
'task_exclusion',
sa.Column('id', sa.Integer(), nullable=False),
sa.Column('dag_id', sa.String(length=250), nullable=False),
sa.Column('task_id', sa.String(length=250), nullable=False),
sa.Column('exclusion_type', sa.String(length=32), nullable=False),
sa.Column('exclusion_start_date', sa.DateTime(), nullable=False),
sa.Column('exclusion_end_date', sa.DateTime(), nullable=False),
sa.Column('created_by', sa.String(length=256), nullable=False),
sa.Column('created_on', sa.DateTime(), nullable=False),
sa.PrimaryKeyConstraint('id'))


def downgrade():
op.drop_table('task_exclusion')

This is the PR for the exclusion of a task. We review our code internally 
before setting up a PR into the main repo for the next review, hence the PR 
being in our fork. The PR does not yet contain our unit tests.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Bolke de Bruin 
Sent: 09 December 2016 20:54
To: dev@airflow.incubator.apache.org
Subject: Re: Skip task

What table was this? I recently pushed a fix that allows fractional seconds in 
our minimum supported version of MySQL (5.6.4 and beyond).

I might have missed something.

Thanks
Bolke

Sent from my iPhone

> On 9 Dec 2016, at 14:27, Maycock, Luke 
>  wrote:
>
> I found the issue to be that, for MySQL, the datetime was being rounded to 
> the nearest second. The strange thing is that if a datetime without the 
> microseconds was passed to SQLAlchemy, the insertion into MySQL failed; but 
> when a datetime with microseconds was passed, the microseconds are removed by 
> rounding to the nearest second.
>
>
> Hopefully, this will prevent someone else going down the same rabbit hole 
> that I did.
>
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
> 
> From: Maycock, Luke 
> Sent: 08 December 2016 10:44:32
> To: dev@airflow.incubator.apache.org
> Subject: Re: Skip task
>
> Hi All,
>
>
> We have implemented a solution for allowing the exclusion of individual tasks 
> during a DAG run. However, when writing unit tests for this, we are 
> encountering an issue with MySQL, which I am hoping someone is able to help 
> us with.
>
>
> For our solution, we have a new 'TaskExclusion' table in the meta-data. Our 
> unit tests were run by Travis, not locally.
>
>
> The code block under test:
>
>
> class TaskExclusion(Base):
>   """
> This class is used to define objects that can be used to specify not to
> run a given task in a given dag on a variety of execution date conditions.
> These objects will be stored in the backend database in the task_exclusion
> table.
> Static methods are provided for the creation, removal and investigation of
> these objects.
> """
>
> __tablename__ = "task_exclusion"
>
> id = Column(Integer(), primary_key=True)
>   dag_id = Column(String(ID_LEN), nullable=False)
>   task_id = Column(String(ID_LEN), nullable=False)
>   exclusion_type = Column(String(32), nullable=False)
>   exclusion_start_date = Column(DateTime, nullable=True)
>   exclusion_end_date = Column(DateTime, nullable=True)
>   created_by = Column(String(256), nullable=False)
>   created_on = Column(DateTime, nullable=False)
>
>   @classmethod
> @provide_session
> def set(
>   cls,
>   dag_id,
>   task_id,
>   exclusion_type,
>   exclusion_start_date,
>   exclusion_end_date,
>   created_by,
>   session=None):
>   """
> Add a task exclusion to prevent a task running under certain
> circumstances.
> :param dag_id: The dag_id of the DAG containing the task to exclude
> from execution.
> :param task_id: The task_id of the task to exclude from execution.
> :param exclusion_type: The type of circumstances to exclude the task
> from execution under. See the TaskExclusionType class for more detail.
> :param exclusion_start_date: The execution_date to start excluding on.
> This will be ignored if the exclusion_type is INDEFINITE.
> :param exclusion_end_date: The execution_date to stop excluding on.
> This will be ignored if the ex

Re: Skip task

2016-12-09 Thread Maycock, Luke
I found the issue to be that, for MySQL, the datetime was being rounded to the 
nearest second. The strange thing is that if a datetime without the 
microseconds was passed to SQLAlchemy, the insertion into MySQL failed; but 
when a datetime with microseconds was passed, the microseconds are removed by 
rounding to the nearest second.


Hopefully, this will prevent someone else going down the same rabbit hole that 
I did.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>


____
From: Maycock, Luke 
Sent: 08 December 2016 10:44:32
To: dev@airflow.incubator.apache.org
Subject: Re: Skip task

Hi All,


We have implemented a solution for allowing the exclusion of individual tasks 
during a DAG run. However, when writing unit tests for this, we are 
encountering an issue with MySQL, which I am hoping someone is able to help us 
with.


For our solution, we have a new 'TaskExclusion' table in the meta-data. Our 
unit tests were run by Travis, not locally.


The code block under test:


class TaskExclusion(Base):
   """
This class is used to define objects that can be used to specify not to
run a given task in a given dag on a variety of execution date conditions.
These objects will be stored in the backend database in the task_exclusion
table.
Static methods are provided for the creation, removal and investigation of
these objects.
"""

__tablename__ = "task_exclusion"

id = Column(Integer(), primary_key=True)
   dag_id = Column(String(ID_LEN), nullable=False)
   task_id = Column(String(ID_LEN), nullable=False)
   exclusion_type = Column(String(32), nullable=False)
   exclusion_start_date = Column(DateTime, nullable=True)
   exclusion_end_date = Column(DateTime, nullable=True)
   created_by = Column(String(256), nullable=False)
   created_on = Column(DateTime, nullable=False)

   @classmethod
@provide_session
def set(
   cls,
   dag_id,
   task_id,
   exclusion_type,
   exclusion_start_date,
   exclusion_end_date,
   created_by,
   session=None):
   """
Add a task exclusion to prevent a task running under certain
circumstances.
:param dag_id: The dag_id of the DAG containing the task to exclude
from execution.
:param task_id: The task_id of the task to exclude from execution.
:param exclusion_type: The type of circumstances to exclude the task
from execution under. See the TaskExclusionType class for more detail.
:param exclusion_start_date: The execution_date to start excluding on.
This will be ignored if the exclusion_type is INDEFINITE.
:param exclusion_end_date: The execution_date to stop excluding on.
This will be ignored if the exclusion_type is INDEFINITE or
SINGLE_DATE.
:param created_by: Who is creating this exclusion. Stored with the
exclusion record for auditing/debugging purposes.
:return: None.
"""

session.expunge_all()

   # Set up execution date range correctly
if exclusion_type == TaskExclusionType.SINGLE_DATE:
   if exclusion_start_date:
   exclusion_end_date = exclusion_start_date
   else:
   raise AirflowException(
   "No exclusion_start_date "
)
   elif exclusion_type == TaskExclusionType.DATE_RANGE:
   if exclusion_start_date > exclusion_end_date:
   raise AirflowException(
   "The exclusion_start_date is after the exclusion_end_date"
)
   elif exclusion_type == TaskExclusionType.INDEFINITE:
   exclusion_start_date = None
exclusion_end_date = None
else:
   raise AirflowException(
   "The exclusion_type, {}, is not recognised."
.format(exclusion_type)
   )

   # remove any duplicate exclusions
session.query(cls).filter(
   cls.dag_id == dag_id,
   cls.task_id == task_id,
   cls.exclusion_type == exclusion_type,
   cls.exclusion_start_date == exclusion_start_date,
   cls.exclusion_end_date == exclusion_end_date
   ).delete()

   # insert new exclusion
session.add(TaskExclusion(
   dag_id=dag_id,
   task_id=task_id,
   exclusion_type=exclusion_type,
   exclusion_start_date=exclusion_start_date,
   exclusion_end_date=exclusion_end_date,
   created_by=created_by,
   created_on=datetime.now())
   )

   session.commit()


The unit test:

class TaskExclusionTest(unittest.TestCase):
   def test_set_exclusion(self, session=None):

   session = settings.Session()

   session.expunge_all()

   dag_id = 'test_task_exclude'
task_id = 'test_task_exclude'
exec_date = datetime.datetime.now()

   TaskExclusion.set(dag_id=dag_id,
   

Re: Skip task

2016-12-08 Thread Maycock, Luke
rtTrue(exclusion)


The unit test passes for postgreSQL and SQLite but fails for MySQL. I have 
checked and the 'exclusion' variable contains a TaskExclusion object for 
postgreSQL and SQLite but is set to 'None' for MySQL. Any suggestions on what 
could be causing this would be much appreciated.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: siddharth anand 
Sent: 16 November 2016 00:40
To: dev@airflow.incubator.apache.org
Subject: Re: Skip task

If your requirement is to skip a portion of tasks in a DagRun based on some
state encountered while executing that DagRun, that is what
BranchPythonOperator or ShortCircruitOperator (optionally paired with a
Trigger Rule specified on a downstream task) is made for.

These operators take a custom Python callable as a argument. The callable
can check for the existence of data or files that should have been
generated by an external system or an upstream task in the same DAG. The
callables need to return a Boolean value in the case of the
ShortCircruitOperator or a selected choice (i.e. branch to take) as in the
case of the BranchPythonOperator.

If you have 20 tasks that all depend on the presence of 20 different files,
you would need 20 ShortCircruitOperator or BranchPythonOperator tasks each
either sharing a common callable or each with its own callable.

One could argue that these tasks are "overhead" because they just encompass
some conditional or control logic and that DAGs should only contain
workhorse tasks (i.e. tasks that do some  work). DAGs with workhorse-only
tasks are more of a pure dataflow approach -- i.e. no control-logic
operators. However, I don't see another option.

In the current system, a callable registered with a ShortCircruitOperator
would check for the presence of a file -- if the file were not available,
then a series of downstream tasks would be skipped in that DAGRun, until a
task with a Trigger_Rule="all_done" were encountered, downstream of which,
tasks would no longer be skipped for the DagRun.

I hope this makes sense.

A long time ago, I proposed UI functionality to skip a series of DAG runs
via the UI, because I knew that no data was available for that time range
from an external system. It wanted to essentially specify a "blackout"
period in terms of a time range that covered multiple DagRuns. My intention
was for backfills to skip those days. It turns out that my company did not
end up having such a requirement, so I dropped the feature request.

If this is what you are asking for, then I am +1. Please implement it and
submit a PR.

On Tue, Nov 15, 2016 at 2:50 AM, Maycock, Luke <
luke.mayc...@affiliate.oliverwyman.com> wrote:

> Thank you for taking the time to respond. This is a great approach if you
> know at the time of creating the DAG which tasks you expect to need to
> skip. However, I don't think this is exactly the use case I have. For
> example, I may be expecting a file to arrive in an FTP folder for loading
> into a database but one day it doesn't arrive so I just want to skip that
> task on that day.
>
>
> Our workflows commonly have around 20 of these types of tasks in. I could
> configure all of these tasks in the way you suggested in case I ever need
> to skip one of them. However, I'd prefer not to have to set the tasks up
> this way and instead have the ability just to skip a task on an ad-hoc
> basis. I could then also use this functionality to add the ability to run
> from a certain point in a DAG or to a certain point in the DAG.
>
>
>
> Thanks,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.
> mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
>
> 
> From: siddharth anand 
> Sent: 14 November 2016 19:48
> To: dev@airflow.incubator.apache.org
> Subject: Re: Skip task
>
> For cases like this, we (Agari) use the following approach :
>
>   1. Create a Variable in the UI of type boolean such as *enable_feature_x*
>   2. Use a ShortCircuitOperator (or BranchPythonOperator) to Skip
>   downstream processing based on the value of *enable_feature_x*
>   3. Assuming that you don't want to skip ALL downstream tasks, you can
>   use a trigger_rule of all_done to resume processing some portion of your
>   downstream DAG after skipping an upstream portion
>
> In other words, there is already a means to achieve what you are asking for
> today. You can change the value of via *enable_feature_x  *the UI. If you'd
> like to enhance the UI to better capture this pattern, pls submit a PR.
> -s
>
> On Thu, Nov 10, 201

Logging Improvements

2016-12-06 Thread Maycock, Luke
Hi All,


We have carried out some work around logging. Highlights:

 *   Removed altering of the root logger as this would affect logging for any 
parent processes also using this root logger object as detailed in 
https://issues.apache.org/jira/browse/AIRFLOW-409
 *   Replaced use of the root logger with context specific logger objects - 
these will be inheriting from a base 'Airflow' logger object, allowing for 
centralised control of the logging configuration.
 *   Added a logging controller that can be used to configure the 'Airflow' 
base logger, which will propagate throughout Airflow.
 *   Added Airflow configuration options for configuring the 'Airflow' base 
logger.

The PR is here:
https://github.com/apache/incubator-airflow/pull/1921

Landscape failed to run for this PR (maybe it is too large). However, we are 
linting locally so I would expect there to be no issues.

If people feel this is an improvement to the existing logging, it would be 
great to get this merged as soon as possible so we don't have to keep changing 
new pieces of logging in Airflow to fit this new model.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: API additions

2016-12-05 Thread Maycock, Luke
Hi Max,


Thank you for this information. I had not come across the TriggerDagRunOperator 
so thank you for pointing it out - it looks like a useful operator. We are 
planning to interact with Airflow from a non-python apps so believe the API is 
necessary.


Cheers,
Luke


From: Maxime Beauchemin 
Sent: 05 December 2016 16:28
To: dev@airflow.incubator.apache.org
Subject: Re: API additions

Hi Oliver,

Have you looked into:
https://airflow.incubator.apache.org/code.html#airflow.operators.TriggerDagRunOperator

The idea here is to have a specific high frequency simple DAG with this
operator (with a schedule_interval or every 5 minutes or so) that triggers
another not-scheduled (schedule_interval=None) DAG when the criteria is
met. Your `every_five_minutes` DAG can be reused for anything that follows
that pattern.

Also note that DagRuns have a payload field that should be easy to access
(at least from PythonOperator and templates) in which you can make
parameters available for tasks in your DagRun (similar to XCom).

It's great to have the API, and vital if the triggering app isn't written
in Python, but you can also use the ORM directly `from airflow.models
import DagRun`, of course there are pros/cons to that approach.

Max

On Mon, Dec 5, 2016 at 6:31 AM, Maycock, Luke <
luke.mayc...@affiliate.oliverwyman.com> wrote:

> Hello,
>
> We've had a requirement for Airflow to be able to start DAG Runs and check
> on the state of tasks in a specific DAG Run via API. We'd seen that an
> experimental API had been put in place and expanded on this, but there has
> since been a significant overhaul, mostly around security and the setup in
> this PR: https://github.com/apache/incubator-airflow/pull/1783
>
> Before we rework our additions (https://github.com/owlabs/
> incubator-airflow/pull/11, https://github.com/owlabs/
> incubator-airflow/pull/11, https://github.com/owlabs/
> incubator-airflow/pull/14) , we thought it might be worth asking if
> anybody had any thoughts about what we've done so far and whether we're
> treading on anybody else's toes with these features.
>
> The endpoints that we've written are:
>
> - Get details of a task instance
> - Create a DAG Run (this is somewhat covered by the trigger_dag endpoint
> that has now been added, but we also have a version to allow you to specify
> the execution date)
> - Add a value to xcom for a task instance (initially required the task to
> exist, but we had a use to be able to add xcom for tasks that didn't exist,
> so we removed that restriction)
>
> Is anybody else working on these endpoints or any other significant
> changes to this area that we'd do better to wait for before we rework these
> for PR?
>
> Thanks,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.
> mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
> 
> This e-mail and any attachments may be confidential or legally privileged.
> If you received this message in error or are not the intended recipient,
> you should destroy the e-mail message and any attachments or copies, and
> you are prohibited from retaining, distributing, disclosing or using any
> information contained herein. Please inform us of the erroneous delivery by
> return e-mail. Thank you for your cooperation.
>


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: API additions

2016-12-05 Thread Maycock, Luke
Hi Bolke,


Thanks for the quick and detailed feedback! We will get on with the necessary 
changes soon.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Bolke de Bruin 
Sent: 05 December 2016 15:03
To: dev@airflow.incubator.apache.org
Subject: Re: API additions

Het Luke,

That’s great news and excellent that your are building on the work of the PR! 
The functionality you are proposing seem reasonable, obviously not replicating 
too much would be smart, e.g. extend on trigger_dag rather than introducing a 
new function that almost does the same.

A couple of guidelines that I had in mind:

- The API resides in airflow/api/common. This means endpoints (code defined in 
an endpoint) cannot access the database directly. It should always import from 
airflow.api.common. This decouples the API definition from its server side 
implementation.
- Client side API must be implemented in airflow.api.client.XXX (JSON + Local). 
While for now this is not DRY it allows for a smoother migration path away from 
direct DB access in the CLI.

A quick glance over the code that was written by you:

- /createdagrun/dag//executiondate/ —> 
this is indeed trigger_dag, with an extra field in the JSON
- 
/taskstate/dag//task//executiondate/
  —> already implemented as task_info might need extending
- make sure to use nouns and proper HTTP actions (PUT, GET, POST). For some 
best practices have a look at 
http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-api 
<http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-api>

In addition, considering the work Max is doing with Flask Application Builder I 
expect the “endpoints” to change in the future. That makes it even more 
important to make sure your DB access code resides in airflow.api.common. Also 
some rework might be required again, it is marked experimental after all :-). 
Still, I would like to invite you to implement it as we need the drive of the 
community to point us where we need to go with this.

Cheers
Bolke



> Op 5 dec. 2016, om 14:31 heeft Maycock, Luke 
>  het volgende geschreven:
>
> Hello,
>
> We've had a requirement for Airflow to be able to start DAG Runs and check on 
> the state of tasks in a specific DAG Run via API. We'd seen that an 
> experimental API had been put in place and expanded on this, but there has 
> since been a significant overhaul, mostly around security and the setup in 
> this PR: https://github.com/apache/incubator-airflow/pull/1783
>
> Before we rework our additions 
> (https://github.com/owlabs/incubator-airflow/pull/11, 
> https://github.com/owlabs/incubator-airflow/pull/11, 
> https://github.com/owlabs/incubator-airflow/pull/14) , we thought it might be 
> worth asking if anybody had any thoughts about what we've done so far and 
> whether we're treading on anybody else's toes with these features.
>
> The endpoints that we've written are:
>
> - Get details of a task instance
> - Create a DAG Run (this is somewhat covered by the trigger_dag endpoint that 
> has now been added, but we also have a version to allow you to specify the 
> execution date)
> - Add a value to xcom for a task instance (initially required the task to 
> exist, but we had a use to be able to add xcom for tasks that didn't exist, 
> so we removed that restriction)
>
> Is anybody else working on these endpoints or any other significant changes 
> to this area that we'd do better to wait for before we rework these for PR?
>
> Thanks,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
> 
> This e-mail and any attachments may be confidential or legally privileged. If 
> you received this message in error or are not the intended recipient, you 
> should destroy the e-mail message and any attachments or copies, and you are 
> prohibited from retaining, distributing, disclosing or using any information 
> contained herein. Please inform us of the erroneous delivery by return 
> e-mail. Thank you for your cooperation.



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Travis build failure

2016-12-05 Thread Maycock, Luke
Great news. Thanks Bolke.

Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Emanuele Palese 
Sent: 05 December 2016 16:05
To: dev@airflow.incubator.apache.org
Subject: Re: Travis build failure

It's now working:
https://travis-ci.org/epalese/incubator-airflow/builds/181378555
I've also marked the issue https://issues.apache.org/jira/browse/AIRFLOW-671
as resolved.

Thanks Bolke.

2016-12-05 15:21 GMT+00:00 Bolke de Bruin :

> I just pushed a fix for this. Travis has changed its way how it makes the
> jdk available. PRs need a force push or empty commit to trigger a new build.
>
> - Bolke
>
> > Op 5 dec. 2016, om 14:38 heeft Emanuele Palese  het
> volgende geschreven:
> >
> > Hey,
> > I am experiencing the same issue. I've opened a ticket too:
> > https://issues.apache.org/jira/browse/AIRFLOW-671
> > I've tried to download minicluster manually from
> > https://github.com/bolkedebruin/minicluster/releases/download/1.1/
> minicluster-1.1-SNAPSHOT-bin.zip
> > and I have the same issue that is beeline can't connect to the local Hive
> > server.
> >
> > 2016-12-05 14:22 GMT+00:00 Bolke de Bruin :
> >
> >> the java binary seems to missing from the path for some reason. I’ll
> have
> >> a look to see if I can fix that.
> >>
> >>> Op 5 dec. 2016, om 14:20 heeft Maycock, Luke  >> oliverwyman.com> het volgende geschreven:
> >>>
> >>> I have only noticed this. All recent PRs have this error. At a glance,
> I
> >> could not see any individual commit that caused this so maybe this is an
> >> environment issue. I have not really spent any time looking into the
> issue
> >> yet but if I find a solution, I will let you know.
> >>>
> >>>
> >>> Cheers,
> >>>
> >>> Luke Maycock
> >>> OLIVER WYMAN
> >>> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.
> >> mayc...@affiliate.oliverwyman.com>
> >>> www.oliverwyman.com<http://www.oliverwyman.com/>
> >>>
> >>>
> >>>
> >>> 
> >>> From: Vijay Bhat 
> >>> Sent: 05 December 2016 13:46
> >>> To: dev@airflow.incubator.apache.org
> >>> Subject: Travis build failure
> >>>
> >>> I'm getting the following travis build error - it's unable to connect
> to
> >>> Hive.
> >>>
> >>> https://travis-ci.org/apache/incubator-airflow/builds/181336068
> >>>
> >>> Anybody else seeing this or have ideas on why it's happening?
> >>>
> >>> INFO  [root] Connecting to jdbc:hive2://localhost:1/default
> >>> INFO  [root] HS2 may be unavailable, check server status
> >>> INFO  [root] Error: Could not open client transport with JDBC Uri:
> >>> jdbc:hive2://localhost:1/default: java.net.ConnectException:
> >>> Connection refused (Connection refused) (state=08S01,code=0)
> >>>
> >>> Thanks,
> >>>
> >>> Vijay
> >>>
> >>> 
> >>> This e-mail and any attachments may be confidential or legally
> >> privileged. If you received this message in error or are not the
> intended
> >> recipient, you should destroy the e-mail message and any attachments or
> >> copies, and you are prohibited from retaining, distributing, disclosing
> or
> >> using any information contained herein. Please inform us of the
> erroneous
> >> delivery by return e-mail. Thank you for your cooperation.
> >>
> >>
>
>


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


API additions

2016-12-05 Thread Maycock, Luke
Hello,

We've had a requirement for Airflow to be able to start DAG Runs and check on 
the state of tasks in a specific DAG Run via API. We'd seen that an 
experimental API had been put in place and expanded on this, but there has 
since been a significant overhaul, mostly around security and the setup in this 
PR: https://github.com/apache/incubator-airflow/pull/1783

Before we rework our additions 
(https://github.com/owlabs/incubator-airflow/pull/11, 
https://github.com/owlabs/incubator-airflow/pull/11, 
https://github.com/owlabs/incubator-airflow/pull/14) , we thought it might be 
worth asking if anybody had any thoughts about what we've done so far and 
whether we're treading on anybody else's toes with these features.

The endpoints that we've written are:

- Get details of a task instance
- Create a DAG Run (this is somewhat covered by the trigger_dag endpoint that 
has now been added, but we also have a version to allow you to specify the 
execution date)
- Add a value to xcom for a task instance (initially required the task to 
exist, but we had a use to be able to add xcom for tasks that didn't exist, so 
we removed that restriction)

Is anybody else working on these endpoints or any other significant changes to 
this area that we'd do better to wait for before we rework these for PR?

Thanks,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Travis build failure

2016-12-05 Thread Maycock, Luke
I have only noticed this. All recent PRs have this error. At a glance, I could 
not see any individual commit that caused this so maybe this is an environment 
issue. I have not really spent any time looking into the issue yet but if I 
find a solution, I will let you know.


Cheers,

Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com




From: Vijay Bhat 
Sent: 05 December 2016 13:46
To: dev@airflow.incubator.apache.org
Subject: Travis build failure

I'm getting the following travis build error - it's unable to connect to
Hive.

https://travis-ci.org/apache/incubator-airflow/builds/181336068

Anybody else seeing this or have ideas on why it's happening?

INFO  [root] Connecting to jdbc:hive2://localhost:1/default
INFO  [root] HS2 may be unavailable, check server status
INFO  [root] Error: Could not open client transport with JDBC Uri:
jdbc:hive2://localhost:1/default: java.net.ConnectException:
Connection refused (Connection refused) (state=08S01,code=0)

Thanks,

Vijay


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Python 3

2016-12-01 Thread Maycock, Luke
Thanks Dennis. We will perhaps try Redis as there has been a ticket to make 
librabbitmq Python 3 compatible since 2012.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Dennis O'Brien 
Sent: 01 December 2016 04:45
To: dev@airflow.incubator.apache.org
Subject: Re: Python 3

Hi Luke,

I've been running Airflow on Python 3.5 with Celery.  I'm using Redis as
the message broker (and ElastiCache Redis in production).  I haven't tried
RabbitMQ so I can't speak to its compatibility with Python 3.5.

On Mon, Nov 28, 2016 at 9:18 AM Maycock, Luke <
luke.mayc...@affiliate.oliverwyman.com> wrote:

> I thought it would be worth following up on this given our experience of
> testing Python 3 compatibility.
>
>
> We are using Celery and RabbitMQ. Celery uses librabbitmq for interaction
> with RabbitMQ but librabbitmq is not compatible with Python 3 (see
> https://github.com/celery/librabbitmq/issues/13). Has anyone else been
> able to get Airflow/Python 3/Celery working?
>
>
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
>
> 
> From: Maycock, Luke 
> Sent: 28 November 2016 12:33
> To: dev@airflow.incubator.apache.org
> Subject: Re: Python 3
>
> Thank you Dmitriy and Li.
>
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
>
> 
> From: Li Xuan Ji 
> Sent: 28 November 2016 12:20
> To: dev@airflow.incubator.apache.org
> Subject: Re: Python 3
>
> Airflow is supposed to be Python 3 compatible and any
> incompatibilities are bugs on our side.
>
> I haven't heard of anyone running airflow on windows, but I'm not sure
> what the official stance is.
>
> On 28 November 2016 at 07:01, Dmitriy Krasnikov 
> wrote:
> > I have been using it and developing plugins for it in Python 3.5 with no
> problems.
> >
> >
> > -Original Message-
> > From: Maycock, Luke [mailto:luke.mayc...@affiliate.oliverwyman.com]
> > Sent: Monday, November 28, 2016 6:58 AM
> > To: dev@airflow.incubator.apache.org
> > Subject: Python 3
> >
> > Hi All,
> >
> >
> > I have been using Airflow for a while now and have a couple of questions
> that I am hoping someone knows the answer to:
> >
> >  1. Is Airflow now Python 3 compatible? The documentation used to state
> that Airflow was only compatible with Python 2.7 (<
> https://web.archive.org/web/20160310111640/http://pythonhosted.org/airflow/start.html
> >
> https://web.archive.org/web/20160502024539/http://pythonhosted.org/airflow/start.html)
> but that statement has since been removed.
> >  2.  I know Gunicorn does not run on Windows - is this the only
> component of Airflow that does not run on Windows?
> >
> >
> > Cheers,
> > Luke Maycock
> > OLIVER WYMAN
> > luke.mayc...@affiliate.oliverwyman.com luke.mayc...@affiliate.oliverwyman.com>
> > www.oliverwyman.com<http://www.oliverwyman.com/>
> >
> >
> > 
> > This e-mail and any attachments may be confidential or legally
> privileged. If you received this message in error or are not the intended
> recipient, you should destroy the e-mail message and any attachments or
> copies, and you are prohibited from retaining, distributing, disclosing or
> using any information contained herein. Please inform us of the erroneous
> delivery by return e-mail. Thank you for your cooperation.
>
>
>
> --
> Im Xuan Ji!
>
> 
> This e-mail and any attachments may be confidential or legally privileged.
> If you received this message in error or are not the intended recipient,
> you should destroy the e-mail message and any attachments or copies, and
> you are prohibited from retaining, distributing, disclosing or using any
> information contained herein. Please inform us of the erroneous delivery by
> return e-mail. Thank you for your cooperation.
>
> 
> This e-mail and any attachments may be confidential or legally privileged.
> If you received this message in error or are not the intended recipient,
> you should destroy the e-mail message and any attachments or copies, and
> you are prohibited from retaining, distributing, disclos

Re: Python 3

2016-11-28 Thread Maycock, Luke
I thought it would be worth following up on this given our experience of 
testing Python 3 compatibility.


We are using Celery and RabbitMQ. Celery uses librabbitmq for interaction with 
RabbitMQ but librabbitmq is not compatible with Python 3 (see 
https://github.com/celery/librabbitmq/issues/13). Has anyone else been able to 
get Airflow/Python 3/Celery working?



Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>



____
From: Maycock, Luke 
Sent: 28 November 2016 12:33
To: dev@airflow.incubator.apache.org
Subject: Re: Python 3

Thank you Dmitriy and Li.

Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Li Xuan Ji 
Sent: 28 November 2016 12:20
To: dev@airflow.incubator.apache.org
Subject: Re: Python 3

Airflow is supposed to be Python 3 compatible and any
incompatibilities are bugs on our side.

I haven't heard of anyone running airflow on windows, but I'm not sure
what the official stance is.

On 28 November 2016 at 07:01, Dmitriy Krasnikov  wrote:
> I have been using it and developing plugins for it in Python 3.5 with no 
> problems.
>
>
> -Original Message-
> From: Maycock, Luke [mailto:luke.mayc...@affiliate.oliverwyman.com]
> Sent: Monday, November 28, 2016 6:58 AM
> To: dev@airflow.incubator.apache.org
> Subject: Python 3
>
> Hi All,
>
>
> I have been using Airflow for a while now and have a couple of questions that 
> I am hoping someone knows the answer to:
>
>  1. Is Airflow now Python 3 compatible? The documentation used to state that 
> Airflow was only compatible with Python 2.7 
> (<https://web.archive.org/web/20160310111640/http://pythonhosted.org/airflow/start.html>https://web.archive.org/web/20160502024539/http://pythonhosted.org/airflow/start.html)
>  but that statement has since been removed.
>  2.  I know Gunicorn does not run on Windows - is this the only component of 
> Airflow that does not run on Windows?
>
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
> 
> This e-mail and any attachments may be confidential or legally privileged. If 
> you received this message in error or are not the intended recipient, you 
> should destroy the e-mail message and any attachments or copies, and you are 
> prohibited from retaining, distributing, disclosing or using any information 
> contained herein. Please inform us of the erroneous delivery by return 
> e-mail. Thank you for your cooperation.



--
Im Xuan Ji!


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Python 3

2016-11-28 Thread Maycock, Luke
Thank you Dmitriy and Li.

Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Li Xuan Ji 
Sent: 28 November 2016 12:20
To: dev@airflow.incubator.apache.org
Subject: Re: Python 3

Airflow is supposed to be Python 3 compatible and any
incompatibilities are bugs on our side.

I haven't heard of anyone running airflow on windows, but I'm not sure
what the official stance is.

On 28 November 2016 at 07:01, Dmitriy Krasnikov  wrote:
> I have been using it and developing plugins for it in Python 3.5 with no 
> problems.
>
>
> -Original Message-
> From: Maycock, Luke [mailto:luke.mayc...@affiliate.oliverwyman.com]
> Sent: Monday, November 28, 2016 6:58 AM
> To: dev@airflow.incubator.apache.org
> Subject: Python 3
>
> Hi All,
>
>
> I have been using Airflow for a while now and have a couple of questions that 
> I am hoping someone knows the answer to:
>
>  1.  Is Airflow now Python 3 compatible? The documentation used to state that 
> Airflow was only compatible with Python 2.7 
> (<https://web.archive.org/web/20160310111640/http://pythonhosted.org/airflow/start.html>https://web.archive.org/web/20160502024539/http://pythonhosted.org/airflow/start.html)
>  but that statement has since been removed.
>  2.  I know Gunicorn does not run on Windows - is this the only component of 
> Airflow that does not run on Windows?
>
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
> 
> This e-mail and any attachments may be confidential or legally privileged. If 
> you received this message in error or are not the intended recipient, you 
> should destroy the e-mail message and any attachments or copies, and you are 
> prohibited from retaining, distributing, disclosing or using any information 
> contained herein. Please inform us of the erroneous delivery by return 
> e-mail. Thank you for your cooperation.



--
Im Xuan Ji!


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Python 3

2016-11-28 Thread Maycock, Luke
Hi All,


I have been using Airflow for a while now and have a couple of questions that I 
am hoping someone knows the answer to:

 1.  Is Airflow now Python 3 compatible? The documentation used to state that 
Airflow was only compatible with Python 2.7 
(https://web.archive.org/web/20160502024539/http://pythonhosted.org/airflow/start.html)
 but that statement has since been removed.
 2.  I know Gunicorn does not run on Windows - is this the only component of 
Airflow that does not run on Windows?


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Airflow 2.0

2016-11-24 Thread Maycock, Luke
> Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.


+ 1


Also, I believe xcoms would need to be addressed in the same way at the same 
time - I have added a comment to that affect on 
https://issues.apache.org/jira/browse/AIRFLOW-642


I believe this would be implemented for all supported back-ends, not just 
PostgreSQL.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com




From: Arunprasad Venkatraman 
Sent: 21 November 2016 18:16
To: dev@airflow.incubator.apache.org
Subject: Re: Airflow 2.0

> Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.
>  Ensure scheduler can be run continuously without needing restarts.
>  Ensure scheduler can handle tens of thousands of active workflows

+1

We are planning to run around 40,000 tasks a day using airflow and some of
them are critical to give quick feedback to developers. Currently having
execution date to uniquely identify tasks does not work for us since we
mainly trigger dags (instead of running them on schedule). And we collide
with 1 sec granularity on several occasions.  Having a task uuid or
associating dag_run to task_instance as suggested by Sergei table will help
mitigate this issue for us and would make it easy for us to update task
results too. We would be happy to start working on this if it makes sense.

Also we are wondering if there were any work done in community to support
multiple schedulers(or alternates to mysql/Postgres) because 1 scheduler
does not scale for us well and we see slow down of up to couple of minute
sometimes when there are several pending tasks.

Thanks



On Mon, Nov 21, 2016 at 9:57 AM, Chris Riccomini 
wrote:

> > Ensure scheduler can be run continuously without needing restarts
>
> +1
>
> On Mon, Nov 21, 2016 at 5:25 AM, David Batista  wrote:
> > A small request, which might be handy.
> >
> > Having the possibility to select multiple tasks and mark them as
> > Success/Clear/etc.
> >
> > Allow the UI to select individual tasks (i.e., inside the Tree View) and
> > then have a button to mark them as Success/Clear/etc.
> >
> > On 21 November 2016 at 14:22, Sergei Iakhnin  wrote:
> >
> >> I've been running Airflow on 1500 cores in the context of scientific
> >> workflows for the past year and a half. Features that would be
> important to
> >> me for 2.0:
> >>
> >> - Add FK to dag_run to the task_instance table on Postgres so that
> >> task_instances can be uniquely attributed to dag runs.
> >> - Ensure scheduler can be run continuously without needing restarts.
> Right
> >> now it gets into some ill-determined bad state forcing me to restart it
> >> every 20 minutes.
> >> - Ensure scheduler can handle tens of thousands of active workflows.
> Right
> >> now this results in extremely long scheduling times and inconsistent
> >> scheduling even at 2 thousand active workflows.
> >> - Add more flexible task scheduling prioritization. The default
> >> prioritization is the opposite of the behaviour I want. I would prefer
> that
> >> downstream tasks always have higher priority than upstream tasks to
> cause
> >> entire workflows to tend to complete sooner, rather than scheduling
> tasks
> >> from other workflows. Having a few scheduling prioritization strategies
> >> would be beneficial here.
> >> - Provide better support for manually-triggered DAGs on the UI i.e. by
> >> showing them as queued.
> >> - Provide some resource management capabilities via something like slots
> >> that can be defined on workers and occupied by tasks. Using celery's
> >> concurrency parameter at the airflow server level is too coarse-grained
> as
> >> it forces all workers to be the same, and does not allow proper resource
> >> management when different workflow tasks have different resource
> >> requirements thus hurting utilization (a worker could run 8 parallel
> tasks
> >> with small memory footprint, but only 1 task with large memory footprint
> >> for instance).
> >>
> >> With best regards,
> >>
> >> Sergei.
> >>
> >>
> >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> >> ext-pavlo.ryabc...@here.com>
> >> wrote:
> >>
> >> > -1. We extremely rely on data profiling, as a pipeline health
> monitoring
> >> > tool
> >> >
> >> > -Original Message-
> >> > From: Chris Riccomini [mailto:criccom...@apache.org]
> >> > Sent: Saturday, November 19, 2016 1:57 AM
> >> > To: dev@airflow.incubator.apache.org
> >> > Subject: Re: Airflow 2.0
> >> >
> >> > > RIP out the charting application and the data profiler
> >> >
> >> > Yes please! +1
> >> >
> >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> >> > maximebeauche...@gmail.com> wrote:
> >> > > Another point that may be controversial for Airflow 2.0: RIP out the
> >> > > charting application and the data p

Re: Skip task

2016-11-15 Thread Maycock, Luke
Thank you for taking the time to respond. This is a great approach if you know 
at the time of creating the DAG which tasks you expect to need to skip. 
However, I don't think this is exactly the use case I have. For example, I may 
be expecting a file to arrive in an FTP folder for loading into a database but 
one day it doesn't arrive so I just want to skip that task on that day.


Our workflows commonly have around 20 of these types of tasks in. I could 
configure all of these tasks in the way you suggested in case I ever need to 
skip one of them. However, I'd prefer not to have to set the tasks up this way 
and instead have the ability just to skip a task on an ad-hoc basis. I could 
then also use this functionality to add the ability to run from a certain point 
in a DAG or to a certain point in the DAG.



Thanks,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: siddharth anand 
Sent: 14 November 2016 19:48
To: dev@airflow.incubator.apache.org
Subject: Re: Skip task

For cases like this, we (Agari) use the following approach :

  1. Create a Variable in the UI of type boolean such as *enable_feature_x*
  2. Use a ShortCircuitOperator (or BranchPythonOperator) to Skip
  downstream processing based on the value of *enable_feature_x*
  3. Assuming that you don't want to skip ALL downstream tasks, you can
  use a trigger_rule of all_done to resume processing some portion of your
  downstream DAG after skipping an upstream portion

In other words, there is already a means to achieve what you are asking for
today. You can change the value of via *enable_feature_x  *the UI. If you'd
like to enhance the UI to better capture this pattern, pls submit a PR.
-s

On Thu, Nov 10, 2016 at 1:20 PM, Maycock, Luke <
luke.mayc...@affiliate.oliverwyman.com> wrote:

> Hi Gerard,
>
>
> I see the new status as having a number of uses:
>
>  1.  A user can manually set a task to skip in a DAG run via the UI.
>  2.  We can then make use of this new status to add the following
> functionality to Airflow:
> *   Run a DAG run up to a certain point and have the rest of the tasks
> have the new status.
> *   Run a DAG run from a certain task to the end, setting all
> pre-requisite tasks to have this new status.
>
> I am happy to be challenged on the above use cases if there are better
> ways to achieve the same things.
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.
> mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
>
> 
> From: Gerard Toonstra 
> Sent: 09 November 2016 18:08
> To: dev@airflow.incubator.apache.org
> Subject: Re: Skip task
>
> Hey Luke,
>
> Who or what makes the decision to skip processing that task?
>
> Rgds,
>
> Gerard
>
> On Wed, Nov 9, 2016 at 2:39 PM, Maycock, Luke <
> luke.mayc...@affiliate.oliverwyman.com> wrote:
>
> > Hi Gerard,
> >
> >
> > Thank you for your quick response.
> >
> >
> > I am not trying to implement this for a specific operator but rather
> > trying to add it as a feature for any task in any DAG.
> >
> >
> > Given that the skipped states propagate where all directly upstream tasks
> > are skipped, I don't think this is the state we want to use. For the
> > functionality I'm looking for, I think I'll need to introduce a new
> status,
> > maybe 'disabled'.
> >
> >
> > Again, thanks for your response.
> >
> >
> > Cheers,
> > Luke Maycock
> > OLIVER WYMAN
> > luke.mayc...@affiliate.oliverwyman.com<mailto:luke.
> > mayc...@affiliate.oliverwyman.com>
> > www.oliverwyman.com<http://www.oliverwyman.com/>
> >
> >
> >
> > 
> > From: Gerard Toonstra 
> > Sent: 08 November 2016 18:19
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Skip task
> >
> > Also in 1.7.1.3, there's the ShortCircuitOperator, which can give you an
> > example.
> >
> > https://github.com/apache/incubator-airflow/blob/1.7.1.
> > 3/airflow/operators/python_operator.py
> >
> > You'd have to modify this to your needs, but the way it works is that if
> > the condition evaluates to True, none of the
> > downstream tasks are actually executed, they'd be skipped. The reason for
> > putting them into SKIPPED state is that
> > the DAG final result would still be SUCCESS and not failed.
> >
> > You could cop

Re: Skip task

2016-11-10 Thread Maycock, Luke
Hi Gerard,


I see the new status as having a number of uses:

 1.  A user can manually set a task to skip in a DAG run via the UI.
 2.  We can then make use of this new status to add the following functionality 
to Airflow:
*   Run a DAG run up to a certain point and have the rest of the tasks have 
the new status.
*   Run a DAG run from a certain task to the end, setting all pre-requisite 
tasks to have this new status.

I am happy to be challenged on the above use cases if there are better ways to 
achieve the same things.

Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Gerard Toonstra 
Sent: 09 November 2016 18:08
To: dev@airflow.incubator.apache.org
Subject: Re: Skip task

Hey Luke,

Who or what makes the decision to skip processing that task?

Rgds,

Gerard

On Wed, Nov 9, 2016 at 2:39 PM, Maycock, Luke <
luke.mayc...@affiliate.oliverwyman.com> wrote:

> Hi Gerard,
>
>
> Thank you for your quick response.
>
>
> I am not trying to implement this for a specific operator but rather
> trying to add it as a feature for any task in any DAG.
>
>
> Given that the skipped states propagate where all directly upstream tasks
> are skipped, I don't think this is the state we want to use. For the
> functionality I'm looking for, I think I'll need to introduce a new status,
> maybe 'disabled'.
>
>
> Again, thanks for your response.
>
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.
> mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
>
> 
> From: Gerard Toonstra 
> Sent: 08 November 2016 18:19
> To: dev@airflow.incubator.apache.org
> Subject: Re: Skip task
>
> Also in 1.7.1.3, there's the ShortCircuitOperator, which can give you an
> example.
>
> https://github.com/apache/incubator-airflow/blob/1.7.1.
> 3/airflow/operators/python_operator.py
>
> You'd have to modify this to your needs, but the way it works is that if
> the condition evaluates to True, none of the
> downstream tasks are actually executed, they'd be skipped. The reason for
> putting them into SKIPPED state is that
> the DAG final result would still be SUCCESS and not failed.
>
> You could copy the operator from there and don't do the full "for loop",
> only pick the tasks immediately downstream
> from this operator and skip that. Or... if you need to skip additional
> tasks downstream, add a parameter "num_tasks"
> that decide on a halting condition for the for loop.
>
> I believe that should work. I didn't try that here, but you can test that
> and see what it does for you.
>
>
> If you want this as a UI capability... for example have a human operator
> decide on skipping this yes or not, then
> maybe the best way forward would be some kind of highly custom plugin with
> its own view. In the end, you'd basically
> do the same action in the backend, whether the python cond evaluates to
> True or the button is clicked.
>
> In the plugin case though, you'd have to keep the UI and the structure of
> the DAG in sync and aligned, otherwise
> it'd become a mess Airflow wasn't really developed for workflow/human
> interaction, but in workflows where only
> automated processes are involved. That doesn't mean that you can't do
> anything like that, but it may be costly resource
> wise to get this done. For example, on the basis of the BranchOperator, you
> could call an external API to verify if a decision
> was taken on a case, then follow branch A or B if the decision is there or
> put the state back into UP_FOR_RETRY.
> At the moment though, there's no programmatic way to reschedule that task
> to some minutes or hours into the future before
> it's looked at again, unless you really dive into airflow, scheduling
> semantics (@once vs. other schedules) and how
> the scheduler works.
>
> Rgds,
>
> Gerard
>
>
>
>
> On Tue, Nov 8, 2016 at 5:30 PM, Maycock, Luke <
> luke.mayc...@affiliate.oliverwyman.com> wrote:
>
> > Hi All,
> >
> >
> > I am using Airflow 1.7.1.3 and have a particular requirement, which I
> > don't think is currently supported by Airflow but just wanted to check in
> > case I was missing something.
> >
> >
> > I occasionally wish to skip a particular task in a given DAG run such
> that
> > the task does not run for that DAG run. Is this functionality available
> in
> &

Re: Skip task

2016-11-09 Thread Maycock, Luke
Hi Gerard,


Thank you for your quick response.


I am not trying to implement this for a specific operator but rather trying to 
add it as a feature for any task in any DAG.


Given that the skipped states propagate where all directly upstream tasks are 
skipped, I don't think this is the state we want to use. For the functionality 
I'm looking for, I think I'll need to introduce a new status, maybe 'disabled'.


Again, thanks for your response.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Gerard Toonstra 
Sent: 08 November 2016 18:19
To: dev@airflow.incubator.apache.org
Subject: Re: Skip task

Also in 1.7.1.3, there's the ShortCircuitOperator, which can give you an
example.

https://github.com/apache/incubator-airflow/blob/1.7.1.3/airflow/operators/python_operator.py

You'd have to modify this to your needs, but the way it works is that if
the condition evaluates to True, none of the
downstream tasks are actually executed, they'd be skipped. The reason for
putting them into SKIPPED state is that
the DAG final result would still be SUCCESS and not failed.

You could copy the operator from there and don't do the full "for loop",
only pick the tasks immediately downstream
from this operator and skip that. Or... if you need to skip additional
tasks downstream, add a parameter "num_tasks"
that decide on a halting condition for the for loop.

I believe that should work. I didn't try that here, but you can test that
and see what it does for you.


If you want this as a UI capability... for example have a human operator
decide on skipping this yes or not, then
maybe the best way forward would be some kind of highly custom plugin with
its own view. In the end, you'd basically
do the same action in the backend, whether the python cond evaluates to
True or the button is clicked.

In the plugin case though, you'd have to keep the UI and the structure of
the DAG in sync and aligned, otherwise
it'd become a mess Airflow wasn't really developed for workflow/human
interaction, but in workflows where only
automated processes are involved. That doesn't mean that you can't do
anything like that, but it may be costly resource
wise to get this done. For example, on the basis of the BranchOperator, you
could call an external API to verify if a decision
was taken on a case, then follow branch A or B if the decision is there or
put the state back into UP_FOR_RETRY.
At the moment though, there's no programmatic way to reschedule that task
to some minutes or hours into the future before
it's looked at again, unless you really dive into airflow, scheduling
semantics (@once vs. other schedules) and how
the scheduler works.

Rgds,

Gerard




On Tue, Nov 8, 2016 at 5:30 PM, Maycock, Luke <
luke.mayc...@affiliate.oliverwyman.com> wrote:

> Hi All,
>
>
> I am using Airflow 1.7.1.3 and have a particular requirement, which I
> don't think is currently supported by Airflow but just wanted to check in
> case I was missing something.
>
>
> I occasionally wish to skip a particular task in a given DAG run such that
> the task does not run for that DAG run. Is this functionality available in
> Airflow?
>
>
> I am aware of the BranchPythonOperator (https://airflow.incubator.
> apache.org/concepts.html#branching) but I don't think believe this is
> exactly what I am looking for.
>
>
> I am thinking that a button in the UI alongside the 'Mark Success' and
> 'Run' buttons would be appropriate.
>
>
> If the functionality does not exist, does anyone have any suggestions on
> ways to implement this?
>
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.
> mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
> 
> This e-mail and any attachments may be confidential or legally privileged.
> If you received this message in error or are not the intended recipient,
> you should destroy the e-mail message and any attachments or copies, and
> you are prohibited from retaining, distributing, disclosing or using any
> information contained herein. Please inform us of the erroneous delivery by
> return e-mail. Thank you for your cooperation.
>


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Skip task

2016-11-08 Thread Maycock, Luke
Hi All,


I am using Airflow 1.7.1.3 and have a particular requirement, which I don't 
think is currently supported by Airflow but just wanted to check in case I was 
missing something.


I occasionally wish to skip a particular task in a given DAG run such that the 
task does not run for that DAG run. Is this functionality available in Airflow?


I am aware of the BranchPythonOperator 
(https://airflow.incubator.apache.org/concepts.html#branching) but I don't 
think believe this is exactly what I am looking for.


I am thinking that a button in the UI alongside the 'Mark Success' and 'Run' 
buttons would be appropriate.


If the functionality does not exist, does anyone have any suggestions on ways 
to implement this?


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Airflow Logging

2016-10-18 Thread Maycock, Luke
Thank you for the response Maxime. We will consider the points you made when 
making the changes and may come back with further thoughts as we work on it.


I can't say I am too familiar with Kibana, Logstash and Elasticsearch but I am 
assuming Airflow would still produce logs and you would be using Logstash to 
process the logs?


Thanks,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Maxime Beauchemin 
Sent: 14 October 2016 17:02
To: dev@airflow.incubator.apache.org
Subject: Re: Airflow Logging

Another consideration is for configuration as code in `settings.py`
in-place or in conjunction with `airflow.cfg` to allow more dynamic
configuration, like passing your own custom log handler[s]. Many folks at
the meetup (including Airbnb) spoke about their intent to ship the logging
to Kibana and/or other logging backend.

Side note about `airflow.cfg` vs `settings.py`: it would be great to move
completely away from the cfgs for Airflow 2.0. It turns out that Python's
ConfigParser may be one of the worst module in the standard library and the
cfg not-so-standard standard isn't great. We have many instances where we
allow for people to pass objects or function as configuration parameters
anyways, so let's keep the whole configuration in one place!

Max

On Fri, Oct 14, 2016 at 7:43 AM, Maycock, Luke <
luke.mayc...@affiliate.oliverwyman.com> wrote:

> Carrying on from the below. We believe it would be useful to have a number
> of additional configuration options for logs as below. Do you think there
> are any important options we have missed or do you have any other feedback?
> If so, please let us know.
>
>
> Global
>
> CLEAR_INHERITED_LOGGING_SETTINGS (Boolean)
> Clears any handlers, inherited from the root logger, from the Airflow
> logger object and its children.
>
> LOG_TO_DEBUG_FILE (Boolean)
>
> Turn on/off logging of all events to a file.
>
> LOG_TO_ERROR_FILE (Boolean)
> Turn on/off the logging of all errors to a file.
>
> LOG_TO_CONSOLE (Boolean)
> Turn on/off the logging of all events to the console.
>
> Per log file
> The Python logging library, being used for logging in the Airflow code,
> has the ability to rotate log files based on time:
>
>
> • TimedRotatingFileHandler - (https://docs.python.org/2.6/
> library/logging.html#timedrotatingfilehandler) - rotates based on the
> product of when and interval (see below).
>
> The TimedRotatingFileHandler takes the following arguments:
>
>  *   Filename – the filename
>  *   When – used to specify the type of interval
>  *   Interval – defines the interval
>  *   BackupCount - If backupCount is nonzero, at most backupCount files
> will be kept, and if more would be created when rollover occurs, the oldest
> one is deleted. The deletion logic uses the interval to determine which
> files to delete, so changing the interval may leave old files lying around.
>  *   Encoding - if encoding is not ‘None’, it is used to open the file
> with that encoding.
>  *   Delay - if delay is true, then file opening is deferred until the
> first call to emit() – I don’t think this is suitable to expose as an
> Airflow configuration option.
>  *   UTC - if the UTC argument is true, times in UTC will be used;
> otherwise local time is used.
> I believe all of the above, except ‘delay’, should be exposed as Airflow
> configuration.
>
>
> Cheers,
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.
> mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
>
> 
> From: Maycock, Luke
> Sent: 13 October 2016 14:52
> To: dev@airflow.incubator.apache.org
> Subject: Airflow Logging
>
>
> Hi All,
>
> We (owlabs - fork: https://github.com/owlabs/incubator-airflow) have a
> high level design for how to improve the logging throughout the Airflow
> code to be more consistent, maintainable and extensible. We'd really
> appreciate any feedback on the design.
>
> Design for Consolidating Logging Setup:
>
>  *   Create a Class in the utils\logging.py to instantiate and handle
> setup of an "airflow" logger object. This allows us to easily find all
> setup code for further modification and extension in the future.
>  *   This class would be where any general logging configuration (from
> Airflow.cfg) would be applied.
>  *   Instantiate this class in this file, so that importing it allows easy
> control of the logging setup from anywhere else in the application.
>  *   Move all setup of the logging in general to this class w

Re: String formatting

2016-10-17 Thread Maycock, Luke
Thank you for the quick responses!


My understanding from the latest python documentation is that '.format' 
supersedes the use of '%'. Therefore, it seemed a little strange that the 
Landscape configuration was advising the use of '%', which is why I asked the 
question. I understand having a check in to ensure consistency but I would have 
expected it to be checking that '.format' is being used, not '%'.


I would not be against Jeremiah's suggestion of disabling this warning. If 
no-one has any strong objections, we will go ahead and make the change then 
create a PR.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>




From: Jeremiah Lowin 
Sent: 17 October 2016 15:23
To: dev@airflow.incubator.apache.org
Cc: Maycock, Luke
Subject: Re: String formatting

Indeed -- though I think the larger question from Luke is whether or not we
want to enforce a certain style of logging message (variable arguments vs
formatting the string itself). Since there's nothing to stop users from
formatting the string on one line and passing it to logging on the next,
not to mention that I don't really see the performance benefit, I think
this warning is silly and should be removed and people can log however they
feel most comfortable. Just my $0.02.

On Mon, Oct 17, 2016 at 10:20 AM Andrew Phillips 
wrote:

> Perhaps I stand corrected! -- though I don't see where it actually says
> this approach is preferred. In any case, the Python 3 docs explicitly
> state
> that the behavior is only maintained for backwards compatibility:
> https://docs.python.org/3/howto/logging.html#logging-variable-data

Ah, interesting! From [1], it appears that it's now possible to change
the placeholder syntax so that the messages look somewhat more "modern".
As far as I can tell, variable components of the log messages are still
intended to be passed as arguments, though?

So with the style change, the log message could look like this:

logging.info("A message with a param: {}", param)

Regards

ap

[1]
https://docs.python.org/3/howto/logging-cookbook.html#formatting-styles


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


String formatting

2016-10-17 Thread Maycock, Luke
Hi Dev List,

We're currently working on removing all of the new Landscape.io warnings from 
some of our code and we're noticing the following quite a lot:

"Use % formatting in logging functions and pass the % parameters as arguments"

This is being flagged up on lines such as:

"logging.info("Running command:\n {}".format(hql))"

We're just wondering whether it is indeed preferred to use the % parameters as 
suggested or whether this is an issue with the Landscape.io setup.

Can anybody give a judgement on this? If the .format is preferred, then we'll 
look into changing the landscape.io settings.

Thanks,

Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: scheduler questions

2016-10-17 Thread Maycock, Luke
This is great! I have also upvoted 
https://github.com/apache/incubator-airflow/pull/1830.


If I would like to get feedback on a design or if I have an Airflow question, 
is emailing this dev list the correct way to go?


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com




From: siddharth anand 
Sent: 13 October 2016 18:11
To: dev@airflow.incubator.apache.org
Subject: Re: scheduler questions

Boris,

*Question 1*
Only_Run_Latest is in master -
https://github.com/apache/incubator-airflow/commit/edf033be65b575f44aa221d5d0ec9ecb6b32c67a.
That will solve your problem.

Releases come out one a quarter sometimes once every 2 quarters, so I would
recommend that you run off master or off your own fork.

You could also achieve this yourself with the following code snippet. It
uses a ShortCircuitOperator that will skip downstream tasks if the DagRun
being executed is not the current one. It will work for any schedule. The
code below has essentially been implemented in the LatestOnlyOperator above
for convenience.

def skip_to_current_job(ds, **kwargs):

   now = datetime.now()

   left_window = kwargs['dag'].following_schedule(kwargs['execution_date'])

   right_window = kwargs['dag'].following_schedule(left_window)

   logging.info(('Left Window {}, Now {}, Right Window {}'
).format(left_window,now,right_window))

   if not now <= right_window:

   logging.info('Not latest execution, skipping downstream.')

   return False

   return True


t0 = ShortCircuitOperator(

 task_id = 'short_circuit_if_not_current,

 provide_context = True,

 python_callable = skip_to_current_job,

 dag = dag

)


-s


On Thu, Oct 13, 2016 at 7:46 AM, Boris Tyukin  wrote:

> Hello all and thanks for such an amazing project! I have been evaluating
> Airflow and spent a few days reading about it and playing with it and I
> have a few questions that I struggle to understand.
>
> Let's say I have a simple DAG that runs once a day and it is doing a full
> reload of tables from the source database so the process is not
> incremental.
>
> Let's consider this scenario:
>
> Day 1 - OK
>
> Day 2 - airflow scheduler or server with airflow is down for some reason
> ((or
> DAG is paused)
>
> Day 3 - still down(or DAG is paused)
>
> Day 4 - server is up and now needs to run missing jobs.
>
>
> How can I make airflow to run only Day 4 job and not backfill Day 2 and 3?
>
>
> I tried to do depend_on_past = True but it does not seem to do this trick.
>
>
> I also found in a roadmap doc this but seems it is not made to the release
> yet:
>
>
>  Only Run Latest - Champion : Sid
>
> • For cases where we need to only run the latest in a series of task
> instance runs and mark the others as skipped. For example, we may have job
> to execute a DB snapshot every day. If the DAG is paused for 5 days and
> then unpaused, we don’t want to run all 5, just the latest. With this
> feature, we will provide “cron” functionality for task scheduling that is
> not related to ETL
>
>
> My second question, what if I have another DAG that does incremental loads
> from a source table:
>
>
> Day 1 - OK, loaded new/changed data for previous day
>
> Day 2 - source system is down (or DAG is paused), Airflow DagRun failed
>
> Day 3 - source system is down (or DAG is paused), Airflow DagRun failed
>
> Day 4 - source system is up, Airflow Dagrun succeeded
>
>
> My problem (unless I am missing something), Airflow on Day 4 would use
> execution time from Day 3, so the interval for incremental load would be
> since the last run (which was Failed). My hope it would use the last
> _successful_ run so on Day 4 it would go back to Day 1. Is it possible to
> achieve this?
>
> I am aware of a manual backfill command via CLI but I am not sure I want to
> use due to all the issues and inconsistencies I've read about it.
>
> Thanks!
>


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Airflow Logging

2016-10-14 Thread Maycock, Luke
Carrying on from the below. We believe it would be useful to have a number of 
additional configuration options for logs as below. Do you think there are any 
important options we have missed or do you have any other feedback? If so, 
please let us know.


Global

CLEAR_INHERITED_LOGGING_SETTINGS (Boolean)
Clears any handlers, inherited from the root logger, from the Airflow logger 
object and its children.

LOG_TO_DEBUG_FILE (Boolean)

Turn on/off logging of all events to a file.

LOG_TO_ERROR_FILE (Boolean)
Turn on/off the logging of all errors to a file.

LOG_TO_CONSOLE (Boolean)
Turn on/off the logging of all events to the console.

Per log file
The Python logging library, being used for logging in the Airflow code, has the 
ability to rotate log files based on time:


• TimedRotatingFileHandler - 
(https://docs.python.org/2.6/library/logging.html#timedrotatingfilehandler) - 
rotates based on the product of when and interval (see below).

The TimedRotatingFileHandler takes the following arguments:

 *   Filename – the filename
 *   When – used to specify the type of interval
 *   Interval – defines the interval
 *   BackupCount - If backupCount is nonzero, at most backupCount files will be 
kept, and if more would be created when rollover occurs, the oldest one is 
deleted. The deletion logic uses the interval to determine which files to 
delete, so changing the interval may leave old files lying around.
 *   Encoding - if encoding is not ‘None’, it is used to open the file with 
that encoding.
 *   Delay - if delay is true, then file opening is deferred until the first 
call to emit() – I don’t think this is suitable to expose as an Airflow 
configuration option.
 *   UTC - if the UTC argument is true, times in UTC will be used; otherwise 
local time is used.
I believe all of the above, except ‘delay’, should be exposed as Airflow 
configuration.


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>



____
From: Maycock, Luke
Sent: 13 October 2016 14:52
To: dev@airflow.incubator.apache.org
Subject: Airflow Logging


Hi All,

We (owlabs - fork: https://github.com/owlabs/incubator-airflow) have a high 
level design for how to improve the logging throughout the Airflow code to be 
more consistent, maintainable and extensible. We'd really appreciate any 
feedback on the design.

Design for Consolidating Logging Setup:

 *   Create a Class in the utils\logging.py to instantiate and handle setup of 
an "airflow" logger object. This allows us to easily find all setup code for 
further modification and extension in the future.
 *   This class would be where any general logging configuration (from 
Airflow.cfg) would be applied.
 *   Instantiate this class in this file, so that importing it allows easy 
control of the logging setup from anywhere else in the application.
 *   Move all setup of the logging in general to this class with easy to call 
control methods to turn forms of logging on and off (e.g. 
console/debugFile/errorFile).
 *   Move any other helper functions related to logging that are still required 
into the utils\logging.py so that we can easily find them for modification and 
extension in the future.
 *   Ensure that all logging throughout the application is done via the 
"airflow" logger object, or by a child of this logger object. This allows us to:
*   Leave all settings of the root logger object alone, so that we are 
friendly to any parent or child processes that are not part of the application, 
allowing them to safely manage their own overriding logging setup.
*   Modify the settings of logging throughout the entire application via a 
single simple point of control.

Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Airflow Logging

2016-10-13 Thread Maycock, Luke
Hi All,

We (owlabs - fork: https://github.com/owlabs/incubator-airflow) have a high 
level design for how to improve the logging throughout the Airflow code to be 
more consistent, maintainable and extensible. We'd really appreciate any 
feedback on the design.

Design for Consolidating Logging Setup:

 *   Create a Class in the utils\logging.py to instantiate and handle setup of 
an "airflow" logger object. This allows us to easily find all setup code for 
further modification and extension in the future.
 *   This class would be where any general logging configuration (from 
Airflow.cfg) would be applied.
 *   Instantiate this class in this file, so that importing it allows easy 
control of the logging setup from anywhere else in the application.
 *   Move all setup of the logging in general to this class with easy to call 
control methods to turn forms of logging on and off (e.g. 
console/debugFile/errorFile).
 *   Move any other helper functions related to logging that are still required 
into the utils\logging.py so that we can easily find them for modification and 
extension in the future.
 *   Ensure that all logging throughout the application is done via the 
"airflow" logger object, or by a child of this logger object. This allows us to:
*   Leave all settings of the root logger object alone, so that we are 
friendly to any parent or child processes that are not part of the application, 
allowing them to safely manage their own overriding logging setup.
*   Modify the settings of logging throughout the entire application via a 
single simple point of control.

Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Re: Airflow Releases

2016-10-03 Thread Maycock, Luke
Hi All,


We are running master in our environment and have noticed something new (that 
wasn't present in release 1.7.1.3).


I have the following DAG, which I ran a backfill on:

# -- BuildTask6 -- 
BuildTask7 -- BuildTask8 -- BuildTask9
#/  
   /
#   InitBuildTask -- BuildTask1 -- BuildTask2 -- BuildTask3 
  /
#\ \
 /
# -- BuildTask4 -- 
BuildTask5
#


When the backfill begins, the outputs gives warnings for all tasks that have 
not yet had their dependencies met (full output attached) i.e. all tasks except 
the one task that has no dependencies. This continues until the backfill has 
completed. Is this expected behaviour?


Cheers,
Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>



From: Chris Riccomini 
Sent: 29 September 2016 18:14:02
To: dev@airflow.incubator.apache.org
Subject: Re: Airflow Releases

Hey Luke,

> Is there anything we can do to help get the next release in place?

One thing that would definitely help is running master somewhere in
your environment, and reporting any issues that see. Over the next few
weeks, AirBNB and a few other folks will be doing the same in an
effort to harden the 1.8 release.

Cheers,
Chris

On Thu, Sep 29, 2016 at 3:08 AM, Maycock, Luke
 wrote:
> Airflow Developers,
>
>
> We were looking at writing a workflow framework in Python when we found 
> Airflow. We have carried out some proof of concept work for using Airflow and 
> wish to continue using it as it comes with lots of great features 
> out-of-the-box.
>
>
> We have created our own fork here:
>
> https://github.com/owlabs/incubator-airflow
>
>
> So far, the only thing we have committed back to the main repository is the 
> following fix to the mssql_hook:
>
> https://github.com/apache/incubator-airflow/pull/1626
>
>
> Among other types of tasks, we wish to be able to run mssql tasks using 
> Airflow. In order to do so, the above and below fixes are required:
>
> https://github.com/apache/incubator-airflow/pull/1458<https://github.com/apache/incubator-airflow/pull/1458/commits/e7e655fde3c29742149d047028cbb21aecba86ed>
>
>
> We have created a Chef cookbook for configuring VMs with Airflow and its 
> prerequisites. As part of this cookbook, we are installing the latest release 
> of Airflow. However, it appears that the latest release does not have the 
> aforementioned fixes.
>
> Do you know when the next release of Airflow is expected? Is there anything 
> we can do to help get the next release in place?
>
>
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
> 
> This e-mail and any attachments may be confidential or legally privileged. If 
> you received this message in error or are not the intended recipient, you 
> should destroy the e-mail message and any attachments or copies, and you are 
> prohibited from retaining, distributing, disclosing or using any information 
> contained herein. Please inform us of the erroneous delivery by return 
> e-mail. Thank you for your cooperation.


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.
DAG:

#Define dependencies
#
# -- BuildTask6 -- 
BuildTask7 -- BuildTask8 -- BuildTask9
#/  
   /
#   InitBuildTask -- BuildTask1 -- BuildTask2 -- BuildTask3 
  /
#\ \
 /
# -- BuildTask4 -- 
BuildTask5
#



[2016-10-03 09:52:26,996] {models.py:168} INFO - Filling up the DagBag from 
/usr/local/lib/airflow/dags
[2016-10-03 09:52:27,378] {jobs.py:1680} INFO - Checking run 
[2016-10-03 09:52:27,387] {models.py:1032} WARNING - Dependencies not met for 
, dependency 
&#

Re: Airflow Releases

2016-09-30 Thread Maycock, Luke
Hi Chris,


We are currently running master in our environment and we'll be sure to report 
any issues.


Do you know when the 1.8 release is due?


Cheers,

Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
www.oliverwyman.com<http://www.oliverwyman.com/>



From: Chris Riccomini 
Sent: 29 September 2016 18:14:02
To: dev@airflow.incubator.apache.org
Subject: Re: Airflow Releases

Hey Luke,

> Is there anything we can do to help get the next release in place?

One thing that would definitely help is running master somewhere in
your environment, and reporting any issues that see. Over the next few
weeks, AirBNB and a few other folks will be doing the same in an
effort to harden the 1.8 release.

Cheers,
Chris

On Thu, Sep 29, 2016 at 3:08 AM, Maycock, Luke
 wrote:
> Airflow Developers,
>
>
> We were looking at writing a workflow framework in Python when we found 
> Airflow. We have carried out some proof of concept work for using Airflow and 
> wish to continue using it as it comes with lots of great features 
> out-of-the-box.
>
>
> We have created our own fork here:
>
> https://github.com/owlabs/incubator-airflow
>
>
> So far, the only thing we have committed back to the main repository is the 
> following fix to the mssql_hook:
>
> https://github.com/apache/incubator-airflow/pull/1626
>
>
> Among other types of tasks, we wish to be able to run mssql tasks using 
> Airflow. In order to do so, the above and below fixes are required:
>
> https://github.com/apache/incubator-airflow/pull/1458<https://github.com/apache/incubator-airflow/pull/1458/commits/e7e655fde3c29742149d047028cbb21aecba86ed>
>
>
> We have created a Chef cookbook for configuring VMs with Airflow and its 
> prerequisites. As part of this cookbook, we are installing the latest release 
> of Airflow. However, it appears that the latest release does not have the 
> aforementioned fixes.
>
> Do you know when the next release of Airflow is expected? Is there anything 
> we can do to help get the next release in place?
>
>
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com<mailto:luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com<http://www.oliverwyman.com/>
>
>
> 
> This e-mail and any attachments may be confidential or legally privileged. If 
> you received this message in error or are not the intended recipient, you 
> should destroy the e-mail message and any attachments or copies, and you are 
> prohibited from retaining, distributing, disclosing or using any information 
> contained herein. Please inform us of the erroneous delivery by return 
> e-mail. Thank you for your cooperation.


This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.


Airflow Releases

2016-09-29 Thread Maycock, Luke
Airflow Developers,


We were looking at writing a workflow framework in Python when we found 
Airflow. We have carried out some proof of concept work for using Airflow and 
wish to continue using it as it comes with lots of great features 
out-of-the-box.


We have created our own fork here:

https://github.com/owlabs/incubator-airflow


So far, the only thing we have committed back to the main repository is the 
following fix to the mssql_hook:

https://github.com/apache/incubator-airflow/pull/1626


Among other types of tasks, we wish to be able to run mssql tasks using 
Airflow. In order to do so, the above and below fixes are required:

https://github.com/apache/incubator-airflow/pull/1458


We have created a Chef cookbook for configuring VMs with Airflow and its 
prerequisites. As part of this cookbook, we are installing the latest release 
of Airflow. However, it appears that the latest release does not have the 
aforementioned fixes.

Do you know when the next release of Airflow is expected? Is there anything we 
can do to help get the next release in place?


Luke Maycock
OLIVER WYMAN
luke.mayc...@affiliate.oliverwyman.com
www.oliverwyman.com



This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.