Re: Pinning dependencies for Apache Airflow

2018-10-19 Thread Maxime Beauchemin
Oh good to know! Scrap what I wrote then.

On Fri, Oct 19, 2018 at 9:08 AM Ash Berlin-Taylor  wrote:

> echo 'pandas==2.1.3' > constraints.txt
>
> pip install -c constraints.txt apache-airflow[pandas]
>
> That will ignore what ever we specify in setup.py and use 2.1.3.
> https://pip.pypa.io/en/latest/user_guide/#constraints-files
>
> (sorry for the brief message)
>
> > On 19 Oct 2018, at 17:02, Maxime Beauchemin 
> wrote:
> >
> >> releases in pip should have stable (pinned deps)
> > I think that's an issue. When setup.py (the only reqs that setuptools/pip
> > knows about) is restrictive, there's no way to change that in your
> > environment, install will just fail if you deviate (are there any
> > hacks/solutions around that that I don't know about???). For example if
> you
> > want a specific version of pandas in your env, and Airflow's setup.py has
> > another version of pandas pinned, you're out of luck. I think the only
> way
> > is to fork and make you own build at that point as you cannot alter
> > setup.py once it's installed. On the other hand, when a version range is
> > specified in setup.py, you're free to pin using your own reqs.txt within
> > the specified version range.
> >
> > I think pinning in setup.py is just not viable. setup.py should have
> > version ranges based semantic versioning expectations. (lib>=1.1.2,
> > <2.0.0). Personally I think we should always have 2 bounds based on
> either
> > 1-semantic versioning major release, or 2- a lower version than
> prescribed
> > by semver that we know breaks backwards compatibility features we
> require.
> >
> > I think we have consensus around something like pip-tools to generate a
> > "deterministic" `requirements.txt`. A caveat is we may need 2:
> > requirements.txt and requirements3.txt for Python 3 as some package
> > versions can be flagged as only py2 or only py3.
> >
> > Max
> >
> >
> > On Fri, Oct 19, 2018 at 1:47 AM Jarek Potiuk 
> > wrote:
> >
> >> I think i might have a proposal that could be acceptable by everyone in
> the
> >> discussion (hopefully :) ).  Let me summarise what I am leaning towards
> >> now:
> >>
> >> I think we can have a solution where it will be relatively easy to keep
> >> both "open" and "fixed" requirements (open in setup.py, fixed in
> >> requirements.txt). Possibly we can use pip-tools or poetry (including
> using
> >> of the poetry-setup  which
> seem
> >> to be able to generate setup.py/constraints.txt/requirements.txt from
> >> poetry setup). Poetry is still "new" so it might not work, then we can
> try
> >> to get similar approach with pip-tools or our own custom solution. Here
> are
> >> the basic assumptions:
> >>
> >>   - we can leave master with "open" requirements which makes it
> >>   potentially unstable with potential conflicting dependencies. We will
> >> also
> >>   document how to generate stable set of requirements (hopefully
> >>   automatically) and a way how to install from master using those. *This
> >>   addresses needs of people using master for active development with
> >> latest
> >>   libraries.*
> >>   - releases in pip should have stable (pinned deps). Upgrading pinned
> >>   releases to latest "working" stable set should be part of the release
> >>   process (possibly automated with poetry). We can try it out and decide
> >> if
> >>   we want to pin only direct dependencies or also the transitive ones (I
> >>   think including transitive dependencies is a bit more stable). *This
> way
> >>   we keep long-term "install-ability" of releases and make job of
> release
> >>   maintainer easier*.
> >>   - CI builds will use the stable dependencies from requirements.txt.
> >> *This
> >>   way we keep CI from dependency-triggered failures.*
> >>   - we add documentation on how to use pip --constraints mechanism by
> >>   anyone who would like to use airflow from PIP rather than sources, but
> >>   would like also to use other (up- or down- graded) versions of
> specific
> >>   dependencies. *This way we let active developers to work with airflow
> >>   and more recent/or older releases.*
> >>
> >> If we can have general consensus that we should try it, I might try to
> find
> >> some time next week to do some "real work". Rather than implement it and
> >> make a pull request immediately, I think of a Proof Of Concept branch
> >> showing how it would work (with some artificial going back to older
> >> versions of requirements). I thought about pre-flaskappbuilder upgrade
> in
> >> one commit and update to post-flaskappbuilder upgrade in second,
> explaining
> >> the steps I've done to get to it. That would be much better for the
> >> community to discuss if that's the right approach.
> >>
> >> Does it sound good ?
> >>
> >> J.
> >>
> >> On Wed, Oct 17, 2018 at 2:21 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
> >> lamb...@coupang.com> wrote:
> >>
> >>> On 10/17/18, 12:24 AM, "William Pursell" 
> >>> wrote:
> >>>
> >>>I'm jumping in a bit late 

Re: explicit_defaults_for_timestamp for mysql

2018-10-19 Thread Bolke de Bruin
O and reading 

https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_explicit_defaults_for_timestamp
 


indicates that it can be set on the session level as well. So we could just 
change the alembic scripts do try it. However
MariaDB does not support it in a session so we always need to check the 
variable. We will also need to set it at *every* 
alembic script that deals with datetimes in the future. Nevertheless this might 
be the easiest solution.

Does GCP’s MySQL also allow this setting in the session scope? 

B.

> On 19 Oct 2018, at 18:48, Deng Xiaodong  wrote:
> 
> I'm ok to test this.
> 
> @ash, may you kindly give some examples of what exact behaviour the testers
> should pay attention to? Since people like me may not know the full
> background of having introduced this restriction & check, or what issue it
> was trying to address.
> 
> @Feng Lu, may you please advise if you are still interested to prepare this
> PR?
> 
> Thanks!
> 
> 
> XD
> 
> On Sat, Oct 20, 2018 at 12:38 AM Ash Berlin-Taylor  wrote:
> 
>> This sounds sensible and would mean we could also run on GCP's MySQL
>> offering too.
>> 
>> This would need someone to try out and check that timezones behave
>> sensibly with this change made.
>> 
>> Any volunteers?
>> 
>> -ash
>> 
>>> On 19 Oct 2018, at 17:32, Deng Xiaodong  wrote:
>>> 
>>> Wondering if there is any further thoughts about this proposal kindly
>> raised by Feng Lu earlier?
>>> 
>>> If we can skip this check & allow explicit_defaults_for_timestamp to be
>> 0, it would be helpful, especially for enterprise users in whose
>> environments it’s really hard to ask for a database global variable change
>> (like myself…).
>>> 
>>> 
>>> XD
>>> 
>>> On 2018/08/28 15:23:10, Feng Lu  wrote:
 Bolke, a gentle ping..>
 Thank you.>
 
 On Thu, Aug 23, 2018, 23:01 Feng Lu  wrote:>
 
> Hi all,>
>> 
> After reading the MySQL documentation on the>
> exlicit_defaults_for_timestamp, it appears that we can skip the check
>> on explicit_defaults_for_timestamp>
> = 1>
> <
>> https://github.com/apache/incubator-airflow/blob/master/airflow/migrations/versions/0e2a74e0fc9f_add_time_zone_awareness.py#L43>
>> by>
> setting the column to accept NULL explicitly. For example:>
>> 
> op.alter_column(table_name='chart', column_name='last_modified',>
> type_=mysql.TIMESTAMP(fsp=6)) -->>
> op.alter_column(table_name='chart', column_name='last_modified',>
> type_=mysql.TIMESTAMP(fsp=6), nullable=True)>
>> 
> Here's why:>
> From MySQL doc (when explicit_defaults_for_timestamp is set to True):>
> "TIMESTAMP columns not explicitly declared with the NOT NULL attribute
>> are>
> automatically declared with the NULL attribute and permit NULL
>> values.>
> Assigning such a column a value of NULL sets it to NULL, not the
>> current>
> timestamp.">
>> 
> Thanks and happy to shoot a PR if it makes sense.>
>> 
> Feng>
>> 
>> 
>> 
>> 
>> 



Re: explicit_defaults_for_timestamp for mysql

2018-10-19 Thread Deng Xiaodong
I'm ok to test this.

@ash, may you kindly give some examples of what exact behaviour the testers
should pay attention to? Since people like me may not know the full
background of having introduced this restriction & check, or what issue it
was trying to address.

@Feng Lu, may you please advise if you are still interested to prepare this
PR?

Thanks!


XD

On Sat, Oct 20, 2018 at 12:38 AM Ash Berlin-Taylor  wrote:

> This sounds sensible and would mean we could also run on GCP's MySQL
> offering too.
>
> This would need someone to try out and check that timezones behave
> sensibly with this change made.
>
> Any volunteers?
>
> -ash
>
> > On 19 Oct 2018, at 17:32, Deng Xiaodong  wrote:
> >
> > Wondering if there is any further thoughts about this proposal kindly
> raised by Feng Lu earlier?
> >
> > If we can skip this check & allow explicit_defaults_for_timestamp to be
> 0, it would be helpful, especially for enterprise users in whose
> environments it’s really hard to ask for a database global variable change
> (like myself…).
> >
> >
> > XD
> >
> > On 2018/08/28 15:23:10, Feng Lu  wrote:
> >> Bolke, a gentle ping..>
> >> Thank you.>
> >>
> >> On Thu, Aug 23, 2018, 23:01 Feng Lu  wrote:>
> >>
> >>> Hi all,>
> 
> >>> After reading the MySQL documentation on the>
> >>> exlicit_defaults_for_timestamp, it appears that we can skip the check
> on explicit_defaults_for_timestamp>
> >>> = 1>
> >>> <
> https://github.com/apache/incubator-airflow/blob/master/airflow/migrations/versions/0e2a74e0fc9f_add_time_zone_awareness.py#L43>
> by>
> >>> setting the column to accept NULL explicitly. For example:>
> 
> >>> op.alter_column(table_name='chart', column_name='last_modified',>
> >>> type_=mysql.TIMESTAMP(fsp=6)) -->>
> >>> op.alter_column(table_name='chart', column_name='last_modified',>
> >>> type_=mysql.TIMESTAMP(fsp=6), nullable=True)>
> 
> >>> Here's why:>
> >>> From MySQL doc (when explicit_defaults_for_timestamp is set to True):>
> >>> "TIMESTAMP columns not explicitly declared with the NOT NULL attribute
> are>
> >>> automatically declared with the NULL attribute and permit NULL
> values.>
> >>> Assigning such a column a value of NULL sets it to NULL, not the
> current>
> >>> timestamp.">
> 
> >>> Thanks and happy to shoot a PR if it makes sense.>
> 
> >>> Feng>
> 
> 
> 
>
>


Re: explicit_defaults_for_timestamp for mysql

2018-10-19 Thread Ash Berlin-Taylor
This sounds sensible and would mean we could also run on GCP's MySQL offering 
too.

This would need someone to try out and check that timezones behave sensibly 
with this change made.

Any volunteers?

-ash

> On 19 Oct 2018, at 17:32, Deng Xiaodong  wrote:
> 
> Wondering if there is any further thoughts about this proposal kindly raised 
> by Feng Lu earlier?
> 
> If we can skip this check & allow explicit_defaults_for_timestamp to be 0, it 
> would be helpful, especially for enterprise users in whose environments it’s 
> really hard to ask for a database global variable change (like myself…).
> 
> 
> XD
> 
> On 2018/08/28 15:23:10, Feng Lu  wrote: 
>> Bolke, a gentle ping..> 
>> Thank you.> 
>> 
>> On Thu, Aug 23, 2018, 23:01 Feng Lu  wrote:> 
>> 
>>> Hi all,> 
 
>>> After reading the MySQL documentation on the> 
>>> exlicit_defaults_for_timestamp, it appears that we can skip the check on 
>>> explicit_defaults_for_timestamp> 
>>> = 1> 
>>> 
>>>  by> 
>>> setting the column to accept NULL explicitly. For example:> 
 
>>> op.alter_column(table_name='chart', column_name='last_modified',> 
>>> type_=mysql.TIMESTAMP(fsp=6)) -->> 
>>> op.alter_column(table_name='chart', column_name='last_modified',> 
>>> type_=mysql.TIMESTAMP(fsp=6), nullable=True)> 
 
>>> Here's why:> 
>>> From MySQL doc (when explicit_defaults_for_timestamp is set to True):> 
>>> "TIMESTAMP columns not explicitly declared with the NOT NULL attribute are> 
>>> automatically declared with the NULL attribute and permit NULL values.> 
>>> Assigning such a column a value of NULL sets it to NULL, not the current> 
>>> timestamp."> 
 
>>> Thanks and happy to shoot a PR if it makes sense.> 
 
>>> Feng> 
 
 
 



Re: explicit_defaults_for_timestamp for mysql

2018-10-19 Thread Deng Xiaodong
Wondering if there is any further thoughts about this proposal kindly raised by 
Feng Lu earlier?

If we can skip this check & allow explicit_defaults_for_timestamp to be 0, it 
would be helpful, especially for enterprise users in whose environments it’s 
really hard to ask for a database global variable change (like myself…).


XD

On 2018/08/28 15:23:10, Feng Lu  wrote: 
> Bolke, a gentle ping..> 
> Thank you.> 
> 
> On Thu, Aug 23, 2018, 23:01 Feng Lu  wrote:> 
> 
> > Hi all,> 
> >> 
> > After reading the MySQL documentation on the> 
> > exlicit_defaults_for_timestamp, it appears that we can skip the check on 
> > explicit_defaults_for_timestamp> 
> > = 1> 
> > 
> >  by> 
> > setting the column to accept NULL explicitly. For example:> 
> >> 
> > op.alter_column(table_name='chart', column_name='last_modified',> 
> > type_=mysql.TIMESTAMP(fsp=6)) -->> 
> > op.alter_column(table_name='chart', column_name='last_modified',> 
> > type_=mysql.TIMESTAMP(fsp=6), nullable=True)> 
> >> 
> > Here's why:> 
> > From MySQL doc (when explicit_defaults_for_timestamp is set to True):> 
> > "TIMESTAMP columns not explicitly declared with the NOT NULL attribute are> 
> > automatically declared with the NULL attribute and permit NULL values.> 
> > Assigning such a column a value of NULL sets it to NULL, not the current> 
> > timestamp."> 
> >> 
> > Thanks and happy to shoot a PR if it makes sense.> 
> >> 
> > Feng> 
> >> 
> >> 
> >> 
> 

Re: Pinning dependencies for Apache Airflow

2018-10-19 Thread Ash Berlin-Taylor
echo 'pandas==2.1.3' > constraints.txt

pip install -c constraints.txt apache-airflow[pandas]

That will ignore what ever we specify in setup.py and use 2.1.3. 
https://pip.pypa.io/en/latest/user_guide/#constraints-files

(sorry for the brief message) 

> On 19 Oct 2018, at 17:02, Maxime Beauchemin  
> wrote:
> 
>> releases in pip should have stable (pinned deps)
> I think that's an issue. When setup.py (the only reqs that setuptools/pip
> knows about) is restrictive, there's no way to change that in your
> environment, install will just fail if you deviate (are there any
> hacks/solutions around that that I don't know about???). For example if you
> want a specific version of pandas in your env, and Airflow's setup.py has
> another version of pandas pinned, you're out of luck. I think the only way
> is to fork and make you own build at that point as you cannot alter
> setup.py once it's installed. On the other hand, when a version range is
> specified in setup.py, you're free to pin using your own reqs.txt within
> the specified version range.
> 
> I think pinning in setup.py is just not viable. setup.py should have
> version ranges based semantic versioning expectations. (lib>=1.1.2,
> <2.0.0). Personally I think we should always have 2 bounds based on either
> 1-semantic versioning major release, or 2- a lower version than prescribed
> by semver that we know breaks backwards compatibility features we require.
> 
> I think we have consensus around something like pip-tools to generate a
> "deterministic" `requirements.txt`. A caveat is we may need 2:
> requirements.txt and requirements3.txt for Python 3 as some package
> versions can be flagged as only py2 or only py3.
> 
> Max
> 
> 
> On Fri, Oct 19, 2018 at 1:47 AM Jarek Potiuk 
> wrote:
> 
>> I think i might have a proposal that could be acceptable by everyone in the
>> discussion (hopefully :) ).  Let me summarise what I am leaning towards
>> now:
>> 
>> I think we can have a solution where it will be relatively easy to keep
>> both "open" and "fixed" requirements (open in setup.py, fixed in
>> requirements.txt). Possibly we can use pip-tools or poetry (including using
>> of the poetry-setup  which seem
>> to be able to generate setup.py/constraints.txt/requirements.txt from
>> poetry setup). Poetry is still "new" so it might not work, then we can try
>> to get similar approach with pip-tools or our own custom solution. Here are
>> the basic assumptions:
>> 
>>   - we can leave master with "open" requirements which makes it
>>   potentially unstable with potential conflicting dependencies. We will
>> also
>>   document how to generate stable set of requirements (hopefully
>>   automatically) and a way how to install from master using those. *This
>>   addresses needs of people using master for active development with
>> latest
>>   libraries.*
>>   - releases in pip should have stable (pinned deps). Upgrading pinned
>>   releases to latest "working" stable set should be part of the release
>>   process (possibly automated with poetry). We can try it out and decide
>> if
>>   we want to pin only direct dependencies or also the transitive ones (I
>>   think including transitive dependencies is a bit more stable). *This way
>>   we keep long-term "install-ability" of releases and make job of release
>>   maintainer easier*.
>>   - CI builds will use the stable dependencies from requirements.txt.
>> *This
>>   way we keep CI from dependency-triggered failures.*
>>   - we add documentation on how to use pip --constraints mechanism by
>>   anyone who would like to use airflow from PIP rather than sources, but
>>   would like also to use other (up- or down- graded) versions of specific
>>   dependencies. *This way we let active developers to work with airflow
>>   and more recent/or older releases.*
>> 
>> If we can have general consensus that we should try it, I might try to find
>> some time next week to do some "real work". Rather than implement it and
>> make a pull request immediately, I think of a Proof Of Concept branch
>> showing how it would work (with some artificial going back to older
>> versions of requirements). I thought about pre-flaskappbuilder upgrade in
>> one commit and update to post-flaskappbuilder upgrade in second, explaining
>> the steps I've done to get to it. That would be much better for the
>> community to discuss if that's the right approach.
>> 
>> Does it sound good ?
>> 
>> J.
>> 
>> On Wed, Oct 17, 2018 at 2:21 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
>> lamb...@coupang.com> wrote:
>> 
>>> On 10/17/18, 12:24 AM, "William Pursell" 
>>> wrote:
>>> 
>>>I'm jumping in a bit late here, and perhaps have missed some of the
>>>discussion, but I haven't seen any mention of the fact that pinning
>>>versions in setup.py isn't going to solve the problem.  Perhaps it's
>>>my lack of experience with pip, but currently pip doesn't provide any
>>>guarantee that the version of 

Re: Pinning dependencies for Apache Airflow

2018-10-19 Thread Maxime Beauchemin
> releases in pip should have stable (pinned deps)
I think that's an issue. When setup.py (the only reqs that setuptools/pip
knows about) is restrictive, there's no way to change that in your
environment, install will just fail if you deviate (are there any
hacks/solutions around that that I don't know about???). For example if you
want a specific version of pandas in your env, and Airflow's setup.py has
another version of pandas pinned, you're out of luck. I think the only way
is to fork and make you own build at that point as you cannot alter
setup.py once it's installed. On the other hand, when a version range is
specified in setup.py, you're free to pin using your own reqs.txt within
the specified version range.

I think pinning in setup.py is just not viable. setup.py should have
version ranges based semantic versioning expectations. (lib>=1.1.2,
<2.0.0). Personally I think we should always have 2 bounds based on either
1-semantic versioning major release, or 2- a lower version than prescribed
by semver that we know breaks backwards compatibility features we require.

I think we have consensus around something like pip-tools to generate a
"deterministic" `requirements.txt`. A caveat is we may need 2:
requirements.txt and requirements3.txt for Python 3 as some package
versions can be flagged as only py2 or only py3.

Max


On Fri, Oct 19, 2018 at 1:47 AM Jarek Potiuk 
wrote:

> I think i might have a proposal that could be acceptable by everyone in the
> discussion (hopefully :) ).  Let me summarise what I am leaning towards
> now:
>
> I think we can have a solution where it will be relatively easy to keep
> both "open" and "fixed" requirements (open in setup.py, fixed in
> requirements.txt). Possibly we can use pip-tools or poetry (including using
> of the poetry-setup  which seem
> to be able to generate setup.py/constraints.txt/requirements.txt from
> poetry setup). Poetry is still "new" so it might not work, then we can try
> to get similar approach with pip-tools or our own custom solution. Here are
> the basic assumptions:
>
>- we can leave master with "open" requirements which makes it
>potentially unstable with potential conflicting dependencies. We will
> also
>document how to generate stable set of requirements (hopefully
>automatically) and a way how to install from master using those. *This
>addresses needs of people using master for active development with
> latest
>libraries.*
>- releases in pip should have stable (pinned deps). Upgrading pinned
>releases to latest "working" stable set should be part of the release
>process (possibly automated with poetry). We can try it out and decide
> if
>we want to pin only direct dependencies or also the transitive ones (I
>think including transitive dependencies is a bit more stable). *This way
>we keep long-term "install-ability" of releases and make job of release
>maintainer easier*.
>- CI builds will use the stable dependencies from requirements.txt.
> *This
>way we keep CI from dependency-triggered failures.*
>- we add documentation on how to use pip --constraints mechanism by
>anyone who would like to use airflow from PIP rather than sources, but
>would like also to use other (up- or down- graded) versions of specific
>dependencies. *This way we let active developers to work with airflow
>and more recent/or older releases.*
>
> If we can have general consensus that we should try it, I might try to find
> some time next week to do some "real work". Rather than implement it and
> make a pull request immediately, I think of a Proof Of Concept branch
> showing how it would work (with some artificial going back to older
> versions of requirements). I thought about pre-flaskappbuilder upgrade in
> one commit and update to post-flaskappbuilder upgrade in second, explaining
> the steps I've done to get to it. That would be much better for the
> community to discuss if that's the right approach.
>
> Does it sound good ?
>
> J.
>
> On Wed, Oct 17, 2018 at 2:21 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
> lamb...@coupang.com> wrote:
>
> > On 10/17/18, 12:24 AM, "William Pursell" 
> > wrote:
> >
> > I'm jumping in a bit late here, and perhaps have missed some of the
> > discussion, but I haven't seen any mention of the fact that pinning
> > versions in setup.py isn't going to solve the problem.  Perhaps it's
> > my lack of experience with pip, but currently pip doesn't provide any
> > guarantee that the version of a dependency specified in setup.py will
> > be the version that winds up being installed.  Is this a known issue
> > that is being intentionally ignored because it's hard (and out of
> > scope) to solve?  I agree that versions should be pinned in setup.py
> > for stable releases, but I think we need to be aware that this won't
> > solve the problem.
> >
> > So the problem is going to 

Re: Mocking airflow (similar to moto for AWS)

2018-10-19 Thread Christophe Blefari
Hi, I think Jarek that you can unit test your DAGs without a BDD. You will
have to patching some connections but it's feasible. On my side I test the
topological order of my DAGs to be sure of the order.

And I patch the xcom_push and xcom_push methods to be sure that between
task everything will be OK.

I your hooks are tested well I think it's OK.

For instance I have this kind of code to test the topological order :
https://gist.github.com/Bl3f/acd3d4b251eb565c96168635d84d0513.

Regards,
Christophe

Le ven. 19 oct. 2018 à 10:23, Jarek Potiuk  a
écrit :

> Thanks! I like the suggestion about testing hooks rather than whole DAGs -
> we will certainly use it in the future. And BDD is the approach I really
> like - thanks for the code examples! We might also use it in the near
> future. Super helpful!
>
> So far we mocked hooks in our unit tests only (for example here
> <
> https://github.com/PolideaInternal/incubator-airflow/blob/master/tests/contrib/operators/test_gcp_compute_operator.py#L241
> >)
> - that helps to test the logic of more complex operators.
> @Anthony - we also use a modified docker-based environment to run the tests
> (https://github.com/PolideaInternal/airflow-breeze/tree/integration-tests)
> including running full Dags. And yeah missing import was just an
> exaggerated example :) we also use IDE/lints to catch those early :D.
>
> I think still there is a need to run whole DAGs on top of testing operators
> and hooks separate. This is to test a bit more complex interactions between
> the operators. In our case we use example dags for both documentation and
> running full e2e integration tests (for example here
>
> https://github.com/PolideaInternal/incubator-airflow/blob/master/airflow/contrib/example_dags/example_gcp_compute.py
> ).
> Those are simple examples but we will have a bit more complex interactions
> and it would be great to be able to run them quicker. However if we get the
> hook tests automated/unit-testable as well, maybe our current approach
> where we run them in the full dockerized environment will be good enough.
>
> J.
>
>
> On Thu, Oct 18, 2018 at 5:44 PM Anthony Brown <
> anthony.br...@johnlewis.co.uk>
> wrote:
>
> > I have pylint set up in my IDE which catches most silly errors like
> missing
> > imports
> > I also use a docker image so I can start up airflow locally and manually
> > test any changes before trying to deploy them. I use a slightly modified
> > version of https://github.com/puckel/docker-airflow to control it. This
> > only works on connections I have access to from my machine
> > Finally I have a suite of tests based on
> >
> >
> https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c
> > which I can run to test DAGs are valid and any unit tests I can put in.
> The
> > tests are run in a docker container which runs a local db instance so I
> > have access to xcoms etc
> >
> > As part of my deployment pipeline, I run pylint and tests again before
> > deploying anywhere to make sure nobody has forgotten to run them locally
> >
> > Gerard - I like the suggestion about using mocked hooks and BDD. I will
> > look into this further
> >
> > On Thu, 18 Oct 2018 at 15:12, Gerard Toonstra 
> wrote:
> >
> > > There was a discussion about a unit testing approach last year 2017 I
> > > believe. If you dig the mail archives, you can find it.
> > >
> > > My take is:
> > >
> > > - You should test "hooks" against some real system, which can be a
> docker
> > > container. Make sure the behavior is predictable when talking against
> > that
> > > system. Hook tests are not part of general CI tests because of the
> > > complexity of the CI setup you'd have to make, so they are run on local
> > > boxes.
> > > - Maybe add additional "mock" hook tests, mocking out the connected
> > > systems.
> > > - When hooks are tested, operators can use 'mocked' hooks that no
> longer
> > > need access to actual systems. You can then set up an environment where
> > you
> > > have predictable inputs and outputs and test how the operators act on
> > them.
> > > I've used "behave" to do that with very simple record sets, but you can
> > > make these as complex as you want.
> > > - Then you know your hooks and operators work functionally. Testing if
> > your
> > > workflow works in general can be implemented by adding "check"
> operators.
> > > The benefit here is that you don't test the workflow once, but you test
> > for
> > > data consistency every time the dag runs. If you have complex workflows
> > > where the correct behavior of the flow is worrysome, then you may need
> to
> > > go deeper into it.
> > >
> > > The above doesn't depend on DAGS that need to be scheduled and the
> delays
> > > involving that.
> > >
> > > All of the above is implemented in my repo
> > > https://github.com/gtoonstra/airflow-hovercraft  , using "behave" as a
> > BDD
> > > method of testing, so you can peruse that.
> > >
> > > Rgds,
> > >

Re: Transferring files between S3 and GCS

2018-10-19 Thread Chris Fei
I ran into the same issue and ended building a separate operator that
works as you describe, though I haven't submitted it as a PR. Happy to
share my implementation with you.
I found that it's useful to have both ways of transferring data.
Initially, I migrated all of my S3ToGCS tasks to use the transfer
service, but I found that its performance can be unreliable with some
combination of 1) transferring smaller datasets and 2) invoking many
transfers in parallel. The transfer service is a bit of a black box, so
when it doesn't work as expected you're stuck. Because of this, I ended
up migrating some of my tasks to the original implementation. I would
definitely keep both options around--I don't think I have a preference
between new operator vs a param on the existing operator.
Chris


On Fri, Oct 19, 2018, at 7:09 AM, Conrad Lee wrote:
> Hello Airflow community,
> 
> I'm interested in transferring data between S3 and Google Cloud
> Storage.  I> want to transfer data on the scale of hundreds of gigabytes to a 
> few
> terrabytes.
> 
> Airflow already has an operator that could be used for this use-case:> the 
> S3ToGoogleCloudStorageOperator.
> However, looking over its implementation it appears that all the
> data to be> transferred actually passes through the machine running airflow.  
> That> seems completely unnecessary to me, and will place a lot of
> burden on the> airflow workers and will be bottlenecked by the bandwidth of 
> the
> workers.> It could even lead to out of disk errors like this one
> >
>  .
> 
> I would much rather use Google Cloud's 'Transfer Service' for doing
> this--that way the airflow operator just needs to make an API call and> 
> (optionally) keep polling the API until the transfer is done (this
> last bit> could be done in a sensor).  The heavy work of performing the
> transfer is> offloaded to the Transfer Service.
> 
> Was it an intentional design decision to avoid using the Google
> Transfer> Service?  If I create a PR that adds the ability to perform
> transfers with> the Google Transfer Service, should it
> 
>   - replace the existing operator
>   - be an option on the existing operator (i.e., add an argument that>   
> toggles between 'local worker transfer' and 'google hosted
>   transfer')>   - make a new operator
> 
> Thanks,
> Conrad Lee



Transferring files between S3 and GCS

2018-10-19 Thread Conrad Lee
Hello Airflow community,

I'm interested in transferring data between S3 and Google Cloud Storage.  I
want to transfer data on the scale of hundreds of gigabytes to a few
terrabytes.

Airflow already has an operator that could be used for this use-case:
the S3ToGoogleCloudStorageOperator.
However, looking over its implementation it appears that all the data to be
transferred actually passes through the machine running airflow.  That
seems completely unnecessary to me, and will place a lot of burden on the
airflow workers and will be bottlenecked by the bandwidth of the workers.
It could even lead to out of disk errors like this one

.

I would much rather use Google Cloud's 'Transfer Service' for doing
this--that way the airflow operator just needs to make an API call and
(optionally) keep polling the API until the transfer is done (this last bit
could be done in a sensor).  The heavy work of performing the transfer is
offloaded to the Transfer Service.

Was it an intentional design decision to avoid using the Google Transfer
Service?  If I create a PR that adds the ability to perform transfers with
the Google Transfer Service, should it

   - replace the existing operator
   - be an option on the existing operator (i.e., add an argument that
   toggles between 'local worker transfer' and 'google hosted transfer')
   - make a new operator

Thanks,
Conrad Lee


Re: Mocking airflow (similar to moto for AWS)

2018-10-19 Thread Jarek Potiuk
Thanks! I like the suggestion about testing hooks rather than whole DAGs -
we will certainly use it in the future. And BDD is the approach I really
like - thanks for the code examples! We might also use it in the near
future. Super helpful!

So far we mocked hooks in our unit tests only (for example here
)
- that helps to test the logic of more complex operators.
@Anthony - we also use a modified docker-based environment to run the tests
(https://github.com/PolideaInternal/airflow-breeze/tree/integration-tests)
including running full Dags. And yeah missing import was just an
exaggerated example :) we also use IDE/lints to catch those early :D.

I think still there is a need to run whole DAGs on top of testing operators
and hooks separate. This is to test a bit more complex interactions between
the operators. In our case we use example dags for both documentation and
running full e2e integration tests (for example here
https://github.com/PolideaInternal/incubator-airflow/blob/master/airflow/contrib/example_dags/example_gcp_compute.py).
Those are simple examples but we will have a bit more complex interactions
and it would be great to be able to run them quicker. However if we get the
hook tests automated/unit-testable as well, maybe our current approach
where we run them in the full dockerized environment will be good enough.

J.


On Thu, Oct 18, 2018 at 5:44 PM Anthony Brown 
wrote:

> I have pylint set up in my IDE which catches most silly errors like missing
> imports
> I also use a docker image so I can start up airflow locally and manually
> test any changes before trying to deploy them. I use a slightly modified
> version of https://github.com/puckel/docker-airflow to control it. This
> only works on connections I have access to from my machine
> Finally I have a suite of tests based on
>
> https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c
> which I can run to test DAGs are valid and any unit tests I can put in. The
> tests are run in a docker container which runs a local db instance so I
> have access to xcoms etc
>
> As part of my deployment pipeline, I run pylint and tests again before
> deploying anywhere to make sure nobody has forgotten to run them locally
>
> Gerard - I like the suggestion about using mocked hooks and BDD. I will
> look into this further
>
> On Thu, 18 Oct 2018 at 15:12, Gerard Toonstra  wrote:
>
> > There was a discussion about a unit testing approach last year 2017 I
> > believe. If you dig the mail archives, you can find it.
> >
> > My take is:
> >
> > - You should test "hooks" against some real system, which can be a docker
> > container. Make sure the behavior is predictable when talking against
> that
> > system. Hook tests are not part of general CI tests because of the
> > complexity of the CI setup you'd have to make, so they are run on local
> > boxes.
> > - Maybe add additional "mock" hook tests, mocking out the connected
> > systems.
> > - When hooks are tested, operators can use 'mocked' hooks that no longer
> > need access to actual systems. You can then set up an environment where
> you
> > have predictable inputs and outputs and test how the operators act on
> them.
> > I've used "behave" to do that with very simple record sets, but you can
> > make these as complex as you want.
> > - Then you know your hooks and operators work functionally. Testing if
> your
> > workflow works in general can be implemented by adding "check" operators.
> > The benefit here is that you don't test the workflow once, but you test
> for
> > data consistency every time the dag runs. If you have complex workflows
> > where the correct behavior of the flow is worrysome, then you may need to
> > go deeper into it.
> >
> > The above doesn't depend on DAGS that need to be scheduled and the delays
> > involving that.
> >
> > All of the above is implemented in my repo
> > https://github.com/gtoonstra/airflow-hovercraft  , using "behave" as a
> BDD
> > method of testing, so you can peruse that.
> >
> > Rgds,
> >
> > G>
> >
> >
> > On Thu, Oct 18, 2018 at 2:43 PM Jarek Potiuk 
> > wrote:
> >
> > > I am also looking to have (I think) similar workflow. Maybe someone has
> > > done something similar and can give some hints on how to do it the
> > easiest
> > > way?
> > >
> > > Context:
> > >
> > > While developing operators I am using example test DAGs that talk to
> GCP.
> > > So far our "integration tests" require copying the dag folder and
> > > restarting the airflow servers, unpausing the dag and waiting for it to
> > > start. That takes a lot of time, sometimes just to find out that you
> > missed
> > > one import.
> > >
> > > Ideal workflow:
> > >
> > > Ideally I'd love to have a "unit" test (i.e possible to run via
> nosetests
> > > or IDE integration/PyCharm) that:
> > >
> > >- should