[jira] [Reopened] (AIRFLOW-539) Add support for BigQuery Standard SQL
[ https://issues.apache.org/jira/browse/AIRFLOW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Riccomini reopened AIRFLOW-539: - > Add support for BigQuery Standard SQL > - > > Key: AIRFLOW-539 > URL: https://issues.apache.org/jira/browse/AIRFLOW-539 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Reporter: Sam McVeety >Assignee: Ilya Rakoshes >Priority: Minor > Labels: gcp > Fix For: Airflow 1.8 > > > Many of the operators in > https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/operators > are implicitly using the "useLegacySql" option to BigQuery. Providing the > option to negate this will allow users to migrate to Standard SQL > (https://cloud.google.com/bigquery/sql-reference/enabling-standard-sql). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-564) DagRunOperator and ExternalTaskSensor Incompatibility
[ https://issues.apache.org/jira/browse/AIRFLOW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565656#comment-15565656 ] Martin Grayson commented on AIRFLOW-564: Thanks for the inspiration [~lauralorenz]! It didn't quite occur to me that I could pass the execution date of the parent dag into the child (via the payload) and subsequently use that in the ExternalTaskSensor. This could be a workaround for me, but does complicate the situation. I agree with you that inheritance of execution_date could be a more optimal solution. > DagRunOperator and ExternalTaskSensor Incompatibility > - > > Key: AIRFLOW-564 > URL: https://issues.apache.org/jira/browse/AIRFLOW-564 > Project: Apache Airflow > Issue Type: Task > Components: DagRun >Affects Versions: Airflow 1.7.1.3 >Reporter: Martin Grayson > Attachments: 1.controller.png, 2. par.png, 3. dep.png, dep_dag.py, > parent_dag.py, test_dag_scheduler.py > > > I have an hourly Dag that acts to orchestrate other dags, triggering dag runs > if a precondition is met. > The controller dag (Dag CD) uses a `TriggerDagRunOperator` operator to do > this, launching 2 other dags (Dag A, Dag B) in my example (10+ in my real > life). > Within these dags they have their own dependencies, for example Dag B is > dependent on Dag A's completion. This is handled using a > `ExternalTaskSensor`. > Since Dag A and B are executed dynamically with the `TriggerDagRunOperator`, > they're assigned very granular execution dates, rather than dd-mm- > hh:00:00 as the scheduler would assign. Due to this, the > `ExternalTaskSensor` will not succeed unless Dag A and B managed to have > exactly the same execution_date (down to the second). > As far as I can tell, I'm not able to force the `ExternalTaskSensor` to poke > for the correct execution date. The only possible solution would be to > implement a `execution_date_fn` that looked through the metadata db to find > the exact execution time. This seems massively complicated for a relatively > simple requirement. > Can this scenario be modelled with these components? > I've attached a set of example dags and some screenshots, showing the issue. > Thanks, > Martin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-564) DagRunOperator and ExternalTaskSensor Incompatibility
[ https://issues.apache.org/jira/browse/AIRFLOW-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565570#comment-15565570 ] Laura Lorenz commented on AIRFLOW-564: -- For us this hasn't leaked into the problem as described here, but we do have an execution_date aware DAG that is triggered via another DAG's TriggerDagRunOperator, and we have to send the parent DAG's execution_date as part of the payload to its child. I can see how it could escalate into the problem [~martingrayson] describes if we threw an ExternalTaskSensor into the mix. I think it makes a lot more intuitive sense for the child DAG's DagRun to inherit the execution_date of its parent DAG, not one that is mirrored by the DagRun's start_date. > DagRunOperator and ExternalTaskSensor Incompatibility > - > > Key: AIRFLOW-564 > URL: https://issues.apache.org/jira/browse/AIRFLOW-564 > Project: Apache Airflow > Issue Type: Task > Components: DagRun >Affects Versions: Airflow 1.7.1.3 >Reporter: Martin Grayson > Attachments: 1.controller.png, 2. par.png, 3. dep.png, dep_dag.py, > parent_dag.py, test_dag_scheduler.py > > > I have an hourly Dag that acts to orchestrate other dags, triggering dag runs > if a precondition is met. > The controller dag (Dag CD) uses a `TriggerDagRunOperator` operator to do > this, launching 2 other dags (Dag A, Dag B) in my example (10+ in my real > life). > Within these dags they have their own dependencies, for example Dag B is > dependent on Dag A's completion. This is handled using a > `ExternalTaskSensor`. > Since Dag A and B are executed dynamically with the `TriggerDagRunOperator`, > they're assigned very granular execution dates, rather than dd-mm- > hh:00:00 as the scheduler would assign. Due to this, the > `ExternalTaskSensor` will not succeed unless Dag A and B managed to have > exactly the same execution_date (down to the second). > As far as I can tell, I'm not able to force the `ExternalTaskSensor` to poke > for the correct execution date. The only possible solution would be to > implement a `execution_date_fn` that looked through the metadata db to find > the exact execution time. This seems massively complicated for a relatively > simple requirement. > Can this scenario be modelled with these components? > I've attached a set of example dags and some screenshots, showing the issue. > Thanks, > Martin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AIRFLOW-563) Dag stuck in "Filling up the DagBag from .." state
[ https://issues.apache.org/jira/browse/AIRFLOW-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Martinez reassigned AIRFLOW-563: Assignee: Vincent Martinez > Dag stuck in "Filling up the DagBag from .." state > -- > > Key: AIRFLOW-563 > URL: https://issues.apache.org/jira/browse/AIRFLOW-563 > Project: Apache Airflow > Issue Type: Bug >Reporter: Vincent Martinez >Assignee: Vincent Martinez >Priority: Critical > Attachments: 2016-10-10T11_00_00.txt, dagrun.png, load_dims.py > > > Hello, > I have scheduled a dag hourly but the task doesn't run at all, it stays at > "Filling up the DagBag from .." step and each hour I have one more Dag id in > running state (stuck aswell). Do you have any idea where does the problem > comes from ? > You'll find attached screens, log of the dag for 1 run and the dag file. > If you need anything else feel free to ask. > Thanks > Best regards, > Vincent -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-565) DockerOperator doesn't work on Python3.4
[ https://issues.apache.org/jira/browse/AIRFLOW-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566109#comment-15566109 ] Vitor Baptista commented on AIRFLOW-565: Went ahead and sent a pull request for it on https://github.com/apache/incubator-airflow/pull/1832 > DockerOperator doesn't work on Python3.4 > > > Key: AIRFLOW-565 > URL: https://issues.apache.org/jira/browse/AIRFLOW-565 > Project: Apache Airflow > Issue Type: Bug > Components: docker, operators >Affects Versions: Airflow 1.7.1.3 > Environment: Python 3.4.3 >Reporter: Vitor Baptista > > On {{DockerOperator.execute()}} we have: > {code} > if self.force_pull or len(self.cli.images(name=image)) == 0: > logging.info('Pulling docker image ' + image) > for l in self.cli.pull(image, stream=True): > output = json.loads(l) > logging.info("{}".format(output['status'])) > {code} > https://github.com/apache/incubator-airflow/blob/master/airflow/operators/docker_operator.py#L152-L156 > The {{self.cli.pull()}} method returns {{bytes}} in Python3.4, and > {{json.loads()}} expects a string, so we end up with: > {code} > Traceback (most recent call last): > File > "/home/vitor/Projetos/okfn/opentrials/airflow/env/lib/python3.4/site-packages/airflow/models.py", > line 1245, in run > result = task_copy.execute(context=context) > File > "/home/vitor/Projetos/okfn/opentrials/airflow/env/lib/python3.4/site-packages/airflow/operators/docker_operator.py", > line 142, in execute > logging.info("{}".format(output['status'])) > File "/usr/lib/python3.4/json/__init__.py", line 312, in loads > s.__class__.__name__)) > TypeError: the JSON object must be str, not 'bytes' > {code} > To avoid this, we could simply change it to {{output = > json.loads(l.encode('utf-8'))}}. This hardcodes the string as UTF-8, which > should be fine, considering the JSON spec requires the use of UTF-8, UTF-16 > or UTF-32. As we're dealing with a Docker server, we can assume they'll be > well behaved. > I'm happy to submit a pull request for this if you agree with the solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AIRFLOW-565) DockerOperator doesn't work on Python3.4
[ https://issues.apache.org/jira/browse/AIRFLOW-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitor Baptista reassigned AIRFLOW-565: -- Assignee: Vitor Baptista > DockerOperator doesn't work on Python3.4 > > > Key: AIRFLOW-565 > URL: https://issues.apache.org/jira/browse/AIRFLOW-565 > Project: Apache Airflow > Issue Type: Bug > Components: docker, operators >Affects Versions: Airflow 1.7.1.3 > Environment: Python 3.4.3 >Reporter: Vitor Baptista >Assignee: Vitor Baptista > > On {{DockerOperator.execute()}} we have: > {code} > if self.force_pull or len(self.cli.images(name=image)) == 0: > logging.info('Pulling docker image ' + image) > for l in self.cli.pull(image, stream=True): > output = json.loads(l) > logging.info("{}".format(output['status'])) > {code} > https://github.com/apache/incubator-airflow/blob/master/airflow/operators/docker_operator.py#L152-L156 > The {{self.cli.pull()}} method returns {{bytes}} in Python3.4, and > {{json.loads()}} expects a string, so we end up with: > {code} > Traceback (most recent call last): > File > "/home/vitor/Projetos/okfn/opentrials/airflow/env/lib/python3.4/site-packages/airflow/models.py", > line 1245, in run > result = task_copy.execute(context=context) > File > "/home/vitor/Projetos/okfn/opentrials/airflow/env/lib/python3.4/site-packages/airflow/operators/docker_operator.py", > line 142, in execute > logging.info("{}".format(output['status'])) > File "/usr/lib/python3.4/json/__init__.py", line 312, in loads > s.__class__.__name__)) > TypeError: the JSON object must be str, not 'bytes' > {code} > To avoid this, we could simply change it to {{output = > json.loads(l.encode('utf-8'))}}. This hardcodes the string as UTF-8, which > should be fine, considering the JSON spec requires the use of UTF-8, UTF-16 > or UTF-32. As we're dealing with a Docker server, we can assume they'll be > well behaved. > I'm happy to submit a pull request for this if you agree with the solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-513) ExternalTaskSensor tasks should not count towards parallelism limit
[ https://issues.apache.org/jira/browse/AIRFLOW-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565966#comment-15565966 ] Laura Lorenz commented on AIRFLOW-513: -- I think that sensors, including ExternalTaskSensors, should be counted in the core paralellism limit; they are after all using resources. I think the resource bottlenecks long-waiting ExternalTaskSensors can create should be managed with pools/queues or actual DAG dependencies. To the latter point I think in your case you may want to use the TriggerDagRunOperator instead so you are pushing to create a DagRun instance for DAG 2, instead of polling from DAG 2 for DAG 1. We do use some ExternalTaskSensors in the way you describe but we increase worker limits each time we add one, and in any case we have only 4 running at a time, so it hasn't reached your level of contention yet. > ExternalTaskSensor tasks should not count towards parallelism limit > --- > > Key: AIRFLOW-513 > URL: https://issues.apache.org/jira/browse/AIRFLOW-513 > Project: Apache Airflow > Issue Type: Improvement > Environment: Ubuntu 14.04 > Version 1.7.0 >Reporter: Kevin Yuen > > Hi, > We are using airflow version 1.7.0 and we are using `ExternalTaskSensor` > pretty heavily to manage dependencies between our DAGs. > We have recently experienced a case where the external task sensors are > causing the DAGs to go into limbo state because they took up all the > execution slots defined via `AIRFLOW__CORE__PARALLELISM`. > For example: > Given we have 2 DAGs: > first one with 16 python operator tasks, and the other with 16 sensors. > We set `PARALLELISM` to 16. > If the scheduler choses to schedule all 16 sensors first, the dag runs > will never complete. > There are a couple of work around to this: > # staggering the DAGs so that the first dag with python operator runs first > # lowering the TaskSensor timeout thresholds and relying on retries > Both of these options seems less then ideal to us. We wonder if > `ExternalTaskSensor` should really be counting towards the `PARALLELISM` > limit? > Cheers, > Kevin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-560) DbApiHook should provide access to url / sqlalchemy
[ https://issues.apache.org/jira/browse/AIRFLOW-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566026#comment-15566026 ] ASF subversion and git services commented on AIRFLOW-560: - Commit c6de16b563cb1f681c62696b755a9e2eb3c80341 in incubator-airflow's branch refs/heads/master from [~gwax] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=c6de16b ] [AIRFLOW-560] Get URI & SQLA engine from Connection Closes #1256 from gwax/sqlalchemy_conn > DbApiHook should provide access to url / sqlalchemy > --- > > Key: AIRFLOW-560 > URL: https://issues.apache.org/jira/browse/AIRFLOW-560 > Project: Apache Airflow > Issue Type: Improvement >Reporter: George Leslie-Waksman >Assignee: George Leslie-Waksman >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (AIRFLOW-560) DbApiHook should provide access to url / sqlalchemy
[ https://issues.apache.org/jira/browse/AIRFLOW-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Anand resolved AIRFLOW-560. - Resolution: Fixed Issue resolved by pull request #1256 [https://github.com/apache/incubator-airflow/pull/1256] > DbApiHook should provide access to url / sqlalchemy > --- > > Key: AIRFLOW-560 > URL: https://issues.apache.org/jira/browse/AIRFLOW-560 > Project: Apache Airflow > Issue Type: Improvement >Reporter: George Leslie-Waksman >Assignee: George Leslie-Waksman >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-470) Frequent multiple dispatching of the same task to celery
[ https://issues.apache.org/jira/browse/AIRFLOW-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566196#comment-15566196 ] Laura Lorenz commented on AIRFLOW-470: -- Dup of AIRFLOW-471? > Frequent multiple dispatching of the same task to celery > > > Key: AIRFLOW-470 > URL: https://issues.apache.org/jira/browse/AIRFLOW-470 > Project: Apache Airflow > Issue Type: Bug > Components: celery, scheduler >Affects Versions: Airflow 1.7.1.3 >Reporter: Jasmine Tsai >Priority: Critical > > We are seeing a lot of frequent dispatching of the same task to celery within > a very short time frame (same task instance by Airflow conditions, but a > different celery task uuid), which is causing a lot of unexpected behavior > for us. Most of these are annoying but harmless — sometimes they clear xcom > data and overwrite logs, but for the most part they are able to rely on the > db metadata and not try to run itself multiple times. We are seeing this > behavior frequent, some tasks are getting scheduled 5 times within the span > of two minutes. The issue seems to be exacerbated by the use of pools. > We have even seen the same task being dispatched twice within a second apart, > causing real race conditions because the second try didn't see the task > instance starting to run yet in the metadata db. > It seems from other issues submitted here that people definitely see problems > with the same tasks running multiple times, but this problem seems to be > getting worse for us. Is it a known issue for the multiple dispatching to be > so frequent/severe? (or maybe even the intentional design/side effect?) Are > there things that we could be doing that might make this worse? (One of our > primary suspect is the scheduler, which we have set its num_runs to 1) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-441) DagRuns are marked as failed as soon as one task fails
[ https://issues.apache.org/jira/browse/AIRFLOW-441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566210#comment-15566210 ] Laura Lorenz commented on AIRFLOW-441: -- I haven't looked into the source at all but this has definitely bit me too. I would prefer the DagRun failure state to occur after all tasks have been attempted and are either 'success', 'failed', or 'skipped' status. > DagRuns are marked as failed as soon as one task fails > -- > > Key: AIRFLOW-441 > URL: https://issues.apache.org/jira/browse/AIRFLOW-441 > Project: Apache Airflow > Issue Type: Bug >Reporter: Jeff Balogh > > https://github.com/apache/incubator-airflow/pull/1514 added a > [{{verify_integrity}} > function|https://github.com/apache/incubator-airflow/blob/fcf645b/airflow/models.py#L3850-L3877] > that greedily creates {{TaskInstance}} objects for all tasks in a dag. > This does not interact well with the assumptions in the new [{{update_state}} > function|https://github.com/apache/incubator-airflow/blob/fcf645b/airflow/models.py#L3816-L3825]. > The guard for {{if len(tis) == len(dag.active_tasks)}} is no longer > effective; in the old world of lazily-created tasks this code would only run > once all the tasks in the dag had run. Now it runs all the time, and as soon > as one task in a dag run fails the whole DagRun fails. This is bad since the > scheduler stops processing the DagRun after that. > In retrospect, the old code was also buggy: if your dag ends with a bunch of > Queued tasks the DagRun could be marked as failed prematurely. > I suspect the fix is to update the guard to look at tasks where the state is > success or failed. Otherwise we're evaluating and failing the dag based on > up_for_retry/queued/scheduled tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
incubator-airflow git commit: closes apache/incubator-airflow#1590 *no movement from submitter*
Repository: incubator-airflow Updated Branches: refs/heads/master c6de16b56 -> 9b689d05a closes apache/incubator-airflow#1590 *no movement from submitter* Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/9b689d05 Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/9b689d05 Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/9b689d05 Branch: refs/heads/master Commit: 9b689d05a5f3ef3ce73250578c5c9b83ec6a9b3a Parents: c6de16b Author: Siddharth AnandAuthored: Tue Oct 11 17:11:25 2016 -0700 Committer: Siddharth Anand Committed: Tue Oct 11 17:11:25 2016 -0700 -- --
[jira] [Updated] (AIRFLOW-559) Add support for BigQuery kwarg parameters
[ https://issues.apache.org/jira/browse/AIRFLOW-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Van Boxel updated AIRFLOW-559: --- Component/s: gcp > Add support for BigQuery kwarg parameters > - > > Key: AIRFLOW-559 > URL: https://issues.apache.org/jira/browse/AIRFLOW-559 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib, gcp >Reporter: Sam McVeety >Assignee: Sam McVeety >Priority: Minor > Fix For: Airflow 1.8 > > > Many of the operators in > https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/operators > add parameters over time, and plumbing these through multiple layers of > calls isn't always a high priority. > The operators (and hooks) should support an end-to-end kwargs parameter that > allows for new fields (e.g. useLegacySql, defaultDataset) to be added by > users without needing to change the underlying code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-539) Add support for BigQuery Standard SQL
[ https://issues.apache.org/jira/browse/AIRFLOW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564730#comment-15564730 ] Alex Van Boxel commented on AIRFLOW-539: [~criccomini]: athough this ticket is closed (can you add a "gcp" label on this ticket), you're probably an admin. Thanks. > Add support for BigQuery Standard SQL > - > > Key: AIRFLOW-539 > URL: https://issues.apache.org/jira/browse/AIRFLOW-539 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Reporter: Sam McVeety >Assignee: Ilya Rakoshes >Priority: Minor > Fix For: Airflow 1.8 > > > Many of the operators in > https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/operators > are implicitly using the "useLegacySql" option to BigQuery. Providing the > option to negate this will allow users to migrate to Standard SQL > (https://cloud.google.com/bigquery/sql-reference/enabling-standard-sql). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AIRFLOW-563) Dag stuck in "Filling up the DagBag from .." state
[ https://issues.apache.org/jira/browse/AIRFLOW-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Martinez updated AIRFLOW-563: - Attachment: dagrun.png 2016-10-10T11_00_00.txt > Dag stuck in "Filling up the DagBag from .." state > -- > > Key: AIRFLOW-563 > URL: https://issues.apache.org/jira/browse/AIRFLOW-563 > Project: Apache Airflow > Issue Type: Bug >Reporter: Vincent Martinez >Priority: Critical > Attachments: 2016-10-10T11_00_00.txt, dagrun.png, load_dims.py > > > Hello, > I have scheduled a dag hourly but the task doesn't run at all, it stays at > "Filling up the DagBag from .." step and each hour I have one more Dag id in > running state (stuck aswell). Do you have any idea where does the problem > comes from ? > You'll find attached screens, log of the dag for 1 run and the dag file. > If you need anything else feel free to ask. > Thanks > Best regards, > Vincent -- This message was sent by Atlassian JIRA (v6.3.4#6332)