[jira] [Created] (AIRFLOW-2036) "Broken Dag" messages are basically useless for troubleshooting

2018-01-26 Thread Neil Hanlon (JIRA)
Neil Hanlon created AIRFLOW-2036:


 Summary: "Broken Dag" messages are basically useless for 
troubleshooting
 Key: AIRFLOW-2036
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2036
 Project: Apache Airflow
  Issue Type: Improvement
Affects Versions: Airflow 1.8, Airflow 1.7.1
Reporter: Neil Hanlon


Numerous times we've been plagued with the dreaded "Broken dag: " messages popping up in our WebServer UI.

Until last night, we all thought this message meant the airflow webserver was 
unable to process these dags, and had literally no idea why; because not only 
could I enter airflow's virtual environment and import the packages it was 
looking for, but i could trigger dag runs and they'd work perfectly.

 

Turns out the Broken Dag messages in the UI can come from any airflow process 
that happens to kick off and be unable to import packages.

 

I suggest Airflow modify the {{import_errors}} table to add a hostname, which 
would at the very least help with troubleshooting and prevent operators from 
pulling their hair out trying to figure out why their web server is reporting 
no errors but somehow getting Broken Dag messages.

 

Airflow is "highly distributed", but that doesn't mean much when you can't find 
out which of your nodes is having an issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (AIRFLOW-246) dag_stats endpoint has a terrible query

2016-06-17 Thread Neil Hanlon (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336229#comment-15336229
 ] 

Neil Hanlon edited comment on AIRFLOW-246 at 6/17/16 2:43 PM:
--

[~sekikn] Looks good to me. Query only takes ~4 seconds to run on our db (~750k 
rows).

Obviously with more task instances the actual size of this query will increase 
which will require database tuning to allow it to run (max_allowed_packet) and 
all that... so maybe it should also be looked at whether or not the whole thing 
should be rearchitected... both of the queries (original, and the one you 
wrote) are pretty ugly. I'm not saying all queries have to be pretty, but...

Let me know if you get a code fix for this and I'd be happy to test it.

(new query profile for posterity)

{code:sql}
airflow-rwdb01:airflow> show profile for query 1;
++--+
| Status | Duration |
++--+
| starting   | 0.33 |
| Waiting for query cache lock   | 0.05 |
| Waiting on query cache mutex   | 0.04 |
| checking query cache for query | 0.002673 |
| checking permissions   | 0.06 |
| checking permissions   | 0.04 |
| checking permissions   | 0.03 |
| checking permissions   | 0.06 |
| Opening tables | 0.25 |
| System lock| 0.001740 |
| optimizing | 0.05 |
| statistics | 0.17 |
| preparing  | 0.08 |
| executing  | 0.07 |
| Sorting result | 0.04 |
| Sending data   | 0.000855 |
| optimizing | 0.09 |
| statistics | 0.09 |
| preparing  | 0.10 |
| executing  | 0.04 |
| Sending data   | 0.000622 |
| optimizing | 0.73 |
| statistics | 1.371602 |
| preparing  | 0.000239 |
| executing  | 0.08 |
| Sending data   | 0.407134 |
| optimizing | 0.000145 |
| statistics | 1.176737 |
| preparing  | 0.000174 |
| executing  | 0.06 |
| Sending data   | 0.395361 |
| optimizing | 0.11 |
| statistics | 0.14 |
| preparing  | 0.12 |
| executing  | 0.05 |
| Sending data   | 0.000654 |
| removing tmp table | 0.21 |
| Sending data   | 0.13 |
| Waiting for query cache lock   | 0.05 |
| Waiting on query cache mutex   | 0.05 |
| Sending data   | 0.001048 |
| init   | 0.23 |
| optimizing | 0.06 |
| statistics | 0.07 |
| preparing  | 0.08 |
| Creating tmp table | 0.000386 |
| executing  | 0.06 |
| Copying to tmp table   | 0.005354 |
| Sorting result | 0.91 |
| Sending data   | 0.90 |
| end| 0.04 |
| removing tmp table | 0.000169 |
| end| 0.06 |
| query end  | 0.07 |
| closing tables | 0.04 |
| removing tmp table | 0.07 |
| closing tables | 0.04 |
| removing tmp table | 0.07 |
| closing tables | 0.04 |
| removing tmp table | 0.05 |
| closing tables | 0.13 |
| freeing items  | 0.000144 |
| Waiting for query cache lock   | 0.04 |
| Waiting on query cache mutex   | 0.03 |
| freeing items  | 0.57 |
| Waiting for query cache lock   | 0.04 |
| Waiting on query cache mutex   | 0.05 |
| freeing items  | 0.04 |
| storing result in query cache  | 0.07 |
| logging slow query | 0.04 |
| logging slow query | 0.000221 |
| cleaning up| 0.13 |
++--+
72 rows in set (0.00 sec)
{code}


was (Author: nhanlon):
[~sekikn] Looks good to me. Query only takes ~4 seconds to run on our db (~750k 
rows).

Obviously with more task instances the actual size of this query will increase 
which will require database tuning to allow it to run (max_allowed_packet) and 
all that... so maybe it should also be looked at whether or not the whole thing 
should be rearchitected... both of the queries (original, and the one you 
wrote) are pretty ugly. I'm not saying all queries have to be pretty, but...

Let me know if you get a 

[jira] [Commented] (AIRFLOW-246) dag_stats endpoint has a terrible query

2016-06-17 Thread Neil Hanlon (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336229#comment-15336229
 ] 

Neil Hanlon commented on AIRFLOW-246:
-

[~sekikn] Looks good to me. Query only takes ~4 seconds to run on our db (~750k 
rows).

Obviously with more task instances the actual size of this query will increase 
which will require database tuning to allow it to run (max_allowed_packet) and 
all that... so maybe it should also be looked at whether or not the whole thing 
should be rearchitected... both of the queries (original, and the one you 
wrote) are pretty ugly. I'm not saying all queries have to be pretty, but...

Let me know if you get a code fix for this and I'd be happy to test it.

(new query profile for posterity)

{pre}
airflow-rwdb01:airflow> show profile for query 1;
++--+
| Status | Duration |
++--+
| starting   | 0.33 |
| Waiting for query cache lock   | 0.05 |
| Waiting on query cache mutex   | 0.04 |
| checking query cache for query | 0.002673 |
| checking permissions   | 0.06 |
| checking permissions   | 0.04 |
| checking permissions   | 0.03 |
| checking permissions   | 0.06 |
| Opening tables | 0.25 |
| System lock| 0.001740 |
| optimizing | 0.05 |
| statistics | 0.17 |
| preparing  | 0.08 |
| executing  | 0.07 |
| Sorting result | 0.04 |
| Sending data   | 0.000855 |
| optimizing | 0.09 |
| statistics | 0.09 |
| preparing  | 0.10 |
| executing  | 0.04 |
| Sending data   | 0.000622 |
| optimizing | 0.73 |
| statistics | 1.371602 |
| preparing  | 0.000239 |
| executing  | 0.08 |
| Sending data   | 0.407134 |
| optimizing | 0.000145 |
| statistics | 1.176737 |
| preparing  | 0.000174 |
| executing  | 0.06 |
| Sending data   | 0.395361 |
| optimizing | 0.11 |
| statistics | 0.14 |
| preparing  | 0.12 |
| executing  | 0.05 |
| Sending data   | 0.000654 |
| removing tmp table | 0.21 |
| Sending data   | 0.13 |
| Waiting for query cache lock   | 0.05 |
| Waiting on query cache mutex   | 0.05 |
| Sending data   | 0.001048 |
| init   | 0.23 |
| optimizing | 0.06 |
| statistics | 0.07 |
| preparing  | 0.08 |
| Creating tmp table | 0.000386 |
| executing  | 0.06 |
| Copying to tmp table   | 0.005354 |
| Sorting result | 0.91 |
| Sending data   | 0.90 |
| end| 0.04 |
| removing tmp table | 0.000169 |
| end| 0.06 |
| query end  | 0.07 |
| closing tables | 0.04 |
| removing tmp table | 0.07 |
| closing tables | 0.04 |
| removing tmp table | 0.07 |
| closing tables | 0.04 |
| removing tmp table | 0.05 |
| closing tables | 0.13 |
| freeing items  | 0.000144 |
| Waiting for query cache lock   | 0.04 |
| Waiting on query cache mutex   | 0.03 |
| freeing items  | 0.57 |
| Waiting for query cache lock   | 0.04 |
| Waiting on query cache mutex   | 0.05 |
| freeing items  | 0.04 |
| storing result in query cache  | 0.07 |
| logging slow query | 0.04 |
| logging slow query | 0.000221 |
| cleaning up| 0.13 |
++--+
72 rows in set (0.00 sec)
{pre}

> dag_stats endpoint has a terrible query
> ---
>
> Key: AIRFLOW-246
> URL: https://issues.apache.org/jira/browse/AIRFLOW-246
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: Airflow 1.7.1
> Environment: MySQL Backend through sqlalchemy
>Reporter: Neil Hanlon
>
> Hitting this endpoint creates a series of queries on the database which take 
> over 20 seconds to run, causing the page to not load for that entire time. 
> 

[jira] [Updated] (AIRFLOW-246) dag_stats endpoint has a terrible query

2016-06-15 Thread Neil Hanlon (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Hanlon updated AIRFLOW-246:

Description: 
Hitting this endpoint creates a series of queries on the database which take 
over 20 seconds to run, causing the page to not load for that entire time. 
Luckily the main page (which includes this under "Recent Statuses") loads this 
synchronously, but still... waiting almost half a minute (at times more) to see 
the statuses for dags is really not fun.

We have less than a million rows in the task_instance table--so it's not even a 
problem with that.

Here's a query profile for the query:

https://gist.github.com/NeilHanlon/613f12724e802bc51c23fca7d46d28bf

We've done some optimizations on the database, but to no avail.

The query:

{code:sql}
SELECT task_instance.dag_id AS task_instance_dag_id, task_instance.state AS 
task_instance_state, count(task_instance.task_id) AS count_1 FROM task_instance 
LEFT OUTER JOIN (SELECT dag_run.dag_id AS dag_id, dag_run.execution_date AS 
execution_date FROM dag_run WHERE dag_run.state = 'running') AS running_dag_run 
ON running_dag_run.dag_id = task_instance.dag_id AND 
running_dag_run.execution_date = task_instance.execution_date LEFT OUTER JOIN 
(SELECT dag_run.dag_id AS dag_id, max(dag_run.execution_date) AS execution_date 
FROM dag_run GROUP BY dag_run.dag_id) AS last_dag_run ON last_dag_run.dag_id = 
task_instance.dag_id AND last_dag_run.execution_date = 
task_instance.execution_date WHERE task_instance.task_id IN ... AND 
(running_dag_run.dag_id IS NOT NULL OR last_dag_run.dag_id IS NOT NULL) GROUP 
BY task_instance.dag_id, task_instance.state;
{code}

  was:
Hitting this endpoint creates a series of queries on the database which take 
over 20 seconds to run, causing the page to not load for that entire time. 
Luckily the main page (which includes this under "Recent Statuses") loads this 
synchronously, but still... waiting almost half a minute (at times more) to see 
the statuses for dags is really not fun.

We have less than a million rows in the task_instance table--so it's not even a 
problem with that.

Here's a query profile for the query:

https://gist.github.com/NeilHanlon/613f12724e802bc51c23fca7d46d28bf

We've done some optimizations on the database, but to no avail.

The query:

[pre]
SELECT task_instance.dag_id AS task_instance_dag_id, task_instance.state AS 
task_instance_state, count(task_instance.task_id) AS count_1 FROM task_instance 
LEFT OUTER JOIN (SELECT dag_run.dag_id AS dag_id, dag_run.execution_date AS 
execution_date FROM dag_run WHERE dag_run.state = 'running') AS running_dag_run 
ON running_dag_run.dag_id = task_instance.dag_id AND 
running_dag_run.execution_date = task_instance.execution_date LEFT OUTER JOIN 
(SELECT dag_run.dag_id AS dag_id, max(dag_run.execution_date) AS execution_date 
FROM dag_run GROUP BY dag_run.dag_id) AS last_dag_run ON last_dag_run.dag_id = 
task_instance.dag_id AND last_dag_run.execution_date = 
task_instance.execution_date WHERE task_instance.task_id IN ... AND 
(running_dag_run.dag_id IS NOT NULL OR last_dag_run.dag_id IS NOT NULL) GROUP 
BY task_instance.dag_id, task_instance.state;
[/pre]


> dag_stats endpoint has a terrible query
> ---
>
> Key: AIRFLOW-246
> URL: https://issues.apache.org/jira/browse/AIRFLOW-246
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: Airflow 1.7.1
> Environment: MySQL Backend through sqlalchemy
>Reporter: Neil Hanlon
>
> Hitting this endpoint creates a series of queries on the database which take 
> over 20 seconds to run, causing the page to not load for that entire time. 
> Luckily the main page (which includes this under "Recent Statuses") loads 
> this synchronously, but still... waiting almost half a minute (at times more) 
> to see the statuses for dags is really not fun.
> We have less than a million rows in the task_instance table--so it's not even 
> a problem with that.
> Here's a query profile for the query:
> https://gist.github.com/NeilHanlon/613f12724e802bc51c23fca7d46d28bf
> We've done some optimizations on the database, but to no avail.
> The query:
> {code:sql}
> SELECT task_instance.dag_id AS task_instance_dag_id, task_instance.state AS 
> task_instance_state, count(task_instance.task_id) AS count_1 FROM 
> task_instance LEFT OUTER JOIN (SELECT dag_run.dag_id AS dag_id, 
> dag_run.execution_date AS execution_date FROM dag_run WHERE dag_run.state = 
> 'running') AS running_dag_run ON running_dag_run.dag_id = 
> task_instance.dag_id AND running_dag_run.execution_date = 
> task_instance.execution_date LEFT OUTER JOIN (SELECT dag_run.dag_id AS 
> dag_id, max(dag_run.execution_date) AS execution_date FROM dag_run GROUP BY 
> dag_run.dag_id) AS last_dag_run ON last_dag_run.dag_id = 

[jira] [Created] (AIRFLOW-246) dag_stats endpoint has a terrible query

2016-06-15 Thread Neil Hanlon (JIRA)
Neil Hanlon created AIRFLOW-246:
---

 Summary: dag_stats endpoint has a terrible query
 Key: AIRFLOW-246
 URL: https://issues.apache.org/jira/browse/AIRFLOW-246
 Project: Apache Airflow
  Issue Type: Bug
  Components: webserver
Affects Versions: Airflow 1.7.1
 Environment: MySQL Backend through sqlalchemy
Reporter: Neil Hanlon


Hitting this endpoint creates a series of queries on the database which take 
over 20 seconds to run, causing the page to not load for that entire time. 
Luckily the main page (which includes this under "Recent Statuses") loads this 
synchronously, but still... waiting almost half a minute (at times more) to see 
the statuses for dags is really not fun.

We have less than a million rows in the task_instance table--so it's not even a 
problem with that.

Here's a query profile for the query:

https://gist.github.com/NeilHanlon/613f12724e802bc51c23fca7d46d28bf

We've done some optimizations on the database, but to no avail.

The query:

[pre]
SELECT task_instance.dag_id AS task_instance_dag_id, task_instance.state AS 
task_instance_state, count(task_instance.task_id) AS count_1 FROM task_instance 
LEFT OUTER JOIN (SELECT dag_run.dag_id AS dag_id, dag_run.execution_date AS 
execution_date FROM dag_run WHERE dag_run.state = 'running') AS running_dag_run 
ON running_dag_run.dag_id = task_instance.dag_id AND 
running_dag_run.execution_date = task_instance.execution_date LEFT OUTER JOIN 
(SELECT dag_run.dag_id AS dag_id, max(dag_run.execution_date) AS execution_date 
FROM dag_run GROUP BY dag_run.dag_id) AS last_dag_run ON last_dag_run.dag_id = 
task_instance.dag_id AND last_dag_run.execution_date = 
task_instance.execution_date WHERE task_instance.task_id IN ... AND 
(running_dag_run.dag_id IS NOT NULL OR last_dag_run.dag_id IS NOT NULL) GROUP 
BY task_instance.dag_id, task_instance.state;
[/pre]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)