[jira] [Commented] (AIRFLOW-3518) Toposort is very slow

2018-12-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722184#comment-16722184
 ] 

ASF GitHub Bot commented on AIRFLOW-3518:
-

ashb closed pull request #4322: [AIRFLOW-3518] Performance fixes for 
topological_sort
URL: https://github.com/apache/incubator-airflow/pull/4322
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/airflow/models.py b/airflow/models.py
index 74555902a5..47ebd9713c 100755
--- a/airflow/models.py
+++ b/airflow/models.py
@@ -23,7 +23,7 @@
 from __future__ import unicode_literals
 
 import copy
-from collections import defaultdict, namedtuple
+from collections import defaultdict, namedtuple, OrderedDict
 
 from builtins import ImportError as BuiltinImportError, bytes, object, str
 from future.standard_library import install_aliases
@@ -2662,10 +2662,10 @@ def __init__(
 }
 
 def __eq__(self, other):
-return (
-type(self) == type(other) and
-all(self.__dict__.get(c, None) == other.__dict__.get(c, None)
-for c in self._comps))
+if (type(self) == type(other) and
+self.task_id == other.task_id):
+return all(self.__dict__.get(c, None) == other.__dict__.get(c, 
None) for c in self._comps)
+return False
 
 def __ne__(self, other):
 return not self == other
@@ -3443,12 +3443,13 @@ def __repr__(self):
 return "".format(self=self)
 
 def __eq__(self, other):
-return (
-type(self) == type(other) and
+if (type(self) == type(other) and
+self.dag_id == other.dag_id):
+
 # Use getattr() instead of __dict__ as __dict__ doesn't return
 # correct values for properties.
-all(getattr(self, c, None) == getattr(other, c, None)
-for c in self._comps))
+return all(getattr(self, c, None) == getattr(other, c, None) for c 
in self._comps)
+return False
 
 def __ne__(self, other):
 return not self == other
@@ -3904,8 +3905,8 @@ def topological_sort(self):
 :return: list of tasks in topological order
 """
 
-# copy the the tasks so we leave it unmodified
-graph_unsorted = self.tasks[:]
+# convert into an OrderedDict to speedup lookup while keeping order 
the same
+graph_unsorted = OrderedDict((task.task_id, task) for task in 
self.tasks)
 
 graph_sorted = []
 
@@ -3928,14 +3929,14 @@ def topological_sort(self):
 # not, we need to bail out as the graph therefore can't be
 # sorted.
 acyclic = False
-for node in list(graph_unsorted):
+for node in list(graph_unsorted.values()):
 for edge in node.upstream_list:
-if edge in graph_unsorted:
+if edge.task_id in graph_unsorted:
 break
 # no edges in upstream tasks
 else:
 acyclic = True
-graph_unsorted.remove(node)
+del graph_unsorted[node.task_id]
 graph_sorted.append(node)
 
 if not acyclic:


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Toposort is very slow
> -
>
> Key: AIRFLOW-3518
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3518
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: Niels Zeilemaker
>Assignee: Niels Zeilemaker
>Priority: Major
>
> At a client we've discovered that for larger DAGs toposort is very slow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-3518) Toposort is very slow

2018-12-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721536#comment-16721536
 ] 

ASF GitHub Bot commented on AIRFLOW-3518:
-

NielsZeilemaker opened a new pull request #4322: [AIRFLOW-3518] Performance 
fixes for topological_sort
URL: https://github.com/apache/incubator-airflow/pull/4322
 
 
   For larger DAGs topological_sort was found to be very inefficient. Made
   some small changes to the code to improve the datastructures used in the
   method.
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-3518
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI 
changes:
   While running a larger DAG, topological_sort was found to be very slow. This 
fixup improves the performance on my local machine from ~5-10 seconds to 0.01s.
   Moreover, i've modified the __eq__ operators to have a shortcut for 
dag_id/task_id. This should improve all other cases where equality is used.
   
   ### Tests
   
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   No tests, as I would expect topological_sort to be covered already.
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - When adding new operators/hooks/sensors, the autoclass documentation 
generation needs to be added.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
   
   ### Code Quality
   
   - [X] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Toposort is very slow
> -
>
> Key: AIRFLOW-3518
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3518
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: scheduler
>Reporter: Niels Zeilemaker
>Assignee: Niels Zeilemaker
>Priority: Major
>
> At a client we've discovered that for larger DAGs toposort is very slow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)