Sushanth Sowmyan created HIVE-17095:
---------------------------------------

             Summary: Long chain repl loads do not complete in a timely fashion
                 Key: HIVE-17095
                 URL: https://issues.apache.org/jira/browse/HIVE-17095
             Project: Hive
          Issue Type: Bug
          Components: Query Planning, repl
            Reporter: sapin amin
            Assignee: Sushanth Sowmyan


Per performance testing done by [~sapinamin] (thus, I'm setting him as 
reporter), we were able to discover an important bug affecting replication. It 
has the potential to affect other large DAGs of Tasks that hive generates as 
well, if those DAGs have multiple paths to child Task nodes.

Basically, we find that incremental REPL LOAD does not finish in a timely 
fashion. The test, in this case was to add 400 partitions, and replicate them. 
Associated with each partition, there was an ADD PTN and a ALTER PTN. For each 
of the ADD PTN tasks, we'd generate a DDLTask, a CopyTask and a MoveTask. For 
each Alter ptn, there'd be a single DDLTask. And order of execution is 
important, so it would chain in dependency collection tasks between phases.

Trying to root cause this shows us that it seems to stall forever at the Driver 
instantiation time, and it almost looks like the thread doesn't proceed past 
that point.

Looking at logs, it seems that the way this is written, it looks for all tasks 
generated that are subtrees of all nodes, without looking for duplicates, and 
this is done simply to get the number of execution tasks!

And thus, the task visitor will visit every subtree of every node, which is 
fine if you have graphs that look like open trees, but is horrible for us, 
since we have dependency collection tasks between each phase. Effectively, this 
is what's happening:

We have a DAG, say, like this:

4 tasks in parallel -> DEP col -> 4 tasks in parallel -> DEP col -> ...

This means that for each of the 4 root tasks, we will do a full traversal of 
every graph (not just every node) past the DEP col, and this happens 
recursively, and this leads to an exponential growth of number of tasks visited 
as the length and breadth of the graph increase. In our case, we had about 800 
tasks in the graph, with roughly a width of about 2-3, with 200 stages, a dep 
collection before and after, and this meant that leaf nodes of this DAG would 
have something like 2^200 - 3^200 ways in which they can be visited, and thus, 
we'd visit them in all those ways. And all this simply to count the number of 
tasks to schedule - we would revisit this function multiple more times, once 
per each hook, once for the MapReduceCompiler and once for the TaskCompiler.

We have not been sending such large DAGs to the Driver, thus it has not yet 
been a problem, and there are upcoming changes to reduce the number of tasks 
replication generates(as part of a memory addressing issue), but we still 
should fix the way we do Task traversal so that a large DAG cannot cripple us.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to