Re: Experimental results using TPC-DS (versus Spark and Presto)

Prasanth Jayachandran Tue, 31 Jan 2017 15:56:56 -0800

Hi Dongwon

Thanks for the presentation! Very insightful.
I just filed a bug for query72. Hive’s CBO seems to be selecting wrong join 
order. https://issues.apache.org/jira/browse/HIVE-15771


In the following link you can find a rewrite for the query which gives much 
better runtime (in my testing I was able to run in 130s on 1TB scale on 6 node 
LLAP cluster).
I also disabled the date filter that turns into NULL > NULL + 5 expression in 
the queries.
Ideally, we want CBO to pick up join order in the rewritten query (which should 
be fixed with HIVE-15771).

https://transfer.sh/E0XfV/q72.zip

The above link contains
- original query
- modified query
- the explain output (txt and svg) for original and modified

Thanks for reporting the issue and I hope this helps.

Thanks
Prasanth

On Jan 30, 2017, at 9:39 PM, 김동원 
<eastcirc...@gmail.com<mailto:eastcirc...@gmail.com>> wrote:

gopal :
In the attached  gopal.tar.gz, I put two svg images and two text files after 
rerunning query72 with and without the following inequation:
and d3.d_date > d1.d_date + 5

FYI, I already did Hive experiments with and without the inequation because 
Presto doesn't allow it at the time of query submission,
but Hive's running times are not that different.
<to_gopal.tar.gz>
- Dongwon

2017. 1. 31. 오후 12:48, Gopal Vijayaraghavan 
<gop...@apache.org<mailto:gop...@apache.org>> 작성:


Gopal : (yarn logs -application $APPID) doesn't contain a line
containing HISTORY so it doesn't produce svg file. Should I turn on
some option to get the lines containing HISTORY in yarn application
log?

There's a config option tez.am.log.level=INFO which controls who much data is 
written to the log there.

I think there's an interval type clause in the 72 query, which might be a 
problem.

and d3.d_date > d1.d_date + 5

That might be doing UDFToDouble(d_date) > UDFToDouble(d_date) + 5, which will 
evaluate into

NULL > NULL + 5

Because UDFToDouble("1997-01-01") is NULL.

So, seeing your explain would go a long way in finding out what's going on.

The swimlane raw data is also somewhat interesting to me, because I also draw a 
differen t set of graphs from the same HISTORY data.

http://people.apache.org/~gopalv/q21_suppliers_who_kept_orders_waiting.svg

to locate bottlenecks in the system.

Cheers,
Gopal

Re: Experimental results using TPC-DS (versus Spark and Presto)

Reply via email to