Re: query plan with a huge cost in hawq

Guo Kai Mon, 16 Jan 2017 17:36:57 -0800

I use the default optimizer not orca. The resource management's setting is
none not YARN.


After analyzing all tables, the problem is still.

2017-01-16 21:50 GMT+08:00 Ruilong Huo <[email protected]>:

> Which optimizer do you use in them respectively? Are the table analyzed
> before you run the query?
>
> Best regards,
> Ruilong Huo
>
> On Mon, Jan 16, 2017 at 9:32 PM, Guo Kai <[email protected]> wrote:
>
>>
>>> Hi, All.
>>>
>>> I ran tpc-h on a hawq cluster. The cluster consists of 1 master node and
>>> 3 segment nodes, and the same with HDFS cluster. Meanwhile, I build a gpdb
>>> cluster on same nodes.
>>>
>>> Each node(exactly each vm) has the same configurations -- 4GB memory, 1
>>> vcpu, and sharing host network. I used dbgen to generate 3GB data, then
>>> loaded into hawq and gpdb.
>>>
>>> I ran Query 4 in tpc-h. gpdb finished it in a few seconds, but hawq got
>>> stuck for tens of minutes and more.
>>>
>>> Query 4 is as follows:
>>>
>>> *select*
>>> *        o_orderpriority,*
>>> *        count(*) as order_count*
>>> *from*
>>> *        orders*
>>> *where*
>>> *        o_orderdate >= date '1996-03-01'*
>>> *        and o_orderdate < date '1996-03-01' + interval '3 month'*
>>> *        and exists (*
>>> *                select*
>>> *                        **
>>> *                from*
>>> *                        lineitem*
>>> *                where*
>>> *                        l_orderkey = o_orderkey*
>>> *                        and l_commitdate < l_receiptdate*
>>> *        )*
>>> *group by*
>>> *        o_orderpriority*
>>> *order by^M*
>>> *        o_orderpriority*
>>> *LIMIT 1;*
>>>
>>> Moreover, I retrieved both of query plans. I found hawq produced a huge
>>> cost plan while gpdb did a small one. The obvious difference between them
>>> is that hawq adopted
>>> nest loop join but gpdb adopted hash join. I checked that the parameter
>>> 'enable_nestloop' is off by default.
>>>
>>> The gpdb plan is as follows:
>>>
>>>              QUERY PLAN
>>>
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>>  Limit  (cost=407633.93..407633.96 rows=1 width=72)
>>>    ->  Gather Motion 12:1  (slice3; segments: 12)
>>>  (cost=407633.93..407633.96 rows=1 width=72)
>>>          Merge Key: partial_aggregation.o_orderpriority
>>>          ->  Limit  (cost=407633.93..407633.94 rows=1 width=72)
>>>                ->  Sort  (cost=407633.93..407633.94 rows=1 width=72)
>>>                      Sort Key (Limit): partial_aggregation.o_orderpri
>>> ority
>>>                      ->  HashAggregate  (cost=407633.91..407633.92
>>> rows=1 width=72)
>>>                            Group By: orders.o_orderpriority
>>>                            ->  Redistribute Motion 12:12  (slice2;
>>> segments: 12)  (cost=407633.88..407633.90 rows=1 width=72)
>>>                                  Hash Key: orders.o_orderpriority
>>>                                  ->  HashAggregate
>>>  (cost=407633.88..407633.88 rows=1 width=72)
>>>                                        Group By: orders.o_orderpriority
>>>                                        ->  HashAggregate
>>>  (cost=404041.00..406436.25 rows=19961 width=74)
>>>                                              Group By:
>>> orders.ctid::bigint, orders.gp_segment_id
>>>                                              ->  Hash Join
>>>  (cost=89634.53..402843.37 rows=19961 width=26)
>>>                                                    Hash Cond:
>>> lineitem.l_orderkey = orders.o_orderkey
>>>                                                    ->  Append-only
>>> Columnar Scan on lineitem  (cost=0.00..295217.60 rows=499906 width=4)
>>>                                                          Filter:
>>> l_commitdate < l_receiptdate
>>>                                                    ->  Hash
>>>  (cost=87388.56..87388.56 rows=14974 width=34)
>>>                                                          ->
>>>  Redistribute Motion 12:12  (slice1; segments: 12)  (cost=0.00..87388.56
>>> rows=14974 width=34)
>>>                                                                Hash Key:
>>> orders.o_orderkey
>>>                                                                ->
>>>  Append-only Columnar Scan on orders  (cost=0.00..83795.00 rows=14974
>>> width=34)
>>>
>>>  Filter: o_orderdate >= '1996-03-01'::date AND o_orderdate < '1996-06-01
>>> 00:00:00'::timestamp without time zone
>>> (23 rows)
>>>
>>> The hawq plan is as follows:
>>>
>>>                    QUERY PLAN
>>>
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------------------
>>>  Limit  (cost=424555364732.86..424555364732.88 rows=1 width=72)
>>>    ->  Gather Motion 18:1  (slice4; segments: 18)
>>>  (cost=424555364732.86..424555364732.88 rows=1 width=72)
>>>          Merge Key: partial_aggregation.o_orderpriority
>>>          ->  Limit  (cost=424555364732.86..424555364732.86 rows=1
>>> width=72)
>>>                ->  Sort  (cost=424555364732.86..424555364732.86 rows=1
>>> width=72)
>>>                      Sort Key (Limit): partial_aggregation.o_orderpri
>>> ority
>>>                      ->  HashAggregate  
>>> (cost=424555364732.84..424555364732.85
>>> rows=1 width=72)
>>>                            Group By: orders.o_orderpriority
>>>                            ->  Redistribute Motion 18:18  (slice3;
>>> segments: 18)  (cost=424555364732.80..424555364732.82 rows=1 width=72)
>>>                                  Hash Key: orders.o_orderpriority
>>>                                  ->  HashAggregate
>>>  (cost=424555364732.80..424555364732.80 rows=1 width=72)
>>>                                        Group By: orders.o_orderpriority
>>>                                        ->  HashAggregate
>>>  (cost=424555361238.53..424555363568.04 rows=12942 width=74)
>>>                                              Group By:
>>> orders.ctid::bigint, orders.gp_segment_id
>>>                                              ->  Result
>>>  (cost=122726.35..424555360073.77 rows=12942 width=26)
>>>                                                    ->  Redistribute
>>> Motion 18:18  (slice2; segments: 18)  (cost=122726.35..424555360073.77
>>> rows=12942 width=26)
>>>                                                          Hash Key:
>>> orders.ctid
>>>                                                          ->  Nested Loop
>>>  (cost=122726.35..424555355414.73 rows=12942 width=26)
>>>                                                                Join
>>> Filter: lineitem.l_orderkey = orders.o_orderkey
>>>                                                                ->
>>>  Parquet table Scan on lineitem  (cost=0.00..318758.60 rows=333271 width=4)
>>>
>>>  Filter: l_commitdate < l_receiptdate
>>>                                                                ->
>>>  Materialize  (cost=122726.35..154180.79 rows=174747 width=34)
>>>                                                                      ->
>>>  Broadcast Motion 18:18  (slice1; segments: 18)  (cost=0.00..119580.91
>>> rows=174747 width=34)
>>>
>>>    ->  Parquet table Scan on orders  (cost=0.00..86379.00 rows=9709
>>> width=34)
>>>
>>>          Filter: o_orderdate >= '1996-03-01'::date AND o_orderdate <
>>> '1996-06-01 00:00:00'::timestamp without time zone
>>>  Settings:  default_hash_table_bucket_number=18
>>> (26 rows)
>>>
>>> Please suggest explanations why this happens. Thank you!
>>>
>>>
>


-- 
Guo Kai,
[email protected]

Re: query plan with a huge cost in hawq

Reply via email to