Which optimizer do you use in them respectively? Are the table analyzed
before you run the query?

Best regards,
Ruilong Huo

On Mon, Jan 16, 2017 at 9:32 PM, Guo Kai <[email protected]> wrote:

>
>> Hi, All.
>>
>> I ran tpc-h on a hawq cluster. The cluster consists of 1 master node and
>> 3 segment nodes, and the same with HDFS cluster. Meanwhile, I build a gpdb
>> cluster on same nodes.
>>
>> Each node(exactly each vm) has the same configurations -- 4GB memory, 1
>> vcpu, and sharing host network. I used dbgen to generate 3GB data, then
>> loaded into hawq and gpdb.
>>
>> I ran Query 4 in tpc-h. gpdb finished it in a few seconds, but hawq got
>> stuck for tens of minutes and more.
>>
>> Query 4 is as follows:
>>
>> *select*
>> *        o_orderpriority,*
>> *        count(*) as order_count*
>> *from*
>> *        orders*
>> *where*
>> *        o_orderdate >= date '1996-03-01'*
>> *        and o_orderdate < date '1996-03-01' + interval '3 month'*
>> *        and exists (*
>> *                select*
>> *                        **
>> *                from*
>> *                        lineitem*
>> *                where*
>> *                        l_orderkey = o_orderkey*
>> *                        and l_commitdate < l_receiptdate*
>> *        )*
>> *group by*
>> *        o_orderpriority*
>> *order by^M*
>> *        o_orderpriority*
>> *LIMIT 1;*
>>
>> Moreover, I retrieved both of query plans. I found hawq produced a huge
>> cost plan while gpdb did a small one. The obvious difference between them
>> is that hawq adopted
>> nest loop join but gpdb adopted hash join. I checked that the parameter
>> 'enable_nestloop' is off by default.
>>
>> The gpdb plan is as follows:
>>
>>            QUERY PLAN
>>
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>>  Limit  (cost=407633.93..407633.96 rows=1 width=72)
>>    ->  Gather Motion 12:1  (slice3; segments: 12)
>>  (cost=407633.93..407633.96 rows=1 width=72)
>>          Merge Key: partial_aggregation.o_orderpriority
>>          ->  Limit  (cost=407633.93..407633.94 rows=1 width=72)
>>                ->  Sort  (cost=407633.93..407633.94 rows=1 width=72)
>>                      Sort Key (Limit): partial_aggregation.o_orderpri
>> ority
>>                      ->  HashAggregate  (cost=407633.91..407633.92 rows=1
>> width=72)
>>                            Group By: orders.o_orderpriority
>>                            ->  Redistribute Motion 12:12  (slice2;
>> segments: 12)  (cost=407633.88..407633.90 rows=1 width=72)
>>                                  Hash Key: orders.o_orderpriority
>>                                  ->  HashAggregate
>>  (cost=407633.88..407633.88 rows=1 width=72)
>>                                        Group By: orders.o_orderpriority
>>                                        ->  HashAggregate
>>  (cost=404041.00..406436.25 rows=19961 width=74)
>>                                              Group By:
>> orders.ctid::bigint, orders.gp_segment_id
>>                                              ->  Hash Join
>>  (cost=89634.53..402843.37 rows=19961 width=26)
>>                                                    Hash Cond:
>> lineitem.l_orderkey = orders.o_orderkey
>>                                                    ->  Append-only
>> Columnar Scan on lineitem  (cost=0.00..295217.60 rows=499906 width=4)
>>                                                          Filter:
>> l_commitdate < l_receiptdate
>>                                                    ->  Hash
>>  (cost=87388.56..87388.56 rows=14974 width=34)
>>                                                          ->  Redistribute
>> Motion 12:12  (slice1; segments: 12)  (cost=0.00..87388.56 rows=14974
>> width=34)
>>                                                                Hash Key:
>> orders.o_orderkey
>>                                                                ->
>>  Append-only Columnar Scan on orders  (cost=0.00..83795.00 rows=14974
>> width=34)
>>
>>  Filter: o_orderdate >= '1996-03-01'::date AND o_orderdate < '1996-06-01
>> 00:00:00'::timestamp without time zone
>> (23 rows)
>>
>> The hawq plan is as follows:
>>
>>                  QUERY PLAN
>>
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>> ------------------------------------------------------------------------
>>  Limit  (cost=424555364732.86..424555364732.88 rows=1 width=72)
>>    ->  Gather Motion 18:1  (slice4; segments: 18)
>>  (cost=424555364732.86..424555364732.88 rows=1 width=72)
>>          Merge Key: partial_aggregation.o_orderpriority
>>          ->  Limit  (cost=424555364732.86..424555364732.86 rows=1
>> width=72)
>>                ->  Sort  (cost=424555364732.86..424555364732.86 rows=1
>> width=72)
>>                      Sort Key (Limit): partial_aggregation.o_orderpri
>> ority
>>                      ->  HashAggregate  
>> (cost=424555364732.84..424555364732.85
>> rows=1 width=72)
>>                            Group By: orders.o_orderpriority
>>                            ->  Redistribute Motion 18:18  (slice3;
>> segments: 18)  (cost=424555364732.80..424555364732.82 rows=1 width=72)
>>                                  Hash Key: orders.o_orderpriority
>>                                  ->  HashAggregate
>>  (cost=424555364732.80..424555364732.80 rows=1 width=72)
>>                                        Group By: orders.o_orderpriority
>>                                        ->  HashAggregate
>>  (cost=424555361238.53..424555363568.04 rows=12942 width=74)
>>                                              Group By:
>> orders.ctid::bigint, orders.gp_segment_id
>>                                              ->  Result
>>  (cost=122726.35..424555360073.77 rows=12942 width=26)
>>                                                    ->  Redistribute
>> Motion 18:18  (slice2; segments: 18)  (cost=122726.35..424555360073.77
>> rows=12942 width=26)
>>                                                          Hash Key:
>> orders.ctid
>>                                                          ->  Nested Loop
>>  (cost=122726.35..424555355414.73 rows=12942 width=26)
>>                                                                Join
>> Filter: lineitem.l_orderkey = orders.o_orderkey
>>                                                                ->
>>  Parquet table Scan on lineitem  (cost=0.00..318758.60 rows=333271 width=4)
>>
>>  Filter: l_commitdate < l_receiptdate
>>                                                                ->
>>  Materialize  (cost=122726.35..154180.79 rows=174747 width=34)
>>                                                                      ->
>>  Broadcast Motion 18:18  (slice1; segments: 18)  (cost=0.00..119580.91
>> rows=174747 width=34)
>>
>>  ->  Parquet table Scan on orders  (cost=0.00..86379.00 rows=9709 width=34)
>>
>>        Filter: o_orderdate >= '1996-03-01'::date AND o_orderdate <
>> '1996-06-01 00:00:00'::timestamp without time zone
>>  Settings:  default_hash_table_bucket_number=18
>> (26 rows)
>>
>> Please suggest explanations why this happens. Thank you!
>>
>>

Reply via email to