RE: performance issue on big table join

Xu, Cheng A Thu, 26 Oct 2017 23:58:01 -0700

Hi Junjie,
The following error is caused by low version of Hive metastore. Can be resolved 
by upgrade script.

Hi Mostafa
About “compute stats”, is it used to help Impala determine join type (broadcast 
join or partitioned join) since we have better stats?
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_perf_joins.html#perf_joins
From the plan picture as attached, the join type for store_sales and 
catalog_sales is already partitioned type.

Also from this picture, the broadcast for catalog_sales is very fast and I 
think this table is used for build and probe store_sales table. The MAX and avg 
for broadcast and exchange is much lower than that in Hash join(Max is 2h, avg 
2h). Not sure whether we can increase the parallelism for the probe part. If 
only single thread mode is supported now, any tuning tips to improve this case 
(two big table join)? Thank you!

Best Regards
Ferdinand Xu

From: 俊杰陈 [mailto:[email protected]]
Sent: Friday, October 27, 2017 1:38 PM
To: [email protected]; [email protected]
Subject: Re: performance issue on big table join

Thanks, let me put here.

Yes, the query is intended to verify parallelism of partitioned join. We want 
to know how many fragment instances started for a hash join on a single node, 
and how many threads in a fragment instance perform the join. I'm not sure 
whether there is only one thread participate in hash join, since thus it can 
not maximize the CPU utilization.

The compute stats met following error:

Query: compute stats store_sales
WARNINGS: ImpalaRuntimeException: Error making 'updateTableColumnStatistics' 
RPC to Hive Metastore:
CAUSED BY: MetaException: Insert of object 
"org.apache.hadoop.hive.metastore.model.MTableColumnStatistics@49fe55a1<mailto:org.apache.hadoop.hive.metastore.model.MTableColumnStatistics@49fe55a1>"
 using statement "INSERT INTO `TAB_COL_STATS` 
(`CS_ID`,`AVG_COL_LEN`,`BIT_VECTOR`,`COLUMN_NAME`,`COLUMN_TYPE`,`DB_NAME`,`BIG_DECIMAL_HIGH_VALUE`,`BIG_DECIMAL_LOW_VALUE`,`DOUBLE_HIGH_VALUE`,`DOUBLE_LOW_VALUE`,`LAST_ANALYZED`,`LONG_HIGH_VALUE`,`LONG_LOW_VALUE`,`MAX_COL_LEN`,`NUM_DISTINCTS`,`NUM_FALSES`,`NUM_NULLS`,`NUM_TRUES`,`TBL_ID`,`TABLE_NAME`)
 VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column 
'BIT_VECTOR' in 'field list'

2017-10-27 12:02 GMT+08:00 Mostafa Mokhtar 
<[email protected]<mailto:[email protected]>>:
Hi,

Looks like you are joining store_sales with catalog_sales on item_sk, this
kind of join condition is a many to many, which means the output number of
rows will be much larger then input number of rows, not sure if this is
intended.

Also did you run "compute stats [TABLE_NAME]" on both tables?

For a more comprehensive query try TPCDS Q17

select  i_item_id

       ,i_item_desc

       ,s_state

       ,count(ss_quantity) as store_sales_quantitycount

       ,avg(ss_quantity) as store_sales_quantityave

       ,stddev_samp(ss_quantity) as store_sales_quantitystdev

       ,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov

       ,count(sr_return_quantity) as store_returns_quantitycount

       ,avg(sr_return_quantity) as store_returns_quantityave

       ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev

       ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as
store_returns_quantitycov

       ,count(cs_quantity) as catalog_sales_quantitycount
,avg(cs_quantity) as catalog_sales_quantityave

       ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev

       ,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov

 from store_sales

     ,store_returns

     ,catalog_sales

     ,date_dim d1

     ,date_dim d2

     ,date_dim d3

     ,store

     ,item

 where d1.d_quarter_name = '2000Q1'

   and d1.d_date_sk = ss_sold_date_sk

   and i_item_sk = ss_item_sk

   and s_store_sk = ss_store_sk

   and ss_customer_sk = sr_customer_sk

   and ss_item_sk = sr_item_sk

   and ss_ticket_number = sr_ticket_number

   and sr_returned_date_sk = d2.d_date_sk

   and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3')

   and sr_customer_sk = cs_bill_customer_sk

   and sr_item_sk = cs_item_sk

   and cs_sold_date_sk = d3.d_date_sk

   and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3')

 group by i_item_id

         ,i_item_desc

         ,s_state

 order by i_item_id

         ,i_item_desc

         ,s_state

limit 100;

I recommend moving this kind of discussion on
[email protected]<mailto:[email protected]>.

On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 
<[email protected]<mailto:[email protected]>> wrote:

> The profile file is damaged. Here is a screenshot for exec summary
>
> 
>
> 2017-10-27 10:04 GMT+08:00 俊杰陈 
> <[email protected]<mailto:[email protected]>>:
>
>> Hi Devs
>>
>> I met a performance issue on big table join. The query takes more than 3
>> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
>> cluster. when running query,  the left scanner and exchange node are very
>> slow.  Did I miss some key arguments?
>>
>> you can see profile file in attachment.
>>
>>
>> 
>> --
>> Thanks & Best Regards
>>
>
>
>
> --
> Thanks & Best Regards
>

--
Thanks & Best Regards

RE: performance issue on big table join

Reply via email to