Re: performance issue on big table join

Alexander Behm Thu, 26 Oct 2017 22:46:03 -0700

See my response on the other thread you started. The probe side of joins
are are executed in a single thread per host. Impala can run multiple
builds in parallel - but each build uses only a single thread.
A single query might not be able to max out your CPU, but most realistic
workloads run several queries concurrently.



On Thu, Oct 26, 2017 at 10:38 PM, 俊杰陈 <[email protected]> wrote:

> Thanks, let me put here.
>
> Yes, the query is intended to verify parallelism of partitioned join. We
> want to know how many fragment instances started for a hash join on a
> single node, and how many threads in a fragment instance perform the join.
> I'm not sure whether there is only one thread participate in hash join,
> since thus it can not maximize the CPU utilization.
>
> The compute stats met following error:
>
> Query: compute stats store_sales
> WARNINGS: ImpalaRuntimeException: Error making
> 'updateTableColumnStatistics' RPC to Hive Metastore:
> CAUSED BY: MetaException: Insert of object "org.apache.hadoop.hive.
> metastore.model.MTableColumnStatistics@49fe55a1" using statement "INSERT
> INTO `TAB_COL_STATS` (`CS_ID`,`AVG_COL_LEN`,`BIT_
> VECTOR`,`COLUMN_NAME`,`COLUMN_TYPE`,`DB_NAME`,`BIG_DECIMAL_
> HIGH_VALUE`,`BIG_DECIMAL_LOW_VALUE`,`DOUBLE_HIGH_VALUE`,`
> DOUBLE_LOW_VALUE`,`LAST_ANALYZED`,`LONG_HIGH_VALUE`,`
> LONG_LOW_VALUE`,`MAX_COL_LEN`,`NUM_DISTINCTS`,`NUM_FALSES`,`
> NUM_NULLS`,`NUM_TRUES`,`TBL_ID`,`TABLE_NAME`) VALUES
> (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column
> 'BIT_VECTOR' in 'field list'
>
>
> 2017-10-27 12:02 GMT+08:00 Mostafa Mokhtar <[email protected]>:
>
>> Hi,
>>
>> Looks like you are joining store_sales with catalog_sales on item_sk, this
>> kind of join condition is a many to many, which means the output number of
>> rows will be much larger then input number of rows, not sure if this is
>> intended.
>>
>> Also did you run "compute stats [TABLE_NAME]" on both tables?
>>
>> For a more comprehensive query try TPCDS Q17
>>
>> select  i_item_id
>>
>>        ,i_item_desc
>>
>>        ,s_state
>>
>>        ,count(ss_quantity) as store_sales_quantitycount
>>
>>        ,avg(ss_quantity) as store_sales_quantityave
>>
>>        ,stddev_samp(ss_quantity) as store_sales_quantitystdev
>>
>>        ,stddev_samp(ss_quantity)/avg(ss_quantity) as
>> store_sales_quantitycov
>>
>>        ,count(sr_return_quantity) as store_returns_quantitycount
>>
>>        ,avg(sr_return_quantity) as store_returns_quantityave
>>
>>        ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev
>>
>>        ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as
>> store_returns_quantitycov
>>
>>        ,count(cs_quantity) as catalog_sales_quantitycount
>> ,avg(cs_quantity) as catalog_sales_quantityave
>>
>>        ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev
>>
>>        ,stddev_samp(cs_quantity)/avg(cs_quantity) as
>> catalog_sales_quantitycov
>>
>>  from store_sales
>>
>>      ,store_returns
>>
>>      ,catalog_sales
>>
>>      ,date_dim d1
>>
>>      ,date_dim d2
>>
>>      ,date_dim d3
>>
>>      ,store
>>
>>      ,item
>>
>>  where d1.d_quarter_name = '2000Q1'
>>
>>    and d1.d_date_sk = ss_sold_date_sk
>>
>>    and i_item_sk = ss_item_sk
>>
>>    and s_store_sk = ss_store_sk
>>
>>    and ss_customer_sk = sr_customer_sk
>>
>>    and ss_item_sk = sr_item_sk
>>
>>    and ss_ticket_number = sr_ticket_number
>>
>>    and sr_returned_date_sk = d2.d_date_sk
>>
>>    and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3')
>>
>>    and sr_customer_sk = cs_bill_customer_sk
>>
>>    and sr_item_sk = cs_item_sk
>>
>>    and cs_sold_date_sk = d3.d_date_sk
>>
>>    and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3')
>>
>>  group by i_item_id
>>
>>          ,i_item_desc
>>
>>          ,s_state
>>
>>  order by i_item_id
>>
>>          ,i_item_desc
>>
>>          ,s_state
>>
>> limit 100;
>>
>>
>> I recommend moving this kind of discussion on
>> [email protected].
>>
>> On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 <[email protected]> wrote:
>>
>> > The profile file is damaged. Here is a screenshot for exec summary
>> >
>> > 
>> >
>> > 2017-10-27 10:04 GMT+08:00 俊杰陈 <[email protected]>:
>> >
>> >> Hi Devs
>> >>
>> >> I met a performance issue on big table join. The query takes more than
>> 3
>> >> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
>> >> cluster. when running query,  the left scanner and exchange node are
>> very
>> >> slow.  Did I miss some key arguments?
>> >>
>> >> you can see profile file in attachment.
>> >>
>> >>
>> >> 
>> >> --
>> >> Thanks & Best Regards
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks & Best Regards
>> >
>>
>
>
>
> --
> Thanks & Best Regards
>

Re: performance issue on big table join

Reply via email to