Re: performance issue on big table join

俊杰陈 Thu, 26 Oct 2017 23:27:07 -0700

Thank Alex.

Cheng, build phase multithreading should depend on Disk IO manager.



2017-10-27 13:50 GMT+08:00 Xu, Cheng A <[email protected]>:

> Thanks Alexander for the reply. Is there any configuration we can use to
> determine the parallelism level for build phase? Thank you!
>
>
>
> Thanks
>
> Ferdinand Xu
>
>
>
> *From:* Alexander Behm [mailto:[email protected]]
> *Sent:* Friday, October 27, 2017 1:46 PM
> *To:* [email protected]
> *Cc:* Mostafa Mokhtar <[email protected]>
> *Subject:* Re: performance issue on big table join
>
>
>
> See my response on the other thread you started. The probe side of joins
> are are executed in a single thread per host. Impala can run multiple
> builds in parallel - but each build uses only a single thread.
>
> A single query might not be able to max out your CPU, but most realistic
> workloads run several queries concurrently.
>
>
>
>
>
> On Thu, Oct 26, 2017 at 10:38 PM, 俊杰陈 <[email protected]> wrote:
>
> Thanks, let me put here.
>
>
>
> Yes, the query is intended to verify parallelism of partitioned join. We
> want to know how many fragment instances started for a hash join on a
> single node, and how many threads in a fragment instance perform the join.
> I'm not sure whether there is only one thread participate in hash join,
> since thus it can not maximize the CPU utilization.
>
>
>
> The compute stats met following error:
>
>
>
> Query: compute stats store_sales
>
> WARNINGS: ImpalaRuntimeException: Error making
> 'updateTableColumnStatistics' RPC to Hive Metastore:
>
> CAUSED BY: MetaException: Insert of object "org.apache.hadoop.hive.
> metastore.model.MTableColumnStatistics@49fe55a1" using statement "INSERT
> INTO `TAB_COL_STATS` (`CS_ID`,`AVG_COL_LEN`,`BIT_
> VECTOR`,`COLUMN_NAME`,`COLUMN_TYPE`,`DB_NAME`,`BIG_DECIMAL_
> HIGH_VALUE`,`BIG_DECIMAL_LOW_VALUE`,`DOUBLE_HIGH_VALUE`,`
> DOUBLE_LOW_VALUE`,`LAST_ANALYZED`,`LONG_HIGH_VALUE`,`
> LONG_LOW_VALUE`,`MAX_COL_LEN`,`NUM_DISTINCTS`,`NUM_FALSES`,`
> NUM_NULLS`,`NUM_TRUES`,`TBL_ID`,`TABLE_NAME`) VALUES
> (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column
> 'BIT_VECTOR' in 'field list'
>
>
>
>
>
> 2017-10-27 12:02 GMT+08:00 Mostafa Mokhtar <[email protected]>:
>
> Hi,
>
> Looks like you are joining store_sales with catalog_sales on item_sk, this
> kind of join condition is a many to many, which means the output number of
> rows will be much larger then input number of rows, not sure if this is
> intended.
>
> Also did you run "compute stats [TABLE_NAME]" on both tables?
>
> For a more comprehensive query try TPCDS Q17
>
> select  i_item_id
>
>        ,i_item_desc
>
>        ,s_state
>
>        ,count(ss_quantity) as store_sales_quantitycount
>
>        ,avg(ss_quantity) as store_sales_quantityave
>
>        ,stddev_samp(ss_quantity) as store_sales_quantitystdev
>
>        ,stddev_samp(ss_quantity)/avg(ss_quantity) as
> store_sales_quantitycov
>
>        ,count(sr_return_quantity) as store_returns_quantitycount
>
>        ,avg(sr_return_quantity) as store_returns_quantityave
>
>        ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev
>
>        ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as
> store_returns_quantitycov
>
>        ,count(cs_quantity) as catalog_sales_quantitycount
> ,avg(cs_quantity) as catalog_sales_quantityave
>
>        ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev
>
>        ,stddev_samp(cs_quantity)/avg(cs_quantity) as
> catalog_sales_quantitycov
>
>  from store_sales
>
>      ,store_returns
>
>      ,catalog_sales
>
>      ,date_dim d1
>
>      ,date_dim d2
>
>      ,date_dim d3
>
>      ,store
>
>      ,item
>
>  where d1.d_quarter_name = '2000Q1'
>
>    and d1.d_date_sk = ss_sold_date_sk
>
>    and i_item_sk = ss_item_sk
>
>    and s_store_sk = ss_store_sk
>
>    and ss_customer_sk = sr_customer_sk
>
>    and ss_item_sk = sr_item_sk
>
>    and ss_ticket_number = sr_ticket_number
>
>    and sr_returned_date_sk = d2.d_date_sk
>
>    and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3')
>
>    and sr_customer_sk = cs_bill_customer_sk
>
>    and sr_item_sk = cs_item_sk
>
>    and cs_sold_date_sk = d3.d_date_sk
>
>    and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3')
>
>  group by i_item_id
>
>          ,i_item_desc
>
>          ,s_state
>
>  order by i_item_id
>
>          ,i_item_desc
>
>          ,s_state
>
> limit 100;
>
>
> I recommend moving this kind of discussion on
> [email protected].
>
> On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 <[email protected]> wrote:
>
> > The profile file is damaged. Here is a screenshot for exec summary
> >
> > 
> >
> > 2017-10-27 10:04 GMT+08:00 俊杰陈 <[email protected]>:
> >
> >> Hi Devs
> >>
> >> I met a performance issue on big table join. The query takes more than 3
> >> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes
> >> cluster. when running query,  the left scanner and exchange node are
> very
> >> slow.  Did I miss some key arguments?
> >>
> >> you can see profile file in attachment.
> >>
> >>
> >> 
>
> >> --
> >> Thanks & Best Regards
> >>
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
>
>
>
>
>
> --
>
> Thanks & Best Regards
>
>
>



-- 
Thanks & Best Regards

Re: performance issue on big table join

Reply via email to