Thank Alex. Cheng, build phase multithreading should depend on Disk IO manager.
2017-10-27 13:50 GMT+08:00 Xu, Cheng A <[email protected]>: > Thanks Alexander for the reply. Is there any configuration we can use to > determine the parallelism level for build phase? Thank you! > > > > Thanks > > Ferdinand Xu > > > > *From:* Alexander Behm [mailto:[email protected]] > *Sent:* Friday, October 27, 2017 1:46 PM > *To:* [email protected] > *Cc:* Mostafa Mokhtar <[email protected]> > *Subject:* Re: performance issue on big table join > > > > See my response on the other thread you started. The probe side of joins > are are executed in a single thread per host. Impala can run multiple > builds in parallel - but each build uses only a single thread. > > A single query might not be able to max out your CPU, but most realistic > workloads run several queries concurrently. > > > > > > On Thu, Oct 26, 2017 at 10:38 PM, 俊杰陈 <[email protected]> wrote: > > Thanks, let me put here. > > > > Yes, the query is intended to verify parallelism of partitioned join. We > want to know how many fragment instances started for a hash join on a > single node, and how many threads in a fragment instance perform the join. > I'm not sure whether there is only one thread participate in hash join, > since thus it can not maximize the CPU utilization. > > > > The compute stats met following error: > > > > Query: compute stats store_sales > > WARNINGS: ImpalaRuntimeException: Error making > 'updateTableColumnStatistics' RPC to Hive Metastore: > > CAUSED BY: MetaException: Insert of object "org.apache.hadoop.hive. > metastore.model.MTableColumnStatistics@49fe55a1" using statement "INSERT > INTO `TAB_COL_STATS` (`CS_ID`,`AVG_COL_LEN`,`BIT_ > VECTOR`,`COLUMN_NAME`,`COLUMN_TYPE`,`DB_NAME`,`BIG_DECIMAL_ > HIGH_VALUE`,`BIG_DECIMAL_LOW_VALUE`,`DOUBLE_HIGH_VALUE`,` > DOUBLE_LOW_VALUE`,`LAST_ANALYZED`,`LONG_HIGH_VALUE`,` > LONG_LOW_VALUE`,`MAX_COL_LEN`,`NUM_DISTINCTS`,`NUM_FALSES`,` > NUM_NULLS`,`NUM_TRUES`,`TBL_ID`,`TABLE_NAME`) VALUES > (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column > 'BIT_VECTOR' in 'field list' > > > > > > 2017-10-27 12:02 GMT+08:00 Mostafa Mokhtar <[email protected]>: > > Hi, > > Looks like you are joining store_sales with catalog_sales on item_sk, this > kind of join condition is a many to many, which means the output number of > rows will be much larger then input number of rows, not sure if this is > intended. > > Also did you run "compute stats [TABLE_NAME]" on both tables? > > For a more comprehensive query try TPCDS Q17 > > select i_item_id > > ,i_item_desc > > ,s_state > > ,count(ss_quantity) as store_sales_quantitycount > > ,avg(ss_quantity) as store_sales_quantityave > > ,stddev_samp(ss_quantity) as store_sales_quantitystdev > > ,stddev_samp(ss_quantity)/avg(ss_quantity) as > store_sales_quantitycov > > ,count(sr_return_quantity) as store_returns_quantitycount > > ,avg(sr_return_quantity) as store_returns_quantityave > > ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev > > ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as > store_returns_quantitycov > > ,count(cs_quantity) as catalog_sales_quantitycount > ,avg(cs_quantity) as catalog_sales_quantityave > > ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev > > ,stddev_samp(cs_quantity)/avg(cs_quantity) as > catalog_sales_quantitycov > > from store_sales > > ,store_returns > > ,catalog_sales > > ,date_dim d1 > > ,date_dim d2 > > ,date_dim d3 > > ,store > > ,item > > where d1.d_quarter_name = '2000Q1' > > and d1.d_date_sk = ss_sold_date_sk > > and i_item_sk = ss_item_sk > > and s_store_sk = ss_store_sk > > and ss_customer_sk = sr_customer_sk > > and ss_item_sk = sr_item_sk > > and ss_ticket_number = sr_ticket_number > > and sr_returned_date_sk = d2.d_date_sk > > and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3') > > and sr_customer_sk = cs_bill_customer_sk > > and sr_item_sk = cs_item_sk > > and cs_sold_date_sk = d3.d_date_sk > > and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3') > > group by i_item_id > > ,i_item_desc > > ,s_state > > order by i_item_id > > ,i_item_desc > > ,s_state > > limit 100; > > > I recommend moving this kind of discussion on > [email protected]. > > On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 <[email protected]> wrote: > > > The profile file is damaged. Here is a screenshot for exec summary > > > > > > > > 2017-10-27 10:04 GMT+08:00 俊杰陈 <[email protected]>: > > > >> Hi Devs > >> > >> I met a performance issue on big table join. The query takes more than 3 > >> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes > >> cluster. when running query, the left scanner and exchange node are > very > >> slow. Did I miss some key arguments? > >> > >> you can see profile file in attachment. > >> > >> > >> > > >> -- > >> Thanks & Best Regards > >> > > > > > > > > -- > > Thanks & Best Regards > > > > > > > > -- > > Thanks & Best Regards > > > -- Thanks & Best Regards
