See my response on the other thread you started. The probe side of joins are are executed in a single thread per host. Impala can run multiple builds in parallel - but each build uses only a single thread. A single query might not be able to max out your CPU, but most realistic workloads run several queries concurrently.
On Thu, Oct 26, 2017 at 10:38 PM, 俊杰陈 <[email protected]> wrote: > Thanks, let me put here. > > Yes, the query is intended to verify parallelism of partitioned join. We > want to know how many fragment instances started for a hash join on a > single node, and how many threads in a fragment instance perform the join. > I'm not sure whether there is only one thread participate in hash join, > since thus it can not maximize the CPU utilization. > > The compute stats met following error: > > Query: compute stats store_sales > WARNINGS: ImpalaRuntimeException: Error making > 'updateTableColumnStatistics' RPC to Hive Metastore: > CAUSED BY: MetaException: Insert of object "org.apache.hadoop.hive. > metastore.model.MTableColumnStatistics@49fe55a1" using statement "INSERT > INTO `TAB_COL_STATS` (`CS_ID`,`AVG_COL_LEN`,`BIT_ > VECTOR`,`COLUMN_NAME`,`COLUMN_TYPE`,`DB_NAME`,`BIG_DECIMAL_ > HIGH_VALUE`,`BIG_DECIMAL_LOW_VALUE`,`DOUBLE_HIGH_VALUE`,` > DOUBLE_LOW_VALUE`,`LAST_ANALYZED`,`LONG_HIGH_VALUE`,` > LONG_LOW_VALUE`,`MAX_COL_LEN`,`NUM_DISTINCTS`,`NUM_FALSES`,` > NUM_NULLS`,`NUM_TRUES`,`TBL_ID`,`TABLE_NAME`) VALUES > (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column > 'BIT_VECTOR' in 'field list' > > > 2017-10-27 12:02 GMT+08:00 Mostafa Mokhtar <[email protected]>: > >> Hi, >> >> Looks like you are joining store_sales with catalog_sales on item_sk, this >> kind of join condition is a many to many, which means the output number of >> rows will be much larger then input number of rows, not sure if this is >> intended. >> >> Also did you run "compute stats [TABLE_NAME]" on both tables? >> >> For a more comprehensive query try TPCDS Q17 >> >> select i_item_id >> >> ,i_item_desc >> >> ,s_state >> >> ,count(ss_quantity) as store_sales_quantitycount >> >> ,avg(ss_quantity) as store_sales_quantityave >> >> ,stddev_samp(ss_quantity) as store_sales_quantitystdev >> >> ,stddev_samp(ss_quantity)/avg(ss_quantity) as >> store_sales_quantitycov >> >> ,count(sr_return_quantity) as store_returns_quantitycount >> >> ,avg(sr_return_quantity) as store_returns_quantityave >> >> ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev >> >> ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as >> store_returns_quantitycov >> >> ,count(cs_quantity) as catalog_sales_quantitycount >> ,avg(cs_quantity) as catalog_sales_quantityave >> >> ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev >> >> ,stddev_samp(cs_quantity)/avg(cs_quantity) as >> catalog_sales_quantitycov >> >> from store_sales >> >> ,store_returns >> >> ,catalog_sales >> >> ,date_dim d1 >> >> ,date_dim d2 >> >> ,date_dim d3 >> >> ,store >> >> ,item >> >> where d1.d_quarter_name = '2000Q1' >> >> and d1.d_date_sk = ss_sold_date_sk >> >> and i_item_sk = ss_item_sk >> >> and s_store_sk = ss_store_sk >> >> and ss_customer_sk = sr_customer_sk >> >> and ss_item_sk = sr_item_sk >> >> and ss_ticket_number = sr_ticket_number >> >> and sr_returned_date_sk = d2.d_date_sk >> >> and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3') >> >> and sr_customer_sk = cs_bill_customer_sk >> >> and sr_item_sk = cs_item_sk >> >> and cs_sold_date_sk = d3.d_date_sk >> >> and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3') >> >> group by i_item_id >> >> ,i_item_desc >> >> ,s_state >> >> order by i_item_id >> >> ,i_item_desc >> >> ,s_state >> >> limit 100; >> >> >> I recommend moving this kind of discussion on >> [email protected]. >> >> On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 <[email protected]> wrote: >> >> > The profile file is damaged. Here is a screenshot for exec summary >> > >> > >> > >> > 2017-10-27 10:04 GMT+08:00 俊杰陈 <[email protected]>: >> > >> >> Hi Devs >> >> >> >> I met a performance issue on big table join. The query takes more than >> 3 >> >> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes >> >> cluster. when running query, the left scanner and exchange node are >> very >> >> slow. Did I miss some key arguments? >> >> >> >> you can see profile file in attachment. >> >> >> >> >> >> >> >> -- >> >> Thanks & Best Regards >> >> >> > >> > >> > >> > -- >> > Thanks & Best Regards >> > >> > > > > -- > Thanks & Best Regards >
