Want to make sure there are no misunderstandings. When I say "builds can be run in parallel" I mean that for queries with a series of joins, multiple independent join builds can run in parallel. But each individual build is still run in a singe thread. The degree of build parallelism is dynamically adjusted by our ThreadResourceMgr based on the availability of threads. You can control how many threads are available for the ThreadResourceMgr to allocate based on these paremeters: --num_cores --num_threads_per_core
The total number of threads for the ThreadResourceMgr is num_cores*num_threads_per_core. That said, I strongly advise against adjusting those config params for the purpose of "controlling" the build parallelism because those params affect many more things i the system (e.g. scan parallelism). On Thu, Oct 26, 2017 at 11:26 PM, 俊杰陈 <[email protected]> wrote: > Thank Alex. > > Cheng, build phase multithreading should depend on Disk IO manager. > > > 2017-10-27 13:50 GMT+08:00 Xu, Cheng A <[email protected]>: > >> Thanks Alexander for the reply. Is there any configuration we can use to >> determine the parallelism level for build phase? Thank you! >> >> >> >> Thanks >> >> Ferdinand Xu >> >> >> >> *From:* Alexander Behm [mailto:[email protected]] >> *Sent:* Friday, October 27, 2017 1:46 PM >> *To:* [email protected] >> *Cc:* Mostafa Mokhtar <[email protected]> >> *Subject:* Re: performance issue on big table join >> >> >> >> See my response on the other thread you started. The probe side of joins >> are are executed in a single thread per host. Impala can run multiple >> builds in parallel - but each build uses only a single thread. >> >> A single query might not be able to max out your CPU, but most realistic >> workloads run several queries concurrently. >> >> >> >> >> >> On Thu, Oct 26, 2017 at 10:38 PM, 俊杰陈 <[email protected]> wrote: >> >> Thanks, let me put here. >> >> >> >> Yes, the query is intended to verify parallelism of partitioned join. We >> want to know how many fragment instances started for a hash join on a >> single node, and how many threads in a fragment instance perform the join. >> I'm not sure whether there is only one thread participate in hash join, >> since thus it can not maximize the CPU utilization. >> >> >> >> The compute stats met following error: >> >> >> >> Query: compute stats store_sales >> >> WARNINGS: ImpalaRuntimeException: Error making >> 'updateTableColumnStatistics' RPC to Hive Metastore: >> >> CAUSED BY: MetaException: Insert of object "org.apache.hadoop.hive.metast >> ore.model.MTableColumnStatistics@49fe55a1" using statement "INSERT INTO >> `TAB_COL_STATS` (`CS_ID`,`AVG_COL_LEN`,`BIT_VE >> CTOR`,`COLUMN_NAME`,`COLUMN_TYPE`,`DB_NAME`,`BIG_DECIMAL_HIG >> H_VALUE`,`BIG_DECIMAL_LOW_VALUE`,`DOUBLE_HIGH_VALUE`,`DOUBLE >> _LOW_VALUE`,`LAST_ANALYZED`,`LONG_HIGH_VALUE`,`LONG_LOW_ >> VALUE`,`MAX_COL_LEN`,`NUM_DISTINCTS`,`NUM_FALSES`,`NUM_ >> NULLS`,`NUM_TRUES`,`TBL_ID`,`TABLE_NAME`) VALUES >> (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column >> 'BIT_VECTOR' in 'field list' >> >> >> >> >> >> 2017-10-27 12:02 GMT+08:00 Mostafa Mokhtar <[email protected]>: >> >> Hi, >> >> Looks like you are joining store_sales with catalog_sales on item_sk, this >> kind of join condition is a many to many, which means the output number of >> rows will be much larger then input number of rows, not sure if this is >> intended. >> >> Also did you run "compute stats [TABLE_NAME]" on both tables? >> >> For a more comprehensive query try TPCDS Q17 >> >> select i_item_id >> >> ,i_item_desc >> >> ,s_state >> >> ,count(ss_quantity) as store_sales_quantitycount >> >> ,avg(ss_quantity) as store_sales_quantityave >> >> ,stddev_samp(ss_quantity) as store_sales_quantitystdev >> >> ,stddev_samp(ss_quantity)/avg(ss_quantity) as >> store_sales_quantitycov >> >> ,count(sr_return_quantity) as store_returns_quantitycount >> >> ,avg(sr_return_quantity) as store_returns_quantityave >> >> ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev >> >> ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as >> store_returns_quantitycov >> >> ,count(cs_quantity) as catalog_sales_quantitycount >> ,avg(cs_quantity) as catalog_sales_quantityave >> >> ,stddev_samp(cs_quantity) as catalog_sales_quantitystdev >> >> ,stddev_samp(cs_quantity)/avg(cs_quantity) as >> catalog_sales_quantitycov >> >> from store_sales >> >> ,store_returns >> >> ,catalog_sales >> >> ,date_dim d1 >> >> ,date_dim d2 >> >> ,date_dim d3 >> >> ,store >> >> ,item >> >> where d1.d_quarter_name = '2000Q1' >> >> and d1.d_date_sk = ss_sold_date_sk >> >> and i_item_sk = ss_item_sk >> >> and s_store_sk = ss_store_sk >> >> and ss_customer_sk = sr_customer_sk >> >> and ss_item_sk = sr_item_sk >> >> and ss_ticket_number = sr_ticket_number >> >> and sr_returned_date_sk = d2.d_date_sk >> >> and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3') >> >> and sr_customer_sk = cs_bill_customer_sk >> >> and sr_item_sk = cs_item_sk >> >> and cs_sold_date_sk = d3.d_date_sk >> >> and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3') >> >> group by i_item_id >> >> ,i_item_desc >> >> ,s_state >> >> order by i_item_id >> >> ,i_item_desc >> >> ,s_state >> >> limit 100; >> >> >> I recommend moving this kind of discussion on >> [email protected]. >> >> On Thu, Oct 26, 2017 at 7:25 PM, 俊杰陈 <[email protected]> wrote: >> >> > The profile file is damaged. Here is a screenshot for exec summary >> > >> > >> > >> > 2017-10-27 10:04 GMT+08:00 俊杰陈 <[email protected]>: >> > >> >> Hi Devs >> >> >> >> I met a performance issue on big table join. The query takes more than >> 3 >> >> hours on Impala and only 3 minutes on Spark SQL on the same 5 nodes >> >> cluster. when running query, the left scanner and exchange node are >> very >> >> slow. Did I miss some key arguments? >> >> >> >> you can see profile file in attachment. >> >> >> >> >> >> >> >> >> -- >> >> Thanks & Best Regards >> >> >> > >> > >> > >> > -- >> > Thanks & Best Regards >> > >> >> >> >> >> >> -- >> >> Thanks & Best Regards >> >> >> > > > > -- > Thanks & Best Regards >
