Hey Joseph, Yes, basically everything except scans uses one thread per operator. IMPALA-3902 tracks multithreading. You can trick impala to use more threads for the same job by altering the query (break it up into smaller tasks by limiting the scans, then union), but it's not really a viable/nice solution.
On Mon, Mar 13, 2017 at 2:50 PM, Joseph Naegele <[email protected]> wrote: > Thanks Maurin, > > Does this imply that my example query can't use multi-threading in Impala > 2.5? Is this the case for all non-join queries? > > --- > Joe Naegele > Grier Forensics > > -----Original Message----- > From: Maurin Lenglart [mailto:[email protected]] > Sent: Friday, March 10, 2017 6:39 PM > To: [email protected] > Subject: Re: low CPU usage > > Hi, > You can look at the MT_DOP flag (only avaible in the latest impala release > 2.8 with cdh 5.10) > It enable multi-threading for queries that are not using joins. > > On 3/10/17, 9:52 AM, "Joseph Naegele" <[email protected]> wrote: > > Hi folks, > > I'm running a very rough comparison of Spark and Impala for a specific > operation in our application. I have a 2+ billion record table on which I run > a window function to remove duplicate records. AFAIK this is similar to a > self-join. The data doesn't fit entirely in memory but Spark is able to spill > to disk and complete the operation in 40 minutes on this one machine while > the equivalent query in Impala takes 2.3 hours. The machine has approx. 250 > GB RAM and 56 vcores. While running in Impala I observe very low (1-core) CPU > usage throughout. > > My dataset is around 52 GB compressed (2-3x that uncompressed) in > Parquet+Gzip format spread across 1200 files. My query looks like this: > > CREATE TABLE compacted STORED AS PARQUET LOCATION "/tmp/compacted" AS > SELECT a, b, c, d, e FROM ( > SELECT row_number() OVER ( > PARTITION BY c, d ORDER BY e DESC > ) rank, * FROM duplicates > ) q WHERE q.rank=1; > > Where fields a, b, c, ... are strings or long integers. Also, I've run > COMPUTE STATS on the table. > > I'm using the Cloudera quickstart Docker image. Impalad reports version > as impalad version 2.5.0-cdh5.7.0 RELEASE. I've allocated 6 local disks to > Impalad's scratch space (since I expect spillage). Otherwise the > configuration is vanilla. I'm monitoring CPU usage using htop and CPU > utilization fluctuates between 1/56 and 5/56 of available CPU. I have the > Impala query profile available in the web UI, but I wanted to ask if there's > anything obvious I'm missing before posting too much detail. > > Is there anything obvious I should look into or configure to achieve > higher CPU utilization and faster overall performance? > > Thanks > --- > Joe Naegele > Grier Forensics > > > > >
