Hi folks,

I'm running a very rough comparison of Spark and Impala for a specific 
operation in our application. I have a 2+ billion record table on which I run a 
window function to remove duplicate records. AFAIK this is similar to a 
self-join. The data doesn't fit entirely in memory but Spark is able to spill 
to disk and complete the operation in 40 minutes on this one machine while the 
equivalent query in Impala takes 2.3 hours. The machine has approx. 250 GB RAM 
and 56 vcores. While running in Impala I observe very low (1-core) CPU usage 
throughout.

My dataset is around 52 GB compressed (2-3x that uncompressed) in Parquet+Gzip 
format spread across 1200 files. My query looks like this:

CREATE TABLE compacted STORED AS PARQUET LOCATION "/tmp/compacted" AS
SELECT a, b, c, d, e FROM (
    SELECT row_number() OVER (
        PARTITION BY c, d ORDER BY e DESC
    ) rank, * FROM duplicates
) q WHERE q.rank=1;

Where fields a, b, c, ... are strings or long integers. Also, I've run COMPUTE 
STATS on the table.

I'm using the Cloudera quickstart Docker image. Impalad reports version as 
impalad version 2.5.0-cdh5.7.0 RELEASE. I've allocated 6 local disks to 
Impalad's scratch space (since I expect spillage). Otherwise the configuration 
is vanilla. I'm monitoring CPU usage using htop and CPU utilization fluctuates 
between 1/56 and 5/56 of available CPU. I have the Impala query profile 
available in the web UI, but I wanted to ask if there's anything obvious I'm 
missing before posting too much detail.

Is there anything obvious I should look into or configure to achieve higher CPU 
utilization and faster overall performance?

Thanks
---
Joe Naegele
Grier Forensics


Reply via email to