Hi folks,
I'm running a very rough comparison of Spark and Impala for a specific
operation in our application. I have a 2+ billion record table on which I run a
window function to remove duplicate records. AFAIK this is similar to a
self-join. The data doesn't fit entirely in memory but Spark is able to spill
to disk and complete the operation in 40 minutes on this one machine while the
equivalent query in Impala takes 2.3 hours. The machine has approx. 250 GB RAM
and 56 vcores. While running in Impala I observe very low (1-core) CPU usage
throughout.
My dataset is around 52 GB compressed (2-3x that uncompressed) in Parquet+Gzip
format spread across 1200 files. My query looks like this:
CREATE TABLE compacted STORED AS PARQUET LOCATION "/tmp/compacted" AS
SELECT a, b, c, d, e FROM (
SELECT row_number() OVER (
PARTITION BY c, d ORDER BY e DESC
) rank, * FROM duplicates
) q WHERE q.rank=1;
Where fields a, b, c, ... are strings or long integers. Also, I've run COMPUTE
STATS on the table.
I'm using the Cloudera quickstart Docker image. Impalad reports version as
impalad version 2.5.0-cdh5.7.0 RELEASE. I've allocated 6 local disks to
Impalad's scratch space (since I expect spillage). Otherwise the configuration
is vanilla. I'm monitoring CPU usage using htop and CPU utilization fluctuates
between 1/56 and 5/56 of available CPU. I have the Impala query profile
available in the web UI, but I wanted to ask if there's anything obvious I'm
missing before posting too much detail.
Is there anything obvious I should look into or configure to achieve higher CPU
utilization and faster overall performance?
Thanks
---
Joe Naegele
Grier Forensics