Hi folks,

I’m trying to debug why the performance of our new Impala setup is performing a 
bit worse on the same queries as Presto. We’re running Impala 2.10.0 and 
starting it up with 225G of memory (-mem-limit). It runs on the same set of 
nodes as Presto (4 nodes, 48 cores each, 10G ethernet). We noticed a couple of 
queries in Impala were fairly slow in the hdfs-scan stage so we tried to 
isolate the behavior with a slightly simpler query:
select max(hour + nb_display) from my_large_parquet_table where day = 
'2017-10-04';

This query takes around 28-30 mins on Impala and seems to run in around 8.5 
mins on Presto. The input table is 11.64TB and uses Parquet (snappy compressed).

Looking at the performance profile (attached in case anyone’s interested), I 
noticed that during the query Impala seems to be averaging around 1.1 - 1.2 
MB/s per thread and the total read throughput is around 2.13 MB/s. I tried 
bumping up the number of scanner threads (from 48 to 96) but that seemed to 
only help marginally (improved runtimes by ~10s). Running “sar –n DEV 1” on 
some of our hosts while the query is running, seems to show that Impala is 
reading at a rate of 30-50 MB/s (whereas we see this go up to 60-125 MB/s on 
our other runs).

We haven’t tweaked our Impala setup much beyond the defaults that come out of 
the box. I’m wondering if I’m missing some tuning settings that help improve 
read rates from HDFS when Impala is running outside of the datanodes. If anyone 
has any ideas / suggestions, they’d be welcome. I am happy to provide more 
details if needed as well.

Thanks,

-- Piyush

Attachment: impala_perf_profile
Description: impala_perf_profile

Reply via email to