Hi all,

I am really excited about Apache Drill its easiness to bring SQL on top of 
different storage technologies. I am in the phase of learning/evaluating 
Apache Drill and I have come up with a case where the performance drops 
significantly. Therefore, I would like to share with you my results and 
get hints about how to improve performance.

I have installed Drill in a cluster of 12 nodes and I have assigned 8GB 
for Drill per node.The main steps of our data pipeline are:
1. Import data on HDFS as Parquet files with Sqoop. For the evaluation 
tests I have a dataset of Parquet files with ~11M rows  and 50 columns. 
The total size is ~1GB.
2. Query Parquet files with Drill.

I have tried different types of queries and even in very complicated ones 
the response time is around or less than 5sec. However I have noticed that 
the response time rises to ~80sec if I try queries which have the 
following 2 characteristics:
1. Sort the resultset (ORDER BY)
2. Get all columns

For example a query with the following pattern:

SELECT *
FROM table
ORDER BY columng
LIMIT 1;

It is interesting that the more columns I put in the select clause the 
more time it needs to respond. If I don't sort or if I get just a couple 
of columns then the response time drops from ~80s to ~3s.
Please notice that I limit the resultset to 1 row, in order to avoid 
network traffic delays.

I have checked the Query Profiler and the most time consuming operations 
are:
HASH_PARTITION_SENDER with Avg Process Time: 38sec
PARQUET_ROW_GROUP_SCAN with Avg Process Time: 42sec

Do you have any idea how I can improve performance in the case of my query 
(if you like I can also provide a Full Json Profile).

Thanks,
Nikos


Reply via email to