Hi all, I am really excited about Apache Drill its easiness to bring SQL on top of different storage technologies. I am in the phase of learning/evaluating Apache Drill and I have come up with a case where the performance drops significantly. Therefore, I would like to share with you my results and get hints about how to improve performance.
I have installed Drill in a cluster of 12 nodes and I have assigned 8GB for Drill per node.The main steps of our data pipeline are: 1. Import data on HDFS as Parquet files with Sqoop. For the evaluation tests I have a dataset of Parquet files with ~11M rows and 50 columns. The total size is ~1GB. 2. Query Parquet files with Drill. I have tried different types of queries and even in very complicated ones the response time is around or less than 5sec. However I have noticed that the response time rises to ~80sec if I try queries which have the following 2 characteristics: 1. Sort the resultset (ORDER BY) 2. Get all columns For example a query with the following pattern: SELECT * FROM table ORDER BY columng LIMIT 1; It is interesting that the more columns I put in the select clause the more time it needs to respond. If I don't sort or if I get just a couple of columns then the response time drops from ~80s to ~3s. Please notice that I limit the resultset to 1 row, in order to avoid network traffic delays. I have checked the Query Profiler and the most time consuming operations are: HASH_PARTITION_SENDER with Avg Process Time: 38sec PARQUET_ROW_GROUP_SCAN with Avg Process Time: 42sec Do you have any idea how I can improve performance in the case of my query (if you like I can also provide a Full Json Profile). Thanks, Nikos
