Assuming all your queries contain a LIMIT, when there is no ORDER BY then Drill doesn't need to read all rows to produce the results, but with ORDER BY there is no alternative than to read 11M rows from disk. Parquet is a columnar format, so Drill, is able to only read the columns you selected that's why you still get descent response times when you only select a couple of columns. Also, the less columns you select, the less data needs to be sent through the network.
How many parquet files do you have ? Drill will try to run multiple parquet reading threads but it can only do so if you have multiple parquet files in your dataset. Sharing the query profile will definitely help provide more advice about how to improve the performance of your query. For instance, do you see any excessive wait time in parquet_row_group_scan or any of the senders ? Thanks On Thu, Jul 7, 2016 at 2:22 AM, Nikos Livathinos <[email protected]> wrote: > Hi all, > > I am really excited about Apache Drill its easiness to bring SQL on top of > different storage technologies. I am in the phase of learning/evaluating > Apache Drill and I have come up with a case where the performance drops > significantly. Therefore, I would like to share with you my results and > get hints about how to improve performance. > > I have installed Drill in a cluster of 12 nodes and I have assigned 8GB > for Drill per node.The main steps of our data pipeline are: > 1. Import data on HDFS as Parquet files with Sqoop. For the evaluation > tests I have a dataset of Parquet files with ~11M rows and 50 columns. > The total size is ~1GB. > 2. Query Parquet files with Drill. > > I have tried different types of queries and even in very complicated ones > the response time is around or less than 5sec. However I have noticed that > the response time rises to ~80sec if I try queries which have the > following 2 characteristics: > 1. Sort the resultset (ORDER BY) > 2. Get all columns > > For example a query with the following pattern: > > SELECT * > FROM table > ORDER BY columng > LIMIT 1; > > It is interesting that the more columns I put in the select clause the > more time it needs to respond. If I don't sort or if I get just a couple > of columns then the response time drops from ~80s to ~3s. > Please notice that I limit the resultset to 1 row, in order to avoid > network traffic delays. > > I have checked the Query Profiler and the most time consuming operations > are: > HASH_PARTITION_SENDER with Avg Process Time: 38sec > PARQUET_ROW_GROUP_SCAN with Avg Process Time: 42sec > > Do you have any idea how I can improve performance in the case of my query > (if you like I can also provide a Full Json Profile). > > Thanks, > Nikos > > > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
