Hello Charles, I ran into the same performance issues some time ago and did make some discoveries:
* Drill is good at only pulling the byte ranges out of the file system it needs. Sadly, s3a in Hadoop 2.7 is translating a request to the byte range (x,y) into a HTTP request to S3 of the byte range (x,end-of-file). In the case of Parquet, this means that you will read for each column in each row group from the beginning of this column chunk to the end of the file. Overall this amounted for me for a traffic of 10-20x the size of the actual file in total. * Hadoop 2.8/3.0 actually introduces a new S3 experimental random access mode that really improves performance as this will only send requests of (x, y+readahead.range) to S3. You can activate it with fs.s3a.experimental.input.fadvise=random. * I played a bit with fs.s3a.readahead.range which is optimistic range that is included in the request but actually found that I could keep it at its default of 65536 bytes as Drill often requests all bytes it needs at once and thus reading ahead did not improve the situation. * This random access mode plays well with Parquet files but sadly slowed down the read of the metadata cache drastically as only requests of the size 65540 were done to S3. Therefore I had to add is.setReadahead(filesize); after https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L593 to ensure that the metadata cache is read at once from S3. * Also https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L662 seem to have been always true in my case, causing a refresh of the cache on every query. As I had quite a big dataset, this added a large constant to my query. This might be simply due to the fact that S3 does not have the concept of directories. I have not digged deeper into this but added as a dirty workaround that once the cache exists, it is never updated automatically. Locally I have made my own Drill build based on the Hadoop 2.8 libraries but sadly some unit tests failed, at least for the S3 testing, everything seems to work. Work is still on the 1.11 release sources and some code has changed since then. I will have some time in the next days/weeks to look again at this and might open some PRs (don't expect me to be the one to open the Hadoop-Update PR, I'm a full-time Python dev, so this is a bit out of my comfort zone :D ). At least for my basic tests, this resulted in a quite performant setup for me (embedded and in distributed mode). Cheers Uwe On Sun, Nov 5, 2017, at 02:29 AM, Charles Givre wrote: > Hello everyone, > I’m experimenting with Drill on S3 and I’ve been pretty disappointed with > the performance. I’m curious as to what kind of performance I can > expect? Also what can be done to improve performance on S3. My current > config is I am using Drill in embedded mode with a corporate S3 bucket. > Thanks, > — C
