Re: S3 Performance

Uwe L. Korn Sun, 05 Nov 2017 08:27:53 -0800

Hello Charles, 

I ran into the same performance issues some time ago and did make some
discoveries:

 * Drill is good at only pulling the byte ranges out of the file system
 it needs. Sadly, s3a in Hadoop 2.7 is translating a request to the byte
 range (x,y) into a HTTP request to S3 of the byte range
 (x,end-of-file). In the case of Parquet, this means that you will read
 for each column in each row group from the beginning of this column
 chunk to the end of the file. Overall this amounted for me for a
 traffic of 10-20x the size of the actual file in total.
 * Hadoop 2.8/3.0 actually introduces a new S3 experimental random
 access mode that really improves performance as this will only send
 requests of (x, y+readahead.range) to S3. You can activate it with
 fs.s3a.experimental.input.fadvise=random.
 * I played a bit with fs.s3a.readahead.range which is optimistic range
 that is included in the request but actually found that I could keep it
 at its default of 65536 bytes as Drill often requests all bytes it
 needs at once and thus reading ahead did not improve the situation.
 * This random access mode plays well with Parquet files but sadly
 slowed down the read of the metadata cache drastically as only requests
 of the size 65540 were done to S3. Therefore I had to add
 is.setReadahead(filesize); after

https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L593
 to ensure that the metadata cache is read at once from S3.
 * Also

https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L662
 seem to have been always true in my case, causing a refresh of the
 cache on every query. As I had quite a big dataset, this added a large
 constant to my query. This might be simply due to the fact that S3 does
 not have the concept of directories. I have not digged deeper into this
 but added as a dirty workaround that once the cache exists, it is never
 updated automatically.

Locally I have made my own Drill build based on the Hadoop 2.8 libraries
but sadly some unit tests failed, at least for the S3 testing,
everything seems to work. Work is still on the 1.11 release sources and
some code has changed since then. I will have some time in the next
days/weeks to look again at this and might open some PRs (don't expect
me to be the one to open the Hadoop-Update PR, I'm a full-time Python
dev, so this is a bit out of my comfort zone :D ).  At least for my
basic tests, this resulted in a quite performant setup for me (embedded
and in distributed mode).

Cheers
Uwe

On Sun, Nov 5, 2017, at 02:29 AM, Charles Givre wrote:
> Hello everyone, 
> I’m experimenting with Drill on S3 and I’ve been pretty disappointed with
> the performance.  I’m curious as to what kind of performance I can
> expect?  Also what can be done to improve performance on S3.  My current
> config is I am using Drill in embedded mode with a corporate S3 bucket. 
> Thanks,
> — C

Re: S3 Performance

Reply via email to