Hello,

We had some test runs with Drill 1.8 in the last days and wanted to share the 
experience with you as we've made some interesting findings that astonished us. 
We did run on our internal company cluster and thus used the S3 API to access 
our internal storage cluster, not AWS (the behavior should still be the same).

Setup experience: Awesome, it took me less than 30min to have a multimode Drill 
setup running on Mesos+Aurora with S3 configured. Really nice.

Performance with the 1.8 release: Awful. Compared to the queries I ran locally 
with Drill on a small dataset, runtimes were magnitudes higher than on my 
laptop. After some debugging, I saw that hadoop-s3a is always requesting via S3 
the byte range from the position we want to start to read until the end of the 
file. This gave the following HTTP pattern:
 * GET bytes=8k-100M
 * GET bytes=2M-100M
 * GET bytes=4M-100M
Although the HTTP request were normally aborted before all the data was send by 
the server, it was still about 10-15x the size of the input files that went 
over the network. Using Parquet, I actually hoped to achieve the opposite, i.e. 
that less the whole file was transferred (my test queries were only using 2 of 
15 columns).

In Hadoop 3.0.0-alpha1 [2], there are a lot of improvements w.r.t. S3 access. 
You can now select via fs.s3a.experimental.input.fadvise=random a new reading 
mode that will only request via S3 the asked range plus a small readahead 
buffer. While this keeps the number of requests constant, we now only request 
the actual data we need. With that, performance is not amazing but in an 
acceptable range.

Still query planning always took at least 35s. This was an effect of  
fs.s3a.experimental.input.fadvise=random. While the Parquet access is 
specifying really good which ranges it wants to read, the parser for the 
metadata cache actually only request 8000 bytes at once and thus lead to 
several thousand HTTP requests for a single sequential read. As a workaround, 
we have added a call to FSDataInputStream.setReadahead(metadata-filesize) to 
limit the access to a single request. This brought reading metadata down to 3s.

Another problem with the metadata cache was, that it actually was rebuild on 
every query. Drill relies here on the change timestamp of the directory which 
is not support by S3 [1] and thus always the current time was returned as the 
modification date of the directory.

These were just our initial, basic findings with Drill. At the moment it looks 
promising enough so that we will probably do some more usability and 
performance testing. If we already did something wrong with the initial S3 
tests, it would be nice to get to know some pointers what it could have been. 
The bad S3 I/O performance was really surprising for us.

Kind regards,
Uwe

[1] 
https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_modification_times_of_directories
 
<https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_modification_times_of_directories>
[2] From here on, the tests were made with 
Drill-master+hadoop-3.0.0-alpha1+aws-sdk-1.11.35, i.e. custom Drill and Hadoop 
builds to have dependencies in newer versions.

Reply via email to