First impressions with Drill+Parquet+S3

Uwe Korn Thu, 06 Oct 2016 12:36:06 -0700

Hello,

We had some test runs with Drill 1.8 in the last days and wanted to share the 
experience with you as we've made some interesting findings that astonished us. 
We did run on our internal company cluster and thus used the S3 API to access 
our internal storage cluster, not AWS (the behavior should still be the same).

Setup experience: Awesome, it took me less than 30min to have a multimode Drill
setup running on Mesos+Aurora with S3 configured. Really nice.

Performance with the 1.8 release: Awful. Compared to the queries I ran locally
with Drill on a small dataset, runtimes were magnitudes higher than on my
laptop. After some debugging, I saw that hadoop-s3a is always requesting via S3
the byte range from the position we want to start to read until the end of the
file. This gave the following HTTP pattern:
* GET bytes=8k-100M
* GET bytes=2M-100M
* GET bytes=4M-100M
Although the HTTP request were normally aborted before all the data was send by
the server, it was still about 10-15x the size of the input files that went
over the network. Using Parquet, I actually hoped to achieve the opposite, i.e.
that less the whole file was transferred (my test queries were only using 2 of
15 columns).

In Hadoop 3.0.0-alpha1 [2], there are a lot of improvements w.r.t. S3 access.
You can now select via fs.s3a.experimental.input.fadvise=random a new reading
mode that will only request via S3 the asked range plus a small readahead
buffer. While this keeps the number of requests constant, we now only request
the actual data we need. With that, performance is not amazing but in an
acceptable range.

Still query planning always took at least 35s. This was an effect of
fs.s3a.experimental.input.fadvise=random. While the Parquet access is
specifying really good which ranges it wants to read, the parser for the
metadata cache actually only request 8000 bytes at once and thus lead to
several thousand HTTP requests for a single sequential read. As a workaround,
we have added a call to FSDataInputStream.setReadahead(metadata-filesize) to
limit the access to a single request. This brought reading metadata down to 3s.

Another problem with the metadata cache was, that it actually was rebuild on
every query. Drill relies here on the change timestamp of the directory which
is not support by S3 [1] and thus always the current time was returned as the
modification date of the directory.

These were just our initial, basic findings with Drill. At the moment it looks
promising enough so that we will probably do some more usability and
performance testing. If we already did something wrong with the initial S3
tests, it would be nice to get to know some pointers what it could have been.
The bad S3 I/O performance was really surprising for us.

Kind regards,
Uwe

[1]
https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_modification_times_of_directories

<https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_modification_times_of_directories>
[2] From here on, the tests were made with
Drill-master+hadoop-3.0.0-alpha1+aws-sdk-1.11.35, i.e. custom Drill and Hadoop
builds to have dependencies in newer versions.

First impressions with Drill+Parquet+S3

Reply via email to