Re: Question on querying parquet files

Jason Altekruse Thu, 14 Apr 2016 08:30:40 -0700

Hi Johannes,

Currently Drill does not implement any caching itself. Previously there
have been discussions about general caching layers like Tachyon for caching
reads of files from S3, but it looks like it should also be able to cache
data from HDFS[1]. I do believe there was some discussion about a
shortcoming in Tachyon about not caching parquet files because we were
ignoring a tiny part of the file at the beginning that just has 4 magic
bytes. Trying to find the discussion I think it actually happened on Slack,
we need to get better at reflecting those discussions to the public mailing
list to make them more easily searchable.

We will not parallelize reads of a single Parquet file, unless it is
splittable. For parquet this means the file has multiple row groups, which
are supposed to line up with HDFS blocks. Drill has never produced such
files, we always write a single row group per file. Other tools have taken
advantage of this feature in parquet. That being said, it has some
downsides, because the format did not pad row groups to the HDFS block
side, which could cause duplicated reading of data that spanned across a
block (which is inevitable if you don't block align by padding). Here is
some relevant discussion on the topic from the parquet list [2].

[1] - http://alluxio.org/  (formerly Tachyon)
[2] -
https://mail-archives.apache.org/mod_mbox/parquet-dev/201601.mbox/%3CCA+CA-8vN4heWcLc7=fzp==z895whyaucnd8jdq6q4-spt0j...@mail.gmail.com%3E

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Thu, Apr 14, 2016 at 5:52 AM, Johannes Zillmann <[email protected]
> wrote:

> Hey there,
>
> 2 more questions on querying parquet files.
>
> (1) Cache files locally ?
> So far i tested Drill/Parquet only with a local file-system. If Drill
> loads a file from HDFS, how much of a overhead is that... Does it load the
> file from HDFS for the first query only and then keeps the file cached
> locally or does it touch HDFS for each query ?
>
> (2) Parallelization on a single Parquet File ?
> In case i query a single file, does Drill split the workload across its
> drillbits or does that only happen querying multiple files ?
>
> Johannes

Re: Question on querying parquet files

Reply via email to