Hi Johannes, Currently Drill does not implement any caching itself. Previously there have been discussions about general caching layers like Tachyon for caching reads of files from S3, but it looks like it should also be able to cache data from HDFS[1]. I do believe there was some discussion about a shortcoming in Tachyon about not caching parquet files because we were ignoring a tiny part of the file at the beginning that just has 4 magic bytes. Trying to find the discussion I think it actually happened on Slack, we need to get better at reflecting those discussions to the public mailing list to make them more easily searchable.
We will not parallelize reads of a single Parquet file, unless it is splittable. For parquet this means the file has multiple row groups, which are supposed to line up with HDFS blocks. Drill has never produced such files, we always write a single row group per file. Other tools have taken advantage of this feature in parquet. That being said, it has some downsides, because the format did not pad row groups to the HDFS block side, which could cause duplicated reading of data that spanned across a block (which is inevitable if you don't block align by padding). Here is some relevant discussion on the topic from the parquet list [2]. [1] - http://alluxio.org/ (formerly Tachyon) [2] - https://mail-archives.apache.org/mod_mbox/parquet-dev/201601.mbox/%3CCA+CA-8vN4heWcLc7=fzp==z895whyaucnd8jdq6q4-spt0j...@mail.gmail.com%3E Jason Altekruse Software Engineer at Dremio Apache Drill Committer On Thu, Apr 14, 2016 at 5:52 AM, Johannes Zillmann <[email protected] > wrote: > Hey there, > > 2 more questions on querying parquet files. > > (1) Cache files locally ? > So far i tested Drill/Parquet only with a local file-system. If Drill > loads a file from HDFS, how much of a overhead is that... Does it load the > file from HDFS for the first query only and then keeps the file cached > locally or does it touch HDFS for each query ? > > (2) Parallelization on a single Parquet File ? > In case i query a single file, does Drill split the workload across its > drillbits or does that only happen querying multiple files ? > > Johannes
