Re: Parquet and filtering

Jason Altekruse Mon, 05 Jan 2015 07:59:06 -0800

Hi Adam,

I have a few thoughts that might explain the difference in query times.
Drill is able to read a subset of the data from a parquet file, when
selecting only a few columns out of a large file. Drill will give you
faster results if you ask for 3 columns instead of 10 in terms of read
performance. However, we are still working on further optimizing the reader
by making use of the statistics contained in the block and page meta-data,
that will allow us to skip reading a subset of a column, as the parquet
writer can store min/max values for blocks of data.

If you ran a query that was summing over a column, the reason it was faster
is because it avoided a bunch of individual value copies as we filtered out
the records that were not needed. This currently takes place in a separate
filter operator and should be pushed down into the read operation to make
use of the file meta-data and eliminate some of the reads.

-Jason

On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <[email protected]> wrote:

> Hi guys,
>
> I have a question re Parquet.  I'm not sure if this is a Drill question or
> Parquet, but thought I'd start here.
>
> I have a sample dataset of ~100M rows in a Parquet file.  It's quick to sum
> a single column across the whole dataset.
>
> I have a column which has approx 100 unique values (e.g. a customer ID).
> When I filter on that column by one of those values (to reduce the set to
> ~1M values), the query takes longer.
>
> This doesn't make a lot of sense to me - I would have expected the Parquet
> format to only bring back segments that match that and only sum those
> values.  I would expect that this would make the query magnitudes faster,
> not slower.
>
> Other columnar formats I've used (e.g. ORCFile, SQL Server Columnstore)
> have acted this way, so I can't quite understand why Parquet doesn't act
> the same.
>
> Can anyone suggest what I'm doing wrong?
>

Re: Parquet and filtering

Reply via email to