Re: Parquet and filtering

Adam Gilmore Wed, 07 Jan 2015 16:06:07 -0800

Out of interest, is there a reason Drill implemented effectively its own
Parquet reading implementation as opposed to using the reading classes from
the Parquet project itself?  Were there particular performance reasons for
this?


On Thu, Jan 8, 2015 at 2:22 AM, Jason Altekruse <[email protected]>
wrote:

> Just made one, I put some comments there from the design discussions we
> have had in the past.
>
> https://issues.apache.org/jira/browse/DRILL-1950
>
> - Jason Altekruse
>
> On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <[email protected]>
> wrote:
>
> > Just a quick follow up on this - is there a JIRA item for implementing
> push
> > down predicates for Parquet scans or do we need to create one?
> >
> > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <
> [email protected]>
> > wrote:
> >
> > > Hi Adam,
> > >
> > > I have a few thoughts that might explain the difference in query times.
> > > Drill is able to read a subset of the data from a parquet file, when
> > > selecting only a few columns out of a large file. Drill will give you
> > > faster results if you ask for 3 columns instead of 10 in terms of read
> > > performance. However, we are still working on further optimizing the
> > reader
> > > by making use of the statistics contained in the block and page
> > meta-data,
> > > that will allow us to skip reading a subset of a column, as the parquet
> > > writer can store min/max values for blocks of data.
> > >
> > > If you ran a query that was summing over a column, the reason it was
> > faster
> > > is because it avoided a bunch of individual value copies as we filtered
> > out
> > > the records that were not needed. This currently takes place in a
> > separate
> > > filter operator and should be pushed down into the read operation to
> make
> > > use of the file meta-data and eliminate some of the reads.
> > >
> > > -Jason
> > >
> > >
> > >
> > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <[email protected]>
> > > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > I have a question re Parquet.  I'm not sure if this is a Drill
> question
> > > or
> > > > Parquet, but thought I'd start here.
> > > >
> > > > I have a sample dataset of ~100M rows in a Parquet file.  It's quick
> to
> > > sum
> > > > a single column across the whole dataset.
> > > >
> > > > I have a column which has approx 100 unique values (e.g. a customer
> > ID).
> > > > When I filter on that column by one of those values (to reduce the
> set
> > to
> > > > ~1M values), the query takes longer.
> > > >
> > > > This doesn't make a lot of sense to me - I would have expected the
> > > Parquet
> > > > format to only bring back segments that match that and only sum those
> > > > values.  I would expect that this would make the query magnitudes
> > faster,
> > > > not slower.
> > > >
> > > > Other columnar formats I've used (e.g. ORCFile, SQL Server
> Columnstore)
> > > > have acted this way, so I can't quite understand why Parquet doesn't
> > act
> > > > the same.
> > > >
> > > > Can anyone suggest what I'm doing wrong?
> > > >
> > >
> >
>

Re: Parquet and filtering

Reply via email to