Re: Performance querying a single column out of a parquet file

Aman Sinha Mon, 11 Apr 2016 10:37:38 -0700

There is a JIRA related to one aspect of this: DRILL-1950 (filter pushdown
into parquet scan).  This is still work in progress I believe.  Once that
is implemented, the scan will produce the filtered rows only.


Regarding column projections, currently in Drill, the columns referenced
anywhere in the query (including SELECT list) need to be produced by the
table scan, so the scan will read all those columns, not just the ones in
the filter condition.   You can see what columns are being produced by the
Scan node from the EXPLAIN plan.

What would help for the SELECT * case is* late materialization of columns*.
 i.e even if the filter does not get pushed down into scan,  we could read
only the 'id' column from the table first, do the filtering that supposedly
selects 1 row, then do a late materialization of all other columns just for
that 1 row by using a row-id based lookup (if the underlying storage format
supports rowid based lookup).   This would be a feature request..I am not
sure if a JIRA already exists for it or not.

-Aman

On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning <[email protected]> wrote:

> I just replicated these results. Full table scans with aggregation take
> pretty much exactly the same amount of time with or without filtering.
>
>
>
> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <
> [email protected]
> > wrote:
>
> > Hey Ted,
> >
> > Sorry i mixed up row and column!
> >
> > Queries are like that:
> >         (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
> >         (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"
> >
> > (1) is 14 sec and (2) is 1.5 sec.
> > Using drill-1.6.
> > So it looks like Drill is extracting the columns before filtering which i
> > didn’t expect…
> > Is there anyway to change that behaviour ?
> >
> > Johannes
> >
> >
> >
> > > On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]> wrote:
> > >
> > > Did you mean that you are doing a select to find a single column? What
> > you
> > > typed was row, but that seems out of line with the rest of what you
> > wrote.
> > >
> > > If you are truly asking about filtering down to a single row, whether
> it
> > > costs more to return all of the columns rather than just one from a
> > single
> > > row will depend on whether Drill is extracting columns before filtering
> > or
> > > after.
> > >
> > >
> > >
> > > On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann <
> > [email protected]
> > >> wrote:
> > >
> > >> Hey there,
> > >>
> > >> i currently doing some performance measurements on Drill.
> > >> In my case its a single parquet file with a single local Drill Bit.
> > >>
> > >> Now in one case i have unexpected results and i’m curious if somebody
> > has
> > >> a good explanation for it!
> > >>
> > >> So i have a file with 10 mio rows with 9 columns .
> > >> Now i’m doing a select statement to find one single row.
> > >> Runtime with select * : ~ 14.79 s
> > >> Runtime with select(filterField) : ~ 1.5 sec
> > >>
> > >> So i’m surprised that there is so much variance depending on the
> fields
> > i
> > >> select, since i thought Drill needs most time for finding that one
> > element,
> > >> and then deserialize the other fields only on a hit…
> > >> But for deserialising 8 more hits 10 sec seem way to much!?!?!?
> > >>
> > >> best
> > >> Johannes
> > >>
> > >>
> >
> >
>

Re: Performance querying a single column out of a parquet file

Reply via email to