Re: Performance querying a single column out of a parquet file

John Omernik Fri, 01 Jul 2016 06:11:07 -0700

Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and
asked if Drill could do this. (Late/Lazy Materialization of columns).


While the performance gain on tables with less columns may not be huge ,
when you are looking at really wide tables, with disparate date types, this
can be huge.   For example, on one of my tables, if I do  "select id from
table where id = 12 and location between 10 and 200" Drill will return in
30 seconds. When I run select * from from table where id = 12 and location
between 10 and 200" and this query is well into 14 minutes of run time.
That's a huge difference.

Now, the initial answer may be "train user only to select the columns they
need"  and yes, we will be working on that... HOWEVER as anyone who works
in infosec knows, user training can be the best there is, and you will get
people who don't follow the instructions. And, since this is such a intense
query, those hit or miss queries with select * can then cause a large
impact on the performance of a cluster.

Do we have a JIRA open on late/lazy materialization of fields in Parquet?

John

On Thu, Apr 14, 2016 at 9:57 AM, Ted Dunning <[email protected]> wrote:

> Not quite.
>
> With a fix for DRILL_1950, no rows would necessarily be materialized at all
> for the filter columns. Rows would only be materialized for the projection
> columns when the filter matches.
>
> In some cases, the pushdown might be implemented by fully materializing the
> values referenced by the filter, but hopefully not.
>
>
> On Thu, Apr 14, 2016 at 1:42 PM, Johannes Zillmann <
> [email protected]
> > wrote:
>
> > Ok, thanks for the information!
> >
> > Am i right that in case DRILL-1950 would be fixed, Drill would
> > automatically only materialize only those rows/columns which match the
> > filter ?
> >
> > If not so, would the late materialization you described for the filter
> > case be possible to implement with the current Hooks/API ?
> >
> > Johannes
> >
> > > On 11 Apr 2016, at 19:36, Aman Sinha <[email protected]> wrote:
> > >
> > > There is a JIRA related to one aspect of this: DRILL-1950 (filter
> > pushdown
> > > into parquet scan).  This is still work in progress I believe.  Once
> that
> > > is implemented, the scan will produce the filtered rows only.
> > >
> > > Regarding column projections, currently in Drill, the columns
> referenced
> > > anywhere in the query (including SELECT list) need to be produced by
> the
> > > table scan, so the scan will read all those columns, not just the ones
> in
> > > the filter condition.   You can see what columns are being produced by
> > the
> > > Scan node from the EXPLAIN plan.
> > >
> > > What would help for the SELECT * case is* late materialization of
> > columns*.
> > > i.e even if the filter does not get pushed down into scan,  we could
> read
> > > only the 'id' column from the table first, do the filtering that
> > supposedly
> > > selects 1 row, then do a late materialization of all other columns just
> > for
> > > that 1 row by using a row-id based lookup (if the underlying storage
> > format
> > > supports rowid based lookup).   This would be a feature request..I am
> not
> > > sure if a JIRA already exists for it or not.
> > >
> > > -Aman
> > >
> > > On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning <[email protected]>
> > wrote:
> > >
> > >> I just replicated these results. Full table scans with aggregation
> take
> > >> pretty much exactly the same amount of time with or without filtering.
> > >>
> > >>
> > >>
> > >> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <
> > >> [email protected]
> > >>> wrote:
> > >>
> > >>> Hey Ted,
> > >>>
> > >>> Sorry i mixed up row and column!
> > >>>
> > >>> Queries are like that:
> > >>>        (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
> > >>>        (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"
> > >>>
> > >>> (1) is 14 sec and (2) is 1.5 sec.
> > >>> Using drill-1.6.
> > >>> So it looks like Drill is extracting the columns before filtering
> > which i
> > >>> didn’t expect…
> > >>> Is there anyway to change that behaviour ?
> > >>>
> > >>> Johannes
> > >>>
> > >>>
> > >>>
> > >>>> On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]>
> wrote:
> > >>>>
> > >>>> Did you mean that you are doing a select to find a single column?
> What
> > >>> you
> > >>>> typed was row, but that seems out of line with the rest of what you
> > >>> wrote.
> > >>>>
> > >>>> If you are truly asking about filtering down to a single row,
> whether
> > >> it
> > >>>> costs more to return all of the columns rather than just one from a
> > >>> single
> > >>>> row will depend on whether Drill is extracting columns before
> > filtering
> > >>> or
> > >>>> after.
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann <
> > >>> [email protected]
> > >>>>> wrote:
> > >>>>
> > >>>>> Hey there,
> > >>>>>
> > >>>>> i currently doing some performance measurements on Drill.
> > >>>>> In my case its a single parquet file with a single local Drill Bit.
> > >>>>>
> > >>>>> Now in one case i have unexpected results and i’m curious if
> somebody
> > >>> has
> > >>>>> a good explanation for it!
> > >>>>>
> > >>>>> So i have a file with 10 mio rows with 9 columns .
> > >>>>> Now i’m doing a select statement to find one single row.
> > >>>>> Runtime with select * : ~ 14.79 s
> > >>>>> Runtime with select(filterField) : ~ 1.5 sec
> > >>>>>
> > >>>>> So i’m surprised that there is so much variance depending on the
> > >> fields
> > >>> i
> > >>>>> select, since i thought Drill needs most time for finding that one
> > >>> element,
> > >>>>> and then deserialize the other fields only on a hit…
> > >>>>> But for deserialising 8 more hits 10 sec seem way to much!?!?!?
> > >>>>>
> > >>>>> best
> > >>>>> Johannes
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: Performance querying a single column out of a parquet file

Reply via email to