Re: Performance querying a single column out of a parquet file

Ted Dunning Thu, 14 Apr 2016 07:58:49 -0700

Not quite.

With a fix for DRILL_1950, no rows would necessarily be materialized at all
for the filter columns. Rows would only be materialized for the projection
columns when the filter matches.


In some cases, the pushdown might be implemented by fully materializing the
values referenced by the filter, but hopefully not.


On Thu, Apr 14, 2016 at 1:42 PM, Johannes Zillmann <[email protected]
> wrote:

> Ok, thanks for the information!
>
> Am i right that in case DRILL-1950 would be fixed, Drill would
> automatically only materialize only those rows/columns which match the
> filter ?
>
> If not so, would the late materialization you described for the filter
> case be possible to implement with the current Hooks/API ?
>
> Johannes
>
> > On 11 Apr 2016, at 19:36, Aman Sinha <[email protected]> wrote:
> >
> > There is a JIRA related to one aspect of this: DRILL-1950 (filter
> pushdown
> > into parquet scan).  This is still work in progress I believe.  Once that
> > is implemented, the scan will produce the filtered rows only.
> >
> > Regarding column projections, currently in Drill, the columns referenced
> > anywhere in the query (including SELECT list) need to be produced by the
> > table scan, so the scan will read all those columns, not just the ones in
> > the filter condition.   You can see what columns are being produced by
> the
> > Scan node from the EXPLAIN plan.
> >
> > What would help for the SELECT * case is* late materialization of
> columns*.
> > i.e even if the filter does not get pushed down into scan,  we could read
> > only the 'id' column from the table first, do the filtering that
> supposedly
> > selects 1 row, then do a late materialization of all other columns just
> for
> > that 1 row by using a row-id based lookup (if the underlying storage
> format
> > supports rowid based lookup).   This would be a feature request..I am not
> > sure if a JIRA already exists for it or not.
> >
> > -Aman
> >
> > On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning <[email protected]>
> wrote:
> >
> >> I just replicated these results. Full table scans with aggregation take
> >> pretty much exactly the same amount of time with or without filtering.
> >>
> >>
> >>
> >> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <
> >> [email protected]
> >>> wrote:
> >>
> >>> Hey Ted,
> >>>
> >>> Sorry i mixed up row and column!
> >>>
> >>> Queries are like that:
> >>>        (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
> >>>        (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"
> >>>
> >>> (1) is 14 sec and (2) is 1.5 sec.
> >>> Using drill-1.6.
> >>> So it looks like Drill is extracting the columns before filtering
> which i
> >>> didn’t expect…
> >>> Is there anyway to change that behaviour ?
> >>>
> >>> Johannes
> >>>
> >>>
> >>>
> >>>> On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]> wrote:
> >>>>
> >>>> Did you mean that you are doing a select to find a single column? What
> >>> you
> >>>> typed was row, but that seems out of line with the rest of what you
> >>> wrote.
> >>>>
> >>>> If you are truly asking about filtering down to a single row, whether
> >> it
> >>>> costs more to return all of the columns rather than just one from a
> >>> single
> >>>> row will depend on whether Drill is extracting columns before
> filtering
> >>> or
> >>>> after.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann <
> >>> [email protected]
> >>>>> wrote:
> >>>>
> >>>>> Hey there,
> >>>>>
> >>>>> i currently doing some performance measurements on Drill.
> >>>>> In my case its a single parquet file with a single local Drill Bit.
> >>>>>
> >>>>> Now in one case i have unexpected results and i’m curious if somebody
> >>> has
> >>>>> a good explanation for it!
> >>>>>
> >>>>> So i have a file with 10 mio rows with 9 columns .
> >>>>> Now i’m doing a select statement to find one single row.
> >>>>> Runtime with select * : ~ 14.79 s
> >>>>> Runtime with select(filterField) : ~ 1.5 sec
> >>>>>
> >>>>> So i’m surprised that there is so much variance depending on the
> >> fields
> >>> i
> >>>>> select, since i thought Drill needs most time for finding that one
> >>> element,
> >>>>> and then deserialize the other fields only on a hit…
> >>>>> But for deserialising 8 more hits 10 sec seem way to much!?!?!?
> >>>>>
> >>>>> best
> >>>>> Johannes
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Performance querying a single column out of a parquet file

Reply via email to