Re: Performance querying a single column out of a parquet file

John Omernik Fri, 01 Jul 2016 11:05:30 -0700

I created a JIRA for discussion. This could be a huge performance win if it
were possible.


https://issues.apache.org/jira/browse/DRILL-4758



On Fri, Jul 1, 2016 at 12:33 PM, Parth Chandra <pchan...@maprtech.com>
wrote:

> This has come up in the past in some other context. At the moment though,
> there is no JIRA for this.
>
> On Fri, Jul 1, 2016 at 6:10 AM, John Omernik <j...@omernik.com> wrote:
>
> > Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and
> > asked if Drill could do this. (Late/Lazy Materialization of columns).
> >
> > While the performance gain on tables with less columns may not be huge ,
> > when you are looking at really wide tables, with disparate date types,
> this
> > can be huge.   For example, on one of my tables, if I do  "select id from
> > table where id = 12 and location between 10 and 200" Drill will return in
> > 30 seconds. When I run select * from from table where id = 12 and
> location
> > between 10 and 200" and this query is well into 14 minutes of run time.
> > That's a huge difference.
> >
> > Now, the initial answer may be "train user only to select the columns
> they
> > need"  and yes, we will be working on that... HOWEVER as anyone who works
> > in infosec knows, user training can be the best there is, and you will
> get
> > people who don't follow the instructions. And, since this is such a
> intense
> > query, those hit or miss queries with select * can then cause a large
> > impact on the performance of a cluster.
> >
> > Do we have a JIRA open on late/lazy materialization of fields in Parquet?
> >
> > John
> >
> > On Thu, Apr 14, 2016 at 9:57 AM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > > Not quite.
> > >
> > > With a fix for DRILL_1950, no rows would necessarily be materialized at
> > all
> > > for the filter columns. Rows would only be materialized for the
> > projection
> > > columns when the filter matches.
> > >
> > > In some cases, the pushdown might be implemented by fully materializing
> > the
> > > values referenced by the filter, but hopefully not.
> > >
> > >
> > > On Thu, Apr 14, 2016 at 1:42 PM, Johannes Zillmann <
> > > jzillm...@googlemail.com
> > > > wrote:
> > >
> > > > Ok, thanks for the information!
> > > >
> > > > Am i right that in case DRILL-1950 would be fixed, Drill would
> > > > automatically only materialize only those rows/columns which match
> the
> > > > filter ?
> > > >
> > > > If not so, would the late materialization you described for the
> filter
> > > > case be possible to implement with the current Hooks/API ?
> > > >
> > > > Johannes
> > > >
> > > > > On 11 Apr 2016, at 19:36, Aman Sinha <amansi...@apache.org> wrote:
> > > > >
> > > > > There is a JIRA related to one aspect of this: DRILL-1950 (filter
> > > > pushdown
> > > > > into parquet scan).  This is still work in progress I believe.
> Once
> > > that
> > > > > is implemented, the scan will produce the filtered rows only.
> > > > >
> > > > > Regarding column projections, currently in Drill, the columns
> > > referenced
> > > > > anywhere in the query (including SELECT list) need to be produced
> by
> > > the
> > > > > table scan, so the scan will read all those columns, not just the
> > ones
> > > in
> > > > > the filter condition.   You can see what columns are being produced
> > by
> > > > the
> > > > > Scan node from the EXPLAIN plan.
> > > > >
> > > > > What would help for the SELECT * case is* late materialization of
> > > > columns*.
> > > > > i.e even if the filter does not get pushed down into scan,  we
> could
> > > read
> > > > > only the 'id' column from the table first, do the filtering that
> > > > supposedly
> > > > > selects 1 row, then do a late materialization of all other columns
> > just
> > > > for
> > > > > that 1 row by using a row-id based lookup (if the underlying
> storage
> > > > format
> > > > > supports rowid based lookup).   This would be a feature request..I
> am
> > > not
> > > > > sure if a JIRA already exists for it or not.
> > > > >
> > > > > -Aman
> > > > >
> > > > > On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning <
> ted.dunn...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> I just replicated these results. Full table scans with aggregation
> > > take
> > > > >> pretty much exactly the same amount of time with or without
> > filtering.
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <
> > > > >> jzillm...@googlemail.com
> > > > >>> wrote:
> > > > >>
> > > > >>> Hey Ted,
> > > > >>>
> > > > >>> Sorry i mixed up row and column!
> > > > >>>
> > > > >>> Queries are like that:
> > > > >>>        (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
> > > > >>>        (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"
> > > > >>>
> > > > >>> (1) is 14 sec and (2) is 1.5 sec.
> > > > >>> Using drill-1.6.
> > > > >>> So it looks like Drill is extracting the columns before filtering
> > > > which i
> > > > >>> didn’t expect…
> > > > >>> Is there anyway to change that behaviour ?
> > > > >>>
> > > > >>> Johannes
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>> On 11 Apr 2016, at 16:42, Ted Dunning <ted.dunn...@gmail.com>
> > > wrote:
> > > > >>>>
> > > > >>>> Did you mean that you are doing a select to find a single
> column?
> > > What
> > > > >>> you
> > > > >>>> typed was row, but that seems out of line with the rest of what
> > you
> > > > >>> wrote.
> > > > >>>>
> > > > >>>> If you are truly asking about filtering down to a single row,
> > > whether
> > > > >> it
> > > > >>>> costs more to return all of the columns rather than just one
> from
> > a
> > > > >>> single
> > > > >>>> row will depend on whether Drill is extracting columns before
> > > > filtering
> > > > >>> or
> > > > >>>> after.
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann <
> > > > >>> jzillm...@googlemail.com
> > > > >>>>> wrote:
> > > > >>>>
> > > > >>>>> Hey there,
> > > > >>>>>
> > > > >>>>> i currently doing some performance measurements on Drill.
> > > > >>>>> In my case its a single parquet file with a single local Drill
> > Bit.
> > > > >>>>>
> > > > >>>>> Now in one case i have unexpected results and i’m curious if
> > > somebody
> > > > >>> has
> > > > >>>>> a good explanation for it!
> > > > >>>>>
> > > > >>>>> So i have a file with 10 mio rows with 9 columns .
> > > > >>>>> Now i’m doing a select statement to find one single row.
> > > > >>>>> Runtime with select * : ~ 14.79 s
> > > > >>>>> Runtime with select(filterField) : ~ 1.5 sec
> > > > >>>>>
> > > > >>>>> So i’m surprised that there is so much variance depending on
> the
> > > > >> fields
> > > > >>> i
> > > > >>>>> select, since i thought Drill needs most time for finding that
> > one
> > > > >>> element,
> > > > >>>>> and then deserialize the other fields only on a hit…
> > > > >>>>> But for deserialising 8 more hits 10 sec seem way to much!?!?!?
> > > > >>>>>
> > > > >>>>> best
> > > > >>>>> Johannes
> > > > >>>>>
> > > > >>>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Performance querying a single column out of a parquet file

Reply via email to