Re: Performance querying a single column out of a parquet file

Parth Chandra Fri, 01 Jul 2016 10:33:50 -0700

This has come up in the past in some other context. At the moment though,
there is no JIRA for this.


On Fri, Jul 1, 2016 at 6:10 AM, John Omernik <[email protected]> wrote:

> Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and
> asked if Drill could do this. (Late/Lazy Materialization of columns).
>
> While the performance gain on tables with less columns may not be huge ,
> when you are looking at really wide tables, with disparate date types, this
> can be huge.   For example, on one of my tables, if I do  "select id from
> table where id = 12 and location between 10 and 200" Drill will return in
> 30 seconds. When I run select * from from table where id = 12 and location
> between 10 and 200" and this query is well into 14 minutes of run time.
> That's a huge difference.
>
> Now, the initial answer may be "train user only to select the columns they
> need"  and yes, we will be working on that... HOWEVER as anyone who works
> in infosec knows, user training can be the best there is, and you will get
> people who don't follow the instructions. And, since this is such a intense
> query, those hit or miss queries with select * can then cause a large
> impact on the performance of a cluster.
>
> Do we have a JIRA open on late/lazy materialization of fields in Parquet?
>
> John
>
> On Thu, Apr 14, 2016 at 9:57 AM, Ted Dunning <[email protected]>
> wrote:
>
> > Not quite.
> >
> > With a fix for DRILL_1950, no rows would necessarily be materialized at
> all
> > for the filter columns. Rows would only be materialized for the
> projection
> > columns when the filter matches.
> >
> > In some cases, the pushdown might be implemented by fully materializing
> the
> > values referenced by the filter, but hopefully not.
> >
> >
> > On Thu, Apr 14, 2016 at 1:42 PM, Johannes Zillmann <
> > [email protected]
> > > wrote:
> >
> > > Ok, thanks for the information!
> > >
> > > Am i right that in case DRILL-1950 would be fixed, Drill would
> > > automatically only materialize only those rows/columns which match the
> > > filter ?
> > >
> > > If not so, would the late materialization you described for the filter
> > > case be possible to implement with the current Hooks/API ?
> > >
> > > Johannes
> > >
> > > > On 11 Apr 2016, at 19:36, Aman Sinha <[email protected]> wrote:
> > > >
> > > > There is a JIRA related to one aspect of this: DRILL-1950 (filter
> > > pushdown
> > > > into parquet scan).  This is still work in progress I believe.  Once
> > that
> > > > is implemented, the scan will produce the filtered rows only.
> > > >
> > > > Regarding column projections, currently in Drill, the columns
> > referenced
> > > > anywhere in the query (including SELECT list) need to be produced by
> > the
> > > > table scan, so the scan will read all those columns, not just the
> ones
> > in
> > > > the filter condition.   You can see what columns are being produced
> by
> > > the
> > > > Scan node from the EXPLAIN plan.
> > > >
> > > > What would help for the SELECT * case is* late materialization of
> > > columns*.
> > > > i.e even if the filter does not get pushed down into scan,  we could
> > read
> > > > only the 'id' column from the table first, do the filtering that
> > > supposedly
> > > > selects 1 row, then do a late materialization of all other columns
> just
> > > for
> > > > that 1 row by using a row-id based lookup (if the underlying storage
> > > format
> > > > supports rowid based lookup).   This would be a feature request..I am
> > not
> > > > sure if a JIRA already exists for it or not.
> > > >
> > > > -Aman
> > > >
> > > > On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning <[email protected]>
> > > wrote:
> > > >
> > > >> I just replicated these results. Full table scans with aggregation
> > take
> > > >> pretty much exactly the same amount of time with or without
> filtering.
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <
> > > >> [email protected]
> > > >>> wrote:
> > > >>
> > > >>> Hey Ted,
> > > >>>
> > > >>> Sorry i mixed up row and column!
> > > >>>
> > > >>> Queries are like that:
> > > >>>        (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
> > > >>>        (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"
> > > >>>
> > > >>> (1) is 14 sec and (2) is 1.5 sec.
> > > >>> Using drill-1.6.
> > > >>> So it looks like Drill is extracting the columns before filtering
> > > which i
> > > >>> didn’t expect…
> > > >>> Is there anyway to change that behaviour ?
> > > >>>
> > > >>> Johannes
> > > >>>
> > > >>>
> > > >>>
> > > >>>> On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]>
> > wrote:
> > > >>>>
> > > >>>> Did you mean that you are doing a select to find a single column?
> > What
> > > >>> you
> > > >>>> typed was row, but that seems out of line with the rest of what
> you
> > > >>> wrote.
> > > >>>>
> > > >>>> If you are truly asking about filtering down to a single row,
> > whether
> > > >> it
> > > >>>> costs more to return all of the columns rather than just one from
> a
> > > >>> single
> > > >>>> row will depend on whether Drill is extracting columns before
> > > filtering
> > > >>> or
> > > >>>> after.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann <
> > > >>> [email protected]
> > > >>>>> wrote:
> > > >>>>
> > > >>>>> Hey there,
> > > >>>>>
> > > >>>>> i currently doing some performance measurements on Drill.
> > > >>>>> In my case its a single parquet file with a single local Drill
> Bit.
> > > >>>>>
> > > >>>>> Now in one case i have unexpected results and i’m curious if
> > somebody
> > > >>> has
> > > >>>>> a good explanation for it!
> > > >>>>>
> > > >>>>> So i have a file with 10 mio rows with 9 columns .
> > > >>>>> Now i’m doing a select statement to find one single row.
> > > >>>>> Runtime with select * : ~ 14.79 s
> > > >>>>> Runtime with select(filterField) : ~ 1.5 sec
> > > >>>>>
> > > >>>>> So i’m surprised that there is so much variance depending on the
> > > >> fields
> > > >>> i
> > > >>>>> select, since i thought Drill needs most time for finding that
> one
> > > >>> element,
> > > >>>>> and then deserialize the other fields only on a hit…
> > > >>>>> But for deserialising 8 more hits 10 sec seem way to much!?!?!?
> > > >>>>>
> > > >>>>> best
> > > >>>>> Johannes
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: Performance querying a single column out of a parquet file

Reply via email to