This has come up in the past in some other context. At the moment though, there is no JIRA for this.
On Fri, Jul 1, 2016 at 6:10 AM, John Omernik <[email protected]> wrote: > Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and > asked if Drill could do this. (Late/Lazy Materialization of columns). > > While the performance gain on tables with less columns may not be huge , > when you are looking at really wide tables, with disparate date types, this > can be huge. For example, on one of my tables, if I do "select id from > table where id = 12 and location between 10 and 200" Drill will return in > 30 seconds. When I run select * from from table where id = 12 and location > between 10 and 200" and this query is well into 14 minutes of run time. > That's a huge difference. > > Now, the initial answer may be "train user only to select the columns they > need" and yes, we will be working on that... HOWEVER as anyone who works > in infosec knows, user training can be the best there is, and you will get > people who don't follow the instructions. And, since this is such a intense > query, those hit or miss queries with select * can then cause a large > impact on the performance of a cluster. > > Do we have a JIRA open on late/lazy materialization of fields in Parquet? > > John > > On Thu, Apr 14, 2016 at 9:57 AM, Ted Dunning <[email protected]> > wrote: > > > Not quite. > > > > With a fix for DRILL_1950, no rows would necessarily be materialized at > all > > for the filter columns. Rows would only be materialized for the > projection > > columns when the filter matches. > > > > In some cases, the pushdown might be implemented by fully materializing > the > > values referenced by the filter, but hopefully not. > > > > > > On Thu, Apr 14, 2016 at 1:42 PM, Johannes Zillmann < > > [email protected] > > > wrote: > > > > > Ok, thanks for the information! > > > > > > Am i right that in case DRILL-1950 would be fixed, Drill would > > > automatically only materialize only those rows/columns which match the > > > filter ? > > > > > > If not so, would the late materialization you described for the filter > > > case be possible to implement with the current Hooks/API ? > > > > > > Johannes > > > > > > > On 11 Apr 2016, at 19:36, Aman Sinha <[email protected]> wrote: > > > > > > > > There is a JIRA related to one aspect of this: DRILL-1950 (filter > > > pushdown > > > > into parquet scan). This is still work in progress I believe. Once > > that > > > > is implemented, the scan will produce the filtered rows only. > > > > > > > > Regarding column projections, currently in Drill, the columns > > referenced > > > > anywhere in the query (including SELECT list) need to be produced by > > the > > > > table scan, so the scan will read all those columns, not just the > ones > > in > > > > the filter condition. You can see what columns are being produced > by > > > the > > > > Scan node from the EXPLAIN plan. > > > > > > > > What would help for the SELECT * case is* late materialization of > > > columns*. > > > > i.e even if the filter does not get pushed down into scan, we could > > read > > > > only the 'id' column from the table first, do the filtering that > > > supposedly > > > > selects 1 row, then do a late materialization of all other columns > just > > > for > > > > that 1 row by using a row-id based lookup (if the underlying storage > > > format > > > > supports rowid based lookup). This would be a feature request..I am > > not > > > > sure if a JIRA already exists for it or not. > > > > > > > > -Aman > > > > > > > > On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > >> I just replicated these results. Full table scans with aggregation > > take > > > >> pretty much exactly the same amount of time with or without > filtering. > > > >> > > > >> > > > >> > > > >> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann < > > > >> [email protected] > > > >>> wrote: > > > >> > > > >>> Hey Ted, > > > >>> > > > >>> Sorry i mixed up row and column! > > > >>> > > > >>> Queries are like that: > > > >>> (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23" > > > >>> (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23" > > > >>> > > > >>> (1) is 14 sec and (2) is 1.5 sec. > > > >>> Using drill-1.6. > > > >>> So it looks like Drill is extracting the columns before filtering > > > which i > > > >>> didn’t expect… > > > >>> Is there anyway to change that behaviour ? > > > >>> > > > >>> Johannes > > > >>> > > > >>> > > > >>> > > > >>>> On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]> > > wrote: > > > >>>> > > > >>>> Did you mean that you are doing a select to find a single column? > > What > > > >>> you > > > >>>> typed was row, but that seems out of line with the rest of what > you > > > >>> wrote. > > > >>>> > > > >>>> If you are truly asking about filtering down to a single row, > > whether > > > >> it > > > >>>> costs more to return all of the columns rather than just one from > a > > > >>> single > > > >>>> row will depend on whether Drill is extracting columns before > > > filtering > > > >>> or > > > >>>> after. > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann < > > > >>> [email protected] > > > >>>>> wrote: > > > >>>> > > > >>>>> Hey there, > > > >>>>> > > > >>>>> i currently doing some performance measurements on Drill. > > > >>>>> In my case its a single parquet file with a single local Drill > Bit. > > > >>>>> > > > >>>>> Now in one case i have unexpected results and i’m curious if > > somebody > > > >>> has > > > >>>>> a good explanation for it! > > > >>>>> > > > >>>>> So i have a file with 10 mio rows with 9 columns . > > > >>>>> Now i’m doing a select statement to find one single row. > > > >>>>> Runtime with select * : ~ 14.79 s > > > >>>>> Runtime with select(filterField) : ~ 1.5 sec > > > >>>>> > > > >>>>> So i’m surprised that there is so much variance depending on the > > > >> fields > > > >>> i > > > >>>>> select, since i thought Drill needs most time for finding that > one > > > >>> element, > > > >>>>> and then deserialize the other fields only on a hit… > > > >>>>> But for deserialising 8 more hits 10 sec seem way to much!?!?!? > > > >>>>> > > > >>>>> best > > > >>>>> Johannes > > > >>>>> > > > >>>>> > > > >>> > > > >>> > > > >> > > > > > > > > >
