I created a JIRA for discussion. This could be a huge performance win if it were possible.
https://issues.apache.org/jira/browse/DRILL-4758 On Fri, Jul 1, 2016 at 12:33 PM, Parth Chandra <pchan...@maprtech.com> wrote: > This has come up in the past in some other context. At the moment though, > there is no JIRA for this. > > On Fri, Jul 1, 2016 at 6:10 AM, John Omernik <j...@omernik.com> wrote: > > > Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and > > asked if Drill could do this. (Late/Lazy Materialization of columns). > > > > While the performance gain on tables with less columns may not be huge , > > when you are looking at really wide tables, with disparate date types, > this > > can be huge. For example, on one of my tables, if I do "select id from > > table where id = 12 and location between 10 and 200" Drill will return in > > 30 seconds. When I run select * from from table where id = 12 and > location > > between 10 and 200" and this query is well into 14 minutes of run time. > > That's a huge difference. > > > > Now, the initial answer may be "train user only to select the columns > they > > need" and yes, we will be working on that... HOWEVER as anyone who works > > in infosec knows, user training can be the best there is, and you will > get > > people who don't follow the instructions. And, since this is such a > intense > > query, those hit or miss queries with select * can then cause a large > > impact on the performance of a cluster. > > > > Do we have a JIRA open on late/lazy materialization of fields in Parquet? > > > > John > > > > On Thu, Apr 14, 2016 at 9:57 AM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > Not quite. > > > > > > With a fix for DRILL_1950, no rows would necessarily be materialized at > > all > > > for the filter columns. Rows would only be materialized for the > > projection > > > columns when the filter matches. > > > > > > In some cases, the pushdown might be implemented by fully materializing > > the > > > values referenced by the filter, but hopefully not. > > > > > > > > > On Thu, Apr 14, 2016 at 1:42 PM, Johannes Zillmann < > > > jzillm...@googlemail.com > > > > wrote: > > > > > > > Ok, thanks for the information! > > > > > > > > Am i right that in case DRILL-1950 would be fixed, Drill would > > > > automatically only materialize only those rows/columns which match > the > > > > filter ? > > > > > > > > If not so, would the late materialization you described for the > filter > > > > case be possible to implement with the current Hooks/API ? > > > > > > > > Johannes > > > > > > > > > On 11 Apr 2016, at 19:36, Aman Sinha <amansi...@apache.org> wrote: > > > > > > > > > > There is a JIRA related to one aspect of this: DRILL-1950 (filter > > > > pushdown > > > > > into parquet scan). This is still work in progress I believe. > Once > > > that > > > > > is implemented, the scan will produce the filtered rows only. > > > > > > > > > > Regarding column projections, currently in Drill, the columns > > > referenced > > > > > anywhere in the query (including SELECT list) need to be produced > by > > > the > > > > > table scan, so the scan will read all those columns, not just the > > ones > > > in > > > > > the filter condition. You can see what columns are being produced > > by > > > > the > > > > > Scan node from the EXPLAIN plan. > > > > > > > > > > What would help for the SELECT * case is* late materialization of > > > > columns*. > > > > > i.e even if the filter does not get pushed down into scan, we > could > > > read > > > > > only the 'id' column from the table first, do the filtering that > > > > supposedly > > > > > selects 1 row, then do a late materialization of all other columns > > just > > > > for > > > > > that 1 row by using a row-id based lookup (if the underlying > storage > > > > format > > > > > supports rowid based lookup). This would be a feature request..I > am > > > not > > > > > sure if a JIRA already exists for it or not. > > > > > > > > > > -Aman > > > > > > > > > > On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning < > ted.dunn...@gmail.com> > > > > wrote: > > > > > > > > > >> I just replicated these results. Full table scans with aggregation > > > take > > > > >> pretty much exactly the same amount of time with or without > > filtering. > > > > >> > > > > >> > > > > >> > > > > >> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann < > > > > >> jzillm...@googlemail.com > > > > >>> wrote: > > > > >> > > > > >>> Hey Ted, > > > > >>> > > > > >>> Sorry i mixed up row and column! > > > > >>> > > > > >>> Queries are like that: > > > > >>> (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23" > > > > >>> (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23" > > > > >>> > > > > >>> (1) is 14 sec and (2) is 1.5 sec. > > > > >>> Using drill-1.6. > > > > >>> So it looks like Drill is extracting the columns before filtering > > > > which i > > > > >>> didn’t expect… > > > > >>> Is there anyway to change that behaviour ? > > > > >>> > > > > >>> Johannes > > > > >>> > > > > >>> > > > > >>> > > > > >>>> On 11 Apr 2016, at 16:42, Ted Dunning <ted.dunn...@gmail.com> > > > wrote: > > > > >>>> > > > > >>>> Did you mean that you are doing a select to find a single > column? > > > What > > > > >>> you > > > > >>>> typed was row, but that seems out of line with the rest of what > > you > > > > >>> wrote. > > > > >>>> > > > > >>>> If you are truly asking about filtering down to a single row, > > > whether > > > > >> it > > > > >>>> costs more to return all of the columns rather than just one > from > > a > > > > >>> single > > > > >>>> row will depend on whether Drill is extracting columns before > > > > filtering > > > > >>> or > > > > >>>> after. > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann < > > > > >>> jzillm...@googlemail.com > > > > >>>>> wrote: > > > > >>>> > > > > >>>>> Hey there, > > > > >>>>> > > > > >>>>> i currently doing some performance measurements on Drill. > > > > >>>>> In my case its a single parquet file with a single local Drill > > Bit. > > > > >>>>> > > > > >>>>> Now in one case i have unexpected results and i’m curious if > > > somebody > > > > >>> has > > > > >>>>> a good explanation for it! > > > > >>>>> > > > > >>>>> So i have a file with 10 mio rows with 9 columns . > > > > >>>>> Now i’m doing a select statement to find one single row. > > > > >>>>> Runtime with select * : ~ 14.79 s > > > > >>>>> Runtime with select(filterField) : ~ 1.5 sec > > > > >>>>> > > > > >>>>> So i’m surprised that there is so much variance depending on > the > > > > >> fields > > > > >>> i > > > > >>>>> select, since i thought Drill needs most time for finding that > > one > > > > >>> element, > > > > >>>>> and then deserialize the other fields only on a hit… > > > > >>>>> But for deserialising 8 more hits 10 sec seem way to much!?!?!? > > > > >>>>> > > > > >>>>> best > > > > >>>>> Johannes > > > > >>>>> > > > > >>>>> > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > > > > >