Ok, thanks for the information! Am i right that in case DRILL-1950 would be fixed, Drill would automatically only materialize only those rows/columns which match the filter ?
If not so, would the late materialization you described for the filter case be possible to implement with the current Hooks/API ? Johannes > On 11 Apr 2016, at 19:36, Aman Sinha <[email protected]> wrote: > > There is a JIRA related to one aspect of this: DRILL-1950 (filter pushdown > into parquet scan). This is still work in progress I believe. Once that > is implemented, the scan will produce the filtered rows only. > > Regarding column projections, currently in Drill, the columns referenced > anywhere in the query (including SELECT list) need to be produced by the > table scan, so the scan will read all those columns, not just the ones in > the filter condition. You can see what columns are being produced by the > Scan node from the EXPLAIN plan. > > What would help for the SELECT * case is* late materialization of columns*. > i.e even if the filter does not get pushed down into scan, we could read > only the 'id' column from the table first, do the filtering that supposedly > selects 1 row, then do a late materialization of all other columns just for > that 1 row by using a row-id based lookup (if the underlying storage format > supports rowid based lookup). This would be a feature request..I am not > sure if a JIRA already exists for it or not. > > -Aman > > On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning <[email protected]> wrote: > >> I just replicated these results. Full table scans with aggregation take >> pretty much exactly the same amount of time with or without filtering. >> >> >> >> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann < >> [email protected] >>> wrote: >> >>> Hey Ted, >>> >>> Sorry i mixed up row and column! >>> >>> Queries are like that: >>> (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23" >>> (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23" >>> >>> (1) is 14 sec and (2) is 1.5 sec. >>> Using drill-1.6. >>> So it looks like Drill is extracting the columns before filtering which i >>> didn’t expect… >>> Is there anyway to change that behaviour ? >>> >>> Johannes >>> >>> >>> >>>> On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]> wrote: >>>> >>>> Did you mean that you are doing a select to find a single column? What >>> you >>>> typed was row, but that seems out of line with the rest of what you >>> wrote. >>>> >>>> If you are truly asking about filtering down to a single row, whether >> it >>>> costs more to return all of the columns rather than just one from a >>> single >>>> row will depend on whether Drill is extracting columns before filtering >>> or >>>> after. >>>> >>>> >>>> >>>> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann < >>> [email protected] >>>>> wrote: >>>> >>>>> Hey there, >>>>> >>>>> i currently doing some performance measurements on Drill. >>>>> In my case its a single parquet file with a single local Drill Bit. >>>>> >>>>> Now in one case i have unexpected results and i’m curious if somebody >>> has >>>>> a good explanation for it! >>>>> >>>>> So i have a file with 10 mio rows with 9 columns . >>>>> Now i’m doing a select statement to find one single row. >>>>> Runtime with select * : ~ 14.79 s >>>>> Runtime with select(filterField) : ~ 1.5 sec >>>>> >>>>> So i’m surprised that there is so much variance depending on the >> fields >>> i >>>>> select, since i thought Drill needs most time for finding that one >>> element, >>>>> and then deserialize the other fields only on a hit… >>>>> But for deserialising 8 more hits 10 sec seem way to much!?!?!? >>>>> >>>>> best >>>>> Johannes >>>>> >>>>> >>> >>> >>
