I just replicated these results. Full table scans with aggregation take pretty much exactly the same amount of time with or without filtering.
On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <[email protected] > wrote: > Hey Ted, > > Sorry i mixed up row and column! > > Queries are like that: > (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23" > (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23" > > (1) is 14 sec and (2) is 1.5 sec. > Using drill-1.6. > So it looks like Drill is extracting the columns before filtering which i > didn’t expect… > Is there anyway to change that behaviour ? > > Johannes > > > > > On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]> wrote: > > > > Did you mean that you are doing a select to find a single column? What > you > > typed was row, but that seems out of line with the rest of what you > wrote. > > > > If you are truly asking about filtering down to a single row, whether it > > costs more to return all of the columns rather than just one from a > single > > row will depend on whether Drill is extracting columns before filtering > or > > after. > > > > > > > > On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann < > [email protected] > >> wrote: > > > >> Hey there, > >> > >> i currently doing some performance measurements on Drill. > >> In my case its a single parquet file with a single local Drill Bit. > >> > >> Now in one case i have unexpected results and i’m curious if somebody > has > >> a good explanation for it! > >> > >> So i have a file with 10 mio rows with 9 columns . > >> Now i’m doing a select statement to find one single row. > >> Runtime with select * : ~ 14.79 s > >> Runtime with select(filterField) : ~ 1.5 sec > >> > >> So i’m surprised that there is so much variance depending on the fields > i > >> select, since i thought Drill needs most time for finding that one > element, > >> and then deserialize the other fields only on a hit… > >> But for deserialising 8 more hits 10 sec seem way to much!?!?!? > >> > >> best > >> Johannes > >> > >> > >
