Re: Performance querying a single column out of a parquet file

2016-07-01 Thread John Omernik
I created a JIRA for discussion. This could be a huge performance win if it were possible. https://issues.apache.org/jira/browse/DRILL-4758 On Fri, Jul 1, 2016 at 12:33 PM, Parth Chandra wrote: > This has come up in the past in some other context. At the moment though, > there is no JIRA for

Re: Performance querying a single column out of a parquet file

2016-07-01 Thread Parth Chandra
This has come up in the past in some other context. At the moment though, there is no JIRA for this. On Fri, Jul 1, 2016 at 6:10 AM, John Omernik wrote: > Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and > asked if Drill could do this. (Late/Lazy Materialization of columns

Re: Performance querying a single column out of a parquet file

2016-07-01 Thread John Omernik
Hey all, some colleagues are looking at this on Impala (IMPALA-2017)and asked if Drill could do this. (Late/Lazy Materialization of columns). While the performance gain on tables with less columns may not be huge , when you are looking at really wide tables, with disparate date types, this can be

Re: Performance querying a single column out of a parquet file

2016-04-14 Thread Ted Dunning
Not quite. With a fix for DRILL_1950, no rows would necessarily be materialized at all for the filter columns. Rows would only be materialized for the projection columns when the filter matches. In some cases, the pushdown might be implemented by fully materializing the values referenced by the f

Re: Performance querying a single column out of a parquet file

2016-04-14 Thread Johannes Zillmann
Ok, thanks for the information! Am i right that in case DRILL-1950 would be fixed, Drill would automatically only materialize only those rows/columns which match the filter ? If not so, would the late materialization you described for the filter case be possible to implement with the current Ho

Re: Performance querying a single column out of a parquet file

2016-04-11 Thread Jacques Nadeau
There was a major conflict between the patch and the metadata caching feature that came in right at the same time (right before it). I believe there was a discussion about this on the list. It would be great if a developer could pick this up. -- Jacques Nadeau CTO and Co-Founder, Dremio On Mon, A

Re: Performance querying a single column out of a parquet file

2016-04-11 Thread Ted Dunning
On Mon, Apr 11, 2016 at 10:36 AM, Aman Sinha wrote: > There is a JIRA related to one aspect of this: DRILL-1950 (filter pushdown > into parquet scan). This is still work in progress I believe. > Actually, it looks like there was a patch from the community nearly a year ago. Hard to understand

Re: Performance querying a single column out of a parquet file

2016-04-11 Thread Aman Sinha
There is a JIRA related to one aspect of this: DRILL-1950 (filter pushdown into parquet scan). This is still work in progress I believe. Once that is implemented, the scan will produce the filtered rows only. Regarding column projections, currently in Drill, the columns referenced anywhere in th

Re: Performance querying a single column out of a parquet file

2016-04-11 Thread Ted Dunning
I just replicated these results. Full table scans with aggregation take pretty much exactly the same amount of time with or without filtering. On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann wrote: > Hey Ted, > > Sorry i mixed up row and column! > > Queries are like that: > (1) "SEL

Re: Performance querying a single column out of a parquet file

2016-04-11 Thread Johannes Zillmann
Hey Ted, Sorry i mixed up row and column! Queries are like that: (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23" (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23" (1) is 14 sec and (2) is 1.5 sec. Using drill-1.6. So it looks like Drill is extracting the columns b

Re: Performance querying a single column out of a parquet file

2016-04-11 Thread Ted Dunning
Did you mean that you are doing a select to find a single column? What you typed was row, but that seems out of line with the rest of what you wrote. If you are truly asking about filtering down to a single row, whether it costs more to return all of the columns rather than just one from a single

Performance querying a single column out of a parquet file

2016-04-11 Thread Johannes Zillmann
Hey there, i currently doing some performance measurements on Drill. In my case its a single parquet file with a single local Drill Bit. Now in one case i have unexpected results and i’m curious if somebody has a good explanation for it! So i have a file with 10 mio rows with 9 columns . Now i’