Hi Jason, Understood - so currently Drill doesn't do predicate pushdown for Parquet?
Regards, *Adam Gilmore* Director of Technology [email protected] +61 421 997 655 (Mobile) 1300 733 876 (AU) +617 3171 9902 (Intl) *PharmaData* Data Intelligence Solutions for Pharmacy www.PharmaData.net.au <http://www.pharmadata.net.au/> [image: pharmadata-sig] *Disclaimer* This communication including any attachments may contain information that is either confidential or otherwise protected from disclosure and is intended solely for the use of the intended recipient. If you are not the intended recipient please immediately notify the sender by e-mail and delete the original transmission and its contents. Any unauthorised use, dissemination, forwarding, printing, or copying of this communication including any file attachments is prohibited. The recipient should check this email and any attachments for viruses and other defects. The Company disclaims any liability for loss or damage arising in any way from this communication including any file attachments. On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse <[email protected]> wrote: > Hi Adam, > > I have a few thoughts that might explain the difference in query times. > Drill is able to read a subset of the data from a parquet file, when > selecting only a few columns out of a large file. Drill will give you > faster results if you ask for 3 columns instead of 10 in terms of read > performance. However, we are still working on further optimizing the reader > by making use of the statistics contained in the block and page meta-data, > that will allow us to skip reading a subset of a column, as the parquet > writer can store min/max values for blocks of data. > > If you ran a query that was summing over a column, the reason it was faster > is because it avoided a bunch of individual value copies as we filtered out > the records that were not needed. This currently takes place in a separate > filter operator and should be pushed down into the read operation to make > use of the file meta-data and eliminate some of the reads. > > -Jason > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <[email protected]> > wrote: > > > Hi guys, > > > > I have a question re Parquet. I'm not sure if this is a Drill question > or > > Parquet, but thought I'd start here. > > > > I have a sample dataset of ~100M rows in a Parquet file. It's quick to > sum > > a single column across the whole dataset. > > > > I have a column which has approx 100 unique values (e.g. a customer ID). > > When I filter on that column by one of those values (to reduce the set to > > ~1M values), the query takes longer. > > > > This doesn't make a lot of sense to me - I would have expected the > Parquet > > format to only bring back segments that match that and only sum those > > values. I would expect that this would make the query magnitudes faster, > > not slower. > > > > Other columnar formats I've used (e.g. ORCFile, SQL Server Columnstore) > > have acted this way, so I can't quite understand why Parquet doesn't act > > the same. > > > > Can anyone suggest what I'm doing wrong? > > >
