Hi Adam, I have a few thoughts that might explain the difference in query times. Drill is able to read a subset of the data from a parquet file, when selecting only a few columns out of a large file. Drill will give you faster results if you ask for 3 columns instead of 10 in terms of read performance. However, we are still working on further optimizing the reader by making use of the statistics contained in the block and page meta-data, that will allow us to skip reading a subset of a column, as the parquet writer can store min/max values for blocks of data.
If you ran a query that was summing over a column, the reason it was faster is because it avoided a bunch of individual value copies as we filtered out the records that were not needed. This currently takes place in a separate filter operator and should be pushed down into the read operation to make use of the file meta-data and eliminate some of the reads. -Jason On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore <[email protected]> wrote: > Hi guys, > > I have a question re Parquet. I'm not sure if this is a Drill question or > Parquet, but thought I'd start here. > > I have a sample dataset of ~100M rows in a Parquet file. It's quick to sum > a single column across the whole dataset. > > I have a column which has approx 100 unique values (e.g. a customer ID). > When I filter on that column by one of those values (to reduce the set to > ~1M values), the query takes longer. > > This doesn't make a lot of sense to me - I would have expected the Parquet > format to only bring back segments that match that and only sum those > values. I would expect that this would make the query magnitudes faster, > not slower. > > Other columnar formats I've used (e.g. ORCFile, SQL Server Columnstore) > have acted this way, so I can't quite understand why Parquet doesn't act > the same. > > Can anyone suggest what I'm doing wrong? >
