Predicate push down into the input format is turned off by default because there is a bug in the current parquet library that null pointers when there are full row groups that are null.
https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want: http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration On Mon, Jan 5, 2015 at 3:38 PM, Adam Gilmore <[email protected]> wrote: > Hi all, > > I have a question regarding predicate pushdown for Parquet. > > My understanding was this would use the metadata in Parquet's blocks/pages > to skip entire chunks that won't match without needing to decode the values > and filter on every value in the table. > > I was testing a scenario where I had 100M rows in a Parquet file. > > Summing over a column took about 2-3 seconds. > > I also have a column (e.g. customer ID) with approximately 100 unique > values. My assumption, though not exactly linear, would be that filtering > on this would reduce the query time significantly due to it skipping entire > segments based on the metadata. > > In fact, it took much longer - somewhere in the vicinity of 4-5 seconds, > which suggested to me it's reading all the values for the key column (100M > values), then filtering, then reading all the relevant segments/values for > the "measure" column, hence the increase in time. > > In the logs, I could see it was successfully pushing down a Parquet > predicate, so I'm not sure I'm understanding why this is taking longer. > > Could anyone shed some light on this or point out where I'm going wrong? > Thanks! >
