Parquet Filter pushdown not working and statistics are not generating for any column with Spark 1.6 CDH 5.7

2017-11-21 Thread Rabin Banerjee
Hi All , I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my data set as parquet data and then querying it . Query is executing fine But when I checked the files generated by spark, I found statistics(min/max) is missing for all the columns . And hence filters are not

Parquet Filter PushDown

2017-03-30 Thread Rahul Nandi
Hi, I have around 2 million data as parquet file in s3. The file structure is somewhat like id data 1 abc 2 cdf 3 fas Now I want to filter and take the records where the id matches with my required Id. val requiredDataId = Array(1,2) //Might go upto 100s of records.

Re: Expected benefit of parquet filter pushdown?

2016-09-01 Thread Christon DeWan
Thanks for the references, that explains a great deal. I can verify that using integer keys in this use case does work as expected w/r/t run time and bytes read. Hopefully this all works in the next spark release! Thanks, Xton > On Aug 31, 2016, at 3:41 PM, Robert Kruszewski

Re: Expected benefit of parquet filter pushdown?

2016-08-31 Thread Robert Kruszewski
Your statistics seem corrupted. The creator filed doesn’t match the version spec and as such parquet is not using it to filter. I would check whether you have references to PARQUET-251 or PARQUET-297 in your executor logs. This bug existed between parquet 1.5.0 and 1.8.0. Checkout

Expected benefit of parquet filter pushdown?

2016-08-31 Thread Christon DeWan
I have a data set stored in parquet with several short key fields and one relatively large (several kb) blob field. The data set is sorted by key1, key2. message spark_schema { optional binary key1 (UTF8); optional binary key2; optional binary blob; } One use case of