Hi All ,
I am using CDH 5.7 which comes with Spark version 1.6.0. I am saving my
data set as parquet data and then querying it . Query is executing fine But
when I checked the files generated by spark, I found statistics(min/max) is
missing for all the columns . And hence filters are not
Hi,
I have around 2 million data as parquet file in s3. The file structure is
somewhat like
id data
1 abc
2 cdf
3 fas
Now I want to filter and take the records where the id matches with my
required Id.
val requiredDataId = Array(1,2) //Might go upto 100s of records.
Thanks for the references, that explains a great deal. I can verify that using
integer keys in this use case does work as expected w/r/t run time and bytes
read. Hopefully this all works in the next spark release!
Thanks,
Xton
> On Aug 31, 2016, at 3:41 PM, Robert Kruszewski
Your statistics seem corrupted. The creator filed doesn’t match the version
spec and as such parquet is not using it to filter. I would check whether you
have references to PARQUET-251 or PARQUET-297 in your executor logs. This bug
existed between parquet 1.5.0 and 1.8.0. Checkout
I have a data set stored in parquet with several short key fields and one
relatively large (several kb) blob field. The data set is sorted by key1, key2.
message spark_schema {
optional binary key1 (UTF8);
optional binary key2;
optional binary blob;
}
One use case of