So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would "partition" data into folders. So I set up some parquet data paritioned by date. This enabled is to reference a single day/month/year minimizing how much data was scanned.
eg: val myDataFrame = hiveContext.read.parquet("s3n://myBucket/myPath/2014/07/01") or val myDataFrame = hiveContext.read.parquet("s3n://myBucket/myPath/2014/07") However since upgrading to Spark 1.4.0 it doesnt seem to be working the same way. The first line works, in the "01" folder is all the normal files: 2015-06-02 20:01 0 s3://myBucket/myPath/2014/07/01/_SUCCESS 2015-06-02 20:01 2066 s3://myBucket/myPath/2014/07/01/_common_metadata 2015-06-02 20:01 1077190 s3://myBucket/myPath/2014/07/01/_metadata 2015-06-02 19:57 119933 s3://myBucket/myPath/2014/07/01/part-r-00001.parquet 2015-06-02 19:57 48478 s3://myBucket/myPath/2014/07/01/part-r-00002.parquet 2015-06-02 19:57 576878 s3://myBucket/myPath/2014/07/01/part-r-00003.parquet ... but if I now use the second line above, to read in all days, it comes back empty. Is there an option I can set somewhere to fix this ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Parquet-partitions-folder-hierarchy-changed-from-1-3-1-tp23558.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org