Hi, After migration from Spark 1.5.2 to 1.6.1 I faced strange issue. I have a Parquet directory with partitions. Each partition (month) is a subject of incremental ETL that takes current Avro files and replaces the corresponding Parquet files.
Now there is a problem that appeared in 1.6.x: I have a couple of derived data frames. After ETL finishes all RDDs and DataFrames are properly recreated, but for some reason the originally captured file paths are retained. Of course due to the override some paths are gone. As a result I am getting exceptions as shown below. As I mentioned it all worked flawlessly in Spark 1.5.x, i.e. after ETL the engine nicely read the new directory structure. Is there any setting to restore the previous behaviour? Regards, Piotr org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 32.0 failed 1 times, most recent failure: Lost task 7.0 in stage 32.0 (TID 386, localhost): java.io.FileNotFoundException: File does not exist: hdfs://demo.sample/apps/demo/transactions/month=2015-09-01/part-r-00026-792365f9-d1f5-4a70-a3d4-e0b87f6ee087.gz.parquet at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180) at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)