As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that Spark is behaving differently when reading Parquet directories that contain a .metadata directory.
It seems that in spark 1.2.x, it would just ignore the .metadata directory, but now that I'm using Spark 1.3, reading these files causes the following exceptions: scala> val d = sqlContext.parquetFile("/user/ddrak/parq_dir") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [116, 34, 10, 125] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [116, 34, 10, 125] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87) at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86) at scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650) at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72) at scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514) at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) When I remove the .metadata directory, it is able to read these parquet files just fine. I feel that Spark should ignore the dot files/directories when attempting to read these parquet files. I'm seeing this in CDH 5.4.2 (Spark 1.3.0 + patches) Thoughts? -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ http://www.MailLaunder.com/ 800-733-2143