As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that
Spark is behaving differently when reading Parquet directories that contain
a .metadata directory.

It seems that in spark 1.2.x, it would just ignore the .metadata directory,
but now that I'm using Spark 1.3, reading these files causes the following
exceptions:

scala> val d = sqlContext.parquetFile("/user/ddrak/parq_dir")

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.

scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown
during a parallel computation: java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is not a
Parquet file. expected magic number at tail [80, 65, 82, 49] but found
[116, 34, 10, 125]

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)

org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)

org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

.

.

.



java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc is not a
Parquet file. expected magic number at tail [80, 65, 82, 49] but found
[116, 34, 10, 125]

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)

org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)

org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

.

.

.



java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties is
not a Parquet file. expected magic number at tail [80, 65, 82, 49] but
found [117, 101, 116, 10]

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)

org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)

org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

.

.

.

        at
scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)

        at
scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)

        at
scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)

        at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)

        at
scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)

        at
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)

        at
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)

        at
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)

        at
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)

        at
scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)

        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




When I remove the .metadata directory, it is able to read these parquet
files just fine.

I feel that Spark should ignore the dot files/directories when attempting
to read these parquet files. I'm seeing this in CDH 5.4.2 (Spark 1.3.0 +
patches)

Thoughts?

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
800-733-2143

Reply via email to