This issue has been fixed recently in Spark 1.4
https://github.com/apache/spark/pull/6581
Cheng
On 6/5/15 12:38 AM, Marcelo Vanzin wrote:
I talked to Don outside the list and he says that he's seeing this
issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like
there is a real issue here.
On Wed, Jun 3, 2015 at 1:39 PM, Don Drake <dondr...@gmail.com
<mailto:dondr...@gmail.com>> wrote:
As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I
noticed that Spark is behaving differently when reading Parquet
directories that contain a .metadata directory.
It seems that in spark 1.2.x, it would just ignore the .metadata
directory, but now that I'm using Spark 1.3, reading these files
causes the following exceptions:
scala> val d = sqlContext.parquetFile("/user/ddrak/parq_dir")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
further details.
scala.collection.parallel.CompositeThrowable: Multiple exceptions
thrown during a parallel computation: java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is
not a Parquet file. expected magic number at tail [80, 65, 82, 49]
but found [116, 34, 10, 125]
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
.
.
.
java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc
is not a Parquet file. expected magic number at tail [80, 65, 82,
49] but found [116, 34, 10, 125]
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
.
.
.
java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties
is not a Parquet file. expected magic number at tail [80, 65, 82,
49] but found [117, 101, 116, 10]
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
.
.
.
at
scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
at
scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
at
scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
at
scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
at
scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
at
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
at
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at
scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
When I remove the .metadata directory, it is able to read these
parquet files just fine.
I feel that Spark should ignore the dot files/directories when
attempting to read these parquet files. I'm seeing this in CDH
5.4.2 (Spark 1.3.0 + patches)
Thoughts?
--
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
800-733-2143 <tel:800-733-2143>
--
Marcelo