This issue has been fixed recently in Spark 1.4 https://github.com/apache/spark/pull/6581

Cheng

On 6/5/15 12:38 AM, Marcelo Vanzin wrote:
I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real issue here.

On Wed, Jun 3, 2015 at 1:39 PM, Don Drake <dondr...@gmail.com <mailto:dondr...@gmail.com>> wrote:

    As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I
    noticed that Spark is behaving differently when reading Parquet
    directories that contain a .metadata directory.

    It seems that in spark 1.2.x, it would just ignore the .metadata
    directory, but now that I'm using Spark 1.3, reading these files
    causes the following exceptions:

    scala> val d = sqlContext.parquetFile("/user/ddrak/parq_dir")

    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

    SLF4J: Defaulting to no-operation (NOP) logger implementation

    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
    further details.

    scala.collection.parallel.CompositeThrowable: Multiple exceptions
    thrown during a parallel computation: java.lang.RuntimeException:
    hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is
    not a Parquet file. expected magic number at tail [80, 65, 82, 49]
    but found [116, 34, 10, 125]

    parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

    parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)

    
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)

    
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

    scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)

    
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

    scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

    scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

    scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

    scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

    .

    .

    .

    java.lang.RuntimeException:
    hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc
    is not a Parquet file. expected magic number at tail [80, 65, 82,
    49] but found [116, 34, 10, 125]

    parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

    parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)

    
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)

    
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

    scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)

    
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

    scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

    scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

    scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

    scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

    .

    .

    .

    java.lang.RuntimeException:
    hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties
    is not a Parquet file. expected magic number at tail [80, 65, 82,
    49] but found [117, 101, 116, 10]

    parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

    parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)

    
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)

    
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

    scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)

    
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

    scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

    scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

    scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

    scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

    .

    .

    .

            at
    scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)

            at
    scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)

            at
    
scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)

            at
    scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)

            at
    scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)

            at
    
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)

            at
    
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)

            at
    
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)

            at
    
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)

            at
    scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)

            at
    scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

            at
    
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

            at
    scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

            at
    
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




    When I remove the .metadata directory, it is able to read these
    parquet files just fine.

    I feel that Spark should ignore the dot files/directories when
    attempting to read these parquet files. I'm seeing this in CDH
    5.4.2 (Spark 1.3.0 + patches)

    Thoughts?


-- Donald Drake
    Drake Consulting
    http://www.drakeconsulting.com/
    http://www.MailLaunder.com/
    800-733-2143 <tel:800-733-2143>




--
Marcelo

Reply via email to