Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

Konstantinos Kougios Mon, 08 Jun 2015 08:13:35 -0700

It was giving the same error, which made me figure out it is the driverbut the driver running on hadoop - not the local one. So I did


    --conf spark.driver.memory=8g


and now it is processing the files!

Cheers


On 08/06/15 15:52, Ewan Leith wrote:

Can you do a simple

sc.binaryFiles("hdfs:///path/to/files/*").count()

in the spark-shell and verify that part works?

Ewan



-----Original Message-----
From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com]
Sent: 08 June 2015 15:40
To: Ewan Leith; user@spark.apache.org
Subject: Re: spark timesout maybe due to binaryFiles() with more than 1 million 
files in HDFS

No luck I am afraid. After giving the namenode 16GB of RAM, I am still getting 
an out of mem exception, kind of different one:

15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw
exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
      at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1351)
      at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413)
      at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524)
      at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533)
      at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557)
      at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
      at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
      at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
      at com.sun.proxy.$Proxy10.getListing(Unknown Source)
      at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
      at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
      at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:724)
      at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
      at
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
      at
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
      at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
      at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
      at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
      at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644)
      at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:292)
      at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
      at
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:47)
      at
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:43)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
      at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)


and on the 2nd retry of spark, a similar exception:

java.lang.OutOfMemoryError: GC overhead limit exceeded
      at
com.google.protobuf.LiteralByteString.toString(LiteralByteString.java:148)
      at com.google.protobuf.ByteString.toStringUtf8(ByteString.java:572)
      at
org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$HdfsFileStatusProto.getOwner(HdfsProtos.java:21558)
      at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413)
      at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524)
      at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533)
      at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557)
      at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
      at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
      at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
      at com.sun.proxy.$Proxy10.getListing(Unknown Source)
      at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
      at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
      at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:724)
      at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
      at
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
      at
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
      at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
      at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
      at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
      at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644)
      at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:292)
      at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
      at
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:47)
      at
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:43)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)


Any ideas which part of hadoop is running out of mem?

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

Reply via email to