Hello Andrew,

i wish I could share the code, but for proprietary reasons I can't. But I
can give some idea though of what i am trying to do. The job reads a file
and for each line of that file and processors these lines. I am not doing
anything intense in the "processLogs" function

import argonaut._
import argonaut.Argonaut._


/* all of these case classes are created from json strings extracted from
the line in the processLogs() function
*
*/
case class struct1…
case class struct2…
case class value1(struct1, struct2)

def processLogs(line:String): Option[(key1, value1)] {…
}

def run(sparkMaster, appName, executorMemory, jarsPath) {
  val sparkConf = new SparkConf()
   sparkConf.setMaster(sparkMaster)
   sparkConf.setAppName(appName)
   sparkConf.set("spark.executor.memory", executorMemory)
    sparkConf.setJars(jarsPath) // This includes all the jars relevant
jars..
   val sc = new SparkContext(sparkConf)
  val rawLogs = sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")

rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")

rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
}

If I switch to "local" mode, the code runs just fine, it fails with the
error I pasted above. In the cluster mode, even writing back the file we
just read fails
(rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")

I still believe this is a classNotFound error in disguise

Thanks
Shivani



On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <and...@andrewash.com> wrote:

> Wait, so the file only has four lines and the job running out of heap
> space?  Can you share the code you're running that does the processing?
>  I'd guess that you're doing some intense processing on every line but just
> writing parsed case classes back to disk sounds very lightweight.
>
> I
>
>
> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <raoshiv...@gmail.com> wrote:
>
>> I am trying to process a file that contains 4 log lines (not very long)
>> and then write my parsed out case classes to a destination folder, and I
>> get the following error:
>>
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at
>> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>>
>> at
>> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>>
>> at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>
>> at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>
>> at
>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>>
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>>
>> at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>>
>> at
>> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>
>>
>> Sadly, there are several folks that have faced this error while trying to
>> execute Spark jobs and there are various solutions, none of which work for
>> me
>>
>>
>> a) I tried (
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
>> changing the number of partitions in my RDD by using coalesce(8) and the
>> error persisted
>>
>> b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g, and
>> both did not work
>>
>> c) I strongly suspect there is a class path error (
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
>> Mainly because the call stack is repetitive. Maybe the OOM error is a
>> disguise ?
>>
>> d) I checked that i am not out of disk space and that i do not have too
>> many open files (ulimit -u << sudo ls /proc/<spark_master_process_id>/fd |
>> wc -l)
>>
>>
>> I am also noticing multiple reflections happening to find the right
>> "class" i guess, so it could be "class Not Found: error disguising itself
>> as a memory error.
>>
>>
>> Here are other threads that are encountering same situation .. but have
>> not been resolved in any way so far..
>>
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>>
>>
>> Any help is greatly appreciated. I am especially calling out on creators
>> of Spark and Databrick folks. This seems like a "known bug" waiting to
>> happen.
>>
>>
>> Thanks,
>>
>> Shivani
>>
>> --
>> Software Engineer
>> Analytics Engineering Team@ Box
>> Mountain View, CA
>>
>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Reply via email to