I have an RDD that contains instances of a class I wrote.  I was able to write 
it to HDFS using saveAsObjectFile, but when I try to read the same file using 
sc.objectFile, it throws a ClassNotFoundException saying it can't find the 
class.  I wrote a simple program to reproduce the problem:

 val list = mutable.MutableList[MyClass]()

     for (x <- 1 to 10)
     {
       val obj = new MyClass()
       list += obj
     }

     val rdd = sc.parallelize(list)
     rdd.saveAsObjectFile(outFile)

     val objs:RDD[MyClass] = sc.objectFile(outFile)   // <== This throws 
ClassNotFoundException

MyClass is just a simple class that extends Serializable (with no member 
fields).

Here is the stack trace I get:

java.lang.ClassNotFoundException: com.mycom.MyClass
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:623)
        at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1610)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
        at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1661)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1342)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at org.apache.spark.util.Utils$.deserialize(Utils.scala:60)
        at 
org.apache.spark.SparkContext$$anonfun$objectFile$1.apply(SparkContext.scala:474)
        at 
org.apache.spark.SparkContext$$anonfun$objectFile$1.apply(SparkContext.scala:474)
        at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:679)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:677)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758)
        at org.apache.spark.scheduler.ResultTask.run(ResultTask.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)


Is there any workaround for this?

Thanks,
Hiroko

Reply via email to