reduceByKey issue in example wordcount (scala)

Ian Bonnycastle Mon, 14 Apr 2014 09:26:58 -0700

Good afternoon,

I'm attempting to get the wordcount example working, and I keep getting an
error in the "reduceByKey(_ + _)" call. I've scoured the mailing lists, and
haven't been able to find a sure fire solution, unless I'm missing
something big. I did find something close, but it didn't appear to work in
my case. The error is:


org.apache.spark.SparkException: Job aborted: Task 2.0:3 failed 4 times
(most recent failure: Exception failure: java.lang.ClassNotFoundException:
SimpleApp$$anonfun$3)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
        at scala.Option.foreach(Option.scala:236)
        at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I've commented out and re-commented in the reduceByKey line to make sure it
was the cause, and it is. If I take it out, my script compiles and runs no
problem. If I put it in, I get the above error across all my nodes. I've
attempted to use the spark-shell, and it will actually process the line
properly, so I assumed it was a missing "import" statement. The only one I
could find that was anywhere close to my particular error was someone who
was having the same issues, and a "import
scala.collection.JavaConversions._" fixed his problem. This didn't appear
to work for me. Can anyone shed some light on this, as I'm pulling out my
hair trying to figure out what I'm missing.

The section of code I'm trying to get to work is:

    val JCountRes = logData.flatMap(line => line.split(" "))
                           .map(word => (word, 1))
                           .reduceByKey(_ + _)

"logData" is just an RDD pointing to a large (2gb) file in HDFS.

Thanks,

Ian

reduceByKey issue in example wordcount (scala)

Reply via email to