I am getting a strange null pointer exception when trying to list the first
entry of a JavaPairRDD after calling groupByKey on it. Following is my code:


JavaPairRDD<Tuple3&lt;String, String, String>, List<String>> KeyToAppList =
                                KeyToApp.distinct().groupByKey();
                // System.out.println("First member of the key-val list: " +
KeyToAppList.first());
                // Above call to .first causes a null pointer exception
                JavaRDD<Integer> KeyToAppCount = KeyToAppList.map(
                                new Function<Tuple2&lt;Tuple3&lt;String, 
String, String>, List<String>>,
Integer>() {
                                        @Override
                                        public Integer 
call(Tuple2<Tuple3&lt;String, String, String>,
List<String>> tupleOfTupAndList) throws Exception {
                                                List<String> apps = 
tupleOfTupAndList._2;
                                                Set<String> uniqueApps = new 
HashSet<String>(apps);
                                                return uniqueApps.size();
                                        }
                                });
                System.out.println("First member of the key-val list: " +
KeyToAppCount.first());
                // Above call to .first prints the first element all right. 


The first call to JavaPairRDD results in a null pointer exception. However,
if I comment out the call to JavaPairRDD.first(), and instead proceed onto
applying the map function, the call to JavaPairRDD.first() doesn't raise any
exception. Why the null pointer exception immediately after applying
groupByKey?

The null pointer exception looks like follows:
Exception in thread "main" org.apache.spark.SparkException: Job aborted:
Exception while deserializing and fetching task:
java.lang.NullPointerException
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
        at scala.Option.foreach(Option.scala:236)
        at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-JavaPairRDD-first-after-calling-JavaPairRDD-groupByKey-results-in-NullPointerException-tp7318.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to