Hi All,

I have successfully accessed my mongodb using spark. after creating a NewHadoopRDD and calling the function first() I get the data correctly from the DB. However, if I call first() a second time (without calling anything else in between), spark crashes with the following message:

org.apache.spark.rdd.NewHadoopRDD[java.lang.Object,org.bson.BSONObject] = NewHadoopRDD[1] at NewHadoopRDD at <console>:36

scala> a.first()
13/10/09 16:58:49 INFO spark.SparkContext: Starting job: first at <console>:39 13/10/09 16:58:49 INFO scheduler.DAGScheduler: Got job 1 (first at <console>:39) with 1 output partitions (allowLocal=true) 13/10/09 16:58:49 INFO scheduler.DAGScheduler: Final stage: Stage 1 (first at <console>:39) 13/10/09 16:58:49 INFO scheduler.DAGScheduler: Parents of final stage: List()
13/10/09 16:58:49 INFO scheduler.DAGScheduler: Missing parents: List()
13/10/09 16:58:49 INFO scheduler.DAGScheduler: Computing the requested partition locally 13/10/09 16:58:49 INFO rdd.NewHadoopRDD: Input split: MongoInputSplit{URI=mongodb://mongo12.mit.edu/local.testCollection <http://mongo12.mit.edu/local.testCollection>, keyField=_id, min=null, max=null, query={ }, sort={ }, fields={ }, limit=0, skip=0, notimeout=false} 13/10/09 16:58:49 INFO scheduler.DAGScheduler: Failed to run first at <console>:39
java.lang.NullPointerException
    at com.mongodb.DBApiLayer$Result.hasNext(DBApiLayer.java:416)
    at com.mongodb.DBCursor._hasNext(DBCursor.java:464)
    at com.mongodb.DBCursor.hasNext(DBCursor.java:484)
at com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:75) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:96)
    at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:381)
    at scala.collection.Iterator$class.foreach(Iterator.scala:772)
    at scala.collection.Iterator$$anon$18.foreach(Iterator.scala:379)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)
    at scala.collection.Iterator$$anon$18.toBuffer(Iterator.scala:379)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)
    at scala.collection.Iterator$$anon$18.toArray(Iterator.scala:379)
    at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:768)
    at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:768)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:758) at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:484) at org.apache.spark.scheduler.DAGScheduler$$anon$2.run(DAGScheduler.scala:470)

Any ideas what Im doing wrong ? is this a mongo driver problem or a spark problem?

Best,

Yadid

Reply via email to