I tried using RDD#mapPartitions but my job completes prematurely and without error as if nothing gets done. What I have is fairly simple....
sc .textFile(inputFile) .map(parser.parse) .mapPartitions(bulkLoad) But the Iterator[T] of mapPartitions is always empty, even though I know map is generating records. On Thu Nov 20 2014 at 9:25:54 PM Soumya Simanta <soumya.sima...@gmail.com> wrote: > On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson <ben.d.tho...@gmail.com> > wrote: > >> I'm trying to use MongoDB as a destination for an ETL I'm writing in >> Spark. It appears I'm gaining a lot of overhead in my system databases >> (and possibly in the primary documents themselves); I can only assume it's >> because I'm left to using PairRDD.saveAsNewAPIHadoopFile. >> >> - Is there a way to batch some of the data together and use Casbah >> natively so I can use bulk inserts? >> > > Why cannot you write Mongo in a RDD#mapPartition ? > > >> >> - Is there maybe a less "hacky" way to load to MongoDB (instead of >> using saveAsNewAPIHadoopFile)? >> >> > If the latency (time by which all data should be in Mongo) is not a > concern you can try a separate process that uses Akka/Casbah to write from > HDFS into Mongo. > > > >