On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson <ben.d.tho...@gmail.com>
wrote:

> I'm trying to use MongoDB as a destination for an ETL I'm writing in
> Spark.  It appears I'm gaining a lot of overhead in my system databases
> (and possibly in the primary documents themselves);  I can only assume it's
> because I'm left to using PairRDD.saveAsNewAPIHadoopFile.
>
> - Is there a way to batch some of the data together and use Casbah
> natively so I can use bulk inserts?
>

Why cannot you write Mongo in a RDD#mapPartition ?


>
> - Is there maybe a less "hacky" way to load to MongoDB (instead of
> using saveAsNewAPIHadoopFile)?
>
>
If the latency (time by which all data should be in Mongo) is not a concern
you can try a separate process that uses Akka/Casbah to write from HDFS
into Mongo.

Reply via email to