Hi Everyone,

I saved a 2GB pdf file into MongoDB using GridFS. now i want process those
GridFS collection data using Java Spark Mapreduce. previously i have
successfully processed normal mongoDB collections(not GridFS) with Apache
spark using Mongo-Hadoop connector. now i'm unable to handle input GridFS
collections.

My question here is..

1) how to pass MongoDB GridFS data as input to our spark application
2)Do we have separate RDD to handle GridFS  binary data...

I'm trying with following snippet but I'm unable to get actual data.

 MongoConfigUtil.setInputURI(config,
"mongodb://localhost:27017/pdfbooks.fs.chunks" );
 MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
 JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
            com.mongodb.hadoop.MongoInputFormat.class, Object.class,
            BSONObject.class);
 JavaRDD<String> words = mongoRDD.flatMap(new
FlatMapFunction<Tuple2<Object,BSONObject>,
   String>() {
   @Override
   public Iterable<String> call(Tuple2<Object, BSONObject> arg) {
   System.out.println(arg._2.toString());
   ...

please suggest me  available ways to do...Thank you in Advance!!!

Reply via email to