Hi Everyone, I saved a 2GB pdf file into MongoDB using GridFS. now i want process those GridFS collection data using Java Spark Mapreduce. previously i have successfully processed normal mongoDB collections(not GridFS) with Apache spark using Mongo-Hadoop connector. now i'm unable to handle input GridFS collections.
My question here is.. 1) how to pass MongoDB GridFS data as input to our spark application 2)Do we have separate RDD to handle GridFS binary data... I'm trying with following snippet but I'm unable to get actual data. MongoConfigUtil.setInputURI(config, "mongodb://localhost:27017/pdfbooks.fs.chunks" ); MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output ); JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class); JavaRDD<String> words = mongoRDD.flatMap(new FlatMapFunction<Tuple2<Object,BSONObject>, String>() { @Override public Iterable<String> call(Tuple2<Object, BSONObject> arg) { System.out.println(arg._2.toString()); ... please suggest me available ways to do...Thank you in Advance!!!