Re: Spark + MongoDB

Matei Zaharia Tue, 18 Feb 2014 12:50:34 -0800

Very cool, thanks for writing this. I’ll link it from our website.

Matei


On Feb 18, 2014, at 12:44 PM, Sampo Niskanen <[email protected]> wrote:

> Hi,
> 
> Since getting Spark + MongoDB to work together was not very obvious (at least 
> to me) I wrote a tutorial about it in my blog with an example application:
> http://codeforhire.com/2014/02/18/using-spark-with-mongodb/
> 
> Hope it's of use to someone else as well.
> 
> 
> Cheers,
> 
>     Sampo Niskanen
>     Lead developer / Wellmo
> 
>     [email protected]
>     +358 40 820 5291
>  
> 
> 
> On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das <[email protected]> 
> wrote:
> Can you try using sc.newAPIHadoop**  ?
> There are two kinds of classes because the Hadoop API for input and output 
> format had undergone a significant change a few years ago. 
> 
> TD
> 
> 
> On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <[email protected]> 
> wrote:
> Hi,
> 
> Thanks for the pointer.  However, I'm still unable to generate the RDD using 
> MongoInputFormat.  I'm trying to add the mongo-hadoop connector to the Java 
> SimpleApp in the quickstart at 
> http://spark.incubator.apache.org/docs/latest/quick-start.html
> 
> 
> The mongo-hadoop connector contains two versions of MongoInputFormat, one 
> extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, the 
> other extending org.apache.hadoop.mapred.InputFormat<Object, BSONObject>.  
> Neither of them is accepted by the compiler, and I'm unsure why:
> 
>         JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
>         sc.hadoopRDD(job, com.mongodb.hadoop.mapred.MongoInputFormat.class, 
> Object.class, BSONObject.class);
>         sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class, 
> Object.class, BSONObject.class);
> 
> Eclipse gives the following error for both the the latter two lines:
> 
> Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, Class<K>, 
> Class<V>) of type JavaSparkContext is not applicable for the arguments 
> (JobConf, Class<MongoInputFormat>, Class<Object>, Class<BSONObject>). The 
> inferred type MongoInputFormat is not a valid substitute for the bounded 
> parameter <F extends InputFormat<K,V>>
> 
> 
> I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop 
> versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I haven't 
> figured out how to select which Hadoop version Spark uses, when required from 
> an sbt file.  (The SBT file is the one described in the quickstart.)
> 
> 
> Thanks for any help.
> 
> 
> Best regards,
>    Sampo N.
> 
> 
> 
> On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <[email protected]> 
> wrote:
> I walked through the example in the second link you gave. The Treasury Yield 
> example referred there is here. Note the InputFormat and OutputFormat used in 
> the job configuration. This InputFormat and OutputFormat specifies how to 
> write data in and out of MongoDB. You should be able to use the same 
> InputFormat and outputFormat class in Spark as well. For saving files to 
> MongoDB, use yourRDD.saveAsHadoopFile(.... specify the output format class 
> ...)  and to read from MongoDB  sparkContext.hadoopFile(..... specify input 
> format class ....) . 
> 
> TD
> 
> 
> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <[email protected]> 
> wrote:
> Hi,
> 
> We're starting to build an analytics framework for our wellness service.  
> While our data is not yet Big, we'd like to use a framework that will scale 
> as needed, and Spark seems to be the best around.
> 
> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to 
> use Spark in connection with MongoDB.  Apparently, I should be able to use 
> the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also 
> with Spark, but haven't figured out how.
> 
> I've run through the Spark tutorials and been able to setup a single-machine 
> Hadoop system with the MongoDB connector as instructed at 
> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
> and
> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
> 
> Could someone give some instructions or pointers on how to configure and use 
> the mongo-hadoop connector with Spark?  I haven't been able to find any 
> documentation about this.
> 
> 
> Thanks.
> 
> 
> Best regards,
>    Sampo N.
> 
> 
> 
> 
> 
>

Re: Spark + MongoDB

Reply via email to