Very cool, thanks for writing this. I’ll link it from our website. Matei
On Feb 18, 2014, at 12:44 PM, Sampo Niskanen <[email protected]> wrote: > Hi, > > Since getting Spark + MongoDB to work together was not very obvious (at least > to me) I wrote a tutorial about it in my blog with an example application: > http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ > > Hope it's of use to someone else as well. > > > Cheers, > > Sampo Niskanen > Lead developer / Wellmo > > [email protected] > +358 40 820 5291 > > > > On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das <[email protected]> > wrote: > Can you try using sc.newAPIHadoop** ? > There are two kinds of classes because the Hadoop API for input and output > format had undergone a significant change a few years ago. > > TD > > > On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <[email protected]> > wrote: > Hi, > > Thanks for the pointer. However, I'm still unable to generate the RDD using > MongoInputFormat. I'm trying to add the mongo-hadoop connector to the Java > SimpleApp in the quickstart at > http://spark.incubator.apache.org/docs/latest/quick-start.html > > > The mongo-hadoop connector contains two versions of MongoInputFormat, one > extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, the > other extending org.apache.hadoop.mapred.InputFormat<Object, BSONObject>. > Neither of them is accepted by the compiler, and I'm unsure why: > > JavaSparkContext sc = new JavaSparkContext("local", "Simple App"); > sc.hadoopRDD(job, com.mongodb.hadoop.mapred.MongoInputFormat.class, > Object.class, BSONObject.class); > sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class, > Object.class, BSONObject.class); > > Eclipse gives the following error for both the the latter two lines: > > Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, Class<K>, > Class<V>) of type JavaSparkContext is not applicable for the arguments > (JobConf, Class<MongoInputFormat>, Class<Object>, Class<BSONObject>). The > inferred type MongoInputFormat is not a valid substitute for the bounded > parameter <F extends InputFormat<K,V>> > > > I'm using Spark 0.9.0. Might this be caused by a conflict of Hadoop > versions? I downloaded the mongo-hadoop connector for Hadoop 2.2. I haven't > figured out how to select which Hadoop version Spark uses, when required from > an sbt file. (The SBT file is the one described in the quickstart.) > > > Thanks for any help. > > > Best regards, > Sampo N. > > > > On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <[email protected]> > wrote: > I walked through the example in the second link you gave. The Treasury Yield > example referred there is here. Note the InputFormat and OutputFormat used in > the job configuration. This InputFormat and OutputFormat specifies how to > write data in and out of MongoDB. You should be able to use the same > InputFormat and outputFormat class in Spark as well. For saving files to > MongoDB, use yourRDD.saveAsHadoopFile(.... specify the output format class > ...) and to read from MongoDB sparkContext.hadoopFile(..... specify input > format class ....) . > > TD > > > On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <[email protected]> > wrote: > Hi, > > We're starting to build an analytics framework for our wellness service. > While our data is not yet Big, we'd like to use a framework that will scale > as needed, and Spark seems to be the best around. > > I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to > use Spark in connection with MongoDB. Apparently, I should be able to use > the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also > with Spark, but haven't figured out how. > > I've run through the Spark tutorials and been able to setup a single-machine > Hadoop system with the MongoDB connector as instructed at > http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ > and > http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/ > > Could someone give some instructions or pointers on how to configure and use > the mongo-hadoop connector with Spark? I haven't been able to find any > documentation about this. > > > Thanks. > > > Best regards, > Sampo N. > > > > > >
