Hi,
Since getting Spark + MongoDB to work together was not very obvious (at
least to me) I wrote a tutorial about it in my blog with an example
application:
http://codeforhire.com/2014/02/18/using-spark-with-mongodb/
Hope it's of use to someone else as well.
Cheers,
* Sampo Niskanen*
*Lead developer / Wellmo*
[email protected]
+358 40 820 5291
On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das
<[email protected]>wrote:
> Can you try using sc.newAPIHadoop** ?
> There are two kinds of classes because the Hadoop API for input and output
> format had undergone a significant change a few years ago.
>
> TD
>
>
> On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen
> <[email protected]>wrote:
>
>> Hi,
>>
>> Thanks for the pointer. However, I'm still unable to generate the RDD
>> using MongoInputFormat. I'm trying to add the mongo-hadoop connector to
>> the Java SimpleApp in the quickstart at
>> http://spark.incubator.apache.org/docs/latest/quick-start.html
>>
>>
>> The mongo-hadoop connector contains two versions of MongoInputFormat, one
>> extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>,
>> the other extending org.apache.hadoop.mapred.InputFormat<Object,
>> BSONObject>. Neither of them is accepted by the compiler, and I'm
>> unsure why:
>>
>> JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
>> sc.hadoopRDD(job,
>> com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class,
>> BSONObject.class);
>> sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class,
>> Object.class, BSONObject.class);
>>
>> Eclipse gives the following error for both the the latter two lines:
>>
>> Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>,
>> Class<K>, Class<V>) of type JavaSparkContext is not applicable for the
>> arguments (JobConf, Class<MongoInputFormat>, Class<Object>,
>> Class<BSONObject>). The inferred type MongoInputFormat is not a valid
>> substitute for the bounded parameter <F extends InputFormat<K,V>>
>>
>>
>>
>> I'm using Spark 0.9.0. Might this be caused by a conflict of Hadoop
>> versions? I downloaded the mongo-hadoop connector for Hadoop 2.2. I
>> haven't figured out how to select which Hadoop version Spark uses, when
>> required from an sbt file. (The SBT file is the one described in the
>> quickstart.)
>>
>>
>> Thanks for any help.
>>
>>
>> Best regards,
>> Sampo N.
>>
>>
>>
>> On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <
>> [email protected]> wrote:
>>
>>> I walked through the example in the second link you gave. The Treasury
>>> Yield example referred there is
>>> here<https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java>.
>>> Note the InputFormat and OutputFormat used in the job configuration. This
>>> InputFormat and OutputFormat specifies how to write data in and out of
>>> MongoDB. You should be able to use the same InputFormat and outputFormat
>>> class in Spark as well. For saving files to MongoDB, use
>>> yourRDD.saveAsHadoopFile(.... specify the output format class ...) and to
>>> read from MongoDB sparkContext.hadoopFile(..... specify input format class
>>> ....) .
>>>
>>> TD
>>>
>>>
>>> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> We're starting to build an analytics framework for our wellness
>>>> service. While our data is not yet Big, we'd like to use a framework that
>>>> will scale as needed, and Spark seems to be the best around.
>>>>
>>>> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how
>>>> to use Spark in connection with MongoDB. Apparently, I should be able to
>>>> use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop)
>>>> also with Spark, but haven't figured out how.
>>>>
>>>> I've run through the Spark tutorials and been able to setup a
>>>> single-machine Hadoop system with the MongoDB connector as instructed at
>>>>
>>>> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
>>>> and
>>>> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
>>>>
>>>> Could someone give some instructions or pointers on how to configure
>>>> and use the mongo-hadoop connector with Spark? I haven't been able to find
>>>> any documentation about this.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Best regards,
>>>> Sampo N.
>>>>
>>>>
>>>>
>>>
>>
>