Re: Spark + MongoDB

Sampo Niskanen Tue, 18 Feb 2014 12:45:49 -0800

Hi,

Since getting Spark + MongoDB to work together was not very obvious (at
least to me) I wrote a tutorial about it in my blog with an example
application:
http://codeforhire.com/2014/02/18/using-spark-with-mongodb/


Hope it's of use to someone else as well.


Cheers,

*    Sampo Niskanen*

*Lead developer / Wellmo*
    [email protected]
    +358 40 820 5291



On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das
<[email protected]>wrote:

> Can you try using sc.newAPIHadoop**  ?
> There are two kinds of classes because the Hadoop API for input and output
> format had undergone a significant change a few years ago.
>
> TD
>
>
> On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen 
> <[email protected]>wrote:
>
>> Hi,
>>
>> Thanks for the pointer.  However, I'm still unable to generate the RDD
>> using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to
>> the Java SimpleApp in the quickstart at
>> http://spark.incubator.apache.org/docs/latest/quick-start.html
>>
>>
>> The mongo-hadoop connector contains two versions of MongoInputFormat, one
>> extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>,
>> the other extending org.apache.hadoop.mapred.InputFormat<Object,
>> BSONObject>.  Neither of them is accepted by the compiler, and I'm
>> unsure why:
>>
>>         JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
>>         sc.hadoopRDD(job,
>> com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class,
>> BSONObject.class);
>>         sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class,
>> Object.class, BSONObject.class);
>>
>> Eclipse gives the following error for both the the latter two lines:
>>
>>  Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>,
>> Class<K>, Class<V>) of type JavaSparkContext is not applicable for the
>> arguments (JobConf, Class<MongoInputFormat>, Class<Object>,
>> Class<BSONObject>). The inferred type MongoInputFormat is not a valid
>> substitute for the bounded parameter <F extends InputFormat<K,V>>
>>
>>
>>
>> I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop
>> versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I
>> haven't figured out how to select which Hadoop version Spark uses, when
>> required from an sbt file.  (The SBT file is the one described in the
>> quickstart.)
>>
>>
>> Thanks for any help.
>>
>>
>> Best regards,
>>    Sampo N.
>>
>>
>>
>> On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <
>> [email protected]> wrote:
>>
>>> I walked through the example in the second link you gave. The Treasury
>>> Yield example referred there is 
>>> here<https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java>.
>>> Note the InputFormat and OutputFormat used in the job configuration. This
>>> InputFormat and OutputFormat specifies how to write data in and out of
>>> MongoDB. You should be able to use the same InputFormat and outputFormat
>>> class in Spark as well. For saving files to MongoDB, use
>>> yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to
>>> read from MongoDB  sparkContext.hadoopFile(..... specify input format class
>>> ....) .
>>>
>>> TD
>>>
>>>
>>> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> We're starting to build an analytics framework for our wellness
>>>> service.  While our data is not yet Big, we'd like to use a framework that
>>>> will scale as needed, and Spark seems to be the best around.
>>>>
>>>> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how
>>>> to use Spark in connection with MongoDB.  Apparently, I should be able to
>>>> use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop)
>>>> also with Spark, but haven't figured out how.
>>>>
>>>> I've run through the Spark tutorials and been able to setup a
>>>> single-machine Hadoop system with the MongoDB connector as instructed at
>>>>
>>>> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
>>>> and
>>>> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
>>>>
>>>> Could someone give some instructions or pointers on how to configure
>>>> and use the mongo-hadoop connector with Spark?  I haven't been able to find
>>>> any documentation about this.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Best regards,
>>>>    Sampo N.
>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark + MongoDB

Reply via email to