Re: Spark + MongoDB

Tathagata Das Tue, 04 Feb 2014 12:49:45 -0800

Can you try using sc.newAPIHadoop**  ?
There are two kinds of classes because the Hadoop API for input and output
format had undergone a significant change a few years ago.


TD


On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <[email protected]>wrote:

> Hi,
>
> Thanks for the pointer.  However, I'm still unable to generate the RDD
> using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to
> the Java SimpleApp in the quickstart at
> http://spark.incubator.apache.org/docs/latest/quick-start.html
>
>
> The mongo-hadoop connector contains two versions of MongoInputFormat, one
> extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>,
> the other extending org.apache.hadoop.mapred.InputFormat<Object,
> BSONObject>.  Neither of them is accepted by the compiler, and I'm unsure
> why:
>
>         JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
>         sc.hadoopRDD(job,
> com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class,
> BSONObject.class);
>         sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class,
> Object.class, BSONObject.class);
>
> Eclipse gives the following error for both the the latter two lines:
>
>  Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>,
> Class<K>, Class<V>) of type JavaSparkContext is not applicable for the
> arguments (JobConf, Class<MongoInputFormat>, Class<Object>,
> Class<BSONObject>). The inferred type MongoInputFormat is not a valid
> substitute for the bounded parameter <F extends InputFormat<K,V>>
>
>
>
> I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop
> versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I
> haven't figured out how to select which Hadoop version Spark uses, when
> required from an sbt file.  (The SBT file is the one described in the
> quickstart.)
>
>
> Thanks for any help.
>
>
> Best regards,
>    Sampo N.
>
>
>
> On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <
> [email protected]> wrote:
>
>> I walked through the example in the second link you gave. The Treasury
>> Yield example referred there is 
>> here<https://github.com/mongodb/mongo-hadoop/blob/master/examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfigV2.java>.
>> Note the InputFormat and OutputFormat used in the job configuration. This
>> InputFormat and OutputFormat specifies how to write data in and out of
>> MongoDB. You should be able to use the same InputFormat and outputFormat
>> class in Spark as well. For saving files to MongoDB, use
>> yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to
>> read from MongoDB  sparkContext.hadoopFile(..... specify input format class
>> ....) .
>>
>> TD
>>
>>
>> On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> We're starting to build an analytics framework for our wellness service.
>>>  While our data is not yet Big, we'd like to use a framework that will
>>> scale as needed, and Spark seems to be the best around.
>>>
>>> I'm new to Hadoop and Spark, and I'm having difficulty figuring out how
>>> to use Spark in connection with MongoDB.  Apparently, I should be able to
>>> use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop)
>>> also with Spark, but haven't figured out how.
>>>
>>> I've run through the Spark tutorials and been able to setup a
>>> single-machine Hadoop system with the MongoDB connector as instructed at
>>>
>>> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
>>> and
>>> http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/
>>>
>>> Could someone give some instructions or pointers on how to configure and
>>> use the mongo-hadoop connector with Spark?  I haven't been able to find any
>>> documentation about this.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> Best regards,
>>>    Sampo N.
>>>
>>>
>>>
>>
>

Re: Spark + MongoDB

Reply via email to