Re: Using mongo with PySpark

Mayur Rustagi Sat, 17 May 2014 09:13:32 -0700

You have to ideally pass the mongoclient object along with your data in the
mapper(python should be try to serialize your mongoclient, but explicit is
better)....
if client is serializable then all should end well.. if not then you are
better off using map partition & initilizing the driver in each iteration &
load data of each partition. Thr is a similar discussion in the list in the
past.
Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Sat, May 17, 2014 at 8:58 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Where's your driver code (the code interacting with the RDDs)? Are you
> getting serialization errors?
>
> 2014년 5월 17일 토요일, Samarth Mailinglist<mailinglistsama...@gmail.com>님이 작성한
> 메시지:
>
> Hi all,
>>
>> I am trying to store the results of a reduce into mongo.
>> I want to share the variable "collection" in the mappers.
>>
>>
>> Here's what I have so far (I'm using pymongo)
>>
>> db = MongoClient()['spark_test_db']
>> collec = db['programs']
>> db = MongoClient()['spark_test_db']
>> *collec = db['programs']*
>>
>> def mapper(val):
>>     asc = val.encode('ascii','ignore')
>>     json = convertToJSON(asc, indexMap)
>>     collec.insert(json) # *this is not working*
>>
>> def convertToJSON(string, indexMap):
>>     values = string.strip().split(",")
>>     json = {}
>>     for i in range(len(values)):
>>         json[indexMap[i]] = values[i]
>>     return json
>>
>> How do I do this?
>>
>

Re: Using mongo with PySpark

Reply via email to