Thanks for the recommendations. I had been focused on solving the problem
"within Spark" but a distributed database sounds like a better solution.

Jeff

On Sat, Aug 29, 2015 at 11:47 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Not sure if the race condition you mentioned is related to Cassandra's
> data consistency model.
>
> If hbase is used as the external key value store, atomicity is guaranteed.
>
> Cheers
>
> On Sat, Aug 29, 2015 at 7:40 AM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> We are using Cassandra for similar kind of problem and it works well...
>> You need to take care of race condition between updating the store and
>> looking up the store...
>> On Aug 29, 2015 1:31 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>>
>>> +1 on Jason's suggestion.
>>>
>>> bq. this large variable is broadcast many times during the lifetime
>>>
>>> Please consider making this large variable more granular. Meaning,
>>> reduce the amount of data transferred between the key value store and
>>> your app during update.
>>>
>>> Cheers
>>>
>>> On Fri, Aug 28, 2015 at 12:44 PM, Jason <ja...@jasonknight.us> wrote:
>>>
>>>> You could try using an external key value store (like HBase, Redis) and
>>>> perform lookups/updates inside of your mappers (you'd need to create the
>>>> connection within a mapPartitions code block to avoid the connection
>>>> setup/teardown overhead)?
>>>>
>>>> I haven't done this myself though, so I'm just throwing the idea out
>>>> there.
>>>>
>>>> On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff <j...@atware.co.jp>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am working on a Spark application that is using of a large (~3G)
>>>>> broadcast variable as a lookup table. The application refines the data in
>>>>> this lookup table in an iterative manner. So this large variable is
>>>>> broadcast many times during the lifetime of the application process.
>>>>>
>>>>> From what I have observed perhaps 60% of the execution time is spent
>>>>> waiting for the variable to broadcast in each iteration. My reading of a
>>>>> Spark performance article[1] suggests that the time spent broadcasting 
>>>>> will
>>>>> increase with the number of nodes I add.
>>>>>
>>>>> My question for the group - what would you suggest as an alternative
>>>>> to broadcasting a large variable like this?
>>>>>
>>>>> One approach I have considered is segmenting my RDD and adding a copy
>>>>> of the lookup table for each X number of values to process. So, for
>>>>> example, if I have a list of 1 million entries to process (eg, 
>>>>> RDD[Entry]),
>>>>> I could split this into segments of 100K entries, with a copy of the 
>>>>> lookup
>>>>> table, and make that an RDD[(Lookup, Array[Entry]).
>>>>>
>>>>> Another solution I am looking at it is making the lookup table an RDD
>>>>> instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to
>>>>> improve performance. One issue with this approach is that I would have to
>>>>> rewrite my application code to use two RDDs so that I do not reference the
>>>>> lookup RDD in the from within the closure of another RDD.
>>>>>
>>>>> Any other recommendations?
>>>>>
>>>>> Jeff
>>>>>
>>>>>
>>>>> [1]
>>>>> http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf
>>>>>
>>>>> [2]https://github.com/amplab/spark-indexedrdd
>>>>>
>>>>
>>>
>

Reply via email to