Re: Maintain state outside rdd

Jakob Odersky Wed, 27 Jan 2016 20:45:26 -0800

Be careful with mapPartitions though, since it is executed on worker
nodes, you may not see side-effects locally.


Is it not possible to represent your state changes as part of your
rdd's transformations? I.e. return a tuple containing the modified
data and some accumulated state.
If that really doesn't work, I would second accumulators. Check out
http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka,
it also tells you how to define your own for custom data types.

On Wed, Jan 27, 2016 at 7:22 PM, Krishna <research...@gmail.com> wrote:
> mapPartitions(...) seems like a good candidate, since, it's processing over
> a partition while maintaining state across map(...) calls.
>
> On Wed, Jan 27, 2016 at 6:58 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> Initially I thought of using accumulators.
>>
>> Since state change can be anything, how about storing state in external
>> NoSQL store such as hbase ?
>>
>> On Wed, Jan 27, 2016 at 6:37 PM, Krishna <research...@gmail.com> wrote:
>>>
>>> Thanks; What I'm looking for is a way to see changes to the state of some
>>> variable during map(..) phase.
>>> I simplified the scenario in my example by making row_index() increment
>>> "incr" by 1 but in reality, the change to "incr" can be anything.
>>>
>>> On Wed, Jan 27, 2016 at 6:25 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>> Have you looked at this method ?
>>>>
>>>>    * Zips this RDD with its element indices. The ordering is first based
>>>> on the partition index
>>>> ...
>>>>   def zipWithIndex(): RDD[(T, Long)] = withScope {
>>>>
>>>> On Wed, Jan 27, 2016 at 6:03 PM, Krishna <research...@gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I've a scenario where I need to maintain state that is local to a
>>>>> worker that can change during map operation. What's the best way to handle
>>>>> this?
>>>>>
>>>>> incr = 0
>>>>> def row_index():
>>>>>   global incr
>>>>>   incr += 1
>>>>>   return incr
>>>>>
>>>>> out_rdd = inp_rdd.map(lambda x: row_index()).collect()
>>>>>
>>>>> "out_rdd" in this case only contains 1s but I would like it to have
>>>>> index of each row in "inp_rdd".
>>>>
>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Maintain state outside rdd

Reply via email to