Re: Maintain state outside rdd

Krishna Wed, 27 Jan 2016 19:22:28 -0800

mapPartitions(...) seems like a good candidate, since, it's processing over
a partition while maintaining state across map(...) calls.


On Wed, Jan 27, 2016 at 6:58 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Initially I thought of using accumulators.
>
> Since state change can be anything, how about storing state in external
> NoSQL store such as hbase ?
>
> On Wed, Jan 27, 2016 at 6:37 PM, Krishna <research...@gmail.com> wrote:
>
>> Thanks; What I'm looking for is a way to see changes to the state of some
>> variable during map(..) phase.
>> I simplified the scenario in my example by making row_index() increment
>> "incr" by 1 but in reality, the change to "incr" can be anything.
>>
>> On Wed, Jan 27, 2016 at 6:25 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Have you looked at this method ?
>>>
>>>    * Zips this RDD with its element indices. The ordering is first based
>>> on the partition index
>>> ...
>>>   def zipWithIndex(): RDD[(T, Long)] = withScope {
>>>
>>> On Wed, Jan 27, 2016 at 6:03 PM, Krishna <research...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've a scenario where I need to maintain state that is local to a
>>>> worker that can change during map operation. What's the best way to handle
>>>> this?
>>>>
>>>> *incr = 0*
>>>> *def row_index():*
>>>> *  global incr*
>>>> *  incr += 1*
>>>> *  return incr*
>>>>
>>>> *out_rdd = inp_rdd.map(lambda x: row_index()).collect()*
>>>>
>>>> "out_rdd" in this case only contains 1s but I would like it to have
>>>> index of each row in "inp_rdd".
>>>>
>>>
>>>
>>
>

Re: Maintain state outside rdd

Reply via email to