mapPartitions(...) seems like a good candidate, since, it's processing over a partition while maintaining state across map(...) calls.
On Wed, Jan 27, 2016 at 6:58 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Initially I thought of using accumulators. > > Since state change can be anything, how about storing state in external > NoSQL store such as hbase ? > > On Wed, Jan 27, 2016 at 6:37 PM, Krishna <research...@gmail.com> wrote: > >> Thanks; What I'm looking for is a way to see changes to the state of some >> variable during map(..) phase. >> I simplified the scenario in my example by making row_index() increment >> "incr" by 1 but in reality, the change to "incr" can be anything. >> >> On Wed, Jan 27, 2016 at 6:25 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Have you looked at this method ? >>> >>> * Zips this RDD with its element indices. The ordering is first based >>> on the partition index >>> ... >>> def zipWithIndex(): RDD[(T, Long)] = withScope { >>> >>> On Wed, Jan 27, 2016 at 6:03 PM, Krishna <research...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I've a scenario where I need to maintain state that is local to a >>>> worker that can change during map operation. What's the best way to handle >>>> this? >>>> >>>> *incr = 0* >>>> *def row_index():* >>>> * global incr* >>>> * incr += 1* >>>> * return incr* >>>> >>>> *out_rdd = inp_rdd.map(lambda x: row_index()).collect()* >>>> >>>> "out_rdd" in this case only contains 1s but I would like it to have >>>> index of each row in "inp_rdd". >>>> >>> >>> >> >