Thank you, TD. This is important information for us. Will keep an eye on that.
Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > Yes, this is the limitation of the current implementation. But this will > be improved a loooot when we have IndexedRDD > <https://github.com/apache/spark/pull/1297> in the Spark that allows > faster single value updates to a key-value (within each partition, without > processing the entire partition. > > Soon..... > > TD > > > On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang <yanfang...@gmail.com> wrote: > >> Hi TD, >> >> Thank you. Yes, it behaves as you described. Sorry for missing this >> point. >> >> Then my only concern is in the performance side - since Spark Streaming >> operates on all the keys everytime a new batch comes, I think it is fine >> when the state size is small. When the state size becomes big, say, a few >> GBs, if we still go through the whole key list, would the operation be a >> little inefficient then? Maybe I miss some points in Spark Streaming, which >> consider this situation. >> >> Cheers, >> >> Fang, Yan >> yanfang...@gmail.com >> +1 (206) 849-4108 >> >> >> On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> The updateFunction given in updateStateByKey should be called on ALL the >>> keys are in the state, even if there is no new data in the batch for some >>> key. Is that not the behavior you see? >>> >>> What do you mean by "show all the existing states"? You have access to >>> the latest state RDD by doing stateStream.foreachRDD(...). There you can do >>> whatever operation on all the key-state pairs. >>> >>> TD >>> >>> >>> >>> >>> On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang <yanfang...@gmail.com> wrote: >>> >>>> Hi TD, >>>> >>>> Thank you for the quick replying and backing my approach. :) >>>> >>>> 1) The example is this: >>>> >>>> 1. In the first 2 second interval, after updateStateByKey, I get a few >>>> keys and their states, say, ("a" -> 1, "b" -> 2, "c" -> 3) >>>> 2. In the following 2 second interval, I only receive "c" and "d" and >>>> their value. But I want to update/display the state of "a" and "b" >>>> accordingly. >>>> * It seems I have no way to "access" the "a" and "b" and get their >>>> states. >>>> * also, do I have a way to show all the existing states? >>>> >>>> I guess the approach to solve this will be similar to what you >>>> mentioned for 2). But the difficulty is that, if I want to display all the >>>> existing states, need to bundle all the rest keys to one key. >>>> >>>> Thank you. >>>> >>>> Cheers, >>>> >>>> Fang, Yan >>>> yanfang...@gmail.com >>>> +1 (206) 849-4108 >>>> >>>> >>>> On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das < >>>> tathagata.das1...@gmail.com> wrote: >>>> >>>>> For accessing previous version, I would do it the same way. :) >>>>> >>>>> 1. Can you elaborate on what you mean by that with an example? What do >>>>> you mean by "accessing" keys? >>>>> >>>>> 2. Yeah, that is hard to do with the ability to do point lookups into >>>>> an RDD, which we dont support yet. You could try embedding the related key >>>>> in the values of the keys that need it. That is, B will is present in the >>>>> value of key A. Then put this transformed DStream through >>>>> updateStateByKey. >>>>> >>>>> TD >>>>> >>>> >>>> >>> >> >