Thank you, TD. This is important information for us. Will keep an eye on
that.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> Yes, this is the limitation of the current implementation. But this will
> be improved a loooot when we have IndexedRDD
> <https://github.com/apache/spark/pull/1297> in the Spark that allows
> faster single value updates to a key-value (within each partition, without
> processing the entire partition.
>
> Soon.....
>
> TD
>
>
> On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang <yanfang...@gmail.com> wrote:
>
>> Hi TD,
>>
>> Thank you. Yes, it behaves as you described. Sorry for missing this
>> point.
>>
>> Then my only concern is in the performance side - since Spark Streaming
>> operates on all the keys everytime a new batch comes, I think it is fine
>> when the state size is small. When the state size becomes big, say, a few
>> GBs, if we still go through the whole key list, would the operation be a
>> little inefficient then? Maybe I miss some points in Spark Streaming, which
>> consider this situation.
>>
>> Cheers,
>>
>> Fang, Yan
>> yanfang...@gmail.com
>> +1 (206) 849-4108
>>
>>
>> On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> The updateFunction given in updateStateByKey should be called on ALL the
>>> keys are in the state, even if there is no new data in the batch for some
>>> key. Is that not the behavior you see?
>>>
>>> What do you mean by "show all the existing states"? You have access to
>>> the latest state RDD by doing stateStream.foreachRDD(...). There you can do
>>> whatever operation on all the key-state pairs.
>>>
>>> TD
>>>
>>>
>>>
>>>
>>> On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang <yanfang...@gmail.com> wrote:
>>>
>>>> Hi TD,
>>>>
>>>> Thank you for the quick replying and backing my approach. :)
>>>>
>>>> 1) The example is this:
>>>>
>>>> 1. In the first 2 second interval, after updateStateByKey, I get a few
>>>> keys and their states, say, ("a" -> 1, "b" -> 2, "c" -> 3)
>>>> 2. In the following 2 second interval, I only receive "c" and "d" and
>>>> their value. But I want to update/display the state of "a" and "b"
>>>> accordingly.
>>>> * It seems I have no way to "access" the "a" and "b" and get their
>>>> states.
>>>> * also, do I have a way to show all the existing states?
>>>>
>>>> I guess the approach to solve this will be similar to what you
>>>> mentioned for 2). But the difficulty is that, if I want to display all the
>>>> existing states, need to bundle all the rest keys to one key.
>>>>
>>>> Thank you.
>>>>
>>>> Cheers,
>>>>
>>>> Fang, Yan
>>>> yanfang...@gmail.com
>>>> +1 (206) 849-4108
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das <
>>>> tathagata.das1...@gmail.com> wrote:
>>>>
>>>>> For accessing previous version, I would do it the same way. :)
>>>>>
>>>>> 1. Can you elaborate on what you mean by that with an example? What do
>>>>> you mean by "accessing" keys?
>>>>>
>>>>> 2. Yeah, that is hard to do with the ability to do point lookups into
>>>>> an RDD, which we dont support yet. You could try embedding the related key
>>>>> in the values of the keys that need it. That is, B will is present in the
>>>>> value of key A. Then put this transformed DStream through 
>>>>> updateStateByKey.
>>>>>
>>>>> TD
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to