So Seq[V] contains only "new" tuples. I initially thought that whenever a new tuple was found, it would add it to Seq and call the update function immediately so there wouldn't be more than 1 update to Seq per function call.
Say I want to sum tuples with the same key is an RDD using updateStateByKey, Then (1) Seq[V] would contain the numbers for a particular key and my S state could be the sum? Or would (2) Seq contain partial sums (say sum per partition?) which I then need to sum into the final sum? After writing this out and thinking a little more about it I think #2 is correct. Can you confirm? Thanks again! -A -----Original Message----- From: Sean Owen [mailto:so...@cloudera.com] Sent: April-30-14 4:30 PM To: user@spark.apache.org Subject: Re: What is Seq[V] in updateStateByKey? S is the previous count, if any. Seq[V] are potentially many new counts. All of them have to be added together to keep an accurate total. It's as if the count were 3, and I tell you I've just observed 2, 5, and 1 additional occurrences -- the new count is 3 + (2+5+1) not 1 + 1. I butted in since I'd like to ask a different question about the same line of code. Why: val currentCount = values.foldLeft(0)(_ + _) instead of val currentCount = values.sum This happens a few places in the code. sum seems equivalent and likely quicker. Same with things like "filter(_ == 200).size" instead of "count(_ == 200)"... pretty trivial but hey. On Wed, Apr 30, 2014 at 9:23 PM, Adrian Mocanu <amoc...@verticalscope.com> wrote: > Hi TD, > > Why does the example keep recalculating the count via fold? > > Wouldn’t it make more sense to get the last count in values Seq and > add 1 to it and save that as current count? > > > > From what Sean explained I understand that all values in Seq have the > same key. Then when a new value for that key is found it is added to > this Seq collection and the update function is called. > > > > Is my understanding correct?