>>> import itertools
>>> l = [1,1,1,2,2,3,4,4,5,1]
>>> gs = itertools.groupby(l)
>>> map(lambda (n, it): (n, sum(1 for _ in it)), gs)
[(1, 3), (2, 2), (3, 1), (4, 2), (5, 1), (1, 1)]

def groupCount(l):
   gs = itertools.groupby(l)
   return map(lambda (n, it): (n, sum(1 for _ in it)), gs)

If you have an RDD, you can use RDD.mapPartitions(groupCount).collect()

On Sun, Aug 17, 2014 at 10:34 PM, fil <f...@pobox.com> wrote:
> Can anyone assist with a scan of the following kind (Python preferred, but
> whatever..)? I'm looking for a kind of segmented fold count.
>
> Input: [1,1,1,2,2,3,4,4,5,1]
> Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)]
> or preferably two output columns:
> id: [1,2,3,4,5,1]
> count: [3,2,1,2,1,1]
>
> I can use a groupby/count, except for the fact that I just want to scan -
> not resort. Ideally this would be as low-level as possible and perform in a
> simple single scan. It also needs to retain the original sort order.
>
> Thoughts?
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to