Thanks, yes. I was using Int for my V and didn't get the second param in the second closure right :)
On Mon, Apr 13, 2015 at 1:55 PM, Dean Wampler <deanwamp...@gmail.com> wrote: > That appears to work, with a few changes to get the types correct: > > input.distinct().combineByKey((s: String) => 1, (agg: Int, s: String) => > agg + 1, (agg1: Int, agg2: Int) => agg1 + agg2) > > dean > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Mon, Apr 13, 2015 at 3:24 PM, Victor Tso-Guillen <v...@paxata.com> > wrote: > >> How about this? >> >> input.distinct().combineByKey((v: V) => 1, (agg: Int, x: Int) => agg + 1, >> (agg1: Int, agg2: Int) => agg1 + agg2).collect() >> >> On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler <deanwamp...@gmail.com> >> wrote: >> >>> The problem with using collect is that it will fail for large data sets, >>> as you'll attempt to copy the entire RDD to the memory of your driver >>> program. The following works (Scala syntax, but similar to Python): >>> >>> scala> val i1 = input.distinct.groupByKey >>> scala> i1.foreach(println) >>> (1,CompactBuffer(beta, alpha, foo)) >>> (3,CompactBuffer(foo)) >>> (2,CompactBuffer(alpha, bar)) >>> >>> scala> val i2 = i1.map(tup => (tup._1, tup._2.size)) >>> scala> i1.foreach(println) >>> (1,3) >>> (3,1) >>> (2,2) >>> >>> The "i2" line passes a function that takes a tuple argument, then >>> constructs a new output tuple with the first element and the size of the >>> second (each CompactBuffer). An alternative pattern match syntax would be. >>> >>> scala> val i2 = i1.map { case (key, buffer) => (key, buffer.size) } >>> >>> This should work as long as none of the CompactBuffers are too large, >>> which could happen for extremely large data sets. >>> >>> dean >>> >>> Dean Wampler, Ph.D. >>> Author: Programming Scala, 2nd Edition >>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) >>> Typesafe <http://typesafe.com> >>> @deanwampler <http://twitter.com/deanwampler> >>> http://polyglotprogramming.com >>> >>> On Mon, Apr 13, 2015 at 11:45 AM, Marco Shaw <marco.s...@gmail.com> >>> wrote: >>> >>>> **Learning the ropes** >>>> >>>> I'm trying to grasp the concept of using the pipeline in pySpark... >>>> >>>> Simplified example: >>>> >>> >>>> list=[(1,"alpha"),(1,"beta"),(1,"foo"),(1,"alpha"),(2,"alpha"),(2,"alpha"),(2,"bar"),(3,"foo")] >>>> >>>> Desired outcome: >>>> [(1,3),(2,2),(3,1)] >>>> >>>> Basically for each key, I want the number of unique values. >>>> >>>> I've tried different approaches, but am I really using Spark >>>> effectively? I wondered if I would do something like: >>>> >>> input=sc.parallelize(list) >>>> >>> input.groupByKey().collect() >>>> >>>> Then I wondered if I could do something like a foreach over each key >>>> value, and then map the actual values and reduce them. Pseudo-code: >>>> >>>> input.groupbykey() >>>> .keys >>>> .foreach(_.values >>>> .map(lambda x: x,1) >>>> .reducebykey(lambda a,b:a+b) >>>> .count() >>>> ) >>>> >>>> I was somehow hoping that the key would get the current value of count, >>>> and thus be the count of the unique keys, which is exactly what I think I'm >>>> looking for. >>>> >>>> Am I way off base on how I could accomplish this? >>>> >>>> Marco >>>> >>> >>> >> >