I ended up with the following:

def firstValue(items: Iterable[String]) = for { i <- items
} yield (i, items.head)

data.groupByKey().map{case(a, b)=>(a, firstValue(b))}.collect

More details:
http://dmtolpeko.com/2015/02/17/first_value-last_value-lead-and-lag-in-spark/

I would appreciate any feedback.

Dmitry

On Fri, Feb 13, 2015 at 11:54 AM, Dmitry Tolpeko <dmtolp...@gmail.com>
wrote:

> Hello,
>
> To convert existing Map Reduce jobs to Spark, I need to implement window
> functions such as FIRST_VALUE, LEAD, LAG and so on. For example,
> FIRST_VALUE function:
>
> Source (1st column is key):
>
> A, A1
> A, A2
> A, A3
> B, B1
> B, B2
> C, C1
>
> and the result should be
>
> A, A1, A1
> A, A2, A1
> A, A3, A1
> B, B1, B1
> B, B2, B1
> C, C1, C1
>
> You can see that the first value in a group is repeated in each row.
>
> My current Spark/Scala code:
>
> def firstValue(b: Iterable[String]) : List[(String, String)] = {
>   val c = scala.collection.mutable.MutableList[(String, String)]()
>   var f = ""
>   b.foreach(d => { if(f.isEmpty()) f = d; c += d -> f})
>   c.toList
> }
>
> val data=sc.parallelize(List(
>    ("A", "A1"),
>    ("A", "A2"),
>    ("A", "A3"),
>    ("B", "B1"),
>    ("B", "B2"),
>    ("C", "C1")))
>
> data.groupByKey().map{case(a, b)=>(a, firstValue(b))}.collect
>
> So I create a new list after groupByKey. Is it right approach to do this
> in Spark? Are there any other options? Please point me to any drawbacks.
>
> Thanks,
>
> Dmitry
>
>

Reply via email to