Your example requires each row to be exactly the same length, since zipped
will truncate to the shorter of its two arguments.

The second solution is elegant, but reduceByKey involves flying a bunch of
data around to sort the keys.  I suspect it would be a lot slower.  But you
could save yourself from adding up a bunch of zeros:

 val sparseRows = spark.textFile("myfile.tsv").map(line =>
line.split("\t").zipWithIndex.filter(_._1.length>0))
sparseRows.reduce(mergeAdd(_,_))

You'll have to write a mergeAdd function.  This might not be any faster,
but it does allow variable length rows.


On Fri, Nov 8, 2013 at 3:23 PM, Patrick Wendell <pwend...@gmail.com> wrote:

> It would be a bit more straightforward to write it like this:
>
> val columns = [same as before]
>
> val counts = columns.flatMap(emit (col_id, 0 or 1) for each
> column).reduceByKey(_+ _)
>
> Basically look at each row and emit several records using flatMap.
> Each record has an ID for the column (maybe its index) and a flag for
> whether it's present.
>
> Then you reduce by key to get the per-column count. Then you can
> collect at the end.
>
> - Patrick
>
> On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren <philip.og...@oracle.com>
> wrote:
> > Hi Spark coders,
> >
> > I wrote my first little Spark job that takes columnar data and counts up
> how
> > many times each column is populated in an RDD.  Here is the code I came
> up
> > with:
> >
> >     //RDD of List[String] corresponding to tab delimited values
> >     val columns = spark.textFile("myfile.tsv").map(line =>
> > line.split("\t").toList)
> >     //RDD of List[Int] corresponding to populated columns (1 for
> populated
> > and 0 for not populated)
> >     val populatedColumns = columns.map(row => row.map(column =>
> > if(column.length > 0) 1 else 0))
> >     //List[Int] contains sums of the 1's in each column
> >     val counts = populatedColumns.reduce((row1,row2)
> > =>(row1,row2).zipped.map(_+_))
> >
> > Any thoughts about the fitness of this code snippet?  I'm a little
> annoyed
> > by creating an RDD full of 1's and 0's in the second line.  The if
> statement
> > feels awkward too.  I was happy to find the zipped method for the reduce
> > step.  Any feedback you might have on how to improve this code is
> > appreciated.  I'm a newbie to both Scala and Spark.
> >
> > Thanks,
> > Philip
> >
>

Reply via email to