Re: code review - counting populated columns

Patrick Wendell Fri, 08 Nov 2013 13:42:14 -0800

Hey Tom,

reduceByKey will reduce locally on all the nodes, so there won't be
any data movement except to combine totals at the end.


- Patrick

On Fri, Nov 8, 2013 at 1:35 PM, Tom Vacek <[email protected]> wrote:
> Your example requires each row to be exactly the same length, since zipped
> will truncate to the shorter of its two arguments.
>
> The second solution is elegant, but reduceByKey involves flying a bunch of
> data around to sort the keys.  I suspect it would be a lot slower.  But you
> could save yourself from adding up a bunch of zeros:
>
>  val sparseRows = spark.textFile("myfile.tsv").map(line =>
> line.split("\t").zipWithIndex.filter(_._1.length>0))
> sparseRows.reduce(mergeAdd(_,_))
>
> You'll have to write a mergeAdd function.  This might not be any faster, but
> it does allow variable length rows.
>
>
> On Fri, Nov 8, 2013 at 3:23 PM, Patrick Wendell <[email protected]> wrote:
>>
>> It would be a bit more straightforward to write it like this:
>>
>> val columns = [same as before]
>>
>> val counts = columns.flatMap(emit (col_id, 0 or 1) for each
>> column).reduceByKey(_+ _)
>>
>> Basically look at each row and emit several records using flatMap.
>> Each record has an ID for the column (maybe its index) and a flag for
>> whether it's present.
>>
>> Then you reduce by key to get the per-column count. Then you can
>> collect at the end.
>>
>> - Patrick
>>
>> On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren <[email protected]>
>> wrote:
>> > Hi Spark coders,
>> >
>> > I wrote my first little Spark job that takes columnar data and counts up
>> > how
>> > many times each column is populated in an RDD.  Here is the code I came
>> > up
>> > with:
>> >
>> >     //RDD of List[String] corresponding to tab delimited values
>> >     val columns = spark.textFile("myfile.tsv").map(line =>
>> > line.split("\t").toList)
>> >     //RDD of List[Int] corresponding to populated columns (1 for
>> > populated
>> > and 0 for not populated)
>> >     val populatedColumns = columns.map(row => row.map(column =>
>> > if(column.length > 0) 1 else 0))
>> >     //List[Int] contains sums of the 1's in each column
>> >     val counts = populatedColumns.reduce((row1,row2)
>> > =>(row1,row2).zipped.map(_+_))
>> >
>> > Any thoughts about the fitness of this code snippet?  I'm a little
>> > annoyed
>> > by creating an RDD full of 1's and 0's in the second line.  The if
>> > statement
>> > feels awkward too.  I was happy to find the zipped method for the reduce
>> > step.  Any feedback you might have on how to improve this code is
>> > appreciated.  I'm a newbie to both Scala and Spark.
>> >
>> > Thanks,
>> > Philip
>> >
>
>

Re: code review - counting populated columns

Reply via email to