Re: code review - counting populated columns

Tom Vacek Fri, 08 Nov 2013 13:40:25 -0800

Messed up.  Should be
 val sparseRows = spark.textFile("myfile.tsv").map(line =>
line.split("\t").zipWithIndex.flatMap( tt => if(tt._1.length>0) (tt._2, 1) )
Then reduce with a mergeAdd.



On Fri, Nov 8, 2013 at 3:35 PM, Tom Vacek <minnesota...@gmail.com> wrote:

> Your example requires each row to be exactly the same length, since zipped
> will truncate to the shorter of its two arguments.
>
> The second solution is elegant, but reduceByKey involves flying a bunch of
> data around to sort the keys.  I suspect it would be a lot slower.  But you
> could save yourself from adding up a bunch of zeros:
>
>  val sparseRows = spark.textFile("myfile.tsv").map(line =>
> line.split("\t").zipWithIndex.filter(_._1.length>0))
> sparseRows.reduce(mergeAdd(_,_))
>
> You'll have to write a mergeAdd function.  This might not be any faster,
> but it does allow variable length rows.
>
>
> On Fri, Nov 8, 2013 at 3:23 PM, Patrick Wendell <pwend...@gmail.com>wrote:
>
>> It would be a bit more straightforward to write it like this:
>>
>> val columns = [same as before]
>>
>> val counts = columns.flatMap(emit (col_id, 0 or 1) for each
>> column).reduceByKey(_+ _)
>>
>> Basically look at each row and emit several records using flatMap.
>> Each record has an ID for the column (maybe its index) and a flag for
>> whether it's present.
>>
>> Then you reduce by key to get the per-column count. Then you can
>> collect at the end.
>>
>> - Patrick
>>
>> On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren <philip.og...@oracle.com>
>> wrote:
>> > Hi Spark coders,
>> >
>> > I wrote my first little Spark job that takes columnar data and counts
>> up how
>> > many times each column is populated in an RDD.  Here is the code I came
>> up
>> > with:
>> >
>> >     //RDD of List[String] corresponding to tab delimited values
>> >     val columns = spark.textFile("myfile.tsv").map(line =>
>> > line.split("\t").toList)
>> >     //RDD of List[Int] corresponding to populated columns (1 for
>> populated
>> > and 0 for not populated)
>> >     val populatedColumns = columns.map(row => row.map(column =>
>> > if(column.length > 0) 1 else 0))
>> >     //List[Int] contains sums of the 1's in each column
>> >     val counts = populatedColumns.reduce((row1,row2)
>> > =>(row1,row2).zipped.map(_+_))
>> >
>> > Any thoughts about the fitness of this code snippet?  I'm a little
>> annoyed
>> > by creating an RDD full of 1's and 0's in the second line.  The if
>> statement
>> > feels awkward too.  I was happy to find the zipped method for the reduce
>> > step.  Any feedback you might have on how to improve this code is
>> > appreciated.  I'm a newbie to both Scala and Spark.
>> >
>> > Thanks,
>> > Philip
>> >
>>
>
>

Re: code review - counting populated columns

Reply via email to