Messed up. Should be val sparseRows = spark.textFile("myfile.tsv").map(line => line.split("\t").zipWithIndex.flatMap( tt => if(tt._1.length>0) (tt._2, 1) ) Then reduce with a mergeAdd.
On Fri, Nov 8, 2013 at 3:35 PM, Tom Vacek <minnesota...@gmail.com> wrote: > Your example requires each row to be exactly the same length, since zipped > will truncate to the shorter of its two arguments. > > The second solution is elegant, but reduceByKey involves flying a bunch of > data around to sort the keys. I suspect it would be a lot slower. But you > could save yourself from adding up a bunch of zeros: > > val sparseRows = spark.textFile("myfile.tsv").map(line => > line.split("\t").zipWithIndex.filter(_._1.length>0)) > sparseRows.reduce(mergeAdd(_,_)) > > You'll have to write a mergeAdd function. This might not be any faster, > but it does allow variable length rows. > > > On Fri, Nov 8, 2013 at 3:23 PM, Patrick Wendell <pwend...@gmail.com>wrote: > >> It would be a bit more straightforward to write it like this: >> >> val columns = [same as before] >> >> val counts = columns.flatMap(emit (col_id, 0 or 1) for each >> column).reduceByKey(_+ _) >> >> Basically look at each row and emit several records using flatMap. >> Each record has an ID for the column (maybe its index) and a flag for >> whether it's present. >> >> Then you reduce by key to get the per-column count. Then you can >> collect at the end. >> >> - Patrick >> >> On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren <philip.og...@oracle.com> >> wrote: >> > Hi Spark coders, >> > >> > I wrote my first little Spark job that takes columnar data and counts >> up how >> > many times each column is populated in an RDD. Here is the code I came >> up >> > with: >> > >> > //RDD of List[String] corresponding to tab delimited values >> > val columns = spark.textFile("myfile.tsv").map(line => >> > line.split("\t").toList) >> > //RDD of List[Int] corresponding to populated columns (1 for >> populated >> > and 0 for not populated) >> > val populatedColumns = columns.map(row => row.map(column => >> > if(column.length > 0) 1 else 0)) >> > //List[Int] contains sums of the 1's in each column >> > val counts = populatedColumns.reduce((row1,row2) >> > =>(row1,row2).zipped.map(_+_)) >> > >> > Any thoughts about the fitness of this code snippet? I'm a little >> annoyed >> > by creating an RDD full of 1's and 0's in the second line. The if >> statement >> > feels awkward too. I was happy to find the zipped method for the reduce >> > step. Any feedback you might have on how to improve this code is >> > appreciated. I'm a newbie to both Scala and Spark. >> > >> > Thanks, >> > Philip >> > >> > >