My first thought would be creating 10 rdds and run your word count on each of them..I think spark scheduler is going to resolve dependency in parallel and launch 10 jobs.
Best Ayan On 18 May 2015 23:41, "Laeeq Ahmed" <[email protected]> wrote: > Hi, > > Consider I have a tab delimited text file with 10 columns. Each column is > a a set of text. I would like to do a word count for each column. In scala, > I would do the following RDD transformation and action: > > > > > > *val data = sc.textFile("hdfs://namenode/data.txt") for(i <- 0 until 9){ > data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_).saveAsTextFile("i") } * > Within the for loop, it's a parallel process, but each column is > sequentially processed from 0 to 9. > > Is there anyway so that I can process multiple column in parallel in > Spark? I saw posting about using AKKA, but RDD itself is already using > AKKA. Any pointers would be appreciated. > > > Regards, > Laeeq >
