Processing multiple columns in parallel

Laeeq Ahmed Mon, 18 May 2015 06:41:24 -0700

Hi,
Consider I have a tab delimited text file with 10 columns. Each column is a a 
set of text. I would like to do a word count for each column. In scala, I would 
do the following RDD transformation and action:


val data = sc.textFile("hdfs://namenode/data.txt") 
for(i <- 0 until 9){ 
   data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_).saveAsTextFile("i") 
} 

Within the for loop, it's a parallel process, but each column is sequentially 
processed from 0 to 9. 

Is there anyway so that I can process multiple column in parallel in Spark? I 
saw posting about using AKKA, but RDD itself is already using AKKA. Any 
pointers would be appreciated.


Regards,Laeeq

Processing multiple columns in parallel

Reply via email to