Hi,
Consider I have a tab delimited text file with 10 columns. Each column is a a
set of text. I would like to do a word count for each column. In scala, I would
do the following RDD transformation and action:
val data = sc.textFile("hdfs://namenode/data.txt")
for(i <- 0 until 9){
data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_).saveAsTextFile("i")
}
Within the for loop, it's a parallel process, but each column is sequentially
processed from 0 to 9.
Is there anyway so that I can process multiple column in parallel in Spark? I
saw posting about using AKKA, but RDD itself is already using AKKA. Any
pointers would be appreciated.
Regards,Laeeq