Re: Processing multiple columns in parallel

ayan guha Mon, 18 May 2015 07:47:45 -0700

My first thought would be creating 10 rdds and run your word count on each
of them..I think spark scheduler is going to resolve dependency in parallel
and launch 10 jobs.


Best
Ayan
On 18 May 2015 23:41, "Laeeq Ahmed" <[email protected]> wrote:

> Hi,
>
> Consider I have a tab delimited text file with 10 columns. Each column is
> a a set of text. I would like to do a word count for each column. In scala,
> I would do the following RDD transformation and action:
>
>
>
>
>
> *val data = sc.textFile("hdfs://namenode/data.txt") for(i <- 0 until 9){
>  data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_).saveAsTextFile("i") } *
> Within the for loop, it's a parallel process, but each column is
> sequentially processed from 0 to 9.
>
> Is there anyway so that I can process multiple column in parallel in
> Spark? I saw posting about using AKKA, but RDD itself is already using
> AKKA. Any pointers would be appreciated.
>
>
> Regards,
> Laeeq
>

Re: Processing multiple columns in parallel

Reply via email to