How about making the range in the for loop parallelised? The driver will then 
kick off the word counts independently.

Regards,
Guy Needham | Data Discovery
Virgin Media   | Technology and Transformation | Data
Bartley Wood Business Park, Hook, Hampshire RG27 9UP
D 01256 75 3362
I welcome VSRE emails. Learn more at http://vsre.info/
From: ayan guha [mailto:[email protected]]
Sent: 18 May 2015 15:46
To: Laeeq Ahmed
Cc: [email protected]
Subject: Re: Processing multiple columns in parallel


My first thought would be creating 10 rdds and run your word count on each of 
them..I think spark scheduler is going to resolve dependency in parallel and 
launch 10 jobs.

Best
Ayan
On 18 May 2015 23:41, "Laeeq Ahmed" 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Consider I have a tab delimited text file with 10 columns. Each column is a a 
set of text. I would like to do a word count for each column. In scala, I would 
do the following RDD transformation and action:

val data = sc.textFile("hdfs://namenode/data.txt")
for(i <- 0 until 9){
   data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_).saveAsTextFile("i")
}

Within the for loop, it's a parallel process, but each column is sequentially 
processed from 0 to 9.

Is there anyway so that I can process multiple column in parallel in Spark? I 
saw posting about using AKKA, but RDD itself is already using AKKA. Any 
pointers would be appreciated.


Regards,
Laeeq

--------------------------------------------------------------------
Save Paper - Do you really need to print this e-mail?

Visit www.virginmedia.com for more information, and more fun.

This email and any attachments are or may be confidential and legally privileged
and are sent solely for the attention of the addressee(s). If you have received 
this
email in error, please delete it from your system: its use, disclosure or 
copying is
unauthorised. Statements and opinions expressed in this email may not represent
those of Virgin Media. Any representations or commitments in this email are
subject to contract. 

Registered office: Media House, Bartley Wood Business Park, Hook, Hampshire, 
RG27 9UP
Registered in England and Wales with number 2591237

Reply via email to