Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for
performance tuning.

Thanks
Best Regards

On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote:

 Thanks very much, Akhil.

 That solved my problem.

 Best,
 Rex



 On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Something like this?

 val huge_data = sc.textFile(/path/to/first.csv).map(x =
 (x.split(\t)(1), x.split(\t)(0))
 val gender_data = sc.textFile(/path/to/second.csv),map(x =
 (x.split(\t)(0), x))

 val joined_data = huge_data.join(gender_data)

 joined_data.take(1000)


 Its scala btw, python api should also be similar.

 Thanks
 Best Regards

 On Sat, Jun 13, 2015 at 12:16 AM, Rex X dnsr...@gmail.com wrote:

 To be concrete, say we have a folder with thousands of tab-delimited csv
 files with following attributes format (each csv file is about 10GB):

 idnameaddresscity...
 1Mattadd1LA...
 2Willadd2LA...
 3Lucyadd3SF...
 ...

 And we have a lookup table based on name above

 namegender
 MattM
 LucyF
 ...

 Now we are interested to output from top 1000 rows of each csv file into
 following format:

 idnamegender
 1MattM
 ...

 Can we use pyspark to efficiently handle this?







Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this?

val huge_data = sc.textFile(/path/to/first.csv).map(x =
(x.split(\t)(1), x.split(\t)(0))
val gender_data = sc.textFile(/path/to/second.csv),map(x =
(x.split(\t)(0), x))

val joined_data = huge_data.join(gender_data)

joined_data.take(1000)


Its scala btw, python api should also be similar.

Thanks
Best Regards

On Sat, Jun 13, 2015 at 12:16 AM, Rex X dnsr...@gmail.com wrote:

 To be concrete, say we have a folder with thousands of tab-delimited csv
 files with following attributes format (each csv file is about 10GB):

 idnameaddresscity...
 1Mattadd1LA...
 2Willadd2LA...
 3Lucyadd3SF...
 ...

 And we have a lookup table based on name above

 namegender
 MattM
 LucyF
 ...

 Now we are interested to output from top 1000 rows of each csv file into
 following format:

 idnamegender
 1MattM
 ...

 Can we use pyspark to efficiently handle this?





How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-12 Thread Rex X
To be concrete, say we have a folder with thousands of tab-delimited csv
files with following attributes format (each csv file is about 10GB):

idnameaddresscity...
1Mattadd1LA...
2Willadd2LA...
3Lucyadd3SF...
...

And we have a lookup table based on name above

namegender
MattM
LucyF
...

Now we are interested to output from top 1000 rows of each csv file into
following format:

idnamegender
1MattM
...

Can we use pyspark to efficiently handle this?