Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X wrote: > Thanks very much, Akhil. > > That solved my problem. > > Best, > Rex > > > > On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das > wrote:

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this? val huge_data = sc.textFile("/path/to/first.csv").map(x => (x.split("\t")(1), x.split("\t")(0)) val gender_data = sc.textFile("/path/to/second.csv"),map(x => (x.split("\t")(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python a

How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-12 Thread Rex X
To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a lo