Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote: Thanks very much, Akhil. That solved my problem. Best, Rex On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Something like this? val huge_data = sc.textFile(/path/to/first.csv).map(x = (x.split(\t)(1), x.split(\t)(0)) val gender_data = sc.textFile(/path/to/second.csv),map(x = (x.split(\t)(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python api should also be similar. Thanks Best Regards On Sat, Jun 13, 2015 at 12:16 AM, Rex X dnsr...@gmail.com wrote: To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a lookup table based on name above namegender MattM LucyF ... Now we are interested to output from top 1000 rows of each csv file into following format: idnamegender 1MattM ... Can we use pyspark to efficiently handle this?
Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?
Something like this? val huge_data = sc.textFile(/path/to/first.csv).map(x = (x.split(\t)(1), x.split(\t)(0)) val gender_data = sc.textFile(/path/to/second.csv),map(x = (x.split(\t)(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python api should also be similar. Thanks Best Regards On Sat, Jun 13, 2015 at 12:16 AM, Rex X dnsr...@gmail.com wrote: To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a lookup table based on name above namegender MattM LucyF ... Now we are interested to output from top 1000 rows of each csv file into following format: idnamegender 1MattM ... Can we use pyspark to efficiently handle this?
How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?
To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a lookup table based on name above namegender MattM LucyF ... Now we are interested to output from top 1000 rows of each csv file into following format: idnamegender 1MattM ... Can we use pyspark to efficiently handle this?