You can also look into https://spark.apache.org/docs/latest/tuning.html for
performance tuning.
Thanks
Best Regards
On Mon, Jun 15, 2015 at 10:28 PM, Rex X wrote:
> Thanks very much, Akhil.
>
> That solved my problem.
>
> Best,
> Rex
>
>
>
> On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das
> wrote:
Something like this?
val huge_data = sc.textFile("/path/to/first.csv").map(x =>
(x.split("\t")(1), x.split("\t")(0))
val gender_data = sc.textFile("/path/to/second.csv"),map(x =>
(x.split("\t")(0), x))
val joined_data = huge_data.join(gender_data)
joined_data.take(1000)
Its scala btw, python a
To be concrete, say we have a folder with thousands of tab-delimited csv
files with following attributes format (each csv file is about 10GB):
idnameaddresscity...
1Mattadd1LA...
2Willadd2LA...
3Lucyadd3SF...
...
And we have a lo