You can also look into https://spark.apache.org/docs/latest/tuning.html for
performance tuning.
Thanks
Best Regards
On Mon, Jun 15, 2015 at 10:28 PM, Rex X wrote:
> Thanks very much, Akhil.
>
> That solved my problem.
>
> Best,
> Rex
>
>
>
> On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das
> wrote:
>
>> Something like this?
>>
>> val huge_data = sc.textFile("/path/to/first.csv").map(x =>
>> (x.split("\t")(1), x.split("\t")(0))
>> val gender_data = sc.textFile("/path/to/second.csv"),map(x =>
>> (x.split("\t")(0), x))
>>
>> val joined_data = huge_data.join(gender_data)
>>
>> joined_data.take(1000)
>>
>>
>> Its scala btw, python api should also be similar.
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Jun 13, 2015 at 12:16 AM, Rex X wrote:
>>
>>> To be concrete, say we have a folder with thousands of tab-delimited csv
>>> files with following attributes format (each csv file is about 10GB):
>>>
>>> idnameaddresscity...
>>> 1Mattadd1LA...
>>> 2Willadd2LA...
>>> 3Lucyadd3SF...
>>> ...
>>>
>>> And we have a lookup table based on "name" above
>>>
>>> namegender
>>> MattM
>>> LucyF
>>> ...
>>>
>>> Now we are interested to output from top 1000 rows of each csv file into
>>> following format:
>>>
>>> idnamegender
>>> 1MattM
>>> ...
>>>
>>> Can we use pyspark to efficiently handle this?
>>>
>>>
>>>
>>
>