Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for
performance tuning.

Thanks
Best Regards

On Mon, Jun 15, 2015 at 10:28 PM, Rex X  wrote:

> Thanks very much, Akhil.
>
> That solved my problem.
>
> Best,
> Rex
>
>
>
> On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das 
> wrote:
>
>> Something like this?
>>
>> val huge_data = sc.textFile("/path/to/first.csv").map(x =>
>> (x.split("\t")(1), x.split("\t")(0))
>> val gender_data = sc.textFile("/path/to/second.csv"),map(x =>
>> (x.split("\t")(0), x))
>>
>> val joined_data = huge_data.join(gender_data)
>>
>> joined_data.take(1000)
>>
>>
>> Its scala btw, python api should also be similar.
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Jun 13, 2015 at 12:16 AM, Rex X  wrote:
>>
>>> To be concrete, say we have a folder with thousands of tab-delimited csv
>>> files with following attributes format (each csv file is about 10GB):
>>>
>>> idnameaddresscity...
>>> 1Mattadd1LA...
>>> 2Willadd2LA...
>>> 3Lucyadd3SF...
>>> ...
>>>
>>> And we have a lookup table based on "name" above
>>>
>>> namegender
>>> MattM
>>> LucyF
>>> ...
>>>
>>> Now we are interested to output from top 1000 rows of each csv file into
>>> following format:
>>>
>>> idnamegender
>>> 1MattM
>>> ...
>>>
>>> Can we use pyspark to efficiently handle this?
>>>
>>>
>>>
>>
>


Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this?

val huge_data = sc.textFile("/path/to/first.csv").map(x =>
(x.split("\t")(1), x.split("\t")(0))
val gender_data = sc.textFile("/path/to/second.csv"),map(x =>
(x.split("\t")(0), x))

val joined_data = huge_data.join(gender_data)

joined_data.take(1000)


Its scala btw, python api should also be similar.

Thanks
Best Regards

On Sat, Jun 13, 2015 at 12:16 AM, Rex X  wrote:

> To be concrete, say we have a folder with thousands of tab-delimited csv
> files with following attributes format (each csv file is about 10GB):
>
> idnameaddresscity...
> 1Mattadd1LA...
> 2Willadd2LA...
> 3Lucyadd3SF...
> ...
>
> And we have a lookup table based on "name" above
>
> namegender
> MattM
> LucyF
> ...
>
> Now we are interested to output from top 1000 rows of each csv file into
> following format:
>
> idnamegender
> 1MattM
> ...
>
> Can we use pyspark to efficiently handle this?
>
>
>