Re: [Spark] What is the most efficient way to do such a join and column manipulation?

Michael Armbrust Sat, 13 Jun 2015 17:51:01 -0700

Yes, its all just RDDs under the covers.  DataFrames/SQL is just a more
concise way to express your parallel programs.


On Sat, Jun 13, 2015 at 5:25 PM, Rex X <dnsr...@gmail.com> wrote:

> Thanks, Don! Does SQL implementation of spark do parallel processing on
> records by default?
>
> -Rex
>
>
>
> On Sat, Jun 13, 2015 at 10:13 AM, Don Drake <dondr...@gmail.com> wrote:
>
>> Take a look at https://github.com/databricks/spark-csv to read in the
>> tab-delimited file (change the default delimiter)
>>
>> and once you have that as a DataFrame, SQL can do the rest.
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>> -Don
>>
>>
>> On Fri, Jun 12, 2015 at 8:46 PM, Rex X <dnsr...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I want to use spark to select N columns, top M rows of all csv files
>>> under a folder.
>>>
>>> To be concrete, say we have a folder with thousands of tab-delimited csv
>>> files with following attributes format (each csv file is about 10GB):
>>>
>>>     id    name    address    city...
>>>     1    Matt    add1    LA...
>>>     2    Will    add2    LA...
>>>     3    Lucy    add3    SF...
>>>     ...
>>>
>>> And we have a lookup table based on "name" above
>>>
>>>     name    gender
>>>     Matt    M
>>>     Lucy    F
>>>     ...
>>>
>>> Now we are interested to output from top 100K rows of each csv file into
>>> following format:
>>>
>>>     id    name    gender
>>>     1    Matt    M
>>>     ...
>>>
>>> Can we use pyspark to efficiently handle this?
>>>
>>>
>>>
>>
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> http://www.MailLaunder.com/
>> 800-733-2143
>>
>
>

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

Reply via email to