Yes, its all just RDDs under the covers.  DataFrames/SQL is just a more
concise way to express your parallel programs.

On Sat, Jun 13, 2015 at 5:25 PM, Rex X <> wrote:

> Thanks, Don! Does SQL implementation of spark do parallel processing on
> records by default?
> -Rex
> On Sat, Jun 13, 2015 at 10:13 AM, Don Drake <> wrote:
>> Take a look at to read in the
>> tab-delimited file (change the default delimiter)
>> and once you have that as a DataFrame, SQL can do the rest.
>> -Don
>> On Fri, Jun 12, 2015 at 8:46 PM, Rex X <> wrote:
>>> Hi,
>>> I want to use spark to select N columns, top M rows of all csv files
>>> under a folder.
>>> To be concrete, say we have a folder with thousands of tab-delimited csv
>>> files with following attributes format (each csv file is about 10GB):
>>>     id    name    address    city...
>>>     1    Matt    add1    LA...
>>>     2    Will    add2    LA...
>>>     3    Lucy    add3    SF...
>>>     ...
>>> And we have a lookup table based on "name" above
>>>     name    gender
>>>     Matt    M
>>>     Lucy    F
>>>     ...
>>> Now we are interested to output from top 100K rows of each csv file into
>>> following format:
>>>     id    name    gender
>>>     1    Matt    M
>>>     ...
>>> Can we use pyspark to efficiently handle this?
>> --
>> Donald Drake
>> Drake Consulting
>> 800-733-2143

Reply via email to