Take a look at https://github.com/databricks/spark-csv to read in the
tab-delimited file (change the default delimiter)

and once you have that as a DataFrame, SQL can do the rest.

https://spark.apache.org/docs/latest/sql-programming-guide.html

-Don


On Fri, Jun 12, 2015 at 8:46 PM, Rex X <dnsr...@gmail.com> wrote:

> Hi,
>
> I want to use spark to select N columns, top M rows of all csv files under
> a folder.
>
> To be concrete, say we have a folder with thousands of tab-delimited csv
> files with following attributes format (each csv file is about 10GB):
>
>     id    name    address    city...
>     1    Matt    add1    LA...
>     2    Will    add2    LA...
>     3    Lucy    add3    SF...
>     ...
>
> And we have a lookup table based on "name" above
>
>     name    gender
>     Matt    M
>     Lucy    F
>     ...
>
> Now we are interested to output from top 100K rows of each csv file into
> following format:
>
>     id    name    gender
>     1    Matt    M
>     ...
>
> Can we use pyspark to efficiently handle this?
>
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
800-733-2143

Reply via email to