Take a look at https://github.com/databricks/spark-csv to read in the tab-delimited file (change the default delimiter)
and once you have that as a DataFrame, SQL can do the rest. https://spark.apache.org/docs/latest/sql-programming-guide.html -Don On Fri, Jun 12, 2015 at 8:46 PM, Rex X <dnsr...@gmail.com> wrote: > Hi, > > I want to use spark to select N columns, top M rows of all csv files under > a folder. > > To be concrete, say we have a folder with thousands of tab-delimited csv > files with following attributes format (each csv file is about 10GB): > > id name address city... > 1 Matt add1 LA... > 2 Will add2 LA... > 3 Lucy add3 SF... > ... > > And we have a lookup table based on "name" above > > name gender > Matt M > Lucy F > ... > > Now we are interested to output from top 100K rows of each csv file into > following format: > > id name gender > 1 Matt M > ... > > Can we use pyspark to efficiently handle this? > > > -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ http://www.MailLaunder.com/ 800-733-2143