Yes, its all just RDDs under the covers. DataFrames/SQL is just a more concise way to express your parallel programs.
On Sat, Jun 13, 2015 at 5:25 PM, Rex X <dnsr...@gmail.com> wrote: > Thanks, Don! Does SQL implementation of spark do parallel processing on > records by default? > > -Rex > > > > On Sat, Jun 13, 2015 at 10:13 AM, Don Drake <dondr...@gmail.com> wrote: > >> Take a look at https://github.com/databricks/spark-csv to read in the >> tab-delimited file (change the default delimiter) >> >> and once you have that as a DataFrame, SQL can do the rest. >> >> https://spark.apache.org/docs/latest/sql-programming-guide.html >> >> -Don >> >> >> On Fri, Jun 12, 2015 at 8:46 PM, Rex X <dnsr...@gmail.com> wrote: >> >>> Hi, >>> >>> I want to use spark to select N columns, top M rows of all csv files >>> under a folder. >>> >>> To be concrete, say we have a folder with thousands of tab-delimited csv >>> files with following attributes format (each csv file is about 10GB): >>> >>> id name address city... >>> 1 Matt add1 LA... >>> 2 Will add2 LA... >>> 3 Lucy add3 SF... >>> ... >>> >>> And we have a lookup table based on "name" above >>> >>> name gender >>> Matt M >>> Lucy F >>> ... >>> >>> Now we are interested to output from top 100K rows of each csv file into >>> following format: >>> >>> id name gender >>> 1 Matt M >>> ... >>> >>> Can we use pyspark to efficiently handle this? >>> >>> >>> >> >> >> -- >> Donald Drake >> Drake Consulting >> http://www.drakeconsulting.com/ >> http://www.MailLaunder.com/ >> 800-733-2143 >> > >