As Sean said, this is just a few lines of code. You can see an example here:
https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660 <https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660> > On Feb 6, 2015, at 7:29 AM, Charles Feduke <charles.fed...@gmail.com> wrote: > > I've been doing a bunch of work with CSVs in Spark, mostly saving them as a > merged CSV (instead of the various part-nnnnn files). You might find the > following links useful: > > - This article is about combining the part files and outputting a header as > the first line in the merged results: > > http://java.dzone.com/articles/spark-write-csv-file-header > <http://java.dzone.com/articles/spark-write-csv-file-header> > > - This was my take on the previous author's original article, but it doesn't > yet handle the header row: > > http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/ > <http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/> > > spark-csv helps with reading CSV data and mapping a schema for Spark SQL, but > as of now doesn't save CSV data. > > On Fri Feb 06 2015 at 9:49:06 AM Sean Owen <so...@cloudera.com > <mailto:so...@cloudera.com>> wrote: > You can do this manually without much trouble: get your files on a > distributed store like HDFS, read them with textFile, filter out > headers, parse with a CSV library like Commons CSV, select columns, > format and store the result. That's tens of lines of code. > > However you probably want to start by looking at > https://github.com/databricks/spark-csv > <https://github.com/databricks/spark-csv> which may make it even easier > than that and give you a richer query syntax. > > On Fri, Feb 6, 2015 at 8:37 AM, Spico Florin <spicoflo...@gmail.com > <mailto:spicoflo...@gmail.com>> wrote: > > Hi! > > I'm new to Spark. I have a case study that where the data is store in CSV > > files. These files have headers with morte than 1000 columns. I would like > > to know what are the best practice to parsing them and in special the > > following points: > > 1. Getting and parsing all the files from a folder > > 2. What CSV parser do you use? > > 3. I would like to select just some columns whose names matches a pattern > > and then pass the selected columns values (plus the column names) to the > > processing and save the output to a CSV (preserving the selected columns). > > > > If you have any experience with some points above, it will be really helpful > > (for me and for the others that will encounter the same cases) if you can > > share your thoughts. > > Thanks. > > Regards, > > Florin > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> >