Re: Parsing CSV files in Spark

Mohit Jaggi Fri, 06 Feb 2015 09:47:54 -0800

As Sean said, this is just a few lines of code. You can see an example here:


https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660
 
<https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660>


> On Feb 6, 2015, at 7:29 AM, Charles Feduke <charles.fed...@gmail.com> wrote:
> 
> I've been doing a bunch of work with CSVs in Spark, mostly saving them as a 
> merged CSV (instead of the various part-nnnnn files). You might find the 
> following links useful:
> 
> - This article is about combining the part files and outputting a header as 
> the first line in the merged results:
> 
> http://java.dzone.com/articles/spark-write-csv-file-header 
> <http://java.dzone.com/articles/spark-write-csv-file-header>
> 
> - This was my take on the previous author's original article, but it doesn't 
> yet handle the header row:
> 
> http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/ 
> <http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/>
> 
> spark-csv helps with reading CSV data and mapping a schema for Spark SQL, but 
> as of now doesn't save CSV data.
> 
> On Fri Feb 06 2015 at 9:49:06 AM Sean Owen <so...@cloudera.com 
> <mailto:so...@cloudera.com>> wrote:
> You can do this manually without much trouble: get your files on a
> distributed store like HDFS, read them with textFile, filter out
> headers, parse with a CSV library like Commons CSV, select columns,
> format and store the result. That's tens of lines of code.
> 
> However you probably want to start by looking at
> https://github.com/databricks/spark-csv 
> <https://github.com/databricks/spark-csv> which may make it even easier
> than that and give you a richer query syntax.
> 
> On Fri, Feb 6, 2015 at 8:37 AM, Spico Florin <spicoflo...@gmail.com 
> <mailto:spicoflo...@gmail.com>> wrote:
> > Hi!
> >   I'm new to Spark. I have a case study that where the data is store in CSV
> > files. These files have headers with morte than 1000 columns. I would like
> > to know what are the best practice to parsing them and in special the
> > following points:
> > 1. Getting and parsing all the files from a folder
> > 2. What CSV parser do you use?
> > 3. I would like to select just some columns whose names matches a pattern
> > and then pass the selected columns values (plus the column names) to the
> > processing and save the output to a CSV (preserving the selected columns).
> >
> > If you have any experience with some points above, it will be really helpful
> > (for me and for the others that will encounter the same cases) if you can
> > share your thoughts.
> > Thanks.
> >   Regards,
> >  Florin
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
>

Re: Parsing CSV files in Spark

Reply via email to