Re: removing header from csv file

2016-05-03 Thread Michael Segel
Hi, Another silly question… Don’t you want to use the header line to help create a schema for the RDD? Thx -Mike > On May 3, 2016, at 8:09 AM, Mathieu Longtin wrote: > > This only works if the files are "unsplittable". For example gzip files, each > partition is

Re: removing header from csv file

2016-05-03 Thread Mathieu Longtin
This only works if the files are "unsplittable". For example gzip files, each partition is one file (if you have more partitions than files), so the first line of each partition is the header. Spark-csv extensions reads the very first line of the RDD, assumes it's the header, and then filters

Re: removing header from csv file

2016-05-03 Thread Abhishek Anand
You can use this function to remove the header from your dataset(applicable to RDD) def dropHeader(data: RDD[String]): RDD[String] = { data.mapPartitionsWithIndex((idx, lines) => { if (idx == 0) { lines.drop(1) } lines }) } Abhi On Wed, Apr 27, 2016 at

Re: removing header from csv file

2016-04-27 Thread Marco Mistroni
If u r using Scala api you can do Myrdd.zipwithindex.filter(_._2 >0).map(_._1) Maybe a little bit complicated but will do the trick As per spark CSV, you will get back a data frame which you can reconduct to rdd. . Hth Marco On 27 Apr 2016 6:59 am, "nihed mbarek" wrote: > You

Re: removing header from csv file

2016-04-27 Thread Nachiketa
Why "without sqlcontext" ? Could you please describe what is it that you are trying to accomplish ? Thanks. Regards, Nachiketa On Wed, Apr 27, 2016 at 10:54 AM, Ashutosh Kumar wrote: > I see there is a library spark-csv which can be used for removing header > and

Re: removing header from csv file

2016-04-27 Thread Hyukjin Kwon
There are two ways to do so. Firstly, this way will make sure cleanly it skips the header. But of course the use of mapWithIndex decreases performance rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter } Secondly, you can do val header = rdd.first() val data =

RE: removing header from csv file

2016-04-27 Thread Mishra, Abhishek
] Sent: Wednesday, April 27, 2016 11:29 AM To: Divya Gehlot Cc: Ashutosh Kumar; user @spark Subject: Re: removing header from csv file You can add a filter with string that you are sure available only in the header Le mercredi 27 avril 2016, Divya Gehlot <divya.htco...@gmail.com<mailto:divy

Re: removing header from csv file

2016-04-26 Thread nihed mbarek
You can add a filter with string that you are sure available only in the header Le mercredi 27 avril 2016, Divya Gehlot a écrit : > yes you can remove the headers by removing the first row > > can first() or head() to do that > > > Thanks, > Divya > > On 27 April 2016

Re: removing header from csv file

2016-04-26 Thread Divya Gehlot
yes you can remove the headers by removing the first row can first() or head() to do that Thanks, Divya On 27 April 2016 at 13:24, Ashutosh Kumar wrote: > I see there is a library spark-csv which can be used for removing header > and processing of csv files. But it

Re: removing header from csv file

2016-04-26 Thread Praveen Devarao
tosh Kumar <kmr.ashutos...@gmail.com> To: "user @spark" <user@spark.apache.org> Date: 27/04/2016 10:55 am Subject: removing header from csv file I see there is a library spark-csv which can be used for removing header and processing of csv files. But it see

Re: removing header from csv file

2016-04-26 Thread Takeshi Yamamuro
Hi, What do u mean "with sqlcontext only"? You mean you'd like to load csv data as rdd (sparkcontext) or something? // maropu On Wed, Apr 27, 2016 at 2:24 PM, Ashutosh Kumar wrote: > I see there is a library spark-csv which can be used for removing header > and

removing header from csv file

2016-04-26 Thread Ashutosh Kumar
I see there is a library spark-csv which can be used for removing header and processing of csv files. But it seems it works with sqlcontext only. Is there a way to remove header from csv files without sqlcontext ? Thanks Ashutosh