Hi,
Another silly question…
Don’t you want to use the header line to help create a schema for the RDD?
Thx
-Mike
> On May 3, 2016, at 8:09 AM, Mathieu Longtin wrote:
>
> This only works if the files are "unsplittable". For example gzip files, each
> partition is
This only works if the files are "unsplittable". For example gzip files,
each partition is one file (if you have more partitions than files), so the
first line of each partition is the header.
Spark-csv extensions reads the very first line of the RDD, assumes it's the
header, and then filters
You can use this function to remove the header from your dataset(applicable
to RDD)
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
Abhi
On Wed, Apr 27, 2016 at
If u r using Scala api you can do
Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
Maybe a little bit complicated but will do the trick
As per spark CSV, you will get back a data frame which you can reconduct to
rdd. .
Hth
Marco
On 27 Apr 2016 6:59 am, "nihed mbarek" wrote:
> You
Why "without sqlcontext" ? Could you please describe what is it that you
are trying to accomplish ? Thanks.
Regards,
Nachiketa
On Wed, Apr 27, 2016 at 10:54 AM, Ashutosh Kumar
wrote:
> I see there is a library spark-csv which can be used for removing header
> and
There are two ways to do so.
Firstly, this way will make sure cleanly it skips the header. But of course
the use of mapWithIndex decreases performance
rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else
iter }
Secondly, you can do
val header = rdd.first()
val data =
]
Sent: Wednesday, April 27, 2016 11:29 AM
To: Divya Gehlot
Cc: Ashutosh Kumar; user @spark
Subject: Re: removing header from csv file
You can add a filter with string that you are sure available only in the header
Le mercredi 27 avril 2016, Divya Gehlot
<divya.htco...@gmail.com<mailto:divy
You can add a filter with string that you are sure available only in the
header
Le mercredi 27 avril 2016, Divya Gehlot a écrit :
> yes you can remove the headers by removing the first row
>
> can first() or head() to do that
>
>
> Thanks,
> Divya
>
> On 27 April 2016
yes you can remove the headers by removing the first row
can first() or head() to do that
Thanks,
Divya
On 27 April 2016 at 13:24, Ashutosh Kumar wrote:
> I see there is a library spark-csv which can be used for removing header
> and processing of csv files. But it
tosh Kumar <kmr.ashutos...@gmail.com>
To: "user @spark" <user@spark.apache.org>
Date: 27/04/2016 10:55 am
Subject: removing header from csv file
I see there is a library spark-csv which can be used for removing header
and processing of csv files. But it see
Hi,
What do u mean "with sqlcontext only"?
You mean you'd like to load csv data as rdd (sparkcontext) or something?
// maropu
On Wed, Apr 27, 2016 at 2:24 PM, Ashutosh Kumar
wrote:
> I see there is a library spark-csv which can be used for removing header
> and
I see there is a library spark-csv which can be used for removing header
and processing of csv files. But it seems it works with sqlcontext only. Is
there a way to remove header from csv files without sqlcontext ?
Thanks
Ashutosh
12 matches
Mail list logo