Re: Optimal solution for getting the header from CSV with Spark

Spico Florin Wed, 25 Mar 2015 00:39:39 -0700

Hello!
  Thank for your responses. I was afraid that due to partitioning I will
loose the logic that the first element is the header. I observe that
rdd.first calls behind the rdd.take(1) method.
Best regards,
  Florin


On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler <[email protected]> wrote:

> Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
> read the whole file, use data.take(1), which is simpler.
>
> From the Rdd.take documentation, it works by first scanning one partition,
> and using the results from that partition to estimate the number of
> additional partitions needed to satisfy the limit. In this case, it will
> trivially stop at the first.
>
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin <[email protected]>
> wrote:
>
>> Hello!
>>
>> I would like to know what is the optimal solution for getting the header
>> with from a CSV file with Spark? My aproach was:
>>
>> def getHeader(data: RDD[String]): String = {
>> data.zipWithIndex().filter(_._2==0).map(x=>x._1).take(1).mkString("") }
>>
>> Thanks.
>>
>
>

Re: Optimal solution for getting the header from CSV with Spark

Reply via email to