Hello! Thank for your responses. I was afraid that due to partitioning I will loose the logic that the first element is the header. I observe that rdd.first calls behind the rdd.take(1) method. Best regards, Florin
On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler <[email protected]> wrote: > Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to > read the whole file, use data.take(1), which is simpler. > > From the Rdd.take documentation, it works by first scanning one partition, > and using the results from that partition to estimate the number of > additional partitions needed to satisfy the limit. In this case, it will > trivially stop at the first. > > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin <[email protected]> > wrote: > >> Hello! >> >> I would like to know what is the optimal solution for getting the header >> with from a CSV file with Spark? My aproach was: >> >> def getHeader(data: RDD[String]): String = { >> data.zipWithIndex().filter(_._2==0).map(x=>x._1).take(1).mkString("") } >> >> Thanks. >> > >
