Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Felix C
The spark-csv package can handle header row, and the code is at the link below. 
It could also use the header to infer field names in the schema.

https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvRelation.scala

--- Original Message ---

From: Dean Wampler deanwamp...@gmail.com
Sent: March 24, 2015 9:19 AM
To: Sean Owen so...@cloudera.com
Cc: Spico Florin spicoflo...@gmail.com, user user@spark.apache.org
Subject: Re: Optimal solution for getting the header from CSV with Spark

Good point. There's no guarantee that you'll get the actual first
partition. One reason why I wouldn't allow a CSV header line in a real data
file, if I could avoid it.

Back to Spark, a safer approach is RDD.foreachPartition, which takes a
function expecting an iterator. You'll only need to grab the first element
(being careful that the partition isn't empty!) and then determine which of
those first lines has the header info.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Mar 24, 2015 at 11:12 AM, Sean Owen so...@cloudera.com wrote:

 I think this works in practice, but I don't know that the first block
 of the file is guaranteed to be in the first partition? certainly
 later down the pipeline that won't be true but presumably this is
 happening right after reading the file.

 I've always just written some filter that would only match the header,
 which assumes that this is possible to distinguish, but usually is.

 On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com
 wrote:
  Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
  read the whole file, use data.take(1), which is simpler.
 
  From the Rdd.take documentation, it works by first scanning one
 partition,
  and using the results from that partition to estimate the number of
  additional partitions needed to satisfy the limit. In this case, it will
  trivially stop at the first.
 
 
  Dean Wampler, Ph.D.
  Author: Programming Scala, 2nd Edition (O'Reilly)
  Typesafe
  @deanwampler
  http://polyglotprogramming.com
 
  On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com
 wrote:
 
  Hello!
 
  I would like to know what is the optimal solution for getting the header
  with from a CSV file with Spark? My aproach was:
 
  def getHeader(data: RDD[String]): String = {
  data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }
 
  Thanks.
 
 



Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Spico Florin
Hello!
  Thank for your responses. I was afraid that due to partitioning I will
loose the logic that the first element is the header. I observe that
rdd.first calls behind the rdd.take(1) method.
Best regards,
  Florin

On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler deanwamp...@gmail.com wrote:

 Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
 read the whole file, use data.take(1), which is simpler.

 From the Rdd.take documentation, it works by first scanning one partition,
 and using the results from that partition to estimate the number of
 additional partitions needed to satisfy the limit. In this case, it will
 trivially stop at the first.


 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com
 wrote:

 Hello!

 I would like to know what is the optimal solution for getting the header
 with from a CSV file with Spark? My aproach was:

 def getHeader(data: RDD[String]): String = {
 data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }

 Thanks.





Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Spico Florin
Hello!

I would like to know what is the optimal solution for getting the header
with from a CSV file with Spark? My aproach was:

def getHeader(data: RDD[String]): String = {
data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }

Thanks.


Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Sean Owen
I think this works in practice, but I don't know that the first block
of the file is guaranteed to be in the first partition? certainly
later down the pipeline that won't be true but presumably this is
happening right after reading the file.

I've always just written some filter that would only match the header,
which assumes that this is possible to distinguish, but usually is.

On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com wrote:
 Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
 read the whole file, use data.take(1), which is simpler.

 From the Rdd.take documentation, it works by first scanning one partition,
 and using the results from that partition to estimate the number of
 additional partitions needed to satisfy the limit. In this case, it will
 trivially stop at the first.


 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition (O'Reilly)
 Typesafe
 @deanwampler
 http://polyglotprogramming.com

 On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote:

 Hello!

 I would like to know what is the optimal solution for getting the header
 with from a CSV file with Spark? My aproach was:

 def getHeader(data: RDD[String]): String = {
 data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }

 Thanks.



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler
Good point. There's no guarantee that you'll get the actual first
partition. One reason why I wouldn't allow a CSV header line in a real data
file, if I could avoid it.

Back to Spark, a safer approach is RDD.foreachPartition, which takes a
function expecting an iterator. You'll only need to grab the first element
(being careful that the partition isn't empty!) and then determine which of
those first lines has the header info.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Mar 24, 2015 at 11:12 AM, Sean Owen so...@cloudera.com wrote:

 I think this works in practice, but I don't know that the first block
 of the file is guaranteed to be in the first partition? certainly
 later down the pipeline that won't be true but presumably this is
 happening right after reading the file.

 I've always just written some filter that would only match the header,
 which assumes that this is possible to distinguish, but usually is.

 On Tue, Mar 24, 2015 at 2:41 PM, Dean Wampler deanwamp...@gmail.com
 wrote:
  Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
  read the whole file, use data.take(1), which is simpler.
 
  From the Rdd.take documentation, it works by first scanning one
 partition,
  and using the results from that partition to estimate the number of
  additional partitions needed to satisfy the limit. In this case, it will
  trivially stop at the first.
 
 
  Dean Wampler, Ph.D.
  Author: Programming Scala, 2nd Edition (O'Reilly)
  Typesafe
  @deanwampler
  http://polyglotprogramming.com
 
  On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com
 wrote:
 
  Hello!
 
  I would like to know what is the optimal solution for getting the header
  with from a CSV file with Spark? My aproach was:
 
  def getHeader(data: RDD[String]): String = {
  data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }
 
  Thanks.
 
 



Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler
Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to
read the whole file, use data.take(1), which is simpler.

From the Rdd.take documentation, it works by first scanning one partition,
and using the results from that partition to estimate the number of
additional partitions needed to satisfy the limit. In this case, it will
trivially stop at the first.


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote:

 Hello!

 I would like to know what is the optimal solution for getting the header
 with from a CSV file with Spark? My aproach was:

 def getHeader(data: RDD[String]): String = {
 data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }

 Thanks.