Re: skipping header from each file

2015-01-09 Thread Sean Owen
I think this was already answered on stackoverflow:
http://stackoverflow.com/questions/27854919/skipping-header-file-from-each-csv-file-in-spark
where the one additional idea would be:


If there were just one header line, in the first record, then the most
efficient way to filter it out is:

rdd.mapPartitionsWithIndex { (idx, iter) = if (idx == 0) iter.drop(1)
else iter }

This doesn't help if of course there are many files with many header
lines inside. You can union 3 RDDs you make this way and union them.


On Fri, Jan 9, 2015 at 6:18 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote:
 Suppose I give three files paths to spark context to read and each file has
 schema in first row. how can we skip schema lines from headers


 val rdd=sc.textFile(file1,file2,file3);



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: skipping header from each file

2015-01-09 Thread Somnath Pandeya
May be you can use wholeTextFiles method, which returns filename and content of 
the file as PariRDD and ,then you can remove the first line from files.



-Original Message-
From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com]
Sent: Friday, January 09, 2015 11:48 AM
To: user@spark.apache.org
Subject: skipping header from each file

Suppose I give three files paths to spark context to read and each file has 
schema in first row. how can we skip schema lines from headers


val rdd=sc.textFile(file1,file2,file3);



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: skipping header from each file

2015-01-08 Thread Akhil Das
Did you try something like:

val file = sc.textFile(/home/akhld/sigmoid/input)

val skipped = file.filter(row = !row.contains(header))

skipped.take(10).foreach(println)

Thanks
Best Regards

On Fri, Jan 9, 2015 at 11:48 AM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 Suppose I give three files paths to spark context to read and each file has
 schema in first row. how can we skip schema lines from headers


 val rdd=sc.textFile(file1,file2,file3);



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org