Re: skipping header from each file
I think this was already answered on stackoverflow: http://stackoverflow.com/questions/27854919/skipping-header-file-from-each-csv-file-in-spark where the one additional idea would be: If there were just one header line, in the first record, then the most efficient way to filter it out is: rdd.mapPartitionsWithIndex { (idx, iter) = if (idx == 0) iter.drop(1) else iter } This doesn't help if of course there are many files with many header lines inside. You can union 3 RDDs you make this way and union them. On Fri, Jan 9, 2015 at 6:18 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Suppose I give three files paths to spark context to read and each file has schema in first row. how can we skip schema lines from headers val rdd=sc.textFile(file1,file2,file3); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: skipping header from each file
May be you can use wholeTextFiles method, which returns filename and content of the file as PariRDD and ,then you can remove the first line from files. -Original Message- From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] Sent: Friday, January 09, 2015 11:48 AM To: user@spark.apache.org Subject: skipping header from each file Suppose I give three files paths to spark context to read and each file has schema in first row. how can we skip schema lines from headers val rdd=sc.textFile(file1,file2,file3); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: skipping header from each file
Did you try something like: val file = sc.textFile(/home/akhld/sigmoid/input) val skipped = file.filter(row = !row.contains(header)) skipped.take(10).foreach(println) Thanks Best Regards On Fri, Jan 9, 2015 at 11:48 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Suppose I give three files paths to spark context to read and each file has schema in first row. how can we skip schema lines from headers val rdd=sc.textFile(file1,file2,file3); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org