RDD.tail()

2014-04-14 Thread Philip Ogren
Has there been any thought to adding a tail() method to RDD?  It would 
be really handy to skip over the first item in an RDD when it contains 
header information.  Even better would be a drop(int) function that 
would allow you to skip over several lines of header information.  Our 
attempts to do something equivalent with a filter() call seem a bit 
contorted.  Any thoughts?


Thanks,
Philip


Re: RDD.tail()

2014-04-14 Thread Ethan Jewett
We have similar needs but IIRC, I came to the conclusion that this would
only work on ordered RDDs, and then you would still have to figure out
which partition is the first one. I ended up deciding it would be best to
just drop the header lines from a Scala iterator before creating an RDD
based on it. Not sure if this was the right thing to do, but would that
work for you?

Regards,
Ethan


On Mon, Apr 14, 2014 at 10:24 AM, Philip Ogren philip.og...@oracle.comwrote:

 Has there been any thought to adding a tail() method to RDD?  It would be
 really handy to skip over the first item in an RDD when it contains header
 information.  Even better would be a drop(int) function that would allow
 you to skip over several lines of header information.  Our attempts to do
 something equivalent with a filter() call seem a bit contorted.  Any
 thoughts?

 Thanks,
 Philip



Re: RDD.tail()

2014-04-14 Thread Matei Zaharia
You can use mapPartitionsWithIndex and look at the partition index (0 will be 
the first partition) to decide whether to skip the first line.

Matei

On Apr 14, 2014, at 8:50 AM, Ethan Jewett esjew...@gmail.com wrote:

 We have similar needs but IIRC, I came to the conclusion that this would only 
 work on ordered RDDs, and then you would still have to figure out which 
 partition is the first one. I ended up deciding it would be best to just drop 
 the header lines from a Scala iterator before creating an RDD based on it. 
 Not sure if this was the right thing to do, but would that work for you?
 
 Regards,
 Ethan
 
 
 On Mon, Apr 14, 2014 at 10:24 AM, Philip Ogren philip.og...@oracle.com 
 wrote:
 Has there been any thought to adding a tail() method to RDD?  It would be 
 really handy to skip over the first item in an RDD when it contains header 
 information.  Even better would be a drop(int) function that would allow you 
 to skip over several lines of header information.  Our attempts to do 
 something equivalent with a filter() call seem a bit contorted.  Any thoughts?
 
 Thanks,
 Philip