Sorry, but the accumulator is still going to require you to walk through the RDD to get an accurate count, right? Its not being persisted?
On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya <[email protected]> wrote: > Alternative to doing a naive toArray is to declare an accumulator per > partition and use that. It's specifically what they were designed to do. See > the programming guide. > > > > Sent with Good (www.good.com) > > > -----Original Message----- > From: Tobias Pfeiffer [[email protected]] > Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time > To: Kevin Burton > Cc: Ganelin, Ilya; [email protected] > Subject: Re: quickly counting the number of rows in a partition? > > Hi, > > On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya <[email protected]> > wrote: > Use the mapPartitions function. It returns an iterator to each partition. > Then just get that length by converting to an array. > > On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton <[email protected]> wrote: > Doesn’t that just read in all the values? The count isn’t pre-computed? It’s > not the end of the world if it’s not but would be faster. > > Well, "converting to an array" may not work due to memory constraints, > counting the items in the iterator may be better. However, there is no > "pre-computed" value. For counting, you need to compute all values in the > RDD, in general. If you think of > > items.map(x => /* throw exception */).count() > > then even though the count you want to get does not necessarily require the > evaluation of the function in map() (i.e., the number is the same), you may > not want to get the count if that code actually fails. > > Tobias > > The information contained in this e-mail is confidential and/or proprietary > to Capital One and/or its affiliates. The information transmitted herewith is > intended only for use by the individual or entity to which it is addressed. > If the reader of this message is not the intended recipient, you are hereby > notified that any review, retransmission, dissemination, distribution, > copying or other use of, or taking of any action in reliance upon this > information is strictly prohibited. If you have received this communication > in error, please contact the sender and delete the material from your > computer.
