Alternative to doing a naive toArray is to declare an accumulator per partition and use that. It's specifically what they were designed to do. See the programming guide.
Sent with Good (www.good.com) -----Original Message----- From: Tobias Pfeiffer [t...@preferred.jp<mailto:t...@preferred.jp>] Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time To: Kevin Burton Cc: Ganelin, Ilya; user@spark.apache.org Subject: Re: quickly counting the number of rows in a partition? Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya <ilya.gane...@capitalone.com<mailto:ilya.gane...@capitalone.com>> wrote: Use the mapPartitions function. It returns an iterator to each partition. Then just get that length by converting to an array. On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton <bur...@spinn3r.com<mailto:bur...@spinn3r.com>> wrote: Doesn’t that just read in all the values? The count isn’t pre-computed? It’s not the end of the world if it’s not but would be faster. Well, "converting to an array" may not work due to memory constraints, counting the items in the iterator may be better. However, there is no "pre-computed" value. For counting, you need to compute all values in the RDD, in general. If you think of items.map(x => /* throw exception */).count() then even though the count you want to get does not necessarily require the evaluation of the function in map() (i.e., the number is the same), you may not want to get the count if that code actually fails. Tobias ________________________________________________________ The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.