Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Ilya Ganelin
gt; > > > Sent with Good (www.good.com) > > > -Original Message- > *From: *Tobias Pfeiffer [t...@preferred.jp] > *Sent: *Tuesday, January 13, 2015 08:06 PM Eastern Standard Time > *To: *Kevin Burton > *Cc: *Ganelin, Ilya; user@spark.apache.org > *Subject: *Re: quickly

Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Michael Segel
me > To: Kevin Burton > Cc: Ganelin, Ilya; user@spark.apache.org > Subject: Re: quickly counting the number of rows in a partition? > > Hi, > > On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya > wrote: > Use the mapPartitions function. It returns an iterator to ea

RE: quickly counting the number of rows in a partition?

2015-01-13 Thread Ganelin, Ilya
mailto:t...@preferred.jp>] Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time To: Kevin Burton Cc: Ganelin, Ilya; user@spark.apache.org Subject: Re: quickly counting the number of rows in a partition? Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya mailto:ilya.gane...@capitalone.com>&g

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi again, On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer wrote: > If you think of > items.map(x => /* throw exception */).count() > then even though the count you want to get does not necessarily require > the evaluation of the function in map() (i.e., the number is the same), you > may n

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya wrote: > Use the mapPartitions function. It returns an iterator to each partition. > Then just get that length by converting to an array. > On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton wrote: > Doesn’t that just read in all the values? The

Re: quickly counting the number of rows in a partition?

2015-01-12 Thread Kevin Burton
length by converting to an array. > > > > Sent with Good (www.good.com) > > > > -Original Message- > *From: *Kevin Burton [bur...@spinn3r.com] > *Sent: *Monday, January 12, 2015 09:55 PM Eastern Standard Time > *To: *user@spark.apache.org > *Subject:

Re: quickly counting the number of rows in a partition?

2015-01-12 Thread Sven Krasser
Yes, using mapPartitionsWithIndex, e.g. in PySpark: >>> sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda idx,iter: ((idx, len(list(iter))),)).collect() [(0, 250), (1, 250), (2, 250), (3, 250)] (This is not the most efficient way to get the length of an iterator, but you get the ide

quickly counting the number of rows in a partition?

2015-01-12 Thread Kevin Burton
Is there a way to compute the total number of records in each RDD partition? So say I had 4 partitions.. I’d want to have partition 0: 100 records partition 1: 104 records partition 2: 90 records partition 3: 140 records Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: htt