Re: Issue with the parallelize method in SparkContext

Nicholas Chammas Sat, 24 May 2014 09:14:33 -0700

partitionedSource is an RDD, right? If so, then
partitionedSource.countshould return the number of elements in the
RDD, regardless of how many
partitions it’s split into.


If you want to count the number of elements per partition, you’ll need to
use RDD.mapPartitions, I believe.



On Sat, May 24, 2014 at 10:18 AM, Wisc Forum <wiscfo...@gmail.com> wrote:

> Hi, dear user group:
>
> I recently try to use the parallelize method of SparkContext to slice
> original data into small pieces for further handling. Something like the
> below:
>
> val partitionedSource = sparkContext.parallelize(seq, sparkPartitionSize)
>
> The size of my original testing data is 88 objects.
>
> I know the default value (if I don't specify the sparkPartitionSize
> value) of numSlices is is 10.
>
> What happens is that when I specified the numSlices value to be 2 (As I
> use 2 slave nodes), when I do something like this:
> println("partitionedSource.count: " + partitionedSource.count)
>
>
> The output is partitionedSource.count: 44. The subtask though, is
> correctly created as 2.
>
> My intention is the get two slices where each slice has 44 objects and
> thus partitionedSource.count should be 2, isn't it? So, does this
> actually result 44 mean that I have 44 slices or 44 objects in each slice?
> How can the second case be? What if I have 89 objects? Maybe I didn't use
> it correctly?
>
> Can somebody help me on this?
>
> Thanks,
> Xiao Bing
>

Re: Issue with the parallelize method in SparkContext

Reply via email to