It's actually a set of 2171 S3 files, with an average size of about 18MB.

On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> What do you get for rdd1._jrdd.splits().size()? You might think you’re
> getting > 100 partitions, but it may not be happening.
> ​
>
>
> On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <alex.boisv...@gmail.com>
> wrote:
>
>> With the following pseudo-code,
>>
>> val rdd1 = sc.sequenceFile(...) // has > 100 partitions
>> val rdd2 = rdd1.coalesce(100)
>> val rdd3 = rdd2 map { ... }
>> val rdd4 = rdd3.coalesce(2)
>> val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
>>
>> I would expect the parallelism of the map() operation to be 100
>> concurrent tasks, and the parallelism of the save() operation to be 2.
>>
>> However, it appears the parallelism of the entire chain is 2 -- I only
>> see two tasks created for the save() operation and those tasks appear to
>> execute the map() operation as well.
>>
>> Assuming what I'm seeing is as-specified (meaning, how things are meant
>> to be), what's the recommended way to force a parallelism of 100 on the
>> map() operation?
>>
>> thanks!
>>
>>
>>
>

Reply via email to