It's actually a set of 2171 S3 files, with an average size of about 18MB.
On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > What do you get for rdd1._jrdd.splits().size()? You might think you’re > getting > 100 partitions, but it may not be happening. > > > > On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <alex.boisv...@gmail.com> > wrote: > >> With the following pseudo-code, >> >> val rdd1 = sc.sequenceFile(...) // has > 100 partitions >> val rdd2 = rdd1.coalesce(100) >> val rdd3 = rdd2 map { ... } >> val rdd4 = rdd3.coalesce(2) >> val rdd5 = rdd4.saveAsTextFile(...) // want only two output files >> >> I would expect the parallelism of the map() operation to be 100 >> concurrent tasks, and the parallelism of the save() operation to be 2. >> >> However, it appears the parallelism of the entire chain is 2 -- I only >> see two tasks created for the save() operation and those tasks appear to >> execute the map() operation as well. >> >> Assuming what I'm seeing is as-specified (meaning, how things are meant >> to be), what's the recommended way to force a parallelism of 100 on the >> map() operation? >> >> thanks! >> >> >> >