You have 8 files, not 8 partitions. It does not follow that they should be
read as 8 partitions since they are presumably large and so you would be
stuck using at most 8 tasks in parallel to process. The number of
partitions is determined by Hadoop input splits and generally makes a
partition per block of data. If you know that this number is too high you
can request a number of partitions when you read it. Don't coalesce, just
read the desired number from the start.
On Jan 21, 2015 4:32 PM, "Wang, Ningjun (LNG-NPV)" <
ningjun.w...@lexisnexis.com> wrote:

>  Why sc.objectFile(…) return a Rdd with thousands of partitions?
>
>
>
> I save a rdd to file system using
>
>
>
> rdd.saveAsObjectFile(“file:///tmp/mydir”)
>
>
>
> Note that the rdd contains 7 millions object. I check the directory
> /tmp/mydir/, it contains 8 partitions
>
>
>
> part-00000  part-00002  part-00004  part-00006  _SUCCESS
>
> part-00001  part-00003  part-00005  part-00007
>
>
>
> I then load the rdd back using
>
>
>
> val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir”, 8)
>
>
>
> I expect rdd2 to have 8 partitions. But from the master UI, I see that
> rdd2 has over 1000 partitions. This is very inefficient. How can I limit it
> to 8 partitions just like what is stored on the file system?
>
>
>
> Regards,
>
>
>
> *Ningjun Wang*
>
> Consulting Software Engineer
>
> LexisNexis
>
> 121 Chanlon Road
>
> New Providence, NJ 07974-1541
>
>
>

Reply via email to