If sampling whole partitions is sufficient (or a part of a partition), sure you could mapPartitionsWithIndex and decide whether to process a partition at all based on its # and skip the rest. That's much faster.
On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV) <ningjun.w...@lexisnexis.com> wrote: > I don't need to be 100% randome. How about randomly pick a few partitions and > return all docs in those partitions? Is > rdd.mapPartitionsWithIndex() the right method to use to just process a small > portion of partitions? > > Ningjun --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org