Re: rdd.sample() methods very slow

Sean Owen Thu, 21 May 2015 11:37:20 -0700

If sampling whole partitions is sufficient (or a part of a partition),
sure you could mapPartitionsWithIndex and decide whether to process a
partition at all based on its # and skip the rest. That's much faster.


On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV)
<ningjun.w...@lexisnexis.com> wrote:
> I don't need to be 100% randome. How about randomly pick a few partitions and 
> return all docs in those partitions? Is
> rdd.mapPartitionsWithIndex() the right method to use to just process a small 
> portion of partitions?
>
> Ningjun

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: rdd.sample() methods very slow

Reply via email to