Re: How to process one partition at a time?

Andrei Thu, 07 Apr 2016 07:45:24 -0700

Thanks everyone, both - `submitJob` and `PartitionPrunningRDD` - work for
me.


On Thu, Apr 7, 2016 at 8:22 AM, Hemant Bhanawat <hemant9...@gmail.com>
wrote:

> Apparently, there is another way to do it. You can try creating a
> PartitionPruningRDD and pass a partition filter function to it. This RDD
> will do the same thing that I suggested in my mail and you will not have to
> create a new RDD.
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
> www.snappydata.io
>
> On Wed, Apr 6, 2016 at 5:35 PM, Sun, Rui <rui....@intel.com> wrote:
>
>> Maybe you can try SparkContext.submitJob:
>>
>> *def **submitJob**[T, U, R](rdd: RDD
>> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>[T],
>>  processPartition:
>> (Iterator[T]) **⇒** U, partitions: Seq[Int], resultHandler: (Int, U) **⇒** 
>> Unit, resultFunc:
>> **⇒** R): SimpleFutureAction
>> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SimpleFutureAction.html>[R]*
>>
>>
>>
>>
>>
>> *From:* Hemant Bhanawat [mailto:hemant9...@gmail.com]
>> *Sent:* Wednesday, April 6, 2016 7:16 PM
>> *To:* Andrei <faithlessfri...@gmail.com>
>> *Cc:* user <user@spark.apache.org>
>> *Subject:* Re: How to process one partition at a time?
>>
>>
>>
>> Instead of doing it in compute, you could rather override getPartitions
>> method of your RDD and return only the target partitions. This way tasks
>> for only target partitions will be created. Currently in your case, tasks
>> for all the partitions are getting created.
>>
>> I hope it helps. I would like to hear if you take some other approach.
>>
>>
>> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
>>
>> www.snappydata.io
>>
>>
>>
>> On Wed, Apr 6, 2016 at 3:49 PM, Andrei <faithlessfri...@gmail.com> wrote:
>>
>> I'm writing a kind of sampler which in most cases will require only 1
>> partition, sometimes 2 and very rarely more. So it doesn't make sense to
>> process all partitions in parallel. What is the easiest way to limit
>> computations to one partition only?
>>
>>
>>
>> So far the best idea I came to is to create a custom partition whose
>> `compute` method looks something like:
>>
>>
>>
>> def compute(split: Partition, context: TaskContext) = {
>>
>>     if (split.index == targetPartition) {
>>
>>         // do computation
>>
>>     } else {
>>
>>        // return empty iterator
>>
>>     }
>>
>> }
>>
>>
>>
>>
>>
>> But it's quite ugly and I'm unlikely to be the first person with such a
>> need. Is there easier way to do it?
>>
>>
>>
>>
>>
>>
>>
>
>

Re: How to process one partition at a time?

Reply via email to