Re: How to filter a sorted RDD

Mark Hamstra Mon, 04 Nov 2013 09:20:03 -0800

scala> val rdd = sc.parallelize(1 to 1000000, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at <console>:12


scala> rdd.mapPartitions { itr => itr.takeWhile(_ < 10) }.count
.
.
.
res1: Long = 9

In the first partition, 9 elements are iterated through before the
evaluation of the closure over that partition is completed; in the other
three partitions, only one element is examined.



On Sun, Nov 3, 2013 at 11:32 PM, Xiang Huo <[email protected]> wrote:

> Hi Mark,
>
> Could you tell me more detail information about how to short-circuit the
> filtering ?
>
> Thanks.
>
> Xiang
>
>
> 2013/11/4 Mark Hamstra <[email protected]>
>
>> You could short-circuit the filtering within the interator function
>> supplied to mapPartitions.
>>
>>
>> On Sunday, November 3, 2013, Xiang Huo wrote:
>>
>>> Hi all,
>>>
>>> I am trying to filter a smaller RDD data set from a large RDD data set.
>>> And the large one is sorted. So my question is that is there any way to
>>> make the filter method does't check every element in RDD but filter out all
>>> the other elements when one element doesn't meet the condition of filter.
>>> Because the large data set is sorted, when there is one element doesn't
>>> meet the requirement, all the following elements are impossible to meet.
>>> But checking them one by one will take a relative long time.
>>> So is there any way to save time for this part?
>>>
>>> Thanks,
>>>
>>> Xiang
>>>
>>> --
>>> Xiang Huo
>>> Department of Computer Science
>>> University of Illinois at Chicago(UIC)
>>> Chicago, Illinois
>>> US
>>> Email: [email protected]
>>>            or [email protected]
>>>
>>
>
>
> --
> Xiang Huo
> Department of Computer Science
> University of Illinois at Chicago(UIC)
> Chicago, Illinois
> US
> Email: [email protected]
>            or [email protected]
>

Re: How to filter a sorted RDD

Reply via email to