One thing you can also look at is to save your data in a way that can be
accessed through file patterns. Eg by hour, zone etc so that you only load
what you need.
On Jan 24, 2016 10:00 PM, "Ilya Ganelin" <[email protected]> wrote:

> The solution I normally use is to zipWithIndex() and then use the filter
> operation. Filter is an O(m) operation where m is the size of your
> partition, not an O(N) operation.
>
> -Ilya Ganelin
>
> On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <[email protected]>
> wrote:
>
>> Problem is I have RDD of about 10M rows and it keeps growing. Everytime
>> when we want to perform query and compute on subset of data we have to use
>> filter and then some aggregation. Here I know filter goes through each
>> partitions and every rows of RDD which may not be efficient at all.
>>
>> Spark having Ordered RDD functions I dont see why it's so difficult to
>> implement such function. Cassandra/Hbase has it for years where they can
>> fetch data only from certain partitions based on your rowkey. Scala TreeMap
>> has Range function to do the same.
>>
>> I think people have been looking for this for while. I see several post
>> asking this.
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048
>>
>> By the way, I assume there
>> Thanks
>> Nirav
>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>
>

Reply via email to