The solution I normally use is to zipWithIndex() and then use the filter operation. Filter is an O(m) operation where m is the size of your partition, not an O(N) operation.
-Ilya Ganelin On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel <[email protected]> wrote: > Problem is I have RDD of about 10M rows and it keeps growing. Everytime > when we want to perform query and compute on subset of data we have to use > filter and then some aggregation. Here I know filter goes through each > partitions and every rows of RDD which may not be efficient at all. > > Spark having Ordered RDD functions I dont see why it's so difficult to > implement such function. Cassandra/Hbase has it for years where they can > fetch data only from certain partitions based on your rowkey. Scala TreeMap > has Range function to do the same. > > I think people have been looking for this for while. I see several post > asking this. > > > http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-td20170.html#a26048 > > By the way, I assume there > Thanks > Nirav > > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation>
