Won't you be able to use case statement to generate a virtual column (like 
partition_num), then use analytic SQL partition by this virtual column?
In this case, the full dataset will be just scanned once.

Yong

Date: Thu, 29 Oct 2015 10:51:53 -0700
Subject: RDD's filter() or using 'where' condition in SparkSQL
From: anfernee...@gmail.com
To: user@spark.apache.org

Hi,
I have a pretty large data set(2M entities) in my RDD, the data has already 
been partitioned by a specific key, the key has a range(type in long), now I 
want to create a bunch of key buckets, for example, the key has range 
    1 -> 100,
I will break the whole range into below buckets       1 ->  10    11 -> 20    
...    90 -> 100
 I want to run some analytic SQL functions over the data that owned by each key 
range, so I come up with 2 approaches,
1) run RDD's filter() on the full data set RDD, the filter will create the RDD 
corresponding to each key bucket, and with each RDD, I can create DataFrame and 
run the sql.

2) create a DataFrame for the whole RDD, and using a buch of SQL's to do my job.
    SELECT * from XXXX where key>=key1 AND key <key2
So my question is which one is better from performance perspective?
Thanks
-- 
--Anfernee
                                          

Reply via email to