RE: RDD's filter() or using 'where' condition in SparkSQL
You can do the SQL like following: select *, case when key >= 1 and key <=10 then 1 when key >= 11 and key <= 20 then 2 .. else 10 end as bucket_idfrom your table See the conditional functions "case" in the HIVE. After you have "bucket_id" column, then you can do whatever analytic function you want. Yong Date: Thu, 29 Oct 2015 12:53:35 -0700 Subject: Re: RDD's filter() or using 'where' condition in SparkSQL From: anfernee...@gmail.com To: java8...@hotmail.com CC: user@spark.apache.org Thanks Yong for your response. Let me see if I can understand what you're suggesting, so the whole data set, when I load them into Spark(I'm using custom Hadoop InputFormat), I will add an extra field to each element in RDD, like bucket_id. For example Key: 1 - 10, bucket_id=111-20, bucket_id=2...90-100, butcket_id =10 then I can re-partition the RDD with a partitioner that will put all records with the same bucket_id in the same partition, after I get DataFrame from the RDD, the partition is still preserved(is it correct?) then reset of work is only issue SQL query like SELECT * from XXX where bucket_id=1SELECT * from XXX where bucket_id=2 .. Am I right? Thanks Anfernee On Thu, Oct 29, 2015 at 11:07 AM, java8964 wrote: Won't you be able to use case statement to generate a virtual column (like partition_num), then use analytic SQL partition by this virtual column? In this case, the full dataset will be just scanned once. Yong Date: Thu, 29 Oct 2015 10:51:53 -0700 Subject: RDD's filter() or using 'where' condition in SparkSQL From: anfernee...@gmail.com To: user@spark.apache.org Hi, I have a pretty large data set(2M entities) in my RDD, the data has already been partitioned by a specific key, the key has a range(type in long), now I want to create a bunch of key buckets, for example, the key has range 1 -> 100, I will break the whole range into below buckets 1 -> 1011 -> 20 ...90 -> 100 I want to run some analytic SQL functions over the data that owned by each key range, so I come up with 2 approaches, 1) run RDD's filter() on the full data set RDD, the filter will create the RDD corresponding to each key bucket, and with each RDD, I can create DataFrame and run the sql. 2) create a DataFrame for the whole RDD, and using a buch of SQL's to do my job. SELECT * from where key>=key1 AND key
Re: RDD's filter() or using 'where' condition in SparkSQL
Thanks Yong for your response. Let me see if I can understand what you're suggesting, so the whole data set, when I load them into Spark(I'm using custom Hadoop InputFormat), I will add an extra field to each element in RDD, like bucket_id. For example Key: 1 - 10, bucket_id=1 11-20, bucket_id=2 ... 90-100, butcket_id =10 then I can re-partition the RDD with a partitioner that will put all records with the same bucket_id in the same partition, after I get DataFrame from the RDD, the partition is still preserved(is it correct?) then reset of work is only issue SQL query like SELECT * from XXX where bucket_id=1 SELECT * from XXX where bucket_id=2 .. Am I right? Thanks Anfernee On Thu, Oct 29, 2015 at 11:07 AM, java8964 wrote: > Won't you be able to use case statement to generate a virtual column (like > partition_num), then use analytic SQL partition by this virtual column? > > In this case, the full dataset will be just scanned once. > > Yong > > -- > Date: Thu, 29 Oct 2015 10:51:53 -0700 > Subject: RDD's filter() or using 'where' condition in SparkSQL > From: anfernee...@gmail.com > To: user@spark.apache.org > > > Hi, > > I have a pretty large data set(2M entities) in my RDD, the data has > already been partitioned by a specific key, the key has a range(type in > long), now I want to create a bunch of key buckets, for example, the key > has range > > 1 -> 100, > > I will break the whole range into below buckets > > 1 -> 10 > 11 -> 20 > ... > 90 -> 100 > > I want to run some analytic SQL functions over the data that owned by > each key range, so I come up with 2 approaches, > > 1) run RDD's filter() on the full data set RDD, the filter will create the > RDD corresponding to each key bucket, and with each RDD, I can create > DataFrame and run the sql. > > > 2) create a DataFrame for the whole RDD, and using a buch of SQL's to do > my job. > > SELECT * from where key>=key1 AND key > So my question is which one is better from performance perspective? > > Thanks > > -- > --Anfernee > -- --Anfernee
RE: RDD's filter() or using 'where' condition in SparkSQL
Won't you be able to use case statement to generate a virtual column (like partition_num), then use analytic SQL partition by this virtual column? In this case, the full dataset will be just scanned once. Yong Date: Thu, 29 Oct 2015 10:51:53 -0700 Subject: RDD's filter() or using 'where' condition in SparkSQL From: anfernee...@gmail.com To: user@spark.apache.org Hi, I have a pretty large data set(2M entities) in my RDD, the data has already been partitioned by a specific key, the key has a range(type in long), now I want to create a bunch of key buckets, for example, the key has range 1 -> 100, I will break the whole range into below buckets 1 -> 1011 -> 20 ...90 -> 100 I want to run some analytic SQL functions over the data that owned by each key range, so I come up with 2 approaches, 1) run RDD's filter() on the full data set RDD, the filter will create the RDD corresponding to each key bucket, and with each RDD, I can create DataFrame and run the sql. 2) create a DataFrame for the whole RDD, and using a buch of SQL's to do my job. SELECT * from where key>=key1 AND key