subject:"RE\: RDD's filter\(\) or using 'where' condition in SparkSQL"

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964

You can do the SQL like following:
select *, case when key >= 1 and key <=10 then 1 when key >= 11 and key <= 20 
then 2 .. else 10 end as bucket_idfrom your table
See the conditional functions "case" in the HIVE.
After you have "bucket_id" column, then you can do whatever analytic function 
you want.
Yong

Date: Thu, 29 Oct 2015 12:53:35 -0700
Subject: Re: RDD's filter() or using 'where' condition in SparkSQL
From: anfernee...@gmail.com
To: java8...@hotmail.com
CC: user@spark.apache.org

Thanks Yong for your response.
Let me see if I can understand what you're suggesting, so the whole data set, 
when I load them into Spark(I'm using custom Hadoop InputFormat), I will add an 
extra field to each element in RDD, like bucket_id.
For example
Key:
1 - 10, bucket_id=111-20, bucket_id=2...90-100, butcket_id =10
then I can re-partition the RDD with a partitioner that will put all records 
with the same bucket_id in the same partition, after I get DataFrame from the 
RDD, the partition is still preserved(is it correct?)
then reset of work is only issue SQL query like
SELECT * from XXX where bucket_id=1SELECT * from XXX where bucket_id=2

..
Am I right?
Thanks
Anfernee
On Thu, Oct 29, 2015 at 11:07 AM, java8964  wrote:

Won't you be able to use case statement to generate a virtual column (like 
partition_num), then use analytic SQL partition by this virtual column?
In this case, the full dataset will be just scanned once.

Yong

Date: Thu, 29 Oct 2015 10:51:53 -0700
Subject: RDD's filter() or using 'where' condition in SparkSQL
From: anfernee...@gmail.com
To: user@spark.apache.org

Hi,
I have a pretty large data set(2M entities) in my RDD, the data has already 
been partitioned by a specific key, the key has a range(type in long), now I 
want to create a bunch of key buckets, for example, the key has range 
1 -> 100,
I will break the whole range into below buckets   1 ->  1011 -> 20
...90 -> 100
 I want to run some analytic SQL functions over the data that owned by each key 
range, so I come up with 2 approaches,
1) run RDD's filter() on the full data set RDD, the filter will create the RDD 
corresponding to each key bucket, and with each RDD, I can create DataFrame and 
run the sql.

2) create a DataFrame for the whole RDD, and using a buch of SQL's to do my job.
SELECT * from  where key>=key1 AND key

Re: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread Anfernee Xu

Thanks Yong for your response.

Let me see if I can understand what you're suggesting, so the whole data
set, when I load them into Spark(I'm using custom Hadoop InputFormat), I
will add an extra field to each element in RDD, like bucket_id.

For example

Key:

1 - 10, bucket_id=1
11-20, bucket_id=2
...
90-100, butcket_id =10

then I can re-partition the RDD with a partitioner that will put all
records with the same bucket_id in the same partition, after I get
DataFrame from the RDD, the partition is still preserved(is it correct?)

then reset of work is only issue SQL query like

SELECT * from XXX where bucket_id=1
SELECT * from XXX where bucket_id=2

..

Am I right?

Thanks

Anfernee

On Thu, Oct 29, 2015 at 11:07 AM, java8964  wrote:

> Won't you be able to use case statement to generate a virtual column (like
> partition_num), then use analytic SQL partition by this virtual column?
>
> In this case, the full dataset will be just scanned once.
>
> Yong
>
> --
> Date: Thu, 29 Oct 2015 10:51:53 -0700
> Subject: RDD's filter() or using 'where' condition in SparkSQL
> From: anfernee...@gmail.com
> To: user@spark.apache.org
>
>
> Hi,
>
> I have a pretty large data set(2M entities) in my RDD, the data has
> already been partitioned by a specific key, the key has a range(type in
> long), now I want to create a bunch of key buckets, for example, the key
> has range
>
> 1 -> 100,
>
> I will break the whole range into below buckets
>
> 1 ->  10
> 11 -> 20
> ...
> 90 -> 100
>
>  I want to run some analytic SQL functions over the data that owned by
> each key range, so I come up with 2 approaches,
>
> 1) run RDD's filter() on the full data set RDD, the filter will create the
> RDD corresponding to each key bucket, and with each RDD, I can create
> DataFrame and run the sql.
>
>
> 2) create a DataFrame for the whole RDD, and using a buch of SQL's to do
> my job.
>
> SELECT * from  where key>=key1 AND key 
> So my question is which one is better from performance perspective?
>
> Thanks
>
> --
> --Anfernee
>



-- 
--Anfernee

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964

Won't you be able to use case statement to generate a virtual column (like 
partition_num), then use analytic SQL partition by this virtual column?
In this case, the full dataset will be just scanned once.

Yong

Date: Thu, 29 Oct 2015 10:51:53 -0700
Subject: RDD's filter() or using 'where' condition in SparkSQL
From: anfernee...@gmail.com
To: user@spark.apache.org

Hi,
I have a pretty large data set(2M entities) in my RDD, the data has already 
been partitioned by a specific key, the key has a range(type in long), now I 
want to create a bunch of key buckets, for example, the key has range 
1 -> 100,
I will break the whole range into below buckets   1 ->  1011 -> 20
...90 -> 100
 I want to run some analytic SQL functions over the data that owned by each key 
range, so I come up with 2 approaches,
1) run RDD's filter() on the full data set RDD, the filter will create the RDD 
corresponding to each key bucket, and with each RDD, I can create DataFrame and 
run the sql.

2) create a DataFrame for the whole RDD, and using a buch of SQL's to do my job.
SELECT * from  where key>=key1 AND key

RE: RDD's filter() or using 'where' condition in SparkSQL

Re: RDD's filter() or using 'where' condition in SparkSQL

RE: RDD's filter() or using 'where' condition in SparkSQL

3 matches

Site Navigation

Mail list logo

Footer information