Re: Problem with query on bucketed table

Bejoy Ks Sun, 09 Oct 2011 21:40:34 -0700

Hi Avrilia,
           In your SELECT query you need to append the following, *** 
TABLESAMPLE( BUCKET x OUT OF y);  It specifies from which all buckets to scan 
and to get the data from. Only in the case of partitions if you issue a select 
query (on a partition) without any hint only that partition would be scanned. 
In case of SAMPLING/BUCKETS you need to specify hint for the query on which 
bucket to do the scan. You can adjust the value of 'x' and 'y' to scan more 
than one bucket in a query. 
       Just a suggestion from my end,from my understanding of the requirement 
you put forth, i feel PARTITIONS would be better than SAMPLING for you as the 
query is centered around on some particular value for a column. SAMPLING is 
good when you like to do some operations on a small sample of whole data, say 
you need an approximate average of the whole data then you can use sampling to 
avoid scanning the whole table. In Partitions use Dynamic Partitions to load 
data from the source table into the target table on partitions on the fly.



Hope it helps!..

Regards
Bejoy.K.S



________________________________
From: Avrilia Floratou <[email protected]>
To: [email protected]
Sent: Monday, October 10, 2011 7:13 AM
Subject: Problem with query on bucketed table

Hello,

I have a question regarding the execution of some queries on bucketed tables.

I've created a compressed bucketed table using the following statement:

create external table partRC (P_PARTKEY BIGINT,P_NAME STRING, P_MFGR
STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING,
P_RETAILPRICE DOUBLE, P_COMMENT STRING)
CLUSTERED BY (P_PARTKEY) SORTED BY (P_PARTKEY) INTO 512 BUCKETS
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde'
stored as RCFILE
LOCATION '/tpch/partRC';

I inserted all the data by reading from another table (PART) stored as
textfile using the following script:

set hive.enforce.bucketing = true;
set hive.enforce.sorting = true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.exec.compress.output=true;


INSERT OVERWRITE TABLE partRC
select * from part
cluster by (P_PARTKEY);

I verified that 512 buckets where cretaed (512 reducers where used) and
then I ran the following selection:

select * from partRC where P_PARTKEY = 3;

I noticed that the whole table is scanned. Is this expected? I was
expecting that only the bucket that contains the value 3 would be scanned
(and actually not the whole bucket -- since it is sorted). Am I doing sth
wrong here?

Thanks,
Avrilia

Re: Problem with query on bucketed table

Reply via email to