Hi Avrilia, In your SELECT query you need to append the following, *** TABLESAMPLE( BUCKET x OUT OF y); It specifies from which all buckets to scan and to get the data from. Only in the case of partitions if you issue a select query (on a partition) without any hint only that partition would be scanned. In case of SAMPLING/BUCKETS you need to specify hint for the query on which bucket to do the scan. You can adjust the value of 'x' and 'y' to scan more than one bucket in a query. Just a suggestion from my end,from my understanding of the requirement you put forth, i feel PARTITIONS would be better than SAMPLING for you as the query is centered around on some particular value for a column. SAMPLING is good when you like to do some operations on a small sample of whole data, say you need an approximate average of the whole data then you can use sampling to avoid scanning the whole table. In Partitions use Dynamic Partitions to load data from the source table into the target table on partitions on the fly.
Hope it helps!.. Regards Bejoy.K.S ________________________________ From: Avrilia Floratou <[email protected]> To: [email protected] Sent: Monday, October 10, 2011 7:13 AM Subject: Problem with query on bucketed table Hello, I have a question regarding the execution of some queries on bucketed tables. I've created a compressed bucketed table using the following statement: create external table partRC (P_PARTKEY BIGINT,P_NAME STRING, P_MFGR STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING, P_RETAILPRICE DOUBLE, P_COMMENT STRING) CLUSTERED BY (P_PARTKEY) SORTED BY (P_PARTKEY) INTO 512 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde' stored as RCFILE LOCATION '/tpch/partRC'; I inserted all the data by reading from another table (PART) stored as textfile using the following script: set hive.enforce.bucketing = true; set hive.enforce.sorting = true; set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set hive.exec.compress.output=true; INSERT OVERWRITE TABLE partRC select * from part cluster by (P_PARTKEY); I verified that 512 buckets where cretaed (512 reducers where used) and then I ran the following selection: select * from partRC where P_PARTKEY = 3; I noticed that the whole table is scanned. Is this expected? I was expecting that only the bucket that contains the value 3 would be scanned (and actually not the whole bucket -- since it is sorted). Am I doing sth wrong here? Thanks, Avrilia
