Hello, I have a question regarding the execution of some queries on bucketed tables.
I've created a compressed bucketed table using the following statement: create external table partRC (P_PARTKEY BIGINT,P_NAME STRING, P_MFGR STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING, P_RETAILPRICE DOUBLE, P_COMMENT STRING) CLUSTERED BY (P_PARTKEY) SORTED BY (P_PARTKEY) INTO 512 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde' stored as RCFILE LOCATION '/tpch/partRC'; I inserted all the data by reading from another table (PART) stored as textfile using the following script: set hive.enforce.bucketing = true; set hive.enforce.sorting = true; set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set hive.exec.compress.output=true; INSERT OVERWRITE TABLE partRC select * from part cluster by (P_PARTKEY); I verified that 512 buckets where cretaed (512 reducers where used) and then I ran the following selection: select * from partRC where P_PARTKEY = 3; I noticed that the whole table is scanned. Is this expected? I was expecting that only the bucket that contains the value 3 would be scanned (and actually not the whole bucket -- since it is sorted). Am I doing sth wrong here? Thanks, Avrilia
