[ https://issues.apache.org/jira/browse/HIVE-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565200#comment-14565200 ]
Xuefu Zhang edited comment on HIVE-10283 at 5/29/15 9:29 PM: ------------------------------------------------------------- [~xuefuz] && [~szehon], could you find someone who know this part well work on the issue. Currently, in upstream master code , number of buckets is not respected even with insert overwrite. (insert overwrite only create 1 bucket file while the table definition is 2. Reproduce: {noformat} create table buckettest (data string) partitioned by (state string) clustered by (data) into 2 buckets; set hive.enforce.bucketing = true; insert overwrite table buckettest partition(state='MA') select code from jsmall limit 10; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; 0: jdbc:hive2://localhost:10000> select * from buckettest a join buckettestoutput2 b on (a.data=b.data); select * from buckettest a join buckettestoutpu t2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettest partition state=MA is 2, whereas the number of files is 1 (state=42000,code=10141) {noformat} was (Author: ychena): [~xuefuz] && [~szehon], could you find someone who know this part well work on the issue. Currently, in upstream master code , number of buckets is not respected even with insert overwrite. (insert overwrite only create 1 bucket file while the table definition is 2. Reproduce: {noformat} create table buckettest (data string) partitioned by (state string) clustered by (data) into 2 buckets; set hive.enforce.bucketing = true; insert overwrite table buckettest partition(state='MA') select code from jsmall limit 10; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; 0: jdbc:hive2://localhost:10000> select * from buckettest a join buckettestoutput2 b on (a.data=b.data); select * from buckettest a join buckettestoutpu t2 b on (a.data=b.data); Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of buckets for table buckettest partition state=MA is 2, whereas the number of files is 1 (state=42000,code=10141) > HIVE-4240 may be causing issue with bucketed tables > ---------------------------------------------------- > > Key: HIVE-10283 > URL: https://issues.apache.org/jira/browse/HIVE-10283 > Project: Hive > Issue Type: Bug > Components: Hive > Reporter: Ryan P > > I suspect that by removing the reducer, HIVE-4240, may be causing issues. > Because of this inserts will not consolidate 'buckets' into single files > which is problematic when attempting to use bucketmapjoin. > CREATE TABLE IF NOT EXISTS buckettestinput( > data string > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > CREATE TABLE IF NOT EXISTS buckettestoutput1( > data string > )CLUSTERED BY(data) > INTO 2 BUCKETS > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > CREATE TABLE IF NOT EXISTS buckettestoutput2( > data string > )CLUSTERED BY(data) > INTO 2 BUCKETS > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > Then I inserted the following data into the "buckettestinput" table > firstinsert1 > firstinsert2 > firstinsert3 > firstinsert4 > firstinsert5 > firstinsert6 > firstinsert7 > firstinsert8 > secondinsert1 > secondinsert2 > secondinsert3 > secondinsert4 > secondinsert5 > secondinsert6 > secondinsert7 > secondinsert8 > set hive.enforce.bucketing = true; > set hive.enforce.sorting=true; > insert into table buckettestoutput1 > select * from buckettestinput where data like 'first%' > SELECT * > FROM buckettestoutput1 TABLESAMPLE(BUCKET 1 OUT OF 1 ON data) s; > insert into table buckettestoutput1 > select * from buckettestinput where data like 'second%' > check the results of the table sample query. > for sort merge bucket map join > set hive.auto.convert.sortmerge.join=true; > set hive.optimize.bucketmapjoin = true; > set hive.optimize.bucketmapjoin.sortedmerge = true; > set hive.auto.convert.sortmerge.join.noconditionaltask=true; > select * from buckettestoutput1 a join buckettestoutput2 b on (a.data=b.data) > hive> select * from buckettestoutput1 a join buckettestoutput2 b on > (a.data=b.data); > FAILED: SemanticException [Error 10141]: Bucketed table metadata is not > correct. Fix the metadata or don't use bucketed mapjoin, by setting > hive.enforce.bucketmapjoin to false. The number of buckets for table > buckettestoutput1 is 2, whereas the number of files is 4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)