[ https://issues.apache.org/jira/browse/HIVE-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198542#comment-13198542 ]
xiaoyu wang commented on HIVE-2775: ----------------------------------- {code} index d0ff67e..bcddc5b 100644 @@ -349,7 +349,25 @@ public class Partition implements Serializable { * we are just storing it as a property of the table as a short term measure. */ public int getBucketCount() { - return table.getNumBuckets(); + int logicalBucketNumber = table.getNumBuckets(); + String pathPattern = this.getPartitionPath().toString() + "/*"; + try { + FileSystem fs = FileSystem.get(this.table.getDataLocation(),Hive.get().getConf()); + FileStatus srcs[] = fs.globStatus(new Path(pathPattern)); + int physicalBucketNumber = srcs.length; + if ((physicalBucketNumber/logicalBucketNumber) * logicalBucketNumber == physicalBucketNumber){ + return physicalBucketNumber; + } else { + throw new RuntimeException("Cannot get bucket count for table " + this.table.getTableName() + + " logical bucket is " + logicalBucketNumber + " physical bucket number is " + physicalBucketNumber); + } + }catch (Exception e) + { + throw new RuntimeException("Cannot get bucket count for table " + this.table.getTableName(), e) ; + } + + +// return table.getNumBuckets(); /* * TODO: Keeping this code around for later use when we will support * sampling on tables which are not created with CLUSTERED INTO clause {code} > allow the number of files to be a multiple of bucketed table > ------------------------------------------------------------ > > Key: HIVE-2775 > URL: https://issues.apache.org/jira/browse/HIVE-2775 > Project: Hive > Issue Type: New Feature > Components: Metastore > Reporter: xiaoyu wang > > Currently, hive bucketed table requires the number of files to match the > bucket number in order to for correct sampling. This is very restrictive. > e.g. we can only populate the table using a fix number of reducer, which can > be a bottleneck. > The idea is to introduce this "physical bucket" and "logical bucket" concept. > "physical bucket" is the number of files and "logical bucket" is the number > of bucket stored in meda-data for bucketed table. By allowing "physical > bucket" to be a multiple of "logical bucket", we can do correct sampling as > well as scaling up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira