[jira] [Commented] (HIVE-2775) allow the number of files to be a multiple of bucketed table

xiaoyu wang (Commented) (JIRA) Wed, 01 Feb 2012 21:24:32 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198542#comment-13198542
 ]


xiaoyu wang commented on HIVE-2775:
-----------------------------------

{code}
index d0ff67e..bcddc5b 100644
@@ -349,7 +349,25 @@ public class Partition implements Serializable {
    * we are just storing it as a property of the table as a short term measure.
    */
   public int getBucketCount() {
-    return table.getNumBuckets();
+      int logicalBucketNumber = table.getNumBuckets();
+      String pathPattern = this.getPartitionPath().toString() + "/*";
+      try {
+          FileSystem fs = 
FileSystem.get(this.table.getDataLocation(),Hive.get().getConf());
+          FileStatus srcs[] = fs.globStatus(new Path(pathPattern));
+          int physicalBucketNumber = srcs.length;
+          if ((physicalBucketNumber/logicalBucketNumber) * logicalBucketNumber 
==  physicalBucketNumber){
+              return physicalBucketNumber;
+          } else {
+              throw new RuntimeException("Cannot get bucket count for table " 
+ this.table.getTableName() +
+                      " logical bucket is " + logicalBucketNumber + " physical 
bucket number is " + physicalBucketNumber);
+          }
+      }catch (Exception e)
+      {
+          throw new RuntimeException("Cannot get bucket count for table " + 
this.table.getTableName(), e) ;
+      }
+
+
+//    return table.getNumBuckets();
     /*
      * TODO: Keeping this code around for later use when we will support
      * sampling on tables which are not created with CLUSTERED INTO clause
{code}
                
> allow the number of files to be a multiple of bucketed table
> ------------------------------------------------------------
>
>                 Key: HIVE-2775
>                 URL: https://issues.apache.org/jira/browse/HIVE-2775
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: xiaoyu wang
>
> Currently, hive bucketed table requires the number of files to match the 
> bucket number in order to for correct sampling. This is very restrictive. 
> e.g. we can only populate the table using a fix number of reducer, which can 
> be a bottleneck. 
> The idea is to introduce this "physical bucket" and "logical bucket" concept. 
> "physical bucket" is the number of files and "logical bucket" is the number 
> of bucket stored in meda-data for bucketed table. By allowing "physical 
> bucket" to be a multiple of "logical bucket", we can do correct sampling as 
> well as scaling up. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2775) allow the number of files to be a multiple of bucketed table

Reply via email to