Danny Teok created HIVE-5774:
--------------------------------

             Summary: INSERT OVERWRITE DYNAMIC PARTITION on LARGE DATA
                 Key: HIVE-5774
                 URL: https://issues.apache.org/jira/browse/HIVE-5774
             Project: Hive
          Issue Type: Bug
          Components: Database/Schema
         Environment: debian 6.0.7
            Reporter: Danny Teok
            Priority: Critical


After several forensic analysis, we are convinced that there is a bug when 
rebuilding using dynamic partition over more than 30 days. Row counts do not 
match.

In details:
Part A -- original_table
2013-01-01; 394,755 rows
2013-01-02; 424,448
2013-01-03; 427,201
...
2013-10-30; 3,234,472

Part B -- copy_of_original_table_new
2013-01-01; 372,628 rows
2013-01-02; 400,553
2013-01-03; 403,495
...
2013-10-30; 2,865,877

The query that is used to populate the original table is the same for 
populating the "copy_of_original_table_new" table. When we rebuilt for 1 day, 
e.g. 2013-01-01, the number of row counts of the copy_of_original_table_new  
matched up exactly with orignal_table.
When we rebuilt for 7 days, the number of row counts matched up exactly.
When we rebuilt for 15 days, the number of row counts matched up exactly.
When we rebuilt for 303 days (10 months), everything fxxked up. No matches.
When we rebuilt for 35 days, 80% matched up exactly. The other 20% are out from 
hundreds to tens of thousands of rows (a variance of up to 3%)

In other words, the more days that are specified in the WHERE dt BETWEEN 
dateStart AND dateEnd, the dates will be out, i.e. no matching row count with 
original_table.

However, of those 20% that are out, we rebuilt each of them statically with the 
corresponding date. The result is astonishingly surprising -- they matched the 
original_table row count!

Apologize in advance if this is not technical enough, but I hope the message is 
clear. We believe there is a bug. Not sure how to check our Hive version, but 
our Hadoop's version is "Hadoop 2.0.0-cdh4.1.1"

For a glimpse of the INSERT OVERWRITE sql, it's here -- 
http://pastebin.com/g1qxsUm2



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to