Earlier this week I was surprised to find that, after dumping tons of data from a Hive table to an HBase table, about half of the data didn't end up in HBase.
So, yesterday, I created a new Phoenix table. This time, I'm splitting on the first 6 characters of the key, which gives me about 1700 regions (across 6 fairly beefy region servers). My 7 billion Hive rows live in 125 5GB csv files on HDFS. I copied 35 of them to a separate folder, and ran the CsvBuolkLoad tool against that folder. The application manager tells me that the job ran to completion. 1042/1042 successful maps and 1792/1792 successful reduces. However, when I run the mapreduce.RowCounter against the new table, it only shows about 300 million rows. I should see 35/125 * 7 billion = ~ 1.7 billion rows. These are not primary key collisions. Can someone please help me understand what is going on?
