Bug in CsvBulkLoad tool?

Riesland, Zack Thu, 25 Jun 2015 03:12:07 -0700

Earlier this week I was surprised to find that, after dumping tons of data from 
a Hive table to an HBase table, about half of the data didn't end up in HBase.


So, yesterday, I created a new Phoenix table.

This time, I'm splitting on the first 6 characters of the key, which gives me 
about 1700 regions (across 6 fairly beefy region servers).

My 7 billion Hive rows live in 125 5GB csv files on HDFS.

I copied 35 of them to a separate folder, and ran the CsvBuolkLoad tool against 
that folder.

The application manager tells me that the job ran to completion. 1042/1042 
successful maps and 1792/1792 successful reduces.

However, when I run the mapreduce.RowCounter against the new table, it only 
shows about 300 million rows.

I should see 35/125 * 7 billion = ~ 1.7 billion rows.

These are not primary key collisions.

Can someone please help me understand what is going on?

Bug in CsvBulkLoad tool?

Reply via email to