For #2: hbase org.apache.hadoop.hbase.mapreduce.RowCounter <TABLE_NAME>
On Mon, Jun 22, 2015 at 11:34 AM, Riesland, Zack <[email protected]> wrote: > I had a very large Hive table that I needed in HBase. > > > > After asking around, I came to the conclusion that my best bet was to: > > > > 1 – export the hive table to a CSV ‘file’/folder on the HDFS > > 2 – Use the org.apache.phoenix.mapreduce.CsvBulkLoadTool to import the > data. > > > > I found that if I tried to pass the entire folder (~ 1/2 TB of data) to > the CsvBulkLoadTool, my job would eventually fail. > > > > Empirically, it seems that on our particular cluster, 20-30GB of data is > the most that the CSVBulkLoadTool can handle at one time without so many > map jobs timing out that the entire operation fails. > > > > So I passed one sub-file at a time and eventually got all the data into > HBase. > > > > I tried doing a select count(*) on the table to see whether all of the > rows were transferred, but this eventually fails. > > > > Today, I believe I found a set of data that is in Hive but NOT in HBase. > > > > So, I have 2 questions: > > > > 1) Are there any known errors with the CsvBulkLoadTool such that it might > skip some data without getting my attention with some kind of error? > > > > 2) Is there a straightforward way to count the rows in my Phoenix table so > that I can compare the Hive table with the HBase table? > > > > Thanks in advance! > -- Thanks & Regards, Anil Gupta
