For #2:  hbase org.apache.hadoop.hbase.mapreduce.RowCounter <TABLE_NAME>

On Mon, Jun 22, 2015 at 11:34 AM, Riesland, Zack <[email protected]>
wrote:

>  I had a very large Hive table that I needed in HBase.
>
>
>
> After asking around, I came to the conclusion that my best bet was to:
>
>
>
> 1 – export the hive table to a CSV ‘file’/folder on the HDFS
>
> 2 – Use the org.apache.phoenix.mapreduce.CsvBulkLoadTool to import the
> data.
>
>
>
> I found that if I tried to pass the entire folder (~ 1/2 TB of data) to
> the CsvBulkLoadTool, my job would eventually fail.
>
>
>
> Empirically, it seems that on our particular cluster, 20-30GB of data is
> the most that the CSVBulkLoadTool can handle at one time without so many
> map jobs timing out that the entire operation fails.
>
>
>
> So I passed one sub-file at a time and eventually got all the data into
> HBase.
>
>
>
> I tried doing a select count(*)  on the table to see whether all of the
> rows were transferred, but this eventually fails.
>
>
>
> Today, I believe I found a set of data that is in Hive but NOT in HBase.
>
>
>
> So, I have 2 questions:
>
>
>
> 1) Are there any known errors with the CsvBulkLoadTool such that it might
> skip some data without getting my attention with some kind of error?
>
>
>
> 2) Is there a straightforward way to count the rows in my Phoenix table so
> that I can compare the Hive table with the HBase table?
>
>
>
> Thanks in advance!
>



-- 
Thanks & Regards,
Anil Gupta

Reply via email to