Hive can connect to HBase and insert directly into any direction.
Don't know if it also works via Phoenix...

Counting is too slow on a single threaded job /command line - you should
write a map-reduce job, with some filter to load just the key this being
really fast.

A Map-reduce job is also the solution to load data from hive to HBase (read
from HDFS not Hive, prepare output to Phoenix format and bulk load the
results).
Pe 22 iun. 2015 9:34 p.m., "Riesland, Zack" <[email protected]> a
scris:

>  I had a very large Hive table that I needed in HBase.
>
>
>
> After asking around, I came to the conclusion that my best bet was to:
>
>
>
> 1 – export the hive table to a CSV ‘file’/folder on the HDFS
>
> 2 – Use the org.apache.phoenix.mapreduce.CsvBulkLoadTool to import the
> data.
>
>
>
> I found that if I tried to pass the entire folder (~ 1/2 TB of data) to
> the CsvBulkLoadTool, my job would eventually fail.
>
>
>
> Empirically, it seems that on our particular cluster, 20-30GB of data is
> the most that the CSVBulkLoadTool can handle at one time without so many
> map jobs timing out that the entire operation fails.
>
>
>
> So I passed one sub-file at a time and eventually got all the data into
> HBase.
>
>
>
> I tried doing a select count(*)  on the table to see whether all of the
> rows were transferred, but this eventually fails.
>
>
>
> Today, I believe I found a set of data that is in Hive but NOT in HBase.
>
>
>
> So, I have 2 questions:
>
>
>
> 1) Are there any known errors with the CsvBulkLoadTool such that it might
> skip some data without getting my attention with some kind of error?
>
>
>
> 2) Is there a straightforward way to count the rows in my Phoenix table so
> that I can compare the Hive table with the HBase table?
>
>
>
> Thanks in advance!
>

Reply via email to