How many disk drives do you have / node? 
Generally one node should have 12 drives not configured as raid and not 
configured as lvm.

Files could be a little bit larger (4 or better 40 gb - your namenode will 
thank you) or use Hadoop Archive (HAR).

I am not sure about the latest status of Phoenix but maybe you can can make 
hbase tables directly available as external tables in Hive - you would save a 
lot of time by not converting to csv.
You could also explore using sqoop (import from hive to jdbc / Phoenix or to 
hbase).

> On 13 Feb 2016, at 13:41, Riesland, Zack <[email protected]> wrote:
> 
> On a daily basis, we move large amounts of data from hive to hbase, via 
> phoenix.
>  
> In order to do this, we create an external hive table with the data we need 
> to move (all a subset of 1 compressed ORC table), and then use the Phoenix 
> CsvBulkUpload utility. From everything I've read, this is the best approach.
>  
> My question is: how can I optimize my external table to make the bulk upload 
> as efficient as possible?
>  
> For example, today, my external table is backed by 6,020 files in HDFS, each 
> about 300-400mb.
>  
> This results in a mapreduce operation with 12,209 mappers that takes about 3 
> hours (we don't have a huge cluster – 13 data nodes currently).
>  
> Would it be better to have more, smaller files? Fewer, larger files?

Reply via email to