RE: Optimizing external table structure

Riesland, Zack Sat, 13 Feb 2016 07:08:16 -0800

Thanks.

We have 16 disks per node, to answer your question.
________________________________________
From: Jörn Franke [[email protected]]
Sent: Saturday, February 13, 2016 9:46 AM
To: [email protected]
Subject: Re: Optimizing external table structure


How many disk drives do you have / node?
Generally one node should have 12 drives not configured as raid and not 
configured as lvm.

Files could be a little bit larger (4 or better 40 gb - your namenode will 
thank you) or use Hadoop Archive (HAR).

I am not sure about the latest status of Phoenix but maybe you can can make 
hbase tables directly available as external tables in Hive - you would save a 
lot of time by not converting to csv.
You could also explore using sqoop (import from hive to jdbc / Phoenix or to 
hbase).

On 13 Feb 2016, at 13:41, Riesland, Zack 
<[email protected]<mailto:[email protected]>> wrote:

On a daily basis, we move large amounts of data from hive to hbase, via phoenix.

In order to do this, we create an external hive table with the data we need to 
move (all a subset of 1 compressed ORC table), and then use the Phoenix 
CsvBulkUpload utility. From everything I've read, this is the best approach.

My question is: how can I optimize my external table to make the bulk upload as 
efficient as possible?

For example, today, my external table is backed by 6,020 files in HDFS, each 
about 300-400mb.

This results in a mapreduce operation with 12,209 mappers that takes about 3 
hours (we don't have a huge cluster – 13 data nodes currently).

Would it be better to have more, smaller files? Fewer, larger files?

RE: Optimizing external table structure

Reply via email to