How many disk drives do you have / node? Generally one node should have 12 drives not configured as raid and not configured as lvm.
Files could be a little bit larger (4 or better 40 gb - your namenode will thank you) or use Hadoop Archive (HAR). I am not sure about the latest status of Phoenix but maybe you can can make hbase tables directly available as external tables in Hive - you would save a lot of time by not converting to csv. You could also explore using sqoop (import from hive to jdbc / Phoenix or to hbase). > On 13 Feb 2016, at 13:41, Riesland, Zack <[email protected]> wrote: > > On a daily basis, we move large amounts of data from hive to hbase, via > phoenix. > > In order to do this, we create an external hive table with the data we need > to move (all a subset of 1 compressed ORC table), and then use the Phoenix > CsvBulkUpload utility. From everything I've read, this is the best approach. > > My question is: how can I optimize my external table to make the bulk upload > as efficient as possible? > > For example, today, my external table is backed by 6,020 files in HDFS, each > about 300-400mb. > > This results in a mapreduce operation with 12,209 mappers that takes about 3 > hours (we don't have a huge cluster – 13 data nodes currently). > > Would it be better to have more, smaller files? Fewer, larger files?
