Bulk load to multiple tables

Kevin Thu, 26 Jun 2014 10:29:39 -0700

I am reading data off of HDFS that don't all get loaded into a single
table. With the current way of bulk loading I can load to the table that
most of the data will end up in, and I can use the client API (i.e., Put)
to load the other data from the file into the other tables.


The current bulk loading process involves creating the same number of
reducers as there are regions for the specified table. I think I understand
that once the appropriate region servers adopt the HFiles minor compactions
will merge them into the regions' storefiles.

It seems like you could set the number of reducers to the total number of
regions for all the tables considered. Then you write the partitions file
as key-values where the key is the destination table and the value is a
region start key (instead of the key being the start key and the value
being NullWritable). Mappers could then prefix rows with their destination
table before doing a context.write(). The TotalOrderPartitioner needs to be
modified to account for all these changes. I have a feeling this is an
overly complicated approach or if it would even work.

Maybe you could do it without all those changes and just use
MultipleOutputs?

Has anyone else thought about or done bulk loading with multiple tables?

Bulk load to multiple tables

Reply via email to