I am reading data off of HDFS that don't all get loaded into a single table. With the current way of bulk loading I can load to the table that most of the data will end up in, and I can use the client API (i.e., Put) to load the other data from the file into the other tables.
The current bulk loading process involves creating the same number of reducers as there are regions for the specified table. I think I understand that once the appropriate region servers adopt the HFiles minor compactions will merge them into the regions' storefiles. It seems like you could set the number of reducers to the total number of regions for all the tables considered. Then you write the partitions file as key-values where the key is the destination table and the value is a region start key (instead of the key being the start key and the value being NullWritable). Mappers could then prefix rows with their destination table before doing a context.write(). The TotalOrderPartitioner needs to be modified to account for all these changes. I have a feeling this is an overly complicated approach or if it would even work. Maybe you could do it without all those changes and just use MultipleOutputs? Has anyone else thought about or done bulk loading with multiple tables?
