Best practices for loading data into hbase

David Poisson Fri, 31 May 2013 13:21:21 -0700

Hi,
     We are still very new at all of this hbase/hadoop/mapreduce stuff. We are 
looking for the best practices that will fit our requirements. We are currently 
using the latest cloudera vmware's (single node) for our development tests.


The problem is as follows: 

We have multiple sources in different format (xml, csv, etc), which are dumps 
of existing systems. As one might think, there will be an initial "import" of 
the data into hbase 
and afterwards, the systems would most likely dump whatever data they have 
accumulated since the initial import into hbase or since the last data dump. 
Another thing, we would require to have an
intermediary step, so that we can ensure all of a source's data can be 
successfully processed, something which would look like:

XML data file --(MR JOB)--> Intermediate (hbase table or hfile?) --(MR JOB)--> 
production tables in hbase

We're guessing we can't use something like a transaction in hbase, so we 
thought about using a intermediate step: Is that how things are normally done?

As we import data into hbase, we will be populating several tables that links 
data parts together (account X in System 1 == account Y in System 2) as tuples 
in 3 tables. Currently, 
this is being done by a mapreduce job which reads the XML source and uses 
multiTableOutputFormat to "put" data into those 3 hbase tables. This method
isn't that fast using our test sample (2 minutes for 5Mb), so we are looking at 
optimizing the loading of data.

We have been researching bulk loading but we are unsure of a couple of things:
Once we process an xml file and we populate our 3 "production" hbase tables, 
could we bulk load another xml file and append this new data to our 3 tables or 
would it write over what was written before?
In order to bulk load, we need to output a file using HFileOutputFormat. Since 
MultiHFileOutputFormat doesn't seem to officially exist yet (still in the 
works, right?), should we process our input xml file
with 3 MapReduce jobs instead of 1 and output an hfile for each, which we could 
then become our intermediate step (if all 3 hfiles were created without errors, 
then process was successful: bulk load
in hbase)? Can you experiment with bulk loading on a vmware? We're experiencing 
problems with partition file not being found with the following exception:

java.lang.Exception: java.lang.IllegalArgumentException: Can't read partitions 
file
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
Caused by: java.lang.IllegalArgumentException: Can't read partitions file
        at 
org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
        at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
        at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
        at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:588)

We also tried another idea on how to speed things up: What if instead of doing 
individual puts, we passed a list of puts to put() (eg: htable.put(putList) ). 
Internally in hbase, would there be less overhead vs multiple
calls to put()? It seems to be faster, however since we're not using 
context.write, I'm guessing this will lead to problems later on, right?

Turning off WAL on puts to speed things up isn't an option, since data loss 
would be unacceptable, even if the chances of a failure occurring are slim.

Thanks, David

Best practices for loading data into hbase

Reply via email to