Re: Best practices for loading data into hbase

Ted Yu Fri, 31 May 2013 13:27:14 -0700

bq. Once we process an xml file and we populate our 3 "production" hbase
tables, could we bulk load another xml file and append this new data to our
3 tables or would it write over what was written before?


You can bulk load another XML file.

bq. should we process our input xml file with 3 MapReduce jobs instead of 1

You don't need to use 3 jobs.

Looks like you were using CDH. Mind telling us the version number for HBase
and hadoop ?

Thanks

On Fri, May 31, 2013 at 1:19 PM, David Poisson <[email protected]
> wrote:

> Hi,
>      We are still very new at all of this hbase/hadoop/mapreduce stuff. We
> are looking for the best practices that will fit our requirements. We are
> currently using the latest cloudera vmware's (single node) for our
> development tests.
>
> The problem is as follows:
>
> We have multiple sources in different format (xml, csv, etc), which are
> dumps of existing systems. As one might think, there will be an initial
> "import" of the data into hbase
> and afterwards, the systems would most likely dump whatever data they have
> accumulated since the initial import into hbase or since the last data
> dump. Another thing, we would require to have an
> intermediary step, so that we can ensure all of a source's data can be
> successfully processed, something which would look like:
>
> XML data file --(MR JOB)--> Intermediate (hbase table or hfile?) --(MR
> JOB)--> production tables in hbase
>
> We're guessing we can't use something like a transaction in hbase, so we
> thought about using a intermediate step: Is that how things are normally
> done?
>
> As we import data into hbase, we will be populating several tables that
> links data parts together (account X in System 1 == account Y in System 2)
> as tuples in 3 tables. Currently,
> this is being done by a mapreduce job which reads the XML source and uses
> multiTableOutputFormat to "put" data into those 3 hbase tables. This method
> isn't that fast using our test sample (2 minutes for 5Mb), so we are
> looking at optimizing the loading of data.
>
> We have been researching bulk loading but we are unsure of a couple of
> things:
> Once we process an xml file and we populate our 3 "production" hbase
> tables, could we bulk load another xml file and append this new data to our
> 3 tables or would it write over what was written before?
> In order to bulk load, we need to output a file using HFileOutputFormat.
> Since MultiHFileOutputFormat doesn't seem to officially exist yet (still in
> the works, right?), should we process our input xml file
> with 3 MapReduce jobs instead of 1 and output an hfile for each, which we
> could then become our intermediate step (if all 3 hfiles were created
> without errors, then process was successful: bulk load
> in hbase)? Can you experiment with bulk loading on a vmware? We're
> experiencing problems with partition file not being found with the
> following exception:
>
> java.lang.Exception: java.lang.IllegalArgumentException: Can't read
> partitions file
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
> Caused by: java.lang.IllegalArgumentException: Can't read partitions file
>         at
> org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
>         at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
>         at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:588)
>
> We also tried another idea on how to speed things up: What if instead of
> doing individual puts, we passed a list of puts to put() (eg:
> htable.put(putList) ). Internally in hbase, would there be less overhead vs
> multiple
> calls to put()? It seems to be faster, however since we're not using
> context.write, I'm guessing this will lead to problems later on, right?
>
> Turning off WAL on puts to speed things up isn't an option, since data
> loss would be unacceptable, even if the chances of a failure occurring are
> slim.
>
> Thanks, David

Re: Best practices for loading data into hbase

Reply via email to