bq. Once we process an xml file and we populate our 3 "production" hbase tables, could we bulk load another xml file and append this new data to our 3 tables or would it write over what was written before?
You can bulk load another XML file. bq. should we process our input xml file with 3 MapReduce jobs instead of 1 You don't need to use 3 jobs. Looks like you were using CDH. Mind telling us the version number for HBase and hadoop ? Thanks On Fri, May 31, 2013 at 1:19 PM, David Poisson <[email protected] > wrote: > Hi, > We are still very new at all of this hbase/hadoop/mapreduce stuff. We > are looking for the best practices that will fit our requirements. We are > currently using the latest cloudera vmware's (single node) for our > development tests. > > The problem is as follows: > > We have multiple sources in different format (xml, csv, etc), which are > dumps of existing systems. As one might think, there will be an initial > "import" of the data into hbase > and afterwards, the systems would most likely dump whatever data they have > accumulated since the initial import into hbase or since the last data > dump. Another thing, we would require to have an > intermediary step, so that we can ensure all of a source's data can be > successfully processed, something which would look like: > > XML data file --(MR JOB)--> Intermediate (hbase table or hfile?) --(MR > JOB)--> production tables in hbase > > We're guessing we can't use something like a transaction in hbase, so we > thought about using a intermediate step: Is that how things are normally > done? > > As we import data into hbase, we will be populating several tables that > links data parts together (account X in System 1 == account Y in System 2) > as tuples in 3 tables. Currently, > this is being done by a mapreduce job which reads the XML source and uses > multiTableOutputFormat to "put" data into those 3 hbase tables. This method > isn't that fast using our test sample (2 minutes for 5Mb), so we are > looking at optimizing the loading of data. > > We have been researching bulk loading but we are unsure of a couple of > things: > Once we process an xml file and we populate our 3 "production" hbase > tables, could we bulk load another xml file and append this new data to our > 3 tables or would it write over what was written before? > In order to bulk load, we need to output a file using HFileOutputFormat. > Since MultiHFileOutputFormat doesn't seem to officially exist yet (still in > the works, right?), should we process our input xml file > with 3 MapReduce jobs instead of 1 and output an hfile for each, which we > could then become our intermediate step (if all 3 hfiles were created > without errors, then process was successful: bulk load > in hbase)? Can you experiment with bulk loading on a vmware? We're > experiencing problems with partition file not being found with the > following exception: > > java.lang.Exception: java.lang.IllegalArgumentException: Can't read > partitions file > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404) > Caused by: java.lang.IllegalArgumentException: Can't read partitions file > at > org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:588) > > We also tried another idea on how to speed things up: What if instead of > doing individual puts, we passed a list of puts to put() (eg: > htable.put(putList) ). Internally in hbase, would there be less overhead vs > multiple > calls to put()? It seems to be faster, however since we're not using > context.write, I'm guessing this will lead to problems later on, right? > > Turning off WAL on puts to speed things up isn't an option, since data > loss would be unacceptable, even if the chances of a failure occurring are > slim. > > Thanks, David
