RE: Best Way to Insert data into Hbase using Map Reduce

Michael Segel Mon, 08 Nov 2010 11:23:22 -0800

Ok.

You have a couple of issues.
First is that each file is a record. That doesn't make for a good map/reduce, 
although you can pass in the directory and then for each file you'd get a 
map/reduce task, assuming that you're processing all of the files at the same 
time.


Having millions of fields... I'm not sure that you have good structured data 
within your XML file and if you want to create one row per record.

Part of your speed problem is that building a DOM tree with millions of fields 
is probably what is taking a long time. (You have the issue of putting your 
entire document in to memory. and the time it takes to build the tree.) Then 
you have to determine your mapping from the JDOM object to your hbase table.

Doing Stax will make your code more efficient.

With respect to the buffer caching.

What that will do is cache your writes on the client side. Not sure if that 
makes sense when you're processing the entire file which is going to be larger 
than your cache.

I don't believe that it is going to be your performance issue. Having both a 
bad XML schema and the hbase schema will be an issue.

> Date: Mon, 8 Nov 2010 21:59:22 +0500
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> From: [email protected]
> To: [email protected]
> 
> One more thing which i want to ask that i have found that people have given
> the following buffer size.
> 
>   table.setWriteBufferSize(1024*1024*24);
>   table.setAutoFlush(false);
> 
> Is there any specific reason of giving such buffer size? and how much ram is
> required for it. I have given 4 GB to each region server and I can see that
> used heap value for region server going increasing and increasing and region
> servers are crashing then.
> 
> On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <[email protected]> wrote:
> 
> > Ok
> > Well...i am getting hundred of files daily which all need to process thats
> > why i am using hadoop so it manage distribution of processing itself.
> > Yes, one record has millions of fields
> >
> > Thanks for comments.
> >
> >
> > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel 
> > <[email protected]>wrote:
> >
> >>
> >> Switch out the JDOM for a Stax parser.
> >>
> >> Ok, having said that...
> >> You said you have a single record per file. Ok that means you have a lot
> >> of fields.
> >> Because you have 1 record, this isn't a map/reduce problem. You're better
> >> off writing a single threaded app
> >> to read the file, parse the file using Stax, and then write the fields to
> >> HBase.
> >>
> >> I'm not sure why you have millions of put()s.
> >> Do you have millions of fields in this one record?
> >>
> >> Writing a good stax parser and then mapping the fields to your hbase
> >> column(s) will help.
> >>
> >> HTH
> >>
> >> -Mike
> >> PS. A good stax implementation would be a recursive/re-entrant piece of
> >> code.
> >> While the code may look simple, it takes a skilled developer to write and
> >> maintain.
> >>
> >>
> >> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >> > From: [email protected]
> >> > To: [email protected]
> >> >
> >> > HI
> >> >
> >> > I have used JDOM library to parse the xml in mapper and in my case, one
> >> > single file consist of 1 record so i give one complete file to map
> >> process
> >> > and extract the information from it which i need. I have only 2 column
> >> > families in my schema and bottleneck was the put statements which run
> >> > millions of time for each file. when i comment this put statement then
> >> job
> >> > complete within minutes but with put statement, it was taking about 7
> >> hours
> >> > to complete the same job. Anyhow I have changed the code according to
> >> > suggestion given by Michael  and now using java api to dump data instead
> >> of
> >> > table output format and used the list of puts and then flush them at
> >> each
> >> > 1000 records and it reduces the time significantly. Now the whole job
> >> > process by 1 hour and 45 min approx but still not in minutes. So is
> >> there
> >> > anything left which i might apply and performance increase?
> >> >
> >> > Thanks
> >> >
> >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <[email protected]>
> >> wrote:
> >> >
> >> > > Good points.
> >> > > Before we can make any rational suggestion, we need to know where the
> >> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> >> > > personally favor Michael's suggestion to split the ingest and the
> >> parsing
> >> > > parts of your job, and to switch to a parser that is faster than a DOM
> >> > > parser (SAX or Stax). But, without knowing what the bottleneck
> >> actually is,
> >> > > all of these suggestions are shots in the dark.
> >> > >
> >> > > What is the network load, the CPU load, the disk load?  Have you at
> >> least
> >> > > installed Ganglia or some equivalent so that you can see what the load
> >> is
> >> > > across the cluster?
> >> > >
> >> > > Dave
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Michael Segel [mailto:[email protected]]
> >> > > Sent: Friday, November 05, 2010 9:49 AM
> >> > > To: [email protected]
> >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >> > >
> >> > >
> >> > > I don't think using the buffered client is going to help a lot w
> >> > > performance.
> >> > >
> >> > > I'm a little confused because it doesn't sound like Shuja is using a
> >> > > map/reduce job to parse the file.
> >> > > That is... he says he parses the file in to a dom tree. Usually your
> >> map
> >> > > job parses each record and then in the mapper you parse out the
> >> record.
> >> > > Within the m/r job we don't parse out the fields in the records
> >> because we
> >> > > do additional processing which 'dedupes' the data so we don't have to
> >> > > further process the data.
> >> > > The second job only has to parse a portion of the original records.
> >> > >
> >> > > So assuming that Shuja is actually using a map reduce job, and each
> >> xml
> >> > > record is being parsed within the mapper() there are a couple of
> >> things...
> >> > > 1) Reduce the number of column families that you are using. (Each
> >> column
> >> > > family is written to a separate file)
> >> > > 2) Set up the HTable instance in Mapper.setup()
> >> > > 3) Switch to a different dom class (not all java classes are equal) or
> >> > > switch to Stax.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > > From: [email protected]
> >> > > > To: [email protected]
> >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >> > > >
> >> > > > Have you tried turning off auto flush, and managing the flush in
> >> your own
> >> > > code (say every 1000 puts?)
> >> > > > Dave
> >> > > >
> >> > > >
> >> > > > -----Original Message-----
> >> > > > From: Shuja Rehman [mailto:[email protected]]
> >> > > > Sent: Friday, November 05, 2010 8:04 AM
> >> > > > To: [email protected]
> >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >> > > >
> >> > > > Michael
> >> > > >
> >> > > > hum....so u are storing xml record in the hbase and in second job, u
> >> r
> >> > > > parsing. but in my case i am parsing it also in first phase. what i
> >> do, i
> >> > > > get xml file and i parse it using jdom and then putting data in
> >> hbase. so
> >> > > > parsing+putting both operations are in 1 phase and in mapper code.
> >> > > >
> >> > > > My actual problem is that after parsing file, i need to use put
> >> statement
> >> > > > millions of times and i think for each statement it connects to
> >> hbase and
> >> > > > then insert it and this might be the reason of slow processing. So i
> >> am
> >> > > > trying to figure out some way we i can first buffer data and then
> >> insert
> >> > > in
> >> > > > batch fashion. it means in one put statement, i can insert many
> >> records
> >> > > and
> >> > > > i think if i do in this way then the process will be very fast.
> >> > > >
> >> > > > secondly what does it means? "we write the raw record in via a
> >> single
> >> > > put()
> >> > > > so the map() method is a null writable."
> >> > > >
> >> > > > can u explain it more?
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > > >
> >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> >> [email protected]
> >> > > >wrote:
> >> > > >
> >> > > > >
> >> > > > > Suja,
> >> > > > >
> >> > > > > Just did a quick glance.
> >> > > > >
> >> > > > > What is it that you want to do exactly?
> >> > > > >
> >> > > > > Here's how we do it... (at a high level.)
> >> > > > >
> >> > > > > Input is an XML file where we want to store the raw XML records in
> >> > > hbase,
> >> > > > > one record per row.
> >> > > > >
> >> > > > > Instead of using the output of the map() method, we write the raw
> >> > > record in
> >> > > > > via a single put() so the map() method is a null writable.
> >> > > > >
> >> > > > > Its pretty fast. However fast is relative.
> >> > > > >
> >> > > > > Another thing... we store the xml record as a string (converted to
> >> > > > > bytecode) rather than a serialized object.
> >> > > > >
> >> > > > > Then you can break it down in to individual fields in a second
> >> batch
> >> > > job.
> >> > > > > (You can start with a DOM parser, and later move to a Stax parser.
> >> > > > > Depending on which DOM parser you have and the size of the record,
> >> it
> >> > > should
> >> > > > > be 'fast enough'. A good implementation of Stax tends to be
> >> > > > > recursive/re-entrant code which is harder to maintain.)
> >> > > > >
> >> > > > > HTH
> >> > > > >
> >> > > > > -Mike
> >> > > > >
> >> > > > >
> >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> >> > > > > > From: [email protected]
> >> > > > > > To: [email protected]
> >> > > > > >
> >> > > > > > Hi
> >> > > > > >
> >> > > > > > I am reading data from raw xml files and inserting data into
> >> hbase
> >> > > using
> >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> >> > > statements,
> >> > > > > it
> >> > > > > > takes many hours to process the data. here is my sample code.
> >> > > > > >
> >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> >> > > > > >     conf.set("xmlinput.start", "<adc>");
> >> > > > > >     conf.set("xmlinput.end", "</adc>");
> >> > > > > >     conf
> >> > > > > >         .set(
> >> > > > > >           "io.serializations",
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > >
> >> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> >> > > > > >
> >> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> >> > > > > >
> >> > > > > >     FileInputFormat.setInputPaths(job, input);
> >> > > > > >     job.setJarByClass(ParserDriver.class);
> >> > > > > >     job.setMapperClass(MyParserMapper.class);
> >> > > > > >     job.setNumReduceTasks(0);
> >> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> >> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> >> > > > > >
> >> > > > > >
> >> > > > > > *and mapper code*
> >> > > > > >
> >> > > > > > public class MyParserMapper   extends
> >> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> >> > > > > >
> >> > > > > >     @Override
> >> > > > > >     public void map(LongWritable key, Text value1,Context
> >> context)
> >> > > > > >
> >> > > > > > throws IOException, InterruptedException {
> >> > > > > > *//doing some processing*
> >> > > > > >  while(rItr.hasNext())
> >> > > > > >                     {
> >> > > > > > *                   //and this put statement runs for
> >> 132,622,560
> >> > > times
> >> > > > > to
> >> > > > > > insert the data.*
> >> > > > > >                     context.write(NullWritable.get(), new
> >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> >> > > > > > Bytes.toBytes(counter.toString()),
> >> > > > > Bytes.toBytes(rElement.getTextTrim())));
> >> > > > > >
> >> > > > > >                     }
> >> > > > > >
> >> > > > > > }}
> >> > > > > >
> >> > > > > > Is there any other way of doing this task so i can improve the
> >> > > > > performance?
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Regards
> >> > > > > > Shuja-ur-Rehman Baig
> >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Regards
> >> > > > Shuja-ur-Rehman Baig
> >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Regards
> >> > Shuja-ur-Rehman Baig
> >> > <http://pk.linkedin.com/in/shujamughal>
> >>
> >>
> >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
> >
> >
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>

RE: Best Way to Insert data into Hbase using Map Reduce

Reply via email to