Re: Best Way to Insert data into Hbase using Map Reduce

Shuja Rehman Mon, 08 Nov 2010 08:26:32 -0800

Ok
Well...i am getting hundred of files daily which all need to process thats
why i am using hadoop so it manage distribution of processing itself.
Yes, one record has millions of fields


Thanks for comments.

On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <[email protected]>wrote:

>
> Switch out the JDOM for a Stax parser.
>
> Ok, having said that...
> You said you have a single record per file. Ok that means you have a lot of
> fields.
> Because you have 1 record, this isn't a map/reduce problem. You're better
> off writing a single threaded app
> to read the file, parse the file using Stax, and then write the fields to
> HBase.
>
> I'm not sure why you have millions of put()s.
> Do you have millions of fields in this one record?
>
> Writing a good stax parser and then mapping the fields to your hbase
> column(s) will help.
>
> HTH
>
> -Mike
> PS. A good stax implementation would be a recursive/re-entrant piece of
> code.
> While the code may look simple, it takes a skilled developer to write and
> maintain.
>
>
> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > From: [email protected]
> > To: [email protected]
> >
> > HI
> >
> > I have used JDOM library to parse the xml in mapper and in my case, one
> > single file consist of 1 record so i give one complete file to map
> process
> > and extract the information from it which i need. I have only 2 column
> > families in my schema and bottleneck was the put statements which run
> > millions of time for each file. when i comment this put statement then
> job
> > complete within minutes but with put statement, it was taking about 7
> hours
> > to complete the same job. Anyhow I have changed the code according to
> > suggestion given by Michael  and now using java api to dump data instead
> of
> > table output format and used the list of puts and then flush them at each
> > 1000 records and it reduces the time significantly. Now the whole job
> > process by 1 hour and 45 min approx but still not in minutes. So is there
> > anything left which i might apply and performance increase?
> >
> > Thanks
> >
> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <[email protected]>
> wrote:
> >
> > > Good points.
> > > Before we can make any rational suggestion, we need to know where the
> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> > > personally favor Michael's suggestion to split the ingest and the
> parsing
> > > parts of your job, and to switch to a parser that is faster than a DOM
> > > parser (SAX or Stax). But, without knowing what the bottleneck actually
> is,
> > > all of these suggestions are shots in the dark.
> > >
> > > What is the network load, the CPU load, the disk load?  Have you at
> least
> > > installed Ganglia or some equivalent so that you can see what the load
> is
> > > across the cluster?
> > >
> > > Dave
> > >
> > >
> > > -----Original Message-----
> > > From: Michael Segel [mailto:[email protected]]
> > > Sent: Friday, November 05, 2010 9:49 AM
> > > To: [email protected]
> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > >
> > >
> > > I don't think using the buffered client is going to help a lot w
> > > performance.
> > >
> > > I'm a little confused because it doesn't sound like Shuja is using a
> > > map/reduce job to parse the file.
> > > That is... he says he parses the file in to a dom tree. Usually your
> map
> > > job parses each record and then in the mapper you parse out the record.
> > > Within the m/r job we don't parse out the fields in the records because
> we
> > > do additional processing which 'dedupes' the data so we don't have to
> > > further process the data.
> > > The second job only has to parse a portion of the original records.
> > >
> > > So assuming that Shuja is actually using a map reduce job, and each xml
> > > record is being parsed within the mapper() there are a couple of
> things...
> > > 1) Reduce the number of column families that you are using. (Each
> column
> > > family is written to a separate file)
> > > 2) Set up the HTable instance in Mapper.setup()
> > > 3) Switch to a different dom class (not all java classes are equal) or
> > > switch to Stax.
> > >
> > >
> > >
> > >
> > > > From: [email protected]
> > > > To: [email protected]
> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > > >
> > > > Have you tried turning off auto flush, and managing the flush in your
> own
> > > code (say every 1000 puts?)
> > > > Dave
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Shuja Rehman [mailto:[email protected]]
> > > > Sent: Friday, November 05, 2010 8:04 AM
> > > > To: [email protected]
> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > > >
> > > > Michael
> > > >
> > > > hum....so u are storing xml record in the hbase and in second job, u
> r
> > > > parsing. but in my case i am parsing it also in first phase. what i
> do, i
> > > > get xml file and i parse it using jdom and then putting data in
> hbase. so
> > > > parsing+putting both operations are in 1 phase and in mapper code.
> > > >
> > > > My actual problem is that after parsing file, i need to use put
> statement
> > > > millions of times and i think for each statement it connects to hbase
> and
> > > > then insert it and this might be the reason of slow processing. So i
> am
> > > > trying to figure out some way we i can first buffer data and then
> insert
> > > in
> > > > batch fashion. it means in one put statement, i can insert many
> records
> > > and
> > > > i think if i do in this way then the process will be very fast.
> > > >
> > > > secondly what does it means? "we write the raw record in via a single
> > > put()
> > > > so the map() method is a null writable."
> > > >
> > > > can u explain it more?
> > > >
> > > > Thanks
> > > >
> > > >
> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> [email protected]
> > > >wrote:
> > > >
> > > > >
> > > > > Suja,
> > > > >
> > > > > Just did a quick glance.
> > > > >
> > > > > What is it that you want to do exactly?
> > > > >
> > > > > Here's how we do it... (at a high level.)
> > > > >
> > > > > Input is an XML file where we want to store the raw XML records in
> > > hbase,
> > > > > one record per row.
> > > > >
> > > > > Instead of using the output of the map() method, we write the raw
> > > record in
> > > > > via a single put() so the map() method is a null writable.
> > > > >
> > > > > Its pretty fast. However fast is relative.
> > > > >
> > > > > Another thing... we store the xml record as a string (converted to
> > > > > bytecode) rather than a serialized object.
> > > > >
> > > > > Then you can break it down in to individual fields in a second
> batch
> > > job.
> > > > > (You can start with a DOM parser, and later move to a Stax parser.
> > > > > Depending on which DOM parser you have and the size of the record,
> it
> > > should
> > > > > be 'fast enough'. A good implementation of Stax tends to be
> > > > > recursive/re-entrant code which is harder to maintain.)
> > > > >
> > > > > HTH
> > > > >
> > > > > -Mike
> > > > >
> > > > >
> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > > > > From: [email protected]
> > > > > > To: [email protected]
> > > > > >
> > > > > > Hi
> > > > > >
> > > > > > I am reading data from raw xml files and inserting data into
> hbase
> > > using
> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> > > statements,
> > > > > it
> > > > > > takes many hours to process the data. here is my sample code.
> > > > > >
> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > > > > >     conf.set("xmlinput.start", "<adc>");
> > > > > >     conf.set("xmlinput.end", "</adc>");
> > > > > >     conf
> > > > > >         .set(
> > > > > >           "io.serializations",
> > > > > >
> > > > > >
> > > > >
> > >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > > > > >
> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> > > > > >
> > > > > >     FileInputFormat.setInputPaths(job, input);
> > > > > >     job.setJarByClass(ParserDriver.class);
> > > > > >     job.setMapperClass(MyParserMapper.class);
> > > > > >     job.setNumReduceTasks(0);
> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > > > > >
> > > > > >
> > > > > > *and mapper code*
> > > > > >
> > > > > > public class MyParserMapper   extends
> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > > > > >
> > > > > >     @Override
> > > > > >     public void map(LongWritable key, Text value1,Context
> context)
> > > > > >
> > > > > > throws IOException, InterruptedException {
> > > > > > *//doing some processing*
> > > > > >  while(rItr.hasNext())
> > > > > >                     {
> > > > > > *                   //and this put statement runs for 132,622,560
> > > times
> > > > > to
> > > > > > insert the data.*
> > > > > >                     context.write(NullWritable.get(), new
> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > > > > Bytes.toBytes(counter.toString()),
> > > > > Bytes.toBytes(rElement.getTextTrim())));
> > > > > >
> > > > > >                     }
> > > > > >
> > > > > > }}
> > > > > >
> > > > > > Is there any other way of doing this task so i can improve the
> > > > > performance?
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards
> > > > > > Shuja-ur-Rehman Baig
> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards
> > > > Shuja-ur-Rehman Baig
> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > >
> > >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
>
>



-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

Re: Best Way to Insert data into Hbase using Map Reduce

Reply via email to