One more thing which i want to ask that i have found that people have given the following buffer size.
table.setWriteBufferSize(1024*1024*24); table.setAutoFlush(false); Is there any specific reason of giving such buffer size? and how much ram is required for it. I have given 4 GB to each region server and I can see that used heap value for region server going increasing and increasing and region servers are crashing then. On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <[email protected]> wrote: > Ok > Well...i am getting hundred of files daily which all need to process thats > why i am using hadoop so it manage distribution of processing itself. > Yes, one record has millions of fields > > Thanks for comments. > > > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel > <[email protected]>wrote: > >> >> Switch out the JDOM for a Stax parser. >> >> Ok, having said that... >> You said you have a single record per file. Ok that means you have a lot >> of fields. >> Because you have 1 record, this isn't a map/reduce problem. You're better >> off writing a single threaded app >> to read the file, parse the file using Stax, and then write the fields to >> HBase. >> >> I'm not sure why you have millions of put()s. >> Do you have millions of fields in this one record? >> >> Writing a good stax parser and then mapping the fields to your hbase >> column(s) will help. >> >> HTH >> >> -Mike >> PS. A good stax implementation would be a recursive/re-entrant piece of >> code. >> While the code may look simple, it takes a skilled developer to write and >> maintain. >> >> >> > Date: Mon, 8 Nov 2010 14:36:34 +0500 >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce >> > From: [email protected] >> > To: [email protected] >> > >> > HI >> > >> > I have used JDOM library to parse the xml in mapper and in my case, one >> > single file consist of 1 record so i give one complete file to map >> process >> > and extract the information from it which i need. I have only 2 column >> > families in my schema and bottleneck was the put statements which run >> > millions of time for each file. when i comment this put statement then >> job >> > complete within minutes but with put statement, it was taking about 7 >> hours >> > to complete the same job. Anyhow I have changed the code according to >> > suggestion given by Michael and now using java api to dump data instead >> of >> > table output format and used the list of puts and then flush them at >> each >> > 1000 records and it reduces the time significantly. Now the whole job >> > process by 1 hour and 45 min approx but still not in minutes. So is >> there >> > anything left which i might apply and performance increase? >> > >> > Thanks >> > >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <[email protected]> >> wrote: >> > >> > > Good points. >> > > Before we can make any rational suggestion, we need to know where the >> > > bottleneck is, so we can make suggestions to move it elsewhere. I >> > > personally favor Michael's suggestion to split the ingest and the >> parsing >> > > parts of your job, and to switch to a parser that is faster than a DOM >> > > parser (SAX or Stax). But, without knowing what the bottleneck >> actually is, >> > > all of these suggestions are shots in the dark. >> > > >> > > What is the network load, the CPU load, the disk load? Have you at >> least >> > > installed Ganglia or some equivalent so that you can see what the load >> is >> > > across the cluster? >> > > >> > > Dave >> > > >> > > >> > > -----Original Message----- >> > > From: Michael Segel [mailto:[email protected]] >> > > Sent: Friday, November 05, 2010 9:49 AM >> > > To: [email protected] >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce >> > > >> > > >> > > I don't think using the buffered client is going to help a lot w >> > > performance. >> > > >> > > I'm a little confused because it doesn't sound like Shuja is using a >> > > map/reduce job to parse the file. >> > > That is... he says he parses the file in to a dom tree. Usually your >> map >> > > job parses each record and then in the mapper you parse out the >> record. >> > > Within the m/r job we don't parse out the fields in the records >> because we >> > > do additional processing which 'dedupes' the data so we don't have to >> > > further process the data. >> > > The second job only has to parse a portion of the original records. >> > > >> > > So assuming that Shuja is actually using a map reduce job, and each >> xml >> > > record is being parsed within the mapper() there are a couple of >> things... >> > > 1) Reduce the number of column families that you are using. (Each >> column >> > > family is written to a separate file) >> > > 2) Set up the HTable instance in Mapper.setup() >> > > 3) Switch to a different dom class (not all java classes are equal) or >> > > switch to Stax. >> > > >> > > >> > > >> > > >> > > > From: [email protected] >> > > > To: [email protected] >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700 >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce >> > > > >> > > > Have you tried turning off auto flush, and managing the flush in >> your own >> > > code (say every 1000 puts?) >> > > > Dave >> > > > >> > > > >> > > > -----Original Message----- >> > > > From: Shuja Rehman [mailto:[email protected]] >> > > > Sent: Friday, November 05, 2010 8:04 AM >> > > > To: [email protected] >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce >> > > > >> > > > Michael >> > > > >> > > > hum....so u are storing xml record in the hbase and in second job, u >> r >> > > > parsing. but in my case i am parsing it also in first phase. what i >> do, i >> > > > get xml file and i parse it using jdom and then putting data in >> hbase. so >> > > > parsing+putting both operations are in 1 phase and in mapper code. >> > > > >> > > > My actual problem is that after parsing file, i need to use put >> statement >> > > > millions of times and i think for each statement it connects to >> hbase and >> > > > then insert it and this might be the reason of slow processing. So i >> am >> > > > trying to figure out some way we i can first buffer data and then >> insert >> > > in >> > > > batch fashion. it means in one put statement, i can insert many >> records >> > > and >> > > > i think if i do in this way then the process will be very fast. >> > > > >> > > > secondly what does it means? "we write the raw record in via a >> single >> > > put() >> > > > so the map() method is a null writable." >> > > > >> > > > can u explain it more? >> > > > >> > > > Thanks >> > > > >> > > > >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel < >> [email protected] >> > > >wrote: >> > > > >> > > > > >> > > > > Suja, >> > > > > >> > > > > Just did a quick glance. >> > > > > >> > > > > What is it that you want to do exactly? >> > > > > >> > > > > Here's how we do it... (at a high level.) >> > > > > >> > > > > Input is an XML file where we want to store the raw XML records in >> > > hbase, >> > > > > one record per row. >> > > > > >> > > > > Instead of using the output of the map() method, we write the raw >> > > record in >> > > > > via a single put() so the map() method is a null writable. >> > > > > >> > > > > Its pretty fast. However fast is relative. >> > > > > >> > > > > Another thing... we store the xml record as a string (converted to >> > > > > bytecode) rather than a serialized object. >> > > > > >> > > > > Then you can break it down in to individual fields in a second >> batch >> > > job. >> > > > > (You can start with a DOM parser, and later move to a Stax parser. >> > > > > Depending on which DOM parser you have and the size of the record, >> it >> > > should >> > > > > be 'fast enough'. A good implementation of Stax tends to be >> > > > > recursive/re-entrant code which is harder to maintain.) >> > > > > >> > > > > HTH >> > > > > >> > > > > -Mike >> > > > > >> > > > > >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500 >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce >> > > > > > From: [email protected] >> > > > > > To: [email protected] >> > > > > > >> > > > > > Hi >> > > > > > >> > > > > > I am reading data from raw xml files and inserting data into >> hbase >> > > using >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put >> > > statements, >> > > > > it >> > > > > > takes many hours to process the data. here is my sample code. >> > > > > > >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable"); >> > > > > > conf.set("xmlinput.start", "<adc>"); >> > > > > > conf.set("xmlinput.end", "</adc>"); >> > > > > > conf >> > > > > > .set( >> > > > > > "io.serializations", >> > > > > > >> > > > > > >> > > > > >> > > >> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization"); >> > > > > > >> > > > > > Job job = new Job(conf, "Populate Table with Data"); >> > > > > > >> > > > > > FileInputFormat.setInputPaths(job, input); >> > > > > > job.setJarByClass(ParserDriver.class); >> > > > > > job.setMapperClass(MyParserMapper.class); >> > > > > > job.setNumReduceTasks(0); >> > > > > > job.setInputFormatClass(XmlInputFormat.class); >> > > > > > job.setOutputFormatClass(TableOutputFormat.class); >> > > > > > >> > > > > > >> > > > > > *and mapper code* >> > > > > > >> > > > > > public class MyParserMapper extends >> > > > > > Mapper<LongWritable, Text, NullWritable, Writable> { >> > > > > > >> > > > > > @Override >> > > > > > public void map(LongWritable key, Text value1,Context >> context) >> > > > > > >> > > > > > throws IOException, InterruptedException { >> > > > > > *//doing some processing* >> > > > > > while(rItr.hasNext()) >> > > > > > { >> > > > > > * //and this put statement runs for >> 132,622,560 >> > > times >> > > > > to >> > > > > > insert the data.* >> > > > > > context.write(NullWritable.get(), new >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"), >> > > > > > Bytes.toBytes(counter.toString()), >> > > > > Bytes.toBytes(rElement.getTextTrim()))); >> > > > > > >> > > > > > } >> > > > > > >> > > > > > }} >> > > > > > >> > > > > > Is there any other way of doing this task so i can improve the >> > > > > performance? >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Regards >> > > > > > Shuja-ur-Rehman Baig >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal> >> > > > > >> > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Regards >> > > > Shuja-ur-Rehman Baig >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal> >> > > >> > > >> > >> > >> > -- >> > Regards >> > Shuja-ur-Rehman Baig >> > <http://pk.linkedin.com/in/shujamughal> >> >> > > > > -- > Regards > Shuja-ur-Rehman Baig > <http://pk.linkedin.com/in/shujamughal> > > -- Regards Shuja-ur-Rehman Baig <http://pk.linkedin.com/in/shujamughal>
