Check the requirements: http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements
You can confirm that you have a xcievers problem by grepping the datanode logs with the error message pasted in the last bullet point. If so, it will explain a lot! J-D On Thu, Jul 1, 2010 at 5:49 PM, Jinsong Hu <[email protected]> wrote: > > I do have some errors , such as > > 2010-07-01 22:53:30,187 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > crea > teBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink > 10.11 > 0.8.85:50010 > java.io.EOFException > > 2010-07-01 23:00:49,976 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > crea > teBlockOutputStream java.net.ConnectException: Connection timed out > 2010-07-01 23:04:13,356 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > crea > teBlockOutputStream java.net.ConnectException: Connection timed out > > > seems they are all hadoop data node errors. > > I searched and people say I need to increase dfs.datanode.max.xcievers to > 2K, and increase > ulimit to 32K ( currently it is set at 16K). > > I will get that done and do more testing. > > Jimmy. > > -------------------------------------------------- > From: "Jean-Daniel Cryans" <[email protected]> > Sent: Thursday, July 01, 2010 5:41 PM > To: <[email protected]> > Subject: Re: dilemma of memory and CPU for hbase. > >> When I start HBase I usually just tail the master log, but it's >> actually just a few seconds then another few seconds for .META. then >> it starts assigning all other regions. >> >> Did you make sure your master log was clean of errors? >> >> J-D >> >> On Thu, Jul 1, 2010 at 5:40 PM, Jinsong Hu <[email protected]> wrote: >>> >>> yes, it terminated correctely. there is no exception while running the >>> add_table. >>> >>> are you saying that after restart, I need to wait for some time for the >>> -ROOT- to >>> be assigned ? usually how long I need to wait ? >>> >>> Jimmy >>> >>> -------------------------------------------------- >>> From: "Jean-Daniel Cryans" <[email protected]> >>> Sent: Thursday, July 01, 2010 5:27 PM >>> To: <[email protected]> >>> Subject: Re: dilemma of memory and CPU for hbase. >>> >>>> Did you see any exception when you ran add_table? Did it even >>>> terminated correctly? >>>> >>>> After a restart, the regions aren't readily available. If something >>>> blocked the master from assigning -ROOT-, it should be pretty evident >>>> by looking at the master log. >>>> >>>> J-D >>>> >>>> On Thu, Jul 1, 2010 at 5:23 PM, Jinsong Hu <[email protected]> >>>> wrote: >>>>> >>>>> After I run the add_table.rb, I refreshed the master's UI page, and >>>>> then >>>>> clicked on the table to show the regions. I expect that all regions >>>>> will >>>>> be >>>>> there. >>>>> But , I found that there are significantly fewer regions. Lots of >>>>> regions >>>>> that was there before were gone. >>>>> >>>>> I then restarted the whole hbase master and region server. And now it >>>>> is >>>>> even worse. the master UI page doesn't even load. saying the _ROOT >>>>> region >>>>> is and .META is not served by any regionserver. The whole cluster is >>>>> not >>>>> in >>>>> a usable state. >>>>> >>>>> That forced me to rename the /hbase to /hbase-0.20.4, and restart all >>>>> hbase >>>>> master and regionservers. recreate all tables, etc.essentially starting >>>>> from scratch. >>>>> >>>>> Jimmy >>>>> >>>>> -------------------------------------------------- >>>>> From: "Jean-Daniel Cryans" <[email protected]> >>>>> Sent: Thursday, July 01, 2010 5:10 PM >>>>> To: <[email protected]> >>>>> Subject: Re: dilemma of memory and CPU for hbase. >>>>> >>>>>> add_table.rb doesn't actually write much in the file system, all your >>>>>> data is still there. It just wipes all the .META. entries and replaces >>>>>> them with the .regioninfo files found in every region directory. >>>>>> >>>>>> Can you define what you mean by "corrupted". It's really an >>>>>> overloaded-term. >>>>>> >>>>>> J-D >>>>>> >>>>>> On Thu, Jul 1, 2010 at 5:01 PM, Jinsong Hu <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Hi, Jean: >>>>>>> Thanks! I will run the add_table.rb and see if it fixes the problem. >>>>>>> Our namenode is backed up with HA and DRBD, and the hbase master >>>>>>> machine >>>>>>> colocates with name node , job tracker so we are not wasting >>>>>>> resources. >>>>>>> >>>>>>> The region hole probably comes from previous 0.20.4 hbase operation. >>>>>>> the >>>>>>> 0.20.4 hbase was >>>>>>> very unstable during its operation. lots of times the master says the >>>>>>> region >>>>>>> is not there but actually >>>>>>> the region server says it was serving the region. >>>>>>> >>>>>>> >>>>>>> I followed the instruction and run commands like >>>>>>> >>>>>>> bin/hbase org.jruby.Main bin/add_table.rb /hbase/table_name >>>>>>> >>>>>>> After the execution, I found all my tables are corrupted and I can't >>>>>>> use >>>>>>> it >>>>>>> any more. restarting hbase >>>>>>> doesn't help either. I have to wipe out all the /hbase directory and >>>>>>> start >>>>>>> from scratch. >>>>>>> >>>>>>> >>>>>>> it looks that the add_table.rb can corrupt the whole hbase. Anyway, >>>>>>> I >>>>>>> am >>>>>>> regenerating the data from >>>>>>> scratch and let's see if it will work out. >>>>>>> >>>>>>> Jimmy. >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------- >>>>>>> From: "Jean-Daniel Cryans" <[email protected]> >>>>>>> Sent: Thursday, July 01, 2010 2:17 PM >>>>>>> To: <[email protected]> >>>>>>> Subject: Re: dilemma of memory and CPU for hbase. >>>>>>> >>>>>>>> (taking the conversation back to the list after receiving logs and >>>>>>>> heap >>>>>>>> dump) >>>>>>>> >>>>>>>> The issue here is actually much more nasty than it seems. But before >>>>>>>> I >>>>>>>> describe the problem, you said: >>>>>>>> >>>>>>>>> I have 3 machines as hbase master (only 1 is active), 3 >>>>>>>>> zookeepers. >>>>>>>>> 8 >>>>>>>>> regionservers. >>>>>>>> >>>>>>>> If those are all distinct machines, you are wasting a lot of >>>>>>>> hardware. >>>>>>>> Unless you have a HA Namenode (I highly doubt), then you already >>>>>>>> have >>>>>>>> a SPOF there so you might as well put every service on that single >>>>>>>> node (1 master, 1 zookeeper). You might be afraid of using only 1 ZK >>>>>>>> node, but unless you share the zookeeper ensemble between clusters >>>>>>>> then losing the Namenode is as bad as losing ZK so might as well put >>>>>>>> them together. At StumbleUpon we have 2-3 clusters using the same >>>>>>>> ensembles, so it makes more sense to put them in a HA setup. >>>>>>>> >>>>>>>> That said, in your log I see: >>>>>>>> >>>>>>>> 2010-06-29 00:00:00,064 DEBUG >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts >>>>>>>> interrupted at index=0 because:Requested row out of range for >>>>>>>> HRegion >>>>>>>> Spam_MsgEventTable,2010-06-28 11:34:02blah >>>>>>>> ... >>>>>>>> 2010-06-29 12:26:13,352 DEBUG >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts >>>>>>>> interrupted at index=0 because:Requested row out of range for >>>>>>>> HRegion >>>>>>>> Spam_MsgEventTable,2010-06-28 11:34:02blah >>>>>>>> >>>>>>>> So for 12 hours (and probably more), the same row was requested >>>>>>>> almost >>>>>>>> every 100ms but it was always failing on a WrongRegionException >>>>>>>> (that's the name of what we see here). You probably use the write >>>>>>>> buffer since you want to import as fast as possible, so all these >>>>>>>> buffers are left unused after the clients terminate their RPC. That >>>>>>>> rate of failed insertion must have kept your garbage collector >>>>>>>> _very_ >>>>>>>> busy, and at some point the JVM OOMEd. This is the stack from your >>>>>>>> OOME: >>>>>>>> >>>>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:175) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:835) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318) >>>>>>>> >>>>>>>> This is where we deserialize client data, so it correlates with what >>>>>>>> I >>>>>>>> just described. >>>>>>>> >>>>>>>> Now, this means that you probably have a hole (or more) in your >>>>>>>> .META. >>>>>>>> table. It usually happens after a region server fails if it was >>>>>>>> carrying it (since data loss is possible with that version of HDFS) >>>>>>>> or >>>>>>>> if a bug in the master messes up the .META. region. Now 2 things: >>>>>>>> >>>>>>>> - It would be nice to know why you have a hole. Look at your .META. >>>>>>>> table around the row in your region server log, you should see that >>>>>>>> the start/end keys don't match. Then you can look in the master log >>>>>>>> from yesterday to search for what went wrong, maybe see some >>>>>>>> exceptions, or maybe a region server failed for any reason and it >>>>>>>> was >>>>>>>> hosting .META. >>>>>>>> >>>>>>>> - You probably want to fix your table. Use the bin/add_table.rb >>>>>>>> script (other people on this list used it in the past, search the >>>>>>>> archive for more info). >>>>>>>> >>>>>>>> Finally (whew!), if you are still developing your solution around >>>>>>>> HBase, you might want to try out one of our dev release that does >>>>>>>> work >>>>>>>> with a durable Hadoop release. See >>>>>>>> http://hbase.apache.org/docs/r0.89.20100621/ for more info. >>>>>>>> Cloudera's >>>>>>>> CDH3b2 also has everything you need. >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Thu, Jul 1, 2010 at 12:03 PM, Jean-Daniel Cryans >>>>>>>> <[email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> 653 regions is very low, even if you had a total of 3 region >>>>>>>>> servers >>>>>>>>> I >>>>>>>>> wouldn't expect any problem. >>>>>>>>> >>>>>>>>> So to me it seems to point towards either a configuration issue or >>>>>>>>> a >>>>>>>>> usage issue. Can you: >>>>>>>>> >>>>>>>>> - Put the log of one region server that OOMEd on a public server. >>>>>>>>> - Tell us more about your setup: # of nodes, hardware, >>>>>>>>> configuration >>>>>>>>> file >>>>>>>>> - Tell us more about how you insert data into HBase >>>>>>>>> >>>>>>>>> And BTW are you trying to do an initial import of your data set? If >>>>>>>>> so, have you considered using HFileOutputFormat? >>>>>>>>> >>>>>>>>> Thx, >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Thu, Jul 1, 2010 at 11:52 AM, Jinsong Hu >>>>>>>>> <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, Sir: >>>>>>>>>> I am using hbase 0.20.5 and this morning I found that 3 of my >>>>>>>>>> region >>>>>>>>>> server running out of memory. >>>>>>>>>> the regionserver is given 6G memory each, and on average, I have >>>>>>>>>> 653 >>>>>>>>>> regions >>>>>>>>>> in total. max store size >>>>>>>>>> is 256M. I analyzed the dump and it shows that there are too many >>>>>>>>>> HRegion in >>>>>>>>>> memory. >>>>>>>>>> >>>>>>>>>> Previously set max store size to 2G, but then I found the region >>>>>>>>>> server >>>>>>>>>> constantly does minor compaction and the CPU usage is very high, >>>>>>>>>> It >>>>>>>>>> also >>>>>>>>>> blocks the heavy client record insertion. >>>>>>>>>> >>>>>>>>>> So now I am limited on one side by memory, limited on another >>>>>>>>>> size >>>>>>>>>> by >>>>>>>>>> CPU. >>>>>>>>>> Is there anyway to get out of this dilemma ? >>>>>>>>>> >>>>>>>>>> Jimmy. >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
