Thanks for your help. I pushed up my upgrade plans and just finished installing 0.90.1 (cdh3b4) and that solved the EOF error as well as a general performance boost with my initial testing.
-chris On Mar 2, 2011, at 9:18 AM, Jean-Daniel Cryans wrote: > I think you could try applying both patches instead on whatever you're > running right now, they are pretty small. > > Another option is using the version of 0.89 we're using here in > production that's already patched https://github.com/stumbleupon/hbase > > J-D > > On Wed, Mar 2, 2011 at 8:55 AM, Chris Tarnas <[email protected]> wrote: >> If HBASE-3038 is the problem is there anything I should be aware of during >> upgrading while this region is in this state? >> >> thanks, >> -chris >> >> On Mar 2, 2011, at 8:22 AM, Chris Tarnas wrote: >> >>> I'm pretty sure I hit HBASE-3038, the recovered.edits file is over 2GB >>> >>> I'll push up my upgrade plans. >>> >>> -chris >>> >>> On Mar 2, 2011, at 2:44 AM, Chris Tarnas wrote: >>> >>>> Actually I see now that this EOFException is keeping a region offline, are >>>> there anyways around this error to bring the region back online? I don't >>>> have the logs from the regionservers when it went offline but here is the >>>> section of the master log from then: >>>> >>>> http://pastebin.com/4ZBKGbnZ >>>> >>>> thanks again >>>> -chris >>>> >>>> On Mar 2, 2011, at 1:03 AM, Chris Tarnas wrote: >>>> >>>>> Under heavy loads I've seen a few of EOFException errors in my >>>>> regionserver logs: >>>>> >>>>> 2011-03-02 02:27:03,669 ERROR >>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening >>>>> sequence,h7BpVjo07UDYrkBZBLwWfg\x09fc00fc97be11e00d731605f8e061462c-A2610001-1\x09,1298335975607.8a5d1e4a300792d74f516ba26de869c8. >>>>> java.io.EOFException: >>>>> hdfs://lxbt006-pvt:8020/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364, >>>>> entryStart=2336278916, pos=2336278916, end=4672557832, edit=13370 >>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>>>> Method) >>>>> at >>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >>>>> at >>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) >>>>> >>>>> Checking the same timeframe in the namenode logs on lcbt006-pvt reveals >>>>> no ominous messages (no warns, errors, anything), just the same file >>>>> being opened by a different node: >>>>> >>>>> 2011-03-02 02:27:05,466 INFO >>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop >>>>> ip=/10.56.24.13 cmd=open >>>>> src=/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364 >>>>> dst=null perm=null >>>>> >>>>> >>>>> The Troubleshooting Wiki mentions it is related to swapping, but none of >>>>> the nodes are swapping - they all have plenty of RAM. Are there other >>>>> common causes? Is this anything I should be worried about or just >>>>> "normal" exceptions, anything else I should look for? I'm on cdh3b3 and >>>>> will be moving to b4 once I get a chance to run it through a test cluster. >>>>> >>>>> thank you, >>>>> -chris >>>> >>> >> >>
