Run same job on hbase over hadoop: all works like a sharm. I can give to conclusions: 1. some bug in standalone mode 2. memory, but i think this is not a case (disks are same, memory are same, machine a same, workload is same, but result - differs).
Later I'll try to write testcase 2010/9/22 Andrey Stepachev <[email protected]>: > Very strange. With habase over hadoop no such errors with checksums. > Very strange. I'll recheck on another big family. > > 2010/9/22 Andrey Stepachev <[email protected]>: >> Thanks. Now i run the same job on >> hbase 0.89 over cloudera hadoop instead of standalone mode. >> May be here some bug in standalone mode, which prevents to >> write correct data on disk. And later I'll check memory. >> >> Btw, linux is opensuse 11.0. 2.6.25.20-0.7-default 64 bit. >> >> 2010/9/22 Ryan Rawson <[email protected]>: >>> So the client code looks good, hard to say what exactly is going on. >>> >>> BTW I opened this JIRA: >>> https://issues.apache.org/jira/browse/HBASE-3029 >>> >>> To address the confusing exception in this case. >>> >>> It's hard to say why you get that exception under load... some systems >>> have been known to give weird flaky faults under load. It used to be >>> compiling the linux kernel was a simple benchmark for RAM problems. >>> If you have time you could try memtest86 to see if the memory has >>> issues, since that is a common place of errors. >>> >>> -ryan >>> >>> On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <[email protected]> wrote: >>>> One more note. This database was 0.20.6 before. Then >>>> i start 0.89 over it. >>>> (but table with wrong checksum was created in 0.89 hbase) >>>> >>>> 2010/9/22 Andrey Stepachev <[email protected]>: >>>>> 2010/9/22 Ryan Rawson <[email protected]>: >>>>>> why are you using such expensive disks? raid + hdfs = lower >>>>>> performance than non-raid. >>>>> >>>>> It was database server, before we migrate to hbase. It was designed >>>>> for postgresql. Now with compression and hbase nature our database >>>>> is 12Gb instead of 180GB in pg. >>>>> So this server was not designed for hbase. >>>>> In production (0.20.6) we much lighter servers (3) with simle dual >>>>> sata drives. >>>>> >>>>>> >>>>>> how's your ram? hows your network switches? NICs? etc etc. >>>>>> anything along the data path can introduce errors. >>>>> >>>>> no. all things on one machined. 17Gb ram (5GB hbase) >>>>> >>>>>> >>>>>> in this case we did the right thing and threw exceptions, but looks >>>>>> like your client continues to call next() despite getting >>>>>> exceptions... can you check your client code to verify this? >>>>> >>>>> hm. i check. but i use only simple wrapper around ResultScanner >>>>> http://pastebin.org/1074628. It should bail out on exception (except >>>>> ScannerTimeoutException) >>>>> >>>>>> >>>>>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[email protected]> >>>>>> wrote: >>>>>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >>>>>>> >>>>>>> 2010/9/22 Ryan Rawson <[email protected]>: >>>>>>>> generally checksum errors are due to hardware faults of one kind or >>>>>>>> another. >>>>>>>> >>>>>>>> what is your hardware like? >>>>>>>> >>>>>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[email protected]> >>>>>>>> wrote: >>>>>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>>>>>>> which reopen scanner on timeout. But what is best way to reopen >>>>>>>>> scanner. >>>>>>>>> Can you point me where i can find all this exceptions? Or may be >>>>>>>>> here already some sort for recoveratble iterator? >>>>>>>>> >>>>>>>>> 2010/9/22 Ryan Rawson <[email protected]>: >>>>>>>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>>>>>>> and iterating on it again won't work. the scanner should probably >>>>>>>>>> self close itself so you get tons of additional exceptions but >>>>>>>>>> instead >>>>>>>>>> we dont. >>>>>>>>>> >>>>>>>>>> there is probably a better fix for this, i'll ponder >>>>>>>>>> >>>>>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>>> very strange... looks like a bad block ended up in your scanner and >>>>>>>>>>> subsequent nexts were failing due to that short read. >>>>>>>>>>> >>>>>>>>>>> did you have to kill the regionserver or did things recover and >>>>>>>>>>> continue normally? >>>>>>>>>>> >>>>>>>>>>> -ryan >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev >>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>> Hi All. >>>>>>>>>>>> >>>>>>>>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>>>>>>>>> write in standalone mode. >>>>>>>>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 >>>>>>>>>>>> threads >>>>>>>>>>>> writes data to it. >>>>>>>>>>>> At some moment exception thrown. >>>>>>>>>>>> >>>>>>>>>>>> Andrey. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
