One more note. This database was 0.20.6 before. Then i start 0.89 over it. (but table with wrong checksum was created in 0.89 hbase)
2010/9/22 Andrey Stepachev <[email protected]>: > 2010/9/22 Ryan Rawson <[email protected]>: >> why are you using such expensive disks? raid + hdfs = lower >> performance than non-raid. > > It was database server, before we migrate to hbase. It was designed > for postgresql. Now with compression and hbase nature our database > is 12Gb instead of 180GB in pg. > So this server was not designed for hbase. > In production (0.20.6) we much lighter servers (3) with simle dual > sata drives. > >> >> how's your ram? hows your network switches? NICs? etc etc. >> anything along the data path can introduce errors. > > no. all things on one machined. 17Gb ram (5GB hbase) > >> >> in this case we did the right thing and threw exceptions, but looks >> like your client continues to call next() despite getting >> exceptions... can you check your client code to verify this? > > hm. i check. but i use only simple wrapper around ResultScanner > http://pastebin.org/1074628. It should bail out on exception (except > ScannerTimeoutException) > >> >> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[email protected]> wrote: >>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core. >>> >>> 2010/9/22 Ryan Rawson <[email protected]>: >>>> generally checksum errors are due to hardware faults of one kind or >>>> another. >>>> >>>> what is your hardware like? >>>> >>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[email protected]> wrote: >>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator >>>>> which reopen scanner on timeout. But what is best way to reopen scanner. >>>>> Can you point me where i can find all this exceptions? Or may be >>>>> here already some sort for recoveratble iterator? >>>>> >>>>> 2010/9/22 Ryan Rawson <[email protected]>: >>>>>> ah ok i think i get it... basically at this point your scanner is bad >>>>>> and iterating on it again won't work. the scanner should probably >>>>>> self close itself so you get tons of additional exceptions but instead >>>>>> we dont. >>>>>> >>>>>> there is probably a better fix for this, i'll ponder >>>>>> >>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[email protected]> wrote: >>>>>>> very strange... looks like a bad block ended up in your scanner and >>>>>>> subsequent nexts were failing due to that short read. >>>>>>> >>>>>>> did you have to kill the regionserver or did things recover and >>>>>>> continue normally? >>>>>>> >>>>>>> -ryan >>>>>>> >>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev <[email protected]> >>>>>>> wrote: >>>>>>>> Hi All. >>>>>>>> >>>>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy >>>>>>>> write in standalone mode. >>>>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 threads >>>>>>>> writes data to it. >>>>>>>> At some moment exception thrown. >>>>>>>> >>>>>>>> Andrey. >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
