Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum error

Andrey Stepachev Wed, 22 Sep 2010 04:10:42 -0700

Run same job on hbase over hadoop: all works like a sharm.

I can give to conclusions:
1. some bug in standalone mode
2. memory, but i think this is not a case
(disks are same, memory are same, machine a same,
workload is same, but result - differs).


Later I'll try to write testcase

2010/9/22 Andrey Stepachev <[email protected]>:
> Very strange. With habase over hadoop no such errors with checksums.
> Very strange. I'll recheck on another big family.
>
> 2010/9/22 Andrey Stepachev <[email protected]>:
>> Thanks. Now i run the same job on
>> hbase 0.89 over cloudera hadoop instead of standalone mode.
>> May be here some bug in standalone mode, which prevents to
>> write correct data on disk. And later I'll check memory.
>>
>> Btw, linux is opensuse 11.0. 2.6.25.20-0.7-default 64 bit.
>>
>> 2010/9/22 Ryan Rawson <[email protected]>:
>>> So the client code looks good, hard to say what exactly is going on.
>>>
>>> BTW I opened this JIRA:
>>> https://issues.apache.org/jira/browse/HBASE-3029
>>>
>>> To address the confusing exception in this case.
>>>
>>> It's hard to say why you get that exception under load... some systems
>>> have been known to give weird flaky faults under load.  It used to be
>>> compiling the linux kernel was a simple benchmark for RAM problems.
>>> If you have time you could try memtest86 to see if the memory has
>>> issues, since that is a common place of errors.
>>>
>>> -ryan
>>>
>>> On Wed, Sep 22, 2010 at 2:29 AM, Andrey Stepachev <[email protected]> wrote:
>>>> One more note. This database was 0.20.6 before. Then
>>>> i start 0.89 over it.
>>>> (but table with wrong checksum was created in 0.89 hbase)
>>>>
>>>> 2010/9/22 Andrey Stepachev <[email protected]>:
>>>>> 2010/9/22 Ryan Rawson <[email protected]>:
>>>>>> why are you using such expensive disks?  raid + hdfs = lower
>>>>>> performance than non-raid.
>>>>>
>>>>> It was database server, before we migrate to hbase. It was designed
>>>>> for postgresql. Now with compression and hbase nature our database
>>>>> is 12Gb instead of 180GB in pg.
>>>>> So this server was not designed for hbase.
>>>>> In production (0.20.6) we much lighter servers (3) with simle dual
>>>>> sata drives.
>>>>>
>>>>>>
>>>>>> how's your ram?  hows your network switches?  NICs?  etc etc.
>>>>>> anything along the data path can introduce errors.
>>>>>
>>>>> no. all things on one machined. 17Gb ram (5GB hbase)
>>>>>
>>>>>>
>>>>>> in this case we did the right thing and threw exceptions, but looks
>>>>>> like your client continues to call next() despite getting
>>>>>> exceptions... can you check your client code to verify this?
>>>>>
>>>>> hm. i check. but i use only simple wrapper around ResultScanner
>>>>> http://pastebin.org/1074628. It should bail out on exception (except
>>>>> ScannerTimeoutException)
>>>>>
>>>>>>
>>>>>> On Wed, Sep 22, 2010 at 2:14 AM, Andrey Stepachev <[email protected]> 
>>>>>> wrote:
>>>>>>> hp proliant raid 10 with 4 sas. 15k. smartarray 6i. 2cpu/4core.
>>>>>>>
>>>>>>> 2010/9/22 Ryan Rawson <[email protected]>:
>>>>>>>> generally checksum errors are due to hardware faults of one kind or 
>>>>>>>> another.
>>>>>>>>
>>>>>>>> what is your hardware like?
>>>>>>>>
>>>>>>>> On Wed, Sep 22, 2010 at 2:08 AM, Andrey Stepachev <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>> But why it is bad? Split/compaction? I make my own RetryResultIterator
>>>>>>>>> which reopen scanner on timeout. But what is best way to reopen 
>>>>>>>>> scanner.
>>>>>>>>> Can you point me where i can find all this exceptions? Or may be
>>>>>>>>> here already some sort for recoveratble iterator?
>>>>>>>>>
>>>>>>>>> 2010/9/22 Ryan Rawson <[email protected]>:
>>>>>>>>>> ah ok i think i get it... basically at this point your scanner is bad
>>>>>>>>>> and iterating on it again won't work.  the scanner should probably
>>>>>>>>>> self close itself so you get tons of additional exceptions but 
>>>>>>>>>> instead
>>>>>>>>>> we dont.
>>>>>>>>>>
>>>>>>>>>> there is probably a better fix for this, i'll ponder
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 22, 2010 at 1:57 AM, Ryan Rawson <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>> very strange... looks like a bad block ended up in your scanner and
>>>>>>>>>>> subsequent nexts were failing due to that short read.
>>>>>>>>>>>
>>>>>>>>>>> did you have to kill the regionserver or did things recover and
>>>>>>>>>>> continue normally?
>>>>>>>>>>>
>>>>>>>>>>> -ryan
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 22, 2010 at 1:37 AM, Andrey Stepachev 
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> Hi All.
>>>>>>>>>>>>
>>>>>>>>>>>> I get org.apache.hadoop.fs.ChecksumException for a table on heavy
>>>>>>>>>>>> write in standalone mode.
>>>>>>>>>>>> table tmp.bsn.main created 2010-09-22 10:42:28,860 and then 5 
>>>>>>>>>>>> threads
>>>>>>>>>>>> writes data to it.
>>>>>>>>>>>> At some moment exception thrown.
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: hbase-0.89/trunk: org.apache.hadoop.fs.ChecksumException: Checksum error

Reply via email to