Re: distributed log splitting aborted

Cyril Scetbon Tue, 10 Jul 2012 03:00:43 -0700

A network issue ?? it's weird, cause reads/writes are working well and not 
rising errors (I'll double check it)


Regards
Cyril SCETBON

On Jul 9, 2012, at 10:55 PM, Jean-Daniel Cryans wrote:

> We've been running with distributed splitting here for >6 months and
> never had this issue. Also the exceptions you are seeing come from
> HDFS and not HBase, the fact that it worked from the master and not
> the region servers seem to point to a network configuration issue
> because the actual splitting code is really the same.
> 
> J-D
> 
> On Sun, Jul 8, 2012 at 2:25 PM, Cyril Scetbon <[email protected]> wrote:
>> I've finally succeeded in starting my cluster by disabling 
>> hbase.master.distributed.log.splitting
>> 
>> it took less than 10 minutes to start it compared to the whole night without 
>> any success with distributed log splitting enabled. Don't you think like me 
>> that it's just buggy ??
>> 
>> thanks
>> 
>> Cyril SCETBON
>> 
>> On Jul 6, 2012, at 8:40 PM, Cyril Scetbon wrote:
>> 
>>> As you can see in the master log, region servers are in charge of splitting 
>>> log files (not found I suppose) and it's retried several times (I didn't 
>>> check if it's always redone)  on different region servers. You can for 
>>> example follow a failing split concerning a file not found in the hadoop 
>>> filesystem :
>>> 
>>> http://pastebin.com/RbcLdbcs
>>> 
>>> Regards
>>> 
>>> Cyril SCETBON
>>> 
>>> On Jul 6, 2012, at 8:17 PM, Cyril Scetbon wrote:
>>> 
>>>> Here are the log files you asked for :
>>>> 
>>>> http://pastebin.com/xRBuQdNS  <---- hbase-master.log
>>>> 
>>>> http://pastebin.com/u6WYQT6R <---- hdfs-namenode.log
>>>> 
>>>> If you find the fix to this damn issue I'll enjoy !
>>>> 
>>>> Thanks
>>>> 
>>>> Cyril SCETBON
>>>> 
>>>> On Jul 5, 2012, at 11:44 PM, Jean-Daniel Cryans wrote:
>>>> 
>>>>> Interesting... Can you read the file? Try a "hadoop dfs -cat" on it
>>>>> and see if it goes to the end of it.
>>>>> 
>>>>> It could also be useful to see a bigger portion of the master log, for
>>>>> all I know maybe it handles it somehow and there's a problem
>>>>> elsewhere.
>>>>> 
>>>>> Finally, which Hadoop version are you using?
>>>>> 
>>>>> Thx,
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Thu, Jul 5, 2012 at 1:58 PM, Cyril Scetbon <[email protected]> 
>>>>> wrote:
>>>>>> yes :
>>>>>> 
>>>>>> /hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.134143064971
>>>>>> 
>>>>>> I did a fsck and here is the report :
>>>>>> 
>>>>>> Status: HEALTHY
>>>>>> Total size:    618827621255 B (Total open files size: 868 B)
>>>>>> Total dirs:    4801
>>>>>> Total files:   2825 (Files currently being written: 42)
>>>>>> Total blocks (validated):      11479 (avg. block size 53909541 B) (Total 
>>>>>> open file blocks (not validated): 41)
>>>>>> Minimally replicated blocks:   11479 (100.0 %)
>>>>>> Over-replicated blocks:        1 (0.008711561 %)
>>>>>> Under-replicated blocks:       0 (0.0 %)
>>>>>> Mis-replicated blocks:         0 (0.0 %)
>>>>>> Default replication factor:    4
>>>>>> Average block replication:     4.0000873
>>>>>> Corrupt blocks:                0
>>>>>> Missing replicas:              0 (0.0 %)
>>>>>> Number of data-nodes:          12
>>>>>> Number of racks:               1
>>>>>> FSCK ended at Thu Jul 05 20:56:35 UTC 2012 in 795 milliseconds
>>>>>> 
>>>>>> 
>>>>>> The filesystem under path '/hbase' is HEALTHY
>>>>>> 
>>>>>> Cyril SCETBON
>>>>>> 
>>>>>> Cyril SCETBON
>>>>>> 
>>>>>> On Jul 5, 2012, at 7:59 PM, Jean-Daniel Cryans wrote:
>>>>>> 
>>>>>>> Does this file really exist in HDFS?
>>>>>>> 
>>>>>>> hdfs://hb-zk1:54310/hbase/.logs/hb-d12,60020,1341429679981-splitting/hb-d12%2C60020%2C1341429679981.1341430649711
>>>>>>> 
>>>>>>> If so, did you run fsck in HDFS?
>>>>>>> 
>>>>>>> It would be weird if HDFS doesn't report anything bad but somehow the
>>>>>>> clients (like HBase) can't read it.
>>>>>>> 
>>>>>>> J-D
>>>>>>> 
>>>>>>> On Thu, Jul 5, 2012 at 12:45 AM, Cyril Scetbon <[email protected]> 
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I can nolonger start my cluster correctly and get messages like 
>>>>>>>> http://pastebin.com/T56wrJxE (taken on one region server)
>>>>>>>> 
>>>>>>>> I suppose Hbase is not done for being stopped but only for having some 
>>>>>>>> nodes going down ??? HDFS is not complaining, it's only HBase that 
>>>>>>>> can't start correctly :(
>>>>>>>> 
>>>>>>>> I suppose some data has not been flushed and it's not really important 
>>>>>>>> for me. Is there a way to fix theses errors even if I will lose data ?
>>>>>>>> 
>>>>>>>> thanks
>>>>>>>> 
>>>>>>>> Cyril SCETBON
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>

Re: distributed log splitting aborted

Reply via email to