RE: wrong region exception

Robert Gonzalez Thu, 02 Jun 2011 10:58:14 -0700

Ok, I think I know why it gets stuck there.  That's where the hole is in the 
original table.  I skipped past the hole and it is off and running again.  
Knock on wood!


Robert


-----Original Message-----
From: Robert Gonzalez [mailto:[email protected]] 
Sent: Thursday, June 02, 2011 12:07 PM
To: '[email protected]'
Subject: RE: wrong region exception

Also, notice the output of my copy, where its now stuck on the final line.  
first column is number of rows, second column is key value:
total:904600000 7FECD7A2D11FFD850FDC7CA899CA3138
total:904700000 7FF0787C8EC28FF760BF0E38BB1F95C8
total:904800000 7FF418DDFCB134EFA7F1304762EA4A20
total:904900000 7FF7B7BC506DC77272DC9CBAE27DDD2D
total:905000000 7FFB5E24CC30B1FF8A9AE73068EFDB0B
total:905100000 7FFF0085ECE908C208BA083A99C05E42
total:905200000 8002A1540C309D99F587DAA712167091
total:905300000 800644A00B083B496A07B0633A51B528
total:905400000 8009E8A2EAC96846405476D294FDD999
total:905500000 800D8E0D6DB9259F16775B1080AE6968
total:905600000 80112E5D7E36AFB915DE4906BFF9F41C

But in the hbase web page for the table urlhashv4 (the one we are copying into) 
it only got this far.

urlhashv4,7FC19684831DF6E8ACCE0E690EF5BCAB,1306993100934.63e52326bc86f332a2f38056d934cbf3.
       c1-s06.atxd.maxpointinteractive.com:60030      
7FC19684831DF6E8ACCE0E690EF5BCAB        7FD3A81AD94CD99BEA6B4DA485BDDEBE
urlhashv4,7FD3A81AD94CD99BEA6B4DA485BDDEBE,1306993171155.0296cf0214b2d7dfe8ad8adac3ad7bf5.
      c1-s06.atxd.maxpointinteractive.com:60030       
7FD3A81AD94CD99BEA6B4DA485BDDEBE        7FE65457BB3A9492B6A0437124D6F5C7
urlhashv4,7FE65457BB3A9492B6A0437124D6F5C7,1306993227676.c038dcb619eabbd6d862634b83ba412e.
      c1-s06.atxd.maxpointinteractive.com:60030       
7FE65457BB3A9492B6A0437124D6F5C7        7FF5A0C88779F27177F1E5E8159680BE
urlhashv4,7FF5A0C88779F27177F1E5E8159680BE,1306993227676.5348549ea1080dca60d2e043da973258.
      c1-s06.atxd.maxpointinteractive.com:60030       
7FF5A0C88779F27177F1E5E8159680BE        


That, in conjunction with the messages on the slave that is trying to insert 
the data, indicates to me that its about to get into the same wrong region 
exception situation again.


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stack
Sent: Wednesday, June 01, 2011 5:34 PM
To: [email protected]
Subject: Re: wrong region exception

We can't reach the server carrying .META. within 60 seconds.  Whats going on on 
that server?  Doesn't the next time the below catalogjanitor run, does it 
succeed or just always fail?

St.Ack

On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez 
<[email protected]> wrote:
> This is basically it (for the first time it died while copying), we have it 
> at warn level and above:
>
> 2011-05-27 16:44:27,565 WARN
> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of catalog 
> table
> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on 
> socket time out exception: java.net.SocketTimeoutException: 60000 
> millis timeout while waiti ng for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connec
> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy6.delete(Unknown Source)
>        at
> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
> arent(MetaEditor.java:201)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
> t(CatalogJanitor.java:233)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
> nitor.java:275)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
> nitor.java:202)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
> tor.java:166)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
> a:120)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
> va:85)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting f or channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected
> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
> aseClient.java:521)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
> va:459)
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of 
> Stack
> Sent: Wednesday, June 01, 2011 12:29 PM
> To: [email protected]
> Subject: Re: wrong region exception
>
> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez 
> <[email protected]> wrote:
>> We have a table copy program that copies the data from one table to another, 
>> and we can give it the start/end keys.  In this case we created a new blank 
>> table with the essential column families and let it run with start/end to be 
>> the whole range, 0-maxkey.  At about 30% of the way through, which is 
>> roughly 600 million rows, it died trying to write to the new table with the 
>> wrong region exception.  When we tried to restart the copy from that key + 
>> some delta, it still crapped out.  No explanation in the logs the first 
>> time, but a series of timeouts in the second run.  Now we are trying the 
>> copy again with a new table.
>>
>
> Robert:
>
> Do you have the master logs for this copy run still?  If so, if you 
> put them somewhere where I can pull them (or send them to me, I'll 
> take a look).   I'd like to see the logs in the cluster to which you were 
> copying the data.
>
> St.Ack
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of 
>> Stack
>> Sent: Tuesday, May 31, 2011 6:42 PM
>> To: [email protected]
>> Subject: Re: wrong region exception
>>
>> So, what about this new WrongRegionException in the new cluster.  Can you 
>> figure how it came about?  In the new cluster, is there also a hole?  Did 
>> you start the new cluster fresh or copy from old cluster?
>>
>> St.Ack
>>
>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez 
>> <[email protected]> wrote:
>>> Yeah, we learned the hard way early last year to follow the guidelines 
>>> religiously.  I've gone over the requirements and checked off everything.  
>>> We even re-did our tables to only have 4 column families, down from 4x that 
>>> amount.   We are at a loss to find out why we seemed to be cursed when it 
>>> comes to HBase.  Hadoop is performing like a charm, pretty much every 
>>> machine is busy 24/7.
>>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of 
>>> Stack
>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>> To: [email protected]
>>> Subject: Re: wrong region exception
>>>
>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez 
>>> <[email protected]> wrote:
>>>> Now I'm getting the wrong region exception on the new table that I'm 
>>>> copying the old table to.  Running hbck reveals an inconsistency in the 
>>>> new table.  The frustration is unbelievable.  Like I said before, it 
>>>> doesn't appear that HBase is ready for prime time.  I don't know how 
>>>> companies are using this successfully, it doesn't appear plausible.
>>>>
>>>
>>>
>>> Sorry you are not having a good experience.  I've not seen 
>>> WrongRegionException in ages (Grep these lists yourself).  Makes me suspect 
>>> your environment.  For sure you've read the requirements section in the 
>>> manual and set up ulimits, nprocs and xceivers up?
>>>
>>> St.Ack
>>>
>>
>

RE: wrong region exception

Reply via email to