Ted,
>From the master log, there was a compaction around the time.
2014-08-09 22:50:51,176 DEBUG [827019302@qtp-63557232-287]
client.HBaseAdmin: Trying to compact {ENCODED =>
12c9a609765ad0bbd6468d93368f860a, NAME =>
'm_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.',
STARTKEY => '2fd811c2b1d7476efb16499ccb2b823d', ENDKEY =>
'3328d07989225a29067b7b7981150052'}:
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region
m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.
is not online
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2585)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3952)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.compactRegion(HRegionServer.java:3750)
at
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:19803)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
at
org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
Also, hbase hbck shows a lot of errors. In particular, I see
ERROR: Region { meta =>
m_hashes,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.,
hdfs =>
hdfs://cluster01/apps/hbase/data/data/default/m_data/12c9a609765ad0bbd6468d93368f860a,
deployed => } not deployed on any region server.
...
ERROR: There is a hole in the region chain between
2fd811c2b1d7476efb16499ccb2b823d and 3328d07989225a29067b7b7981150052.
You need to create a new .regioninfo and region dir in hdfs to plug
the hole.
Looks like the data is there
[hbase@db03 ~]$ hadoop fs -du
/apps/hbase/data/data/default/m_data/12c9a609765ad0bbd6468d93368f860a
105
/apps/hbase/data/data/default/m_data/12c9a609765ad0bbd6468d93368f860a/.regioninfo
4023827732
/apps/hbase/data/data/default/m_data/12c9a609765ad0bbd6468d93368f860a/cf1
1773806
/apps/hbase/data/data/default/m_data/12c9a609765ad0bbd6468d93368f860a/recovered.edits
Wonder if hbase hbck --repairHoles can fix this kind of thing?
thomas
On Sun, Aug 10, 2014 at 5:17 PM, Ted Yu <[email protected]> wrote:
> bq. it's host dn29.manage.com,60020,1407600154728 is dead but not processed
> yet
>
> Can you look back (from 22:50:51) in master log to see what happened to
> dn29 ?
>
> Thanks
>
>
> On Sun, Aug 10, 2014 at 2:51 PM, Thomas Kwan <[email protected]> wrote:
>
>> Thanks for your help Ted.
>>
>> From the master's log, I see
>>
>> 2014-08-09 22:50:51,176 DEBUG [827019302@qtp-63557232-287]
>> client.HBaseAdmin: Trying to compact {ENCODED =>
>> 12c9a609765ad0bbd6468d93368f860a, NAME =>
>>
>> 'm_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.',
>> STARTKEY => '2fd811c2b1d7476efb16499ccb2b823d', ENDKEY =>
>> '3328d07989225a29067b7b7981150052'}:
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Region
>>
>> m_hashes,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.
>> is not online
>> at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2585)
>> at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3952)
>> at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.compactRegion(HRegionServer.java:3750)
>> at
>> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:19803)
>> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
>> at
>> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>>
>> at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown
>> Source)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> at
>> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>> at
>> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>> at
>> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277)
>> at
>> org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1647)
>> at
>> org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1623)
>> at
>> org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1504)
>> at
>> org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1491)
>> at
>> org.apache.hadoop.hbase.generated.master.table_jsp._jspService(table_jsp.java:111)
>> at
>> org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>> at
>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>> at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>> at
>> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>> at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> at
>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081)
>> at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> at
>> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>> at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>> at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>> at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>> at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>> at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>> at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>> at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>> at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>> at org.mortbay.jetty.Server.handle(Server.java:326)
>> at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>> at
>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>> at
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
>> at
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>> ...
>> 2014-08-09 23:11:29,846 INFO [AM.-pool1-t3] master.AssignmentManager:
>> Skip assigning {ENCODED => d5887dd2b5897d14a6d2a041fc2ace1f, NAME =>
>>
>> 'm_data,2f03f0fa374de8af4880ba49401cd441,1406839342141.d5887dd2b5897d14a6d2a041fc2ace1f.',
>> STARTKEY => '2f03f0fa374de8af4880ba49401cd441', ENDKEY =>
>> '2fd811c2b1d7476efb16499ccb2b823d'}, we couldn't close it:
>> {d5887dd2b5897d14a6d2a041fc2ace1f state=FAILED_CLOSE,
>> ts=1407651089846, server=dn05.manage.com,60020,1407649977124}
>> ...
>> 2014-08-10 07:49:17,589 INFO [RpcServer.handler=237,port=60000]
>> master.AssignmentManager: Skip assigning
>>
>> m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.,
>> it's host dn29.manage.com,60020,1407600154728 is dead but not
>> processed yet
>>
>> And I checked dn29 via hbase UI running at
>> http://dn29.manage.com:60030/rs-status, looks like there is no regions
>> on dn29.
>>
>> thanks
>> thomas
>>
>>
>> On Sun, Aug 10, 2014 at 12:28 PM, Ted Yu <[email protected]> wrote:
>> > Can you check master log to see why
>> 'm_data,2fd811c2b1d7476efb16499ccb2b823d'
>> > went offline ?
>> >
>> > Thanks
>> >
>> >
>> > On Sun, Aug 10, 2014 at 12:13 PM, Thomas Kwan <[email protected]>
>> > wrote:
>> >
>> >> Hi Ted,
>> >>
>> >> Hbase version is 0.96.0.2.0
>> >>
>> >> Nothing interesting in the hbase log on dn29 and confirmed that region
>> >> server is running on dn29
>> >>
>> >> When I do 'get', i see
>> >>
>> >> hbase(main):001:0> get 'm_data','2fd811c2b1d7476efb16499ccb2b823d'
>> >>
>> >> COLUMN CELL
>> >>
>> >> ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region
>> >>
>> >>
>> m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.
>> >> is not online
>> >> at
>> >>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2585)
>> >> at
>> >>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3952)
>> >> at
>> >>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
>> >> at
>> >>
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26925)
>> >> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
>> >> at
>> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
>> >>
>> >> On Sun, Aug 10, 2014 at 10:32 AM, Ted Yu <[email protected]> wrote:
>> >> > bq. if I can just rmr stuff under /hbase-unsecure/splitWAL/...
>> >> >
>> >> > Please don't.
>> >> >
>> >> > Have you checked region server log on dn29.manage.com ?
>> >> >
>> >> > What hbase version are you using ?
>> >> >
>> >> > Cheers
>> >> >
>> >> >
>> >> > On Sun, Aug 10, 2014 at 10:27 AM, Thomas Kwan <[email protected]
>> >
>> >> > wrote:
>> >> >
>> >> >> And I have a program that do some read operations and it hangs. And
>> I am
>> >> >> seeing
>> >> >>
>> >> >> 2014-08-10 12:22:05,359 DEBUG [main]
>> >> >> client.HConnectionManager$HConnectionImplementation: Removed all
>> >> >> cached region locations that map to
>> >> >> dn29.manage.com,60020,1407600154728
>> >> >> 2014-08-10 12:22:06,173 DEBUG [main]
>> >> >> client.HConnectionManager$HConnectionImplementation: Removed
>> >> >> dn29.manage.com:60020 as a location of
>> >> >>
>> >> >>
>> >>
>> m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.
>> >> >> for tableName=m_data from cache
>> >> >> 2014-08-10 12:22:07,180 DEBUG [main]
>> >> >> client.HConnectionManager$HConnectionImplementation: Removed
>> >> >> dn29.manage.com:60020 as a location of
>> >> >>
>> >> >>
>> >>
>> m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.
>> >> >> for tableName=m_data from cache
>> >> >> 2014-08-10 12:22:09,193 DEBUG [main]
>> >> >> client.HConnectionManager$HConnectionImplementation: Removed
>> >> >> dn29.manage.com:60020 as a location of
>> >> >>
>> >> >>
>> >>
>> m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.
>> >> >> for tableName=m_data from cache
>> >> >> 2014-08-10 12:22:09,196 DEBUG [main]
>> >> >> client.HConnectionManager$HConnectionImplementation: Removed all
>> >> >> cached region locations that map to
>> >> >> dn29.manage.com,60020,1407600154728
>> >> >> 2014-08-10 12:22:13,208 DEBUG [main]
>> >> >> client.HConnectionManager$HConnectionImplementation: Removed all
>> >> >> cached region locations that map to
>> >> >> dn29.manage.com,60020,1407600154728
>> >> >>
>> >> >> I am seeing the following in the hbase master also
>> >> >>
>> >> >> 2014-08-10 10:22:25,016 INFO
>> >> >> [master02.manage.com
>> ,60000,1407690402682.splitLogManagerTimeoutMonitor]
>> >> >> master.SplitLogManager: total tasks = 1 unassigned = 0
>> >> >> tasks={/hbase-unsecure/splitWAL/WALs%2Fdn29.manage.com
>> >> >> %2C60020%2C1407600154728-splitting%2Fdn29.manage.com
>> >> >> %252C60020%252C1407600154728.1407621759364=last_update
>> >> >> = 1407690428226 last_version = 53 cur_worker_name =
>> >> >> dn21.manage.com,60020,1407650188526 status = in_progress
>> incarnation =
>> >> >> 3 resubmits = 3 batch = installed = 1 done = 0 error = 0}
>> >> >>
>> >> >> I wonder if I can just rmr stuff under /hbase-unsecure/splitWAL/...
>> >> >>
>> >> >> thanks
>> >> >> thomas
>> >> >>
>> >>
>>