Eran: You didn't run hbck during the enabling of gs_raw_events table, right ?
I saw: 2011-06-29 16:43:50,395 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction (major) requested for gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. because User-triggered major compaction; priority=1, compaction queue size=1248 The above might be related to: >> 2011-06-29 16:43:57,880 INFO org.apache.hadoop.hbase. master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. Thanks On Tue, Jul 5, 2011 at 7:19 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Eran: > I logged https://issues.apache.org/jira/browse/HBASE-4060 for you. > > > On Mon, Jul 4, 2011 at 2:30 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Thanks for the understanding. >> >> Can you log a JIRA and put your ideas below in it ? >> >> >> >> On Jul 4, 2011, at 12:42 AM, Eran Kutner <e...@gigya.com> wrote: >> >> > Thanks for the explanation Ted, >> > >> > I will try to apply HBASE-3789 and hope for the best but my >> understanding is >> > that it doesn't really solve the problem, it only reduces the >> probability of >> > it happening, at least in one particular scenario. I would hope for a >> more >> > robust solution. >> > My concern is that the region allocation process seems to rely too much >> on >> > timing considerations and doesn't seem to take enough measures to >> guarantee >> > conflicts do not occur. I understand that in a distributed environment, >> when >> > you don't get a timely response from a remote machine you can't know for >> > sure if it did or did not receive the request, however there are things >> that >> > can be done to mitigate this and reduce the conflict time significantly. >> For >> > example, when I run dbck it knows that some regions are multiply >> assigned, >> > the master could do the same and try to resolve the conflict. Another >> > approach would be to handle late responses, even if the response from >> the >> > remote machine arrives after it was assumed to be dead the master should >> > have enough information to know it had created a conflict by assigning >> the >> > region to another server. An even better solution, I think, is for the >> RS to >> > periodically test that it is indeed the rightful owner of every region >> it >> > holds and relinquish control over the region if it's not. >> > Obviously a state where two RSs hold the same region is pathological and >> can >> > lead to data loss, as demonstrated in my case. The system should be able >> to >> > actively protect itself against such a scenario. It probably doesn't >> need >> > saying but there is really nothing worse for a data storage system than >> data >> > loss. >> > >> > In my case the problem didn't happen in the initial phase but after >> > disabling and enabling a table with about 12K regions. >> > >> > -eran >> > >> > >> > >> > On Sun, Jul 3, 2011 at 23:49, Ted Yu <yuzhih...@gmail.com> wrote: >> > >> >> Let me try to answer some of your questions. >> >> The two paragraphs below were written along my reasoning which is in >> >> reverse >> >> order of the actual call sequence. >> >> >> >> For #4 below, the log indicates that the following was executed: >> >> private void assign(final RegionState state, final boolean >> setOfflineInZK, >> >> final boolean forceNewPlan) { >> >> for (int i = 0; i < this.maximumAssignmentAttempts; i++) { >> >> if (setOfflineInZK && !*setOfflineInZooKeeper*(state)) return; >> >> >> >> The above was due to the timeout which you noted in #2 which would have >> >> caused >> >> TimeoutMonitor.chore() to run this code (line 1787) >> >> >> >> for (Map.Entry<HRegionInfo, Boolean> e: assigns.entrySet()){ >> >> assign(e.getKey(), false, e.getValue()); >> >> } >> >> >> >> This means there is lack of coordination between >> >> assignmentManager.TimeoutMonitor and OpenedRegionHandler >> >> >> >> The reason I mention HBASE-3789 is that it is marked as Incompatible >> change >> >> and is in TRUNK already. >> >> The application of HBASE-3789 to 0.90 branch would change the behavior >> >> (timing) of region assignment. >> >> >> >> I think it makes sense to evaluate the effect of HBASE-3789 in 0.90.4 >> >> >> >> BTW were the incorrect region assignments observed for a table with >> >> multiple >> >> initial regions ? >> >> If so, I have HBASE-4010 in TRUNK which speeds up initial region >> assignment >> >> by about 50%. >> >> >> >> Cheers >> >> >> >> On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner <e...@gigya.com> wrote: >> >> >> >>> Ted, >> >>> So if I understand correctly the the theory is that because of the >> issue >> >>> fixed in HBASE-3789 the master took too long to detect that the region >> >> was >> >>> successfully opened by the first server so it forced closed it and >> >>> transitioned to a second server, but there are a few things about this >> >>> scenario I don't understand, probably because I don't know enough >> about >> >> the >> >>> inner workings of the region transition process and would appreciate >> it >> >> if >> >>> you can help me understand: >> >>> 1. The RS opened the region at 16:37:49. >> >>> 2. The master started handling the opened event at 16:39:54 - this >> delay >> >>> can >> >>> probably be explained by HBASE-3789 >> >>> 3. At 16:39:54 the master log says: Opened region gs_raw_events,..... >> on >> >>> hadoop1-s05.farm-ny.gigya.com >> >>> 4. Then at 16:40:00 the master log says: >> master:60000-0x13004a31d7804c4 >> >>> Creating (or updating) unassigned node for >> 584dac5cc70d8682f71c4675a843c3 >> >>> 09 with OFFLINE state - why did it decide to take the region offline >> >> after >> >>> learning it was successfully opened? >> >>> 5. Then it tries to reopen the region on hadoop1-s05, which indicates >> in >> >>> its >> >>> log that the open request failed because the region was already open - >> >> why >> >>> didn't the master use that information to learn that the region was >> >> already >> >>> open? >> >>> 6. At 16:43:57 the master decides the region transition timed out and >> >>> starts >> >>> forcing the transition - HBASE-3789 again? >> >>> 7. Now the master forces the transition of the region to hadoop1-s02 >> but >> >>> there is no sign of that on hadoop1-s05 - why doesn't the old RS >> >>> (hadoop1-s05) detect that it is no longer the master and relinquishes >> >>> control of the region? >> >>> >> >>> Thanks. >> >>> >> >>> -eran >> >>> >> >>> >> >>> >> >>> On Sun, Jul 3, 2011 at 20:09, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> >> >>>> HBASE-3789 should have sped up region assignment. >> >>>> The patch for 0.90 is attached to that JIRA. >> >>>> >> >>>> You may prudently apply that patch. >> >>>> >> >>>> Regards >> >>>> >> >>>> On Sun, Jul 3, 2011 at 10:01 AM, Eran Kutner <e...@gigya.com> wrote: >> >>>> >> >>>>> Thanks Ted, but, as stated before, I'm already using 0.90.3, so >> >> either >> >>>> it's >> >>>>> not fixed or it's not the same thing. >> >>>>> >> >>>>> -eran >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Sun, Jul 3, 2011 at 17:27, Ted Yu <yuzhih...@gmail.com> wrote: >> >>>>> >> >>>>>> Eran: >> >>>>>> I was thinking of this: >> >>>>>> HBASE-3789 Cleanup the locking contention in the master >> >>>>>> >> >>>>>> though it doesn't directly handle 'PENDING_OPEN for too long' case. >> >>>>>> >> >>>>>> https://issues.apache.org/jira/browse/HBASE-3741 is in 0.90.3 and >> >>>>> actually >> >>>>>> close to the symptom you described. >> >>>>>> >> >>>>>> On Sun, Jul 3, 2011 at 12:00 AM, Eran Kutner <e...@gigya.com> >> >> wrote: >> >>>>>> >> >>>>>>> It does seem that both servers opened the same region around the >> >>> same >> >>>>>> time. >> >>>>>>> The region was offline because I disabled the table so I can >> >> change >> >>>> its >> >>>>>>> TTL. >> >>>>>>> >> >>>>>>> Here is the log from haddop1-s05 : >> >>>>>>> 2011-06-29 16:37:12,576 INFO >> >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received >> >>> request >> >>>> to >> >>>>>>> open >> >>>>>>> region: >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:37:12,680 DEBUG >> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >> >>>>>> Processing >> >>>>>>> open of >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:37:12,680 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to >> >>>>>>> RS_ZK_REGION_OPENING >> >>>>>>> 2011-06-29 16:37:12,711 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to >> >>>>>>> RS_ZK_REGION_OPENING >> >>>>>>> 2011-06-29 16:37:12,711 DEBUG >> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>>>>>> Opening region: REGION => {NAME => >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.', >> >>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY => >> >>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED => >> >>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME => >> >>> 'gs_raw_events', >> >>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE', >> >>>>> REPLICATION_SCOPE >> >>>>>> => >> >>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800', >> >>>> BLOCKSIZE >> >>>>> => >> >>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} >> >>>>>>> 2011-06-29 16:37:12,711 DEBUG >> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>>>>>> Instantiated >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:37:12,847 DEBUG >> >>>>> org.apache.hadoop.hbase.regionserver.Store: >> >>>>>>> loaded >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, >> >>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062, >> >>>>>>> majorCompaction=false >> >>>>>>> 2011-06-29 16:37:12,848 INFO >> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>>>>>> Onlined >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.; >> >>>>>>> next sequenceid=1162228063 >> >>>>>>> 2011-06-29 16:37:12,849 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >> >>>>>>> RS_ZK_REGION_OPENING >> >>>>>>> 2011-06-29 16:37:12,875 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >> >>>>>>> RS_ZK_REGION_OPENING >> >>>>>>> 2011-06-29 16:37:12,951 INFO >> >>>>> org.apache.hadoop.hbase.catalog.MetaEditor: >> >>>>>>> Updated row >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> in region .META.,,1 with server= >> >>> hadoop1-s05.farm-ny.gigya.com:60020, >> >>>>>>> startcode=1307349217076 >> >>>>>>> 2011-06-29 16:37:12,951 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >> >>>>>>> RS_ZK_REGION_OPENED >> >>>>>>> 2011-06-29 16:37:12,964 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >> >>>>>>> RS_ZK_REGION_OPENED >> >>>>>>> 2011-06-29 16:37:12,964 DEBUG >> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >> >>>> Opened >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:40:00,878 INFO >> >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received >> >>> request >> >>>> to >> >>>>>>> open >> >>>>>>> region: >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:40:00,878 DEBUG >> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >> >>>>>> Processing >> >>>>>>> open of >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:40:01,079 WARN >> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >> >>>>> Attempted >> >>>>>>> open of >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> but already online on this server >> >>>>>>> 2011-06-29 16:43:50,395 DEBUG >> >>>>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: >> >> Compaction >> >>>>>> (major) >> >>>>>>> requested for >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> because User-triggered major compaction; priority=1, compaction >> >>> queue >> >>>>>>> size=1248 >> >>>>>>> 2011-06-29 20:19:49,906 INFO >> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>>>>>> Starting major compaction on region >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 20:19:49,906 INFO >> >>>>> org.apache.hadoop.hbase.regionserver.Store: >> >>>>>>> Started compaction of 1 file(s) in cf=events into >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp, >> >>>>>>> seqid=1162228062, totalSize=98.3m >> >>>>>>> 2011-06-29 20:19:49,906 DEBUG >> >>>>> org.apache.hadoop.hbase.regionserver.Store: >> >>>>>>> Compacting >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, >> >>>>>>> keycount=6882816, bloomtype=NONE, size=98.3m >> >>>>>>> 2011-06-29 20:19:59,920 INFO >> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>>>>>> completed compaction on region >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> after 10sec >> >>>>>>> >> >>>>>>> And here is the one from hadoop1-s02: >> >>>>>>> 2011-06-29 16:43:57,935 INFO >> >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received >> >>> request >> >>>> to >> >>>>>>> open >> >>>>>>> region: >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:43:58,990 DEBUG >> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >> >>>>>> Processing >> >>>>>>> open of >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:43:58,990 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to >> >>>>>>> RS_ZK_REGION_OPENING >> >>>>>>> 2011-06-29 16:43:59,002 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to >> >>>>>>> RS_ZK_REGION_OPENING >> >>>>>>> 2011-06-29 16:43:59,002 DEBUG >> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>>>>>> Opening region: REGION => {NAME => >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.', >> >>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY => >> >>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED => >> >>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME => >> >>> 'gs_raw_events', >> >>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE', >> >>>>> REPLICATION_SCOPE >> >>>>>> => >> >>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800', >> >>>> BLOCKSIZE >> >>>>> => >> >>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} >> >>>>>>> 2011-06-29 16:43:59,003 DEBUG >> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>>>>>> Instantiated >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> 2011-06-29 16:43:59,204 DEBUG >> >>>>> org.apache.hadoop.hbase.regionserver.Store: >> >>>>>>> loaded >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, >> >>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062, >> >>>>>>> majorCompaction=false >> >>>>>>> 2011-06-29 16:43:59,205 INFO >> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>>>>>> Onlined >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.; >> >>>>>>> next sequenceid=1162228063 >> >>>>>>> 2011-06-29 16:43:59,205 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >> >>>>>>> RS_ZK_REGION_OPENING >> >>>>>>> 2011-06-29 16:43:59,212 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >> >>>>>>> RS_ZK_REGION_OPENING >> >>>>>>> 2011-06-29 16:43:59,214 INFO >> >>>>> org.apache.hadoop.hbase.catalog.MetaEditor: >> >>>>>>> Updated row >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> in region .META.,,1 with server= >> >>> hadoop1-s02.farm-ny.gigya.com:60020, >> >>>>>>> startcode=1306919627544 >> >>>>>>> 2011-06-29 16:43:59,214 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >> >>>>>>> RS_ZK_REGION_OPENED >> >>>>>>> 2011-06-29 16:43:59,224 DEBUG >> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned >> >> node >> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >> >>>>>>> RS_ZK_REGION_OPENED >> >>>>>>> 2011-06-29 16:43:59,224 DEBUG >> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >> >>>> Opened >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>> java.io.IOException: Got error in response to OP_READ_BLOCK >> >> self=/ >> >>>>>>> 10.1.104.2:33356, remote=/10.1.104.2:50010 for file >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360 >> >>>>>>> for block 3674866614142268536_674205 >> >>>>>>> >> >>>>>>> >> >>>>>>> Ted, can you please point me to J-D's bug fix you mentioned? Are >> >>> you >> >>>>>>> positive it's the same scenario - data loss is a very serious >> >>> problem >> >>>>> for >> >>>>>> a >> >>>>>>> DB. >> >>>>>>> I'd really like to apply that patch ASAP, because when I run hbck >> >> I >> >>>> get >> >>>>>>> over >> >>>>>>> 400 regions which are multiply assigned. >> >>>>>>> Last question, I understand the region's data is lost but is >> >> there >> >>> a >> >>>>> way >> >>>>>> to >> >>>>>>> at least make the table consistent again by some how removing the >> >>>> lost >> >>>>>>> region? >> >>>>>>> >> >>>>>>> Thanks. >> >>>>>>> >> >>>>>>> -eran >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Sat, Jul 2, 2011 at 01:46, Ted Yu <yuzhih...@gmail.com> >> >> wrote: >> >>>>>>> >> >>>>>>>>>> 2011-06-29 16:43:57,880 INFO >> >>>>>>>> org.apache.hadoop.hbase. >> >>>>>>>> master.AssignmentManager: Region has been >> >>>>>>>> PENDING_OPEN for too long, reassigning >> >>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>> >> >>>>>>>> The double assignment should have been fixed by J-D's recent >> >>>> checkin. >> >>>>>>>> >> >>>>>>>> On Fri, Jul 1, 2011 at 3:14 PM, Stack <st...@duboce.net> >> >> wrote: >> >>>>>>>> >> >>>>>>>>> Is >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>> the region that was having the issue? If so, if you looked >> >> in >> >>>>>>>>> hadoop1-s05's logs, was this region opened around 2011-06-29 >> >>>>>> 16:43:57? >> >>>>>>>>> Was it also opened hadoop1-s02 not long after? Did you say >> >>> what >> >>>>>>>>> version of hbase you are on? >> >>>>>>>>> >> >>>>>>>>> St.Ack >> >>>>>>>>> >> >>>>>>>>> On Fri, Jul 1, 2011 at 5:08 AM, Eran Kutner <e...@gigya.com> >> >>>>> wrote: >> >>>>>>>>>> Hi Stack, >> >>>>>>>>>> I'm not sure what the log means. I do see references to two >> >>>>>> different >> >>>>>>>>>> servers, but that would probably happen if there was normal >> >>>>>>> transition >> >>>>>>>> I >> >>>>>>>>>> assume. I'm using version 0.90.3 >> >>>>>>>>>> Here are the relevant lines from the master logs: >> >>>>>>>>>> >> >>>>>>>>>> 2011-06-19 21:39:37,164 INFO >> >>>>>>>>> org.apache.hadoop.hbase.master.ServerManager: >> >>>>>>>>>> Received REGION_SPLIT: >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658.: >> >>>>>>>>>> Daughters; >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518810_1249_WEB204,1308533970928.46f876a4e97be04edb35eb8f8959d482. >> >>>>>>>>>> from hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 >> >>>>>>>>>> 2011-06-19 21:43:12,983 INFO >> >>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor: >> >>>>>>>>>> Deleted daughter reference >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >> >>>>>>>>>> qualifier=splitA, from parent >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658. >> >>>>>>>>>> 2011-06-29 16:29:36,143 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Starting >> >>>>>>> unassignment >> >>>>>>>>> of >> >>>>>>>>>> region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> (offlining) >> >>>>>>>>>> 2011-06-29 16:29:36,146 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Sent >> >> CLOSE >> >>> to >> >>>>>>>>> serverName= >> >>>>>>>>>> hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, >> >>>>>> load=(requests=0, >> >>>>>>>>>> regions=1654, usedHeap=1870, maxHeap=12483) for region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >> >>> new >> >>>>>>>> unassigned >> >>>>>>>>>> node: /hbase/unassigned/584dac5cc70d8682f71c4675a843c309 >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> (region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >> >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, >> >>>>>>>>>> state=RS_ZK_REGION_CLOSED) >> >>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >> >>>>>>>>>> transition=RS_ZK_REGION_CLOSED, >> >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, >> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >> >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: >> >>>>>> Handling >> >>>>>>>>> CLOSED >> >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309 >> >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Table >> >> being >> >>>>>>> disabled >> >>>>>>>> so >> >>>>>>>>>> deleting ZK node and removing from regions in transition, >> >>>>> skipping >> >>>>>>>>>> assignment of region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG >> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned >> >>>> node >> >>>>>> for >> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state >> >>>>>>>>>> RS_ZK_REGION_CLOSED >> >>>>>>>>>> 2011-06-29 16:30:53,801 DEBUG >> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted >> >>> unassigned >> >>>>> node >> >>>>>>> for >> >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state >> >>>>>>>>>> RS_ZK_REGION_CLOSED >> >>>>>>>>>> 2011-06-29 16:34:01,453 INFO >> >>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor: >> >>>>>>>>>> Updated region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> in META >> >>>>>>>>>> 2011-06-29 16:37:12,247 DEBUG >> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating) >> >>>> unassigned >> >>>>>> node >> >>>>>>>> for >> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state >> >>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No >> >> previous >> >>>>>>>> transition >> >>>>>>>>>> plan was found (or we are ignoring an existing plan) for >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> so generated a random one; >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >> >>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com >> >>> ,60020,1307349217076; >> >>>> 5 >> >>>>>>>>> (online=5, >> >>>>>>>>>> exclude=null) available servers >> >>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning >> >>>>> region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 >> >>>>>>>>>> 2011-06-29 16:37:13,102 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >> >>>>>>>>>> transition=RS_ZK_REGION_OPENED, >> >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, >> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >> >>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >> >>>>>> Handling >> >>>>>>>>> OPENED >> >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting >> >>> unassigned >> >>>>>> node >> >>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG >> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned >> >>>> node >> >>>>>> for >> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state >> >>>>>>>>>> RS_ZK_REGION_OPENED >> >>>>>>>>>> 2011-06-29 16:39:54,192 DEBUG >> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted >> >>> unassigned >> >>>>> node >> >>>>>>> for >> >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state >> >>>>>>>>>> RS_ZK_REGION_OPENED >> >>>>>>>>>> 2011-06-29 16:39:54,326 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >> >>>>> Opened >> >>>>>>>> region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 >> >>>>>>>>>> 2011-06-29 16:40:00,598 DEBUG >> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating) >> >>>> unassigned >> >>>>>> node >> >>>>>>>> for >> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state >> >>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No >> >> previous >> >>>>>>>> transition >> >>>>>>>>>> plan was found (or we are ignoring an existing plan) for >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> so generated a random one; >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >> >>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com >> >>> ,60020,1307349217076; >> >>>> 5 >> >>>>>>>>> (online=5, >> >>>>>>>>>> exclude=null) available servers >> >>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning >> >>>>> region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 >> >>>>>>>>>> 2011-06-29 16:43:57,879 INFO >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Regions >> >> in >> >>>>>>> transition >> >>>>>>>>>> timed out: >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> state=PENDING_OPEN, ts=1309380052723 >> >>>>>>>>>> 2011-06-29 16:43:57,880 INFO >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Region >> >> has >> >>>> been >> >>>>>>>>>> PENDING_OPEN for too long, reassigning >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Forcing >> >>>>> OFFLINE; >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> was=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> state=PENDING_OPEN, ts=1309380052723 >> >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No >> >> previous >> >>>>>>>> transition >> >>>>>>>>>> plan was found (or we are ignoring an existing plan) for >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> so generated a random one; >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >> >>>>>>>>>> src=, dest=hadoop1-s02.farm-ny.gigya.com >> >>> ,60020,1306919627544; >> >>>> 5 >> >>>>>>>>> (online=5, >> >>>>>>>>>> exclude=null) available servers >> >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning >> >>>>> region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> to hadoop1-s02.farm-ny.gigya.com,60020,1306919627544 >> >>>>>>>>>> 2011-06-29 16:43:59,022 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >> >>>>>>>>>> transition=RS_ZK_REGION_OPENING, >> >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, >> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >> >>>>>>>>>> 2011-06-29 16:43:59,221 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >> >>>>>>>>>> transition=RS_ZK_REGION_OPENING, >> >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, >> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >> >>>>>>>>>> 2011-06-29 16:43:59,226 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >> >>>>>>>>>> transition=RS_ZK_REGION_OPENED, >> >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, >> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >> >>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >> >>>>>> Handling >> >>>>>>>>> OPENED >> >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting >> >>> unassigned >> >>>>>> node >> >>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG >> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned >> >>>> node >> >>>>>> for >> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state >> >>>>>>>>>> RS_ZK_REGION_OPENED >> >>>>>>>>>> 2011-06-29 16:43:59,296 DEBUG >> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted >> >>> unassigned >> >>>>> node >> >>>>>>> for >> >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state >> >>>>>>>>>> RS_ZK_REGION_OPENED >> >>>>>>>>>> 2011-06-29 16:43:59,375 WARN >> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: >> >> Overwriting >> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 on >> >>>>>>>>>> serverName=hadoop1-s05.farm-ny.gigya.com >> >>> ,60020,1307349217076, >> >>>>>>>>>> load=(requests=0, regions=1273, usedHeap=2676, >> >> maxHeap=12483) >> >>>>>>>>>> 2011-06-29 16:43:59,375 DEBUG >> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >> >>>>> Opened >> >>>>>>>> region >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >> >>>>>>>>>> on hadoop1-s02.farm-ny.gigya.com,60020,1306919627544 >> >>>>>>>>>> >> >>>>>>>>>> Thanks. >> >>>>>>>>>> >> >>>>>>>>>> -eran >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Fri, Jul 1, 2011 at 09:05, Stack <st...@duboce.net> >> >>> wrote: >> >>>>>>>>>> >> >>>>>>>>>>> So, Eran, it seems as though two RegionServers were >> >> carrying >> >>>> the >> >>>>>>>>>>> region? One deleted a file (compaction on its side)? Can >> >>> you >> >>>>>>> figure >> >>>>>>>>>>> if indeed two servers had same region? (Check master logs >> >>> for >> >>>>>> this >> >>>>>>>>>>> regions assignments). >> >>>>>>>>>>> >> >>>>>>>>>>> What version of hbase? >> >>>>>>>>>>> >> >>>>>>>>>>> St.Ack >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> On Thu, Jun 30, 2011 at 3:58 AM, Eran Kutner < >> >>> e...@gigya.com> >> >>>>>>> wrote: >> >>>>>>>>>>>> Hi, >> >>>>>>>>>>>> I have a cluster of 5 nodes with one large table that >> >>>>> currently >> >>>>>>> has >> >>>>>>>>>>> around >> >>>>>>>>>>>> 12000 regions. Everything was working fine for >> >> relatively >> >>>> long >> >>>>>>> time, >> >>>>>>>>>>> until >> >>>>>>>>>>>> now. >> >>>>>>>>>>>> Yesterday I significantly reduced the TTL on the table >> >> and >> >>>>>>> initiated >> >>>>>>>>>>> major >> >>>>>>>>>>>> compaction. This should have reduced the table size to >> >>> about >> >>>>> 20% >> >>>>>>> of >> >>>>>>>>> its >> >>>>>>>>>>>> original size. >> >>>>>>>>>>>> Today, I'm getting errors of inaccessible files on HDFS, >> >>> for >> >>>>>>>> example: >> >>>>>>>>>>>> java.io.IOException: Got error in response to >> >>> OP_READ_BLOCK >> >>>>>> self=/ >> >>>>>>>>>>>> 10.1.104.2:58047, remote=/10.1.104.2:50010 for file >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360 >> >>>>>>>>>>>> for block 3674866614142268536_674205 >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>> >> >>> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) >> >>>>>>>>>>>> at >> >>>>> java.io.DataInputStream.read(DataInputStream.java:132) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) >> >>>>>>>>>>>> at >> >>>>>>>>> >> >> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >> >>>>>>>>>>>> at >> >>>>>>>>> >> >> java.io.BufferedInputStream.read(BufferedInputStream.java:237) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:128) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:68) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) >> >>>>>>>>>>>> at >> >>>>>>>>> >> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >> >>>>>>>>>>>> at >> >>>>>>>>> >> >> java.io.BufferedInputStream.read(BufferedInputStream.java:317) >> >>>>>>>>>>>> at >> >>>>>> org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>> >> >>> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>> >> >>> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1433) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:139) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:96) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:77) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>> >> >>>>> >> >> org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1341) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2269) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1126) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1118) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1102) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1781) >> >>>>>>>>>>>> at >> >>>> sun.reflect.GeneratedMethodAccessor46.invoke(Unknown >> >>>>>>>> Source) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >>>>>>>>>>>> at >> >> java.lang.reflect.Method.invoke(Method.java:597) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>> >> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >> >>>>>>>>>>>> at >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) >> >>>>>>>>>>>> >> >>>>>>>>>>>> I checked and the file, indeed doesn't exist on HDFS, >> >> here >> >>>> is >> >>>>>> the >> >>>>>>>> name >> >>>>>>>>>>> node >> >>>>>>>>>>>> logs for this block, apparently because it was deleted: >> >>>>>>>>>>>> 2011-06-19 21:39:36,651 INFO >> >>>>> org.apache.hadoop.hdfs.StateChange: >> >>>>>>>>> BLOCK* >> >>>>>>>>>>>> NameSystem.allocateBlock: >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp/2096863423111131624. >> >>>>>>>>>>>> blk_3674866614142268536_674205 >> >>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO >> >>>>> org.apache.hadoop.hdfs.StateChange: >> >>>>>>>>> BLOCK* >> >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: >> >>>> 10.1.104.2:50010 >> >>>>> is >> >>>>>>>>> added >> >>>>>>>>>>> to >> >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 >> >>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO >> >>>>> org.apache.hadoop.hdfs.StateChange: >> >>>>>>>>> BLOCK* >> >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: >> >>>> 10.1.104.3:50010 >> >>>>> is >> >>>>>>>>> added >> >>>>>>>>>>> to >> >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 >> >>>>>>>>>>>> 2011-06-19 21:40:11,955 INFO >> >>>>> org.apache.hadoop.hdfs.StateChange: >> >>>>>>>>> BLOCK* >> >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: >> >>>> 10.1.104.5:50010 >> >>>>> is >> >>>>>>>>> added >> >>>>>>>>>>> to >> >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 >> >>>>>>>>>>>> 2011-06-29 20:20:01,662 INFO >> >>>>> org.apache.hadoop.hdfs.StateChange: >> >>>>>>>>> BLOCK* >> >>>>>>>>>>> ask >> >>>>>>>>>>>> 10.1.104.2:50010 to delete >> >>> blk_3674866614142268536_674205 >> >>>>>>>>>>>> 2011-06-29 20:20:13,671 INFO >> >>>>> org.apache.hadoop.hdfs.StateChange: >> >>>>>>>>> BLOCK* >> >>>>>>>>>>> ask >> >>>>>>>>>>>> 10.1.104.5:50010 to delete >> >>> blk_-4056387895369608597_675174 >> >>>>>>>>>>>> blk_-5017882805850873821_672281 >> >>>> blk_702373987100607684_672288 >> >>>>>>>>>>>> blk_-5357157478043290010_668506 >> >>>> blk_7118175133735412789_674903 >> >>>>>>>>>>>> blk_-3569812563715986384_675231 >> >>>> blk_8296855057240604851_669285 >> >>>>>>>>>>>> blk_-6483679172530609101_674268 >> >>>> blk_8738539715363739108_673682 >> >>>>>>>>>>>> blk_1744841904626813502_675238 >> >>>> blk_-6035315106100051103_674266 >> >>>>>>>>>>>> blk_-1789501623010070237_674908 >> >>>> blk_1944054629336265129_673689 >> >>>>>>>>>>>> blk_3674866614142268536_674205 >> >>>> blk_7930425446738143892_647410 >> >>>>>>>>>>>> blk_-3007186753042268449_669296 >> >>>>> blk_-5482302621772778061_647416 >> >>>>>>>>>>>> blk_-3765735404924932181_672004 >> >>>> blk_7476090998956811081_675169 >> >>>>>>>>>>>> blk_7862291659285127712_646890 >> >>>> blk_-2666244746343584727_672013 >> >>>>>>>>>>>> blk_6039172613960915602_674206 >> >>>> blk_-8470884397893086564_646899 >> >>>>>>>>>>>> blk_4558230221166712802_668510 >> >>>>>>>>>>>> 2011-06-29 20:20:46,698 INFO >> >>>>> org.apache.hadoop.hdfs.StateChange: >> >>>>>>>>> BLOCK* >> >>>>>>>>>>> ask >> >>>>>>>>>>>> 10.1.104.3:50010 to delete >> >>> blk_-7851606440036350812_671552 >> >>>>>>>>>>>> blk_9214649160203453845_647566 >> >>> blk_702373987100607684_672288 >> >>>>>>>>>>>> blk_5958099369749234073_668143 >> >>>> blk_-5172218034084903173_673109 >> >>>>>>>>>>>> blk_-2934555181472719276_646476 >> >>>>> blk_-1409986679370073931_672552 >> >>>>>>>>>>>> blk_-2786034325506235869_669086 >> >>>> blk_3674866614142268536_674205 >> >>>>>>>>>>>> blk_510158930393283118_673225 >> >>> blk_916244738216205237_677068 >> >>>>>>>>>>>> blk_-4317027806407316617_670379 >> >>>> blk_8555705688850972639_673485 >> >>>>>>>>>>>> blk_-3765735404924932181_672004 >> >>>>> blk_-5482302621772778061_647416 >> >>>>>>>>>>>> blk_-2461801145731752623_674605 >> >>>>> blk_-8737702908048998927_672549 >> >>>>>>>>>>>> blk_-8470884397893086564_646899 >> >>>> blk_4558230221166712802_668510 >> >>>>>>>>>>>> blk_-4056387895369608597_675174 >> >>>>> blk_-8675430610673886073_647695 >> >>>>>>>>>>>> blk_-6642870230256028318_668211 >> >>>>> blk_-3890408516362176771_677483 >> >>>>>>>>>>>> blk_-3569812563715986384_675231 >> >>>>> blk_-5007142629771321873_674548 >> >>>>>>>>>>>> blk_-3345355191863431669_667066 >> >>>> blk_8296855057240604851_669285 >> >>>>>>>>>>>> blk_-6595462308187757470_672420 >> >>>>> blk_-2583945228783203947_674607 >> >>>>>>>>>>>> blk_-346988625120916345_677063 >> >>>> blk_4449525876338684218_674496 >> >>>>>>>>>>>> blk_2617172363857549730_668201 >> >>>> blk_8738539715363739108_673682 >> >>>>>>>>>>>> blk_-208904675456598428_679286 >> >>>> blk_-497549341281882641_646477 >> >>>>>>>>>>>> blk_-6035315106100051103_674266 >> >>>>> blk_-2356539038067297411_672388 >> >>>>>>>>>>>> blk_-3881703084497103249_668137 >> >>>> blk_2214397881104950315_646643 >> >>>>>>>>>>>> blk_-5907671443455357710_673223 >> >>>>> blk_-2431880309956605679_669204 >> >>>>>>>>>>>> blk_6039172613960915602_674206 >> >>>> blk_5053643911633142711_669194 >> >>>>>>>>>>>> blk_-2636977729205236686_674664 >> >>>>>>>>>>>> >> >>>>>>>>>>>> I assume the file loss is somehow related to this change >> >>> and >> >>>>> the >> >>>>>>>> major >> >>>>>>>>>>>> compaction that followed because the same scan that is >> >>>> failing >> >>>>>> now >> >>>>>>>> was >> >>>>>>>>>>>> working fine yesterday and that is the only changed that >> >>>>>> happened >> >>>>>>> on >> >>>>>>>>> the >> >>>>>>>>>>>> cluster. Any suggestions what to do now? >> >>>>>>>>>>>> >> >>>>>>>>>>>> Thanks. >> >>>>>>>>>>>> >> >>>>>>>>>>>> -eran >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > >