Appreciate it, sorry I didn't get to it sooner. Had some crazy days :) -eran
On Tue, Jul 5, 2011 at 17:19, Ted Yu <[email protected]> wrote: > Eran: > I logged https://issues.apache.org/jira/browse/HBASE-4060 for you. > > On Mon, Jul 4, 2011 at 2:30 AM, Ted Yu <[email protected]> wrote: > > > Thanks for the understanding. > > > > Can you log a JIRA and put your ideas below in it ? > > > > > > > > On Jul 4, 2011, at 12:42 AM, Eran Kutner <[email protected]> wrote: > > > > > Thanks for the explanation Ted, > > > > > > I will try to apply HBASE-3789 and hope for the best but my > understanding > > is > > > that it doesn't really solve the problem, it only reduces the > probability > > of > > > it happening, at least in one particular scenario. I would hope for a > > more > > > robust solution. > > > My concern is that the region allocation process seems to rely too much > > on > > > timing considerations and doesn't seem to take enough measures to > > guarantee > > > conflicts do not occur. I understand that in a distributed environment, > > when > > > you don't get a timely response from a remote machine you can't know > for > > > sure if it did or did not receive the request, however there are things > > that > > > can be done to mitigate this and reduce the conflict time > significantly. > > For > > > example, when I run dbck it knows that some regions are multiply > > assigned, > > > the master could do the same and try to resolve the conflict. Another > > > approach would be to handle late responses, even if the response from > the > > > remote machine arrives after it was assumed to be dead the master > should > > > have enough information to know it had created a conflict by assigning > > the > > > region to another server. An even better solution, I think, is for the > RS > > to > > > periodically test that it is indeed the rightful owner of every region > it > > > holds and relinquish control over the region if it's not. > > > Obviously a state where two RSs hold the same region is pathological > and > > can > > > lead to data loss, as demonstrated in my case. The system should be > able > > to > > > actively protect itself against such a scenario. It probably doesn't > need > > > saying but there is really nothing worse for a data storage system than > > data > > > loss. > > > > > > In my case the problem didn't happen in the initial phase but after > > > disabling and enabling a table with about 12K regions. > > > > > > -eran > > > > > > > > > > > > On Sun, Jul 3, 2011 at 23:49, Ted Yu <[email protected]> wrote: > > > > > >> Let me try to answer some of your questions. > > >> The two paragraphs below were written along my reasoning which is in > > >> reverse > > >> order of the actual call sequence. > > >> > > >> For #4 below, the log indicates that the following was executed: > > >> private void assign(final RegionState state, final boolean > > setOfflineInZK, > > >> final boolean forceNewPlan) { > > >> for (int i = 0; i < this.maximumAssignmentAttempts; i++) { > > >> if (setOfflineInZK && !*setOfflineInZooKeeper*(state)) return; > > >> > > >> The above was due to the timeout which you noted in #2 which would > have > > >> caused > > >> TimeoutMonitor.chore() to run this code (line 1787) > > >> > > >> for (Map.Entry<HRegionInfo, Boolean> e: assigns.entrySet()){ > > >> assign(e.getKey(), false, e.getValue()); > > >> } > > >> > > >> This means there is lack of coordination between > > >> assignmentManager.TimeoutMonitor and OpenedRegionHandler > > >> > > >> The reason I mention HBASE-3789 is that it is marked as Incompatible > > change > > >> and is in TRUNK already. > > >> The application of HBASE-3789 to 0.90 branch would change the behavior > > >> (timing) of region assignment. > > >> > > >> I think it makes sense to evaluate the effect of HBASE-3789 in 0.90.4 > > >> > > >> BTW were the incorrect region assignments observed for a table with > > >> multiple > > >> initial regions ? > > >> If so, I have HBASE-4010 in TRUNK which speeds up initial region > > assignment > > >> by about 50%. > > >> > > >> Cheers > > >> > > >> On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner <[email protected]> wrote: > > >> > > >>> Ted, > > >>> So if I understand correctly the the theory is that because of the > > issue > > >>> fixed in HBASE-3789 the master took too long to detect that the > region > > >> was > > >>> successfully opened by the first server so it forced closed it and > > >>> transitioned to a second server, but there are a few things about > this > > >>> scenario I don't understand, probably because I don't know enough > about > > >> the > > >>> inner workings of the region transition process and would appreciate > it > > >> if > > >>> you can help me understand: > > >>> 1. The RS opened the region at 16:37:49. > > >>> 2. The master started handling the opened event at 16:39:54 - this > > delay > > >>> can > > >>> probably be explained by HBASE-3789 > > >>> 3. At 16:39:54 the master log says: Opened region gs_raw_events,..... > > on > > >>> hadoop1-s05.farm-ny.gigya.com > > >>> 4. Then at 16:40:00 the master log says: > master:60000-0x13004a31d7804c4 > > >>> Creating (or updating) unassigned node for > > 584dac5cc70d8682f71c4675a843c3 > > >>> 09 with OFFLINE state - why did it decide to take the region offline > > >> after > > >>> learning it was successfully opened? > > >>> 5. Then it tries to reopen the region on hadoop1-s05, which indicates > > in > > >>> its > > >>> log that the open request failed because the region was already open > - > > >> why > > >>> didn't the master use that information to learn that the region was > > >> already > > >>> open? > > >>> 6. At 16:43:57 the master decides the region transition timed out and > > >>> starts > > >>> forcing the transition - HBASE-3789 again? > > >>> 7. Now the master forces the transition of the region to hadoop1-s02 > > but > > >>> there is no sign of that on hadoop1-s05 - why doesn't the old RS > > >>> (hadoop1-s05) detect that it is no longer the master and relinquishes > > >>> control of the region? > > >>> > > >>> Thanks. > > >>> > > >>> -eran > > >>> > > >>> > > >>> > > >>> On Sun, Jul 3, 2011 at 20:09, Ted Yu <[email protected]> wrote: > > >>> > > >>>> HBASE-3789 should have sped up region assignment. > > >>>> The patch for 0.90 is attached to that JIRA. > > >>>> > > >>>> You may prudently apply that patch. > > >>>> > > >>>> Regards > > >>>> > > >>>> On Sun, Jul 3, 2011 at 10:01 AM, Eran Kutner <[email protected]> > wrote: > > >>>> > > >>>>> Thanks Ted, but, as stated before, I'm already using 0.90.3, so > > >> either > > >>>> it's > > >>>>> not fixed or it's not the same thing. > > >>>>> > > >>>>> -eran > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Sun, Jul 3, 2011 at 17:27, Ted Yu <[email protected]> wrote: > > >>>>> > > >>>>>> Eran: > > >>>>>> I was thinking of this: > > >>>>>> HBASE-3789 Cleanup the locking contention in the master > > >>>>>> > > >>>>>> though it doesn't directly handle 'PENDING_OPEN for too long' > case. > > >>>>>> > > >>>>>> https://issues.apache.org/jira/browse/HBASE-3741 is in 0.90.3 and > > >>>>> actually > > >>>>>> close to the symptom you described. > > >>>>>> > > >>>>>> On Sun, Jul 3, 2011 at 12:00 AM, Eran Kutner <[email protected]> > > >> wrote: > > >>>>>> > > >>>>>>> It does seem that both servers opened the same region around the > > >>> same > > >>>>>> time. > > >>>>>>> The region was offline because I disabled the table so I can > > >> change > > >>>> its > > >>>>>>> TTL. > > >>>>>>> > > >>>>>>> Here is the log from haddop1-s05 : > > >>>>>>> 2011-06-29 16:37:12,576 INFO > > >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received > > >>> request > > >>>> to > > >>>>>>> open > > >>>>>>> region: > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:37:12,680 DEBUG > > >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > >>>>>> Processing > > >>>>>>> open of > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:37:12,680 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to > > >>>>>>> RS_ZK_REGION_OPENING > > >>>>>>> 2011-06-29 16:37:12,711 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to > > >>>>>>> RS_ZK_REGION_OPENING > > >>>>>>> 2011-06-29 16:37:12,711 DEBUG > > >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: > > >>>>>>> Opening region: REGION => {NAME => > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.', > > >>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY => > > >>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED => > > >>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME => > > >>> 'gs_raw_events', > > >>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE', > > >>>>> REPLICATION_SCOPE > > >>>>>> => > > >>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800', > > >>>> BLOCKSIZE > > >>>>> => > > >>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} > > >>>>>>> 2011-06-29 16:37:12,711 DEBUG > > >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: > > >>>>>>> Instantiated > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:37:12,847 DEBUG > > >>>>> org.apache.hadoop.hbase.regionserver.Store: > > >>>>>>> loaded > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, > > >>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062, > > >>>>>>> majorCompaction=false > > >>>>>>> 2011-06-29 16:37:12,848 INFO > > >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: > > >>>>>>> Onlined > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.; > > >>>>>>> next sequenceid=1162228063 > > >>>>>>> 2011-06-29 16:37:12,849 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > >>>>>>> RS_ZK_REGION_OPENING > > >>>>>>> 2011-06-29 16:37:12,875 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > >>>>>>> RS_ZK_REGION_OPENING > > >>>>>>> 2011-06-29 16:37:12,951 INFO > > >>>>> org.apache.hadoop.hbase.catalog.MetaEditor: > > >>>>>>> Updated row > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> in region .META.,,1 with server= > > >>> hadoop1-s05.farm-ny.gigya.com:60020, > > >>>>>>> startcode=1307349217076 > > >>>>>>> 2011-06-29 16:37:12,951 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > >>>>>>> RS_ZK_REGION_OPENED > > >>>>>>> 2011-06-29 16:37:12,964 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > >>>>>>> RS_ZK_REGION_OPENED > > >>>>>>> 2011-06-29 16:37:12,964 DEBUG > > >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > >>>> Opened > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:40:00,878 INFO > > >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received > > >>> request > > >>>> to > > >>>>>>> open > > >>>>>>> region: > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:40:00,878 DEBUG > > >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > >>>>>> Processing > > >>>>>>> open of > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:40:01,079 WARN > > >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > >>>>> Attempted > > >>>>>>> open of > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> but already online on this server > > >>>>>>> 2011-06-29 16:43:50,395 DEBUG > > >>>>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: > > >> Compaction > > >>>>>> (major) > > >>>>>>> requested for > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> because User-triggered major compaction; priority=1, compaction > > >>> queue > > >>>>>>> size=1248 > > >>>>>>> 2011-06-29 20:19:49,906 INFO > > >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: > > >>>>>>> Starting major compaction on region > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 20:19:49,906 INFO > > >>>>> org.apache.hadoop.hbase.regionserver.Store: > > >>>>>>> Started compaction of 1 file(s) in cf=events into > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp, > > >>>>>>> seqid=1162228062, totalSize=98.3m > > >>>>>>> 2011-06-29 20:19:49,906 DEBUG > > >>>>> org.apache.hadoop.hbase.regionserver.Store: > > >>>>>>> Compacting > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, > > >>>>>>> keycount=6882816, bloomtype=NONE, size=98.3m > > >>>>>>> 2011-06-29 20:19:59,920 INFO > > >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: > > >>>>>>> completed compaction on region > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> after 10sec > > >>>>>>> > > >>>>>>> And here is the one from hadoop1-s02: > > >>>>>>> 2011-06-29 16:43:57,935 INFO > > >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received > > >>> request > > >>>> to > > >>>>>>> open > > >>>>>>> region: > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:43:58,990 DEBUG > > >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > >>>>>> Processing > > >>>>>>> open of > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:43:58,990 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to > > >>>>>>> RS_ZK_REGION_OPENING > > >>>>>>> 2011-06-29 16:43:59,002 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to > > >>>>>>> RS_ZK_REGION_OPENING > > >>>>>>> 2011-06-29 16:43:59,002 DEBUG > > >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: > > >>>>>>> Opening region: REGION => {NAME => > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.', > > >>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY => > > >>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED => > > >>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME => > > >>> 'gs_raw_events', > > >>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE', > > >>>>> REPLICATION_SCOPE > > >>>>>> => > > >>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800', > > >>>> BLOCKSIZE > > >>>>> => > > >>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} > > >>>>>>> 2011-06-29 16:43:59,003 DEBUG > > >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: > > >>>>>>> Instantiated > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> 2011-06-29 16:43:59,204 DEBUG > > >>>>> org.apache.hadoop.hbase.regionserver.Store: > > >>>>>>> loaded > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, > > >>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062, > > >>>>>>> majorCompaction=false > > >>>>>>> 2011-06-29 16:43:59,205 INFO > > >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: > > >>>>>>> Onlined > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.; > > >>>>>>> next sequenceid=1162228063 > > >>>>>>> 2011-06-29 16:43:59,205 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > >>>>>>> RS_ZK_REGION_OPENING > > >>>>>>> 2011-06-29 16:43:59,212 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > >>>>>>> RS_ZK_REGION_OPENING > > >>>>>>> 2011-06-29 16:43:59,214 INFO > > >>>>> org.apache.hadoop.hbase.catalog.MetaEditor: > > >>>>>>> Updated row > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> in region .META.,,1 with server= > > >>> hadoop1-s02.farm-ny.gigya.com:60020, > > >>>>>>> startcode=1306919627544 > > >>>>>>> 2011-06-29 16:43:59,214 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > >>>>>>> RS_ZK_REGION_OPENED > > >>>>>>> 2011-06-29 16:43:59,224 DEBUG > > >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned > > >> node > > >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > >>>>>>> RS_ZK_REGION_OPENED > > >>>>>>> 2011-06-29 16:43:59,224 DEBUG > > >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > >>>> Opened > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>> java.io.IOException: Got error in response to OP_READ_BLOCK > > >> self=/ > > >>>>>>> 10.1.104.2:33356, remote=/10.1.104.2:50010 for file > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360 > > >>>>>>> for block 3674866614142268536_674205 > > >>>>>>> > > >>>>>>> > > >>>>>>> Ted, can you please point me to J-D's bug fix you mentioned? Are > > >>> you > > >>>>>>> positive it's the same scenario - data loss is a very serious > > >>> problem > > >>>>> for > > >>>>>> a > > >>>>>>> DB. > > >>>>>>> I'd really like to apply that patch ASAP, because when I run hbck > > >> I > > >>>> get > > >>>>>>> over > > >>>>>>> 400 regions which are multiply assigned. > > >>>>>>> Last question, I understand the region's data is lost but is > > >> there > > >>> a > > >>>>> way > > >>>>>> to > > >>>>>>> at least make the table consistent again by some how removing the > > >>>> lost > > >>>>>>> region? > > >>>>>>> > > >>>>>>> Thanks. > > >>>>>>> > > >>>>>>> -eran > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Sat, Jul 2, 2011 at 01:46, Ted Yu <[email protected]> > > >> wrote: > > >>>>>>> > > >>>>>>>>>> 2011-06-29 16:43:57,880 INFO > > >>>>>>>> org.apache.hadoop.hbase. > > >>>>>>>> master.AssignmentManager: Region has been > > >>>>>>>> PENDING_OPEN for too long, reassigning > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>> > > >>>>>>>> The double assignment should have been fixed by J-D's recent > > >>>> checkin. > > >>>>>>>> > > >>>>>>>> On Fri, Jul 1, 2011 at 3:14 PM, Stack <[email protected]> > > >> wrote: > > >>>>>>>> > > >>>>>>>>> Is > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>> the region that was having the issue? If so, if you looked > > >> in > > >>>>>>>>> hadoop1-s05's logs, was this region opened around 2011-06-29 > > >>>>>> 16:43:57? > > >>>>>>>>> Was it also opened hadoop1-s02 not long after? Did you say > > >>> what > > >>>>>>>>> version of hbase you are on? > > >>>>>>>>> > > >>>>>>>>> St.Ack > > >>>>>>>>> > > >>>>>>>>> On Fri, Jul 1, 2011 at 5:08 AM, Eran Kutner <[email protected]> > > >>>>> wrote: > > >>>>>>>>>> Hi Stack, > > >>>>>>>>>> I'm not sure what the log means. I do see references to two > > >>>>>> different > > >>>>>>>>>> servers, but that would probably happen if there was normal > > >>>>>>> transition > > >>>>>>>> I > > >>>>>>>>>> assume. I'm using version 0.90.3 > > >>>>>>>>>> Here are the relevant lines from the master logs: > > >>>>>>>>>> > > >>>>>>>>>> 2011-06-19 21:39:37,164 INFO > > >>>>>>>>> org.apache.hadoop.hbase.master.ServerManager: > > >>>>>>>>>> Received REGION_SPLIT: > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658.: > > >>>>>>>>>> Daughters; > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518810_1249_WEB204,1308533970928.46f876a4e97be04edb35eb8f8959d482. > > >>>>>>>>>> from hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 > > >>>>>>>>>> 2011-06-19 21:43:12,983 INFO > > >>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor: > > >>>>>>>>>> Deleted daughter reference > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > >>>>>>>>>> qualifier=splitA, from parent > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658. > > >>>>>>>>>> 2011-06-29 16:29:36,143 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Starting > > >>>>>>> unassignment > > >>>>>>>>> of > > >>>>>>>>>> region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> (offlining) > > >>>>>>>>>> 2011-06-29 16:29:36,146 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Sent > > >> CLOSE > > >>> to > > >>>>>>>>> serverName= > > >>>>>>>>>> hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > >>>>>> load=(requests=0, > > >>>>>>>>>> regions=1654, usedHeap=1870, maxHeap=12483) for region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling > > >>> new > > >>>>>>>> unassigned > > >>>>>>>>>> node: /hbase/unassigned/584dac5cc70d8682f71c4675a843c309 > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > (region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > >>>>>>>>>> state=RS_ZK_REGION_CLOSED) > > >>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling > > >>>>>>>>>> transition=RS_ZK_REGION_CLOSED, > > >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 > > >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: > > >>>>>> Handling > > >>>>>>>>> CLOSED > > >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309 > > >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Table > > >> being > > >>>>>>> disabled > > >>>>>>>> so > > >>>>>>>>>> deleting ZK node and removing from regions in transition, > > >>>>> skipping > > >>>>>>>>>> assignment of region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG > > >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned > > >>>> node > > >>>>>> for > > >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state > > >>>>>>>>>> RS_ZK_REGION_CLOSED > > >>>>>>>>>> 2011-06-29 16:30:53,801 DEBUG > > >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted > > >>> unassigned > > >>>>> node > > >>>>>>> for > > >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state > > >>>>>>>>>> RS_ZK_REGION_CLOSED > > >>>>>>>>>> 2011-06-29 16:34:01,453 INFO > > >>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor: > > >>>>>>>>>> Updated region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> in META > > >>>>>>>>>> 2011-06-29 16:37:12,247 DEBUG > > >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating) > > >>>> unassigned > > >>>>>> node > > >>>>>>>> for > > >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state > > >>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No > > >> previous > > >>>>>>>> transition > > >>>>>>>>>> plan was found (or we are ignoring an existing plan) for > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> so generated a random one; > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > >>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com > > >>> ,60020,1307349217076; > > >>>> 5 > > >>>>>>>>> (online=5, > > >>>>>>>>>> exclude=null) available servers > > >>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning > > >>>>> region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 > > >>>>>>>>>> 2011-06-29 16:37:13,102 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling > > >>>>>>>>>> transition=RS_ZK_REGION_OPENED, > > >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 > > >>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > >>>>>> Handling > > >>>>>>>>> OPENED > > >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting > > >>> unassigned > > >>>>>> node > > >>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG > > >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned > > >>>> node > > >>>>>> for > > >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state > > >>>>>>>>>> RS_ZK_REGION_OPENED > > >>>>>>>>>> 2011-06-29 16:39:54,192 DEBUG > > >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted > > >>> unassigned > > >>>>> node > > >>>>>>> for > > >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state > > >>>>>>>>>> RS_ZK_REGION_OPENED > > >>>>>>>>>> 2011-06-29 16:39:54,326 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > >>>>> Opened > > >>>>>>>> region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 > > >>>>>>>>>> 2011-06-29 16:40:00,598 DEBUG > > >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating) > > >>>> unassigned > > >>>>>> node > > >>>>>>>> for > > >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state > > >>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No > > >> previous > > >>>>>>>> transition > > >>>>>>>>>> plan was found (or we are ignoring an existing plan) for > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> so generated a random one; > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > >>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com > > >>> ,60020,1307349217076; > > >>>> 5 > > >>>>>>>>> (online=5, > > >>>>>>>>>> exclude=null) available servers > > >>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning > > >>>>> region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 > > >>>>>>>>>> 2011-06-29 16:43:57,879 INFO > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Regions > > >> in > > >>>>>>> transition > > >>>>>>>>>> timed out: > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> state=PENDING_OPEN, ts=1309380052723 > > >>>>>>>>>> 2011-06-29 16:43:57,880 INFO > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Region > > >> has > > >>>> been > > >>>>>>>>>> PENDING_OPEN for too long, reassigning > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Forcing > > >>>>> OFFLINE; > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > was=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> state=PENDING_OPEN, ts=1309380052723 > > >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No > > >> previous > > >>>>>>>> transition > > >>>>>>>>>> plan was found (or we are ignoring an existing plan) for > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> so generated a random one; > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > >>>>>>>>>> src=, dest=hadoop1-s02.farm-ny.gigya.com > > >>> ,60020,1306919627544; > > >>>> 5 > > >>>>>>>>> (online=5, > > >>>>>>>>>> exclude=null) available servers > > >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning > > >>>>> region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> to hadoop1-s02.farm-ny.gigya.com,60020,1306919627544 > > >>>>>>>>>> 2011-06-29 16:43:59,022 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling > > >>>>>>>>>> transition=RS_ZK_REGION_OPENING, > > >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, > > >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 > > >>>>>>>>>> 2011-06-29 16:43:59,221 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling > > >>>>>>>>>> transition=RS_ZK_REGION_OPENING, > > >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, > > >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 > > >>>>>>>>>> 2011-06-29 16:43:59,226 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling > > >>>>>>>>>> transition=RS_ZK_REGION_OPENED, > > >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, > > >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 > > >>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > >>>>>> Handling > > >>>>>>>>> OPENED > > >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting > > >>> unassigned > > >>>>>> node > > >>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG > > >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned > > >>>> node > > >>>>>> for > > >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state > > >>>>>>>>>> RS_ZK_REGION_OPENED > > >>>>>>>>>> 2011-06-29 16:43:59,296 DEBUG > > >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: > > >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted > > >>> unassigned > > >>>>> node > > >>>>>>> for > > >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state > > >>>>>>>>>> RS_ZK_REGION_OPENED > > >>>>>>>>>> 2011-06-29 16:43:59,375 WARN > > >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: > > >> Overwriting > > >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 on > > >>>>>>>>>> serverName=hadoop1-s05.farm-ny.gigya.com > > >>> ,60020,1307349217076, > > >>>>>>>>>> load=(requests=0, regions=1273, usedHeap=2676, > > >> maxHeap=12483) > > >>>>>>>>>> 2011-06-29 16:43:59,375 DEBUG > > >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > >>>>> Opened > > >>>>>>>> region > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > >>>>>>>>>> on hadoop1-s02.farm-ny.gigya.com,60020,1306919627544 > > >>>>>>>>>> > > >>>>>>>>>> Thanks. > > >>>>>>>>>> > > >>>>>>>>>> -eran > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On Fri, Jul 1, 2011 at 09:05, Stack <[email protected]> > > >>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> So, Eran, it seems as though two RegionServers were > > >> carrying > > >>>> the > > >>>>>>>>>>> region? One deleted a file (compaction on its side)? Can > > >>> you > > >>>>>>> figure > > >>>>>>>>>>> if indeed two servers had same region? (Check master logs > > >>> for > > >>>>>> this > > >>>>>>>>>>> regions assignments). > > >>>>>>>>>>> > > >>>>>>>>>>> What version of hbase? > > >>>>>>>>>>> > > >>>>>>>>>>> St.Ack > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Thu, Jun 30, 2011 at 3:58 AM, Eran Kutner < > > >>> [email protected]> > > >>>>>>> wrote: > > >>>>>>>>>>>> Hi, > > >>>>>>>>>>>> I have a cluster of 5 nodes with one large table that > > >>>>> currently > > >>>>>>> has > > >>>>>>>>>>> around > > >>>>>>>>>>>> 12000 regions. Everything was working fine for > > >> relatively > > >>>> long > > >>>>>>> time, > > >>>>>>>>>>> until > > >>>>>>>>>>>> now. > > >>>>>>>>>>>> Yesterday I significantly reduced the TTL on the table > > >> and > > >>>>>>> initiated > > >>>>>>>>>>> major > > >>>>>>>>>>>> compaction. This should have reduced the table size to > > >>> about > > >>>>> 20% > > >>>>>>> of > > >>>>>>>>> its > > >>>>>>>>>>>> original size. > > >>>>>>>>>>>> Today, I'm getting errors of inaccessible files on HDFS, > > >>> for > > >>>>>>>> example: > > >>>>>>>>>>>> java.io.IOException: Got error in response to > > >>> OP_READ_BLOCK > > >>>>>> self=/ > > >>>>>>>>>>>> 10.1.104.2:58047, remote=/10.1.104.2:50010 for file > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360 > > >>>>>>>>>>>> for block 3674866614142268536_674205 > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>> > > >>> > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) > > >>>>>>>>>>>> at > > >>>>> java.io.DataInputStream.read(DataInputStream.java:132) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) > > >>>>>>>>>>>> at > > >>>>>>>>> > > >> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > >>>>>>>>>>>> at > > >>>>>>>>> > > >> java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:128) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:68) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) > > >>>>>>>>>>>> at > > >>>>>>>>> > > >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > >>>>>>>>>>>> at > > >>>>>>>>> > > >> java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > >>>>>>>>>>>> at > > >>>>>> org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>> > > >>> > > org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>> > > >>> > > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1433) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:139) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:96) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:77) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>> > > >>>>> > > >> org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1341) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2269) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1126) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1118) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1102) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1781) > > >>>>>>>>>>>> at > > >>>> sun.reflect.GeneratedMethodAccessor46.invoke(Unknown > > >>>>>>>> Source) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > >>>>>>>>>>>> at > > >> java.lang.reflect.Method.invoke(Method.java:597) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>> > > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > > >>>>>>>>>>>> at > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) > > >>>>>>>>>>>> > > >>>>>>>>>>>> I checked and the file, indeed doesn't exist on HDFS, > > >> here > > >>>> is > > >>>>>> the > > >>>>>>>> name > > >>>>>>>>>>> node > > >>>>>>>>>>>> logs for this block, apparently because it was deleted: > > >>>>>>>>>>>> 2011-06-19 21:39:36,651 INFO > > >>>>> org.apache.hadoop.hdfs.StateChange: > > >>>>>>>>> BLOCK* > > >>>>>>>>>>>> NameSystem.allocateBlock: > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp/2096863423111131624. > > >>>>>>>>>>>> blk_3674866614142268536_674205 > > >>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO > > >>>>> org.apache.hadoop.hdfs.StateChange: > > >>>>>>>>> BLOCK* > > >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: > > >>>> 10.1.104.2:50010 > > >>>>> is > > >>>>>>>>> added > > >>>>>>>>>>> to > > >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 > > >>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO > > >>>>> org.apache.hadoop.hdfs.StateChange: > > >>>>>>>>> BLOCK* > > >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: > > >>>> 10.1.104.3:50010 > > >>>>> is > > >>>>>>>>> added > > >>>>>>>>>>> to > > >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 > > >>>>>>>>>>>> 2011-06-19 21:40:11,955 INFO > > >>>>> org.apache.hadoop.hdfs.StateChange: > > >>>>>>>>> BLOCK* > > >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: > > >>>> 10.1.104.5:50010 > > >>>>> is > > >>>>>>>>> added > > >>>>>>>>>>> to > > >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 > > >>>>>>>>>>>> 2011-06-29 20:20:01,662 INFO > > >>>>> org.apache.hadoop.hdfs.StateChange: > > >>>>>>>>> BLOCK* > > >>>>>>>>>>> ask > > >>>>>>>>>>>> 10.1.104.2:50010 to delete > > >>> blk_3674866614142268536_674205 > > >>>>>>>>>>>> 2011-06-29 20:20:13,671 INFO > > >>>>> org.apache.hadoop.hdfs.StateChange: > > >>>>>>>>> BLOCK* > > >>>>>>>>>>> ask > > >>>>>>>>>>>> 10.1.104.5:50010 to delete > > >>> blk_-4056387895369608597_675174 > > >>>>>>>>>>>> blk_-5017882805850873821_672281 > > >>>> blk_702373987100607684_672288 > > >>>>>>>>>>>> blk_-5357157478043290010_668506 > > >>>> blk_7118175133735412789_674903 > > >>>>>>>>>>>> blk_-3569812563715986384_675231 > > >>>> blk_8296855057240604851_669285 > > >>>>>>>>>>>> blk_-6483679172530609101_674268 > > >>>> blk_8738539715363739108_673682 > > >>>>>>>>>>>> blk_1744841904626813502_675238 > > >>>> blk_-6035315106100051103_674266 > > >>>>>>>>>>>> blk_-1789501623010070237_674908 > > >>>> blk_1944054629336265129_673689 > > >>>>>>>>>>>> blk_3674866614142268536_674205 > > >>>> blk_7930425446738143892_647410 > > >>>>>>>>>>>> blk_-3007186753042268449_669296 > > >>>>> blk_-5482302621772778061_647416 > > >>>>>>>>>>>> blk_-3765735404924932181_672004 > > >>>> blk_7476090998956811081_675169 > > >>>>>>>>>>>> blk_7862291659285127712_646890 > > >>>> blk_-2666244746343584727_672013 > > >>>>>>>>>>>> blk_6039172613960915602_674206 > > >>>> blk_-8470884397893086564_646899 > > >>>>>>>>>>>> blk_4558230221166712802_668510 > > >>>>>>>>>>>> 2011-06-29 20:20:46,698 INFO > > >>>>> org.apache.hadoop.hdfs.StateChange: > > >>>>>>>>> BLOCK* > > >>>>>>>>>>> ask > > >>>>>>>>>>>> 10.1.104.3:50010 to delete > > >>> blk_-7851606440036350812_671552 > > >>>>>>>>>>>> blk_9214649160203453845_647566 > > >>> blk_702373987100607684_672288 > > >>>>>>>>>>>> blk_5958099369749234073_668143 > > >>>> blk_-5172218034084903173_673109 > > >>>>>>>>>>>> blk_-2934555181472719276_646476 > > >>>>> blk_-1409986679370073931_672552 > > >>>>>>>>>>>> blk_-2786034325506235869_669086 > > >>>> blk_3674866614142268536_674205 > > >>>>>>>>>>>> blk_510158930393283118_673225 > > >>> blk_916244738216205237_677068 > > >>>>>>>>>>>> blk_-4317027806407316617_670379 > > >>>> blk_8555705688850972639_673485 > > >>>>>>>>>>>> blk_-3765735404924932181_672004 > > >>>>> blk_-5482302621772778061_647416 > > >>>>>>>>>>>> blk_-2461801145731752623_674605 > > >>>>> blk_-8737702908048998927_672549 > > >>>>>>>>>>>> blk_-8470884397893086564_646899 > > >>>> blk_4558230221166712802_668510 > > >>>>>>>>>>>> blk_-4056387895369608597_675174 > > >>>>> blk_-8675430610673886073_647695 > > >>>>>>>>>>>> blk_-6642870230256028318_668211 > > >>>>> blk_-3890408516362176771_677483 > > >>>>>>>>>>>> blk_-3569812563715986384_675231 > > >>>>> blk_-5007142629771321873_674548 > > >>>>>>>>>>>> blk_-3345355191863431669_667066 > > >>>> blk_8296855057240604851_669285 > > >>>>>>>>>>>> blk_-6595462308187757470_672420 > > >>>>> blk_-2583945228783203947_674607 > > >>>>>>>>>>>> blk_-346988625120916345_677063 > > >>>> blk_4449525876338684218_674496 > > >>>>>>>>>>>> blk_2617172363857549730_668201 > > >>>> blk_8738539715363739108_673682 > > >>>>>>>>>>>> blk_-208904675456598428_679286 > > >>>> blk_-497549341281882641_646477 > > >>>>>>>>>>>> blk_-6035315106100051103_674266 > > >>>>> blk_-2356539038067297411_672388 > > >>>>>>>>>>>> blk_-3881703084497103249_668137 > > >>>> blk_2214397881104950315_646643 > > >>>>>>>>>>>> blk_-5907671443455357710_673223 > > >>>>> blk_-2431880309956605679_669204 > > >>>>>>>>>>>> blk_6039172613960915602_674206 > > >>>> blk_5053643911633142711_669194 > > >>>>>>>>>>>> blk_-2636977729205236686_674664 > > >>>>>>>>>>>> > > >>>>>>>>>>>> I assume the file loss is somehow related to this change > > >>> and > > >>>>> the > > >>>>>>>> major > > >>>>>>>>>>>> compaction that followed because the same scan that is > > >>>> failing > > >>>>>> now > > >>>>>>>> was > > >>>>>>>>>>>> working fine yesterday and that is the only changed that > > >>>>>> happened > > >>>>>>> on > > >>>>>>>>> the > > >>>>>>>>>>>> cluster. Any suggestions what to do now? > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks. > > >>>>>>>>>>>> > > >>>>>>>>>>>> -eran > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > >
