Eran:
I logged https://issues.apache.org/jira/browse/HBASE-4060 for you.

On Mon, Jul 4, 2011 at 2:30 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Thanks for the understanding.
>
> Can you log a JIRA and put your ideas below in it ?
>
>
>
> On Jul 4, 2011, at 12:42 AM, Eran Kutner <e...@gigya.com> wrote:
>
> > Thanks for the explanation Ted,
> >
> > I will try to apply HBASE-3789 and hope for the best but my understanding
> is
> > that it doesn't really solve the problem, it only reduces the probability
> of
> > it happening, at least in one particular scenario. I would hope for a
> more
> > robust solution.
> > My concern is that the region allocation process seems to rely too much
> on
> > timing considerations and doesn't seem to take enough measures to
> guarantee
> > conflicts do not occur. I understand that in a distributed environment,
> when
> > you don't get a timely response from a remote machine you can't know for
> > sure if it did or did not receive the request, however there are things
> that
> > can be done to mitigate this and reduce the conflict time significantly.
> For
> > example, when I run dbck it knows that some regions are multiply
> assigned,
> > the master could do the same and try to resolve the conflict. Another
> > approach would be to handle late responses, even if the response from the
> > remote machine arrives after it was assumed to be dead the master should
> > have enough information to know it had created a conflict by assigning
> the
> > region to another server. An even better solution, I think, is for the RS
> to
> > periodically test that it is indeed the rightful owner of every region it
> > holds and relinquish control over the region if it's not.
> > Obviously a state where two RSs hold the same region is pathological and
> can
> > lead to data loss, as demonstrated in my case. The system should be able
> to
> > actively protect itself against such a scenario. It probably doesn't need
> > saying but there is really nothing worse for a data storage system than
> data
> > loss.
> >
> > In my case the problem didn't happen in the initial phase but after
> > disabling and enabling a table with about 12K regions.
> >
> > -eran
> >
> >
> >
> > On Sun, Jul 3, 2011 at 23:49, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> >> Let me try to answer some of your questions.
> >> The two paragraphs below were written along my reasoning which is in
> >> reverse
> >> order of the actual call sequence.
> >>
> >> For #4 below, the log indicates that the following was executed:
> >> private void assign(final RegionState state, final boolean
> setOfflineInZK,
> >>     final boolean forceNewPlan) {
> >>   for (int i = 0; i < this.maximumAssignmentAttempts; i++) {
> >>     if (setOfflineInZK && !*setOfflineInZooKeeper*(state)) return;
> >>
> >> The above was due to the timeout which you noted in #2 which would have
> >> caused
> >> TimeoutMonitor.chore() to run this code (line 1787)
> >>
> >>     for (Map.Entry<HRegionInfo, Boolean> e: assigns.entrySet()){
> >>       assign(e.getKey(), false, e.getValue());
> >>     }
> >>
> >> This means there is lack of coordination between
> >> assignmentManager.TimeoutMonitor and OpenedRegionHandler
> >>
> >> The reason I mention HBASE-3789 is that it is marked as Incompatible
> change
> >> and is in TRUNK already.
> >> The application of HBASE-3789 to 0.90 branch would change the behavior
> >> (timing) of region assignment.
> >>
> >> I think it makes sense to evaluate the effect of HBASE-3789 in 0.90.4
> >>
> >> BTW were the incorrect region assignments observed for a table with
> >> multiple
> >> initial regions ?
> >> If so, I have HBASE-4010 in TRUNK which speeds up initial region
> assignment
> >> by about 50%.
> >>
> >> Cheers
> >>
> >> On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner <e...@gigya.com> wrote:
> >>
> >>> Ted,
> >>> So if I understand correctly the the theory is that because of the
> issue
> >>> fixed in HBASE-3789 the master took too long to detect that the region
> >> was
> >>> successfully opened by the first server so it forced closed it and
> >>> transitioned to a second server, but there are a few things about this
> >>> scenario I don't understand, probably because I don't know enough about
> >> the
> >>> inner workings of the region transition process and would appreciate it
> >> if
> >>> you can help me understand:
> >>> 1. The RS opened the region at 16:37:49.
> >>> 2. The master started handling the opened event at 16:39:54 - this
> delay
> >>> can
> >>> probably be explained by HBASE-3789
> >>> 3. At 16:39:54 the master log says: Opened region gs_raw_events,.....
> on
> >>> hadoop1-s05.farm-ny.gigya.com
> >>> 4. Then at 16:40:00 the master log says: master:60000-0x13004a31d7804c4
> >>> Creating (or updating) unassigned node for
> 584dac5cc70d8682f71c4675a843c3
> >>> 09 with OFFLINE state - why did it decide to take the region offline
> >> after
> >>> learning it was successfully opened?
> >>> 5. Then it tries to reopen the region on hadoop1-s05, which indicates
> in
> >>> its
> >>> log that the open request failed because the region was already open -
> >> why
> >>> didn't the master use that information to learn that the region was
> >> already
> >>> open?
> >>> 6. At 16:43:57 the master decides the region transition timed out and
> >>> starts
> >>> forcing the transition - HBASE-3789 again?
> >>> 7. Now the master forces the transition of the region to hadoop1-s02
> but
> >>> there is no sign of that on hadoop1-s05 - why doesn't the old RS
> >>> (hadoop1-s05) detect that it is no longer the master and relinquishes
> >>> control of the region?
> >>>
> >>> Thanks.
> >>>
> >>> -eran
> >>>
> >>>
> >>>
> >>> On Sun, Jul 3, 2011 at 20:09, Ted Yu <yuzhih...@gmail.com> wrote:
> >>>
> >>>> HBASE-3789 should have sped up region assignment.
> >>>> The patch for 0.90 is attached to that JIRA.
> >>>>
> >>>> You may prudently apply that patch.
> >>>>
> >>>> Regards
> >>>>
> >>>> On Sun, Jul 3, 2011 at 10:01 AM, Eran Kutner <e...@gigya.com> wrote:
> >>>>
> >>>>> Thanks Ted, but, as stated before, I'm already using 0.90.3, so
> >> either
> >>>> it's
> >>>>> not fixed or it's not the same thing.
> >>>>>
> >>>>> -eran
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sun, Jul 3, 2011 at 17:27, Ted Yu <yuzhih...@gmail.com> wrote:
> >>>>>
> >>>>>> Eran:
> >>>>>> I was thinking of this:
> >>>>>> HBASE-3789  Cleanup the locking contention in the master
> >>>>>>
> >>>>>> though it doesn't directly handle 'PENDING_OPEN for too long' case.
> >>>>>>
> >>>>>> https://issues.apache.org/jira/browse/HBASE-3741 is in 0.90.3 and
> >>>>> actually
> >>>>>> close to the symptom you described.
> >>>>>>
> >>>>>> On Sun, Jul 3, 2011 at 12:00 AM, Eran Kutner <e...@gigya.com>
> >> wrote:
> >>>>>>
> >>>>>>> It does seem that both servers opened the same region around the
> >>> same
> >>>>>> time.
> >>>>>>> The region was offline because I disabled the table so I can
> >> change
> >>>> its
> >>>>>>> TTL.
> >>>>>>>
> >>>>>>> Here is the log from haddop1-s05 :
> >>>>>>> 2011-06-29 16:37:12,576 INFO
> >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received
> >>> request
> >>>> to
> >>>>>>> open
> >>>>>>> region:
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:37:12,680 DEBUG
> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> >>>>>> Processing
> >>>>>>> open of
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:37:12,680 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to
> >>>>>>> RS_ZK_REGION_OPENING
> >>>>>>> 2011-06-29 16:37:12,711 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to
> >>>>>>> RS_ZK_REGION_OPENING
> >>>>>>> 2011-06-29 16:37:12,711 DEBUG
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>>>>> Opening region: REGION => {NAME =>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.',
> >>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY =>
> >>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED =>
> >>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME =>
> >>> 'gs_raw_events',
> >>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE',
> >>>>> REPLICATION_SCOPE
> >>>>>> =>
> >>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800',
> >>>> BLOCKSIZE
> >>>>> =>
> >>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
> >>>>>>> 2011-06-29 16:37:12,711 DEBUG
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>>>>> Instantiated
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:37:12,847 DEBUG
> >>>>> org.apache.hadoop.hbase.regionserver.Store:
> >>>>>>> loaded
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360,
> >>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062,
> >>>>>>> majorCompaction=false
> >>>>>>> 2011-06-29 16:37:12,848 INFO
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>>>>> Onlined
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.;
> >>>>>>> next sequenceid=1162228063
> >>>>>>> 2011-06-29 16:37:12,849 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
> >>>>>>> RS_ZK_REGION_OPENING
> >>>>>>> 2011-06-29 16:37:12,875 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
> >>>>>>> RS_ZK_REGION_OPENING
> >>>>>>> 2011-06-29 16:37:12,951 INFO
> >>>>> org.apache.hadoop.hbase.catalog.MetaEditor:
> >>>>>>> Updated row
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> in region .META.,,1 with server=
> >>> hadoop1-s05.farm-ny.gigya.com:60020,
> >>>>>>> startcode=1307349217076
> >>>>>>> 2011-06-29 16:37:12,951 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
> >>>>>>> RS_ZK_REGION_OPENED
> >>>>>>> 2011-06-29 16:37:12,964 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
> >>>>>>> RS_ZK_REGION_OPENED
> >>>>>>> 2011-06-29 16:37:12,964 DEBUG
> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> >>>> Opened
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:40:00,878 INFO
> >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received
> >>> request
> >>>> to
> >>>>>>> open
> >>>>>>> region:
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:40:00,878 DEBUG
> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> >>>>>> Processing
> >>>>>>> open of
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:40:01,079 WARN
> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> >>>>> Attempted
> >>>>>>> open of
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> but already online on this server
> >>>>>>> 2011-06-29 16:43:50,395 DEBUG
> >>>>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> >> Compaction
> >>>>>> (major)
> >>>>>>> requested for
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> because User-triggered major compaction; priority=1, compaction
> >>> queue
> >>>>>>> size=1248
> >>>>>>> 2011-06-29 20:19:49,906 INFO
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>>>>> Starting major compaction on region
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 20:19:49,906 INFO
> >>>>> org.apache.hadoop.hbase.regionserver.Store:
> >>>>>>> Started compaction of 1 file(s) in cf=events  into
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp,
> >>>>>>> seqid=1162228062, totalSize=98.3m
> >>>>>>> 2011-06-29 20:19:49,906 DEBUG
> >>>>> org.apache.hadoop.hbase.regionserver.Store:
> >>>>>>> Compacting
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360,
> >>>>>>> keycount=6882816, bloomtype=NONE, size=98.3m
> >>>>>>> 2011-06-29 20:19:59,920 INFO
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>>>>> completed compaction on region
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> after 10sec
> >>>>>>>
> >>>>>>> And here is the one from hadoop1-s02:
> >>>>>>> 2011-06-29 16:43:57,935 INFO
> >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received
> >>> request
> >>>> to
> >>>>>>> open
> >>>>>>> region:
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:43:58,990 DEBUG
> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> >>>>>> Processing
> >>>>>>> open of
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:43:58,990 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to
> >>>>>>> RS_ZK_REGION_OPENING
> >>>>>>> 2011-06-29 16:43:59,002 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to
> >>>>>>> RS_ZK_REGION_OPENING
> >>>>>>> 2011-06-29 16:43:59,002 DEBUG
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>>>>> Opening region: REGION => {NAME =>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.',
> >>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY =>
> >>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED =>
> >>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME =>
> >>> 'gs_raw_events',
> >>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE',
> >>>>> REPLICATION_SCOPE
> >>>>>> =>
> >>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800',
> >>>> BLOCKSIZE
> >>>>> =>
> >>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
> >>>>>>> 2011-06-29 16:43:59,003 DEBUG
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>>>>> Instantiated
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> 2011-06-29 16:43:59,204 DEBUG
> >>>>> org.apache.hadoop.hbase.regionserver.Store:
> >>>>>>> loaded
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360,
> >>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062,
> >>>>>>> majorCompaction=false
> >>>>>>> 2011-06-29 16:43:59,205 INFO
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>>>>> Onlined
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.;
> >>>>>>> next sequenceid=1162228063
> >>>>>>> 2011-06-29 16:43:59,205 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
> >>>>>>> RS_ZK_REGION_OPENING
> >>>>>>> 2011-06-29 16:43:59,212 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
> >>>>>>> RS_ZK_REGION_OPENING
> >>>>>>> 2011-06-29 16:43:59,214 INFO
> >>>>> org.apache.hadoop.hbase.catalog.MetaEditor:
> >>>>>>> Updated row
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> in region .META.,,1 with server=
> >>> hadoop1-s02.farm-ny.gigya.com:60020,
> >>>>>>> startcode=1306919627544
> >>>>>>> 2011-06-29 16:43:59,214 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
> >>>>>>> RS_ZK_REGION_OPENED
> >>>>>>> 2011-06-29 16:43:59,224 DEBUG
> >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned
> >> node
> >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
> >>>>>>> RS_ZK_REGION_OPENED
> >>>>>>> 2011-06-29 16:43:59,224 DEBUG
> >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> >>>> Opened
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>> java.io.IOException: Got error in response to OP_READ_BLOCK
> >> self=/
> >>>>>>> 10.1.104.2:33356, remote=/10.1.104.2:50010 for file
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360
> >>>>>>> for block 3674866614142268536_674205
> >>>>>>>
> >>>>>>>
> >>>>>>> Ted, can you please point me to J-D's bug fix you mentioned? Are
> >>> you
> >>>>>>> positive it's the same scenario - data loss is a very serious
> >>> problem
> >>>>> for
> >>>>>> a
> >>>>>>> DB.
> >>>>>>> I'd really like to apply that patch ASAP, because when I run hbck
> >> I
> >>>> get
> >>>>>>> over
> >>>>>>> 400 regions which are multiply assigned.
> >>>>>>> Last question, I understand the region's data is lost but is
> >> there
> >>> a
> >>>>> way
> >>>>>> to
> >>>>>>> at least make the table consistent again by some how removing the
> >>>> lost
> >>>>>>> region?
> >>>>>>>
> >>>>>>> Thanks.
> >>>>>>>
> >>>>>>> -eran
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Jul 2, 2011 at 01:46, Ted Yu <yuzhih...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>>>>> 2011-06-29 16:43:57,880 INFO
> >>>>>>>> org.apache.hadoop.hbase.
> >>>>>>>> master.AssignmentManager: Region has been
> >>>>>>>> PENDING_OPEN for too long, reassigning
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>
> >>>>>>>> The double assignment should have been fixed by J-D's recent
> >>>> checkin.
> >>>>>>>>
> >>>>>>>> On Fri, Jul 1, 2011 at 3:14 PM, Stack <st...@duboce.net>
> >> wrote:
> >>>>>>>>
> >>>>>>>>> Is
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>> the region that was having the issue?  If so, if you looked
> >> in
> >>>>>>>>> hadoop1-s05's logs, was this region opened around 2011-06-29
> >>>>>> 16:43:57?
> >>>>>>>>> Was it also opened hadoop1-s02 not long after?  Did you say
> >>> what
> >>>>>>>>> version of hbase you are on?
> >>>>>>>>>
> >>>>>>>>> St.Ack
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 1, 2011 at 5:08 AM, Eran Kutner <e...@gigya.com>
> >>>>> wrote:
> >>>>>>>>>> Hi Stack,
> >>>>>>>>>> I'm not sure what the log means. I do see references to two
> >>>>>> different
> >>>>>>>>>> servers, but that would probably happen if there was normal
> >>>>>>> transition
> >>>>>>>> I
> >>>>>>>>>> assume. I'm using version 0.90.3
> >>>>>>>>>> Here are the relevant lines from the master logs:
> >>>>>>>>>>
> >>>>>>>>>> 2011-06-19 21:39:37,164 INFO
> >>>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>>>> Received REGION_SPLIT:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658.:
> >>>>>>>>>> Daughters;
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518810_1249_WEB204,1308533970928.46f876a4e97be04edb35eb8f8959d482.
> >>>>>>>>>> from hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
> >>>>>>>>>> 2011-06-19 21:43:12,983 INFO
> >>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor:
> >>>>>>>>>> Deleted daughter reference
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
> >>>>>>>>>> qualifier=splitA, from parent
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658.
> >>>>>>>>>> 2011-06-29 16:29:36,143 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Starting
> >>>>>>> unassignment
> >>>>>>>>> of
> >>>>>>>>>> region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> (offlining)
> >>>>>>>>>> 2011-06-29 16:29:36,146 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Sent
> >> CLOSE
> >>> to
> >>>>>>>>> serverName=
> >>>>>>>>>> hadoop1-s05.farm-ny.gigya.com,60020,1307349217076,
> >>>>>> load=(requests=0,
> >>>>>>>>>> regions=1654, usedHeap=1870, maxHeap=12483) for region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> >>> new
> >>>>>>>> unassigned
> >>>>>>>>>> node: /hbase/unassigned/584dac5cc70d8682f71c4675a843c309
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> (region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
> >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076,
> >>>>>>>>>> state=RS_ZK_REGION_CLOSED)
> >>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> >>>>>>>>>> transition=RS_ZK_REGION_CLOSED,
> >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076,
> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
> >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler:
> >>>>>> Handling
> >>>>>>>>> CLOSED
> >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309
> >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Table
> >> being
> >>>>>>> disabled
> >>>>>>>> so
> >>>>>>>>>> deleting ZK node and removing from regions in transition,
> >>>>> skipping
> >>>>>>>>>> assignment of region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned
> >>>> node
> >>>>>> for
> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state
> >>>>>>>>>> RS_ZK_REGION_CLOSED
> >>>>>>>>>> 2011-06-29 16:30:53,801 DEBUG
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted
> >>> unassigned
> >>>>> node
> >>>>>>> for
> >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state
> >>>>>>>>>> RS_ZK_REGION_CLOSED
> >>>>>>>>>> 2011-06-29 16:34:01,453 INFO
> >>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor:
> >>>>>>>>>> Updated region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> in META
> >>>>>>>>>> 2011-06-29 16:37:12,247 DEBUG
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating)
> >>>> unassigned
> >>>>>> node
> >>>>>>>> for
> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state
> >>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No
> >> previous
> >>>>>>>> transition
> >>>>>>>>>> plan was found (or we are ignoring an existing plan) for
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> so generated a random one;
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
> >>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com
> >>> ,60020,1307349217076;
> >>>> 5
> >>>>>>>>> (online=5,
> >>>>>>>>>> exclude=null) available servers
> >>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning
> >>>>> region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
> >>>>>>>>>> 2011-06-29 16:37:13,102 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> >>>>>>>>>> transition=RS_ZK_REGION_OPENED,
> >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076,
> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
> >>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
> >>>>>> Handling
> >>>>>>>>> OPENED
> >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting
> >>> unassigned
> >>>>>> node
> >>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned
> >>>> node
> >>>>>> for
> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state
> >>>>>>>>>> RS_ZK_REGION_OPENED
> >>>>>>>>>> 2011-06-29 16:39:54,192 DEBUG
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted
> >>> unassigned
> >>>>> node
> >>>>>>> for
> >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state
> >>>>>>>>>> RS_ZK_REGION_OPENED
> >>>>>>>>>> 2011-06-29 16:39:54,326 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
> >>>>> Opened
> >>>>>>>> region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
> >>>>>>>>>> 2011-06-29 16:40:00,598 DEBUG
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating)
> >>>> unassigned
> >>>>>> node
> >>>>>>>> for
> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state
> >>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No
> >> previous
> >>>>>>>> transition
> >>>>>>>>>> plan was found (or we are ignoring an existing plan) for
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> so generated a random one;
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
> >>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com
> >>> ,60020,1307349217076;
> >>>> 5
> >>>>>>>>> (online=5,
> >>>>>>>>>> exclude=null) available servers
> >>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning
> >>>>> region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
> >>>>>>>>>> 2011-06-29 16:43:57,879 INFO
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Regions
> >> in
> >>>>>>> transition
> >>>>>>>>>> timed out:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> state=PENDING_OPEN, ts=1309380052723
> >>>>>>>>>> 2011-06-29 16:43:57,880 INFO
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Region
> >> has
> >>>> been
> >>>>>>>>>> PENDING_OPEN for too long, reassigning
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Forcing
> >>>>> OFFLINE;
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> was=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> state=PENDING_OPEN, ts=1309380052723
> >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No
> >> previous
> >>>>>>>> transition
> >>>>>>>>>> plan was found (or we are ignoring an existing plan) for
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> so generated a random one;
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
> >>>>>>>>>> src=, dest=hadoop1-s02.farm-ny.gigya.com
> >>> ,60020,1306919627544;
> >>>> 5
> >>>>>>>>> (online=5,
> >>>>>>>>>> exclude=null) available servers
> >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning
> >>>>> region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> to hadoop1-s02.farm-ny.gigya.com,60020,1306919627544
> >>>>>>>>>> 2011-06-29 16:43:59,022 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> >>>>>>>>>> transition=RS_ZK_REGION_OPENING,
> >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544,
> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
> >>>>>>>>>> 2011-06-29 16:43:59,221 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> >>>>>>>>>> transition=RS_ZK_REGION_OPENING,
> >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544,
> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
> >>>>>>>>>> 2011-06-29 16:43:59,226 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> >>>>>>>>>> transition=RS_ZK_REGION_OPENED,
> >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544,
> >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
> >>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
> >>>>>> Handling
> >>>>>>>>> OPENED
> >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting
> >>> unassigned
> >>>>>> node
> >>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned
> >>>> node
> >>>>>> for
> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state
> >>>>>>>>>> RS_ZK_REGION_OPENED
> >>>>>>>>>> 2011-06-29 16:43:59,296 DEBUG
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted
> >>> unassigned
> >>>>> node
> >>>>>>> for
> >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state
> >>>>>>>>>> RS_ZK_REGION_OPENED
> >>>>>>>>>> 2011-06-29 16:43:59,375 WARN
> >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager:
> >> Overwriting
> >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 on
> >>>>>>>>>> serverName=hadoop1-s05.farm-ny.gigya.com
> >>> ,60020,1307349217076,
> >>>>>>>>>> load=(requests=0, regions=1273, usedHeap=2676,
> >> maxHeap=12483)
> >>>>>>>>>> 2011-06-29 16:43:59,375 DEBUG
> >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
> >>>>> Opened
> >>>>>>>> region
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> >>>>>>>>>> on hadoop1-s02.farm-ny.gigya.com,60020,1306919627544
> >>>>>>>>>>
> >>>>>>>>>> Thanks.
> >>>>>>>>>>
> >>>>>>>>>> -eran
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 1, 2011 at 09:05, Stack <st...@duboce.net>
> >>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> So, Eran, it seems as though two RegionServers were
> >> carrying
> >>>> the
> >>>>>>>>>>> region?  One deleted a file (compaction on its side)?  Can
> >>> you
> >>>>>>> figure
> >>>>>>>>>>> if indeed two servers had same region?  (Check master logs
> >>> for
> >>>>>> this
> >>>>>>>>>>> regions assignments).
> >>>>>>>>>>>
> >>>>>>>>>>> What version of hbase?
> >>>>>>>>>>>
> >>>>>>>>>>> St.Ack
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Jun 30, 2011 at 3:58 AM, Eran Kutner <
> >>> e...@gigya.com>
> >>>>>>> wrote:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>> I have a cluster of 5 nodes with one large table that
> >>>>> currently
> >>>>>>> has
> >>>>>>>>>>> around
> >>>>>>>>>>>> 12000 regions. Everything was working fine for
> >> relatively
> >>>> long
> >>>>>>> time,
> >>>>>>>>>>> until
> >>>>>>>>>>>> now.
> >>>>>>>>>>>> Yesterday I significantly reduced the TTL on the table
> >> and
> >>>>>>> initiated
> >>>>>>>>>>> major
> >>>>>>>>>>>> compaction. This should have reduced the table size to
> >>> about
> >>>>> 20%
> >>>>>>> of
> >>>>>>>>> its
> >>>>>>>>>>>> original size.
> >>>>>>>>>>>> Today, I'm getting errors of inaccessible files on HDFS,
> >>> for
> >>>>>>>> example:
> >>>>>>>>>>>> java.io.IOException: Got error in response to
> >>> OP_READ_BLOCK
> >>>>>> self=/
> >>>>>>>>>>>> 10.1.104.2:58047, remote=/10.1.104.2:50010 for file
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360
> >>>>>>>>>>>> for block 3674866614142268536_674205
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
> >>>>>>>>>>>>       at
> >>>>> java.io.DataInputStream.read(DataInputStream.java:132)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
> >>>>>>>>>>>>       at
> >>>>>>>>>
> >> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> >>>>>>>>>>>>       at
> >>>>>>>>>
> >> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:128)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:68)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
> >>>>>>>>>>>>       at
> >>>>>>>>>
> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >>>>>>>>>>>>       at
> >>>>>>>>>
> >> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>>>>>>>>>>>       at
> >>>>>> org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1433)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:139)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:96)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:77)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>
> >>>>>
> >> org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1341)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2269)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1126)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1118)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1102)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1781)
> >>>>>>>>>>>>       at
> >>>> sun.reflect.GeneratedMethodAccessor46.invoke(Unknown
> >>>>>>>> Source)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>>>>>>>>>       at
> >> java.lang.reflect.Method.invoke(Method.java:597)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>
> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >>>>>>>>>>>>       at
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> >>>>>>>>>>>>
> >>>>>>>>>>>> I checked and the file, indeed doesn't exist on HDFS,
> >> here
> >>>> is
> >>>>>> the
> >>>>>>>> name
> >>>>>>>>>>> node
> >>>>>>>>>>>> logs for this block, apparently because it was deleted:
> >>>>>>>>>>>> 2011-06-19 21:39:36,651 INFO
> >>>>> org.apache.hadoop.hdfs.StateChange:
> >>>>>>>>> BLOCK*
> >>>>>>>>>>>> NameSystem.allocateBlock:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp/2096863423111131624.
> >>>>>>>>>>>> blk_3674866614142268536_674205
> >>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO
> >>>>> org.apache.hadoop.hdfs.StateChange:
> >>>>>>>>> BLOCK*
> >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated:
> >>>> 10.1.104.2:50010
> >>>>> is
> >>>>>>>>> added
> >>>>>>>>>>> to
> >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864
> >>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO
> >>>>> org.apache.hadoop.hdfs.StateChange:
> >>>>>>>>> BLOCK*
> >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated:
> >>>> 10.1.104.3:50010
> >>>>> is
> >>>>>>>>> added
> >>>>>>>>>>> to
> >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864
> >>>>>>>>>>>> 2011-06-19 21:40:11,955 INFO
> >>>>> org.apache.hadoop.hdfs.StateChange:
> >>>>>>>>> BLOCK*
> >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated:
> >>>> 10.1.104.5:50010
> >>>>> is
> >>>>>>>>> added
> >>>>>>>>>>> to
> >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864
> >>>>>>>>>>>> 2011-06-29 20:20:01,662 INFO
> >>>>> org.apache.hadoop.hdfs.StateChange:
> >>>>>>>>> BLOCK*
> >>>>>>>>>>> ask
> >>>>>>>>>>>> 10.1.104.2:50010 to delete
> >>> blk_3674866614142268536_674205
> >>>>>>>>>>>> 2011-06-29 20:20:13,671 INFO
> >>>>> org.apache.hadoop.hdfs.StateChange:
> >>>>>>>>> BLOCK*
> >>>>>>>>>>> ask
> >>>>>>>>>>>> 10.1.104.5:50010 to delete
> >>> blk_-4056387895369608597_675174
> >>>>>>>>>>>> blk_-5017882805850873821_672281
> >>>> blk_702373987100607684_672288
> >>>>>>>>>>>> blk_-5357157478043290010_668506
> >>>> blk_7118175133735412789_674903
> >>>>>>>>>>>> blk_-3569812563715986384_675231
> >>>> blk_8296855057240604851_669285
> >>>>>>>>>>>> blk_-6483679172530609101_674268
> >>>> blk_8738539715363739108_673682
> >>>>>>>>>>>> blk_1744841904626813502_675238
> >>>> blk_-6035315106100051103_674266
> >>>>>>>>>>>> blk_-1789501623010070237_674908
> >>>> blk_1944054629336265129_673689
> >>>>>>>>>>>> blk_3674866614142268536_674205
> >>>> blk_7930425446738143892_647410
> >>>>>>>>>>>> blk_-3007186753042268449_669296
> >>>>> blk_-5482302621772778061_647416
> >>>>>>>>>>>> blk_-3765735404924932181_672004
> >>>> blk_7476090998956811081_675169
> >>>>>>>>>>>> blk_7862291659285127712_646890
> >>>> blk_-2666244746343584727_672013
> >>>>>>>>>>>> blk_6039172613960915602_674206
> >>>> blk_-8470884397893086564_646899
> >>>>>>>>>>>> blk_4558230221166712802_668510
> >>>>>>>>>>>> 2011-06-29 20:20:46,698 INFO
> >>>>> org.apache.hadoop.hdfs.StateChange:
> >>>>>>>>> BLOCK*
> >>>>>>>>>>> ask
> >>>>>>>>>>>> 10.1.104.3:50010 to delete
> >>> blk_-7851606440036350812_671552
> >>>>>>>>>>>> blk_9214649160203453845_647566
> >>> blk_702373987100607684_672288
> >>>>>>>>>>>> blk_5958099369749234073_668143
> >>>> blk_-5172218034084903173_673109
> >>>>>>>>>>>> blk_-2934555181472719276_646476
> >>>>> blk_-1409986679370073931_672552
> >>>>>>>>>>>> blk_-2786034325506235869_669086
> >>>> blk_3674866614142268536_674205
> >>>>>>>>>>>> blk_510158930393283118_673225
> >>> blk_916244738216205237_677068
> >>>>>>>>>>>> blk_-4317027806407316617_670379
> >>>> blk_8555705688850972639_673485
> >>>>>>>>>>>> blk_-3765735404924932181_672004
> >>>>> blk_-5482302621772778061_647416
> >>>>>>>>>>>> blk_-2461801145731752623_674605
> >>>>> blk_-8737702908048998927_672549
> >>>>>>>>>>>> blk_-8470884397893086564_646899
> >>>> blk_4558230221166712802_668510
> >>>>>>>>>>>> blk_-4056387895369608597_675174
> >>>>> blk_-8675430610673886073_647695
> >>>>>>>>>>>> blk_-6642870230256028318_668211
> >>>>> blk_-3890408516362176771_677483
> >>>>>>>>>>>> blk_-3569812563715986384_675231
> >>>>> blk_-5007142629771321873_674548
> >>>>>>>>>>>> blk_-3345355191863431669_667066
> >>>> blk_8296855057240604851_669285
> >>>>>>>>>>>> blk_-6595462308187757470_672420
> >>>>> blk_-2583945228783203947_674607
> >>>>>>>>>>>> blk_-346988625120916345_677063
> >>>> blk_4449525876338684218_674496
> >>>>>>>>>>>> blk_2617172363857549730_668201
> >>>> blk_8738539715363739108_673682
> >>>>>>>>>>>> blk_-208904675456598428_679286
> >>>> blk_-497549341281882641_646477
> >>>>>>>>>>>> blk_-6035315106100051103_674266
> >>>>> blk_-2356539038067297411_672388
> >>>>>>>>>>>> blk_-3881703084497103249_668137
> >>>> blk_2214397881104950315_646643
> >>>>>>>>>>>> blk_-5907671443455357710_673223
> >>>>> blk_-2431880309956605679_669204
> >>>>>>>>>>>> blk_6039172613960915602_674206
> >>>> blk_5053643911633142711_669194
> >>>>>>>>>>>> blk_-2636977729205236686_674664
> >>>>>>>>>>>>
> >>>>>>>>>>>> I assume the file loss is somehow related to this change
> >>> and
> >>>>> the
> >>>>>>>> major
> >>>>>>>>>>>> compaction that followed because the same scan that is
> >>>> failing
> >>>>>> now
> >>>>>>>> was
> >>>>>>>>>>>> working fine yesterday and that is the only changed that
> >>>>>> happened
> >>>>>>> on
> >>>>>>>>> the
> >>>>>>>>>>>> cluster. Any suggestions what to do now?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -eran
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Reply via email to