We have resolved the issue.  Details follow as "lessons learned".

Yes, Stack, the info:splitA/B columns followed, as part of the offline'd
parents .META. row.  But the regions that splitA/B point to *did not
exist*in .META.

Also, after the original email, we checked in HDFS and found the parent
region directory with 4 data files.  And also directories for each daughter
region (with 4 small files each--- presumably references to the original).

So, it looks like (not totally sure about the exact order, but something
like):

   1. split started on region A
   2. region A was offlined
   3. The daughter regions were created in HDFS with the reference files
   4. .META. was updated for region A
   5. **** server crashed

So, the new daughter entries were never added to .META.

We first tried to online region A with the shell command "assign'.  Figuring
that hbase would just find and split region A again.  This seemed to have no
effect... not sure why, maybe because region A already had splitA/B
entries?  Region A remained offline.  We also tried to force it to split
region A, using the shell command "split".  Again no effect.

Finally we tried to manually complete the split that had started.  Peter
manually inserted the two daughter regions into .META.  We then tried to
force a compact from the shell, this failed with a NSRE.  So we onlined
region A with the "assign"  command-- it worked this time.  And now we seem
to be up again, compact works, data loads work, hbck checks out!

As a side note, hbck gave me some good feedback to help investigate the
problem; although the "-fix" didn't help in this case.  It would be nice if
there was a tool or shell command to create a region given name, hdfs-path,
start and end keys.

Also, check_meta.rb threw me off track, because it did not detect any holes
when they did in fact exist.  This made me discount the most obvious
scenario, since I believed there were no holes.  Looking at the source for
bin/check_meta.rb, I see the issue:

if oldHRI.isOffline() && Bytes.equals(oldHRI.getStartKey(),
hri.getStartKey())
  # Presume offlined parent
elsif Bytes.equals(oldHRI.getEndKey(), hri.getStartKey())
  # Start key of next matches end key of previous
...

When checking for holes, it does not properly account for offline regions.
The first condition doesn't apply because oldHRI.start != hri.start.  The
second condition does apply (oldHRI.end = hri.start) and so it continues on
thinking there is no "problem" here.  Instead, I think the second condition
should be:

...
elsif *!oldHRI.isOffline() && *Bytes.equals(oldHRI.getEndKey(),
hri.getStartKey())
  # Start key of next matches end key of previous
...

Marc


On Sun, Mar 6, 2011 at 9:18 AM, Stack <[email protected]> wrote:

> So, yeah Marc, what are the rows that follow the ones you post below?
> Are they the info:splitA and info:splitB or something else?
> Thanks,
> St.Ack
>
> On Sat, Mar 5, 2011 at 4:22 PM, Marc Limotte <[email protected]> wrote:
> > We had an issue a day ago with some OOME's on the region servers.  The
> > master shutdown ok, but most of the RegionServers didn't and so
> eventually
> > had to kill -9 them.  Brought it all back up and ran a major compaction
> to
> > change the hbase block size.  This seemed to work, but now we have an
> > inconsistency which is preventing bulk loads from continuing.
> >
> > hbase hbck -details finds an inconsistency.  I tried -fix, but no help.
> > *Chain of regions in table opx_ad_event_v2 is broken; edges does not
> contain
> > advertiser^BOpenX PSA^Acountry^BSerbia^Apublisher^BDSNR - Filesharing
> >
> ROW^Adomain^Bthejesperbay.com^Aadvertiser_tag^Bmmx.travel^Astarttime^B1295175600
> > *
> > * *
> > hbck also notes that this region is offline:
> >
> > *11/03/05 23:33:16 DEBUG util.HBaseFsck: Region
> > opx_ad_event_v2,advertiser\x02OpenX
> > PSA\x01country\x02Serbia\x01publisher\x02DSNR - Filesharing
> ROW\x01domain\
> > x02thejesperbay.com\x01advertiser_tag\x02mmx.travel
> \x01starttime\x021295175600,1297185243218.6147d3696ba9db3a85e3afd08d0bc59a.
> > offline, split, parent, ignoring.*
> >
> > Looking in .META. I see that the region is indeed offline, and appears to
> be
> > split:
> >
> > info:regioninfo                           timestamp=1299301154675 ...
> > OFFLINE => true,
> > info:splitA                                  timestamp=1299283401019
> > info:splitB                                  timestamp=1299283401019
> > (full .META. row below)
> >
> > So, I'm guessing that it was in the midst of splitting and did not
> complete.
> >
> > How can I recover from this situation?
> >
> > thanks,
> > Marc
> >
> > ----------- .META. output ----------------
> >
> > hbase(main):001:0> get '.META.' , "opx_ad_event_v2,advertiser\x02OpenX
> > PSA\x01country\x02Serbia\x01publisher\x02DSNR - Filesharing
> ROW\x01domain\
> > x02thejesperbay.com\x01advert""
> > COLUMN
> > CELL
> >
> >  info:regioninfo                              timestamp=1299301154675,
> > value=REGION => {NAME => 'opx_ad_event_v2,advertiser\x02OpenX
> > PSA\x01country\x02Serbia\x01publisher\x02DS
> >                                              NR - Filesharing
> > ROW\x01domain\x02thejesperbay.com\x01advertiser_tag\x02mmx.travel
> > \x01starttime\x021295175600,1297185243218.6147d3
> >
>  696ba9db3a85e3afd08d0bc59a.',
> > STARTKEY => 'advertiser\x02OpenX
> > PSA\x01country\x02Serbia\x01publisher\x02DSNR - Filesharing ROW\x01
> >                                              domain\x02thejesperbay.com
> > \x01advertiser_tag\x02mmx.travel\x01starttime\x021295175600', ENDKEY =>
> > 'advertiser\x02OpenX PSA\x01coun
> >                                              try\x02United Arab
> > Emirates\x01publisher\
> > x02www.sixbillionsecrets.com/\x01advertiser_tag\x02mmx.arts<http://x02www.sixbillionsecrets.com/%5Cx01advertiser_tag%5Cx02mmx.arts>and
> > entertainment\x01publishe
> >                                              r_tag\x02mmx.arts and
> > entertainment\x01starttime\x021295877600', ENCODED =>
> > 6147d3696ba9db3a85e3afd08d0bc59a, OFFLINE => true, SPL
> >                                              IT => true, TABLE => {{NAME
> =>
> > 'opx_ad_event_v2', FAMILIES => [{NAME => 'metrics', BLOOMFILTER =>
> 'NONE',
> > REPLICATION_SCOPE => '0'
> >                                              , VERSIONS => '1',
> COMPRESSION
> > => 'GZ', TTL => '2147483647', BLOCKSIZE => '1048576', IN_MEMORY =>
> 'false',
> > BLOCKCACHE => 'true'},
> >                                              {NAME => 'topn', BLOOMFILTER
> > => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'GZ', VERSIONS =>
> '1',
> > TTL => '2147483647', BLOCK
> >                                              SIZE => '1048576', IN_MEMORY
> > => 'false', BLOCKCACHE =>
> > 'true'}]}}
> >  info:server                                  timestamp=1299181144063,
> > value=ip-10-17-24-121.ec2.internal:60020
> >
> >  info:serverstartcode                         timestamp=1299181144063,
> > value=1299180905510
> >
> >  info:splitA                                  timestamp=1299283401019,
> > value=REGION => {NAME => 'opx_ad_event_v2,advertiser\x02OpenX
> > PSA\x01country\x02Serbia\x01publisher\x02DS
> >                                              NR - Filesharing
> > ROW\x01domain\x02thejesperbay.com\x01advertiser_tag\x02mmx.travel
> > \x01starttime\x021295175600,1299283399612.3b278f
> >
>  1b0ea78af239409efc4f0b2a3d.',
> > STARTKEY => 'advertiser\x02OpenX
> > PSA\x01country\x02Serbia\x01publisher\x02DSNR - Filesharing ROW\x01
> >                                              domain\x02thejesperbay.com
> > \x01advertiser_tag\x02mmx.travel\x01starttime\x021295175600', ENDKEY =>
> > 'advertiser\x02OpenX PSA\x01coun
> >                                              try\x02Taiwan\x01domain\
> > x02kanzhongguo.com\x01advertiser_tag\x02mmx.arts and
> > entertainment\x01publisher_tag\x02\x01starttime\x0212
> >                                              96910800', ENCODED =>
> > 3b278f1b0ea78af239409efc4f0b2a3d, TABLE => {{NAME => 'opx_ad_event_v2',
> > FAMILIES => [{NAME => 'metrics', BLO
> >                                              OMFILTER => 'NONE',
> > REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'GZ', TTL =>
> > '2147483647', BLOCKSIZE => '65536', IN_
> >                                              MEMORY => 'false',
> BLOCKCACHE
> > => 'true'}, {NAME => 'topn', BLOOMFILTER => 'NONE', REPLICATION_SCOPE =>
> > '0', VERSIONS => '1', COMPR
> >                                              ESSION => 'GZ', TTL =>
> > '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
> > 'true'}]}}
> >  info:splitB                                  timestamp=1299283401019,
> > value=REGION => {NAME => 'opx_ad_event_v2,advertiser\x02OpenX
> > PSA\x01country\x02Taiwan\x01domain\x02kanzh
> >
> > ongguo.com\x01advertiser_tag\x02mmx.arts
> > and
> >
> entertainment\x01publisher_tag\x02\x01starttime\x021296910800,1299283399612.9d4164620
> >                                              2a363812d068792311d3a9b.',
> > STARTKEY => 'advertiser\x02OpenX PSA\x01country\x02Taiwan\x01domain\
> > x02kanzhongguo.com\x01advertiser_ta
> >                                              g\x02mmx.arts and
> > entertainment\x01publisher_tag\x02\x01starttime\x021296910800', ENDKEY =>
> > 'advertiser\x02OpenX PSA\x01country\x0
> >                                              2United Arab
> > Emirates\x01publisher\
> > x02www.sixbillionsecrets.com/\x01advertiser_tag\x02mmx.arts<http://x02www.sixbillionsecrets.com/%5Cx01advertiser_tag%5Cx02mmx.arts>and
> > entertainment\x01publisher_tag\
> >                                              x02mmx.arts and
> > entertainment\x01starttime\x021295877600', ENCODED =>
> > 9d41646202a363812d068792311d3a9b, TABLE => {{NAME => 'opx_ad
> >                                              _event_v2', FAMILIES =>
> [{NAME
> > => 'metrics', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS
> =>
> > '1', COMPRESSION => 'GZ'
> >                                              , TTL => '2147483647',
> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME
> =>
> > 'topn', BLOOMFILTER => 'NONE',
> >                                              REPLICATION_SCOPE => '0',
> > VERSIONS => '1', COMPRESSION => 'GZ', TTL => '2147483647', BLOCKSIZE =>
> > '65536', IN_MEMORY => 'false', B
> >                                              LOCKCACHE =>
> > 'true'}]}}
> >
> > 5 row(s) in 0.3960 seconds
> >
> >
> >
> > ------------ hbck -details output -------------
> > ...
> > ERROR: Region
> >
> hdfs://ip-10-17-5-253.ec2.internal:9000/hbase/opx_ad_event_v2/34a0ffe60da97431a809f0ffe8e5328a
> > on HDFS, but not listed in META or deployed on any region server.
> > ERROR: Region
> >
> hdfs://ip-10-17-5-253.ec2.internal:9000/hbase/opx_ad_event_v2/3b278f1b0ea78af239409efc4f0b2a3d
> > on HDFS, but not listed in META or deployed on any region server.
> > *11/03/05 23:33:16 DEBUG util.HBaseFsck: Region
> > opx_ad_event_v2,advertiser\x02OpenX
> > PSA\x01country\x02Serbia\x01publisher\x02DSNR - Filesharing
> ROW\x01domain\
> > x02thejesperbay.com\x01advertiser_tag\x02mmx.travel
> \x01starttime\x021295175600,1297185243218.6147d3696ba9db3a85e3afd08d0bc59a.
> > offline, split, parent, ignoring.
> > *ERROR: Region
> >
> hdfs://ip-10-17-5-253.ec2.internal:9000/hbase/opx_ad_event_v2/9d41646202a363812d068792311d3a9b
> > on HDFS, but not listed in META or deployed on any region server.
> > *Chain of regions in table opx_ad_event_v2 is broken; edges does not
> contain
> > advertiser^BOpenX PSA^Acountry^BSerbia^Apublisher^BDSNR - Filesharing
> >
> ROW^Adomain^Bthejesperbay.com^Aadvertiser_tag^Bmmx.travel^Astarttime^B1295175600
> > E*RROR: Found inconsistency in table opx_ad_event_v2
> > Summary:
> >  -ROOT- is okay.
> >    Number of regions: 1
> >    Deployed on:  ip-10-17-24-121.ec2.internal:60020
> >  .META. is okay.
> >    Number of regions: 1
> >    Deployed on:  ip-10-17-5-252.ec2.internal:60020
> > ...
> > Chain of regions in table opx_ad_event_v2 is broken; edges does not
> contain
> > advertiser^BOpenX PSA^Acountry^BSerbia^Apublisher^BDSNR - Filesharing
> >
> ROW^Adomain^Bthejesperbay.com^Aadvertiser_tag^Bmmx.travel^Astarttime^B1295175600
> > Table opx_ad_event_v2 is inconsistent.
> >    Number of regions: 1612
> >    Deployed on: ...
> > 4 inconsistencies detected.
> > Status: INCONSISTENT
> >
>

Reply via email to