We're seeing them build up in the actual data directory. Our first
thought was the cleaner chores since we were seeing them get gummed up
on EMRFS exceptions. We wrote some scripting to monitor the logs and
help the cleaner chores along. This seemed to resolve the repeat
exceptions we were seeing in the logs. However, we haven't seen any data
drain out.
We ran a snapshot export today for one of the tables and the snapshot
output size is the same as the s3 size so HBase thinks it needs the
files in some capacity. The s3 size is still over 3x larger than what's
reported in the hfile stats in the web UI.
We also run our DynamoDB table On-Demand as we've seen the same spikey
behavior.
The regions we have stuck we're fairly confident will be resolved with a
master restart since the in-memory meta doesn't seem to match what's in
the hbase:meta table but we're not in the position for a maintenance
window right now.
Thanks for you help,
Austin
On 5/28/20 6:08 AM, Jacob LeBlanc wrote:
Are the files building up in the archive "directory" in S3 or in the actual data directory? In the past we've had
issues where the cleaner chore thread on the master runs into EMRFS inconsistencies and eventually seems to give up, causing a
buildup of files in the archive directory over time. Restarting master gets the cleaner thread going again which shows the
inconsistencies in the log and you can then clean them up in a targeted way (using "emrfs diff" "emrfs sync"
and "emrfs delete"). Or if you have time during a maintenance window you can try running "emrfs sync" on the
whole hbase directory, but that sometimes takes a long time to run depending on the size of your data and Amazon has warned us
against running that while the cluster is running.
In terms of why EMRFS inconsistencies occur, that might be a question for AWS
support. We used to have a lot of problems with that because of dynamodb
throttling on the EmrFsMetadata table. We see less once we went from
provisioned capacity to on-demand capacity (the demand seems to be very spiky)
and also upgrading to newer versions (I wish Amazon published EMRFS bugs but oh
well - again AWS support might be able to help here). But even with these
changes we still sometimes see inconsistencies.
If you are seeing parents hanging around the actual data directory, then I've
never seen that before.
As for your regions stuck in transition state, we also see this occasionally. Most of the time this can
be fixed just by running "assign '<encoded_region>'" in hbase shell. Or if you prefer you
can run "hbase hbck -fixAssignments" which will basically just do the same thing where it tries
to assign regions still in a transition state. Both of those things can be done with the cluster running
- no need to roll the master.
Hope this helps.
--Jacob
-----Original Message-----
From: Austin Heyne [mailto:[email protected]]
Sent: Thursday, May 21, 2020 10:28 AM
To: [email protected]
Subject: HBase not cleaning up split parents
We're running HBase 1.4.8 on S3 (EMR 5.20) and we're seeing that after a series
of splits and a major compaction the split parents are not getting removed. The
on disk size of some of our tables is 6x what HBase is reporting in the table
details. The RS_COMPACTED_FILES_DISCHARGER threads are all parked waiting and
we haven't see a reduction in size in well over a week. The only thing of note
on the cluster is we have two regions stuck in a transition state until we have
a maintenance window to roll the master.
Has anyone experienced this or have a way to encourage the regionservers to
start the cleanup process?
Thanks,
Austin