We're seeing them build up in the actual data directory. Our first thought was the cleaner chores since we were seeing them get gummed up on EMRFS exceptions. We wrote some scripting to monitor the logs and help the cleaner chores along. This seemed to resolve the repeat exceptions we were seeing in the logs. However, we haven't seen any data drain out.

We ran a snapshot export today for one of the tables and the snapshot output size is the same as the s3 size so HBase thinks it needs the files in some capacity. The s3 size is still over 3x larger than what's reported in the hfile stats in the web UI.

We also run our DynamoDB table On-Demand as we've seen the same spikey behavior.

The regions we have stuck we're fairly confident will be resolved with a master restart since the in-memory meta doesn't seem to match what's in the hbase:meta table but we're not in the position for a maintenance window right now.

Thanks for you help,
Austin

On 5/28/20 6:08 AM, Jacob LeBlanc wrote:
Are the files building up in the archive "directory" in S3 or in the actual data directory? In the past we've had 
issues where the cleaner chore thread on the master runs into EMRFS inconsistencies and eventually seems to give up, causing a 
buildup of files in the archive directory over time. Restarting master gets the cleaner thread going again which shows the 
inconsistencies in the log and you can then clean them up in a targeted way (using "emrfs diff" "emrfs sync" 
and "emrfs delete"). Or if you have time during a maintenance window you can try running "emrfs sync" on the 
whole hbase directory, but that sometimes takes a long time to run depending on the size of your data and Amazon has warned us 
against running that while the cluster is running.

In terms of why EMRFS inconsistencies occur, that might be a question for AWS 
support. We used to have a lot of problems with that because of dynamodb 
throttling on the EmrFsMetadata table. We see less once we went from 
provisioned capacity to on-demand capacity (the demand seems to be very spiky) 
and also upgrading to newer versions (I wish Amazon published EMRFS bugs but oh 
well - again AWS support might be able to help here). But even with these 
changes we still sometimes see inconsistencies.

If you are seeing parents hanging around the actual data directory, then I've 
never seen that before.

As for your regions stuck in transition state, we also see this occasionally. Most of the time this can 
be fixed just by running "assign '<encoded_region>'" in hbase shell. Or if you prefer you 
can run "hbase hbck -fixAssignments" which will basically just do the same thing where it tries 
to assign regions still in a transition state. Both of those things can be done with the cluster running 
- no need to roll the master.

Hope this helps.

--Jacob

-----Original Message-----
From: Austin Heyne [mailto:[email protected]]
Sent: Thursday, May 21, 2020 10:28 AM
To: [email protected]
Subject: HBase not cleaning up split parents

We're running HBase 1.4.8 on S3 (EMR 5.20) and we're seeing that after a series 
of splits and a major compaction the split parents are not getting removed. The 
on disk size of some of our tables is 6x what HBase is reporting in the table 
details. The RS_COMPACTED_FILES_DISCHARGER threads are all parked waiting and 
we haven't see a reduction in size in well over a week. The only thing of note 
on the cluster is we have two regions stuck in a transition state until we have 
a maintenance window to roll the master.

Has anyone experienced this or have a way to encourage the regionservers to 
start the cleanup process?

Thanks,
Austin

Reply via email to