Are the files building up in the archive "directory" in S3 or in the actual data directory? In the past we've had issues where the cleaner chore thread on the master runs into EMRFS inconsistencies and eventually seems to give up, causing a buildup of files in the archive directory over time. Restarting master gets the cleaner thread going again which shows the inconsistencies in the log and you can then clean them up in a targeted way (using "emrfs diff" "emrfs sync" and "emrfs delete"). Or if you have time during a maintenance window you can try running "emrfs sync" on the whole hbase directory, but that sometimes takes a long time to run depending on the size of your data and Amazon has warned us against running that while the cluster is running.
In terms of why EMRFS inconsistencies occur, that might be a question for AWS support. We used to have a lot of problems with that because of dynamodb throttling on the EmrFsMetadata table. We see less once we went from provisioned capacity to on-demand capacity (the demand seems to be very spiky) and also upgrading to newer versions (I wish Amazon published EMRFS bugs but oh well - again AWS support might be able to help here). But even with these changes we still sometimes see inconsistencies. If you are seeing parents hanging around the actual data directory, then I've never seen that before. As for your regions stuck in transition state, we also see this occasionally. Most of the time this can be fixed just by running "assign '<encoded_region>'" in hbase shell. Or if you prefer you can run "hbase hbck -fixAssignments" which will basically just do the same thing where it tries to assign regions still in a transition state. Both of those things can be done with the cluster running - no need to roll the master. Hope this helps. --Jacob -----Original Message----- From: Austin Heyne [mailto:[email protected]] Sent: Thursday, May 21, 2020 10:28 AM To: [email protected] Subject: HBase not cleaning up split parents We're running HBase 1.4.8 on S3 (EMR 5.20) and we're seeing that after a series of splits and a major compaction the split parents are not getting removed. The on disk size of some of our tables is 6x what HBase is reporting in the table details. The RS_COMPACTED_FILES_DISCHARGER threads are all parked waiting and we haven't see a reduction in size in well over a week. The only thing of note on the cluster is we have two regions stuck in a transition state until we have a maintenance window to roll the master. Has anyone experienced this or have a way to encourage the regionservers to start the cleanup process? Thanks, Austin
