We have the "Part 3" of the blog series published. Thanks to the co-writers: Duo Zhang and Andrew Purtell.
Part 3: https://engineering.salesforce.com/evolution-of-region-assignment-in-the-apache-hbase-architecture-part-3-e03b814ae92 On Mon, Sep 13, 2021 at 10:53 PM Viraj Jasani <vjas...@apache.org> wrote: > Thanks Duo for your offer to coordinate on writing "Part 3" of this > series, sounds great! > Although I see TRSP#assign being used by SCP directly while assigning the > regions, I am yet to take a detailed look into HBASE-20881 > <https://issues.apache.org/jira/browse/HBASE-20881> and the relevant > work. Let me reach out to you over Slack and we can take it from there. > > On Sun, Sep 12, 2021 at 7:02 PM 张铎(Duo Zhang) <palomino...@gmail.com> > wrote: > >> Thank you Viraj and Andrew, the blog posts are outstanding! >> >> And I think we'd better have a part 3, about the ServerCrashProcedure(SCP) >> :) >> >> In 2.0 and 2.1, we use MoveRegionProcedure, AssignRegionProcedure and >> UnassignRegionProcedure, and one of the reasons why we removed them all >> and >> introduced a single TRSP to do assign/unassign/move/reopen, is because of >> SCP. >> >> If a region server crashed, obviously, we can not assign regions to it any >> more, so we should have a way to stop the procedure which are still trying >> to assign regions to the dead server. And even for unassigning a region, >> we >> still need to make it online first and then unassign it. For example, when >> disabling a table, we must make sure that all the data in memstore have >> been flushed to storage, so we will need make it online, and then do a >> clean close. >> In 2.0 and 2.1, we had 3 procedures for region assignment, and there were >> lots of corner cases when we want to interrupt them from SCP, which made >> the code really hard to understand and buggy. So finally, we introduced a >> TRSP to replace them all. So SCP only needs to interrupt one type of >> procedure. >> >> This is the story :) >> >> I could help if you guys want to write the part 3 about SCP :) >> >> Thanks. >> >> Viraj Jasani <vjas...@apache.org> 于2021年9月8日周三 上午2:27写道: >> >> > As some of the HBase users are still running HBase 1.x versions in their >> > production environment, and branch-1 is trending toward EOL, now is >> really >> > the right time to evaluate as well as understand the features and core >> > design changes provided by HBase 2.x versions. >> > >> > As the majority of us are already aware, one of the key features with >> > significant architectural changes provided by HBase 2 is >> > AssignmentManagerV2 (AMv2). >> > However, we don't seem to have one place explaining 1) *the evolution >> > of AM* and >> > 2) how it manages region assignments with better scalability, >> reliability >> > and fault-tolerance. >> > Keeping this in mind, Andrew and I have published a series of two-part >> blog >> > posts explaining this evolution. Part 1 provides a) some basic >> introduction >> > to HBase concepts, and b) AM and it's shortcomings from previous >> versions >> > that AMv2 is trying to resolve. Part 2 provides detailed info about Pv2 >> and >> > how AMv2 leverages it, and also state diagrams explaining some of the >> > complex region assignment workflows. The intention of state diagrams is >> for >> > dev/users to be able to a) understand region assignment workflows >> in-depth, >> > b) easier code walk-through and c) debug and root cause issues with >> > better knowledge. >> > >> > Part 1: >> > >> > >> https://engineering.salesforce.com/evolution-of-region-assignment-in-the-apache-hbase-architecture-part-1-c43b1becc522 >> > Part 2: >> > >> > >> https://engineering.salesforce.com/evolution-of-region-assignment-in-the-apache-hbase-architecture-part-2-9568fb3790b >> > >> >