Hi All,

I had some questions about the hbase architecture that I am a little
confused about.

After doing reading over the internet / HBase book etc, my understanding of
regions is the following ->

When the cluster initially starts up (with no data), the regionservers come
online.

When the data starts flowing into HBase, the data starts being written into
a single region. Each region is composed of a memstore, storefiles and a
WAL. As memstores get filled up, they are written to storefiles on disk. A
major compaction combines these storefiles together into a single store
file. Storefiles/memstores exist on a per column family basis.

What I do not understand is the following - the StoreFiles for a particular
region/column family are written into HDFS as an HFile. This HFile then
exists in HDFS as chunks (usually of 64 MB size) across the HDFS data nodes.
When the region "splits", what happens to these store files ?

If a region is already split into multiple store files and a compaction runs
and a split was supposed to occur, the compaction process will only compact
half of the files ?
If a region only has a single store file, how is the data split into two ?
Is half of the data written into a separate file and then the original file
truncated ? Isn't this an expensive operation ?

Secondly, what does it mean to "move" a region from one regionserver to
another. Since a region is composed of a set of HFiles on the HDFS
filesystem, and can potentially exist anywhere across the datanode cluster,
is any data physically being moved from one machine to another ? or is it
just that the regionserver that the region has "moved" to takes control of
the file on HDFS that needs to be written to ?

I guess what I am not being able to fully understand is how is HBase
effectively using HDFS when it comes to region distribution. If a
regionserver is hosting a hot region whose blocks sit on a certain set of
datanodes, how does moving this region to another regionserver help ? The
traffic ultimately would land on the same set of datanodes holding the
different chunks of the region .. shouldn't Hbase then be doing load
balancing based on where the regions are physically located on the datanodes
as opposed to where they are *logically* located ?

Please let me know if I am missing something truly fundamental here.

Thank you,

Sam

Reply via email to