Hi there, I think you probably want to look at thisŠ Hbase catalog metadataŠ
http://hbase.apache.org/book.html#arch.catalog How data is stored internallyŠ http://hbase.apache.org/book.html#regions.arch Lots of versioning description hereŠ http://hbase.apache.org/book.html#datamodel Long story short, client talks directly to RegionServers, Hbase looks at multiple StoreFiles. On 6/1/12 4:27 PM, "S Ahmed" <[email protected]> wrote: >(reference: >http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html) > >A row consists of a key, and column families, along with a timestamp. > >So for example: > >key = com.example.com/some/path > >cf: outboundlinks { > com.example.com/link1, > com.example.com/link2, > .. >} > >Data is stored like this: > >Region Server -> Store -> StoreFile -> HFile > >Now when a client requests a particular key, the hmaster figures out which >region server holds the data, this information is returned the client >(which saves it locally), and then it makes a request to the region >server. > >Now since the actual data files are immutable, if you modify a particular >value in a CF, it is tombestombed (not sure how that works but understand >it at a high level). > >So if I make a request for a given key, going with the example above, a >particular url on the website example.com, and i want all the >outboundlinks >I reference the column family "outboudnlinks" which can store millions of >urls. > >What process/service/class is in charge of assembling the various files to >get all the correct data? > >Summary of my question: >What I am trying to understand is, if a particular CF has millions of >values, and if a single value is mutated, a new file has to be created. >So >this means, if I query for that value i.e. it is included in my result >set, >how does hbase know where to look for the latest data? > >So basically from what I understand, making a get request for a particular >key, cf will have to potentially look at more than one StoreFile (or >HFile?) correct?
