This issue is a common pitfall to those new to HBase and I think it could be a good thing to have in the HBase book. Once someone realizes that you can store multiple values for the same cell, each with a timestamp there can be a natural tendency to think "hey, I can store a one-to-many using multiple version of a cell". That's not the intent of versioned cell values.
Versioned cell values can be thought of as a way to keep a history of change for a single entity that at any given time only has one value. Like keeping track of a state change over time. For a one-to-many relationship (i.e., a user with many events), favor either multiple rows or multiple columns instead. Bill On Fri, Aug 26, 2011 at 9:16 AM, Buttler, David <[email protected]> wrote: > Physically, you will be storing the same data. Hbase stores everything as > key-value pairs. The cell identifier is "row key, column family, column > qualifier, timestamp" > > However, by storing items in different rows it is more convenient to query > and delete old values. By default you only get the most recent version of a > column during a scan. > > One way to think about it is: versions are for when you don't want to > forget previous versions, but you typically only want the most recent > version. If you want to be continuously accessing old versions, you would > be better off putting them in separate rows. > > Dave > > -----Original Message----- > From: Sheng Chen [mailto:[email protected]] > Sent: Friday, August 26, 2011 1:38 AM > To: [email protected] > Subject: Re: Versioning > > Hi, I just saw your recent update of the hbase book on the version number > question, and I'm also confused about it. > As said on the book (HBASE-4251), it is not recommended setting the number > of versions to an exceedingly high level (e.g., hundreds or more) unless > those old values are very dear to you because this will greatly increase > StoreFile size. > > But sometimes, we do need to save multiple versions of values, such as > logging events, or messages of Facebook. In these cases, what is the trade > off between saving them in different rows, and in different versions of one > row? > > Thank you. > Sean > > > 2011/8/18 Doug Meil <[email protected]> > > > > > Versioning can be used to see the previous state of a record. Some > people > > need this feature, others don't. > > > > One thing that may be worth a review is this... > > > > http://hbase.apache.org/book.html#keysize > > > > ... and specifically the fact about all the values being freighted with > > timestamp (aka version) too. I don't know your use case, and I'm not > sure > > I have the time to understand it, but 1 million versions seems like a > lot. > > You're going to use a lot of space doing that. > > > > > > > > > > On 8/17/11 11:53 AM, "Mark" <[email protected]> wrote: > > > > >I'm trying to fully understand all the possibilities of what HBase has > > >to offer but I can determine a valid use case for multiple versions. Can > > >someone please explain some real life use cases for this? > > > > > >Also, at what point is there "too many versions". For example to store > > >all the queries a user has performed couldn't we create a column family > > >and have max versions set to something really high (1M). Using this > > >method we could then ask for the last X amount of queries by setting the > > >max versions to X. It seems like this can also be accomplished by > > >creating a separate row for each query but I'm not sure why one strategy > > >would be better than the other. > > > > > >Please help me understand. Thanks! > > > > >
