Hi, I'm a newbie to hbase and have a question on the rowkey design and I hope this question isn't to newbie-like for this list. I have a question which cannot be answered by knoledge of code but by experience with large databases, thus this mail.
For the sake of explaination I create a small example. Suppose you want to design a small "blogging" plattform. You just want to store the name of the user and a small text. And of course you want to get all postings of one user. Furthermore we have 4 users, let's call them A,B,C,D (and you can trust that the length of the username is fixed). Now let's say the A,B,C and D have N postings, and D has 6*N postings. BUT: the data of A is 3 times more often fetched than the data from the other users each! If you create a hbase cluster with 10 nodes, every node is holding N postings (of course I know, that the data is hold redundantly, but this is not so important for the question). Rowkey design #1: the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003". The table just would be: "create 'postings' , 'text'" For this rowkey design the first node would hold the data of A, the second of B, the third of C and the fourth to the tenth node the data of D. Fetching of data would be very easy, but half of the traffic would hit the first node. Rowkey design #2 the rowkey would be random, e.g. an uuid. The table design would be now: "create 'postings' , 'user' , 'text'" the fetching of the data would be a "real" map-reduce job, checking for the user and emit etc.. So, if a fetching takes place I have to do more computation cycles and IO. But in this scenario all traffic would hit all 10 servers. If the number of N (number of postings) is large enough that the disk space is critical, I'm also not able to adjust the key regions in a way that e.g. the data of D is only on the last server and the key space of A would span the first 5 nodes. Or making replication very broad (e.g. 10 times in this case) So basically the question is: What's the better plan? Trying to avoid computation cycles of map reducing and get the key design straight, or trying to scale the computation, but doing more IO? I hope that the small example helped to make the question more vivid. Best wishes Wilm
