Disclaimer: I'm not a master at HBase schema design, so someone more knowledgable feel free to refute the below.
I'd say the first thing you should do is take a step back; jumping into HBase is not as easy as jumping into MySQL. If you are going to be working at scale, the efficiency of your design matters greatly, such as row key size, column qualifier size, number of columns, separation of columns by column family, etc. A lot of what MySQL abstracts away is put right in your face with HBase. This is great in a lot of ways, but provides for a steep learning curve for someone looking to get in on the scene. Before contemplating such a project, I'd recommend taking a few days studying the various documentation out there: http://hbase.apache.org/book.html and http://ofps.oreilly.com/titles/9781449396107/ are a good place to start. Learning the inner architecture of how KeyValues work and how data is retrieved by HBase is very important. One of the first things you'll hopefully learn is that HBase/NoSQL is not relational. It doesn't make sense to have a category_id "foreign key" column like you have in your current approach. There is no "third normal form" and such for NoSQL like there is for relation databases. To that point, compiling a list of SQL queries as a starting point for a NoSQL project is probably not the best starting point. If your data is highly relational, while its certainly possible to make it work in a non-relational system, it may not be recommended without some real expertise or time to learn the new paradigms. A couple of tips: 1) You mention sorting on columns. You're right, this is not provided by HBase. In HBase there is a single primary key, and that is the row key. This is sorted lexicographically, as you have already found out. Keep in mind that when you sort in MySQL, it takes your data set and loads it into memory to be sorted. HBase doesn't do this for you, but you could easily do it from your client, once you have the data you want. You will see this pattern a lot: MySQL does things for you that you now need to handle yourself, and realize that there isn't much magical about how MySQL is doing it. 2) There are filters that can do some of what you want, such as returning only rows where a column is empty. There are also coprocessors, depending on the release you are using, which can do some extra work on the region server before sending over the wire (such as more complex filtering or data manipulation that might be expensive to do locally). -- I'll take a very quick and naive stab at your specific example. There should be no id columns. If you want to have a list of categories and categories can have multiple children, maybe each child category would be a column on the parent category's row. e.g. rowkey = category name; columns = one column (qualifer = 0x00 byte array) for the main category data, and child categories are extra columns where qualifiers are the name. The value could be a protobuf or avro message with whatever fields are important for a category. If all keywords are linked to a category, you might have that be part of the protobuf/avro message. So for protobuf your message would be: message Category { repeated Keyword keyword = 1; optional string some_other_per_category_field = 2; } message Keyword { optional string name = 1; optional int32 score = 2; } Like I said, this was quick and naive. Some of the queries you mentioned above would be expensive with this approach. You could always keep another table (or even rowkey schema within the same table, or another column family) to keep separate incremented counters for particular statistics you are interested in. Just an untested idea. Hope this helps, Bryan On Fri, Dec 16, 2011 at 11:26 AM, Alwin Roosen <[email protected]> wrote: > Hello, > > > I have been suggested to use HBase for a project, but after reading > some manuals/guidelines, I am still not sure how to design the > database and getting more confused by the minute. I am new to any form > of NoSQL database and having a hard time figuring this one out. > > I am hoping that someone can suggest a HBase design to me based on the > info below. It would also be nice to guide me to some of the HBase > classes/methods I need to use to get the results I need. > > I basically have two tables, a category table and a keyword table. The > category table only contains a few hundred records, but the keyword > table could contain millions over time (hence the HBase suggestion). > > The project is a bit more complex then this, but if I can get started > and understand the NoSQL concept for this example, I hope I am able to > figure out the rest by myself. > > SQL approach: > > =================================== > category > =================================== > id name parent > ----------------------------------- > 1 cat1 NULL > 2 cat2 NULL > 3 cat1-1 1 > 4 cat3 NULL > 5 cat3-1 4 > 6 cat3-2 4 > ... > > =================================== > keyword > =================================== > name category score > ----------------------------------- > book 1 23 > house 4 14 > cup 5 75 > shoe 2 3 > phone 1 58 > tablet 1 NULL > ... > > I need to be able to query HBase with the following example SQL scenarios: > > 1) Get the root categories > > SELECT * FROM `category` WHERE `parent` IS NULL > > 2) Get the child categories for a certain root category (one level) > > SELECT * FROM `category` WHERE `parent`=4 > > 3) Get a list of root categories sorted by the total highest score > from their keywords > > SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS > `c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE > `parent` IS NULL GROUP BY `c`.`id` ORDER BY `cat_score` DESC > > 4) Get a list of child categories sorted by the total highest score > from their keywords > > SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS > `c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE > `parent`=4 GROUP BY `c`.`id` ORDER BY `cat_score` DESC > > 5) Get a list of keywords that do not have a score yet > > SELECT * FROM `keyword` WHERE `score` IS NULL > > 6) Get the total number of categories: > > SELECT COUNT(`id`) FROM `category` > > 7) Get the total number of root categories: > > SELECT COUNT(`id`) FROM `category` WHERE `parent` IS NULL > > 8) Get the total number of keywords: > > SELECT COUNT(`name`) FROM `keyword` > > 9) Get the total number of keywords without a score: > > SELECT COUNT(`name`) FROM `keyword` WHERE `score` IS NULL > > > HBase approach (what I have so far but is most certainly very wrong): > > =================================== > category > =================================== > key = category name > columns: > - info:id (id of the category) > - relation:parent (id of the parent category) > > I am able to parse a category-tree in java (using table.getScanner()) > by just selecting all the rows (small table, no big deal) and creating > a tree. > > Since the table is sorted by key, I get a nice category-tree > alphabetically sorted. > > =================================== > keyword > =================================== > key = keyword name > columns: > - info:name (name of the keyword) > - info:score (score is available, otherwise not set) > - relation:category (id of the category) > > As I understand, you cannot sort on column values (info:score for > example), so how should I approach this? It would be possible to > process this in java, but I have a feeling this is not the correct > approach and the database design should be very different. > > > Any help would be much appreciated! >
