Ian, thanks for your detailed response!
Let me give you feedback to each point: > 1. You could denormalize the additional information (e.g. course > name) into the students table. Then, you're simply reading the > student row, and all the info you need is there. That places an extra > burden of write time and disk space, and does make you do a lot more > work when a course name changes. That's exactly what I thought about and that's why I avoid it. The students and courses example is an example you find at several points on the web, when describing the differences and translations of relations from an RDBMS into a Key-Value-store. In fact, everything you model with a Key-Value-storage like HBase, Cassandra etc. can be modeled as an RDMBS-scheme. Since a lot of people, like me, are coming from that edge, we must re-learn several basic things. It starts with understanding that you model a K-V-storage the way you want to access the data, not as the data relates to eachother (in general terms) and ends with translating the connections of data into a K-V-schema as good as possible. > 2. You could do what you're talking about in your HBase access code: > find the list of course IDs you need for the student, and do a multi > get on the course table. Fundamentally, this won't be much more > efficient to do in batch mode, because the courses are likely to be > evenly spread out over the region servers (orthogonal to the > students). You're essentially doing a hash join, except that it's a > lot less pleasant than on a relational DB b/c you've got network > round trips for each GET. The disk blocks from the course table (I'm > assuming it's the smaller side) will likely be cached so at least > that part will be fast--you'll be answering those questions from > memory, not via disk IO. Whow, what? I thought a Multiget would reduce network-roundtrips as it only accesses each region *one* time, fetching all the queried keys and values from there. If your data is randomly distributed, this could result in the same costs as with doing several Gets in a loop, but should work better if several Keys are part of the same region. Am I right or did I missunderstood the concept??? > 3. You could also let a higher client layer worry about this. For > example, your data layer query just returns a student with a list of > their course IDs, and then another process in your client code looks > up each course by ID to get the name. You can then put an external > caching layer (like memcached) in the middle and make things a lot > faster (though that does put the burden on you to have the code path > for changing course info also flush the relevant cache entries). In > your example, it's unlikely any institution would have more than a > few thousand courses, so they'd probably all stay in memory and be > served instantaneously. Hm, in what way does this give me an advantage over using HBase - assuming that the number of courses is small enough to fit in RAM - ? I know that Memcached is optimized for this purpose and might have much faster response times - no doubts. However, from a conceptual point of view: Why does Memcached handles the K-V-distribution more efficiently than a HBase with warmed caches? Hopefully this question isn't that hard :). > This might seem laborious, and to a degree it is. But note that it's > difficult to see the utility of HBase with toy examples like this; if > you're really storing courses and students, don't use HBase (unless > you've got billions of students and courses, which seems unlikely). > The extra thought you have to put in to making schemas work for you > in HBase is only worth it when it gives you the ability to scale to > gigantic data sets where other solutions wouldn't. Well, the background is a private project. I know that it's a lot easier to do what I want in a RDBMS and there is no real need for using a highly scalable beast like HBase. However, I want to learn something new and since I do not break someone's business by trying out new technology privately, I want to go with HStack. Without ever doing it, you never get a real feeling of when to use the right tool. Using a good tool for the wrong problem can be an interesting experience, since you learn some of the do's and don'ts of the software you use. Since I am a reader of the MEAP-edition of HBase in Action, I am aware of the TwitBase-example application presented in that book. I am very interested in seeing the author presenting a solution for efficiently accessing the Tweets of the persons I follow. This is an n:m-relation. You got n users with m tweets and each user is seeing his own tweets as well as the tweets of followed persons in descending order by timestamp. This must be done with a join within an RDMBs (and maybe in HBase also), since I can not think of another scalable way of doing so. However, if you do this by a Join, this means that a person with 40.000 followers needs a batch-request consisting of 40.000 GET-objects. That's huge and I bet that this is everything but not fast nor scalable. It sounds like broken by design when designing for Big Data. Therefore I am interested in general best practices for such problems. Maybe this is a better example for showing the possibilities of HBase than a students and courses example. Thanks for sharing your insights! Em Am 29.05.2012 17:08, schrieb Ian Varley: > Em, > > What you're describing is a classic relational database nested loop or hash > join; the only difference is that relational databases have this feature > built in, and can do it very efficiently because they typically run on a > single machine, not a distributed cluster. By moving to HBase, you're > explicitly making a tradeoff that's worse for this kind of usage, in exchange > for having horizontally scalable data storage (i.e. you can scale to TB or PB > of data). But the reality is that this makes what you're describing a lot > harder to do. > > A real answer to this question would involve talking a lot about JOIN theory > in relational databases: when do optimizers choose nested loop joins vs. hash > joins or merge joins? How do you know which side of a join to drive from > (HBase doesn't keep stats, nor does it have an optimizer for that matter). > There's not really a general "what's the right way to do this", divorced from > those kinds of questions. > > That said, I can see at least a couple ways to make this particular operation > (get all courses for one student) efficient in HBase: > > 1. You could denormalize the additional information (e.g. course name) into > the students table. Then, you're simply reading the student row, and all the > info you need is there. That places an extra burden of write time and disk > space, and does make you do a lot more work when a course name changes. > > 2. You could do what you're talking about in your HBase access code: find the > list of course IDs you need for the student, and do a multi get on the course > table. Fundamentally, this won't be much more efficient to do in batch mode, > because the courses are likely to be evenly spread out over the region > servers (orthogonal to the students). You're essentially doing a hash join, > except that it's a lot less pleasant than on a relational DB b/c you've got > network round trips for each GET. The disk blocks from the course table (I'm > assuming it's the smaller side) will likely be cached so at least that part > will be fast--you'll be answering those questions from memory, not via disk > IO. > > 3. You could also let a higher client layer worry about this. For example, > your data layer query just returns a student with a list of their course IDs, > and then another process in your client code looks up each course by ID to > get the name. You can then put an external caching layer (like memcached) in > the middle and make things a lot faster (though that does put the burden on > you to have the code path for changing course info also flush the relevant > cache entries). In your example, it's unlikely any institution would have > more than a few thousand courses, so they'd probably all stay in memory and > be served instantaneously. > > This might seem laborious, and to a degree it is. But note that it's > difficult to see the utility of HBase with toy examples like this; if you're > really storing courses and students, don't use HBase (unless you've got > billions of students and courses, which seems unlikely). The extra thought > you have to put in to making schemas work for you in HBase is only worth it > when it gives you the ability to scale to gigantic data sets where other > solutions wouldn't. > > Ian > > On May 29, 2012, at 9:28 AM, Em wrote: > >> Hi, >> >> thanks for your help. >> Yes, I know these slides. >> However I can not find an answer to how to access such schemas efficiently. >> In case of the given schema for students and courses as in those slides, >> they say that each column contains the student's id / course's id. >> However, when you want to build a GUI, you want to get all the courses >> for a given student and display their names. >> You *have* the column-names which represent the ids of the courses, >> however to get the human readable name of a course, you have to access >> the course-table. >> >> I understand the schema, agree with it, but my question was how to >> access this data efficiently within an application / how to implement >> the needed behaviour efficiently. >> >> Thanks! :) >> Em >> >> Am 29.05.2012 12:49, schrieb shashwat shriparv: >>> Check out this link may be it will help you somewhat: >>> >>> http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies >>> >>> On Tue, May 29, 2012 at 4:09 PM, Michel Segel >>> <[email protected]>wrote: >>> >>>> Depends... >>>> Try looking at a hierarchical model rather than a relational model... >>>> >>>> One thing to remember is that joins are expensive in HBase. >>>> >>>> >>>> >>>> Sent from a remote device. Please excuse any typos... >>>> >>>> Mike Segel >>>> >>>> On May 28, 2012, at 12:50 PM, Em <[email protected]> wrote: >>>> >>>>> Hello list, >>>>> >>>>> I have some time now to try out HBase and want to use it for a private >>>>> project. >>>>> >>>>> Questions like "How to I transfer one-to-many or many-to-many relations >>>>> from my RDBMS's schema to HBase?" seem to be common. >>>>> >>>>> I hope we can throw all the best practices that are out there in this >>>>> thread. >>>>> >>>>> As the wiki states: >>>>> One should create two tables. >>>>> One for students, another for courses. >>>>> >>>>> Within the students' table, one should add one column per selected >>>>> course with the course_id besides some columns for the student itself >>>>> (name, birthday, sex etc.). >>>>> >>>>> On the other hand one fills the courses table with one column per >>>>> student_id besides some columns which describe the course itself (name, >>>>> teacher, begin, end, year, location etc.). >>>>> >>>>> So far, so good. >>>>> >>>>> How do I access these tables efficiently? >>>>> >>>>> A common case would be to show all courses per student. >>>>> >>>>> To do so, one has to access the student-table and get all the student's >>>>> courses-columns. >>>>> Let's say their names are prefixed ids. One has to remove the prefix and >>>>> then one accesses the courses-table to get all the courses and their >>>>> metadata (name, teacher, location etc.). >>>>> >>>>> How do I do this kind of operation efficiently? >>>>> The naive and brute force approach seems to be using a Get-object per >>>>> course and fetch the neccessary data. >>>>> Another approach seems to be using the HTable-class and unleash the >>>>> power of "multigets" by using the batch()-method. >>>>> >>>>> All of the information above is theoretically, since I did not used it >>>>> in code (I currently learn more about the fundamentals of HBase). >>>>> >>>>> That's why I give the question to you: How do you do this kind of >>>>> operation by using HBase? >>>>> >>>>> Kind regards, >>>>> Em >>>>> >>>> >>> >>> >>> > >
