Em,

What you're describing is a classic relational database nested loop or hash 
join; the only difference is that relational databases have this feature built 
in, and can do it very efficiently because they typically run on a single 
machine, not a distributed cluster. By moving to HBase, you're explicitly 
making a tradeoff that's worse for this kind of usage, in exchange for having 
horizontally scalable data storage (i.e. you can scale to TB or PB of data). 
But the reality is that this makes what you're describing a lot harder to do.

A real answer to this question would involve talking a lot about JOIN theory in 
relational databases: when do optimizers choose nested loop joins vs. hash 
joins or merge joins? How do you know which side of a join to drive from (HBase 
doesn't keep stats, nor does it have an optimizer for that matter). There's not 
really a general "what's the right way to do this", divorced from those kinds 
of questions.

That said, I can see at least a couple ways to make this particular operation 
(get all courses for one student) efficient in HBase:

1. You could denormalize the additional information (e.g. course name) into the 
students table. Then, you're simply reading the student row, and all the info 
you need is there. That places an extra burden of write time and disk space, 
and does make you do a lot more work when a course name changes.

2. You could do what you're talking about in your HBase access code: find the 
list of course IDs you need for the student, and do a multi get on the course 
table. Fundamentally, this won't be much more efficient to do in batch mode, 
because the courses are likely to be evenly spread out over the region servers 
(orthogonal to the students). You're essentially doing a hash join, except that 
it's a lot less pleasant than on a relational DB b/c you've got network round 
trips for each GET. The disk blocks from the course table (I'm assuming it's 
the smaller side) will likely be cached so at least that part will be 
fast--you'll be answering those questions from memory, not via disk IO.

3. You could also let a higher client layer worry about this. For example, your 
data layer query just returns a student with a list of their course IDs, and 
then another process in your client code looks up each course by ID to get the 
name. You can then put an external caching layer (like memcached) in the middle 
and make things a lot faster (though that does put the burden on you to have 
the code path for changing course info also flush the relevant cache entries). 
In your example, it's unlikely any institution would have more than a few 
thousand courses, so they'd probably all stay in memory and be served 
instantaneously.

This might seem laborious, and to a degree it is. But note that it's difficult 
to see the utility of HBase with toy examples like this; if you're really 
storing courses and students, don't use HBase (unless you've got billions of 
students and courses, which seems unlikely). The extra thought you have to put 
in to making schemas work for you in HBase is only worth it when it gives you 
the ability to scale to gigantic data sets where other solutions wouldn't.

Ian

On May 29, 2012, at 9:28 AM, Em wrote:

> Hi,
> 
> thanks for your help.
> Yes, I know these slides.
> However I can not find an answer to how to access such schemas efficiently.
> In case of the given schema for students and courses as in those slides,
> they say that each column contains the student's id / course's id.
> However, when you want to build a GUI, you want to get all the courses
> for a given student and display their names.
> You *have* the column-names which represent the ids of the courses,
> however to get the human readable name of a course, you have to access
> the course-table.
> 
> I understand the schema, agree with it, but my question was how to
> access this data efficiently within an application / how to implement
> the needed behaviour efficiently.
> 
> Thanks! :)
> Em
> 
> Am 29.05.2012 12:49, schrieb shashwat shriparv:
>> Check out this link may be it will help you somewhat:
>> 
>> http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies
>> 
>> On Tue, May 29, 2012 at 4:09 PM, Michel Segel 
>> <[email protected]>wrote:
>> 
>>> Depends...
>>> Try looking at a hierarchical model rather than a relational model...
>>> 
>>> One thing to remember is that joins are expensive in HBase.
>>> 
>>> 
>>> 
>>> Sent from a remote device. Please excuse any typos...
>>> 
>>> Mike Segel
>>> 
>>> On May 28, 2012, at 12:50 PM, Em <[email protected]> wrote:
>>> 
>>>> Hello list,
>>>> 
>>>> I have some time now to try out HBase and want to use it for a private
>>>> project.
>>>> 
>>>> Questions like "How to I transfer one-to-many or many-to-many relations
>>>> from my RDBMS's schema to HBase?" seem to be common.
>>>> 
>>>> I hope we can throw all the best practices that are out there in this
>>>> thread.
>>>> 
>>>> As the wiki states:
>>>> One should create two tables.
>>>> One for students, another for courses.
>>>> 
>>>> Within the students' table, one should add one column per selected
>>>> course with the course_id besides some columns for the student itself
>>>> (name, birthday, sex etc.).
>>>> 
>>>> On the other hand one fills the courses table with one column per
>>>> student_id besides some columns which describe the course itself (name,
>>>> teacher, begin, end, year, location etc.).
>>>> 
>>>> So far, so good.
>>>> 
>>>> How do I access these tables efficiently?
>>>> 
>>>> A common case would be to show all courses per student.
>>>> 
>>>> To do so, one has to access the student-table and get all the student's
>>>> courses-columns.
>>>> Let's say their names are prefixed ids. One has to remove the prefix and
>>>> then one accesses the courses-table to get all the courses and their
>>>> metadata (name, teacher, location etc.).
>>>> 
>>>> How do I do this kind of operation efficiently?
>>>> The naive and brute force approach seems to be using a Get-object per
>>>> course and fetch the neccessary data.
>>>> Another approach seems to be using the HTable-class and unleash the
>>>> power of "multigets" by using the batch()-method.
>>>> 
>>>> All of the information above is theoretically, since I did not used it
>>>> in code (I currently learn more about the fundamentals of HBase).
>>>> 
>>>> That's why I give the question to you: How do you do this kind of
>>>> operation by using HBase?
>>>> 
>>>> Kind regards,
>>>> Em
>>>> 
>>> 
>> 
>> 
>> 

Reply via email to