Ian,

thanks for your detailed response!

Let me give you feedback to each point:
> 1. You could denormalize the additional information (e.g. course
> name) into the students table. Then, you're simply reading the
> student row, and all the info you need is there. That places an extra
> burden of write time and disk space, and does make you do a lot more
> work when a course name changes.
That's exactly what I thought about and that's why I avoid it. The
students and courses example is an example you find at several points on
the web, when describing the differences and translations of relations
from an RDBMS into a Key-Value-store.
In fact, everything you model with a Key-Value-storage like HBase,
Cassandra etc. can be modeled as an RDMBS-scheme.
Since a lot of people, like me, are coming from that edge, we must
re-learn several basic things.
It starts with understanding that you model a K-V-storage the way you
want to access the data, not as the data relates to eachother (in
general terms) and ends with translating the connections of data into a
K-V-schema as good as possible.


> 2. You could do what you're talking about in your HBase access code:
> find the list of course IDs you need for the student, and do a multi
> get on the course table. Fundamentally, this won't be much more
> efficient to do in batch mode, because the courses are likely to be
> evenly spread out over the region servers (orthogonal to the
> students). You're essentially doing a hash join, except that it's a
> lot less pleasant than on a relational DB b/c you've got network
> round trips for each GET. The disk blocks from the course table (I'm
> assuming it's the smaller side) will likely be cached so at least
> that part will be fast--you'll be answering those questions from
> memory, not via disk IO.

Whow, what?
I thought a Multiget would reduce network-roundtrips as it only accesses
each region *one* time, fetching all the queried keys and values from
there. If your data is randomly distributed, this could result in the
same costs as with doing several Gets in a loop, but should work better
if several Keys are part of the same region.
Am I right or did I missunderstood the concept???

> 3. You could also let a higher client layer worry about this. For
> example, your data layer query just returns a student with a list of
> their course IDs, and then another process in your client code looks
> up each course by ID to get the name. You can then put an external
> caching layer (like memcached) in the middle and make things a lot
> faster (though that does put the burden on you to have the code path
> for changing course info also flush the relevant cache entries). In
> your example, it's unlikely any institution would have more than a
> few thousand courses, so they'd probably all stay in memory and be
> served instantaneously.
Hm, in what way does this give me an advantage over using HBase -
assuming that the number of courses is small enough to fit in RAM - ?
I know that Memcached is optimized for this purpose and might have much
faster response times - no doubts.
However, from a conceptual point of view: Why does Memcached handles the
K-V-distribution more efficiently than a HBase with warmed caches?
Hopefully this question isn't that hard :).

> This might seem laborious, and to a degree it is. But note that it's >
difficult to see the utility of HBase with toy examples like this; if >
you're really storing courses and students, don't use HBase (unless
> you've got billions of students and courses, which seems unlikely).
> The extra thought you have to put in to making schemas work for you
> in HBase is only worth it when it gives you the ability to scale to
> gigantic data sets where other solutions wouldn't.
Well, the background is a private project. I know that it's a lot easier
to do what I want in a RDBMS and there is no real need for using a
highly scalable beast like HBase.
However, I want to learn something new and since I do not break
someone's business by trying out new technology privately, I want to go
with HStack.
Without ever doing it, you never get a real feeling of when to use the
right tool.
Using a good tool for the wrong problem can be an interesting
experience, since you learn some of the do's and don'ts of the software
you use.

Since I am a reader of the MEAP-edition of HBase in Action, I am aware
of the TwitBase-example application presented in that book.
I am very interested in seeing the author presenting a solution for
efficiently accessing the Tweets of the persons I follow.
This is an n:m-relation.
You got n users with m tweets and each user is seeing his own tweets as
 well as the tweets of followed persons in descending order by timestamp.
This must be done with a join within an RDMBs (and maybe in HBase also),
since I can not think of another scalable way of doing so.

However, if you do this by a Join, this means that a person with 40.000
followers needs a batch-request consisting of 40.000 GET-objects. That's
huge and I bet that this is everything but not fast nor scalable. It
sounds like broken by design when designing for Big Data.
Therefore I am interested in general best practices for such problems.

Maybe this is a better example for showing the possibilities of HBase
than a students and courses example.

Thanks for sharing your insights!

Em


Am 29.05.2012 17:08, schrieb Ian Varley:
> Em,
> 
> What you're describing is a classic relational database nested loop or hash 
> join; the only difference is that relational databases have this feature 
> built in, and can do it very efficiently because they typically run on a 
> single machine, not a distributed cluster. By moving to HBase, you're 
> explicitly making a tradeoff that's worse for this kind of usage, in exchange 
> for having horizontally scalable data storage (i.e. you can scale to TB or PB 
> of data). But the reality is that this makes what you're describing a lot 
> harder to do.
> 
> A real answer to this question would involve talking a lot about JOIN theory 
> in relational databases: when do optimizers choose nested loop joins vs. hash 
> joins or merge joins? How do you know which side of a join to drive from 
> (HBase doesn't keep stats, nor does it have an optimizer for that matter). 
> There's not really a general "what's the right way to do this", divorced from 
> those kinds of questions.
> 
> That said, I can see at least a couple ways to make this particular operation 
> (get all courses for one student) efficient in HBase:
> 
> 1. You could denormalize the additional information (e.g. course name) into 
> the students table. Then, you're simply reading the student row, and all the 
> info you need is there. That places an extra burden of write time and disk 
> space, and does make you do a lot more work when a course name changes.
> 
> 2. You could do what you're talking about in your HBase access code: find the 
> list of course IDs you need for the student, and do a multi get on the course 
> table. Fundamentally, this won't be much more efficient to do in batch mode, 
> because the courses are likely to be evenly spread out over the region 
> servers (orthogonal to the students). You're essentially doing a hash join, 
> except that it's a lot less pleasant than on a relational DB b/c you've got 
> network round trips for each GET. The disk blocks from the course table (I'm 
> assuming it's the smaller side) will likely be cached so at least that part 
> will be fast--you'll be answering those questions from memory, not via disk 
> IO.
> 
> 3. You could also let a higher client layer worry about this. For example, 
> your data layer query just returns a student with a list of their course IDs, 
> and then another process in your client code looks up each course by ID to 
> get the name. You can then put an external caching layer (like memcached) in 
> the middle and make things a lot faster (though that does put the burden on 
> you to have the code path for changing course info also flush the relevant 
> cache entries). In your example, it's unlikely any institution would have 
> more than a few thousand courses, so they'd probably all stay in memory and 
> be served instantaneously.
> 
> This might seem laborious, and to a degree it is. But note that it's 
> difficult to see the utility of HBase with toy examples like this; if you're 
> really storing courses and students, don't use HBase (unless you've got 
> billions of students and courses, which seems unlikely). The extra thought 
> you have to put in to making schemas work for you in HBase is only worth it 
> when it gives you the ability to scale to gigantic data sets where other 
> solutions wouldn't.
> 
> Ian
> 
> On May 29, 2012, at 9:28 AM, Em wrote:
> 
>> Hi,
>>
>> thanks for your help.
>> Yes, I know these slides.
>> However I can not find an answer to how to access such schemas efficiently.
>> In case of the given schema for students and courses as in those slides,
>> they say that each column contains the student's id / course's id.
>> However, when you want to build a GUI, you want to get all the courses
>> for a given student and display their names.
>> You *have* the column-names which represent the ids of the courses,
>> however to get the human readable name of a course, you have to access
>> the course-table.
>>
>> I understand the schema, agree with it, but my question was how to
>> access this data efficiently within an application / how to implement
>> the needed behaviour efficiently.
>>
>> Thanks! :)
>> Em
>>
>> Am 29.05.2012 12:49, schrieb shashwat shriparv:
>>> Check out this link may be it will help you somewhat:
>>>
>>> http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies
>>>
>>> On Tue, May 29, 2012 at 4:09 PM, Michel Segel 
>>> <[email protected]>wrote:
>>>
>>>> Depends...
>>>> Try looking at a hierarchical model rather than a relational model...
>>>>
>>>> One thing to remember is that joins are expensive in HBase.
>>>>
>>>>
>>>>
>>>> Sent from a remote device. Please excuse any typos...
>>>>
>>>> Mike Segel
>>>>
>>>> On May 28, 2012, at 12:50 PM, Em <[email protected]> wrote:
>>>>
>>>>> Hello list,
>>>>>
>>>>> I have some time now to try out HBase and want to use it for a private
>>>>> project.
>>>>>
>>>>> Questions like "How to I transfer one-to-many or many-to-many relations
>>>>> from my RDBMS's schema to HBase?" seem to be common.
>>>>>
>>>>> I hope we can throw all the best practices that are out there in this
>>>>> thread.
>>>>>
>>>>> As the wiki states:
>>>>> One should create two tables.
>>>>> One for students, another for courses.
>>>>>
>>>>> Within the students' table, one should add one column per selected
>>>>> course with the course_id besides some columns for the student itself
>>>>> (name, birthday, sex etc.).
>>>>>
>>>>> On the other hand one fills the courses table with one column per
>>>>> student_id besides some columns which describe the course itself (name,
>>>>> teacher, begin, end, year, location etc.).
>>>>>
>>>>> So far, so good.
>>>>>
>>>>> How do I access these tables efficiently?
>>>>>
>>>>> A common case would be to show all courses per student.
>>>>>
>>>>> To do so, one has to access the student-table and get all the student's
>>>>> courses-columns.
>>>>> Let's say their names are prefixed ids. One has to remove the prefix and
>>>>> then one accesses the courses-table to get all the courses and their
>>>>> metadata (name, teacher, location etc.).
>>>>>
>>>>> How do I do this kind of operation efficiently?
>>>>> The naive and brute force approach seems to be using a Get-object per
>>>>> course and fetch the neccessary data.
>>>>> Another approach seems to be using the HTable-class and unleash the
>>>>> power of "multigets" by using the batch()-method.
>>>>>
>>>>> All of the information above is theoretically, since I did not used it
>>>>> in code (I currently learn more about the fundamentals of HBase).
>>>>>
>>>>> That's why I give the question to you: How do you do this kind of
>>>>> operation by using HBase?
>>>>>
>>>>> Kind regards,
>>>>> Em
>>>>>
>>>>
>>>
>>>
>>>
> 
> 

Reply via email to