Nick
Short answer:
It all depends on your key-design overlap with the use cases you want to
address. If (all of) your use cases map very closely to your key design you're
in good hands otherwise some tricks are warranted like more tables with
duplicated data, pre-computations through M/R jobs etc.
Very Long answer:
In my experience, schema design [actually index-key design] is one of the
trickiest part of HBase. It is unfortunate that one needs to understand the
internal architecture to be able to extract optimal utilization and performance
from HBase for all but most silo'd usecases. It is ironic too, as schema
flexilibility is one of the pillars on which NoSQL movement stands and that
HBase provides it only partially by making the schema dynamically extensible
pretty much but having index-key to be a wedlock from start.
Now don't get me wrong. There are pros and cons with every technology. A bit of
such insight and tricks are required on the SQL side too. For Eg:
De-normalization and no FK references in SQL schemas go against the best
practices but work out much better in practice at scale. It is just that there
is a better knowledge base now due to SQL stores being in deployment for a long
time. That's why SQL stores schema design seem like a "mostly solved" problem.
I bet, in the 70s when the technology was coming up, schema design was not as
commonly understood.
Anyways, here is how I understand HBase as:
Features:
- Sorted Key value pairs storage [sorted on key]
- Data retrieval by specifying key [pattern].
- Composite key design.
- Storage is hierarchically grouped based on what elements comprise the key.
Thus optimization is naturally possible on those lines.
* The hierarchy is limited to 3-4 levels depending on how you count.
- Only one way to sort: The key that you define : thus effectively only one
index per table.
- Distributed storage - scales horizontally with data volume
- Multiversioned cells: same row+column combination can store many versions of
data [ mostly versioned by timestamp]
Good for:
- GET calls based on a specific key works great real time lookups.
- Less contention between PUTs and GETs on the same "row". I think(?) the
contention is at the cell level.
- If your storage pattern is a sparse matrix and you are interested only in a
group of columns at a time per row.
- Exploit Hadoop's strength of M/R jobs on the same data: so no data
duplication.*
- Other Hadoop benefits like redundancy, replication etc.
Not useful for:
- Range queries in real time across lots of rows [esp. when range filter
criteria don't go well with the index design]
- GETs requiring all columns of that sparse table all the time.
- Group By/Top-K/count(*) kind of real-time queries
- Sorting/counting on value for real time queries [ esp. across rows]
- Sorting in a different combination of key-elements than how they are laid
out in the key.
- Joins across tables.
So, HBase is very good where its strengths are but for sure, I won't say all
SQL loads can be transferred to HBase with the same or better performance
expectations. From your previous mail, it seems your queries are more SQL-like
and actually, at the risk of being considered outcast here, but my honest
advice would be to also look into more document oriented data stores like Mongo
which can scale to the volume you mentioned and may be able to support range
queries on multiple indexes that you are looking for.
Hth,
Abhishek
-----Original Message-----
From: Jerry Lam [mailto:[email protected]]
Sent: Thursday, November 08, 2012 7:32 AM
To: user
Subject: Re: Nosqls schema design
Hi Nick:
Your question is a good and tough one. I haven't find anything that helps in
guiding the schema design in the nosql world. There are general concepts but
none of them is closed to the SQL schema design in which you can apply some
rules to guiding your decision.
The best presentation I have found about the general concepts in hbase schema
design is
http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-hbasecon-2012.html
and
search for Schema Design. From this presentation, you can learn why it is so
difficult to come up with a suggestion for your problem and learn some best
practices to start your own design.
HTH,
Jerry
On Thu, Nov 8, 2012 at 10:17 AM, Nick maillard <
[email protected]<mailto:[email protected]>> wrote:
> Thanks for the anwsers.
>
> I'm trying to really make sense of NoSql and Hbase in particular. The
> software part has a lot of loop wholes and I'm still fighting off the
> compaction storm issue, so right I would not say hbase is fast when it
> comes to writing.
>
> But my post was more nosql schema thoughts, after so long on SQL
> schemas it does take a little time to stop thinking that way in terms
> of schema but also of in terms of questions or of interaction if you'd
> rather.
> So contrary to SQL I cannot think a logical model for data and figure
> out later what I'll want out of it.
>
> In my case I stated 10 TB but this is very likely to grow since it is
> the starting scenario. I do believe having a 30 minutes latency before
> ingesting logs is not an issue, however the questions to the Hbase
> must be anwsered in real time manner.
>
> I have been trying to play with my questions and see how they can fit
> in a rowkey and Or columnfamilies but they being different in nature
> and purpose I ended supposing they would end up in a number of
> different hbase tables in order to adress the scope of questions. One
> table for one or three questions.
> The questions have joins and filter embedded in them.
>
> My post was about getting your insight on how you would go about
> answering this type of issues, what your schemas might be. Overall how
> to switch from SQL vision to noSQL vision.
> Coprocessor to create a couple of tables on the fly for all questions
> are an interesting way. To mapreduce the logs however I am afraid the
> performance would be to slow. I was thinking of answering in
> milliseconds if possible. But this might be me being new and not
> evaluating correctly.
>
>
>
>
>