Nick


Short answer:

It all depends on your key-design overlap with the use cases you want to 
address. If (all of) your use cases map very closely to your key design you're 
in good hands otherwise some tricks are warranted like more tables with 
duplicated data, pre-computations through M/R jobs etc.



Very Long answer:



In my experience, schema design [actually index-key design] is one of the 
trickiest part of HBase. It is unfortunate that one needs to understand the 
internal architecture to be able to extract optimal utilization and performance 
from HBase for all but most silo'd usecases.  It is ironic too, as schema 
flexilibility is one of the pillars on which NoSQL movement stands and that 
HBase provides it only partially by making  the schema dynamically extensible 
pretty much but having index-key to be a wedlock from start.



Now don't get me wrong. There are pros and cons with every technology. A bit of 
such insight and tricks are required on the SQL side too. For Eg: 
De-normalization and no FK references in SQL schemas go against the best 
practices but work out much better in practice at scale. It is just that there 
is a better knowledge base now due to SQL stores being in deployment for a long 
time. That's why SQL stores schema design seem like a "mostly solved" problem. 
I bet, in the 70s when the technology was coming up, schema design was not as 
commonly understood.





Anyways, here is how I understand HBase as:



Features:



- Sorted Key value pairs storage [sorted on key]

- Data retrieval by specifying key [pattern].

- Composite key design.

- Storage is hierarchically grouped based on what elements comprise the key. 
Thus optimization is naturally possible on those lines.

       * The hierarchy is limited to 3-4  levels depending on how you count.

- Only one way to sort: The key that you define :  thus effectively only one 
index per table.

- Distributed storage - scales horizontally with data volume

- Multiversioned cells: same row+column combination can store many versions of 
data [ mostly versioned by timestamp]



Good for:



- GET calls based on a specific key works great real time lookups.

- Less contention between PUTs and GETs on the same "row".  I think(?) the 
contention is at the cell level.

- If your storage pattern is a sparse matrix and you are interested only in a 
group of columns at a time per row.

- Exploit Hadoop's strength of M/R jobs on the same data: so no data 
duplication.*

- Other Hadoop benefits like redundancy, replication etc.



Not useful for:

-  Range queries in real time across lots of rows [esp. when range filter 
criteria don't go well with the index design]

- GETs requiring all columns of that sparse table all the time.

- Group By/Top-K/count(*) kind of real-time queries

- Sorting/counting on value for real time queries [ esp. across rows]

-  Sorting in a different combination of key-elements  than how they are laid 
out in the key.

-  Joins across tables.



So, HBase is very good where its strengths are but for sure, I won't say all 
SQL loads can be transferred to HBase with the same or better performance 
expectations. From your previous mail, it seems your queries are more SQL-like 
and actually, at the risk of being considered outcast here, but my honest 
advice would be to also look into more document oriented data stores like Mongo 
which can scale to the volume you mentioned and may be able to support range 
queries on multiple indexes that you are looking for.



Hth,

Abhishek





-----Original Message-----
From: Jerry Lam [mailto:[email protected]]
Sent: Thursday, November 08, 2012 7:32 AM
To: user
Subject: Re: Nosqls schema design



Hi Nick:



Your question is a good and tough one. I haven't find anything that helps in 
guiding the schema design in the nosql world. There are general concepts but 
none of them is closed to the SQL schema design in which you can apply some 
rules to guiding your decision.



The best presentation I have found about the general concepts in hbase schema 
design is 
http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-hbasecon-2012.html

and

search for Schema Design. From this presentation, you can learn why it is so 
difficult to come up with a suggestion for your problem and learn some best 
practices to start your own design.



HTH,



Jerry





On Thu, Nov 8, 2012 at 10:17 AM, Nick maillard < 
[email protected]<mailto:[email protected]>> wrote:



> Thanks for the anwsers.

>

> I'm trying to really make sense of NoSql and Hbase in particular. The

> software part has a lot of loop wholes and I'm still fighting off the

> compaction storm issue, so right I would not say hbase is fast when it

> comes to writing.

>

> But my post was more nosql schema thoughts, after so long on SQL

> schemas it does take a little time to stop thinking that way in terms

> of schema but also of in terms of questions or of interaction if you'd

> rather.

> So contrary to SQL I cannot think a logical model for data and figure

> out later what I'll want out of it.

>

> In my case I stated 10 TB but this is very likely to grow since it is

> the starting scenario. I do believe having a 30 minutes latency before

> ingesting logs is not an issue, however the questions to the Hbase

> must be anwsered in real time manner.

>

> I have been trying to play with my questions and see how they can fit

> in a rowkey and Or columnfamilies but they being different in nature

> and purpose I ended supposing they would end up in a number of

> different hbase tables in order to adress the scope of questions. One

> table for one or three questions.

> The questions have joins and filter embedded in them.

>

> My post was about getting your insight on how you would go about

> answering this type of issues, what your schemas might be. Overall how

> to switch from SQL vision to noSQL vision.

> Coprocessor to create a couple of tables on the fly for all questions

> are an interesting way. To mapreduce the logs however I am afraid the

> performance would be to slow. I was thinking of answering in

> milliseconds if possible. But this might be me being new and not

> evaluating correctly.

>

>

>

>

>

Reply via email to