RE: Stargate+hbase

Buttler, David Fri, 25 Mar 2011 10:18:37 -0700

Hmmm.... maybe my mental model is deficient.  How do you propose building a 
secondary index without a transaction?

The reason indexes work is that they store the data in a different way than the 
primary table.  That implies a second, independent data storage.  Without a 
transaction you can't be guaranteed that the second data structure is always 
updated in sync with the primary table.

I suppose you could roll the multiple data writes into the initial data write 
-- that would work if you have write-once data.  But if you partially update 
the data then you have the issue that you may not have enough information in 
the update to correctly write the key for the secondary data stores.  This 
would mean (in general) that you would have to read an entire row before you 
update any part of it so that you can maintain the secondary structures.  Do 
you see the performance problem here? (or that you are introducing a limited 
transactional / eventually consistent state into the data store)

There may be optimizations where you could skip that part of the code if there 
were no indexes.  But now you are talking about greatly increasing the 
complexity of the codebase for a use case that is somewhat specialized.  Hence, 
you see that people who really care about secondary indexes / transaction hbase 
have separate packages.  The probably don't do the job as well as is ideally 
possible by rolling the code into hbase proper, but on the other hand, neither 
do they increase the complexity of the main code branch (hence they don't slow 
down the core development work).

I will stand by my point that there are engineering trade-offs to be made.  
Take the unix philosophy: small components, loosely coupled. If you need 
indexes, build it on top of HBase, not inside of HBase.  Using things like 
co-processors allows you to extend the capabilities of HBase in a way that does 
not impact the core product and hurt all of the other users. An example of this 
is OpenTSDB.  It is a time-series database that uses hbase under the covers, 
but it doesn't ask that hbase support its needs in some special way.  It is 
very instructive to see how it was constructed.

Dave

-----Original Message-----
From: Weishung Chung [mailto:[email protected]] 
Sent: Friday, March 25, 2011 9:27 AM
To: [email protected]
Subject: Re: Stargate+hbase

Thank you so much for the informative info. It really helps me out.

For secondary index, even without transaction, I would think one could still
build a secondary index on another key especially if we have row level
locking. Correct me if I am wrong.

Also, I have read about clustered B-Tree used in InnoDB to implement
secondary index but I know that B-Tree is the primary limitation when come
to scalability and the main reason why NoSQL have discarded B-Tree. But it
would be super nice to be able to build the secondary index without using
another secondary table in HBase.

I am not complaining but I would love to see HBase continues to be the top
NoSQL solution out there :D
Way to go HBase !

On Fri, Mar 25, 2011 at 10:39 AM, Buttler, David <[email protected]> wrote:

> Do you know what it means to make secondary indexing a feature?  There are
> two reasonable outcomes:
> 1) adding ACID semantics (and thus killing scalability)
> 2) allowing the secondary index to be out of date (leading to every naïve
> user claiming that there is a serious bug that must be fixed).
>
> Secondary indexes are basically another way of storing (part of) the data.
>  E.g. another table, sorted on the field(s) that you want to search on.  In
> order to ensure consistency between the primary table and the secondary
> table (index), you have to guarantee that when you mutate the primary table
> that the secondary table is mutated in the same atomic transaction.  Since
> HBase only has row-level locks, this can't be guaranteed across tables.
>
> The situation is not hopeless, because in many cases you don't need to have
> perfectly consistent data and can afford to wait for cleanup tasks.  For
> some applications, you can ensure that the index is updated close enough to
> the table update (using external transactions, or something similar) that
> users would never notice.  One way to implement an eventually consistent
> secondary index would be to mimic the way cluster replication is done.
>
> However, what  I have described is difficult to do generically -- and there
> are engineering tradeoffs that need to be made.  If you absolutely need a
> transactional and consistent secondary index, I would suggest using Oracle,
> MySQL, or another relational database, where this was designed in as a
> primary feature.  Just don't complain that they are too slow or don't scale
> as well as HBase.
>
> </rant>
>
> Sorry for the rant.  If you want to have a secondary index here is what you
> need to do:
> Modify your application so that every time you write to the primary table,
> you also write to a secondary table, keyed off of the values you want to
> search on.  If you can't guarantee that the values form a secondary key
> (i.e. are unique across your entire table), you can make your key a compound
> key (see, for example, how "tsuna" designed OpenTSDB) with your primary key
> as a component.
>
> Then, when you need to query, you can do range queries over the secondary
> table to retrieve the keys in the primary table to return the full data row.
>
> Dave
>
> -----Original Message-----
> From: Wei Shung Chung [mailto:[email protected]]
> Sent: Friday, March 25, 2011 12:04 AM
> To: [email protected]
> Subject: Re: Stargate+hbase
>
> I need to use secondary indexing too, hopefully this important feature
> will be made available soon :)
>
> Sent from my iPhone
>
> On Mar 25, 2011, at 12:48 AM, Stack <[email protected]> wrote:
>
> > There is no native support for secondary indices in HBase (currently).
> > You will have to manage it yourself.
> > St.Ack
> >
> > On Thu, Mar 24, 2011 at 10:47 PM, sreejith P. K. <[email protected]
> > > wrote:
> >> I have tried secondary indexing. It seems I miss some points. Could
> >> you
> >> please explain how it is possible using secondary indexing?
> >>
> >>
> >> I have tried like,
> >>
> >>
> >>                Columnamilty1:kwd1
> >>                Columnamilty1:kwd2
> >> row1         Columnamilty1:kwd3
> >>                Columnamilty1:kwd2
> >>
> >>                Columnamilty1:kwd1
> >>                Columnamilty1:kwd2
> >> row2         Columnamilty1:kwd4
> >>                Columnamilty1:kwd5
> >>
> >>
> >> I need to get all rows which contain kwd1 and kwd2
> >>
> >> Please help.
> >> Thanks
> >>
> >>
> >> On Thu, Mar 24, 2011 at 9:57 PM, Jean-Daniel Cryans <
> [email protected]
> >> >wrote:
> >>
> >>> What you are asking for is a secondary index, and it doesn't exist
> >>> at
> >>> the moment in HBase (let alone REST). Googling a bit for "hbase
> >>> secondary indexing" will show you how people usually do it.
> >>>
> >>> J-D
> >>>
> >>> On Thu, Mar 24, 2011 at 6:18 AM, sreejith P. K. <[email protected]
> >>> >
> >>> wrote:
> >>>> Is it possible using stargate interface to hbase,  fetch all rows
> >>>> where
> >>> more
> >>>> than one column family:<qualifier> must be present?
> >>>>
> >>>> like :select  rows which contains keyword:a and keyword:b ?
> >>>>
> >>>> Thanks
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Sreejith PK
> >> Nesote Technologies (P) Ltd
> >>
>

RE: Stargate+hbase

Reply via email to