+1 Thank you David for the great explanation. It's complicated. I am pretty new to this BigData space and found it really interesting and always want to learn more about it. I will definitely look into OpenTSDB as suggested. Thanks again :D
On Fri, Mar 25, 2011 at 12:18 PM, Buttler, David <[email protected]> wrote: > Hmmm.... maybe my mental model is deficient. How do you propose building a > secondary index without a transaction? > > The reason indexes work is that they store the data in a different way than > the primary table. That implies a second, independent data storage. > Without a transaction you can't be guaranteed that the second data > structure is always updated in sync with the primary table. > > I suppose you could roll the multiple data writes into the initial data > write -- that would work if you have write-once data. But if you partially > update the data then you have the issue that you may not have enough > information in the update to correctly write the key for the secondary data > stores. This would mean (in general) that you would have to read an entire > row before you update any part of it so that you can maintain the secondary > structures. Do you see the performance problem here? (or that you are > introducing a limited transactional / eventually consistent state into the > data store) > > There may be optimizations where you could skip that part of the code if > there were no indexes. But now you are talking about greatly increasing the > complexity of the codebase for a use case that is somewhat specialized. > Hence, you see that people who really care about secondary indexes / > transaction hbase have separate packages. The probably don't do the job as > well as is ideally possible by rolling the code into hbase proper, but on > the other hand, neither do they increase the complexity of the main code > branch (hence they don't slow down the core development work). > > I will stand by my point that there are engineering trade-offs to be made. > Take the unix philosophy: small components, loosely coupled. If you need > indexes, build it on top of HBase, not inside of HBase. Using things like > co-processors allows you to extend the capabilities of HBase in a way that > does not impact the core product and hurt all of the other users. An example > of this is OpenTSDB. It is a time-series database that uses hbase under the > covers, but it doesn't ask that hbase support its needs in some special way. > It is very instructive to see how it was constructed. > > Dave > > > -----Original Message----- > From: Weishung Chung [mailto:[email protected]] > Sent: Friday, March 25, 2011 9:27 AM > To: [email protected] > Subject: Re: Stargate+hbase > > Thank you so much for the informative info. It really helps me out. > > For secondary index, even without transaction, I would think one could > still > build a secondary index on another key especially if we have row level > locking. Correct me if I am wrong. > > Also, I have read about clustered B-Tree used in InnoDB to implement > secondary index but I know that B-Tree is the primary limitation when come > to scalability and the main reason why NoSQL have discarded B-Tree. But it > would be super nice to be able to build the secondary index without using > another secondary table in HBase. > > I am not complaining but I would love to see HBase continues to be the top > NoSQL solution out there :D > Way to go HBase ! > > On Fri, Mar 25, 2011 at 10:39 AM, Buttler, David <[email protected]> > wrote: > > > Do you know what it means to make secondary indexing a feature? There > are > > two reasonable outcomes: > > 1) adding ACID semantics (and thus killing scalability) > > 2) allowing the secondary index to be out of date (leading to every naïve > > user claiming that there is a serious bug that must be fixed). > > > > Secondary indexes are basically another way of storing (part of) the > data. > > E.g. another table, sorted on the field(s) that you want to search on. > In > > order to ensure consistency between the primary table and the secondary > > table (index), you have to guarantee that when you mutate the primary > table > > that the secondary table is mutated in the same atomic transaction. > Since > > HBase only has row-level locks, this can't be guaranteed across tables. > > > > The situation is not hopeless, because in many cases you don't need to > have > > perfectly consistent data and can afford to wait for cleanup tasks. For > > some applications, you can ensure that the index is updated close enough > to > > the table update (using external transactions, or something similar) that > > users would never notice. One way to implement an eventually consistent > > secondary index would be to mimic the way cluster replication is done. > > > > However, what I have described is difficult to do generically -- and > there > > are engineering tradeoffs that need to be made. If you absolutely need a > > transactional and consistent secondary index, I would suggest using > Oracle, > > MySQL, or another relational database, where this was designed in as a > > primary feature. Just don't complain that they are too slow or don't > scale > > as well as HBase. > > > > </rant> > > > > Sorry for the rant. If you want to have a secondary index here is what > you > > need to do: > > Modify your application so that every time you write to the primary > table, > > you also write to a secondary table, keyed off of the values you want to > > search on. If you can't guarantee that the values form a secondary key > > (i.e. are unique across your entire table), you can make your key a > compound > > key (see, for example, how "tsuna" designed OpenTSDB) with your primary > key > > as a component. > > > > Then, when you need to query, you can do range queries over the secondary > > table to retrieve the keys in the primary table to return the full data > row. > > > > Dave > > > > -----Original Message----- > > From: Wei Shung Chung [mailto:[email protected]] > > Sent: Friday, March 25, 2011 12:04 AM > > To: [email protected] > > Subject: Re: Stargate+hbase > > > > I need to use secondary indexing too, hopefully this important feature > > will be made available soon :) > > > > Sent from my iPhone > > > > On Mar 25, 2011, at 12:48 AM, Stack <[email protected]> wrote: > > > > > There is no native support for secondary indices in HBase (currently). > > > You will have to manage it yourself. > > > St.Ack > > > > > > On Thu, Mar 24, 2011 at 10:47 PM, sreejith P. K. < > [email protected] > > > > wrote: > > >> I have tried secondary indexing. It seems I miss some points. Could > > >> you > > >> please explain how it is possible using secondary indexing? > > >> > > >> > > >> I have tried like, > > >> > > >> > > >> Columnamilty1:kwd1 > > >> Columnamilty1:kwd2 > > >> row1 Columnamilty1:kwd3 > > >> Columnamilty1:kwd2 > > >> > > >> Columnamilty1:kwd1 > > >> Columnamilty1:kwd2 > > >> row2 Columnamilty1:kwd4 > > >> Columnamilty1:kwd5 > > >> > > >> > > >> I need to get all rows which contain kwd1 and kwd2 > > >> > > >> Please help. > > >> Thanks > > >> > > >> > > >> On Thu, Mar 24, 2011 at 9:57 PM, Jean-Daniel Cryans < > > [email protected] > > >> >wrote: > > >> > > >>> What you are asking for is a secondary index, and it doesn't exist > > >>> at > > >>> the moment in HBase (let alone REST). Googling a bit for "hbase > > >>> secondary indexing" will show you how people usually do it. > > >>> > > >>> J-D > > >>> > > >>> On Thu, Mar 24, 2011 at 6:18 AM, sreejith P. K. < > [email protected] > > >>> > > > >>> wrote: > > >>>> Is it possible using stargate interface to hbase, fetch all rows > > >>>> where > > >>> more > > >>>> than one column family:<qualifier> must be present? > > >>>> > > >>>> like :select rows which contains keyword:a and keyword:b ? > > >>>> > > >>>> Thanks > > >>>> > > >>> > > >> > > >> > > >> > > >> -- > > >> Sreejith PK > > >> Nesote Technologies (P) Ltd > > >> > > >
