Secondary indexes are only simple when you ignore concurrent updates and failing clients. A client could manage to write the index first and then fail in the main row (that can be handled by always rechecking the main row and always scan all versions of the index rows, which is hard/expensive in a scan). You can also have a WAL, which you check upon each read and reapply all outstanding changes. (2ndary index updates are nice in that they are idempotent).
Similarly there are other scenarios that make this hard, and is the reason why HBase doesn't have them. We've been thinking about primitives to add to HBase to make building/using of 2ndary indexes easier/feasible. Should indexes be global (i.e. it is up to a client or coprocessor to gather then matches and requery the actual rows)? Or local (which means a query needs to farm many queries in parallel to all index sites)? Both have pros and cons. I think the key of Fuzzy filter is that it can actually seek ahead (using the HBase Filter seek hints), which has the potential to be far more efficient than a full scan. In fact local indexes would probably implemented that way: You always scan the main table and use the index information seek ahead. Just my $0.02, though. :) -- Lars ----- Original Message ----- From: Michael Segel <[email protected]> To: [email protected]; Otis Gospodnetic <[email protected]> Cc: Sent: Monday, August 13, 2012 5:28 PM Subject: Re: Secondary indexes suggestions Not really a good idea or anything new. Essentially a full table scan where you're doing a closer inspection on the key to see if it matches your search regex, before actually fetching the entire row and returning it. Secondary indexes are pretty straight forward. You have your primary key and then your value. Secondary index has a table where the key be one of your values from the main base table, and then the value is the key from the base table. So if your main key is 12345, and you store {'Fred', 'Cleveland', 'Ohio'} == {Name, City, State} You could create an index on State where you store 'Ohio' as the key, and a column value of 12345. Then if you search the second table on a row with the key 'Ohio', you'll get all the rows where there is a record in the base table. In this example. a row with the key '12345' ... HTH On Aug 13, 2012, at 4:49 PM, Otis Gospodnetic <[email protected]> wrote: > Lukáš, have a look at this recent post on this topic: > > > http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/ > > > > Otis > ---- > Performance Monitoring for Solr / ElasticSearch / HBase - > http://sematext.com/spm > > > >> ________________________________ >> From: Lukáš Drbal <[email protected]> >> To: [email protected] >> Sent: Sunday, August 12, 2012 8:15 AM >> Subject: Secondary indexes suggestions >> >> Hi all, >> >> iam new user of Hbase and i need help with secondary indexes. >> >> For example i have messages and users. Each user has many messages. >> Data structure will be like this: >> >> Message: >> - String id >> - Long sender_id >> - Long recipient_id >> - String text >> - Timestamp created_at >> [...] >> >> User: >> - Long id >> - String username >> [...] >> >> I need create secondary indexes for reading all messages: >> a) inbox (by recipient_id) in timerange. >> b) outbox (by sender_id) in timerange >> >> Can someone give me suggestions for this index(es) and attributes for >> columnFamily? >> I expect here 500M messages and 50M users. >> >> Thanks a lot for response. >> >> >> P.S. Sorry for my bad english, isn't my primary language >> >> >> Lukas Drbal >> >>
