Re: Secondary indexes suggestions

lars hofhansl Mon, 13 Aug 2012 17:43:28 -0700

Secondary indexes are only simple when you ignore concurrent updates and 
failing clients.
A client could manage to write the index first and then fail in the main row 
(that can be handled by always rechecking the main row and always scan all 
versions of the index rows, which is hard/expensive in a scan).
You can also have a WAL, which you check upon each read and reapply all 
outstanding changes. (2ndary index updates are nice in that they are 
idempotent).

Similarly there are other scenarios that make this hard, and is the reason why 
HBase doesn't have them.
We've been thinking about primitives to add to HBase to make building/using of 
2ndary indexes easier/feasible.

Should indexes be global (i.e. it is up to a client or coprocessor to gather 
then matches and requery the actual rows)? Or local (which means a query needs 
to farm many queries in parallel to all index sites)?
Both have pros and cons.

I think the key of Fuzzy filter is that it can actually seek ahead (using the 
HBase Filter seek hints), which has the potential to be far more efficient than 
a full scan.
In fact local indexes would probably implemented that way: You always scan the 
main table and use the index information seek ahead.

Just my $0.02, though. :)

-- Lars

----- Original Message -----
From: Michael Segel <[email protected]>
To: [email protected]; Otis Gospodnetic <[email protected]>
Cc: 
Sent: Monday, August 13, 2012 5:28 PM
Subject: Re: Secondary indexes suggestions

Not really a good idea or anything new. 
Essentially a full table scan where you're doing a closer inspection on the key 
to see if it matches your search regex, before actually fetching the entire row 
and returning it. 

Secondary indexes are pretty straight forward. 
You have your primary key and then your value. 
Secondary index has a table where the key be one of your values from the main 
base table, and then the value is the key from the base table. 

So if your main key is 12345, and you store {'Fred', 'Cleveland', 'Ohio'}  == 
{Name, City, State}

You could create an index on State where you store 'Ohio' as the key, and a 
column value of 12345.

Then if you search the second table on a row with the key 'Ohio', you'll get 
all the rows where there is a record in the base table. In this example. a row 
with the key '12345' ...

HTH

On Aug 13, 2012, at 4:49 PM, Otis Gospodnetic <[email protected]> 
wrote:

> Lukáš, have a look at this recent post on this topic:
> 
> 
> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/
>  
> 
> 
> Otis 
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - 
> http://sematext.com/spm 
> 
> 
> 
>> ________________________________
>> From: Lukáš Drbal <[email protected]>
>> To: [email protected] 
>> Sent: Sunday, August 12, 2012 8:15 AM
>> Subject: Secondary indexes suggestions
>> 
>> Hi all,
>> 
>> iam new user of Hbase and i need help with secondary indexes.
>> 
>> For example i have messages and users. Each user has many messages.
>> Data structure will be like this:
>> 
>> Message:
>> - String id
>> - Long sender_id
>> - Long recipient_id
>> - String text
>> - Timestamp created_at
>> [...]
>> 
>> User:
>> - Long id
>> - String username
>> [...]
>> 
>> I need create secondary indexes for reading all messages:
>> a) inbox (by recipient_id) in timerange.
>> b) outbox (by sender_id) in timerange
>> 
>> Can someone give me suggestions for this index(es) and attributes for
>> columnFamily?
>> I expect here 500M messages and 50M users.
>> 
>> Thanks a lot for response.
>> 
>> 
>> P.S. Sorry for my bad english, isn't my primary language
>> 
>> 
>> Lukas Drbal
>> 
>>

Re: Secondary indexes suggestions

Reply via email to