Sorry for the delay... Had a full day yesterday...
In a nut shell... Tough nut to crack. I can give you a solution which you can
probably enhance...
At the start, ignore coProcessors for now...
So what end up doing is the following.
General solution... N indexes..
Create a temp table in HBase. (1 column foo)
Assuming that you have a simple K,V index, so you just need to do a simple
get() against the index to get the list of rows ...
For each index, fetch the rows.
For each row, write the rowid and then auto increment a counter in a column foo.
Then scan the table where foo's counter >= N. note that it should == N but just
in case...
Now you have found multiple indexes.
Having said that...
Again assuming your indexes are a simple K,V pair where V is a set of row ids...
Create a hash map of <rowid, count>
For each index:
Get() row based on key
For each rowid in row:
If map.fetch(rowid) is null then add ( rowid, 1)
Else increment the value in count;
;
;
For each rowid in map(rowid, count):
If count == number of indexes N
Then add rowid to result set.
;
Now just return the rows where you have it's rowid in the result set.
That you can do in a coprocessor...
but you may have a memory issue... Depending on the number of rowid
in your index.
does that help?
Sent from a remote device. Please excuse any typos...
Mike Segel
On May 14, 2012, at 8:20 AM, fding hbase <[email protected]> wrote:
> Hi Michel,
>
> I indexed each column within a column family of a table, so we can query a
> row with specific column value.
> By multi-index I mean using multiple indexes at the same time on a single
> query. That looks like a SQL select
> with two *where* clauses of two indexed columns.
>
> The row key of index table is made up of column value and row key of
> indexed table. For set intersection
> I used the utility class from Apache common-collections package
> CollectionUtils.intersection(). There's no
> assumption on sort order on indices. A scan with column value as startKey
> and column value+1 as endKey
> applied to index table will return all rows in indexed table with that
> column value.
>
> For multi-index queries, previously I tried to use a scan for each index
> column and intersect of those
> result sets to get the rows that I want. But the query time is too long. So
> I decided to move the computation of
> intersection to server side and reduce the amount of data transferred.
>
> Do you have any better idea?
>
> On Mon, May 14, 2012 at 8:17 PM, Michel Segel
> <[email protected]>wrote:
>
>> Need a little clarification...
>>
>> You said that you need to do multi-index queries.
>>
>> Did you mean to say multiple people running queries at the same time, or
>> did you mean you wanted to do multi-key indexes where the key is a
>> multi-key part.
>>
>> Or did you mean that you really wanted to use multiple indexes at the same
>> time on a single query?
>>
>> If its the latter, not really a good idea...
>> How do you handle the intersection of the two sets? (3 sets or more?)
>> Can you assume that the indexes are in sort order?
>>
>> What happens when the results from the indexes exceed the amount of
>> allocated memory?
>>
>> What I am suggesting you to do is to set aside the underpinnings of HBase
>> and look at the problem you are trying to solve in general terms. Not an
>> easy one...
>>
>>
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On May 14, 2012, at 4:35 AM, fding hbase <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> Is it possible to use table scanner (different from the host table
>> region)
>>> or
>>> execute coprocessor of another table, in the endpoint coprocessor?
>>> It looks like chaining coprocessors. But I found a possible deadlock!
>>> Can anyone help me with this?
>>>
>>> In my testing environment I deployed the 0.92.0 version from CDH.
>>> I wrote an Endpoint coprocessor to do composite secondary index queries.
>>> The index is stored in another table and the index update is maintained
>>> by the client through a extended HTable. While a single index query
>>> works fine through Scanners of index table, soon after we realized
>>> we need to do multi-index queries at the same time.
>>> At first we tried to pull every row keys queried from a single index
>> table
>>> and do the merge (just set intersection) on the client,
>>> but that overruns the network bandwidth. So I proposed to try
>>> the endpoint coprocessor. The idea is to use coprocessors, one
>>> in master table (the indexed table) and the other for each index table
>>> regions.
>>> Each master table region coprocessor instance invokes the index table
>>> coprocessor instances with its regioninfo (the startKey and endKey) and
>> the
>>> scan,
>>> the index table region coprocessor instance scans and returns the row
>> keys
>>> within the range of startKey and endKey passed in.
>>>
>>> The cluster blocks sometimes in invoking the index table coprocessor. I
>>> traced
>>> into the code and found that when HConnection locates regions it will rpc
>>> to the same regionserver.
>>>
>>> (After a while I found the index table coprocessor is equivalent to
>>> just a plain scan with filter, so I switched to scanners with filter, but
>>> the problem
>>> remains.)
>>
>
>
>
> --
>
> Best Regards!
>
> Fei Ding
> [email protected]