Re: EndPoint Coprocessor could be dealocked?

Michael Segel Wed, 16 May 2012 11:03:56 -0700

Ok...

I think you need to step away from your solution and take a look at the problem 
from a different perspective.


From my limited understanding of Co-processors, this doesn't fit well in what 
you want to do. 
I don't believe that you want to run a M/R query within a Co-processor.

In short, if I understood your problem, your goal is to pull data efficiently 
from a table based on using the intersections of 2 or more indexes. 
 
Note: Most people create composite indexes but its possible that you want to 
index data against a column value along with a different type of index... like 
geo spatial. 

So here you need to capture the intersection of the index lists and then use 
that resulting subset as input in to a m/r job to return the underlying data.  
(Note: you can do this in a single child too. )

If you use a M/R job to fetch and process over the result set, you would need 
to take your intersection in to a java object like an ordered list where you 
can then split the list and pass this off to each node. 


On May 16, 2012, at 1:12 AM, fding hbase wrote:

> Hi Michel,
> 
> Thanks for your reply. I believe your idea works both in theory and
> practice. But the problem I worried about does not
> lie on the memory usage, but on the network performance. If I query all the
> indexed rows from index tables and pull all
> of them to client and push them to the temp table, then the
> client network overhead is heavy. If I can move the calculation to
> server side then the result will be reduced a lot after intersection.
> 
> But sadly, HBase ipc doesn't allow coprocessor chaining mechanism...
> Someone mentioned on
> http://grokbase.com/t/hbase/user/116hrhhf8m/coprocessor-failure-question-and-examples
> :
> 
> If a RegionObserver issues RPC to another table from any of the hooks that
> are called
> out of RPC handlers (for Gets, Puts, Deletes, etc.), you risk deadlock.
> Whatever activity
> you want to check should be in the same region as account data to avoid
> that.
> (Or HBase RPC needs to change.)
> 
> 
> So, that means, the deadlock is inevitable under current circumstance. The
> coprocessors are still limited.
> 
> What I'm seeking is possible extensions of coprocessors or workaround for
> such situations that extra RPC is needed
> in the RPC handlers.
> 
> By the way, the idea you described looks like what Apache
> commons-collections CollectionUtils.intersection() does.
> 
> On Tue, May 15, 2012 at 8:23 PM, Michel Segel 
> <[email protected]>wrote:
> 
>> Sorry for the delay... Had a full day yesterday...
>> 
>> In a nut shell... Tough nut to crack.  I can give you a solution which you
>> can probably enhance...
>> 
>> At the start, ignore coProcessors for now...
>> 
>> So what end up doing is the following.
>> 
>> General solution... N indexes..
>> Create a temp table in HBase. (1 column foo)
>> 
>> Assuming that you have a simple K,V index, so you just need to do a simple
>> get() against the index to get the list of rows ...
>> 
>> For each index, fetch the rows.
>> For each row, write the rowid and then auto increment a counter in a
>> column foo.
>> 
>> Then scan the table where foo's counter >= N. note that it should == N but
>> just in case...
>> 
>> Now you have found multiple indexes.
>> 
>> Having said that...
>> Again assuming your indexes are a simple K,V pair where V is a set of row
>> ids...
>> 
>> Create a hash map of <rowid, count>
>> For each index:
>>    Get() row based on key
>>     For each rowid in row:
>>          If map.fetch(rowid) is null then add ( rowid, 1)
>>          Else increment the value in count;
>>     ;
>> ;
>> For each rowid in map(rowid, count):
>>   If count == number of indexes N
>>   Then add rowid to result set.
>> ;
>> 
>> Now just return the rows where you have it's rowid in the result set.
>> 
>> That you can do in a coprocessor...
>>         but you may have a memory issue... Depending on the number of
>> rowid in your index.
>> 
>> 
>> 
>> does that help?
>> 
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 14, 2012, at 8:20 AM, fding hbase <[email protected]> wrote:
>> 
>>> Hi Michel,
>>> 
>>> I indexed each column within a column family of a table, so we can query
>> a
>>> row with specific column value.
>>> By multi-index I mean using multiple indexes at the same time on a single
>>> query. That looks like a SQL select
>>> with two *where* clauses of two indexed columns.
>>> 
>>> The row key of index table is made up of column value and row key of
>>> indexed table. For set intersection
>>> I used the utility class from Apache common-collections package
>>> CollectionUtils.intersection(). There's no
>>> assumption on sort order on indices. A scan with column value as startKey
>>> and column value+1 as endKey
>>> applied to index table will return all rows in indexed table with that
>>> column value.
>>> 
>>> For multi-index queries, previously I tried to use a scan for each index
>>> column and intersect of those
>>> result sets to get the rows that I want. But the query time is too long.
>> So
>>> I decided to move the computation of
>>> intersection to server side and reduce the amount of data transferred.
>>> 
>>> Do you have any better idea?
>>> 
>>> On Mon, May 14, 2012 at 8:17 PM, Michel Segel <[email protected]
>>> wrote:
>>> 
>>>> Need a little clarification...
>>>> 
>>>> You said that you need to do multi-index queries.
>>>> 
>>>> Did you mean to say multiple people running queries at the same time, or
>>>> did you mean you wanted to do multi-key indexes where the key is a
>>>> multi-key part.
>>>> 
>>>> Or did you mean that you really wanted to use multiple indexes at the
>> same
>>>> time on a single query?
>>>> 
>>>> If its the latter, not really a good idea...
>>>> How do you handle the intersection of the two sets? (3 sets or more?)
>>>> Can you assume that the indexes are in sort order?
>>>> 
>>>> What happens when the results from the indexes exceed the amount of
>>>> allocated memory?
>>>> 
>>>> What I am suggesting you to do is to set aside the underpinnings of
>> HBase
>>>> and look at the problem you are trying to solve in general terms.  Not
>> an
>>>> easy one...
>>>> 
>>>> 
>>>> 
>>>> Sent from a remote device. Please excuse any typos...
>>>> 
>>>> Mike Segel
>>>> 
>>>> On May 14, 2012, at 4:35 AM, fding hbase <[email protected]> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Is it possible to use table scanner (different from the host table
>>>> region)
>>>>> or
>>>>> execute coprocessor of another table, in the endpoint coprocessor?
>>>>> It looks like chaining coprocessors. But I found a possible deadlock!
>>>>> Can anyone help me with this?
>>>>> 
>>>>> In my testing environment I deployed the 0.92.0 version from CDH.
>>>>> I wrote an Endpoint coprocessor to do composite secondary index
>> queries.
>>>>> The index is stored in another table and the index update is maintained
>>>>> by the client through a extended HTable. While a single index query
>>>>> works fine through Scanners of index table, soon after we realized
>>>>> we need to do multi-index queries at the same time.
>>>>> At first we tried to pull every row keys queried from a single index
>>>> table
>>>>> and do the merge (just set intersection) on the client,
>>>>> but that overruns the network bandwidth. So I proposed to try
>>>>> the endpoint coprocessor. The idea is to use coprocessors, one
>>>>> in master table (the indexed table) and the other for each index table
>>>>> regions.
>>>>> Each master table region coprocessor instance invokes the index table
>>>>> coprocessor instances with its regioninfo (the startKey and endKey) and
>>>> the
>>>>> scan,
>>>>> the index table region coprocessor instance scans and returns the row
>>>> keys
>>>>> within the range of startKey and endKey passed in.
>>>>> 
>>>>> The cluster blocks sometimes in invoking the index table coprocessor. I
>>>>> traced
>>>>> into the code and found that when HConnection locates regions it will
>> rpc
>>>>> to the same regionserver.
>>>>> 
>>>>> (After a while I found the index table coprocessor is equivalent to
>>>>> just a plain scan with filter, so I switched to scanners with filter,
>> but
>>>>> the problem
>>>>> remains.)
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> 
>>> Best Regards!
>>> 
>>> Fei Ding
>>> [email protected]
>> 
> 
> 
> 
> -- 
> 
> Best Regards!
> 
> Fei Ding
> [email protected]

Re: EndPoint Coprocessor could be dealocked?

Reply via email to