Ok... I think you need to step away from your solution and take a look at the problem from a different perspective.
From my limited understanding of Co-processors, this doesn't fit well in what you want to do. I don't believe that you want to run a M/R query within a Co-processor. In short, if I understood your problem, your goal is to pull data efficiently from a table based on using the intersections of 2 or more indexes. Note: Most people create composite indexes but its possible that you want to index data against a column value along with a different type of index... like geo spatial. So here you need to capture the intersection of the index lists and then use that resulting subset as input in to a m/r job to return the underlying data. (Note: you can do this in a single child too. ) If you use a M/R job to fetch and process over the result set, you would need to take your intersection in to a java object like an ordered list where you can then split the list and pass this off to each node. On May 16, 2012, at 1:12 AM, fding hbase wrote: > Hi Michel, > > Thanks for your reply. I believe your idea works both in theory and > practice. But the problem I worried about does not > lie on the memory usage, but on the network performance. If I query all the > indexed rows from index tables and pull all > of them to client and push them to the temp table, then the > client network overhead is heavy. If I can move the calculation to > server side then the result will be reduced a lot after intersection. > > But sadly, HBase ipc doesn't allow coprocessor chaining mechanism... > Someone mentioned on > http://grokbase.com/t/hbase/user/116hrhhf8m/coprocessor-failure-question-and-examples > : > > If a RegionObserver issues RPC to another table from any of the hooks that > are called > out of RPC handlers (for Gets, Puts, Deletes, etc.), you risk deadlock. > Whatever activity > you want to check should be in the same region as account data to avoid > that. > (Or HBase RPC needs to change.) > > > So, that means, the deadlock is inevitable under current circumstance. The > coprocessors are still limited. > > What I'm seeking is possible extensions of coprocessors or workaround for > such situations that extra RPC is needed > in the RPC handlers. > > By the way, the idea you described looks like what Apache > commons-collections CollectionUtils.intersection() does. > > On Tue, May 15, 2012 at 8:23 PM, Michel Segel > <[email protected]>wrote: > >> Sorry for the delay... Had a full day yesterday... >> >> In a nut shell... Tough nut to crack. I can give you a solution which you >> can probably enhance... >> >> At the start, ignore coProcessors for now... >> >> So what end up doing is the following. >> >> General solution... N indexes.. >> Create a temp table in HBase. (1 column foo) >> >> Assuming that you have a simple K,V index, so you just need to do a simple >> get() against the index to get the list of rows ... >> >> For each index, fetch the rows. >> For each row, write the rowid and then auto increment a counter in a >> column foo. >> >> Then scan the table where foo's counter >= N. note that it should == N but >> just in case... >> >> Now you have found multiple indexes. >> >> Having said that... >> Again assuming your indexes are a simple K,V pair where V is a set of row >> ids... >> >> Create a hash map of <rowid, count> >> For each index: >> Get() row based on key >> For each rowid in row: >> If map.fetch(rowid) is null then add ( rowid, 1) >> Else increment the value in count; >> ; >> ; >> For each rowid in map(rowid, count): >> If count == number of indexes N >> Then add rowid to result set. >> ; >> >> Now just return the rows where you have it's rowid in the result set. >> >> That you can do in a coprocessor... >> but you may have a memory issue... Depending on the number of >> rowid in your index. >> >> >> >> does that help? >> >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On May 14, 2012, at 8:20 AM, fding hbase <[email protected]> wrote: >> >>> Hi Michel, >>> >>> I indexed each column within a column family of a table, so we can query >> a >>> row with specific column value. >>> By multi-index I mean using multiple indexes at the same time on a single >>> query. That looks like a SQL select >>> with two *where* clauses of two indexed columns. >>> >>> The row key of index table is made up of column value and row key of >>> indexed table. For set intersection >>> I used the utility class from Apache common-collections package >>> CollectionUtils.intersection(). There's no >>> assumption on sort order on indices. A scan with column value as startKey >>> and column value+1 as endKey >>> applied to index table will return all rows in indexed table with that >>> column value. >>> >>> For multi-index queries, previously I tried to use a scan for each index >>> column and intersect of those >>> result sets to get the rows that I want. But the query time is too long. >> So >>> I decided to move the computation of >>> intersection to server side and reduce the amount of data transferred. >>> >>> Do you have any better idea? >>> >>> On Mon, May 14, 2012 at 8:17 PM, Michel Segel <[email protected] >>> wrote: >>> >>>> Need a little clarification... >>>> >>>> You said that you need to do multi-index queries. >>>> >>>> Did you mean to say multiple people running queries at the same time, or >>>> did you mean you wanted to do multi-key indexes where the key is a >>>> multi-key part. >>>> >>>> Or did you mean that you really wanted to use multiple indexes at the >> same >>>> time on a single query? >>>> >>>> If its the latter, not really a good idea... >>>> How do you handle the intersection of the two sets? (3 sets or more?) >>>> Can you assume that the indexes are in sort order? >>>> >>>> What happens when the results from the indexes exceed the amount of >>>> allocated memory? >>>> >>>> What I am suggesting you to do is to set aside the underpinnings of >> HBase >>>> and look at the problem you are trying to solve in general terms. Not >> an >>>> easy one... >>>> >>>> >>>> >>>> Sent from a remote device. Please excuse any typos... >>>> >>>> Mike Segel >>>> >>>> On May 14, 2012, at 4:35 AM, fding hbase <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Is it possible to use table scanner (different from the host table >>>> region) >>>>> or >>>>> execute coprocessor of another table, in the endpoint coprocessor? >>>>> It looks like chaining coprocessors. But I found a possible deadlock! >>>>> Can anyone help me with this? >>>>> >>>>> In my testing environment I deployed the 0.92.0 version from CDH. >>>>> I wrote an Endpoint coprocessor to do composite secondary index >> queries. >>>>> The index is stored in another table and the index update is maintained >>>>> by the client through a extended HTable. While a single index query >>>>> works fine through Scanners of index table, soon after we realized >>>>> we need to do multi-index queries at the same time. >>>>> At first we tried to pull every row keys queried from a single index >>>> table >>>>> and do the merge (just set intersection) on the client, >>>>> but that overruns the network bandwidth. So I proposed to try >>>>> the endpoint coprocessor. The idea is to use coprocessors, one >>>>> in master table (the indexed table) and the other for each index table >>>>> regions. >>>>> Each master table region coprocessor instance invokes the index table >>>>> coprocessor instances with its regioninfo (the startKey and endKey) and >>>> the >>>>> scan, >>>>> the index table region coprocessor instance scans and returns the row >>>> keys >>>>> within the range of startKey and endKey passed in. >>>>> >>>>> The cluster blocks sometimes in invoking the index table coprocessor. I >>>>> traced >>>>> into the code and found that when HConnection locates regions it will >> rpc >>>>> to the same regionserver. >>>>> >>>>> (After a while I found the index table coprocessor is equivalent to >>>>> just a plain scan with filter, so I switched to scanners with filter, >> but >>>>> the problem >>>>> remains.) >>>> >>> >>> >>> >>> -- >>> >>> Best Regards! >>> >>> Fei Ding >>> [email protected] >> > > > > -- > > Best Regards! > > Fei Ding > [email protected]
