Hi Michel, Thanks for your reply. I believe your idea works both in theory and practice. But the problem I worried about does not lie on the memory usage, but on the network performance. If I query all the indexed rows from index tables and pull all of them to client and push them to the temp table, then the client network overhead is heavy. If I can move the calculation to server side then the result will be reduced a lot after intersection.
But sadly, HBase ipc doesn't allow coprocessor chaining mechanism... Someone mentioned on http://grokbase.com/t/hbase/user/116hrhhf8m/coprocessor-failure-question-and-examples : If a RegionObserver issues RPC to another table from any of the hooks that are called out of RPC handlers (for Gets, Puts, Deletes, etc.), you risk deadlock. Whatever activity you want to check should be in the same region as account data to avoid that. (Or HBase RPC needs to change.) So, that means, the deadlock is inevitable under current circumstance. The coprocessors are still limited. What I'm seeking is possible extensions of coprocessors or workaround for such situations that extra RPC is needed in the RPC handlers. By the way, the idea you described looks like what Apache commons-collections CollectionUtils.intersection() does. On Tue, May 15, 2012 at 8:23 PM, Michel Segel <[email protected]>wrote: > Sorry for the delay... Had a full day yesterday... > > In a nut shell... Tough nut to crack. I can give you a solution which you > can probably enhance... > > At the start, ignore coProcessors for now... > > So what end up doing is the following. > > General solution... N indexes.. > Create a temp table in HBase. (1 column foo) > > Assuming that you have a simple K,V index, so you just need to do a simple > get() against the index to get the list of rows ... > > For each index, fetch the rows. > For each row, write the rowid and then auto increment a counter in a > column foo. > > Then scan the table where foo's counter >= N. note that it should == N but > just in case... > > Now you have found multiple indexes. > > Having said that... > Again assuming your indexes are a simple K,V pair where V is a set of row > ids... > > Create a hash map of <rowid, count> > For each index: > Get() row based on key > For each rowid in row: > If map.fetch(rowid) is null then add ( rowid, 1) > Else increment the value in count; > ; > ; > For each rowid in map(rowid, count): > If count == number of indexes N > Then add rowid to result set. > ; > > Now just return the rows where you have it's rowid in the result set. > > That you can do in a coprocessor... > but you may have a memory issue... Depending on the number of > rowid in your index. > > > > does that help? > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 14, 2012, at 8:20 AM, fding hbase <[email protected]> wrote: > > > Hi Michel, > > > > I indexed each column within a column family of a table, so we can query > a > > row with specific column value. > > By multi-index I mean using multiple indexes at the same time on a single > > query. That looks like a SQL select > > with two *where* clauses of two indexed columns. > > > > The row key of index table is made up of column value and row key of > > indexed table. For set intersection > > I used the utility class from Apache common-collections package > > CollectionUtils.intersection(). There's no > > assumption on sort order on indices. A scan with column value as startKey > > and column value+1 as endKey > > applied to index table will return all rows in indexed table with that > > column value. > > > > For multi-index queries, previously I tried to use a scan for each index > > column and intersect of those > > result sets to get the rows that I want. But the query time is too long. > So > > I decided to move the computation of > > intersection to server side and reduce the amount of data transferred. > > > > Do you have any better idea? > > > > On Mon, May 14, 2012 at 8:17 PM, Michel Segel <[email protected] > >wrote: > > > >> Need a little clarification... > >> > >> You said that you need to do multi-index queries. > >> > >> Did you mean to say multiple people running queries at the same time, or > >> did you mean you wanted to do multi-key indexes where the key is a > >> multi-key part. > >> > >> Or did you mean that you really wanted to use multiple indexes at the > same > >> time on a single query? > >> > >> If its the latter, not really a good idea... > >> How do you handle the intersection of the two sets? (3 sets or more?) > >> Can you assume that the indexes are in sort order? > >> > >> What happens when the results from the indexes exceed the amount of > >> allocated memory? > >> > >> What I am suggesting you to do is to set aside the underpinnings of > HBase > >> and look at the problem you are trying to solve in general terms. Not > an > >> easy one... > >> > >> > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On May 14, 2012, at 4:35 AM, fding hbase <[email protected]> wrote: > >> > >>> Hi all, > >>> > >>> Is it possible to use table scanner (different from the host table > >> region) > >>> or > >>> execute coprocessor of another table, in the endpoint coprocessor? > >>> It looks like chaining coprocessors. But I found a possible deadlock! > >>> Can anyone help me with this? > >>> > >>> In my testing environment I deployed the 0.92.0 version from CDH. > >>> I wrote an Endpoint coprocessor to do composite secondary index > queries. > >>> The index is stored in another table and the index update is maintained > >>> by the client through a extended HTable. While a single index query > >>> works fine through Scanners of index table, soon after we realized > >>> we need to do multi-index queries at the same time. > >>> At first we tried to pull every row keys queried from a single index > >> table > >>> and do the merge (just set intersection) on the client, > >>> but that overruns the network bandwidth. So I proposed to try > >>> the endpoint coprocessor. The idea is to use coprocessors, one > >>> in master table (the indexed table) and the other for each index table > >>> regions. > >>> Each master table region coprocessor instance invokes the index table > >>> coprocessor instances with its regioninfo (the startKey and endKey) and > >> the > >>> scan, > >>> the index table region coprocessor instance scans and returns the row > >> keys > >>> within the range of startKey and endKey passed in. > >>> > >>> The cluster blocks sometimes in invoking the index table coprocessor. I > >>> traced > >>> into the code and found that when HConnection locates regions it will > rpc > >>> to the same regionserver. > >>> > >>> (After a while I found the index table coprocessor is equivalent to > >>> just a plain scan with filter, so I switched to scanners with filter, > but > >>> the problem > >>> remains.) > >> > > > > > > > > -- > > > > Best Regards! > > > > Fei Ding > > [email protected] > -- Best Regards! Fei Ding [email protected]
