Jason: What about the double trip done in solr, es, nutch, etc. where first query is about term frequency in each index (region) and then second query is the actual search w/ the term distribution factored in? Will you need to do something equivalent?
St.Ack On Tue, Mar 15, 2011 at 2:53 PM, Jason Rutherglen <[email protected]> wrote: > Gary, > > Thanks for your descriptive definition of how the HBase RPC works. > >> They will issue RPC calls in parallel to >> all of the in the range starting with the region containing "startRow" and >> ending with the region containing "endRow" (again using the row keys for the >> region lookups). > > I think I'll start with the single region RPC call. Once that works, > use the 'execute in parallel' for distributed search, eg, it will take > care of the underlying details, then the code that needs to be written > will only entail merging the results. > > Thanks, > > Jason > > On Tue, Mar 15, 2011 at 10:11 AM, Gary Helmling <[email protected]> wrote: >> Hi Jason, >> >> That's basically correct. To export your own RPC methods from a >> coprocessor, you: >> >> 1) Define an interface containing the RPC methods. This interface must >> extend CoprocessorProtocol (which only requires you to implement >> getProtocolVersion()) >> >> 2) Implement the defined RPC interface in your coprocessor >> >> 3) To call the RPC methods, you have 3 options: >> >> T HTable.coprocessorProxy(Class<T> protocol, byte[] row) >> >> Map<byte[],R> HTable.coprocessorExec(Class<T> protocol, byte[] startRow, >> byte[] endRow, Batch.Call<T,R> callable) >> >> void HTable.coprocessorExec(Class<T> protocol, byte[] start, byte[] end, >> Batch.Call<T,R> call, Batch.Callback<R> callback) >> >> >> The first returns a proxy instance of T to call methods against the >> coprocessor instance in the region containing the row key given by "byte[] >> row", as you described. So the row key is just used to lookup the region. >> >> The last two are a bit tricker. They will issue RPC calls in parallel to >> all of the in the range starting with the region containing "startRow" and >> ending with the region containing "endRow" (again using the row keys for the >> region lookups). However, to call the RPC methods, you still need a proxy >> instance of T, which instead of being returned directly, you get access to >> by providing an implementation of Batch.Call<T,R>, where T is the protocol >> interface and R is the type returned from your Batch.Call implemenation's >> call(T instance) method. >> >> For each region in the range startRow to endRow, the HBase client will call >> your instance of Batch.Call.call(T proxy) passing a proxy instance connected >> to that region. >> >> So in TestCoprocessorEndpoint, we define a simple RPC protocol: >> >> public interface ColumnAggregationProtocol extends CoprocessorProtocol { >> public long sum(byte[] family, byte[] qualifier) throws IOException; >> } >> >> >> Then we invoke it using HTable.coprocessorExec() with an anonymous >> Batch.Call instance like so: >> >> new Batch.Call<ColumnAggregationProtocol, Long>() { >> public Long call(ColumnAggregationProtocol instance) >> throws IOException { >> return instance.sum(TEST_FAMILY, TEST_QUALIFIER); >> } >> } >> >> We then get back a Map<byte[],Long> from coprocessorExec() containing the >> result of each call() invocation, keyed by the region name. >> >> Take a look at the Batch.forMethod() helper, which will generate a >> Batch.Call instance for you if you just want to invoke a single method >> across multiple regions with coprocessorExec(). >> >> >> --gh >> >> >> >> On Tue, Mar 15, 2011 at 9:15 AM, Jason Rutherglen < >> [email protected]> wrote: >> >>> I'm taking a look at TestCoprocessorEndpoint for example, in trying to >>> figure out how the Coprocessor RPC works. I think >>> HTable.coprocessorProxy should be used? Which will return an >>> interface that when called performs the network marshaling etc. The >>> purpose of the row byte[] in coprocessorProxy is to find the correct >>> region to make the call to? >>> >> >
