On Tue, Jul 24, 2012 at 7:59 AM, Bertrand Dechoux <[email protected]> wrote: > First, I thought coprocessors needed a restart but it seems a shell can be > used to add/remove them without requiring a restart. However, at the moment > the coprocessors are defined within jar and can not be dynamically created. > Could you confirm that?
You can dynamically load new coprocessors by deploying a jarfile to HDFS, using the shell to disable the table, add the coprocessor, and then enable the table. To remove a coprocessor from a table, you can use the shell to disable the table, remove the coprocessor, and then enable the table again. However, whatever was loaded by the JVM will remain resident until the regionserver process is restarted. > (I am thinking about the Cascading way of creating > the implementation which will then be serialized, send and executed.) ... as a MapReduce job. MR jobs in Hadoop are really each individual submissions of application code to run on the cluster each and every time. In contrast, HBase coprocessors can be thought of like Linux loadable kernel modules. You add them to your infrastructure. HBase becomes more like an application deployment platform where the details of data colocation with the application code at scale is handled for you automatically, as is client side dispatch to the appropriate locations. An early design of coprocessors considered code shipping at request time, but that doesn't fit the extension model above well. But also consider that HBase is a short-request system. The latency of processing each individual RPC is important and expected to be a short as possible. If for a table where you want to extend server side function, imagine the overhead if that extension is shipped in every request. Each RPC would be what? 10x? 100x? larger? And there would be the client side latency of figuring the transitive closure of classes to send up, and then server side latency of installing the bytecode for execution and then removing it for GC. > Second, I didn't see any way to give parameters to coprocessors. Is that > really the case? If not, how would the parameters be handled? A coprocessor can be an Observer, linked in to server side function. Parameters are handed to your installed extension via upcall from HBase code. Or, a coprocessor can be an Endpoint. This is a dynamic RPC endpoint. You can send up any parameter to an endpoint via Exec as long as HBase RPC can serialize it. For more information see: https://blogs.apache.org/hbase/entry/coprocessor_introduction > Third, I assume coprocessors are using the processus/thread of the region > server. Does that means that, if multiple blocks need to be processed, > MaReduce should be more efficient? Are there other ways to know whether > coprocessors or MapReduce should be chosen? Coprocessors operate on requests (RPCs), not blocks. If you address a coprocessor request to the whole table, whatever happens will happen on all regionservers in parallel. This is as far as the similarity to MapReduce goes. Conceivably you could implement a map() and reduce() interface on top of HBase using Coprocessors, but CPs themselves are a lower level extension framework. > Fourth, I know this is a really broad question but how would you compare > coprocessors to YARN? I have yet to know more about both subjects but I > feel that the concepts are not totally unrelated. Coprocessors are a low level extension framework, YARN is a general purpose high level cluster resource manager. Not in the same engineering ballpark. > Lastly, this is an implementation detail but how the client side waits for > the results? Is it possible to perform early aggregation or does the client > need to receive all the information before doing anything else? > > Regards > > Bertrand > > > Ps : My two sources for that subject are for HBase 0.92 : > * https://blogs.apache.org/hbase/entry/coprocessor_introduction > * HBase The Definitive Guide. -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
