Answers inline below. On Wed, Jul 25, 2012 at 1:09 AM, Bertrand Dechoux <[email protected]> wrote: > #1 > > As Andrew pointed out, Cascading is indeed for MapReduce. I know the use > case was discussed, I wanted to know what was the state now. (The blog > entry is from 2010.) The use case is simple. I am doing log analysis and > would like to perform fast aggregations. These aggregations are common > (count/sum/average) but what is exactly aggregated will depend on the type > of logs. With apache, http code errors may be interesting. With other logs, > request durations could be. I (well, any users) would like to be able to > reuse common codes and write the coprocessors with a concise (domain > specific) language.
Implementing DSLs on top of Java is at best problematic. However if someone was willing and able to do the work, the coprocessor environments could be plumbed to Clojure or Scala or any other language that targets the JVM and can efficiently translate between native and Java types, and is better for building DSLs. > So I was checking not to miss any new > projects/development on that idea. From what I understand of your answers, > implementing a kind of Cascading for coprocessors may be possible but not > done and may not really be pertinent/safe/efficient with the current > architecture of coprocessors. Actually we have considered creating a "Cascading for Coprocessors": https://issues.apache.org/jira/browse/HBASE-3131 The difference is how code shipping up to the cluster would work. It would not be like MapReduce where each job is a one-shot code deployment. That doesn't mean that you cannot install coprocessors and then map flows over them (via Exec). > #2 > > I forgot that the shell still require the table to be offline. Thanks for > pointing that out. So, coprocessors are not meant to be loaded that often. Correct. However, given ongoing work like online schema changes, introduction of a ServiceLoader (in HBASE-4050), separating classloaders (HBASE-6308), a more dynamic loading scheme for Coprocessors could happen once supporting pieces are put in place. > #3 > I am not sure to understand your answers. I have read about big table/hbase > architecture but I may also have not expressed correctly my problem. The > way I see it, coprocessors would allow me to aggregate information from > recent logs. The problem I have with vanilla MapReduce is that if the logs > do not fill a full hfs block then MapReduce is a bit overkill. I though > that for those cases, coprocessors would be more appropriate. Is that a > right way to see it? If so is there any rule of thump for knowing when to > select MapReduce versus Coprocessors? On the other side of the scale, I > also assume that if I had 1 TeraByte of data, MapReduce would be faster > because it allows more parallelism. Well... I hope my concern is clearer > now. If you receive a lot of bulk data and need to transform it for later storing into HBase, then a MapReduce process is the efficient option. Even with an identity transform it is more efficient to drop all of the new data into place in one transaction rather than a transaction for each item, this is the rationale for HBase bulk loading. On the other hand if the data arrives in a streaming fashion, then Coprocessors make it possible to conveniently transform it inline as it is persisted, via Observers. Observers may need to be reconfigured at runtime or may need a side channel for communcation. So, we designed Endpoints (i.e. Exec) to enable registration of dynamic/user RPC protocols at runtime. Endpoints have also been used for running aggregation functions over the region data on demand, see AggregationProtocol. Simple functions which return quickly make sense, but this is not a replacement for a generalized framework like MapReduce. Long running computations server-side can interact with leases and client side RPC management in problematic ways. However, those issues can be addressed by client and server side changes layered on Coprocessors, which could be incorporated into the framework. Hence, HBASE-3131. > #4 > Ok > > #5 > I was talking specifically of coprocessorExec. > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInterface.html#coprocessorExec%28java.lang.Class,%20byte[],%20byte[],%20org.apache.hadoop.hbase.client.coprocessor.Batch.Call%29 > > Since the return value is a Map, I should assume that all the results are > gathered before returning it. So that would be a wait for all servers to > complete there work. See also the Exec method that takes a callback. The callback will be invoked as results are returned from each individual RegionServer. You don't need to wait for all results to be gathered into a Map if you do not want that. > But theoretically, it should be possible to return early results so that > the one calling the method could perform early aggregation of the results > while waiting for the remaining results to come. (Or I may be > misunderstanding something.) > > Thanks for the previous feedback. That's already clearer for me. > > Regards > > Bertrand -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
