All clear now thanks a lot. Bertrand
On Wed, Jul 25, 2012 at 7:28 PM, Andrew Purtell <[email protected]> wrote: > Answers inline below. > > On Wed, Jul 25, 2012 at 1:09 AM, Bertrand Dechoux <[email protected]> > wrote: > > #1 > > > > As Andrew pointed out, Cascading is indeed for MapReduce. I know the use > > case was discussed, I wanted to know what was the state now. (The blog > > entry is from 2010.) The use case is simple. I am doing log analysis and > > would like to perform fast aggregations. These aggregations are common > > (count/sum/average) but what is exactly aggregated will depend on the > type > > of logs. With apache, http code errors may be interesting. With other > logs, > > request durations could be. I (well, any users) would like to be able to > > reuse common codes and write the coprocessors with a concise (domain > > specific) language. > > Implementing DSLs on top of Java is at best problematic. However if > someone was willing and able to do the work, the coprocessor > environments could be plumbed to Clojure or Scala or any other > language that targets the JVM and can efficiently translate between > native and Java types, and is better for building DSLs. > > > So I was checking not to miss any new > > projects/development on that idea. From what I understand of your > answers, > > implementing a kind of Cascading for coprocessors may be possible but not > > done and may not really be pertinent/safe/efficient with the current > > architecture of coprocessors. > > Actually we have considered creating a "Cascading for Coprocessors": > https://issues.apache.org/jira/browse/HBASE-3131 > > The difference is how code shipping up to the cluster would work. It > would not be like MapReduce where each job is a one-shot code > deployment. That doesn't mean that you cannot install coprocessors and > then map flows over them (via Exec). > > > #2 > > > > I forgot that the shell still require the table to be offline. Thanks for > > pointing that out. So, coprocessors are not meant to be loaded that > often. > > Correct. However, given ongoing work like online schema changes, > introduction of a ServiceLoader (in HBASE-4050), separating > classloaders (HBASE-6308), a more dynamic loading scheme for > Coprocessors could happen once supporting pieces are put in place. > > > #3 > > I am not sure to understand your answers. I have read about big > table/hbase > > architecture but I may also have not expressed correctly my problem. The > > way I see it, coprocessors would allow me to aggregate information from > > recent logs. The problem I have with vanilla MapReduce is that if the > logs > > do not fill a full hfs block then MapReduce is a bit overkill. I though > > that for those cases, coprocessors would be more appropriate. Is that a > > right way to see it? If so is there any rule of thump for knowing when to > > select MapReduce versus Coprocessors? On the other side of the scale, I > > also assume that if I had 1 TeraByte of data, MapReduce would be faster > > because it allows more parallelism. Well... I hope my concern is clearer > > now. > > If you receive a lot of bulk data and need to transform it for later > storing into HBase, then a MapReduce process is the efficient option. > Even with an identity transform it is more efficient to drop all of > the new data into place in one transaction rather than a transaction > for each item, this is the rationale for HBase bulk loading. On the > other hand if the data arrives in a streaming fashion, then > Coprocessors make it possible to conveniently transform it inline as > it is persisted, via Observers. > > Observers may need to be reconfigured at runtime or may need a side > channel for communcation. So, we designed Endpoints (i.e. Exec) to > enable registration of dynamic/user RPC protocols at runtime. > > Endpoints have also been used for running aggregation functions over > the region data on demand, see AggregationProtocol. Simple functions > which return quickly make sense, but this is not a replacement for a > generalized framework like MapReduce. Long running computations > server-side can interact with leases and client side RPC management in > problematic ways. However, those issues can be addressed by client and > server side changes layered on Coprocessors, which could be > incorporated into the framework. Hence, HBASE-3131. > > > #4 > > Ok > > > > #5 > > I was talking specifically of coprocessorExec. > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInterface.html#coprocessorExec%28java.lang.Class,%20byte[],%20byte[],%20org.apache.hadoop.hbase.client.coprocessor.Batch.Call%29 > > > > Since the return value is a Map, I should assume that all the results are > > gathered before returning it. So that would be a wait for all servers to > > complete there work. > > See also the Exec method that takes a callback. The callback will be > invoked as results are returned from each individual RegionServer. You > don't need to wait for all results to be gathered into a Map if you do > not want that. > > > But theoretically, it should be possible to return early results so that > > the one calling the method could perform early aggregation of the results > > while waiting for the remaining results to come. (Or I may be > > misunderstanding something.) > > > > Thanks for the previous feedback. That's already clearer for me. > > > > Regards > > > > Bertrand > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet > Hein (via Tom White) > -- Bertrand Dechoux
