All clear now thanks a lot.

Bertrand

On Wed, Jul 25, 2012 at 7:28 PM, Andrew Purtell <[email protected]> wrote:

> Answers inline below.
>
> On Wed, Jul 25, 2012 at 1:09 AM, Bertrand Dechoux <[email protected]>
> wrote:
> > #1
> >
> > As Andrew pointed out, Cascading is indeed for MapReduce. I know the use
> > case was discussed, I wanted to know what was the state now. (The blog
> > entry is from 2010.) The use case is simple. I am doing log analysis and
> > would like to perform fast aggregations. These aggregations are common
> > (count/sum/average) but what is exactly aggregated will depend on the
> type
> > of logs. With apache, http code errors may be interesting. With other
> logs,
> > request durations could be. I (well, any users) would like to be able to
> > reuse common codes and write the coprocessors with a concise (domain
> > specific) language.
>
> Implementing DSLs on top of Java is at best problematic. However if
> someone was willing and able to do the work, the coprocessor
> environments could be plumbed to Clojure or Scala or any other
> language that targets the JVM and can efficiently translate between
> native and Java types, and is better for building DSLs.
>
> > So I was checking not to miss any new
> > projects/development on that idea. From what I understand of your
> answers,
> > implementing a kind of Cascading for coprocessors may be possible but not
> > done and may not really be pertinent/safe/efficient with the current
> > architecture of coprocessors.
>
> Actually we have considered creating a "Cascading for Coprocessors":
> https://issues.apache.org/jira/browse/HBASE-3131
>
> The difference is how code shipping up to the cluster would work. It
> would not be like MapReduce where each job is a one-shot code
> deployment. That doesn't mean that you cannot install coprocessors and
> then map flows over them (via Exec).
>
> > #2
> >
> > I forgot that the shell still require the table to be offline. Thanks for
> > pointing that out. So, coprocessors are not meant to be loaded that
> often.
>
> Correct. However, given ongoing work like online schema changes,
> introduction of a ServiceLoader (in HBASE-4050), separating
> classloaders (HBASE-6308), a more dynamic loading scheme for
> Coprocessors could happen once supporting pieces are put in place.
>
> > #3
> > I am not sure to understand your answers. I have read about big
> table/hbase
> > architecture but I may also have  not expressed correctly my problem. The
> > way I see it, coprocessors would allow me to aggregate information from
> > recent logs. The problem I have with vanilla MapReduce is that if the
> logs
> > do not fill a full hfs block then MapReduce is a bit overkill. I though
> > that for those cases, coprocessors would be more appropriate. Is that a
> > right way to see it? If so is there any rule of thump for knowing when to
> > select MapReduce versus Coprocessors? On the other side of the scale, I
> > also assume that if I had 1 TeraByte of data, MapReduce would be faster
> > because it allows more parallelism. Well... I hope my concern is clearer
> > now.
>
> If you receive a lot of bulk data and need to transform it for later
> storing into HBase, then a MapReduce process is the efficient option.
> Even with an identity transform it is more efficient to drop all of
> the new data into place in one transaction rather than a transaction
> for each item, this is the rationale for HBase bulk loading. On the
> other hand if the data arrives in a streaming fashion, then
> Coprocessors make it possible to conveniently transform it inline as
> it is persisted, via Observers.
>
> Observers may need to be reconfigured at runtime or may need a side
> channel for communcation. So, we designed Endpoints (i.e. Exec) to
> enable registration of dynamic/user RPC protocols at runtime.
>
> Endpoints have also been used for running aggregation functions over
> the region data on demand, see AggregationProtocol. Simple functions
> which return quickly make sense, but this is not a replacement for a
> generalized framework like MapReduce. Long running computations
> server-side can interact with leases and client side RPC management in
> problematic ways. However, those issues can be addressed by client and
> server side changes layered on Coprocessors, which could be
> incorporated into the framework. Hence, HBASE-3131.
>
> > #4
> > Ok
> >
> > #5
> > I was talking specifically of coprocessorExec.
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableInterface.html#coprocessorExec%28java.lang.Class,%20byte[],%20byte[],%20org.apache.hadoop.hbase.client.coprocessor.Batch.Call%29
> >
> > Since the return value is a Map, I should assume that all the results are
> > gathered before returning it. So that would be a wait for all servers to
> > complete there work.
>
> See also the Exec method that takes a callback. The callback will be
> invoked as results are returned from each individual RegionServer. You
> don't need to wait for all results to be gathered into a Map if you do
> not want that.
>
> > But theoretically, it should be possible to return early results so that
> > the one calling the method could perform early aggregation of the results
> > while waiting for the remaining results to come. (Or I may be
> > misunderstanding something.)
> >
> > Thanks for the previous feedback. That's already clearer for me.
> >
> > Regards
> >
> > Bertrand
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)
>



-- 
Bertrand Dechoux

Reply via email to