William.

For your particular request:

> each region server can
> calculate the mean of rows it contains on itself instead of
transporting
> every row back to the client

There is a related jira opened here:
https://issues.apache.org/jira/browse/HBASE-1512

Now it's a sub-ticket of HBASE-2000.

You have 2 options to perform aggregate/mean toward a region at region server: 1) wait for MapReduce framework of coprocessor. However it won't be available soon, since it's be our highest priority right now. Andy had a prototype but we decided to took it off from HBASE-2001 patch. You may want to contribute to the new design of it. (I will create a new jira for the mapred framework for coprocessor).

2) utilize the CommandTarget: there is a simple CommandTarget sample which performs column aggregate on region server. But you need to know some HBase internal logic to build the CommandTarget. This piece will be checked in to TRUNK soon I think.

Thanks,
Mingjie

On 10/07/2010 12:44 PM, William Kang wrote:
Hi St. Ack,
Thanks a lot for your information. I will look them up. If the coprocessors
can work with the 0.90 manual balanced hbase, that would be really nice.


William

On Thu, Oct 7, 2010 at 2:31 PM, Stack<[email protected]>  wrote:

William:

Coprocessors will be committed to TRUNK sometime in the next few days.
  They are well documented.  I suggest you start with this
package-info.html posted to hbase-2001 by Andrew and Mingjie:
https://issues.apache.org/jira/secure/attachment/12456164/packge-info.html
.
  It serves as a good intro to the utility coprocessors add and has
good example uses including examples that resemble strongly that which
you would like to do, described below.

St.Ack


On Wed, Oct 6, 2010 at 11:08 PM, William Kang<[email protected]>
wrote:
Ryan, thanks for your explanation. It is very clear and helpful.

Andy, I think Hbase-2000 is exactly what I was asking for. In general, MR
is
not built for low-latency purpose. But our applications do need something
fast and low weight. For example, we might just want to know the mean of
our
query results over some values inside rows. If each region server can
calculate the mean of rows it contains on itself instead of transporting
every row back to the client, it would be much faster to get the final
result. Will hbase-2000 be able to do it? And would you please share more
information about the development process and how may I contribute to it?
Many thanks.


William

On Wed, Oct 6, 2010 at 11:57 AM, Andrew Purtell<[email protected]>
wrote:

Hi William,

I think you are asking about HBASE-2000:
https://issues.apache.org/jira/browse/HBASE-2000

Work on an in-process parallel execution framework for HBase is in
progress, yes. We have some initial patches up for review which are the
start of this.

Best regards,

    - Andy


--- On Tue, 10/5/10, Ryan Rawson<[email protected]>  wrote:

From: Ryan Rawson<[email protected]>
Subject: Re: Parallel computing on HBase
To: [email protected]
Date: Tuesday, October 5, 2010, 11:10 PM
You understand the hbase data model
yes?  Each region gets a mapper
and each mapper reads the rows for that region feeding it
into the map
functions.  On the output side, each reducer just
writes to hbase. The
parallelism can support millions of row reads/second.

I don't understand the rest of your question
unfortunately.

good luck!
-ryan

On Tue, Oct 5, 2010 at 9:40 PM, William Kang<[email protected]>
wrote:
Can you tell me a little about how HBase works with
MR? If the MR
source/sink has to go through just ONE region client,
then it is not I am
looking for. But if MR can plug directly with the
region server containing
specific rows, then it might work. Furthermore, MR is
a heavy weight process
with lots of overhead. Ideally, we want something
light weight and can get
result fast. Many thanks.


William

On Wed, Oct 6, 2010 at 12:01 AM, Jeff Zhang<[email protected]>
wrote:

You can incorporate map reduce with hbase for
parallel computing.



On Wed, Oct 6, 2010 at 11:24 AM, William Kang
<[email protected]>
wrote:
Hi guys,
Is there any project going on co-processing
on region servers? Right now,
we
have to transfer all data from region servers
to region client after
query,
is that right? This can be slow. Furthermore,
the cpus on the region
servers
are not fully used. If we could distribute
the computation along with the
data on region server, that would be really
handy for some problems. Is
it
possible to do so? Many thanks.


William




--
Best Regards

Jeff Zhang











Reply via email to