Re: Distribution of regions to servers

Dan Han Sun, 30 Sep 2012 10:35:40 -0700

Ramkrishna, I got your meaning. Thanks very much for your reply.

Best Wishes
Dan Han


On Thu, Sep 27, 2012 at 10:21 PM, Ramkrishna.S.Vasudevan <
[email protected]> wrote:

> Hi Dan
>
> Am not very sure whether my answer was infact relevant to your problem.
> Any way I can try answering about the 'region being redundant'?
> No two regions can be responsible for the same range of data in one table.
> That is why if any region is not available that portion of data is not
> available to the clients.
>
> "when you go with coprocessor on  a collocated regions, the caching and
>  rpc
> timeout needs to be set accordingly."
> What I meant here was now every scan will hit two regions and as per your
> use case one is going to be dense and other one will return quickly.
> May be we may need to see that the overall scan is not timeout.
>
> Regards
> Ram
>
>
> > -----Original Message-----
> > From: Dan Han [mailto:[email protected]]
> > Sent: Friday, September 28, 2012 3:05 AM
> > To: [email protected]
> > Subject: Re: Distribution of regions to servers
> >
> > Hi Ramkrishna,
> >
> >   I think relocating regions is based on the queries and queried data.
> > The relocation can scatter the regions involved in the query across
> > region
> > servers
> > which might enable large queries get better load balance.
> > For small queries, distribution of regions can also impact the
> > throughput.
> >
> > To this point, I actually have a question here: can the region
> > be redundant?
> > For example, there are two regions which are responsible for the same
> > range
> > of data?
> >
> > I don't quite understand this: "when you go with coprocessor on
> > a collocated regions, the caching and
> > rpc timeout needs to be set accordingly."
> > Could you please explain it further? Thanks in advance.
> >
> > Best Wishes
> > Dan Han
> >
> >
> > On Wed, Sep 26, 2012 at 10:49 PM, Ramkrishna.S.Vasudevan <
> > [email protected]> wrote:
> >
> > > Just trying out here,
> > >
> > > Is it possible for you to collocate the region of the 1st schema and
> > the
> > > region of the 2nd schema so that overall the total query execution
> > happens
> > > on single RS and there is not much
> > > IO.
> > > Also when you go with coprocessor on a collocated regions, the
> > caching and
> > > rpc timeout needs to be set accordingly.
> > >
> > > Regards
> > > Ram
> > > > -----Original Message-----
> > > > From: Dan Han [mailto:[email protected]]
> > > > Sent: Thursday, September 27, 2012 7:00 AM
> > > > To: [email protected]
> > > > Subject: Re: Distribution of regions to servers
> > > >
> > > > Hi, Eugeny ,
> > > >
> > > >    Thanks for your response. I answered your questions inline in
> > Blue.
> > > > And I'd like to give an example to describe my problem.
> > > >
> > > > Let's think about two data schemas for the same dataset.
> > > > The two data schemas have different composite row keys. But there
> > is
> > > > a same part in both schemas, which represents a sequence ID.
> > > > In 1st schema, one row contains 1KB information;
> > > > while in 2nd schema, one row contains 10KB information.
> > > > So the number of rows in one region in 1st schema is more than
> > > > that in 2nd schema, right? If the queried data is based on the
> > sequence
> > > > ID,
> > > > as one region in 1st schema is responsible for more number of rows
> > than
> > > > that in 2nd schema,
> > > > there would be more computation and long execution time for the
> > > > corresponding coprocessor.
> > > > So in this case, if the regions are not distributed well,
> > > > some region servers will suffer in excess workload.
> > > > That is why I want to do some management of regions to get better
> > load
> > > > balance based on large queries.
> > > >
> > > > Hope it makes sense to you.
> > > >
> > > > Best Wishes
> > > > Dan Han
> > > >
> > > >
> > > > On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
> > > > <[email protected]>wrote:
> > > >
> > > > > Dan,
> > > > >
> > > > > I have additional questions.
> > > > > What is the access pattern of your queries? I mean that f.e.
> > > > PrefixFilters
> > > > > have to be applied for all KeyValue pairs in HFiles, which could
> > be
> > > > slow.
> > > > > Or f.e. scanner setCaching option is able to decrease number of
> > > > network
> > > > > hops to get data from RegionServer.
> > > > >
> > > >
> > > >     I set the range of the rows and the related columns to narrow
> > down
> > > > the
> > > > scan scope,
> > > >     and I used PrefixFilter/ColumnFilter/BinaryFilter to get the
> > rows.
> > > >     I set a little cache (5KB), but I kept it the same for all
> > > > evaluated
> > > > data schema.
> > > >     Because I mainly focus on evaluate the performance of queries
> > under
> > > > the
> > > > different data schemas.
> > > >
> > > >
> > > > > Additionally, coprocessors are able to use InternalScanner
> > instead of
> > > > > ResultScanner, which is also could help greatly.
> > > > >
> > > >
> > > >     yes, I used InternalScanner.
> > > >
> > > > >
> > > > > Also, the more dimension you specify, the more precise your query
> > is,
> > > > the
> > > > > less data is about to be processed - family, columns, timeranges,
> > > > etc.
> > > > >
> > > > >
> > > > > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <[email protected]>
> > > > wrote:
> > > > >
> > > > > >   Thanks for your swift response, Ramkrishna and Anoop. And I
> > will
> > > > > > explicate what we are doing now below.
> > > > > >
> > > > > >    We are trying to explore a systematic way to design the
> > > > appropriate
> > > > > data
> > > > > > schema for various applications in HBase. So we first designed
> > > > several
> > > > > data
> > > > > > schemas for each dataset and evaluate them with the same
> > queries.
> > > > The
> > > > > > queries are designed based on the requirements, such as
> > selecting
> > > > the
> > > > > data
> > > > > > with a matching expression, finding the difference between two
> > > > > > snapshots. The queries were processed with user-level
> > Coprocessor.
> > > > > >
> > > > > >    In our experiments, we found that under some data schemas,
> > the
> > > > queries
> > > > > > cannot get any results because of the connection timeout and RS
> > > > crash
> > > > > > sometimes. We observed that in this case, the queried data were
> > > > centered
> > > > > in
> > > > > > a few regions locating in a few region servers. We think the
> > > > failure
> > > > > might
> > > > > > be caused by the excess workload in these few region servers
> > and
> > > > the
> > > > > > inappropriate load balance. To our best knowledge, this case
> > can be
> > > > > avoided
> > > > > > and improved by the well-distributed regions across the region
> > > > servers.
> > > > > >
> > > > > >   Therefore, we have been thinking to add a monitoring and
> > > > management
> > > > > > component between the client and server, which can schedule the
> > > > > > queries/jobs from client side and distribute the regions
> > > > dynamically
> > > > > > according to the current workload of each region server, the
> > > > incoming
> > > > > > queries and data locality.
> > > > > >
> > > > > >   Does it make sense? Just my two cents. Any comments?
> > > > > >
> > > > > > Best Wishes
> > > > > > Dan Han
> > > > > >
> > > > > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John
> > > > <[email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi
> > > > > > > Can u share more details pls? What work you are doing within
> > the
> > > > CPs
> > > > > > >
> > > > > > > -Anoop-
> > > > > > > ________________________________________
> > > > > > > From: Dan Han [[email protected]]
> > > > > > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > > > > > To: [email protected]
> > > > > > > Subject: Distribution of regions to servers
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > >    I am doing some experiments on HBase with Coprocessor. I
> > found
> > > > that
> > > > > > the
> > > > > > > performance
> > > > > > > of Coprocessor is impacted much by the distribution of the
> > > > regions. I
> > > > > am
> > > > > > > kind of interested in
> > > > > > > going deep into this problem and see if I can do something.
> > > > > > >
> > > > > > >   I only searched out the discussion in the following link.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > http://search-
> > > >
> > hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+o
> > > > f+regions+to+servers
> > > > > > >
> > > > > > > I am wondering if there is any further discussion or any on-
> > going
> > > > work?
> > > > > > Can
> > > > > > > someone point it to me if there is?
> > > > > > > Thanks in advance.
> > > > > > >
> > > > > > > Best Wishes
> > > > > > > Dan Han
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Evgeny Morozov
> > > > > Developer Grid Dynamics
> > > > > Skype: morozov.evgeny
> > > > > www.griddynamics.com
> > > > > [email protected]
> > > > >
> > >
> > >
>
>

Re: Distribution of regions to servers

Reply via email to