Ramkrishna, I got your meaning. Thanks very much for your reply. Best Wishes Dan Han
On Thu, Sep 27, 2012 at 10:21 PM, Ramkrishna.S.Vasudevan < [email protected]> wrote: > Hi Dan > > Am not very sure whether my answer was infact relevant to your problem. > Any way I can try answering about the 'region being redundant'? > No two regions can be responsible for the same range of data in one table. > That is why if any region is not available that portion of data is not > available to the clients. > > "when you go with coprocessor on a collocated regions, the caching and > rpc > timeout needs to be set accordingly." > What I meant here was now every scan will hit two regions and as per your > use case one is going to be dense and other one will return quickly. > May be we may need to see that the overall scan is not timeout. > > Regards > Ram > > > > -----Original Message----- > > From: Dan Han [mailto:[email protected]] > > Sent: Friday, September 28, 2012 3:05 AM > > To: [email protected] > > Subject: Re: Distribution of regions to servers > > > > Hi Ramkrishna, > > > > I think relocating regions is based on the queries and queried data. > > The relocation can scatter the regions involved in the query across > > region > > servers > > which might enable large queries get better load balance. > > For small queries, distribution of regions can also impact the > > throughput. > > > > To this point, I actually have a question here: can the region > > be redundant? > > For example, there are two regions which are responsible for the same > > range > > of data? > > > > I don't quite understand this: "when you go with coprocessor on > > a collocated regions, the caching and > > rpc timeout needs to be set accordingly." > > Could you please explain it further? Thanks in advance. > > > > Best Wishes > > Dan Han > > > > > > On Wed, Sep 26, 2012 at 10:49 PM, Ramkrishna.S.Vasudevan < > > [email protected]> wrote: > > > > > Just trying out here, > > > > > > Is it possible for you to collocate the region of the 1st schema and > > the > > > region of the 2nd schema so that overall the total query execution > > happens > > > on single RS and there is not much > > > IO. > > > Also when you go with coprocessor on a collocated regions, the > > caching and > > > rpc timeout needs to be set accordingly. > > > > > > Regards > > > Ram > > > > -----Original Message----- > > > > From: Dan Han [mailto:[email protected]] > > > > Sent: Thursday, September 27, 2012 7:00 AM > > > > To: [email protected] > > > > Subject: Re: Distribution of regions to servers > > > > > > > > Hi, Eugeny , > > > > > > > > Thanks for your response. I answered your questions inline in > > Blue. > > > > And I'd like to give an example to describe my problem. > > > > > > > > Let's think about two data schemas for the same dataset. > > > > The two data schemas have different composite row keys. But there > > is > > > > a same part in both schemas, which represents a sequence ID. > > > > In 1st schema, one row contains 1KB information; > > > > while in 2nd schema, one row contains 10KB information. > > > > So the number of rows in one region in 1st schema is more than > > > > that in 2nd schema, right? If the queried data is based on the > > sequence > > > > ID, > > > > as one region in 1st schema is responsible for more number of rows > > than > > > > that in 2nd schema, > > > > there would be more computation and long execution time for the > > > > corresponding coprocessor. > > > > So in this case, if the regions are not distributed well, > > > > some region servers will suffer in excess workload. > > > > That is why I want to do some management of regions to get better > > load > > > > balance based on large queries. > > > > > > > > Hope it makes sense to you. > > > > > > > > Best Wishes > > > > Dan Han > > > > > > > > > > > > On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov > > > > <[email protected]>wrote: > > > > > > > > > Dan, > > > > > > > > > > I have additional questions. > > > > > What is the access pattern of your queries? I mean that f.e. > > > > PrefixFilters > > > > > have to be applied for all KeyValue pairs in HFiles, which could > > be > > > > slow. > > > > > Or f.e. scanner setCaching option is able to decrease number of > > > > network > > > > > hops to get data from RegionServer. > > > > > > > > > > > > > I set the range of the rows and the related columns to narrow > > down > > > > the > > > > scan scope, > > > > and I used PrefixFilter/ColumnFilter/BinaryFilter to get the > > rows. > > > > I set a little cache (5KB), but I kept it the same for all > > > > evaluated > > > > data schema. > > > > Because I mainly focus on evaluate the performance of queries > > under > > > > the > > > > different data schemas. > > > > > > > > > > > > > Additionally, coprocessors are able to use InternalScanner > > instead of > > > > > ResultScanner, which is also could help greatly. > > > > > > > > > > > > > yes, I used InternalScanner. > > > > > > > > > > > > > > Also, the more dimension you specify, the more precise your query > > is, > > > > the > > > > > less data is about to be processed - family, columns, timeranges, > > > > etc. > > > > > > > > > > > > > > > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <[email protected]> > > > > wrote: > > > > > > > > > > > Thanks for your swift response, Ramkrishna and Anoop. And I > > will > > > > > > explicate what we are doing now below. > > > > > > > > > > > > We are trying to explore a systematic way to design the > > > > appropriate > > > > > data > > > > > > schema for various applications in HBase. So we first designed > > > > several > > > > > data > > > > > > schemas for each dataset and evaluate them with the same > > queries. > > > > The > > > > > > queries are designed based on the requirements, such as > > selecting > > > > the > > > > > data > > > > > > with a matching expression, finding the difference between two > > > > > > snapshots. The queries were processed with user-level > > Coprocessor. > > > > > > > > > > > > In our experiments, we found that under some data schemas, > > the > > > > queries > > > > > > cannot get any results because of the connection timeout and RS > > > > crash > > > > > > sometimes. We observed that in this case, the queried data were > > > > centered > > > > > in > > > > > > a few regions locating in a few region servers. We think the > > > > failure > > > > > might > > > > > > be caused by the excess workload in these few region servers > > and > > > > the > > > > > > inappropriate load balance. To our best knowledge, this case > > can be > > > > > avoided > > > > > > and improved by the well-distributed regions across the region > > > > servers. > > > > > > > > > > > > Therefore, we have been thinking to add a monitoring and > > > > management > > > > > > component between the client and server, which can schedule the > > > > > > queries/jobs from client side and distribute the regions > > > > dynamically > > > > > > according to the current workload of each region server, the > > > > incoming > > > > > > queries and data locality. > > > > > > > > > > > > Does it make sense? Just my two cents. Any comments? > > > > > > > > > > > > Best Wishes > > > > > > Dan Han > > > > > > > > > > > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John > > > > <[email protected]> > > > > > > wrote: > > > > > > > > > > > > > Hi > > > > > > > Can u share more details pls? What work you are doing within > > the > > > > CPs > > > > > > > > > > > > > > -Anoop- > > > > > > > ________________________________________ > > > > > > > From: Dan Han [[email protected]] > > > > > > > Sent: Wednesday, September 26, 2012 5:55 AM > > > > > > > To: [email protected] > > > > > > > Subject: Distribution of regions to servers > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > I am doing some experiments on HBase with Coprocessor. I > > found > > > > that > > > > > > the > > > > > > > performance > > > > > > > of Coprocessor is impacted much by the distribution of the > > > > regions. I > > > > > am > > > > > > > kind of interested in > > > > > > > going deep into this problem and see if I can do something. > > > > > > > > > > > > > > I only searched out the discussion in the following link. > > > > > > > > > > > > > > > > > > > > > > > > > http://search- > > > > > > hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+o > > > > f+regions+to+servers > > > > > > > > > > > > > > I am wondering if there is any further discussion or any on- > > going > > > > work? > > > > > > Can > > > > > > > someone point it to me if there is? > > > > > > > Thanks in advance. > > > > > > > > > > > > > > Best Wishes > > > > > > > Dan Han > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Evgeny Morozov > > > > > Developer Grid Dynamics > > > > > Skype: morozov.evgeny > > > > > www.griddynamics.com > > > > > [email protected] > > > > > > > > > > > > >
