Thanks for your advice, Eugeny. Best Wishes Dan Han
On Thu, Sep 27, 2012 at 2:34 AM, Eugeny Morozov <[email protected]>wrote: > Dan, see inlined. > > On Thu, Sep 27, 2012 at 5:30 AM, Dan Han <[email protected]> wrote: > > > Hi, Eugeny , > > > > Thanks for your response. I answered your questions inline in Blue. > > And I'd like to give an example to describe my problem. > > > > Let's think about two data schemas for the same dataset. > > The two data schemas have different composite row keys. > > > Just the first idea. If you have different schemas, then it would be much > simpler to have two different tables with these schemas. Because in this > case HBase itself automatically distribute each of the tables' regions > evenly across the cluster. You could actually use the same coprocessor for > both of the tables. > > In case you're using two different column families, you could specify > different BLOCKSIZE (default value is '65536''). You could set this option > different in 10 times for CFs (as the difference in between your schemas). > I believe this would decrease number of readings for larger data chunks. > > In general it is actually not good to have two (or more) really different > in size column families, because they have compaction and flushing based on > region, which means that if HBase start compacting small column family it > will do the same for big one. > http://hbase.apache.org/book.html#number.of.cfs > > BTW, I don't think that coprocessors are good choice to have data mining. > The reason is that it is kind of dangerous. Since coprocessor are server > side creatures - they live in Region Server - they simply could get the > whole system down. Expensive analysis creates heap and CPU pressure, which > in turn lead to GC pauses and even more CPU pressure. > > Consider to use PIG and HBaseStorage to load data from HBase. > > But there is > > a same part in both schemas, which represents a sequence ID. > > In 1st schema, one row contains 1KB information; > > while in 2nd schema, one row contains 10KB information. > > So the number of rows in one region in 1st schema is more than > > that in 2nd schema, right? If the queried data is based on the sequence > ID, > > as one region in 1st schema is responsible for more number of rows than > > that in 2nd schema, > > there would be more computation and long execution time for the > > corresponding coprocessor. > > So in this case, if the regions are not distributed well, > > some region servers will suffer in excess workload. > > That is why I want to do some management of regions to get better load > > balance based on large queries. > > > > Hope it makes sense to you. > > > > Best Wishes > > Dan Han > > > > > > On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov > > <[email protected]>wrote: > > > > > Dan, > > > > > > I have additional questions. > > > What is the access pattern of your queries? I mean that f.e. > > PrefixFilters > > > have to be applied for all KeyValue pairs in HFiles, which could be > slow. > > > Or f.e. scanner setCaching option is able to decrease number of network > > > hops to get data from RegionServer. > > > > > > > I set the range of the rows and the related columns to narrow down > the > > scan scope, > > and I used PrefixFilter/ColumnFilter/BinaryFilter to get the rows. > > I set a little cache (5KB), but I kept it the same for all evaluated > > data schema. > > Because I mainly focus on evaluate the performance of queries under > the > > different data schemas. > > > > > > > Additionally, coprocessors are able to use InternalScanner instead of > > > ResultScanner, which is also could help greatly. > > > > > > > yes, I used InternalScanner. > > > > > > > > Also, the more dimension you specify, the more precise your query is, > the > > > less data is about to be processed - family, columns, timeranges, etc. > > > > > > > > > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <[email protected]> > wrote: > > > > > > > Thanks for your swift response, Ramkrishna and Anoop. And I will > > > > explicate what we are doing now below. > > > > > > > > We are trying to explore a systematic way to design the > appropriate > > > data > > > > schema for various applications in HBase. So we first designed > several > > > data > > > > schemas for each dataset and evaluate them with the same queries. > The > > > > queries are designed based on the requirements, such as selecting the > > > data > > > > with a matching expression, finding the difference between two > > > > snapshots. The queries were processed with user-level Coprocessor. > > > > > > > > In our experiments, we found that under some data schemas, the > > queries > > > > cannot get any results because of the connection timeout and RS crash > > > > sometimes. We observed that in this case, the queried data were > > centered > > > in > > > > a few regions locating in a few region servers. We think the failure > > > might > > > > be caused by the excess workload in these few region servers and the > > > > inappropriate load balance. To our best knowledge, this case can be > > > avoided > > > > and improved by the well-distributed regions across the region > servers. > > > > > > > > Therefore, we have been thinking to add a monitoring and management > > > > component between the client and server, which can schedule the > > > > queries/jobs from client side and distribute the regions dynamically > > > > according to the current workload of each region server, the incoming > > > > queries and data locality. > > > > > > > > Does it make sense? Just my two cents. Any comments? > > > > > > > > Best Wishes > > > > Dan Han > > > > > > > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John <[email protected] > > > > > > wrote: > > > > > > > > > Hi > > > > > Can u share more details pls? What work you are doing within the > CPs > > > > > > > > > > -Anoop- > > > > > ________________________________________ > > > > > From: Dan Han [[email protected]] > > > > > Sent: Wednesday, September 26, 2012 5:55 AM > > > > > To: [email protected] > > > > > Subject: Distribution of regions to servers > > > > > > > > > > Hi all, > > > > > > > > > > I am doing some experiments on HBase with Coprocessor. I found > > that > > > > the > > > > > performance > > > > > of Coprocessor is impacted much by the distribution of the > regions. I > > > am > > > > > kind of interested in > > > > > going deep into this problem and see if I can do something. > > > > > > > > > > I only searched out the discussion in the following link. > > > > > > > > > > > > > > > > > > > > http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers > > > > > > > > > > I am wondering if there is any further discussion or any on-going > > work? > > > > Can > > > > > someone point it to me if there is? > > > > > Thanks in advance. > > > > > > > > > > Best Wishes > > > > > Dan Han > > > > > > > > > > > > > > > > > > > > > -- > > > Evgeny Morozov > > > Developer Grid Dynamics > > > Skype: morozov.evgeny > > > www.griddynamics.com > > > [email protected] > > > > > > > > > -- > Evgeny Morozov > Developer Grid Dynamics > Skype: morozov.evgeny > www.griddynamics.com > [email protected] >
