Thanks for your advice, Eugeny.

Best Wishes
Dan Han

On Thu, Sep 27, 2012 at 2:34 AM, Eugeny Morozov
<[email protected]>wrote:

> Dan, see inlined.
>
> On Thu, Sep 27, 2012 at 5:30 AM, Dan Han <[email protected]> wrote:
>
> > Hi, Eugeny ,
> >
> >    Thanks for your response. I answered your questions inline in Blue.
> > And I'd like to give an example to describe my problem.
> >
> > Let's think about two data schemas for the same dataset.
> > The two data schemas have different composite row keys.
>
>
> Just the first idea. If you have different schemas, then it would be much
> simpler to have two different tables with these schemas. Because in this
> case HBase itself automatically distribute each of the tables' regions
> evenly across the cluster. You could actually use the same coprocessor for
> both of the tables.
>
> In case you're using two different column families, you could specify
> different BLOCKSIZE  (default value is '65536''). You could set this option
> different in 10 times for CFs (as the difference in between your schemas).
> I believe this would decrease number of readings for larger data chunks.
>
> In general it is actually not good to have two (or more) really different
> in size column families, because they have compaction and flushing based on
> region, which means that if  HBase start compacting small column family it
> will do the same for big one.
> http://hbase.apache.org/book.html#number.of.cfs
>
> BTW, I don't think that coprocessors are good choice to have data mining.
> The reason is that it is kind of dangerous. Since coprocessor are server
> side creatures - they live in Region Server - they simply could get the
> whole system down. Expensive analysis creates heap and CPU pressure, which
> in turn lead to GC pauses and even more CPU pressure.
>
> Consider to use PIG and HBaseStorage to load data from HBase.
>
> But there is
> > a same part in both schemas, which represents a sequence ID.
> > In 1st schema, one row contains 1KB information;
> > while in 2nd schema, one row contains 10KB information.
> > So the number of rows in one region in 1st schema is more than
> > that in 2nd schema, right? If the queried data is based on the sequence
> ID,
> > as one region in 1st schema is responsible for more number of rows than
> > that in 2nd schema,
> > there would be more computation and long execution time for the
> > corresponding coprocessor.
> > So in this case, if the regions are not distributed well,
> > some region servers will suffer in excess workload.
> > That is why I want to do some management of regions to get better load
> > balance based on large queries.
> >
> > Hope it makes sense to you.
> >
> > Best Wishes
> > Dan Han
> >
> >
> > On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
> > <[email protected]>wrote:
> >
> > > Dan,
> > >
> > > I have additional questions.
> > > What is the access pattern of your queries? I mean that f.e.
> > PrefixFilters
> > > have to be applied for all KeyValue pairs in HFiles, which could be
> slow.
> > > Or f.e. scanner setCaching option is able to decrease number of network
> > > hops to get data from RegionServer.
> > >
> >
> >     I set the range of the rows and the related columns to narrow down
> the
> > scan scope,
> >     and I used PrefixFilter/ColumnFilter/BinaryFilter to get the rows.
> >     I set a little cache (5KB), but I kept it the same for all evaluated
> > data schema.
> >     Because I mainly focus on evaluate the performance of queries under
> the
> > different data schemas.
> >
> >
> > > Additionally, coprocessors are able to use InternalScanner instead of
> > > ResultScanner, which is also could help greatly.
> > >
> >
> >     yes, I used InternalScanner.
> >
> > >
> > > Also, the more dimension you specify, the more precise your query is,
> the
> > > less data is about to be processed - family, columns, timeranges, etc.
> > >
> > >
> > > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <[email protected]>
> wrote:
> > >
> > > >   Thanks for your swift response, Ramkrishna and Anoop. And I will
> > > > explicate what we are doing now below.
> > > >
> > > >    We are trying to explore a systematic way to design the
> appropriate
> > > data
> > > > schema for various applications in HBase. So we first designed
> several
> > > data
> > > > schemas for each dataset and evaluate them with the same queries.
>  The
> > > > queries are designed based on the requirements, such as selecting the
> > > data
> > > > with a matching expression, finding the difference between two
> > > > snapshots. The queries were processed with user-level Coprocessor.
> > > >
> > > >    In our experiments, we found that under some data schemas, the
> > queries
> > > > cannot get any results because of the connection timeout and RS crash
> > > > sometimes. We observed that in this case, the queried data were
> > centered
> > > in
> > > > a few regions locating in a few region servers. We think the failure
> > > might
> > > > be caused by the excess workload in these few region servers and the
> > > > inappropriate load balance. To our best knowledge, this case can be
> > > avoided
> > > > and improved by the well-distributed regions across the region
> servers.
> > > >
> > > >   Therefore, we have been thinking to add a monitoring and management
> > > > component between the client and server, which can schedule the
> > > > queries/jobs from client side and distribute the regions dynamically
> > > > according to the current workload of each region server, the incoming
> > > > queries and data locality.
> > > >
> > > >   Does it make sense? Just my two cents. Any comments?
> > > >
> > > > Best Wishes
> > > > Dan Han
> > > >
> > > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi
> > > > > Can u share more details pls? What work you are doing within the
> CPs
> > > > >
> > > > > -Anoop-
> > > > > ________________________________________
> > > > > From: Dan Han [[email protected]]
> > > > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > > > To: [email protected]
> > > > > Subject: Distribution of regions to servers
> > > > >
> > > > > Hi all,
> > > > >
> > > > >    I am doing some experiments on HBase with Coprocessor. I found
> > that
> > > > the
> > > > > performance
> > > > > of Coprocessor is impacted much by the distribution of the
> regions. I
> > > am
> > > > > kind of interested in
> > > > > going deep into this problem and see if I can do something.
> > > > >
> > > > >   I only searched out the discussion in the following link.
> > > > >
> > > > >
> > > >
> > >
> >
> http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers
> > > > >
> > > > > I am wondering if there is any further discussion or any on-going
> > work?
> > > > Can
> > > > > someone point it to me if there is?
> > > > > Thanks in advance.
> > > > >
> > > > > Best Wishes
> > > > > Dan Han
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Evgeny Morozov
> > > Developer Grid Dynamics
> > > Skype: morozov.evgeny
> > > www.griddynamics.com
> > > [email protected]
> > >
> >
>
>
>
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> [email protected]
>

Reply via email to