Re: column count guidelines

Marcos Ortiz Thu, 07 Feb 2013 21:38:54 -0800

My recommendation is to keep updated with the last HBase release, andwait for 0.96, which ithas a lot of improvements almost in every area. I talked about this in ablog post.[1]

I think in your use-case, Coprocessors can be very helpful, although inLars's "HBase: The Definitive Guide" book,he explained in Chapter 4 how to use Counters and Coprocessors. Youshould read it.


A great introduction to Coprocessors was posted in HBase's blog, [2] and

a great example of HBase performance tuning, including Coprocessors'suse, wasposted by Hari Kumar from Ericsson Research on its Data and Knowledgeblog.[3]


Best wishes

[1] http://marcosluis2186.posterous.com/some-upcoming-features-in-hbase-096
[2] https://blogs.apache.org/hbase/entry/coprocessor_introduction
[3] http://labs.ericsson.com/blog/hbase-performance-tuners

On 02/07/2013 11:34 PM, Michael Ellery wrote:

thanks for reminding me of the HBASE version in CDH4 - that's something we'll 
definitely take into consideration.

-Mike

On Feb 7, 2013, at 5:09 PM, Ted Yu wrote:

Thanks Michael for this information.

FYI CDH4 (as of now) is based on HBase 0.92.x which doesn't have the two
features I cited below.

On Thu, Feb 7, 2013 at 5:02 PM, Michael Ellery <[email protected]> wrote:

There is only one CF in this schema.

Yes, we are looking at upgrading to CDH4, but it is not trivial since we
cannot have cluster downtime. Our current upgrade plans involves additional
hardware with side-by side clusters until everything is exported/imported.

Thanks,
Mike

On Feb 7, 2013, at 4:34 PM, Ted Yu wrote:

How many column families are involved ?

Have you considered upgrading to 0.94.4 where you would be able to

benefit

from lazy seek, Data Block Encoding, etc ?

Thanks

On Thu, Feb 7, 2013 at 3:47 PM, Michael Ellery <[email protected]>

wrote:

I'm looking for some advice about per row CQ (column qualifier) count
guidelines. Our current schema design means we have a HIGHLY variable CQ
count per row -- some rows have one or two CQs and some rows have

upwards

of 1 million. Each CQ is on the order of 100 bytes (for round numbers)

and

the cell values are null.  We see highly variable and too often
unacceptable read performance using this schema.  I don't know for a

fact

that the CQ count variability is the source of our problems, but I am
suspicious.

I'm curious about others' experience with CQ counts per row -- are there
some best practices/guidelines about how to optimally size the number of
CQs per row. The other obvious solution will involve breaking this data
into finer grained rows, which means shifting from GETs to SCANs - are
there performance trade-offs in such a change?

We are currently using CDH3u4, if that is relevant. All of our loading

is

done via HFILE loading (bulk), so we have not had to tune write

performance

beyond using bulk loads. Any advice appreciated, including what metrics

we

should be looking at to further diagnose our read performance

challenges.

Thanks,
Mike Ellery


--
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: column count guidelines

Reply via email to