I'm looking for some advice about per row CQ (column qualifier) count 
guidelines. Our current schema design means we have a HIGHLY variable CQ count 
per row -- some rows have one or two CQs and some rows have upwards of 1 
million. Each CQ is on the order of 100 bytes (for round numbers) and the cell 
values are null.  We see highly variable and too often unacceptable read 
performance using this schema.  I don't know for a fact that the CQ count 
variability is the source of our problems, but I am suspicious. 

I'm curious about others' experience with CQ counts per row -- are there some 
best practices/guidelines about how to optimally size the number of CQs per 
row. The other obvious solution will involve breaking this data into finer 
grained rows, which means shifting from GETs to SCANs - are there performance 
trade-offs in such a change?

We are currently using CDH3u4, if that is relevant. All of our loading is done 
via HFILE loading (bulk), so we have not had to tune write performance beyond 
using bulk loads. Any advice appreciated, including what metrics we should be 
looking at to further diagnose our read performance challenges. 

Thanks,
Mike Ellery

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to