one-CF?

Michael Segel Tue, 09 Sep 2014 14:03:14 -0700

Locality? 

Then the data should be in the same column family.  That’s as local as you can 
get.


I would suggest that you think of the following:

What’s the predominant use case? 
How are you querying the data. 
If you’re always hitting multiple CFs to get the data… then you should have it 
in the same table. 

I think more people would benefit if they took more time thinking about their 
design and how the data is being used and stored… it would help. 
Also knowing that there really isn’t a single ‘right’ answer. Just a lot of 
wrong ones. ;-) 


Most people still try to think of HBase in terms of relational modeling and not 
in terms of records and more of a hierarchial system. 
Things like CFs and Versioning are often misused because people see them as 
shortcuts. 

Also people tend not to think of their data in HBase in terms of 3D but in 
terms of 2D. 
(CF’s would be 2+D) 

The one question which really hasn’t been answered is how fat is fat in terms 
of a row’s width and when is it too fat? 
This may seem like a simple thing, but it can impact a couple of things in your 
design. (I never got a good answer, and its one of those questions that if your 
wife were to ask if the pants she’s wearing makes her fat, its time to run for 
the hills because you can’t win no matter how you answer!) 
Seriously though, the optimal width of the column is not that easy to answer 
and sometimes you have to just guess as to which would be a better design. 

One of the problems with CFs is that if there’s an imbalance in terms of the 
size of data being stored in each CF, you can run in to issues. 
CFs are stored in separate files and split when the base CF splits. (Assuming 
you have a base CF and then multiple CFs that are related but store smaller 
records per row.) 
And then there’s the issue in terms of each CF is stored separately. (If memory 
serves its a separate file per CF, but right now my last living brain cell 
decided to call it quits and went on strike for more beer.) 
[Damn you last brain cell!!!] :-) 

Again the idea is to follow KISS. 

HTH

-Mike

On Sep 8, 2014, at 7:17 AM, Jianshi Huang <jianshi.hu...@gmail.com> wrote:

> Locality is important, that why I chose CF to put related data into one
> group. I can surely put the CF part to the head of rowkey to achieve
> similar result, but since the number of types is fixed, I don't any benefit
> doing that.
> 
> With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the
> performance should be similar.
> 
> Am I missing something? Please enlighten me.
> 
> Jianshi
> 
> On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel <michael_se...@hotmail.com>
> wrote:
> 
>> I would suggest rethinking column families and look at your potential for
>> a slightly different row key.
>> 
>> Going with column families doesn’t really make sense.
>> 
>> Also how wide are the rows? (worst case?)
>> 
>> one idea is to make type part of the RK…
>> 
>> HTH
>> 
>> -Mike
>> 
>> On Sep 7, 2014, at 2:40 AM, Jianshi Huang <jianshi.hu...@gmail.com> wrote:
>> 
>>> Hi Michael,
>>> 
>>> Thanks for the questions.
>>> 
>>> I'm modeling dynamic Graphs in HBase, all elements (vertices, edges)
>> have a
>>> timestamp and I can query things like events between A and B for the
>> last 7
>>> days.
>>> 
>>> CFs are used for grouping different types of data for the same account.
>>> However, I have lots of skews in the data, to avoid having too much for
>> the
>>> same row, I had to put what was in CQs to now RKs. So CF now acts more
>> like
>>> a table.
>>> 
>>> There's one CF containing sequence of events ordered by timestamp, and
>> this
>>> CF is quite different as the use case is mostly in mapreduce jobs.
>>> 
>>> Jianshi
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel <michael_se...@hotmail.com
>>> 
>>> wrote:
>>> 
>>>> Again, a silly question.
>>>> 
>>>> Why are you using column families?
>>>> 
>>>> Just to play devil’s advocate in terms of design, why are you not
>> treating
>>>> your row as a record? Think hierarchal not relational.
>>>> 
>>>> This really gets in to some design theory.
>>>> 
>>>> Think Column Family as a way to group data that has the same row key,
>>>> reference the same thing, yet the data in each column family is used
>>>> separately.
>>>> The example I always turn to when teaching, is to think of an order
>> entry
>>>> system at a retailer.
>>>> 
>>>> You generate data which is segmented by business process. (order entry,
>>>> pick slips, shipping, invoicing) All reflect a single order, yet the
>> data
>>>> in each process tends to be accessed separately.
>>>> (You don’t need the order entry when using the pick slip to pull orders
>>>> from the warehouse.)  So here, the data access pattern is that each
>> column
>>>> family is used separately, except in generating the data (the order
>> entry
>>>> is used to generate the pick slip(s) and set up things like backorders
>> and
>>>> then the pick process generates the shipping slip(s) etc …  And since
>> they
>>>> are all focused on the same order, they have the same row key.
>>>> 
>>>> So its reasonable to ask how you are accessing the data and how you are
>>>> designing your HBase model?
>>>> 
>>>> Many times,  developers create a model using column families because the
>>>> developer is thinking in terms of relationships. Not access patterns on
>> the
>>>> data.
>>>> 
>>>> Does this make sense?
>>>> 
>>>> 
>>>> On Sep 6, 2014, at 7:46 PM, Jianshi Huang <jianshi.hu...@gmail.com>
>> wrote:
>>>> 
>>>>> BTW, a little explanation about the binning I mentioned.
>>>>> 
>>>>> Currently the rowkey looks like <type_of_events>#<rev_timestamp>#<id>.
>>>>> 
>>>>> And with binning, it looks like
>>>>> <bin_number>#<type_of_events>#<rev_timestamp>#<id>. The bin_number
>> could
>>>> be
>>>>> id % 256 or timestamp % 256. And the table could be pre-splitted. So
>>>> future
>>>>> ingestions could do parallel insertion to #<bin> regions, even without
>>>>> pre-split.
>>>>> 
>>>>> 
>>>>> Jianshi
>>>>> 
>>>>> 
>>>>> On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang <jianshi.hu...@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Each range might span multiple regions, depending on the data size I
>>>> want
>>>>>> scan for MR jobs.
>>>>>> 
>>>>>> The ranges are dynamic, specified by the user, but the number of bins
>>>> can
>>>>>> be static (when the table/schema is created).
>>>>>> 
>>>>>> Jianshi
>>>>>> 
>>>>>> 
>>>>>> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>> 
>>>>>>> bq. 16 to 256 ranges
>>>>>>> 
>>>>>>> Would each range be within single region or the range may span
>> regions
>>>> ?
>>>>>>> Are the ranges dynamic ?
>>>>>>> 
>>>>>>> Using command line for multiple ranges would be out of question. A
>> file
>>>>>>> with ranges is needed.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <
>>>> jianshi.hu...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Thanks Ted for the reference.
>>>>>>>> 
>>>>>>>> That's right, extend the row.start and row.end to specify multiple
>>>>>>> ranges
>>>>>>>> and also getSplits.
>>>>>>>> 
>>>>>>>> I would probably bin the event sequence CF into 16 to 256 bins. So
>> 16
>>>> to
>>>>>>>> 256 ranges.
>>>>>>>> 
>>>>>>>> Jianshi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Please refer to HBASE-5416 Filter on one CF and if a match, then
>> load
>>>>>>> and
>>>>>>>>> return full row
>>>>>>>>> 
>>>>>>>>> bq. to extend TableInputFormat to accept multiple row ranges
>>>>>>>>> 
>>>>>>>>> You mean extending hbase.mapreduce.scan.row.start and
>>>>>>>>> hbase.mapreduce.scan.row.stop so that multiple ranges can be
>>>>>>> specified ?
>>>>>>>>> How many such ranges do you normally need ?
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <
>>>>>>> jianshi.hu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks Ted,
>>>>>>>>>> 
>>>>>>>>>> I'll pre-split the table during ingestion. The reason to keep the
>>>>>>>> rowkey
>>>>>>>>>> monotonic is for easier working with TableInputFormat, otherwise I
>>>>>>>>> would've
>>>>>>>>>> binned it into 256 splits. (well, I think a good way is to extend
>>>>>>>>>> TableInputFormat to accept multiple row ranges, if there's an
>>>>>>> existing
>>>>>>>>>> efficient implementation, please let me know :)
>>>>>>>>>> 
>>>>>>>>>> Would you elaborate a little more on the heap memory usage during
>>>>>>> scan?
>>>>>>>>> Is
>>>>>>>>>> there any reference to that?
>>>>>>>>>> 
>>>>>>>>>> Jianshi
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhih...@gmail.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> If you use monotonically increasing rowkeys, separating out the
>>>>>>>> column
>>>>>>>>>>> family into a new table would give you same issue you're facing
>>>>>>>> today.
>>>>>>>>>>> 
>>>>>>>>>>> Using a single table, essential column family feature would
>> reduce
>>>>>>>> the
>>>>>>>>>>> amount of heap memory used during scan. With two tables, there is
>>>>>>> no
>>>>>>>>> such
>>>>>>>>>>> facility.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang <
>>>>>>>>> jianshi.hu...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, that's the table having RegionTooBusyExceptions :) But the
>>>>>>>>>>> performance
>>>>>>>>>>>> I care most are scan performance.
>>>>>>>>>>>> 
>>>>>>>>>>>> It's mostly for analytics, so I don't care much about atomicity
>>>>>>>>>>> currently.
>>>>>>>>>>>> 
>>>>>>>>>>>> What's your suggestion?
>>>>>>>>>>>> 
>>>>>>>>>>>> Jianshi
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhih...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Is this the same table you mentioned in the thread about
>>>>>>>>>>>>> RegionTooBusyException
>>>>>>>>>>>>> ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you move the column family to another table, you may have
>>>>>>> to
>>>>>>>>>> handle
>>>>>>>>>>>>> atomicity yourself - currently atomic operations are within
>>>>>>>> region
>>>>>>>>>>>>> boundaries.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang <
>>>>>>>>>> jianshi.hu...@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm currently putting everything into one table (to make
>>>>>>> cross
>>>>>>>>>>>> reference
>>>>>>>>>>>>>> queries easier) and there's one CF which contains rowkeys
>>>>>>> very
>>>>>>>>>>>> different
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> the rest. Currently it works well, but I'm wondering if it
>>>>>>> will
>>>>>>>>>> cause
>>>>>>>>>>>>>> performance issues in the future.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So my questions are
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) will there be performance penalties in the way I'm doing?
>>>>>>>>>>>>>> 2) should I move that CF to a separate table?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jianshi Huang
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> LinkedIn: jianshi
>>>>>>>>>>>>>> Twitter: @jshuang
>>>>>>>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Jianshi Huang
>>>>>>>>>>>> 
>>>>>>>>>>>> LinkedIn: jianshi
>>>>>>>>>>>> Twitter: @jshuang
>>>>>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Jianshi Huang
>>>>>>>>>> 
>>>>>>>>>> LinkedIn: jianshi
>>>>>>>>>> Twitter: @jshuang
>>>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Jianshi Huang
>>>>>>>> 
>>>>>>>> LinkedIn: jianshi
>>>>>>>> Twitter: @jshuang
>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jianshi Huang
>>>>>> 
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Jianshi Huang
>>>>> 
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Jianshi Huang
>>> 
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>> 
>> 
> 
> 
> -- 
> Jianshi Huang
> 
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/

Re: One-table w/ multi-CF or multi-table w/ one-CF?

Reply via email to