Locality is important, that why I chose CF to put related data into one
group. I can surely put the CF part to the head of rowkey to achieve
similar result, but since the number of types is fixed, I don't any benefit
doing that.

With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the
performance should be similar.

Am I missing something? Please enlighten me.

Jianshi

On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel <michael_se...@hotmail.com>
wrote:

> I would suggest rethinking column families and look at your potential for
> a slightly different row key.
>
> Going with column families doesn’t really make sense.
>
> Also how wide are the rows? (worst case?)
>
> one idea is to make type part of the RK…
>
> HTH
>
> -Mike
>
> On Sep 7, 2014, at 2:40 AM, Jianshi Huang <jianshi.hu...@gmail.com> wrote:
>
> > Hi Michael,
> >
> > Thanks for the questions.
> >
> > I'm modeling dynamic Graphs in HBase, all elements (vertices, edges)
> have a
> > timestamp and I can query things like events between A and B for the
> last 7
> > days.
> >
> > CFs are used for grouping different types of data for the same account.
> > However, I have lots of skews in the data, to avoid having too much for
> the
> > same row, I had to put what was in CQs to now RKs. So CF now acts more
> like
> > a table.
> >
> > There's one CF containing sequence of events ordered by timestamp, and
> this
> > CF is quite different as the use case is mostly in mapreduce jobs.
> >
> > Jianshi
> >
> >
> >
> >
> > On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel <michael_se...@hotmail.com
> >
> > wrote:
> >
> >> Again, a silly question.
> >>
> >> Why are you using column families?
> >>
> >> Just to play devil’s advocate in terms of design, why are you not
> treating
> >> your row as a record? Think hierarchal not relational.
> >>
> >> This really gets in to some design theory.
> >>
> >> Think Column Family as a way to group data that has the same row key,
> >> reference the same thing, yet the data in each column family is used
> >> separately.
> >> The example I always turn to when teaching, is to think of an order
> entry
> >> system at a retailer.
> >>
> >> You generate data which is segmented by business process. (order entry,
> >> pick slips, shipping, invoicing) All reflect a single order, yet the
> data
> >> in each process tends to be accessed separately.
> >> (You don’t need the order entry when using the pick slip to pull orders
> >> from the warehouse.)  So here, the data access pattern is that each
> column
> >> family is used separately, except in generating the data (the order
> entry
> >> is used to generate the pick slip(s) and set up things like backorders
> and
> >> then the pick process generates the shipping slip(s) etc …  And since
> they
> >> are all focused on the same order, they have the same row key.
> >>
> >> So its reasonable to ask how you are accessing the data and how you are
> >> designing your HBase model?
> >>
> >> Many times,  developers create a model using column families because the
> >> developer is thinking in terms of relationships. Not access patterns on
> the
> >> data.
> >>
> >> Does this make sense?
> >>
> >>
> >> On Sep 6, 2014, at 7:46 PM, Jianshi Huang <jianshi.hu...@gmail.com>
> wrote:
> >>
> >>> BTW, a little explanation about the binning I mentioned.
> >>>
> >>> Currently the rowkey looks like <type_of_events>#<rev_timestamp>#<id>.
> >>>
> >>> And with binning, it looks like
> >>> <bin_number>#<type_of_events>#<rev_timestamp>#<id>. The bin_number
> could
> >> be
> >>> id % 256 or timestamp % 256. And the table could be pre-splitted. So
> >> future
> >>> ingestions could do parallel insertion to #<bin> regions, even without
> >>> pre-split.
> >>>
> >>>
> >>> Jianshi
> >>>
> >>>
> >>> On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang <jianshi.hu...@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Each range might span multiple regions, depending on the data size I
> >> want
> >>>> scan for MR jobs.
> >>>>
> >>>> The ranges are dynamic, specified by the user, but the number of bins
> >> can
> >>>> be static (when the table/schema is created).
> >>>>
> >>>> Jianshi
> >>>>
> >>>>
> >>>> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >>>>
> >>>>> bq. 16 to 256 ranges
> >>>>>
> >>>>> Would each range be within single region or the range may span
> regions
> >> ?
> >>>>> Are the ranges dynamic ?
> >>>>>
> >>>>> Using command line for multiple ranges would be out of question. A
> file
> >>>>> with ranges is needed.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>>
> >>>>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <
> >> jianshi.hu...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks Ted for the reference.
> >>>>>>
> >>>>>> That's right, extend the row.start and row.end to specify multiple
> >>>>> ranges
> >>>>>> and also getSplits.
> >>>>>>
> >>>>>> I would probably bin the event sequence CF into 16 to 256 bins. So
> 16
> >> to
> >>>>>> 256 ranges.
> >>>>>>
> >>>>>> Jianshi
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Please refer to HBASE-5416 Filter on one CF and if a match, then
> load
> >>>>> and
> >>>>>>> return full row
> >>>>>>>
> >>>>>>> bq. to extend TableInputFormat to accept multiple row ranges
> >>>>>>>
> >>>>>>> You mean extending hbase.mapreduce.scan.row.start and
> >>>>>>> hbase.mapreduce.scan.row.stop so that multiple ranges can be
> >>>>> specified ?
> >>>>>>> How many such ranges do you normally need ?
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <
> >>>>> jianshi.hu...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks Ted,
> >>>>>>>>
> >>>>>>>> I'll pre-split the table during ingestion. The reason to keep the
> >>>>>> rowkey
> >>>>>>>> monotonic is for easier working with TableInputFormat, otherwise I
> >>>>>>> would've
> >>>>>>>> binned it into 256 splits. (well, I think a good way is to extend
> >>>>>>>> TableInputFormat to accept multiple row ranges, if there's an
> >>>>> existing
> >>>>>>>> efficient implementation, please let me know :)
> >>>>>>>>
> >>>>>>>> Would you elaborate a little more on the heap memory usage during
> >>>>> scan?
> >>>>>>> Is
> >>>>>>>> there any reference to that?
> >>>>>>>>
> >>>>>>>> Jianshi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> If you use monotonically increasing rowkeys, separating out the
> >>>>>> column
> >>>>>>>>> family into a new table would give you same issue you're facing
> >>>>>> today.
> >>>>>>>>>
> >>>>>>>>> Using a single table, essential column family feature would
> reduce
> >>>>>> the
> >>>>>>>>> amount of heap memory used during scan. With two tables, there is
> >>>>> no
> >>>>>>> such
> >>>>>>>>> facility.
> >>>>>>>>>
> >>>>>>>>> Cheers
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang <
> >>>>>>> jianshi.hu...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Ted,
> >>>>>>>>>>
> >>>>>>>>>> Yes, that's the table having RegionTooBusyExceptions :) But the
> >>>>>>>>> performance
> >>>>>>>>>> I care most are scan performance.
> >>>>>>>>>>
> >>>>>>>>>> It's mostly for analytics, so I don't care much about atomicity
> >>>>>>>>> currently.
> >>>>>>>>>>
> >>>>>>>>>> What's your suggestion?
> >>>>>>>>>>
> >>>>>>>>>> Jianshi
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhih...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Is this the same table you mentioned in the thread about
> >>>>>>>>>>> RegionTooBusyException
> >>>>>>>>>>> ?
> >>>>>>>>>>>
> >>>>>>>>>>> If you move the column family to another table, you may have
> >>>>> to
> >>>>>>>> handle
> >>>>>>>>>>> atomicity yourself - currently atomic operations are within
> >>>>>> region
> >>>>>>>>>>> boundaries.
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang <
> >>>>>>>> jianshi.hu...@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm currently putting everything into one table (to make
> >>>>> cross
> >>>>>>>>>> reference
> >>>>>>>>>>>> queries easier) and there's one CF which contains rowkeys
> >>>>> very
> >>>>>>>>>> different
> >>>>>>>>>>> to
> >>>>>>>>>>>> the rest. Currently it works well, but I'm wondering if it
> >>>>> will
> >>>>>>>> cause
> >>>>>>>>>>>> performance issues in the future.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So my questions are
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) will there be performance penalties in the way I'm doing?
> >>>>>>>>>>>> 2) should I move that CF to a separate table?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Jianshi Huang
> >>>>>>>>>>>>
> >>>>>>>>>>>> LinkedIn: jianshi
> >>>>>>>>>>>> Twitter: @jshuang
> >>>>>>>>>>>> Github & Blog: http://huangjs.github.com/
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Jianshi Huang
> >>>>>>>>>>
> >>>>>>>>>> LinkedIn: jianshi
> >>>>>>>>>> Twitter: @jshuang
> >>>>>>>>>> Github & Blog: http://huangjs.github.com/
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Jianshi Huang
> >>>>>>>>
> >>>>>>>> LinkedIn: jianshi
> >>>>>>>> Twitter: @jshuang
> >>>>>>>> Github & Blog: http://huangjs.github.com/
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Jianshi Huang
> >>>>>>
> >>>>>> LinkedIn: jianshi
> >>>>>> Twitter: @jshuang
> >>>>>> Github & Blog: http://huangjs.github.com/
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Jianshi Huang
> >>>>
> >>>> LinkedIn: jianshi
> >>>> Twitter: @jshuang
> >>>> Github & Blog: http://huangjs.github.com/
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Jianshi Huang
> >>>
> >>> LinkedIn: jianshi
> >>> Twitter: @jshuang
> >>> Github & Blog: http://huangjs.github.com/
> >>
> >>
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Reply via email to