Locality is important, that why I chose CF to put related data into one group. I can surely put the CF part to the head of rowkey to achieve similar result, but since the number of types is fixed, I don't any benefit doing that.
With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the performance should be similar. Am I missing something? Please enlighten me. Jianshi On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel <michael_se...@hotmail.com> wrote: > I would suggest rethinking column families and look at your potential for > a slightly different row key. > > Going with column families doesn’t really make sense. > > Also how wide are the rows? (worst case?) > > one idea is to make type part of the RK… > > HTH > > -Mike > > On Sep 7, 2014, at 2:40 AM, Jianshi Huang <jianshi.hu...@gmail.com> wrote: > > > Hi Michael, > > > > Thanks for the questions. > > > > I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) > have a > > timestamp and I can query things like events between A and B for the > last 7 > > days. > > > > CFs are used for grouping different types of data for the same account. > > However, I have lots of skews in the data, to avoid having too much for > the > > same row, I had to put what was in CQs to now RKs. So CF now acts more > like > > a table. > > > > There's one CF containing sequence of events ordered by timestamp, and > this > > CF is quite different as the use case is mostly in mapreduce jobs. > > > > Jianshi > > > > > > > > > > On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel <michael_se...@hotmail.com > > > > wrote: > > > >> Again, a silly question. > >> > >> Why are you using column families? > >> > >> Just to play devil’s advocate in terms of design, why are you not > treating > >> your row as a record? Think hierarchal not relational. > >> > >> This really gets in to some design theory. > >> > >> Think Column Family as a way to group data that has the same row key, > >> reference the same thing, yet the data in each column family is used > >> separately. > >> The example I always turn to when teaching, is to think of an order > entry > >> system at a retailer. > >> > >> You generate data which is segmented by business process. (order entry, > >> pick slips, shipping, invoicing) All reflect a single order, yet the > data > >> in each process tends to be accessed separately. > >> (You don’t need the order entry when using the pick slip to pull orders > >> from the warehouse.) So here, the data access pattern is that each > column > >> family is used separately, except in generating the data (the order > entry > >> is used to generate the pick slip(s) and set up things like backorders > and > >> then the pick process generates the shipping slip(s) etc … And since > they > >> are all focused on the same order, they have the same row key. > >> > >> So its reasonable to ask how you are accessing the data and how you are > >> designing your HBase model? > >> > >> Many times, developers create a model using column families because the > >> developer is thinking in terms of relationships. Not access patterns on > the > >> data. > >> > >> Does this make sense? > >> > >> > >> On Sep 6, 2014, at 7:46 PM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > >> > >>> BTW, a little explanation about the binning I mentioned. > >>> > >>> Currently the rowkey looks like <type_of_events>#<rev_timestamp>#<id>. > >>> > >>> And with binning, it looks like > >>> <bin_number>#<type_of_events>#<rev_timestamp>#<id>. The bin_number > could > >> be > >>> id % 256 or timestamp % 256. And the table could be pre-splitted. So > >> future > >>> ingestions could do parallel insertion to #<bin> regions, even without > >>> pre-split. > >>> > >>> > >>> Jianshi > >>> > >>> > >>> On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang <jianshi.hu...@gmail.com > > > >>> wrote: > >>> > >>>> Each range might span multiple regions, depending on the data size I > >> want > >>>> scan for MR jobs. > >>>> > >>>> The ranges are dynamic, specified by the user, but the number of bins > >> can > >>>> be static (when the table/schema is created). > >>>> > >>>> Jianshi > >>>> > >>>> > >>>> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >>>> > >>>>> bq. 16 to 256 ranges > >>>>> > >>>>> Would each range be within single region or the range may span > regions > >> ? > >>>>> Are the ranges dynamic ? > >>>>> > >>>>> Using command line for multiple ranges would be out of question. A > file > >>>>> with ranges is needed. > >>>>> > >>>>> Cheers > >>>>> > >>>>> > >>>>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang < > >> jianshi.hu...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Thanks Ted for the reference. > >>>>>> > >>>>>> That's right, extend the row.start and row.end to specify multiple > >>>>> ranges > >>>>>> and also getSplits. > >>>>>> > >>>>>> I would probably bin the event sequence CF into 16 to 256 bins. So > 16 > >> to > >>>>>> 256 ranges. > >>>>>> > >>>>>> Jianshi > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >>>>>> > >>>>>>> Please refer to HBASE-5416 Filter on one CF and if a match, then > load > >>>>> and > >>>>>>> return full row > >>>>>>> > >>>>>>> bq. to extend TableInputFormat to accept multiple row ranges > >>>>>>> > >>>>>>> You mean extending hbase.mapreduce.scan.row.start and > >>>>>>> hbase.mapreduce.scan.row.stop so that multiple ranges can be > >>>>> specified ? > >>>>>>> How many such ranges do you normally need ? > >>>>>>> > >>>>>>> Cheers > >>>>>>> > >>>>>>> > >>>>>>> On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang < > >>>>> jianshi.hu...@gmail.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Thanks Ted, > >>>>>>>> > >>>>>>>> I'll pre-split the table during ingestion. The reason to keep the > >>>>>> rowkey > >>>>>>>> monotonic is for easier working with TableInputFormat, otherwise I > >>>>>>> would've > >>>>>>>> binned it into 256 splits. (well, I think a good way is to extend > >>>>>>>> TableInputFormat to accept multiple row ranges, if there's an > >>>>> existing > >>>>>>>> efficient implementation, please let me know :) > >>>>>>>> > >>>>>>>> Would you elaborate a little more on the heap memory usage during > >>>>> scan? > >>>>>>> Is > >>>>>>>> there any reference to that? > >>>>>>>> > >>>>>>>> Jianshi > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhih...@gmail.com> > wrote: > >>>>>>>> > >>>>>>>>> If you use monotonically increasing rowkeys, separating out the > >>>>>> column > >>>>>>>>> family into a new table would give you same issue you're facing > >>>>>> today. > >>>>>>>>> > >>>>>>>>> Using a single table, essential column family feature would > reduce > >>>>>> the > >>>>>>>>> amount of heap memory used during scan. With two tables, there is > >>>>> no > >>>>>>> such > >>>>>>>>> facility. > >>>>>>>>> > >>>>>>>>> Cheers > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang < > >>>>>>> jianshi.hu...@gmail.com> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Ted, > >>>>>>>>>> > >>>>>>>>>> Yes, that's the table having RegionTooBusyExceptions :) But the > >>>>>>>>> performance > >>>>>>>>>> I care most are scan performance. > >>>>>>>>>> > >>>>>>>>>> It's mostly for analytics, so I don't care much about atomicity > >>>>>>>>> currently. > >>>>>>>>>> > >>>>>>>>>> What's your suggestion? > >>>>>>>>>> > >>>>>>>>>> Jianshi > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhih...@gmail.com> > >>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Is this the same table you mentioned in the thread about > >>>>>>>>>>> RegionTooBusyException > >>>>>>>>>>> ? > >>>>>>>>>>> > >>>>>>>>>>> If you move the column family to another table, you may have > >>>>> to > >>>>>>>> handle > >>>>>>>>>>> atomicity yourself - currently atomic operations are within > >>>>>> region > >>>>>>>>>>> boundaries. > >>>>>>>>>>> > >>>>>>>>>>> Cheers > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang < > >>>>>>>> jianshi.hu...@gmail.com > >>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> I'm currently putting everything into one table (to make > >>>>> cross > >>>>>>>>>> reference > >>>>>>>>>>>> queries easier) and there's one CF which contains rowkeys > >>>>> very > >>>>>>>>>> different > >>>>>>>>>>> to > >>>>>>>>>>>> the rest. Currently it works well, but I'm wondering if it > >>>>> will > >>>>>>>> cause > >>>>>>>>>>>> performance issues in the future. > >>>>>>>>>>>> > >>>>>>>>>>>> So my questions are > >>>>>>>>>>>> > >>>>>>>>>>>> 1) will there be performance penalties in the way I'm doing? > >>>>>>>>>>>> 2) should I move that CF to a separate table? > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> -- > >>>>>>>>>>>> Jianshi Huang > >>>>>>>>>>>> > >>>>>>>>>>>> LinkedIn: jianshi > >>>>>>>>>>>> Twitter: @jshuang > >>>>>>>>>>>> Github & Blog: http://huangjs.github.com/ > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Jianshi Huang > >>>>>>>>>> > >>>>>>>>>> LinkedIn: jianshi > >>>>>>>>>> Twitter: @jshuang > >>>>>>>>>> Github & Blog: http://huangjs.github.com/ > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Jianshi Huang > >>>>>>>> > >>>>>>>> LinkedIn: jianshi > >>>>>>>> Twitter: @jshuang > >>>>>>>> Github & Blog: http://huangjs.github.com/ > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Jianshi Huang > >>>>>> > >>>>>> LinkedIn: jianshi > >>>>>> Twitter: @jshuang > >>>>>> Github & Blog: http://huangjs.github.com/ > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Jianshi Huang > >>>> > >>>> LinkedIn: jianshi > >>>> Twitter: @jshuang > >>>> Github & Blog: http://huangjs.github.com/ > >>>> > >>> > >>> > >>> > >>> -- > >>> Jianshi Huang > >>> > >>> LinkedIn: jianshi > >>> Twitter: @jshuang > >>> Github & Blog: http://huangjs.github.com/ > >> > >> > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/