Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits.
I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <[email protected]> wrote: > Please refer to HBASE-5416 Filter on one CF and if a match, then load and > return full row > > bq. to extend TableInputFormat to accept multiple row ranges > > You mean extending hbase.mapreduce.scan.row.start and > hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? > How many such ranges do you normally need ? > > Cheers > > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <[email protected]> > wrote: > > > Thanks Ted, > > > > I'll pre-split the table during ingestion. The reason to keep the rowkey > > monotonic is for easier working with TableInputFormat, otherwise I > would've > > binned it into 256 splits. (well, I think a good way is to extend > > TableInputFormat to accept multiple row ranges, if there's an existing > > efficient implementation, please let me know :) > > > > Would you elaborate a little more on the heap memory usage during scan? > Is > > there any reference to that? > > > > Jianshi > > > > > > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <[email protected]> wrote: > > > > > If you use monotonically increasing rowkeys, separating out the column > > > family into a new table would give you same issue you're facing today. > > > > > > Using a single table, essential column family feature would reduce the > > > amount of heap memory used during scan. With two tables, there is no > such > > > facility. > > > > > > Cheers > > > > > > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang < > [email protected]> > > > wrote: > > > > > > > Hi Ted, > > > > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the > > > performance > > > > I care most are scan performance. > > > > > > > > It's mostly for analytics, so I don't care much about atomicity > > > currently. > > > > > > > > What's your suggestion? > > > > > > > > Jianshi > > > > > > > > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <[email protected]> wrote: > > > > > > > > > Is this the same table you mentioned in the thread about > > > > > RegionTooBusyException > > > > > ? > > > > > > > > > > If you move the column family to another table, you may have to > > handle > > > > > atomicity yourself - currently atomic operations are within region > > > > > boundaries. > > > > > > > > > > Cheers > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I'm currently putting everything into one table (to make cross > > > > reference > > > > > > queries easier) and there's one CF which contains rowkeys very > > > > different > > > > > to > > > > > > the rest. Currently it works well, but I'm wondering if it will > > cause > > > > > > performance issues in the future. > > > > > > > > > > > > So my questions are > > > > > > > > > > > > 1) will there be performance penalties in the way I'm doing? > > > > > > 2) should I move that CF to a separate table? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > -- > > > > > > Jianshi Huang > > > > > > > > > > > > LinkedIn: jianshi > > > > > > Twitter: @jshuang > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
