Re: Help with row and column design

Ted Yu Wed, 30 Apr 2014 10:43:37 -0700

Please see this post:
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/



On Wed, Apr 30, 2014 at 10:28 AM, Software Dev <[email protected]>wrote:

> I did not know of the FuzzyRowFilter.. that looks like it may be my best
> bet.
>
> Anyone know what Sematexts HBaseWD uses to perform efficient scanning?
>
> On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser <[email protected]> wrote:
> > I would recommend pre-splitting the tables and then hashing your key and
> > putting that in the front.  ie
> >
> > [hash(20140429:Country:US)][2014042901:Country:US]  #notice you're not
> > hashing the sequence number
> >
> > some pseudo python code
> >
> >>>> import hashlib
> >>>> key = "2014042901:Country:US"
> >>>> ckey = "20140429:Country:US"
> >>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key)
> >>>> hbase_key
> > '887d82014042901:Country:US'
> >
> > Now when you want to find something, you can just create the hash
> ('887d8)
> > and use FuzzyRowFilter to find it!
> >
> > cheers,
> > liam
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Apr 29, 2014 at 8:08 PM, Software Dev <[email protected]
> >wrote:
> >
> >> Any improvements in the row key design?
> >>
> >> If i always know we will be querying by country could/should I prefix
> >> the row key with the country to help with hotspotting?
> >>
> >> FR/2014042901
> >> FR/2014042902
> >> ....
> >> US/2014042901
> >> US/2014042902
> >> ...
> >>
> >> Is this preferred over adding it in a column... ie 2014042901
> :Country:US
> >>
> >> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <
> [email protected]>
> >> wrote:
> >> > Ok didnt know if the sheer number of gets would be a limiting factor.
> >> Thanks
> >> >
> >> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <[email protected]> wrote:
> >> >> As I said this afternoon:
> >> >> See the following API in HTable for batching Get's :
> >> >>
> >> >>   public Result[] get(List<Get> gets) throws IOException {
> >> >>
> >> >> Cheers
> >> >>
> >> >>
> >> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <
> >> [email protected]>wrote:
> >> >>
> >> >>> Nothing against your code. I just meant that if we are doing a scan
> >> >>> say for hourly metrics across a 6 month period we are talking about
> >> >>> 4K+ gets. Is that something that can easily be handled?
> >> >>>
> >> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <
> [email protected]
> >> >
> >> >>> wrote:
> >> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
> >> >>> >
> >> >>> > If you by "hairy" you mean the code is ugly, it was written for
> >> maximal
> >> >>> clarity.
> >> >>> > I think you'll find a few sensible loops makes it fairly clean.
> >> >>> > Otherwise I'm not sure what you mean.
> >> >>> >
> >> >>> > -----Original Message-----
> >> >>> > From: Software Dev [mailto:[email protected]]
> >> >>> > Sent: Tuesday, April 29, 2014 5:02 PM
> >> >>> > To: [email protected]
> >> >>> > Subject: Re: Help with row and column design
> >> >>> >
> >> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you
> have to
> >> >>> pre-store every level of aggregation you care about.
> >> >>> >
> >> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a
> >> shitload
> >> >>> of gets thought.. no?
> >> >>> >
> >> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <
> >> [email protected]>
> >> >>> wrote:
> >> >>> >> You don't do a scan, you do a series of gets, which I believe you
> >> can
> >> >>> batch into one call.
> >> >>> >>
> >> >>> >> last 5 days query in pseudocode
> >> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
> >> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
> >> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
> >> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
> >> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
> >> >>> >>
> >> >>> >> For each result you look for the particular column or columns you
> >> are
> >> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
> >> >>> res3.get("c:usa") + ...
> >> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
> >> >>> >>
> >> >>> >> "What happens when we add more fields? Do we just keep adding in
> >> more
> >> >>> column qualifiers? If so, how would we filter across columns to get
> an
> >> >>> aggregate total?"
> >> >>> >>
> >> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you
> have to
> >> >>> pre-store every level of aggregation you care about.
> >> >>> >>
> >> >>> >> -----Original Message-----
> >> >>> >> From: Software Dev [mailto:[email protected]]
> >> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM
> >> >>> >> To: [email protected]
> >> >>> >> Subject: Re: Help with row and column design
> >> >>> >>
> >> >>> >>> The downside is it still has a hotspot when inserting, but when
> >> >>> >>> reading a range of time it does not
> >> >>> >>
> >> >>> >> How can you do a scan query between dates when you hash the date?
> >> >>> >>
> >> >>> >>> Column qualifiers are just the collection of items you are
> >> >>> >>> aggregating on. Values are increments. In your case qualifiers
> >> might
> >> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
> >> >>> >>> c:italy:sex:f, c:italy,
> >> >>> >>
> >> >>> >> What happens when we add more fields? Do we just keep adding in
> more
> >> >>> column qualifiers? If so, how would we filter across columns to get
> an
> >> >>> aggregate total?
> >> >>>
> >>
>

Re: Help with row and column design

Reply via email to