Please see this post: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
On Wed, Apr 30, 2014 at 10:28 AM, Software Dev <[email protected]>wrote: > I did not know of the FuzzyRowFilter.. that looks like it may be my best > bet. > > Anyone know what Sematexts HBaseWD uses to perform efficient scanning? > > On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser <[email protected]> wrote: > > I would recommend pre-splitting the tables and then hashing your key and > > putting that in the front. ie > > > > [hash(20140429:Country:US)][2014042901:Country:US] #notice you're not > > hashing the sequence number > > > > some pseudo python code > > > >>>> import hashlib > >>>> key = "2014042901:Country:US" > >>>> ckey = "20140429:Country:US" > >>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key) > >>>> hbase_key > > '887d82014042901:Country:US' > > > > Now when you want to find something, you can just create the hash > ('887d8) > > and use FuzzyRowFilter to find it! > > > > cheers, > > liam > > > > > > > > > > > > > > > > > > On Tue, Apr 29, 2014 at 8:08 PM, Software Dev <[email protected] > >wrote: > > > >> Any improvements in the row key design? > >> > >> If i always know we will be querying by country could/should I prefix > >> the row key with the country to help with hotspotting? > >> > >> FR/2014042901 > >> FR/2014042902 > >> .... > >> US/2014042901 > >> US/2014042902 > >> ... > >> > >> Is this preferred over adding it in a column... ie 2014042901 > :Country:US > >> > >> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev < > [email protected]> > >> wrote: > >> > Ok didnt know if the sheer number of gets would be a limiting factor. > >> Thanks > >> > > >> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <[email protected]> wrote: > >> >> As I said this afternoon: > >> >> See the following API in HTable for batching Get's : > >> >> > >> >> public Result[] get(List<Get> gets) throws IOException { > >> >> > >> >> Cheers > >> >> > >> >> > >> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev < > >> [email protected]>wrote: > >> >> > >> >>> Nothing against your code. I just meant that if we are doing a scan > >> >>> say for hourly metrics across a 6 month period we are talking about > >> >>> 4K+ gets. Is that something that can easily be handled? > >> >>> > >> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) < > [email protected] > >> > > >> >>> wrote: > >> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no? > >> >>> > > >> >>> > If you by "hairy" you mean the code is ugly, it was written for > >> maximal > >> >>> clarity. > >> >>> > I think you'll find a few sensible loops makes it fairly clean. > >> >>> > Otherwise I'm not sure what you mean. > >> >>> > > >> >>> > -----Original Message----- > >> >>> > From: Software Dev [mailto:[email protected]] > >> >>> > Sent: Tuesday, April 29, 2014 5:02 PM > >> >>> > To: [email protected] > >> >>> > Subject: Re: Help with row and column design > >> >>> > > >> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you > have to > >> >>> pre-store every level of aggregation you care about. > >> >>> > > >> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a > >> shitload > >> >>> of gets thought.. no? > >> >>> > > >> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) < > >> [email protected]> > >> >>> wrote: > >> >>> >> You don't do a scan, you do a series of gets, which I believe you > >> can > >> >>> batch into one call. > >> >>> >> > >> >>> >> last 5 days query in pseudocode > >> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29") > >> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28") > >> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27") > >> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26") > >> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25") > >> >>> >> > >> >>> >> For each result you look for the particular column or columns you > >> are > >> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") + > >> >>> res3.get("c:usa") + ... > >> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ... > >> >>> >> > >> >>> >> "What happens when we add more fields? Do we just keep adding in > >> more > >> >>> column qualifiers? If so, how would we filter across columns to get > an > >> >>> aggregate total?" > >> >>> >> > >> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you > have to > >> >>> pre-store every level of aggregation you care about. > >> >>> >> > >> >>> >> -----Original Message----- > >> >>> >> From: Software Dev [mailto:[email protected]] > >> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM > >> >>> >> To: [email protected] > >> >>> >> Subject: Re: Help with row and column design > >> >>> >> > >> >>> >>> The downside is it still has a hotspot when inserting, but when > >> >>> >>> reading a range of time it does not > >> >>> >> > >> >>> >> How can you do a scan query between dates when you hash the date? > >> >>> >> > >> >>> >>> Column qualifiers are just the collection of items you are > >> >>> >>> aggregating on. Values are increments. In your case qualifiers > >> might > >> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, > >> >>> >>> c:italy:sex:f, c:italy, > >> >>> >> > >> >>> >> What happens when we add more fields? Do we just keep adding in > more > >> >>> column qualifiers? If so, how would we filter across columns to get > an > >> >>> aggregate total? > >> >>> > >> >
