FuzzyRowFilter is not part of the Rest client so this may not be an option for us. Any alternatives?
On Wed, Apr 30, 2014 at 10:28 AM, Software Dev <[email protected]> wrote: > I did not know of the FuzzyRowFilter.. that looks like it may be my best bet. > > Anyone know what Sematexts HBaseWD uses to perform efficient scanning? > > On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser <[email protected]> wrote: >> I would recommend pre-splitting the tables and then hashing your key and >> putting that in the front. ie >> >> [hash(20140429:Country:US)][2014042901:Country:US] #notice you're not >> hashing the sequence number >> >> some pseudo python code >> >>>>> import hashlib >>>>> key = "2014042901:Country:US" >>>>> ckey = "20140429:Country:US" >>>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key) >>>>> hbase_key >> '887d82014042901:Country:US' >> >> Now when you want to find something, you can just create the hash ('887d8) >> and use FuzzyRowFilter to find it! >> >> cheers, >> liam >> >> >> >> >> >> >> >> >> On Tue, Apr 29, 2014 at 8:08 PM, Software Dev >> <[email protected]>wrote: >> >>> Any improvements in the row key design? >>> >>> If i always know we will be querying by country could/should I prefix >>> the row key with the country to help with hotspotting? >>> >>> FR/2014042901 >>> FR/2014042902 >>> .... >>> US/2014042901 >>> US/2014042902 >>> ... >>> >>> Is this preferred over adding it in a column... ie 2014042901:Country:US >>> >>> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <[email protected]> >>> wrote: >>> > Ok didnt know if the sheer number of gets would be a limiting factor. >>> Thanks >>> > >>> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <[email protected]> wrote: >>> >> As I said this afternoon: >>> >> See the following API in HTable for batching Get's : >>> >> >>> >> public Result[] get(List<Get> gets) throws IOException { >>> >> >>> >> Cheers >>> >> >>> >> >>> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev < >>> [email protected]>wrote: >>> >> >>> >>> Nothing against your code. I just meant that if we are doing a scan >>> >>> say for hourly metrics across a 6 month period we are talking about >>> >>> 4K+ gets. Is that something that can easily be handled? >>> >>> >>> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <[email protected] >>> > >>> >>> wrote: >>> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no? >>> >>> > >>> >>> > If you by "hairy" you mean the code is ugly, it was written for >>> maximal >>> >>> clarity. >>> >>> > I think you'll find a few sensible loops makes it fairly clean. >>> >>> > Otherwise I'm not sure what you mean. >>> >>> > >>> >>> > -----Original Message----- >>> >>> > From: Software Dev [mailto:[email protected]] >>> >>> > Sent: Tuesday, April 29, 2014 5:02 PM >>> >>> > To: [email protected] >>> >>> > Subject: Re: Help with row and column design >>> >>> > >>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to >>> >>> pre-store every level of aggregation you care about. >>> >>> > >>> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a >>> shitload >>> >>> of gets thought.. no? >>> >>> > >>> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) < >>> [email protected]> >>> >>> wrote: >>> >>> >> You don't do a scan, you do a series of gets, which I believe you >>> can >>> >>> batch into one call. >>> >>> >> >>> >>> >> last 5 days query in pseudocode >>> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29") >>> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28") >>> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27") >>> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26") >>> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25") >>> >>> >> >>> >>> >> For each result you look for the particular column or columns you >>> are >>> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") + >>> >>> res3.get("c:usa") + ... >>> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ... >>> >>> >> >>> >>> >> "What happens when we add more fields? Do we just keep adding in >>> more >>> >>> column qualifiers? If so, how would we filter across columns to get an >>> >>> aggregate total?" >>> >>> >> >>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to >>> >>> pre-store every level of aggregation you care about. >>> >>> >> >>> >>> >> -----Original Message----- >>> >>> >> From: Software Dev [mailto:[email protected]] >>> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM >>> >>> >> To: [email protected] >>> >>> >> Subject: Re: Help with row and column design >>> >>> >> >>> >>> >>> The downside is it still has a hotspot when inserting, but when >>> >>> >>> reading a range of time it does not >>> >>> >> >>> >>> >> How can you do a scan query between dates when you hash the date? >>> >>> >> >>> >>> >>> Column qualifiers are just the collection of items you are >>> >>> >>> aggregating on. Values are increments. In your case qualifiers >>> might >>> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, >>> >>> >>> c:italy:sex:f, c:italy, >>> >>> >> >>> >>> >> What happens when we add more fields? Do we just keep adding in more >>> >>> column qualifiers? If so, how would we filter across columns to get an >>> >>> aggregate total? >>> >>> >>>
