Re: Help with row and column design

Software Dev Thu, 01 May 2014 17:19:32 -0700

FuzzyRowFilter is not part of the Rest client so this may not be an
option for us. Any alternatives?


On Wed, Apr 30, 2014 at 10:28 AM, Software Dev
<[email protected]> wrote:
> I did not know of the FuzzyRowFilter.. that looks like it may be my best bet.
>
> Anyone know what Sematexts HBaseWD uses to perform efficient scanning?
>
> On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser <[email protected]> wrote:
>> I would recommend pre-splitting the tables and then hashing your key and
>> putting that in the front.  ie
>>
>> [hash(20140429:Country:US)][2014042901:Country:US]  #notice you're not
>> hashing the sequence number
>>
>> some pseudo python code
>>
>>>>> import hashlib
>>>>> key = "2014042901:Country:US"
>>>>> ckey = "20140429:Country:US"
>>>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key)
>>>>> hbase_key
>> '887d82014042901:Country:US'
>>
>> Now when you want to find something, you can just create the hash ('887d8)
>> and use FuzzyRowFilter to find it!
>>
>> cheers,
>> liam
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 29, 2014 at 8:08 PM, Software Dev 
>> <[email protected]>wrote:
>>
>>> Any improvements in the row key design?
>>>
>>> If i always know we will be querying by country could/should I prefix
>>> the row key with the country to help with hotspotting?
>>>
>>> FR/2014042901
>>> FR/2014042902
>>> ....
>>> US/2014042901
>>> US/2014042902
>>> ...
>>>
>>> Is this preferred over adding it in a column... ie 2014042901:Country:US
>>>
>>> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <[email protected]>
>>> wrote:
>>> > Ok didnt know if the sheer number of gets would be a limiting factor.
>>> Thanks
>>> >
>>> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <[email protected]> wrote:
>>> >> As I said this afternoon:
>>> >> See the following API in HTable for batching Get's :
>>> >>
>>> >>   public Result[] get(List<Get> gets) throws IOException {
>>> >>
>>> >> Cheers
>>> >>
>>> >>
>>> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <
>>> [email protected]>wrote:
>>> >>
>>> >>> Nothing against your code. I just meant that if we are doing a scan
>>> >>> say for hourly metrics across a 6 month period we are talking about
>>> >>> 4K+ gets. Is that something that can easily be handled?
>>> >>>
>>> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <[email protected]
>>> >
>>> >>> wrote:
>>> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
>>> >>> >
>>> >>> > If you by "hairy" you mean the code is ugly, it was written for
>>> maximal
>>> >>> clarity.
>>> >>> > I think you'll find a few sensible loops makes it fairly clean.
>>> >>> > Otherwise I'm not sure what you mean.
>>> >>> >
>>> >>> > -----Original Message-----
>>> >>> > From: Software Dev [mailto:[email protected]]
>>> >>> > Sent: Tuesday, April 29, 2014 5:02 PM
>>> >>> > To: [email protected]
>>> >>> > Subject: Re: Help with row and column design
>>> >>> >
>>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>>> >>> pre-store every level of aggregation you care about.
>>> >>> >
>>> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a
>>> shitload
>>> >>> of gets thought.. no?
>>> >>> >
>>> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <
>>> [email protected]>
>>> >>> wrote:
>>> >>> >> You don't do a scan, you do a series of gets, which I believe you
>>> can
>>> >>> batch into one call.
>>> >>> >>
>>> >>> >> last 5 days query in pseudocode
>>> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
>>> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
>>> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
>>> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
>>> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
>>> >>> >>
>>> >>> >> For each result you look for the particular column or columns you
>>> are
>>> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
>>> >>> res3.get("c:usa") + ...
>>> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
>>> >>> >>
>>> >>> >> "What happens when we add more fields? Do we just keep adding in
>>> more
>>> >>> column qualifiers? If so, how would we filter across columns to get an
>>> >>> aggregate total?"
>>> >>> >>
>>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>>> >>> pre-store every level of aggregation you care about.
>>> >>> >>
>>> >>> >> -----Original Message-----
>>> >>> >> From: Software Dev [mailto:[email protected]]
>>> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM
>>> >>> >> To: [email protected]
>>> >>> >> Subject: Re: Help with row and column design
>>> >>> >>
>>> >>> >>> The downside is it still has a hotspot when inserting, but when
>>> >>> >>> reading a range of time it does not
>>> >>> >>
>>> >>> >> How can you do a scan query between dates when you hash the date?
>>> >>> >>
>>> >>> >>> Column qualifiers are just the collection of items you are
>>> >>> >>> aggregating on. Values are increments. In your case qualifiers
>>> might
>>> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
>>> >>> >>> c:italy:sex:f, c:italy,
>>> >>> >>
>>> >>> >> What happens when we add more fields? Do we just keep adding in more
>>> >>> column qualifiers? If so, how would we filter across columns to get an
>>> >>> aggregate total?
>>> >>>
>>>

Re: Help with row and column design

Reply via email to