thanks for the good comments!

Jason

On Tue, Oct 2, 2012 at 7:36 PM, Pamecha, Abhishek <[email protected]> wrote:
> For 1. I wouldn't worry about that problem until it really happens. Just my 
> opinion. If you really want to solve it you will need to generate a unique id 
> per row-key 'put' outside of hbase [ say some hash of serverip + timestamp 
> etc ] and append it to the end of your row key.
>
> For 2. You can investigate bloom filters and that can help you filter out 
> invalid rows  faster. Also, there are way to organize names based on 
> phonetics. You can, may be build a secondary table in background with 
> phonetic keys as row keys.
> http://en.wikipedia.org/wiki/Soundex
>
>
> hth,
> Abhishek
>
>
> -----Original Message-----
> From: Jason Huang [mailto:[email protected]]
> Sent: Tuesday, October 02, 2012 2:38 PM
> To: [email protected]
> Subject: Re: HBase table row key design question.
>
> Thanks Mohammad.
>
> The issue about phone number is that it tends to change over time and we 
> think name and DOB are more reliable. SSN is more unique but the issue is 
> that we can't force the user to provide it. Basically we have limited 
> information that can be used.
>
> thanks,
>
> Jason
>
> On Tue, Oct 2, 2012 at 3:30 PM, Mohammad Tariq <[email protected]> wrote:
>> Hello Sir,
>>
>>      Although we should always try to keep the rowkey length as less
>> as possible, but still a short key that doesn't help much in faster
>> data access is also of no use. So, it totally depends on that
>> particular use case. However, in your case, how about using "phone number" 
>> as the rowkey??
>> Since it is always unique, you will always get the correct result with
>> much shorter rowkey. It's just that in this case you will have to ask
>> for the user's phone number instead of name and DOB.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Tue, Oct 2, 2012 at 7:58 PM, Jason Huang <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I am designing a HBase table for users and hope to get some
>>> suggestions for my row key design. Thanks...
>>>
>>> This user table will have columns which include user information such
>>> as names, birthday, gender, address, phone number, etc... The first
>>> time user comes to us we will ask all these information and we should
>>> generate a new row in the table with a unique row key. The next time
>>> the same user comes in again we will ask for his/her names and
>>> birthday and our application should quickly get the row(s) in the
>>> table which meets the name and birthday provided.
>>>
>>> Here is what I am thinking as row key:
>>>
>>> {first 6 digit of user's first name}_{first 6 digit of user's last
>>> name}_{birthday in MMDDYYYY}_{timestamp when user comes in for the
>>> first time}
>>>
>>> However, I see a few questions from this row key:
>>>
>>> (1) Although it is not very likely but there could be some small
>>> chances that two users with same name and birthday came in at the
>>> same day. And the two requests to generate new user came at the same
>>> time (the timestamps were defined in the HTable API and happened to
>>> be of the same value before calling the put method). This means the
>>> row key design above won't guarantee a unique row key. Any
>>> suggestions on how to modify it and ensure a unique ID?
>>>
>>> (2) Sometimes we will only have part of user's first name and/or last
>>> name. In that case, we will need to perform a scan and return
>>> multiple matches to the client. To avoid scanning the whole table, if
>>> we have user's first name, we can set start/stop row accordingly. But
>>> then if we only have user's last name, we can't set up a good start/stop 
>>> row.
>>> What's even worse, if the user provides a "sounds-like" first or last
>>> name, then our scan won't be able to return good possible matches.
>>> Does anyone ever use names as part of the row key and encounter this
>>> type of issue?
>>>
>>> (3) The row key seems to be long (30+ chars), will this affect our
>>> read/write performance? Maybe it will increase the storage a bit (say
>>> we have 3 million rows per month)? In other words, does the length of
>>> the row key matter a lot?
>>>
>>> thanks!
>>>
>>> Jason
>>>

Reply via email to