thanks for the good comments! Jason
On Tue, Oct 2, 2012 at 7:36 PM, Pamecha, Abhishek <[email protected]> wrote: > For 1. I wouldn't worry about that problem until it really happens. Just my > opinion. If you really want to solve it you will need to generate a unique id > per row-key 'put' outside of hbase [ say some hash of serverip + timestamp > etc ] and append it to the end of your row key. > > For 2. You can investigate bloom filters and that can help you filter out > invalid rows faster. Also, there are way to organize names based on > phonetics. You can, may be build a secondary table in background with > phonetic keys as row keys. > http://en.wikipedia.org/wiki/Soundex > > > hth, > Abhishek > > > -----Original Message----- > From: Jason Huang [mailto:[email protected]] > Sent: Tuesday, October 02, 2012 2:38 PM > To: [email protected] > Subject: Re: HBase table row key design question. > > Thanks Mohammad. > > The issue about phone number is that it tends to change over time and we > think name and DOB are more reliable. SSN is more unique but the issue is > that we can't force the user to provide it. Basically we have limited > information that can be used. > > thanks, > > Jason > > On Tue, Oct 2, 2012 at 3:30 PM, Mohammad Tariq <[email protected]> wrote: >> Hello Sir, >> >> Although we should always try to keep the rowkey length as less >> as possible, but still a short key that doesn't help much in faster >> data access is also of no use. So, it totally depends on that >> particular use case. However, in your case, how about using "phone number" >> as the rowkey?? >> Since it is always unique, you will always get the correct result with >> much shorter rowkey. It's just that in this case you will have to ask >> for the user's phone number instead of name and DOB. >> >> Regards, >> Mohammad Tariq >> >> >> >> On Tue, Oct 2, 2012 at 7:58 PM, Jason Huang <[email protected]> wrote: >> >>> Hello, >>> >>> I am designing a HBase table for users and hope to get some >>> suggestions for my row key design. Thanks... >>> >>> This user table will have columns which include user information such >>> as names, birthday, gender, address, phone number, etc... The first >>> time user comes to us we will ask all these information and we should >>> generate a new row in the table with a unique row key. The next time >>> the same user comes in again we will ask for his/her names and >>> birthday and our application should quickly get the row(s) in the >>> table which meets the name and birthday provided. >>> >>> Here is what I am thinking as row key: >>> >>> {first 6 digit of user's first name}_{first 6 digit of user's last >>> name}_{birthday in MMDDYYYY}_{timestamp when user comes in for the >>> first time} >>> >>> However, I see a few questions from this row key: >>> >>> (1) Although it is not very likely but there could be some small >>> chances that two users with same name and birthday came in at the >>> same day. And the two requests to generate new user came at the same >>> time (the timestamps were defined in the HTable API and happened to >>> be of the same value before calling the put method). This means the >>> row key design above won't guarantee a unique row key. Any >>> suggestions on how to modify it and ensure a unique ID? >>> >>> (2) Sometimes we will only have part of user's first name and/or last >>> name. In that case, we will need to perform a scan and return >>> multiple matches to the client. To avoid scanning the whole table, if >>> we have user's first name, we can set start/stop row accordingly. But >>> then if we only have user's last name, we can't set up a good start/stop >>> row. >>> What's even worse, if the user provides a "sounds-like" first or last >>> name, then our scan won't be able to return good possible matches. >>> Does anyone ever use names as part of the row key and encounter this >>> type of issue? >>> >>> (3) The row key seems to be long (30+ chars), will this affect our >>> read/write performance? Maybe it will increase the storage a bit (say >>> we have 3 million rows per month)? In other words, does the length of >>> the row key matter a lot? >>> >>> thanks! >>> >>> Jason >>>
