Addresses will mainly be accessed independently, and sometime only, with the other data.
I'm not sure either to prefer the "versions" option. So if I go with a 2nd table, does it mean it's better to have more rows than more columns? Here are the 2 options now. Both with a new table. 1) I store the key "personID" and a:a1 to a:an for the addresses. 2) I store the key "personID" + "address In both I will have the same amount of data. In #1 total size will be smaller since the key will be stored only once. In #1 I will have more columns where in #2 I will have more rows. Is there one better than the other one? Also, if I go with option 1, why is it better to have a 2nd table instead of a 2nd column familly? JM 2012/7/2, Amandeep Khurana <[email protected]>: > Responses inline > > > On Monday, July 2, 2012 at 12:53 PM, Jean-Marc Spaggiari wrote: > >> Hi Amandeep, >> >> Thanks for your prompt reply. >> >> I forgot to add that all the addresses are valid at the same time. >> There is no orders int the addresses. They are all active addresses at >> the same time. If one is not valid any more, it's removed. If there is >> a new one, it's added to the list, not replacing any other. So it's >> not "the last address", but I have to consider all the addresses when >> I will process them. >> >> Regarding adding the address count in the first CF, don't ask me why I >> have put in in 'a'. I have no clue why I did not tought about adding >> it to 'b' directly. I agree that it's useless to have it in 'a'. >> >> The idea of the hash as a column name was just to have something to >> put there. It's like the '1' in the second solution. A random number >> will do the same thing. >> >> I'm accessing the data in 2 ways. >> 1) I acces the person information to update them or retreive all of >> them to display them >> 2) I access only the address the compute some statistiques about it. >> Which mean usually I read ALL the address for one person and not just >> one address at a time. >> > So, that means that the addresses are accessed independently of the other > information and you always access all the addresses together? Or does that > mean that the addresses are accessed along with the other information to > display or retrieve and they are also accessed separately for the stats > calculation? > > You could consider the following ideas: > > 1. Store everything in 'a' and let all addresses go into the column > 'a:address'. Increase the versions to N, where N is the max number of > addresses you want to store for any user. > > OR > > 2. Store addresses in an entirely different table with the rowkey being > user+address. The column qualifier and cell value could be just a simple 1 > for the sake of having something there. When you want to get all addresses > for a user, you just scan from start key 'user' to end key 'user+1'. > > I'm not a fan of the first schema option that you outlined earlier because > of the complexity involved in the client code. That approach works with > relational databases where you have the ability to do transactions. In the > HBase world, not so much. >> >> So basically, there all the 3 options almost the same thing. If I >> store the number of addresses, I will have more work when I have to >> add/remove one entry, same amount of work when I want to parse the >> entries, and less work when I want to count the entries. >> >> Difficult choice. I don't find any schema better than the other one >> because all of them have pros and cons. For now, my prefered one is #1 >> because it's sound more "natural" to store the number of columns, then >> parse them by name, etc. but I think I need to think about it a little >> be more before taking any decision... >> >> JM >> >> 2012/7/2, Amandeep Khurana <[email protected] (mailto:[email protected])>: >> > Jean-Marc, >> > >> > These are great questions! Find my answers (and some questions for you) >> > inline. >> > >> > -ak >> > >> > >> > On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote: >> > >> > > Hi, >> > > >> > > I have a question regarding the best way to design a table. >> > > >> > > Let's imagine I want to store all the people in the world on a >> > > database. >> > > >> > > Everyone has a name, last name, phone number, lot of flags (sex, age, >> > > etc.). >> > > >> > > Now, people can have one address, but they can also have 2, or 3, or >> > > even more... But they will never have thousands of addresses. Let's >> > > say, usually, they have between 1 and 10. >> > > >> > >> > >> > The point to think about here is - what will be your read access >> > pattern? >> > Will you always want the latest address? Or will you want all addresses >> > every time? And then also defining the maximum number of addresses to >> > be >> > stored. >> > > >> > > My table is designes like that. >> > > >> > > create 'person', {NAME => 'a', VERSIONS => 1}, {NAME => 'b', VERSIONS >> > > => 1, COMPRESSION => 'gz'} >> > > >> > >> > >> > You could easily bump up the versions to a number that limits the max >> > number >> > of addresses you will store. >> > Having two separate column families is not the way to solve this problem >> > in >> > my opinion. Reason being - the concept of column families enables you >> > to >> > isolate data with different access patterns. If that's what you desire >> > here, >> > multiple families make sense. But again, this goes back to defining >> > your >> > read patterns. Are you going to access all the data together or are the >> > addresses going to be accessed independently of the rest of the >> > information. >> > > >> > > The 'a' CF will contain all the informations exepct the address. >> > > The 'b' CF will contain only the address. >> > > >> > > I have few options to store the addresses. >> > > I can: >> > > - Store in CF 'a' a flag to tell how many addresses there is and >> > > store >> > > "add1" to "addx" in the 'b' CF will each cell containing the address. >> > > >> > >> > >> > This sort of becomes a case where you'll need to build a transaction >> > like >> > logic in your client code. When you want to store an additional >> > address, >> > you'll need to do the following: >> > 1. read counter from 'a'. Let's say that is n. >> > 2. store next address with CQ as add[n++] >> > 3. store n++ as the counter >> > >> > That complicates the client code and is undesirable. Moreover, you are >> > accessing both column families at the time of any access to the address >> > info. It is probably better to store the counter in 'b' instead of 'a' >> > in >> > this approach but you still have the complication of the transaction >> > like >> > logic. >> > > - Store in CF 'b' the addresses using an hash as the column >> > > identifier. >> > >> > >> > The hash doesn't buy you anything. How do you ensure that you are >> > reading >> > the latest address? Again, goes back to defining the read patterns. >> > > - Store in CF 'b' the addresses as the column identifier and simply >> > > put '1' in the cell, or a hash. >> > > >> > >> > Same problem as the last approach. >> > > >> > > The first option give me very quick information about the number of >> > > addresses, but if I need to add one address, I have to update the 2 >> > > CF. Same if I have to remove one. >> > > The second option will allow me to add any address even without >> > > checking if it's already there. I can remove one very quickly and add >> > > one very quickly. If I want to know the number of addresses, I have >> > > to >> > > retreive all the columns in the CF and count them. However, I'm >> > > storing almost the same information twice. One time with the address, >> > > one time with the hash (CRC32). >> > > The 3rd option has all the advantages of the second one but also, >> > > it's >> > > not storing the information twice. However, that might result in VERY >> > > long column names. And I'm not sure it's good. Like, if I just want >> > > to >> > > know how many address this person has, I will still need to download >> > > them totally on the client side to count them. >> > > >> > >> > >> > Long column qualifiers are perfectly fine and take the same amount of >> > disk >> > space as storing the data in the cells. I don't believe that should be >> > a >> > concern. >> > > >> > > I'm not able to find which solution I should use. All of them have >> > > some pros and cons. And I'm not advanced enought in HBase to forsee >> > > the issues I will have later with one format or the other. >> > > >> > > If I look at the online documentation ( >> > > http://hbase.apache.org/book.html#keysize ) it seems the 3rd option >> > > is >> > > not a good one. So I might have to choose between the 2 first one. >> > > >> > > Does anyone have any advice/recommandation regarding the best of the >> > > 2 >> > > formats I should use? Or even maybe there is some other options I >> > > have >> > > not yet figured? >> > > >> > > Thanks >> > > >> > > JM > >
