Sorry, but you missed the point. (Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema design yet for some reason... it just doesn't seem important enough or sexy enough... maybe if I worked for Cloudera/Intel/etc ... ;-)
Look, The issue is what is and how to use Column families. Since they are a separate HFile that uses the same key, the question is why do you need it and when do you want to use it. The answer unfortunately is a bit more complicated than the questions. You have to ask yourself when do you have a series of tables which have the same key value? How do you access this data? It gets more involved, but just looking at the answers to those two questions is a start. Like I said, think about the order entry example and how the data is used in those column families. Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last part, but its a very important concept. You need to stop thinking in terms of ERD when there is no relationship. Column families tend to create a weak relationship... which makes them a bit more confusing.... On Jul 5, 2013, at 11:16 AM, Aji Janis <[email protected]> wrote: > I understand that there shouldn't be unlimited number of column families. I > am using this example on purpose to see how it comes into play. > > > On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel > <[email protected]>wrote: > >> Why do you have so many column families (CF) ? >> >> Its not a question on the physical limitations, but more on the issue of >> data design. >> >> There aren't that many really good examples of where you would have >> multiple column families that would require more than a handful of CFs. >> >> When I teach or lecture, the example I use is an order entry system. >> Where you would have the same key on Order entry, pick slips, shipping, >> and invoice. >> >> That's probably the best example of where CFs come in to play. >> >> I'd suggest that you go back and rethink the design if you're having more >> than a handful. >> >> >> >> On Jul 5, 2013, at 8:53 AM, Aji Janis <[email protected]> wrote: >> >>> Asaf, >>> >>> I am using the Genre/Author stuff as an example but yes at the moment I >>> only have 5 column families. However, over time I may have more (no upper >>> limit decided that this point). See below for more responses >>> >>> >>> On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <[email protected]> >> wrote: >>> >>>> Do you have only 5 static author names? >>>> Keep in mind the column family name is defined when creating the table. >>>> >>>> Regarding tall vs wide debate: >>>> HBase is first and for most a Key Value database thus reads and writes >> in >>>> the column-value level. So it doesn't really care about rows. >>>> But it's not entirely true. Rows come into play in the following >>>> situations: >>>> Splitting a region is per row and not per column, thus a row will be >> saved >>>> as a whole on a region. If you have a really large row, the region size >>>> granularity is dependent on it. It doesn't seem to be the case here. >>>> Put/Delete creates a lock until finished. If you are intensive on >> inserts >>>> to the same row at the same time, thus might be bad for you, keeping >> your >>>> rows slimmer can reduce contention, but again, only if you make a lot >>>> concurrent modifications to the same row. >>>> >>> >>> I expect batches of Put/Delete to the same row to happen by at most one >>> thread at a time based on user's current behavior. So locking shouldn't >> be >>> an issue. However, not sure if the saving row to a region with enough >> space >>> topic is really an issue I need to worry about (probably because I just >>> don't know much about it yet). >>> >>> >>>> Filtering - if you need a filter which need all the row (there is a >> method >>>> you override in Filter to mark that) than a far row will be more memory >>>> intensive. If you needed only 1/5 of your row, than maybe splitting it >> to 5 >>>> rows to begin with would have made a better schema design in terms of >>>> memory and I/O. >>>> >>> >>> Currently, my access pattern is to get all data for a given row. Its >>> possible in the future we may want to apply (family/qualifier) filters. >>> There is a lot of uncertainty on use cases (client side) at this point >>> which is why I am not entirely sure on how things will look months from >>> now. I am not sure I follow this statement >>> >>> "if you need a filter which need all the row (there is a method you >>> override in Filter to mark that) than a far row will be more memory >>> intensive." >>> >>> Can you please explain? Thank you for these suggestions btw, good food >> for >>> thought! >>> >>> >>>> >>>> On Wednesday, July 3, 2013, Aji Janis wrote: >>>> >>>>> I have a major typo in the question so I apologize. I meant to say 5 >>>>> families with 1000+ qualifiers each. >>>>> >>>>> Lets work with an example, (not the greatest example here but still). >>>> Lets >>>>> say we have a Genre Class like this: >>>>> >>>>> Class HistoryBooks{ >>>>> >>>>> ArrayList<Books> author1; >>>>> ArrayList<Books> author2; >>>>> ArrayList<Books> author3; >>>>> ArrayList<Books> author4; >>>>> ArrayList<Books> author5; >>>>> >>>>> ...} >>>>> >>>>> Each author is a column family (lets say we only allow 5 authors per >>>>> <T>Book class. Book per author ends up being the qualifier. In this >>>> case, I >>>>> know I have a max family count but my qualifiers have no upper limit. >> So >>>> is >>>>> this scenario a case for tall or wide table? Why? Thank you. >>>>> >>>>> >>>>> On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault >>>>> <[email protected] <javascript:;>>wrote: >>>>> >>>>>> If they are accessed mostly together they should all be a single >> column >>>>>> family. The key with tall or wide is based on the total byte size of >>>> each >>>>>> KeyValue. Your cells would need to be quite large for 50 to become a >>>>>> problem. I still would recommend using a single CF though. >>>>>> — >>>>>> Sent from iPhone >>>> >> >>
