A 1000 CFs with HBase does not sound like a good idea. category + timestamp sounds like the better of the 2 options you have thought of.
Can you tell us a little more about your data? Regards, Dhaval ________________________________ From: Kamal Bahadur <[email protected]> To: [email protected] Sent: Monday, 23 December 2013 6:01 PM Subject: Schema Design Newbie Question Hello, I am just starting to use HBase and I am coming from Cassandra world.Here is a quick background regarding my data: My system will be storing data that belongs to a certain category. Currently I have around 1000 categories. Also note that some categories produce lot more data than others. To be precise, 10% of the categories provide more than 65% of the total data in the system. Data access queries always contains this category in the query. I have listed 2 options to design the schema: 1. Add category as first component of the row key [category + timestamp] so that my data is sorted based on category for fast retrieval. 2. Add category as column family so that I can just use timestamp as rowkey. This option will however create more hfiles since I have more categories. I am leaning towards option2. I like the idea that HBase separates data for each CF into its own HFiles. However I still worried about the number of hfiles that will be created on the server. Will it cause any other side effects? I would like to hear from the user community as to which option will be the best option in my case. Kamal
