I am now convinced that option 1 will be the best option for my data. Thanks Lars!
Kamal On Mon, Dec 23, 2013 at 4:12 PM, lars hofhansl <[email protected]> wrote: > The HDFS NameNode will have to deal with lots of small files (currently > HBase cannot flush column families independently, so if one is flushed all > of them are). > The other reason is that scanning will the slow (if your scan involves > many column families, due to the merge sort HBase needs to perform). > > Option #1 should be better. HBase will be smart just scanning the HFile > necessary for the key range you provide (Category + Timestamp). > > -- Lars > > > > ________________________________ > From: Kamal Bahadur <[email protected]> > To: user <[email protected]>; Dhaval Shah <[email protected] > > > Sent: Monday, December 23, 2013 3:47 PM > Subject: Re: Schema Design Newbie Question > > > Hi Dhaval, > > Thanks for the quick response! > > Why do you think having more files is not a good idea? Is it because of OS > restrictions? > > I get around 50 million records a day and each record contains ~25 > columns. Values for each column are ~30 characters. > > Kamal > > > > On Mon, Dec 23, 2013 at 3:35 PM, Dhaval Shah <[email protected] > >wrote: > > > A 1000 CFs with HBase does not sound like a good idea. > > > > category + timestamp sounds like the better of the 2 options you have > > thought of. > > > > Can you tell us a little more about your data? > > > > Regards, > > > > Dhaval > > > > > > ________________________________ > > From: Kamal Bahadur <[email protected]> > > To: [email protected] > > Sent: Monday, 23 December 2013 6:01 PM > > Subject: Schema Design Newbie Question > > > > > > Hello, > > > > I am just starting to use HBase and I am coming from Cassandra world.Here > > is a quick background regarding my data: > > > > My system will be storing data that belongs to a certain category. > > Currently I have around 1000 categories. Also note that some categories > > produce lot more data than others. To be precise, 10% of the categories > > provide more than 65% of the total data in the system. > > > > Data access queries always contains this category in the query. I have > > listed 2 options to design the schema: > > > > 1. Add category as first component of the row key [category + timestamp] > so > > that my data is sorted based on category for fast retrieval. > > 2. Add category as column family so that I can just use timestamp as > > rowkey. This option will however create more hfiles since I have more > > categories. > > > > I am leaning towards option2. I like the idea that HBase separates data > for > > each CF into its own HFiles. However I still worried about the number of > > hfiles that will be created on the server. Will it cause any other side > > effects? I would like to hear from the user community as to which option > > will be the best option in my case. > > > > Kamal > > >
