I am now convinced that option 1 will be the best option for my data.
Thanks Lars!

Kamal


On Mon, Dec 23, 2013 at 4:12 PM, lars hofhansl <[email protected]> wrote:

> The HDFS NameNode will have to deal with lots of small files (currently
> HBase cannot flush column families independently, so if one is flushed all
> of them are).
> The other reason is that scanning will the slow (if your scan involves
> many column families, due to the merge sort HBase needs to perform).
>
> Option #1 should be better. HBase will be smart just scanning the HFile
> necessary for the key range you provide (Category + Timestamp).
>
> -- Lars
>
>
>
> ________________________________
>  From: Kamal Bahadur <[email protected]>
> To: user <[email protected]>; Dhaval Shah <[email protected]
> >
> Sent: Monday, December 23, 2013 3:47 PM
> Subject: Re: Schema Design Newbie Question
>
>
> Hi Dhaval,
>
> Thanks for the quick response!
>
> Why do you think having more files is not a good idea? Is it because of OS
> restrictions?
>
> I get around 50 million records a day and each record contains  ~25
> columns. Values for each column are ~30 characters.
>
> Kamal
>
>
>
> On Mon, Dec 23, 2013 at 3:35 PM, Dhaval Shah <[email protected]
> >wrote:
>
> > A 1000 CFs with HBase does not sound like a good idea.
> >
> > category + timestamp sounds like the better of the 2 options you have
> > thought of.
> >
> > Can you tell us a little more about your data?
> >
> > Regards,
> >
> > Dhaval
> >
> >
> > ________________________________
> >  From: Kamal Bahadur <[email protected]>
> > To: [email protected]
> > Sent: Monday, 23 December 2013 6:01 PM
> > Subject: Schema Design Newbie Question
> >
> >
> > Hello,
> >
> > I am just starting to use HBase and I am coming from Cassandra world.Here
> > is a quick background regarding my data:
> >
> > My system will be storing data that belongs to a certain category.
> > Currently I have around 1000 categories.  Also note that some categories
> > produce lot more data than others. To be precise, 10% of the categories
> > provide more than 65% of the total data in the system.
> >
> > Data access queries always contains this category in the query. I have
> > listed 2 options to design the schema:
> >
> > 1. Add category as first component of the row key [category + timestamp]
> so
> > that my data is sorted based on category for fast retrieval.
> > 2. Add category as column family so that I can just use timestamp as
> > rowkey. This option will however create more hfiles since I have more
> > categories.
> >
> > I am leaning towards option2. I like the idea that HBase separates data
> for
> > each CF into its own HFiles. However I still worried about the number of
> > hfiles that will be created on the server. Will it cause any other side
> > effects? I would like to hear from the user community as to which option
> > will be the best option in my case.
> >
> > Kamal
> >
>

Reply via email to