Hello!
We have the almost the same scenario as Aditya, but with some differences.
1. our files are documents in any format (xls, pdf, doc, html etc)
2. we are expecting to have more than 5 millions of these documents
3. The size of them varies like this
70% from them have their length < 1MB
29% from them have their length between 1MB and 10 MB
1% from them have their length > 10MB (they can have also 100MB)
4. We have to index all these files
5. We have to extract some metadata from just a subset of them having as
input a client key
Given the above scenario, it is suitable to store the files (their content)
directly on HBase?
We have used HDFS map files to store them but it doesn't fit so well in the
fifth requirement. Therefore we are trying to move to HBase.
As a conclusion, we need you expertise in this matter and thus contributing
to the take the right decision for using the right "tool".
Thank you,
Florin
--- On Tue, 6/28/11, Buttler, David <[email protected]> wrote:
> From: Buttler, David <[email protected]>
> Subject: RE: HBase region size
> To: "[email protected]" <[email protected]>
> Date: Tuesday, June 28, 2011, 1:12 PM
> My understanding is the following
> (which will hopefully be corrected by those with more
> experience):
> * you should try to limit your cell size to less than 1
> MB. This is not a hard and fast rule, but there are
> certain limits you don't want to exceed: you don't want a
> row exceeding your region size. If you only have only
> CF/column holding a single 5MB object, then you should be
> fine.
>
> * the larger your region size, the less overhead there is
> for storage, and the fewer total regions you will
> need. The drawback is that random access will be
> slower. Given your object size and the assumption that
> this is archival storage, you might want to consider a 4GB
> region size.
>
> * for standard HDFS, you should aim to keep the total
> number of files less than 50M. This argues for the
> larger region sizes as well. With a 4GB region size,
> one HFile per region, you could store a maximum of 200PB in
> HBase. A 256MB region size would drop your maximum to
> closer to 1 PB. Given all of the other overhead that
> maximum doesn't consider, larger sizes are again preferred.
>
> Your other questions are highly dependent on usage patterns
> that I can't even guess at. As a general rule of
> thumb, I would say go with 1 core for 1-4 disk spindles
> (depending on your estimated CPU usage). Think
> carefully about where you bottlenecks (disk / network / CPU)
> are likely to be and then test out your assumptions.
>
> Dave
>
>
> -----Original Message-----
> From: Aditya Karanth A [mailto:[email protected]]
>
> Sent: Monday, June 27, 2011 11:38 PM
> To: [email protected]
> Subject: HBase region size
>
>
> Hi,
>
> We have been using Hadoop in our project as a DFS
> cluster to store some
> critical information.
> This critical information is stored as zip files of about
> 3-5 MB in size
> each. The number of these files would grow to more than a
> billion files and
> more than 1 peta byte of storage. We are aware of the
> “too many small files”
> problem in HDFS and hence have considered moving to HBase
> to store these
> files for the following reasons:
> 1. Indexed reads. Also, this information is archived data,
> which will not be
> read very often.
> 2. The Regions managed by HBase would help ensure that we
> don’t end up
> having too many files on the DFS.
>
> In order to move from a HDFS to an HBase cluster, we are
> considering to have
> the following setup, we would require someone to validate
> the same and let
> us know of better configurations if any:
>
> 1. The setup would have a 20 node HBase cluster.
> 2. The Hbase region size would be 256 MB.
> 3. Each datanode to have atleast 32 TB (Tera Bytes) of disk
> space. (We may
> add more data nodes to accomodate > 1PB)
>
> > The question here is, if we have a region size of
> 256MB, will we still
> > have a problem of "too many small files" in the Hadoop
> for the number of
> > regions it may start generating.
> What is the optimum size of the region to go with, given
> the above scenario?
> Given that, we may not be accessing the HBase cluster in a
> highly concurrent
> environment, can we increase the region size.
>
> > I have heard that bigger the size of the regionserver,
> more time it takes
> > for region splitting and slower the reads are. Is this
> true?
> (I have not been able to experiment with all these in our
> environments yet,
> but if anyone has been there and done that, would be good
> to know)
>
> > Is it good to have smaller clusters with larger disk
> spaces or have more
> > number of clusters with lesser diskspaces?
>
> Any help appreciated.
>
> Thanks & Regards,
> Aditya
> --
> View this message in context:
> http://old.nabble.com/HBase-region-size-tp31943719p31943719.html
> Sent from the HBase User mailing list archive at
> Nabble.com.
>
>