Hello! Thank you for your responses. We are going to implement the solution with storing the metadata information on HBase and the content of the files into HDFS map files. We'll keep the reference of the map file in the HBase. Kind regards, Florin
--- On Fri, 7/1/11, Andrew Purtell <[email protected]> wrote: > From: Andrew Purtell <[email protected]> > Subject: Re: HBase region size > To: "[email protected]" <[email protected]> > Date: Friday, July 1, 2011, 4:23 AM > > From: Stack <[email protected]> > > >> 3. The size of them varies like this > >> 70% from them have their length > < 1MB > >> 29% from them have their length > between 1MB and 10 MB > >> 1% from them have their length > > 10MB (they can have also > > 100MB) > > > > What David says above though Jack in his yfrog > presentation today > > talks of storing all images in hbase up to 5MB in > size. > > > > Karthick in his presentation at hadoop summit talked > about how once > > cells cross a certain size -- he didn't saw what the > threshold was I > > believe -- then only the metadata is stored in hbase > and the content > > goes to their "big stuff" system. > > > > Try it I'd say. If only a few instances of 100MB, > HBase might be fine. > > > I've seen problematic behavior in the past, if you store > values larger than 100 MB and then do concurrent scans over > table(s) containing many such objects. The default KeyValue > size limit is 10 MB. This is usually sufficient. For > webtable-like applications I may raise it to 50 MB, and > larger objects are not interesting anyway (to me). > > One reasonable way to handle native storage of large > objects in HBase would be to introduce a layer of > indirection. Break the large object up into chunks. Store > the chunks in a manner that gets good distribution in the > keyspace, maybe by SHA-1 hash of the content. Then store an > index to the chunks with the key of your choice. Get the key > to retrieve the index, then use a MultiAction to retrieve > the referenced chunks in parallel. Given large objects you > are going to need a number of round trips over the network > to pull all of the data anyway. Adding a couple more in the > front may not cause the result to fall outside the > performance bound of your application. > > However you will put your client under heap pressure that > way, as objects in HBase are fully transferred at once to > the client in the RPC response. Another option is to store > large objects directly into HDFS and keep only the path to > it in HBase. A benefit of this approach is you can stream > the data out of HDFS with as little or as much buffering in > your application as you would like. > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting > back. - Piet Hein (via Tom White) >
