On Wed, Jun 29, 2011 at 10:08 PM, Florin P <[email protected]> wrote: > We have the almost the same scenario as Aditya, but with some differences. > 1. our files are documents in any format (xls, pdf, doc, html etc) > 2. we are expecting to have more than 5 millions of these documents
This is not many docs. Will your document set be steady-state once it hits 5M? > 3. The size of them varies like this > 70% from them have their length < 1MB > 29% from them have their length between 1MB and 10 MB > 1% from them have their length > 10MB (they can have also 100MB) What David says above though Jack in his yfrog presentation today talks of storing all images in hbase up to 5MB in size. Karthick in his presentation at hadoop summit talked about how once cells cross a certain size -- he didn't saw what the threshold was I believe -- then only the metadata is stored in hbase and the content goes to their "big stuff" system. Try it I'd say. If only a few instances of 100MB, HBase might be fine. > 4. We have to index all these files > 5. We have to extract some metadata from just a subset of them having as > input a client key One time or ongoing? St.Ack
