Toke, thanks. Comments embedded (hope that's okay): On Tue, Oct 11, 2011 at 10:52 AM, Toke Eskildsen <t...@statsbiblioteket.dk>wrote:
> > Greetings. I have a paltry 23,000 database records that point to a > > voluminous 300GB worth of PDF, Word, Excel, and other documents. We are > > planning on indexing the records and the documents they point to. I have > no > > clue on how we can calculate what kind of server we need for this. I > > imagine the index isn't going to be bigger than the documents (is it?) > > Sanity check: Let's say your average document is 200 pages with 1000 > words of 5 characters each. That gives you 200 * 1000 * 5 * 23,000 ~= > 21GB of raw text, which is a far cry from the 300GB. > > Either your documents are extremely text heavy or they contain > illustrations and other elements that are not to be indexed. Is it > possible for you to estimate the number of characters in your corpus? > Yes. We estimate each of the 23K DB records has 600 pages of text for the combined documents, 300 words per page, 5 characters per word. Which coincidentally works out to about 21GB, so good guessing there. :) > But what kind of processing power and memory might we need? I am not well-versed in Tika and other PDF/Word/etc analyzing > frameworks, so I'll just focus on the search part here. Guessing wildly, > you're aiming for a low number of running updates or even just a nightly > batch update. Response times should be below 200 ms and the number of > concurrent searches is 2 to 4 at most. > The way it works is we have researchers modifying the DB records during the day, and they may upload documents at that time. We estimate 50-60 uploads throughout the day. If possible, we'd like to index them as they are uploaded, but if that would negatively affect the search, then we can rebuild the index nightly. Which is better? > Bold claim: Assuming that your corpus is more 20GB of raw text than > 300GB, you'll get by just fine with an i7 machine with 8GB of RAM, a 1TB > 7200 RPM drive for storage and a 256GB consumer SSD for search. That is > more or less what we use for our 10M documents/60GB+ index, with a load > as I described above. > > I've always been wary of having to dictate hardware up front for such > projects. It is a lot easier and cheaper to just build the software, > then measure and buy hardware after that. > We have a very beefy VM server that we will use for benchmarking, but your specs provide a starting point. Thanks very much for that. cheers, Travis