Toke, thanks.  Comments embedded (hope that's okay):

On Tue, Oct 11, 2011 at 10:52 AM, Toke Eskildsen 
<t...@statsbiblioteket.dk>wrote:

> > Greetings.  I have a paltry 23,000 database records that point to a
> > voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
> > planning on indexing the records and the documents they point to.  I have
> no
> > clue on how we can calculate what kind of server we need for this.  I
> > imagine the index isn't going to be bigger than the documents (is it?)
>
> Sanity check: Let's say your average document is 200 pages with 1000
> words of 5 characters each. That gives you 200 * 1000 * 5 * 23,000 ~=
> 21GB of raw text, which is a far cry from the 300GB.
>
> Either your documents are extremely text heavy or they contain
> illustrations and other elements that are not to be indexed. Is it
> possible for you to estimate the number of characters in your corpus?
>

Yes.  We estimate each of the 23K DB records has 600 pages of text for the
combined documents, 300 words per page, 5 characters per word.  Which
coincidentally works out to about 21GB, so good guessing there. :)

>  But what kind of processing power and memory might we need?

 I am not well-versed in Tika and other PDF/Word/etc analyzing
> frameworks, so I'll just focus on the search part here. Guessing wildly,
> you're aiming for a low number of running updates or even just a nightly
> batch update. Response times should be below 200 ms and the number of
> concurrent searches is 2 to 4 at most.
>

The way it works is we have researchers modifying the DB records during the
day, and they may upload documents at that time.  We estimate 50-60 uploads
throughout the day.  If possible, we'd like to index them as they are
uploaded, but if that would negatively affect the search, then we can
rebuild the index nightly.

Which is better?


> Bold claim: Assuming that your corpus is more 20GB of raw text than
> 300GB, you'll get by just fine with an i7 machine with 8GB of RAM, a 1TB
> 7200 RPM drive for storage and a 256GB consumer SSD for search. That is
> more or less what we use for our 10M documents/60GB+ index, with a load
> as I described above.
>
> I've always been wary of having to dictate hardware up front for such
> projects. It is a lot easier and cheaper to just build the software,
> then measure and buy hardware after that.
>

We have a very beefy VM server that we will use for benchmarking, but your
specs provide a starting point.  Thanks very much for that.

cheers,

Travis

Reply via email to