Tom, this is really completely unrelated, but given that you have such huge
documents and I see you have exceeded term count limits in lucene, i can't
help but wonder if you have ever considered Andrzej's index pruning patch?
(it is simply a tool you can run on your index)

depending upon requirements, seems like it might be a good fit.

http://issues.apache.org/jira/browse/LUCENE-1812

On Thu, Feb 11, 2010 at 3:11 PM, Tom Burton-West <tburtonw...@gmail.com>wrote:

>
> The HathiTrust Large Search indexes the OCR from 5 million volumes, with an
> average of 200-300 pages per volume. So the total number of pages indexed
> would be over 1 billion. However, we are not using pages as Solr documents,
> we are using the entire book, so we only have 5 million rather than 1
> billion Solr documents.
>
> We also are not storing the OCRed text.  Since the total size of the index
> for 5 million volumes is over 2 terrabytes, we split the index into 10
> shards, each indexing about 1/2 million documents.
>
> Given all that, our indexes are about 250-300GB for each 500,000 books.
> About 85% of that is the *prx position index.   Unless you have enough
> memory on the OS to get a significant amount of the index into the disk OS
> cache, disk I/O is the big bottleneck, especially for phrase queries with
> common words.
>  See   http://www.hathitrust.org/blogs/large-scale-search
> http://www.hathitrust.org/blogs/large-scale-search  for more details.
>
> Have you considered storing the OCR separately rather than in the Solr
> index
> or does your use case require storing the OCR in the index?
>
>
> Tom Burton-West
> Digital Library Production Service
> University of Michigan
>
>
>
> Wick2804 wrote:
> >
> > We are thinking of creating a Lucene Solr project to store 50million full
> > text OCRed A4 pages. Is there anyone out there who could provide some
> kind
> > of guidance on the size of index we are likely to generate, and are there
> > any gotchas in the standard analysis engines for load and query that will
> > cause us issues. Do large indexes cause memory issues on servers?  Any
> > help or advice greatly appreciated.
> >
>
> --
> View this message in context:
> http://old.nabble.com/Solr-Performance-and-Scalability-tp27552013p27553353.html
> Sent from the Solr - Dev mailing list archive at Nabble.com.
>
>


-- 
Robert Muir
rcm...@gmail.com

Reply via email to