Tom, this is really completely unrelated, but given that you have such huge documents and I see you have exceeded term count limits in lucene, i can't help but wonder if you have ever considered Andrzej's index pruning patch? (it is simply a tool you can run on your index)
depending upon requirements, seems like it might be a good fit. http://issues.apache.org/jira/browse/LUCENE-1812 On Thu, Feb 11, 2010 at 3:11 PM, Tom Burton-West <tburtonw...@gmail.com>wrote: > > The HathiTrust Large Search indexes the OCR from 5 million volumes, with an > average of 200-300 pages per volume. So the total number of pages indexed > would be over 1 billion. However, we are not using pages as Solr documents, > we are using the entire book, so we only have 5 million rather than 1 > billion Solr documents. > > We also are not storing the OCRed text. Since the total size of the index > for 5 million volumes is over 2 terrabytes, we split the index into 10 > shards, each indexing about 1/2 million documents. > > Given all that, our indexes are about 250-300GB for each 500,000 books. > About 85% of that is the *prx position index. Unless you have enough > memory on the OS to get a significant amount of the index into the disk OS > cache, disk I/O is the big bottleneck, especially for phrase queries with > common words. > See http://www.hathitrust.org/blogs/large-scale-search > http://www.hathitrust.org/blogs/large-scale-search for more details. > > Have you considered storing the OCR separately rather than in the Solr > index > or does your use case require storing the OCR in the index? > > > Tom Burton-West > Digital Library Production Service > University of Michigan > > > > Wick2804 wrote: > > > > We are thinking of creating a Lucene Solr project to store 50million full > > text OCRed A4 pages. Is there anyone out there who could provide some > kind > > of guidance on the size of index we are likely to generate, and are there > > any gotchas in the standard analysis engines for load and query that will > > cause us issues. Do large indexes cause memory issues on servers? Any > > help or advice greatly appreciated. > > > > -- > View this message in context: > http://old.nabble.com/Solr-Performance-and-Scalability-tp27552013p27553353.html > Sent from the Solr - Dev mailing list archive at Nabble.com. > > -- Robert Muir rcm...@gmail.com