Index Layout Question
I am in the process of indexing about 1.5 million documents, and have started down the path of indexing these by month. Each month has between 100,000 and 200,000 documents. From a performance standpoint, is this the right approach? This allows me to use MultiSearcher (or ParallelMultiSearcher), but I'm not sure if the performance gains are really there. Would one monolithic index be better? Thanks. Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Layout Question
"Jerry Jalenak" <[EMAIL PROTECTED]> writes: > I am in the process of indexing about 1.5 million documents, and have > started down the path of indexing these by month. Each month has between > 100,000 and 200,000 documents. From a performance standpoint, is this the > right approach? This allows me to use MultiSearcher (or > ParallelMultiSearcher), but I'm not sure if the performance gains are really > there. Would one monolithic index be better? Depends on your search infrastructure. Doug Cutting has sent out some basic optimization guidelines on this list which should be in the archives... simply, you need to think about how many CPUs and spindles are involved. 1.5m documents isn't a challenge for Lucene to index or search on a single machine with a monolithic index. I indexed about 1.6m web pages in 22 hours on a single machine with all data local, and search with a single IndexSearcher was instantaneous. We've also done some testing with a larger collection (25m pages) and ParallelMultiSearchers on several machines, and likewise on a fast network haven't felt a slowdown, but we haven't actually benchmarked it. Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index Layout Question
That's good to know. I'm indexing on 11 fields (9 keyword, 2 text). The documents themselves are between 1K to 2K in size. Is there a point at which IndexSearcher performance begins to fall off? (in term of # of index records?) Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Ian Soboroff [mailto:[EMAIL PROTECTED] Sent: Thursday, January 27, 2005 10:31 AM To: Lucene Users List Subject: Re: Index Layout Question "Jerry Jalenak" <[EMAIL PROTECTED]> writes: > I am in the process of indexing about 1.5 million documents, and have > started down the path of indexing these by month. Each month has between > 100,000 and 200,000 documents. From a performance standpoint, is this the > right approach? This allows me to use MultiSearcher (or > ParallelMultiSearcher), but I'm not sure if the performance gains are really > there. Would one monolithic index be better? Depends on your search infrastructure. Doug Cutting has sent out some basic optimization guidelines on this list which should be in the archives... simply, you need to think about how many CPUs and spindles are involved. 1.5m documents isn't a challenge for Lucene to index or search on a single machine with a monolithic index. I indexed about 1.6m web pages in 22 hours on a single machine with all data local, and search with a single IndexSearcher was instantaneous. We've also done some testing with a larger collection (25m pages) and ParallelMultiSearchers on several machines, and likewise on a fast network haven't felt a slowdown, but we haven't actually benchmarked it. Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]