RE: Indexing a large number of DB records
There was other reasons for my choice of going with a Temp Index - namely I was having terrible write times to my Live index as it was stored on a different server, also, while I was writing to my live index people were trying to search on it and were getting file not found exceptions so rather than spend hours or days trying to fix it I took the easiest route by creating a temp index on the server which had the application and merging to the server with the live index. This greatly increased my indexing speed. Best of luck Garrett -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 18:43 To: Lucene Users List Subject: RE: Indexing a large number of DB records Note that this really includes some extra steps. You don't need a temp index. Add everything to a single index using a single IndexWriter instance. No need to call addIndexes nor optimize until the end. Adding Documents to an index takes a constant amount of time, regardless of the index size, because new segments are created as documents are added, and existing segments don't need to be updated (only when merges happen). Again, I'd run your app under a profiler to see where the time and memory are going. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Hi Homan I had a similar problem as you in that I was indexing A LOT of data Essentially how I got round it was to batch the index. What I was doing was to add 10,000 documents to a temporary index, use addIndexes() to merge to temporary index into the live index (which also optimizes the live index) then delete the temporary index. On the next loop I'd only query rows from the db above the id in the maxdoc of the live index and set the max rows of the query to to 10,000 i.e SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] {ID from Index.MaxDoc()} ORDER BY [id_field] ASC Ensuring that the documents go into the index sequentially your problem is solved and memory usage on mine (dotlucene 1.3) is low Regards Garrett -Original Message- From: Homam S.A. [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 02:43 To: Lucene Users List Subject: Indexing a large number of DB records I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing a large number of DB records
Hi Homan I had a similar problem as you in that I was indexing A LOT of data Essentially how I got round it was to batch the index. What I was doing was to add 10,000 documents to a temporary index, use addIndexes() to merge to temporary index into the live index (which also optimizes the live index) then delete the temporary index. On the next loop I'd only query rows from the db above the id in the maxdoc of the live index and set the max rows of the query to to 10,000 i.e SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] {ID from Index.MaxDoc()} ORDER BY [id_field] ASC Ensuring that the documents go into the index sequentially your problem is solved and memory usage on mine (dotlucene 1.3) is low Regards Garrett -Original Message- From: Homam S.A. [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 02:43 To: Lucene Users List Subject: Indexing a large number of DB records I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
C# Ports
I was just wondering what tools (JLCA?) people are using to port Lucene to c# as I'd be well interesting in converting things like snowball stemmers, wordnet etc. Thanks Garrett
maxDoc()
Can anyone please explain to my why maxDoc returns 0 when Luke shows 239,473 documents? maxDoc returns the correct number until I delete a document. And I have called optimize after the delete but still the problem remains Strange. Any ideas greatly appreciated Garrett
RE: addIndexes() Size
Ok I upgraded to 1.4.3 but that didn't solve the issue - I was still ending up with huge indexes. So I changed approach - Instead of handing the addIndexes method IndexReaders I gave it the directory instead as the code is slightly different - now index size is what I would expect it to be. I haven't had time to check it out fully yet as to why - but from what I can see the major difference in the two methods is that the addIndexes(IndexReaders[]) uses the following if (segmentInfos.size() == 1) // add existing index, if any merger.add(new SegmentReader(segmentInfos.info(0))); Perhaps this is resulting in an unnecessary ballooning of the index? I'll leave it for someone with a better understanding of the underlying file system to answer... Thanks Garrett -Original Message- From: Garrett Heaver [mailto:[EMAIL PROTECTED] Sent: 06 December 2004 17:32 To: 'Lucene Users List' Subject: RE: addIndexes() Size Cheers for that Erik - believe it or not I'm still back at v1.3 (doh!!!) Will try 1.4.3 tomorrow Thanks Garrett -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: 06 December 2004 17:27 To: Lucene Users List Subject: Re: addIndexes() Size There was a bug in 1.4 (and maybe 1.4.1?) that kept some index files around that were not used. Are you using Lucene 1.4.3? It not, try that and see if it helps. Erik On Dec 6, 2004, at 12:17 PM, Garrett Heaver wrote: No there are no duplicate copies - I've the correct number when I view through luke and I don't overlap - the temporary index is destroyed after it is added to the main index - I'm currently at index version 159 and it seems that all of my .prx files come in at around 1435 megs (ouch) Thanks Garrett -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 06 December 2004 17:12 To: Lucene Users List Subject: Re: addIndexes() Size If I were you, I would first use Luke to peek at the index. You may find something obvious there, like multiple copies of the same Document. Does your temp index 'overlap' with A index in terms of Documents? If so, you will end up with multliple copies, as addIndexes method doesn't detect and remove duplicate Documents. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Hi. Its probably really simple to explain this but since I'm not up to speed on the way Lucene stores the data I'm a little confused. I'm building an Index, which resides on Server A, with the Lucene Service running on Server B. Now not to bore you with the details but because of the network transfer rate etc I'm running the actual index on \\ServerA\idx file:///\\ServerA\idx and building a temp Index at \\ServerB\idx\temp file:///\\ServerB\idx\temp (obviously because the Local FS is much faster for the service) and then calling addIndexes to import the temp index to the ServerA index before destroying the ServerB index, holding for a bit and then checking for new documents. All works grand BUT the size of the resultant index on ServerA is HUGE in comparison to one I'd build from start to finish (i.e. a simple addDocument Index) - 38gig for 220,000 Unstored Items cannot be right (to give you and idea of how mad this seems, the backed up version of the database from which the data is pulled is only 2gigs) I've considered it being perhaps the number of Items that had to be integrated each time addIndexes was called - right now I'm adding around 10,000 at a time (I had done 1000 at a time but this looked like it was going to end up even larger still) I'm holding off twiddling the minMergeDocs and mergeFactor until I can get a better understanding of whats going on here. Many thanks for any reply's Garrett - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
addIndexes() Size
Hi. Its probably really simple to explain this but since I'm not up to speed on the way Lucene stores the data I'm a little confused. I'm building an Index, which resides on Server A, with the Lucene Service running on Server B. Now not to bore you with the details but because of the network transfer rate etc I'm running the actual index on \\ServerA\idx file:///\\ServerA\idx and building a temp Index at \\ServerB\idx\temp file:///\\ServerB\idx\temp (obviously because the Local FS is much faster for the service) and then calling addIndexes to import the temp index to the ServerA index before destroying the ServerB index, holding for a bit and then checking for new documents. All works grand BUT the size of the resultant index on ServerA is HUGE in comparison to one I'd build from start to finish (i.e. a simple addDocument Index) - 38gig for 220,000 Unstored Items cannot be right (to give you and idea of how mad this seems, the backed up version of the database from which the data is pulled is only 2gigs) I've considered it being perhaps the number of Items that had to be integrated each time addIndexes was called - right now I'm adding around 10,000 at a time (I had done 1000 at a time but this looked like it was going to end up even larger still) I'm holding off twiddling the minMergeDocs and mergeFactor until I can get a better understanding of whats going on here. Many thanks for any reply's Garrett
RE: addIndexes() Size
No there are no duplicate copies - I've the correct number when I view through luke and I don't overlap - the temporary index is destroyed after it is added to the main index - I'm currently at index version 159 and it seems that all of my .prx files come in at around 1435 megs (ouch) Thanks Garrett -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 06 December 2004 17:12 To: Lucene Users List Subject: Re: addIndexes() Size If I were you, I would first use Luke to peek at the index. You may find something obvious there, like multiple copies of the same Document. Does your temp index 'overlap' with A index in terms of Documents? If so, you will end up with multliple copies, as addIndexes method doesn't detect and remove duplicate Documents. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Hi. Its probably really simple to explain this but since I'm not up to speed on the way Lucene stores the data I'm a little confused. I'm building an Index, which resides on Server A, with the Lucene Service running on Server B. Now not to bore you with the details but because of the network transfer rate etc I'm running the actual index on \\ServerA\idx file:///\\ServerA\idx and building a temp Index at \\ServerB\idx\temp file:///\\ServerB\idx\temp (obviously because the Local FS is much faster for the service) and then calling addIndexes to import the temp index to the ServerA index before destroying the ServerB index, holding for a bit and then checking for new documents. All works grand BUT the size of the resultant index on ServerA is HUGE in comparison to one I'd build from start to finish (i.e. a simple addDocument Index) - 38gig for 220,000 Unstored Items cannot be right (to give you and idea of how mad this seems, the backed up version of the database from which the data is pulled is only 2gigs) I've considered it being perhaps the number of Items that had to be integrated each time addIndexes was called - right now I'm adding around 10,000 at a time (I had done 1000 at a time but this looked like it was going to end up even larger still) I'm holding off twiddling the minMergeDocs and mergeFactor until I can get a better understanding of whats going on here. Many thanks for any reply's Garrett - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: addIndexes() Size
Cheers for that Erik - believe it or not I'm still back at v1.3 (doh!!!) Will try 1.4.3 tomorrow Thanks Garrett -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: 06 December 2004 17:27 To: Lucene Users List Subject: Re: addIndexes() Size There was a bug in 1.4 (and maybe 1.4.1?) that kept some index files around that were not used. Are you using Lucene 1.4.3? It not, try that and see if it helps. Erik On Dec 6, 2004, at 12:17 PM, Garrett Heaver wrote: No there are no duplicate copies - I've the correct number when I view through luke and I don't overlap - the temporary index is destroyed after it is added to the main index - I'm currently at index version 159 and it seems that all of my .prx files come in at around 1435 megs (ouch) Thanks Garrett -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 06 December 2004 17:12 To: Lucene Users List Subject: Re: addIndexes() Size If I were you, I would first use Luke to peek at the index. You may find something obvious there, like multiple copies of the same Document. Does your temp index 'overlap' with A index in terms of Documents? If so, you will end up with multliple copies, as addIndexes method doesn't detect and remove duplicate Documents. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Hi. Its probably really simple to explain this but since I'm not up to speed on the way Lucene stores the data I'm a little confused. I'm building an Index, which resides on Server A, with the Lucene Service running on Server B. Now not to bore you with the details but because of the network transfer rate etc I'm running the actual index on \\ServerA\idx file:///\\ServerA\idx and building a temp Index at \\ServerB\idx\temp file:///\\ServerB\idx\temp (obviously because the Local FS is much faster for the service) and then calling addIndexes to import the temp index to the ServerA index before destroying the ServerB index, holding for a bit and then checking for new documents. All works grand BUT the size of the resultant index on ServerA is HUGE in comparison to one I'd build from start to finish (i.e. a simple addDocument Index) - 38gig for 220,000 Unstored Items cannot be right (to give you and idea of how mad this seems, the backed up version of the database from which the data is pulled is only 2gigs) I've considered it being perhaps the number of Items that had to be integrated each time addIndexes was called - right now I'm adding around 10,000 at a time (I had done 1000 at a time but this looked like it was going to end up even larger still) I'm holding off twiddling the minMergeDocs and mergeFactor until I can get a better understanding of whats going on here. Many thanks for any reply's Garrett - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]